Comparison of the Current Refseq, Ensembl and EST Databases for Counting Genes and Gene Discovery

Total Page:16

File Type:pdf, Size:1020Kb

Comparison of the Current Refseq, Ensembl and EST Databases for Counting Genes and Gene Discovery View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector FEBS Letters 579 (2005) 690–698 FEBS 29200 Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery Thomas P. Larsson, Christian G. Murray, Tobias Hill, Robert Fredriksson, Helgi B. Schio¨th* Department of Neuroscience, Uppsala University, BMC Box 593, 751 24 Uppsala, Sweden Received 15 November 2004; revised 13 December 2004; accepted 13 December 2004 Available online 29 December 2004 Edited by Gianni Cesareni mated that if clustered at high stringency, the resulting EST- Abstract Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed se- clusters would represent about 50% of all genes, which from quences tags (ESTs) have recently been added to the NCBI dat- their sim35000 clusters would correspond to about 60000– abases. We matched the transcript-sequences of RefSeq, 70000 genes. Davidson and Burke [4] used a similar method Ensembl and dbEST in an attempt to provide an updated over- based on clustering human EST sequences into transcription view of how many unique human genes can be found. The results units to reach a number of about 70000 genes, with between indicate that there are about 25000 unique genes in the union of 1.2 and 1.5 different transcripts per gene. Both of these meth- RefSeq and Ensembl with 12–18% and 8–13% of the genes in ods suffered from the fact that normalization of cDNA li- each set unique to the other set, respectively. About 20% of all braries may enrich for contaminants and aberrant clones. genes had splice variants. There are a considerable number of The problem of genomic contamination has been estimated ESTs (2200000) that do not match the identified genes and we to affect 5–8% of all ESTs [9,10]. Moreover, the former method used an in-house pipeline to identify 22 novel genes from Gen- scan predictions that have considerable EST coverage. The study included singleton clusters, which may also cause overestima- provides an insight into the current status of human gene cata- tion of the total number of genes. logues and shows that considerable refinement of methods and Another high number was reached by extrapolating from datasets is needed to come to a conclusive gene count. correlation between human genes and genomic CpG islands. Ó 2004 Federation of European Biochemical Societies. Published It was estimated that half of the human genes were accompa- by Elsevier B.V. All rights reserved. nied by a neighboring region of above average CpG dinucleo- tide content. The number of CpG islands was calculated to Keywords: Eexpressed sequences tag; RefSeq; Ensembl; around 40000, which led to an estimate of 80000 human genes Databases; Genscan [5]. This early estimate assumed that each CpG-island corre- sponds to a unique gene, while a later analysis of chromosome 22 showed that this is only true for about 60% of the islands. The public sequencing project reported the existence of only about 29000 islands in non-repeatmasked areas in the analysis 1. Introduction of the finished genomic sequence. With these data, the estimate of this method would be lowered to about 35000 human genes, One of the most intriguing questions in human biology is the a number not far from the currently popular quote derived number and identity of the human genes. Three years after the from both sequencing projects [1,2]. release of the genomic sequence [1,2], there are still large uncer- Predictions made by ab initio programs such as Genscan [11] tainties about the exact number of human genes. The number have proven to be rich source of novel genes. In 2001, the false of known and hypothetical genes in current databases such as positive rate for Genscan was estimated on chromosome 22 RefSeq and Ensembl is continuously growing but is still less using microarray analysis. This study concluded a false posi- than the predicted number of genes [1–8], indicating that these tive fraction of 17% compared to the original estimate by the datasets are incomplete. Moreover, there are significant differ- chromosome 22 sequencing group of 27.5% [12,13]. Using this ences between these datasets. Identification and annotation of assumption and predicting that the rate is constant over human genes is likely to continue to be an important issue the other chromosomes, this would mean that 7368 within the field of genetic research in several years to come. (0.17 · 43000) genes from the whole Genscan set of 43000 Historically, the highest estimates for the human gene count genes are false predictions. The remaining set of about 35600 have been based on clustering of expressed sequences tags human predictions is significantly larger than any known gene (ESTs) and many uncertainties with such methods have been set (RefSeq, Ensembl, SWISS-PROT, TrEMBL), of which the pointed out. Fields and colleagues used a method based on Ensembl set is the largest, currently containing 29802 human similarity between known cDNA sequences and ESTs to calcu- transcripts. Evidence of actual transcription of predicted genes late an approximate number of human genes [3]. They esti- can be obtained by identifying mRNA and EST sequences matching the predictions. The growth of dbEST from 2 million sequences in 2001 to over 5 million today means that the like- *Corresponding author. Fax: +46 18 51 15 40. lihood of finding transcript evidence for novel predictions is E-mail addresses: [email protected] (T.P. Larsson), better than ever. Since 2002, the University of California Santa [email protected] (H.B. Schio¨th). Cruz (UCSC) has provided data on genomic alignments for all 0014-5793/$30.00 Ó 2004 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.febslet.2004.12.046 T.P. Larsson et al. / FEBS Letters 579 (2005) 690–698 691 human ESTs, providing means to more specifically match novel genes from the Genscan set and we present 22 novel ESTs with gene predictions based on genomic location rather genes with extensive EST support identified by filtering Gen- than sequence similarity. Cross matching ab initio predictions scan predictions through a special in-house pipeline. such as Genscan predictions with ESTs remains thus as an important method to verify gene predictions and identify new genes. 2. Materials and methods Creation of a complete set of known and hypothetical genes 2.1. Datasources requires merging of existing datasets into a non-redundant set 2.1.1. NCBI. The 2004-04-05 version of the RefSeq nucleotide of genes. Such merging is often based on a sequence similarity transcript-sequences of Homo sapiens was downloaded from NCBI threshold [14,15] that is prone to misjudgements, since it is dif- (ftp://ftp.ncbi.nih.gov/RefSeq/) as files in FASTA format. ficult to find a threshold that always gives the correct result. Time of download was used as an identifier due to the continuous Another method of merging sets of transcripts is to align all se- update of the RefSeq dataset. For simplicity, instead of sorting Ref- Seq transcripts according to all six official categories: Model, Inferred, quences to the genome and determine which sequences from Predicted, Provisional, Validated and Reviewed, transcripts with each set that are not overlapped by any sequence form the name prefixes N (NM,NR) and X (XM/XR) were defined as reviewed other set. However, for transcript-sequences that are not de- and predicted, respectively. At the time of download, this file con- fined directly from the genomic sequence, it can be a problem tained 27602 sequences. The human ESTs were downloaded as a to find good quality genomic alignments. For example, the FASTA file from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ est_human.gz). At the time of download, the file contained 5484645 UCSC table refGene, which contains positions for RefSeq ESTs. transcripts as acquired by BLAT, only includes about 22000 2.1.2. Ensembl. The 2004-04-05 Ensembl human transcript set of the 27000 transcripts in the set. The inability to find good based on the NCBI genome assembly 34 was downloaded in cDNA quality alignments can be problematic when using this method. FASTA format from (ftp://ftp.ensembl.org/pub). The Ensembl tran- Using genomic sequence to define transcript-sequences, as is scripts fall into three categories: known, novel and pseudogene, and in each comparison with RefSeq, we let known correspond to re- done by Ensembl, should alleviate this problem. The increased viewed and novel to predicted. When referred to as Ensembl genes quality of the genome sequences should also make this prob- or Ensembl transcripts, only sequences of statuses known and novel lem smaller. The continuous changes of the databases make are intended unless otherwise specified. At the time of download, this it a difficult and time-consuming issue to practically ‘‘count’’ file contained 31609 sequences divided into 26309, 3493 and 1807 known, novel and pseudogenes, respectively. The ESTgene dataset known unique genes in an unbiased fashion. Since the publica- version 25.34 for the NCBI34 assembly was downloaded using the tion of the first analysis of the human genome, the NCBI Ref- EnsMart retrieval system. At the time of download, this set consisted Seq catalogue of human transcripts [16] has grown from just of 24980 sequences. above 11000 entries [1] to over 27000 [RefSeq release 4], while 2.1.3. UCSC. The human genome assembly based on NCBI 34 the EMBL/Sanger Ensembl database [17] currently contains was downloaded as unmasked FASTA files from (ftp://hgdown- load.cse.ucsc.edu/goldenPath/hg16/).
Recommended publications
  • Gene Prediction: the End of the Beginning Comment Colin Semple
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
    [Show full text]
  • Life Will Never Be the Same Annual Genome Sequencing and Biology Meeting, Cold Spring Harbor Laboratory, USA
    Yeast Yeast 2000; 17: 241±243. Meeting Review Life will never be the same Annual Genome Sequencing and Biology Meeting, Cold Spring Harbor Laboratory, USA. May 2000 M.A. Strivens* MRC Mammalian Genetics Unit and UK Mouse Genome Centre, Harwell, UK *Correspondence to: M. A. Strivens, MRC Mammalian Genetics Unit and UK Mouse Genome Centre, Harwell, Oxfordshire OX11 0RD,UK. E-mail: [email protected] It seems appropriate that the Cold Spring Harbor The crux of the whole-genome shotgun strategy is Genome Sequencing and Biology Meeting, which the assembly technique. Gene Myers (Celera Geno- witnessed the creation of the Human Genome mics) reported on how the `double-barrelled' shot- Organization (HUGO) in 1988, should this year gun2 approach had given a signi®cant advantage to present three major advances in genomic science: the computer algorithms employed in the assembly the completion of the ®nished sequence of Droso- of the ¯y genome. The assembly system employs a phila melanogaster; the announcement that 85% of bottom-up, nucleating strategy, initially assembling the genome of Homo sapiens is now in draft small islands of sequence of high con®dence sequence; and the complete, ®nished sequence of a (diverting the assembly of repeat regions to later second human chromosome, chromosome 21. Other stages) and then searching for other sequence major sessions of the meeting focused on single (including orientation data from clone end- nucleotide polymorphisms (SNPs), ethical, legal and sequencing) to join the islands together. In addition, social implications (ELSI), as well as comparative he con®rmed a less than 0.5% error rate in the and functional genomics.
    [Show full text]
  • 9781472910042 Herding Hemingway's 1Stpass.Indb 2 9/16/2015 6:57:20 PM HERDING HEMINGWAY’ S CATS UNDERSTANDING HOW OUR GENES WORK
    Also available in the Bloomsbury Sigma series: Sex on Earth by Jules Howard p53 – The Gene that Cracked the Cancer Code by Sue Armstrong Atoms Under the Floorboards by Chris Woodford Spirals in Time by Helen Scales Chilled by Tom Jackson A is for Arsenic by Kathryn Harkup Breaking the Chains of Gravity by Amy Shira Teitel Suspicious Minds by Rob Brotherton 9781472910042_Herding Hemingway's_1stpass.indb 2 9/16/2015 6:57:20 PM HERDING HEMINGWAY ’ S CATS UNDERSTANDING HOW OUR GENES WORK Kat Arney 9781472910042_Herding Hemingway's_1stpass.indb 3 9/16/2015 6:57:20 PM Bloomsbury Sigma An imprint of Bloomsbury Publishing Plc 50 Bedford Square 1385 Broadway London New York WC1B 3DP NY 10018 UK USA www.bloomsbury.com BLOOMSBURY and the Diana logo are trademarks of Bloomsbury Publishing Plc First published 2016 Copyright © Kat Arney, 2016 Kat Arney has asserted her right under the Copyright, Designs and Patents Act, 1988, to be identifi ed as Author of this work. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. No responsibility for loss caused to any individual or organisation acting on or refraining from action as a result of the material in this publication can be accepted by Bloomsbury or the author. Quote on p. XXX reprinted by permission of Edward Monkton. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
    [Show full text]
  • Adnc : Les Incontournables
    Chroniques génomiques ADNc : les incontournables par Bertrand JORDAN médecine/sciences 2001 ; 17 : 81-4 u tout début des années 1990, Redondant par principe (puisque l’on une exploitation informatique plus l’on peinait à séquencer une détermine les séquences partielles de sophistiquée, et surtout une prise de Aou deux centaines de kilobases clones pris au hasard dans des conscience de la nécessité de plani- d’ADN humain, et les résultats scien- banques d’ADNc), cet ensemble était fier une opération de « Très Grand tifiques de tels travaux apparaissaient analysé par des systèmes dénommés Séquençage » comme une entreprise assez minces par rapport aux efforts gene index* comparant toutes les sé- industrielle, aboutissaient, malgré et aux fonds investis [1]. quences afin de les regrouper en clus- l’absence d’une révolution technique ters censés représenter chacun un majeure, à rendre la lecture de méga- Les débuts des EST transcrit : c’est à partir de ces données bases d’ADN possible et presque qu’avait été faite l’estimation d’envi- abordable. Ces progrès devinrent évi- L’option des ADNc, leur déchiffrage ron 100 000 gènes humains dents pour tous avec l’obtention en partiel mais massif apparurent vite aujourd’hui très discutée [3]. L’utilité 1996 de la séquence complète de la comme une alternative réaliste au sé- des EST fut encore renforcée par la levure. Les 13 mégabases déchiffrées quençage intégral. Lancée en franc-ti- localisation d’un grand nombre constituaient de loin le plus important reur par Craig Venter [2], très large- d’entre eux sur notre génome : effec- ensemble jamais obtenu, et mon- ment médiatisée par le scandale que tuée massivement grâce à l’emploi des traient que de tels projets étaient de- soulevèrent les tentatives de brevets hybrides d’irradiation, celle-ci devait venus viables ; et l’utilité de ces don- sur ces séquences partielles, l’ap- aboutir en 1998 au positionnement de nées était attestée par la découverte proche des EST (expressed sequence plus de 30 000 étiquettes [4].
    [Show full text]
  • What Is a Gene, Post-ENCODE? History and Updated Definition
    Downloaded from genome.cshlp.org on October 2, 2021 - Published by Cold Spring Harbor Laboratory Press Perspective What is a gene, post-ENCODE? History and updated definition Mark B. Gerstein,1,2,3,9 Can Bruce,2,4 Joel S. Rozowsky,2 Deyou Zheng,2 Jiang Du,3 Jan O. Korbel,2,5 Olof Emanuelsson,6 Zhengdong D. Zhang,2 Sherman Weissman,7 and Michael Snyder2,8 1Program in Computational Biology & Bioinformatics, Yale University, New Haven, Connecticut 06511, USA; 2Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06511, USA; 3Computer Science Department, Yale University, New Haven, Connecticut 06511, USA; 4Center for Medical Informatics, Yale University, New Haven, Connecticut 06511, USA; 5European Molecular Biology Laboratory, 69117 Heidelberg, Germany; 6Stockholm Bioinformatics Center, Albanova University Center, Stockholm University, SE-10691 Stockholm, Sweden; 7Genetics Department, Yale University, New Haven, Connecticut 06511, USA; 8Molecular, Cellular, & Developmental Biology Department, Yale University, New Haven, Connecticut 06511, USA While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century—from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity.
    [Show full text]
  • Memoir and the Laboratory." Narrative in the Age of the Genome
    Choksey, Lara. "Memoir and the laboratory." Narrative in the Age of the Genome: . London: Bloomsbury Academic, 2021. 83–118. Bloomsbury Collections. Web. 25 Sep. 2021. <http:// dx.doi.org/10.5040/9781350102576.0009>. Downloaded from Bloomsbury Collections, www.bloomsburycollections.com, 25 September 2021, 17:03 UTC. Copyright © Lara Choksey 2021. You may share this work for non-commercial purposes only, provided you give attribution to the copyright holder and the publisher, and provide a link to the Creative Commons licence. 3 Memoir and the laboratory The Human Genome Project took place in the same decade that saw a proliferation of popular memoirs in Anglo American publishing, heaving on airport bestseller shelves and piling up on buffet tables in chain bookshops. The extension of chain bookshops from the high street to warehouse superstores in shopping parks on the outskirts of cities meant a boom for popular non-fiction. Biotechnology was also going through a moment in the sun, with billions invested in laboratories around the world working to sequence the human genome, and the bulk of funding coming from the United States. To justify massive amounts of funding, and to ensure a legacy for the project and its key players, several of the Human Genome Project’s main protagonists published science memoirs of the project. These memoirs – particularly James Watson’sDNA: The Secret of Life (2003) and J. Craig Venter’s A Life Decoded: My Genome, My Life (2007) – quickly ascended bestseller lists, sandwiched between sports stars and film actors. Venter and Watson draw techniques from the blockbuster memoir: a constant (often explicit) refrain of ‘you won’t believe what happened next’, cascading vignettes of overcoming obstacles for the sake of a higher goal, shock twists in the tale and a cast of protagonists, antagonists and commodities plucked from the thrillers and romances that competed for shelf space.
    [Show full text]
  • WHAT IS a GENE? the Idea of Genes As Beads on a DNA String Is Fast Fading
    25.5 gene MH 22/5/06 2:53 PM Page 398 NEWS FEATURE NATURE|Vol 441|25 May 2006 C DARKIN © 2006 Nature Publishing Group 25.5 gene MH 22/5/06 2:53 PM Page 399 NATURE|Vol 441|25 May 2006 NEWS FEATURE C. DARKIN WHAT IS A GENE? The idea of genes as beads on a DNA string is fast fading. Protein-coding sequences have no clear beginning or end and RNA is a key part of the information package, reports Helen Pearson. ene’ is not a typical four-letter Laurence Hurst at the University of Bath, UK. viously unimagined scope of RNA. word. It is not offensive. It is never “All of that information seriously challenges The one gene, one protein idea is coming bleeped out of TV shows. And our conventional definition of a gene,” says under particular assault from researchers who ‘Gwhere the meaning of most four- molecular biologist Bing Ren at the University are comprehensively extracting and analysing letter words is all too clear, that of gene is not. of California, San Diego. And the information the RNA messages, or transcripts, manufac- The more expert scientists become in molecu- challenge is about to get even tougher. Later tured by genomes, including the human and lar genetics, the less easy it is to be sure about this year, a glut of data will be released from mouse genome. Researchers led by Thomas what, if anything, a gene actually is. the international Encyclopedia of DNA Ele- Gingeras at the company Affymetrix in Santa Rick Young, a geneticist at the Whitehead ments (ENCODE) project.
    [Show full text]
  • The Mental Health Between Epigenetics and Individual Beliefs
    Mental Health Global Challenges XXI Century Dramnescu Marin Conference proceedings – 2018 The mental health between epigenetics and individual beliefs Dramnescu Marin Bucharest Academy of Economic Studies, Bucharest, Romania Abstract. Mental health is an integrative concept that is not limited to dysfunctions or accentuations of psychic processes or mechanisms of thought. The research effort focused on the idea that mental health is a functional optimum found at the intersection of cellular behavior, the physical environment, the external environment, with all its subtypes, the environment in which the individual manifests itself and the subjective, psychological environment, dominated mainly by unconscious behavioral routines, beliefs, values, and ultimately individual perspective on life. Mental health represents and manifests itself as an emerging process resulting from the correlated functioning of the biological, physiological, and in particular cellular mechanisms, the various, random and / or permanent influences and stimuli of the physical, social and professional environment and the superior motivational structures of the type of beliefs and individual perspective on life. Epigenetics is a contemporary discipline derived from genetics that includes the environmental context as an important part of heredity. Currently, this discipline strongly influences a variety of areas, including medicine, psychiatry and psychology. Keywords. Mental Health, Cell Behavior, Epigenetic Approach, Beliefs. 1. Introduction It is unanimously accepted that the becoming of a person is an effect of the interaction among genetic factors, environmental factors and educational factors. Their share in the individual's development differs according to the stage of the development, but their influence on mental health in particular and on the formation of personality in general is permanent and becomes manifest throughout the ontogenetic.
    [Show full text]
  • Cartilage Disease & Regeneration Genomics in Osteoarthritis
    Darryl D’Lima, M.D. The Donald P. Shiley Visiting Lectureship Wednesday, Feb. 11, 2009 Scripps Green Hospital Cartilage Disease & Regeneration Genomics in Darryl D. D’Lima, MD, PhD Osteoarthritis Clifford W. Colwell Jr, MD Shiley Center for Shiley Center for Orthopaedic Research & Education Orthopaedic Research & Education Scripps Clinic Scripps Clinic Donald P Shiley Visiting Lectureship Donald P Shiley Visiting Lectureship February 11, 2009 February 11, 2009 Genetics Single gene disorders GENOMICS – Cystic fibrosis – Huntington’s Study of genes and how genetic variations affect function and disease Multifactorial – Heart disease – Diabetes –OA Human Genome Project Human Genome Project A GENOME the complete set of DNA HUMAN GENOME ~ 3 billion pairs of nucleotides for an organism 23 pairs of chromosomes Location of each gene on the chromosome How many genes are there? (normal volunteers) 20,000 - 75,000 C. elegans GeneSweep Winner: 24,947 1 Darryl D’Lima, M.D. The Donald P. Shiley Visiting Lectureship Wednesday, Feb. 11, 2009 Scripps Green Hospital Genome Trivia Human Genome Project Chromosome 1 has the most genes = 2968 Less than 2% of the genome codes for proteins Y (Male) chromosome has the fewest = 231 The functions are unknown for over 50% of discovered genes X (Female) chromosome = 2000 Junk DNA = repeated sequences that do not code for proteins make up at least 50% of the human genome Is This JUNK? Human Genome Project O lny srmat poelpe can raed tihs. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. 99.9% nucleotide bases are exactly the The phaonmneal pweor of the hmuan mnid, aoccdrnig to a same in all people rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae.
    [Show full text]
  • Expanded Human Gene Tally Reignites Debate After 15 Years, Researchers Still Can’T Agree on How Many Genes Are in the Human Genome
    NEWS IN FOCUS GENETICS Expanded human gene tally reignites debate After 15 years, researchers still can’t agree on how many genes are in the human genome. BY CASSANDRA WILLYARD But many geneticists aren’t yet convinced an average of around 40,000. These days, the that all the newly proposed genes will stand up span of estimates has shrunk — with most now ne of the earliest attempts to estimate to close scrutiny. Their criticisms underscore between 19,000 and 22,000 — but there is still the number of genes in the human just how difficult it is to identify new genes, or disagreement (see ‘Gene tally’). genome involved tipsy geneticists, even to define what a Salzberg’s team used data from the Oa bar in Cold Spring Harbor, New York, and “People have gene is. Genotype-Tissue Expression (GTEx) pro- pure guesswork. been working “People have been ject, which sequenced RNA from more than That was in 2000, when a draft human hard at this for working hard at this 30 different tissues taken from several hundred genome sequence was still in the works; geneti- 20 years, and we for 20 years, and we cadavers. RNA is the intermediary between cists were running a sweepstake on how many still don’t have still don’t have the DNA and proteins. The researchers wanted to genes humans have, and wagers ranged from the answer.” answer,” says Steven identify genes that encode a protein and those tens of thousands to hundreds of thousands. Salzberg, a computa- that don’t, but that still have an important role Almost two decades later, scientists armed tional biologist at Johns Hopkins University in in cells.
    [Show full text]
  • The Gene Guessing Game
    Yeast Yeast 2000; 17: 218±224. Review Article The gene guessing game Ian Dunham* The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK *Correspondence to: Abstract I. Dunham, The Sanger Centre, Wellcome Trust Genome A recent ¯urry of publications and media attention has revived interest in the question of Campus, Hinxton, Cambridge how many genes exist in the human genome. Here, I review the estimates and use genomic CB10 1SA, UK. sequence data from human chromosomes 21 and 22 to establish my own prediction. Copyright # 2000 John Wiley & Sons, Ltd. Keywords: human genome; genes; DNA sequence; chromosome 22; EST Introduction functional RNAs are essential to the cell, it is the protein-coding genes that concern most gene coun- How many genes are there in the human genetic ters [4,10]. Mostly this is because the protein-coding parts list? Forget the 5 year plans [19] and the press genes must contain the bulk of the functionality and releases [18] Ð the end of the human genome therefore interest, but in part it is also because it is project will come when we have an accurate answer thought that they might be easier to count. Of to this question. Sure, there will be a few tens or course, we all know what we mean by a protein- maybe even hundreds not identi®ed, and perhaps coding gene. However these genes have a number of we won't have the full structure of every gene, but methods to increase complexity from a single region we will have a pretty good idea of how many genes of DNA, including alternative use of promoters, there are.
    [Show full text]
  • Individualized Nutritional Recommendations: Do We Have the Measurements Needed to Assess Risk and Make Dietary Recommendations?
    Proceedings of the Nutrition Society (2004), 63, 167–172 DOI:10.1079/PNS2003325 g The Author 2004 The Summer Meeting of the Nutrition Society was held at King’s College, London on 7–10 July 2003 Symposium on ‘Implications for dietary guidelines of genetic influences on requirements’ Individualized nutritional recommendations: do we have the measurements needed to assess risk and make dietary recommendations? Lenore Arab School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA Is the information currently available to adjust nutritional recommendations and develop individualized nutrition? No. There is not even the information needed for setting dietary recommendations with confidence now at the group level. Will it be available soon? The answer to this question depends on the drive and will of the nutritional community, the success in recruiting funding to the area, the education of nutritionists and the spawning of great ideas and approaches. The emerging tools of genomics, proteomics and metabolomics are enabling the in-depth study of relationships between diet, genetics and metabolism. The advent of technologies can be compared with the discovery of the microscope and the new dimensions of scientific visualization enabled by that discovery. Nutritionists stand at the crest of new waves of data that can be generated, and new methods for their digestion will be required. To date, the study of dietary requirements has been based largely on a black box approach. Subjects are supplemented or depleted and clinical outcomes are observed. Few recommen- dations are based on metabolic outcomes. Metabolomics and nutrigenomics promise tools with which recommendations can be refined to meet individual requirements and the potential of individualized nutrition can be explored.
    [Show full text]