Discovery of the Human Genome Sequence in the Public and Private Databases

Discovery of the Human Genome Sequence in the Public and Private Databases

R808 Current Biology Vol 11 No 20 Feature Discovery of the human genome sequence in the public and private databases Genomes: Much heat has been generated in discussions about the ‘academic’ subscription for the past key human genome sequence databases, generated by the Human year. We are also intense users (and Genome Project and Celera, and what specific features each offers contributors) of data in the Human genome researchers. Stephen W. Scherer and Joseph Cheung, Genome Project (HGP) databases. who are intense users of both, offer a personal assessment of the Many of our experiences are based developing contents. on gene mapping and sequencing studies of human chromosome 7, but Deservedly, there has been much genes and other important biological also through positional cloning celebration over the publication of features of chromosomes have been studies in other regions of the two draft versions of the human characterized. The goal of this piece genome. We are most often asked to genome sequence. There have also is to share our experiences with comment subjectively on the been other recent assemblies of the other scientists contemplating if and following three datasets: sequence, producing more complete how they might benefit from coverage and reliable DNA subscribing to the Celera DNA Important DNA sequence sources sequence annotation. However, to sequence database. (i) The Celera version of the date, a finished reference sequence Our observations are based on human genome published in of the human genome does not exist. having access to the Celera February (called component-3 or Furthermore, only a fraction of the Discovery System through an C3; data at http://www.celera.com/) and their more recent component-4 Figure 1 (C4) assembly (by subscription since August 2001). The C3 assembly was derived from combining 14,808 Mb 7000 of Whole Genome Shotgun (WGS) Celera sequence with 4,405 Mb 6000 from the HGP. C4 builds on C3 using improved algorithms as well as 5000 additional Celera sequences and new HGP data as of December 4000 q-arm 2000; 3000 (ii) The successive assemblies of p-arm the clone-based approach of the 2000 HGP from the February publication up until August 2001 (up-to-date Order of markers on chromosome 7 Order of markers 1000 statistics for the HGP sequence can be found at 0 Order of markers on Celera mapped scaffolds http://www.ebi.ac.uk/genomes/mot/). Current Biology The best websites for accessing HGP data are listed in Table 1. The order of 5343 chromosome 7 DNA did not fall into these larger scaffolds were HGP does not have Celera data in markers present in the C4 scaffolds (each all found in smaller ones or in the Celera their assemblies; scaffold in a different color) was almost fragment database. The 22 DNA markers entirely consistent with the marker order that are not in the expected order tend to (iii) The Celera mouse genome established by hand-curated data from map to the centromere or to (available since June 2001), radiation and somatic cell hybrid, yeast and intrachromosomal duplications. Over 98% of assembled solely using a WGS based bacterial- artificial chromosome, and genetic known markers could be placed on the map. on approximately 6X genome mapping experiments. The 246 markers that coverage with DNA from three different mouse strains. Magazine R809 Table 1 General characteristics of the Celera and HGP sequence databases.* Category Celera Human genome project Accessibility† To data Good Good to excellent Via cytolocation Very good (mirrors public data) Very good Via gene or marker Good Excellent Via DNA sequence Excellent Good Coverage Euchromatin Outstanding Good (~50% still in draft) Pericentromeric Good Good Large duplication Not represented Better than Celera Accuracy Internal accuracy Excellent Excellent Long-range order and orientation Outstanding Good, continues to improve Gene annotation‡ Known genes Very good Very good New genes Rudimentary Rudimentary Other strengths§ DNA sequence in fragment database often Ease of accessibility to data at multiple websites assists in gap filling Long sequence scaffolds favor genome-wide Availability of clones to confirm or complete comparison/annotation sequencing and mapping Availability of assembled mouse sequence to assist Clone-based strategy essential for completion human annotation of difficult regions Recommendations (wish list) Be more dynamic incorporating latest public data Increase resolution and accuracy of cytolocations Make clones available for sequencing of gap regions Top up and finish human sequence Release human component 4 and mouse data on DVD Increase efforts to incorporate highly-curated to academic subscribers data from community Sequence a third mammalian genome to assist Sequence a third mammalian genome to assist comparative analyses comparative analyses *Based on survey of 10 users of varying (http://www.ncbi.nlm.nih.gov/genome/cyto/h academic subscribers the supporting levels of sophistication; bioinformatics brc.shtml), Locus Link at NCBI evidence for new Celera transcripts is not analysts (4), molecular biologists (3), medical (http://www.ncbi.nlm.nih.gov/LocusLink/) intuitively available to the end user. HGP is geneticists (3). †While the Celera database and again Golden Path, respectively. also more dynamic in updating new cDNA is generally user friendly with excellent 'Ensembl' (http://www.ensembl.org/) was and gene data from the literature. §There are service support the limited number of portals also a good entry point into the public data. multitudes of DNA sequence analysis per academic subscription can inhibit Medical geneticists often use the Genome programs available in the public domain not accessibility. Data retrieval can sometimes Database (http://www.gdb.org/) or mentioned (see be slow. Navigating/searching HGP GeneCards http://searchlauncher.bcm.tmc.edu/). Celera databases is more intuitive primarily since (http://bioinfo.weizmann.ac.il/cards/index. has basic Blast search capabilities, GO familiar nomenclature is used compared with html).‡It is still premature to comment on the ontology, Panther Ontology (proprietary), the obscure identifiers often found in Celera. accuracy and completeness of the overall and Genome Browser which is an The favorite entry points to public DNA annotation of genes since many are based outstanding (proprietary) gene-model sequence data based on cytolocation, gene only on gene-prediction algorithms. Earlier building tool (available to corporate but not marker, and by DNA sequence itself are versions of the Celera and HGP academic subscribers). Celera's mouse UCSC Golden Path assemblies/annotation often missed genome assembly used the 129X1/SvJ, (http://genome.ucsc.edu/) and the BAC contiguous gene family members but in both DBA/2J, and A/J strains and the HGP is resources cases this continues to improve. For sequencing C57BL6/J. We have summarized our access both, if website hits alone sophisticated analysts performing experiences using Celera compared were counted, the HGP would win large-scale annotation experiments to HGP information in Table 1. Both out over Celera primarily because of usually occupy our laboratory’s datasets, and the accompanying ease of accessibility and increased single-portal access (per subscription) annotation, have strengths and number of entry points to the DNA to Celera (the release of the C3 weaknesses. While we constantly sequence. In our group, more assembly on DVD has relieved some R810 Current Biology Vol 11 No 20 Figure 2 of 50 kb of DNA (consistent with a chromosome 21 gene size of 57 kb). This suggests that the annotation of 123CFTR genes, in particular by the HGP, will 100% become more accurate as the 75% genome sequence moves from draft to finished form. As Hogenesch and 50% colleagues have shown, however, the 0k 10k 20k 30k 40k 50k current Celera and Ensembl (HGP) 4 5 6 7 8 9CFTR 10 11 sets of predicted genes are largely 100% mutually exclusive, suggesting that 75% even when a consensus genome sequence is achieved, the resulting 50% gene maps will still vary greatly. 50k 60k 70k 80k 90k 100k An example of where the HGP CFTR 12 13 14 15 16 17 18 19 20 21 clone-based strategy outperforms the 100% Celera WGS approach is in proper 75% assembly of large nearly identical DNA segments that occur in more 50% than one copy in the genome. Such 100k 110k 120k 130k 140k 150k duplications might account for up to CFTR 2223 24 25 26 27 5% of human DNA. When 100% duplications are >50 kb in size, in 75% our experience, they are not represented in large C3 or C4 50% scaffolds (they are found in the 150k 160k 170k 180k 190k 200k Celera ‘fragment’ database). The Current Biology same sequences may also be underrepresented or mistakenly Comparison of 200 kb of Celera human (C4) the program VISTA (http://www- assembled by the HGP. and mouse DNA sequence encompassing gsd.lbl.gov/vista). Each of the 27 CFTR However, we have found the the cystic fibrosis (CFTR) gene on human exons was present in the assembled mouse chromosome 7 and mouse chromosome 6, sequence. Blue shading represents exons HGP data usually to be more respectively. Each window represents 50 kb and red highlights other highly conserved representative for these chromosomal of syntenic DNA sequence displayed using sequences. regions with the added advantage of having access to a physical resource (the clone) for confirmatory analyses. For example, duplications involved of this pressure). Molecular biologists identification of large (and in Williams–Beuren syndrome at and medical geneticists almost sometimes small) genes that would 7q11.23 are not represented in Celera always start by accessing the public have otherwise been fragmented or scaffolds, but they are better covered databases to find out what can be missing and, therefore, not detected by the HGP. The same seems to be found or what is missing, and then using HGP data. For example, using true for duplications flanking they check Celera.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    4 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us