R808 Current Biology Vol 11 No 20

Feature Discovery of the sequence in the public and private databases

Genomes: Much heat has been generated in discussions about the ‘academic’ subscription for the past key human genome sequence databases, generated by the Human year. We are also intense users (and Genome Project and Celera, and what specific features each offers contributors) of data in the Human genome researchers. Stephen W. Scherer and Joseph Cheung, Genome Project (HGP) databases. who are intense users of both, offer a personal assessment of the Many of our experiences are based developing contents. on mapping and sequencing studies of human 7, but Deservedly, there has been much and other important biological also through positional cloning celebration over the publication of features of have been studies in other regions of the two draft versions of the human characterized. The goal of this piece genome. We are most often asked to genome sequence. There have also is to share our experiences with comment subjectively on the been other recent assemblies of the other scientists contemplating if and following three datasets: sequence, producing more complete how they might benefit from coverage and reliable DNA subscribing to the Celera DNA Important DNA sequence sources sequence annotation. However, to sequence database. (i) The Celera version of the date, a finished reference sequence Our observations are based on human genome published in of the human genome does not exist. having access to the Celera February (called component-3 or Furthermore, only a fraction of the Discovery System through an C3; data at http://www.celera.com/) and their more recent component-4 Figure 1 (C4) assembly (by subscription since August 2001). The C3 assembly was derived from combining 14,808 Mb 7000 of Whole Genome Shotgun (WGS) Celera sequence with 4,405 Mb 6000 from the HGP. C4 builds on C3 using improved algorithms as well as 5000 additional Celera sequences and new HGP data as of December 4000 q-arm 2000; 3000 (ii) The successive assemblies of p-arm the clone-based approach of the 2000 HGP from the February publication up until August 2001 (up-to-date

Order of markers on Order of markers 1000 statistics for the HGP sequence can be found at 0 Order of markers on Celera mapped scaffolds http://www.ebi.ac.uk/genomes/mot/). Current Biology The best websites for accessing HGP data are listed in Table 1. The order of 5343 chromosome 7 DNA did not fall into these larger scaffolds were HGP does not have Celera data in markers present in the C4 scaffolds (each all found in smaller ones or in the Celera their assemblies; scaffold in a different color) was almost fragment database. The 22 DNA markers entirely consistent with the marker order that are not in the expected order tend to (iii) The Celera mouse genome established by hand-curated data from map to the centromere or to (available since June 2001), radiation and somatic cell hybrid, yeast and intrachromosomal duplications. Over 98% of assembled solely using a WGS based bacterial- artificial chromosome, and genetic known markers could be placed on the map. on approximately 6X genome mapping experiments. The 246 markers that coverage with DNA from three different mouse strains. Magazine R809

Table 1

General characteristics of the Celera and HGP sequence databases.*

Category Celera Human genome project

Accessibility† To data Good Good to excellent Via cytolocation Very good (mirrors public data) Very good Via gene or marker Good Excellent Via DNA sequence Excellent Good Coverage Euchromatin Outstanding Good (~50% still in draft) Pericentromeric Good Good Large duplication Not represented Better than Celera Accuracy Internal accuracy Excellent Excellent Long-range order and orientation Outstanding Good, continues to improve Gene annotation‡ Known genes Very good Very good New genes Rudimentary Rudimentary Other strengths§ DNA sequence in fragment database often Ease of accessibility to data at multiple websites assists in gap filling Long sequence scaffolds favor genome-wide Availability of clones to confirm or complete comparison/annotation sequencing and mapping Availability of assembled mouse sequence to assist Clone-based strategy essential for completion human annotation of difficult regions

Recommendations (wish list) Be more dynamic incorporating latest public data Increase resolution and accuracy of cytolocations Make clones available for sequencing of gap regions Top up and finish human sequence Release human component 4 and mouse data on DVD Increase efforts to incorporate highly-curated to academic subscribers data from community Sequence a third mammalian genome to assist Sequence a third mammalian genome to assist comparative analyses comparative analyses

*Based on survey of 10 users of varying (http://www.ncbi.nlm.nih.gov/genome/cyto/h academic subscribers the supporting levels of sophistication; bioinformatics brc.shtml), Locus Link at NCBI evidence for new Celera transcripts is not analysts (4), molecular biologists (3), medical (http://www.ncbi.nlm.nih.gov/LocusLink/) intuitively available to the end user. HGP is geneticists (3). †While the Celera database and again Golden Path, respectively. also more dynamic in updating new cDNA is generally user friendly with excellent 'Ensembl' (http://www.ensembl.org/) was and gene data from the literature. §There are service support the limited number of portals also a good entry point into the public data. multitudes of DNA sequence analysis per academic subscription can inhibit Medical geneticists often use the Genome programs available in the public domain not accessibility. Data retrieval can sometimes Database (http://www.gdb.org/) or mentioned (see be slow. Navigating/searching HGP GeneCards http://searchlauncher.bcm.tmc.edu/). Celera databases is more intuitive primarily since (http://bioinfo.weizmann.ac.il/cards/index. has basic Blast search capabilities, GO familiar nomenclature is used compared with html).‡It is still premature to comment on the ontology, Panther Ontology (proprietary), the obscure identifiers often found in Celera. accuracy and completeness of the overall and Genome Browser which is an The favorite entry points to public DNA annotation of genes since many are based outstanding (proprietary) gene-model sequence data based on cytolocation, gene only on gene-prediction algorithms. Earlier building tool (available to corporate but not marker, and by DNA sequence itself are versions of the Celera and HGP academic subscribers). Celera's mouse UCSC Golden Path assemblies/annotation often missed genome assembly used the 129X1/SvJ, (http://genome.ucsc.edu/) and the BAC contiguous gene family members but in both DBA/2J, and A/J strains and the HGP is resources cases this continues to improve. For sequencing C57BL6/J.

We have summarized our access both, if website hits alone sophisticated analysts performing experiences using Celera compared were counted, the HGP would win large-scale annotation experiments to HGP information in Table 1. Both out over Celera primarily because of usually occupy our laboratory’s datasets, and the accompanying ease of accessibility and increased single-portal access (per subscription) annotation, have strengths and number of entry points to the DNA to Celera (the release of the C3 weaknesses. While we constantly sequence. In our group, more assembly on DVD has relieved some R810 Current Biology Vol 11 No 20

Figure 2 of 50 kb of DNA (consistent with a chromosome 21 gene size of 57 kb). This suggests that the annotation of 123CFTR genes, in particular by the HGP, will 100% become more accurate as the 75% genome sequence moves from draft to finished form. As Hogenesch and 50% colleagues have shown, however, the 0k 10k 20k 30k 40k 50k current Celera and Ensembl (HGP) 4 5 6 7 8 9CFTR 10 11 sets of predicted genes are largely 100% mutually exclusive, suggesting that 75% even when a consensus genome sequence is achieved, the resulting 50% gene maps will still vary greatly. 50k 60k 70k 80k 90k 100k An example of where the HGP CFTR 12 13 14 15 16 17 18 19 20 21 clone-based strategy outperforms the 100% Celera WGS approach is in proper 75% assembly of large nearly identical DNA segments that occur in more 50% than one copy in the genome. Such 100k 110k 120k 130k 140k 150k duplications might account for up to CFTR 2223 24 25 26 27 5% of human DNA. When 100% duplications are >50 kb in size, in 75% our experience, they are not represented in large C3 or C4 50% scaffolds (they are found in the 150k 160k 170k 180k 190k 200k Celera ‘fragment’ database). The Current Biology same sequences may also be underrepresented or mistakenly Comparison of 200 kb of Celera human (C4) the program VISTA (http://www- assembled by the HGP. and mouse DNA sequence encompassing gsd.lbl.gov/vista). Each of the 27 CFTR However, we have found the the cystic fibrosis (CFTR) gene on human exons was present in the assembled mouse chromosome 7 and mouse chromosome 6, sequence. Blue shading represents exons HGP data usually to be more respectively. Each window represents 50 kb and red highlights other highly conserved representative for these chromosomal of syntenic DNA sequence displayed using sequences. regions with the added advantage of having access to a physical resource (the clone) for confirmatory analyses. For example, duplications involved of this pressure). Molecular biologists identification of large (and in Williams–Beuren syndrome at and medical geneticists almost sometimes small) genes that would 7q11.23 are not represented in Celera always start by accessing the public have otherwise been fragmented or scaffolds, but they are better covered databases to find out what can be missing and, therefore, not detected by the HGP. The same seems to be found or what is missing, and then using HGP data. For example, using true for duplications flanking they check Celera. In some cases Celera we have published microdeletion and pericentromeric Celera’s data is more complete and/or manuscripts describing the CELSR2 regions, as well as polymorphic accurate than the HGP, in other cases (26 kb at 1p13–p21), RBM15 (8 kb at genomic duplications such as those it is not (Table 1). 1p13), c7orf10 (700 kb at 7p14), observed on chromosome 15 in panic For example, when annotating a IMMP2L (860 kb at 7q31), disorder. As in the latter case, some chromosomal region for genes we RAY1/ST7 (220 kb at 7q31), discrepancies found in different most often use Celera sequence CORTBP2 (170 kb at 7q31), and versions of the genome may occur initially since it almost always CASPR2 (2300 kb at 7q35) genes due to variation existing between the represents longer continuous that, at the time, were not properly source(s) of DNA analyzed. stretches of DNA sequence represented in HGP data. Our (scaffolds) than is currently found in analysis of over 100 known full- Importance of the mouse the public database (Figure 1). This length genes on chromosome 7 The availability of the Celera approach can lead to the indicate they encompass an average mouse genome sequence has already Magazine R811

become an indispensable resource for Ultimately, as the absolute value interpreting the human genome. We of base pairs level out, the true have tested 952 human chromosome measurement of value in these or 7 genes and found 832 (87%) of the any other databases will come from mouse orthologs to be accurately achieving a much higher level of assembled into scaffolds assigned to DNA sequence, gene, and 8 different murine chromosomes (six annotation, beyond what is now representing known syntenies and available. two requiring confirmatory mapping). The murine sequence has been instrumental in defining human gene Further reading International Human Genome Sequencing structure (Figure 2), finding new Consortium: Initial sequencing and genes, annotating regulatory regions, analysis of the human genome and of course in biological studies of sequence. Nature 2001, 409:860-921. Venter JC, Adams MD, Meyers EW, Li PW., the mouse. Mural PJ, Sutton GG, et al.: The sequence In addition, since many of the of the human genome. Science 2001, problematic duplications in the 291:1304-1361. Green ED, Chakravarti A: The human genome human genome described earlier are sequence expedition: views from the relatively recent in origin (occurring “base camp”. Genome Res 2001, 11:645- after divergence of mouse and 651. Katsanis N, Worley KC, Lupski JR: An human), the mouse sequence can evaluation of the draft human genome often serve as a ruler to refine the sequence. Nat Genet 2001, 29:88-91. human sequence. The HGP is also Hogenesch JB, Ching KA, Batalov S, Su AI, Walker JR, Zhou Y, et al.: A comparison of sequencing the mouse genome using the Celera and Ensemble predicted gene a combined WGS and clone-based sets reveals little overlap in novel genes. strategy, but an assembled genome Cell 2001, 106:413-415. Eichler EE: Segmental duplications: what’s sequence has not yet been obtained. missing, misassigned, and missassembled – and should we care? Incremental gains Genome Res 2001, 11:653-656. Gratacos M, Nadal M, Martin-Santos R, Pujana So, in the end, until someone MA, Gago J, Peral B, et al.: A polymorphic completes a definitive version of the genomic duplication on human human genome comparable to that chromosome 15 is a susceptibility factor for panic and phobic disorders. Cell available for chromosome 21 and 22, 2001, 106:367-379. but also with comprehensive Stein L: Genome annotation: from sequence annotation, the question “which is to biology. Nat Rev Genet 2001, 2:493- 503. better” remains irrelevant. Any advantage the HGP or Celera might have over the other is incremental in Address: Genetics and Genomic Biology, nature. The Hospital for Sick Children, Toronto, Canada. Gains by the HGP are usually small but swift, while Celera’s are massive but less dynamic. In fact, much of the discovery is fueled by having the ability to compare, The editors of Current Biology welcome contrast, and combine the different correspondence on any article in the versions of the genome. For the past journal, but reserve the right to reduce the length of any letter to be published. 12 months the availability of large All Correspondence containing data or amounts of human sequence at scientific argument will be refereed. Celera not yet in the public Items for publication should either be databases, more than justified our submitted typed, double-spaced to: The investment. We anticipate the same Editor, Current Biology, Elsevier accelerated rate of discovery over the Science London, 84 Theobald’s Road, London, WC1X 8RR, UK, or sent by next year by having access to an electronic mail to assembled mouse genome otherwise [email protected] not available in the public domain.