Whole-Genome Shotgun Assembly and Analysis of the Genome
Total Page:16
File Type:pdf, Size:1020Kb
R ESEARCH A RTICLE of the human genome, it contains a comparable complement of protein-coding genes, as in- Whole-Genome Shotgun ferred from random genomic sampling (3). Subsequently, more-targeted analyses (5–9) Assembly and Analysis of the showed that the Fugu genome has remarkable homologies to the human sequence. The intron- Genome of Fugu rubripes exon structure of most genes is preserved be- tween Fugu and human, in some cases with Samuel Aparicio,2,1* Jarrod Chapman,3 Elia Stupka,1* conserved alternative splicing (10). The relative Nik Putnam,3 Jer-ming Chia,1 Paramvir Dehal,3 compactness of the Fugu genome is accounted Alan Christoffels,1 Sam Rash,3 Shawn Hoon,1 Arian Smit,4 for by the proportional reduction in the size of introns and intergenic regions, in part owing to Maarten D. Sollewijn Gelpke,3 Jared Roach,4 Tania Oh,1 3 1 3 1 the relative scarcity of repeated sequences like Isaac Y. Ho, Marie Wong, Chris Detter, Frans Verhoef, those that litter the human genome. Conserva- 3 1 3 3 Paul Predki, Alice Tay, Susan Lucas, Paul Richardson, tion of synteny was discovered between hu- Sarah F. Smith,5 Melody S. Clark,5 Yvonne J. K. Edwards,5 mans and Fugu (5, 6), suggesting the possibil- Norman Doggett,6 Andrey Zharkikh,7 Sean V. Tavtigian,7 ity of identifying chromosomal elements from Dmitry Pruss,7 Mary Barnstead,8 Cheryl Evans,8 Holly Baden,8 the common ancestor. Noncoding sequence Justin Powell,9 Gustavo Glusman,4 Lee Rowen,4 Leroy Hood,4 comparisons detected core conserved regulato- ry elements in mice (11). This methodology has Y. H. Tan,1 Greg Elgar,5* Trevor Hawkins,3*† 1 3 1,10 subsequently been used for identifying con- Byrappa Venkatesh, * Daniel Rokhsar, * Sydney Brenner * served elements in several other loci (12–24). These remarkable homologies, conserved over The compact genome of Fugu rubripes has been sequenced to over 95% the 450 million years since the last common coverage, and more than 80% of the assembly is in multigene-sized scaffolds. ancestor of humans and teleost fish, combined In this 365-megabase vertebrate genome, repetitive DNA accounts for less than with the compact nature of the Fugu sequence, one-sixth of the sequence, and gene loci occupy about one-third of the genome. led to the formation of the Fugu Genome Con- As with the human genome, gene loci are not evenly distributed, but are sortium to sequence the pufferfish genome. clustered into sparse and dense regions. Some “giant” genes were observed that had average coding sequence sizes but were spread over genomic lengths Whole-Genome Shotgun Sequencing significantly larger than those of their human orthologs. Although three-quar- and Assembly of the Fugu rubripes ters of predicted human proteins have a strong match to Fugu, approximately Genome a quarter of the human proteins had highly diverged from or had no pufferfish Sequencing and assembly. Shotgun libraries homologs, highlighting the extent of protein evolution in the 450 million years were prepared from genomic DNA that had since teleosts and mammals diverged. Conserved linkages between Fugu and been purified from the testis of a single animal human genes indicate the preservation of chromosomal segments from the to minimize complications due to allelic poly- common vertebrate ancestor, but with considerable scrambling of gene order. morphisms. These polymorphisms are estimat- ed to occur at 0.4% of the nucleotides in our Introduction sons between the genomes of different animals individual fish, ϳfourfold as many as in human Most of the genetic information that governs will guide future approaches to understanding (25). We set out to generate ϳ6ϫ genome how humans develop and function is encoded gene function and regulation. A decade ago, coverage of the Fugu genome (Table 1). Sev- in the human genome sequence (1, 2), but our analysis of the compact genome of the puffer- eral plasmid libraries with 2- and 5.5-kb understanding of the sequence is limited by our fish Fugu rubripes was proposed (3) as a cost- inserts were constructed and end-sequenced ability to retrieve meaning from it. Compari- effective way to illuminate the human sequence by dye terminator and dye primer chemis- through comparative analysis within the verte- tries. The bulk of the sequence coverage brates. We report here the sequencing and ini- resulted from 2-kb libraries (Table 1). 1Institute of Molecular and Cell Biology, 30 Medical Drive, Singapore 117609. 2University of Cambridge, tial analysis of the Fugu genome, the first pub- However, the 5.5-kb library provided cru- Department of Oncology, Hutchison–MRC Research licly available draft vertebrate genome to be cial intermediate-range linking information Centre, Cambridge CB22XZ,UK. 3U.S. DoE Joint published after the human genome. By compar- for assembly. Genome Institute, 2800 Mitchell Drive, Walnut Creek, ison with mammalian genomes the task was Reads passing the primary quality and CA 94598, USA. 4Institute for Systems Biology, 1441 North 34th Street, Seattle, WA 98103, USA. 5MRC UK modest, since almost an order of magnitude less vector screens (“passing reads”) were assem- HGMP Resource Centre, Hinxton, Cambridge CB10 effort is needed to obtain a comparable amount bled into scaffolds by means of JAZZ, a 1SB, UK. 6Los Alamos National Laboratory, Los of information. modular suite of tools for large shotgun as- Alamos, NM 87545, USA. 7Myriad Genetics Inc., 320 Fugu rubripes, commonly known as “tora- semblies that incorporates both read-overlap Wakara Way, Salt Lake City, UT 84108, USA. 8Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, fugu,” is a teleost fish belonging to the Order and read-pairing information USA. 9Paradigm Therapeutics Ltd., Physiological Lab- Tetraodontiformes and Family Tetraodontidae. The 3.71 million passing reads were assem- oratory, Cambridge CB23EG, UK. 10Salk Institute, Its natural habitat spans the Sea of Japan, the bled into 12,381 scaffolds longer than 2 kb, for 10010 North Torrey Pines Road, La Jolla, San Diego, East China Sea, and the Yellow Sea. Early a total of 332.5 Mb. The scaffolding range and CA 92037–1099, USA. work (4) suggested that Tetraodontiformes contiguity of the assembly are shown in table *To whom correspondence and requests for materials should be addressed. E-mail: [email protected] have low nuclear DNA content [less than 500 S1. A total of 745 scaffolds longer than 100 kb (S.A.), [email protected] (E.S.), [email protected] million base pairs (Mb) per haploid genome], account for 35% of the assembled sequence (G.E.), [email protected] which led to the conjecture that the genomes of (119.5 Mb); 1908 scaffolds longer than 50 kb (T.H.), [email protected] (B.V.), dsrokhsar@lbl. these creatures were compact in organization. account for 60% of the assembly (200.8 Mb); gov (D.R.), [email protected] (S.B.). †Present address: Amersham Biosciences, 928 East Although the Fugu genome is unusually small 4108 scaffolds longer than 20 kb account for Arques Avenue, Sunnyvale, CA 945085, USA. for a vertebrate, at about one-eighth the length 81% of the assembly (271 Mb). www.sciencemag.org SCIENCE VOL 297 23 AUGUST 2002 1301 R ESEARCH A RTICLE These scaffolds contain 45,024 contigs that of the remaining four have 6- to 8.5-kb pieces shotgun assembly by a 500-bp inversion at one total 322.5 Mb of assembled sequence. The missing from the scaffolds in regions that are end of the BAC. Small cloning inversions have remaining 10 Mb of scaffold sequence consists clearly repetitive, on the basis of their depth of been noted on BAC and cosmid clone ends in of 32,621 “captured” or “sequence-mapped” high-quality coverage. The fourth (GenBank ac- previous studies and may explain this discrep- gaps (i.e., gaps flanked by contigs that are cession number AH007668) contains the T cell ancy. The JAZZ assembly at this location is connected by spanning clones). These gaps receptor (TCR)–␣ locus and was matched in the supported by strong paired-end linking informa- were up to 4 kb in length, with an average size assembly only within the coding sequence of the tion; raw sequence data for the BAC itself were of 306 base pairs (bp). These gaps are indicated V region, suggesting a cloning or assembly unavailable. As the BAC and shotgun sequences in the scaffold sequence by runs of N’s whose problem. Similarly, all exons of a well-annotat- are from different individual fish, this is a pos- length is the best estimate of gap size on the ed set of 209 Fugu genes from GenBank could sible polymorphism (25). Unlike the human ge- basis of the spanning clones; by convention, be located within the assembly, with the excep- nome, there is no chromosomal or genetic in- gaps projected to be shorter than 50 bp are tion of two odorant receptor genes that were formation on gene loci that requires integration, indicated by 50 N’s. Gaps account for Ͻ3% of found in the unassembled reads. The single- nor in this present assembly was a physical the total scaffold length. exon odorant receptor genes are often found in clone map integrated with the genome sequence. Five percent of the passing reads were with- tandem arrays separated by repetitive sequence, The scaffolds are therefore not mapped onto held from assembly as being from high percent which may account for their absence in the Fugu chromosomes. nucleotide identity, high–copy number repeats current assembly. (25). About 20% of these reads have sisters The accuracy of the sequence was measured Preliminary Annotation and Analysis placed in the assembly and therefore should by comparing the assembly consensus with the of the Fugu Genome contribute to filling in some captured gaps; this finished sequence of cosmid 165K09 (Gen- We annotated the scaffolds with putative gene gap closing is ongoing.