Initial Sequencing and Analysis of the Human Genome

articles Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium* * A partial list of authors appears on the opposite page. Af®liations are listed at the end of the paper. ............................................................................................................................................................................................................................................................................ The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of coordinate regulation of the genes in the clusters. the 20th century1±3 sparked a scienti®c quest to understand the X There appear to be about 30,000±40,000 protein-coding genes in nature and content of genetic information that has propelled the human genomeÐonly about twice as many as in worm or ¯y. biology for the last hundred years. The scienti®c progress made However, the genes are more complex, with more alternative falls naturally into four main phases, corresponding roughly to the splicing generating a larger number of protein products. four quarters of the century. The ®rst established the cellular basis of X The full set of proteins (the `proteome') encoded by the human heredity: the chromosomes. The second de®ned the molecular basis genome is more complex than those of invertebrates. This is due in of heredity: the DNA double helix. The third unlocked the informa- part to the presence of vertebrate-speci®c protein domains and tional basis of heredity, with the discovery of the biological mechan- motifs (an estimated 7% of the total), but more to the fact that ism by which cells read the information contained in genes and with vertebrates appear to have arranged pre-existing components into a the invention of the recombinant DNA technologies of cloning and richer collection of domain architectures. sequencing by which scientists can do the same. X Hundreds of human genes appear likely to have resulted from The last quarter of a century has been marked by a relentless drive horizontal transfer from bacteria at some point in the vertebrate to decipher ®rst genes and then entire genomes, spawning the ®eld lineage. Dozens of genes appear to have been derived from trans- of genomics. The fruits of this work already include the genome posable elements. sequences of 599 viruses and viroids, 205 naturally occurring X Although about half of the human genome derives from trans- plasmids, 185 organelles, 31 eubacteria, seven archaea, one posable elements, there has been a marked decline in the overall fungus, two animals and one plant. activity of such elements in the hominid lineage. DNA transposons Here we report the results of a collaboration involving 20 groups appear to have become completely inactive and long-terminal from the United States, the United Kingdom, Japan, France, repeat (LTR) retroposons may also have done so. Germany and China to produce a draft sequence of the human X The pericentromeric and subtelomeric regions of chromosomes genome. The draft genome sequence was generated from a physical are ®lled with large recent segmental duplications of sequence from map covering more than 96% of the euchromatic part of the human elsewhere in the genome. Segmental duplication is much more genome and, together with additional sequence in public databases, frequent in humans than in yeast, ¯y or worm. it covers about 94% of the human genome. The sequence was X Analysis of the organization of Alu elements explains the long- produced over a relatively short period, with coverage rising from standing mystery of their surprising genomic distribution, and about 10% to more than 90% over roughly ®fteen months. The suggests that there may be strong selection in favour of preferential sequence data have been made available without restriction and retention of Alu elements in GC-rich regions and that these `sel®sh' updated daily throughout the project. The task ahead is to produce a elements may bene®t their human hosts. ®nished sequence, by closing all gaps and resolving all ambiguities. X The mutation rate is about twice as high in male as in female Already about one billion bases are in ®nal form and the task of meiosis, showing that most mutation occurs in males. bringing the vast majority of the sequence to this standard is now X Cytogenetic analysis of the sequenced clones con®rms sugges- straightforward and should proceed rapidly. tions that large GC-poor regions are strongly correlated with `dark The sequence of the human genome is of interest in several G-bands' in karyotypes. respects. It is the largest genome to be extensively sequenced so far, X Recombination rates tend to be much higher in distal regions being 25 times as large as any previously sequenced genome and (around 20 megabases (Mb)) of chromosomes and on shorter eight times as large as the sum of all such genomes. It is the ®rst chromosome arms in general, in a pattern that promotes the vertebrate genome to be extensively sequenced. And, uniquely, it is occurrence of at least one crossover per chromosome arm in each the genome of our own species. meiosis. Much work remains to be done to produce a complete ®nished X More than 1.4 million single nucleotide polymorphisms (SNPs) sequence, but the vast trove of information that has become in the human genome have been identi®ed. This collection should available through this collaborative effort allows a global perspective allow the initiation of genome-wide linkage disequilibrium on the human genome. Although the details will change as the mapping of the genes in the human population. sequence is ®nished, many points are already clear. In this paper, we start by presenting background information on X The genomic landscape shows marked variation in the distribu- the project and describing the generation, assembly and evaluation tion of a number of features, including genes, transposable of the draft genome sequence. We then focus on an initial analysis of elements, GC content, CpG islands and recombination rate. This the sequence itself: the broad chromosomal landscape; the repeat gives us important clues about function. For example, the devel- elements and the rich palaeontological record of evolutionary and opmentally important HOX gene clusters are the most repeat-poor biological processes that they provide; the human genes and regions of the human genome, probably re¯ecting the very complex proteins and their differences and similarities with those of other 860 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com articles Genome Sequencing Centres (Listed in order of total genomic Biotechnology: AndreÂ Rosenthal12, Matthias Platzer12, sequence contributed, with a partial list of personnel. A full list of Gerald Nyakatura12, Stefan Taudien12 & Andreas Rump12 contributors at each centre is available as Supplementary Information.) Beijing Genomics Institute/Human Genome Center: Huanming Yang13, Jun Yu13, Jian Wang13, Guyang Huang14 Whitehead Institute for Biomedical Research, Center for Genome & Jun Gu15 Research: Eric S. Lander1*, Lauren M. Linton1, Bruce Birren1*, Chad Nusbaum1*, Michael C. Zody1*, Jennifer Baldwin1, Multimegabase Sequencing Center, The Institute for Systems 16 16 16 16 Keri Devon1, Ken Dewar1, Michael Doyle1, William FitzHugh1*, Biology: Leroy Hood , Lee Rowen , Anup Madan & Shizen Qin 1 1 1 1 Roel Funke , Diane Gage , Katrina Harris , Andrew Heaford , 17 1 1 1 1 Stanford Genome Technology Center: Ronald W. Davis , John Howland , Lisa Kann , Jessica Lehoczky , Rosie LeVine , 17 17 17 Paul McEwan1, Kevin McKernan1, James Meldrim1, Jill P. Mesirov1*, Nancy A. Federspiel , A. Pia Abola & Michael J. Proctor 1 1 1 Cher Miranda , William Morris , Jerome Naylor , 18 1 1 1 Stanford Human Genome Center: Richard M. Myers , Christina Raymond , Mark Rosetti , Ralph Santos , 18 18 18 1 1 1 Jeremy Schmutz , Mark Dickson , Jane Grimwood Andrew Sheridan , Carrie Sougnez , Nicole Stange-Thomann , & David R. Cox18 Nikola Stojanovic1, Aravind Subramanian1 1 & Dudley Wyman University of Washington Genome Center: Maynard V. Olson19, Rajinder Kaul19 & Christopher Raymond19 The Sanger Centre: Jane Rogers2, John Sulston2*, 2 2 2 2 Rachael Ainscough , Stephan Beck , David Bentley , John Burton , Department of Molecular Biology, Keio University School of 2 2 2 Christopher Clee , Nigel Carter , Alan Coulson , Medicine: Nobuyoshi Shimizu20, Kazuhiko Kawasaki20 2 2 2 Rebecca Deadman , Panos Deloukas , Andrew Dunham , & Shinsei Minoshima20 Ian Dunham2, Richard Durbin2*, Lisa French2, Darren Grafham2, Simon Gregory2, Tim Hubbard2*, Sean Humphray2, Adrienne Hunt2, University of Texas Southwestern Medical Center at Dallas: Matthew Jones2, Christine Lloyd2, Amanda McMurray2, Glen A. Evans21², Maria Athanasiou21 & Roger Schultz21 Lucy Matthews2, Simon Mercer2, Sarah Milne2, James C. Mullikin2*, Andrew Mungall2, Robert Plumb2, Mark Ross2, Ratna Shownkeen2 University of Oklahoma's Advanced Center for Genome & Sarah Sims2 Technology: Bruce A. Roe22, Feng Chen22 & Huaqin Pan22 Washington University Genome Sequencing Center: Max Planck Institute for Molecular Genetics: Juliane

Initial Sequencing and Analysis of the Human Genome

The International Human Epigenome Consortium (IHEC): a Blueprint for Scientific Collaboration and Discovery

A Machine Learning Framework for Precise 3D Domain Boundary Prediction at Base-Level Resolution

Generative Modeling of Multi-Mapping Reads with Mhi-C Advances Analysis of Hi-C Studies Ye Zheng1, Ferhat Ay2,3, Sunduz Keles1,4*

Next Generation Sequencing: Advances in Characterizing the Methylome

Comprehensive Epigenome Characterization Reveals Diverse Transcriptional Regulation Across Human Vascular Endothelial Cells

Survey of Epigenomic Landscapes in ES Cells and Differentiated Cells

Profiling Single-Cell Histone Modifications Using Indexing Chromatin Immunocleavage Sequencing

Integrated Analysis of Tissue-Specific Promoter Methylation and Gene

The Epigenomic Basis of Common Diseases Euan J

Epigenomic Analysis Reveals DNA Motifs Regulating Histone Modifications in Human and Mouse

Chromatin Features Constrain Structural Variation Across Evolutionary Timescales

Analysis of Normal Human Mammary Epigenomes Reveals Cell-Specific