Genome Sequencing Consortium No Press Conferences Abstract No Political Celebrations!
Total Page:16
File Type:pdf, Size:1020Kb
BIT 815L DEEP SEQUENCING GENOMICS* RM Kelly Department of Chemical and Biomolecular EngineerinG North Carolina State University 3313 Partners II BldG Phone: (919) 515-6396 Email: [email protected] *Based on Lecture Notes provided by: Michael W.W. Adams, Department of Biochemistry and Molecular BioloGy, University of GeorGia 20th Century: The Era of Physics - combustion engine, microelectronics, nuclear energy, cell phones 21st Century: The Era of Biology - molecular medicine, pharmaceuticals, biotechnoloGy, bioenerGy GENOMICS - will drive biology and biotechnology in the 21st Century TheThe GenomicsGenomics RevolutionRevolution 1600 http://genomesonline.org/ 1400 Genomes OnLine Database 1200 300 1000 800 200 Human (44th)* 600 E. coli (8th)* 400 100 # Genomes Sequenced 200 # Metagenomes # Genomes Sequenced # Genomes 0 0 1995 1999 2003 2007 2011 History of the Year Driving Force: Genomics “Completion” of the Revolution? Human Genome February, 2001 http://www.nature.com/nature/journal/v409/n6822/abs/409860a0.html http://www.sciencemag.org/cgi/content/abstract/291/5507/1304 Nature 431, 931-945 (21 October 2004) Finishing the euchromatic sequence of the human genome No TIME article International Human Genome Sequencing Consortium No press conferences Abstract No political celebrations! The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers 99% of the euchromatic genome and is accurate to an error rate of 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,00025,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead. Oct. 2004 - Finished … ? Not quite …… annotating/polishing individual chromosomes …… TheMay last 2006: human Finally chromosome - the last (#1) chromosome sequence is “finished” is manually annotated 17 May 2006 Researchers have reached a landmark point in one of the world's most important scientific projects by sequencing the last chromosome in the human genome, the so-called 'book of life'. The sequence of chromosome 1, published in Nature this week, took a team at the Wellcome Trust Sanger Institute and colleagues in the UK and USA ten years to sequence. Chromosome 1 is the largest of the human chromosomes, containing about 8 per cent of all human genetic information. It is home to 3141 genes, nearly twice as many as the average chromosome, and linked to more than 350 known diseases, including cancer, Parkinson's and Alzheimer's disease, and high cholesterol. The sequence is expected to help researchers around the world find novel diagnostics and treatments for cancers, autism, mental disorders and other illnesses. This was the final chromosome to be sequenced by the Human Genome Project, started in 1990 to identify the genes and DNA sequences that provide a blueprint for human beings. Reference Human Genome (composite of anonymous donors) “Completed” May, 2006 22,300 protein-coding genes > 99% of euchromatin (2.8 of 3.2 Gb) Error: < 1 in 100,000 Genome Reference Consortium (2009) http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml (Jan. 20th, 2011 – still updating ….) NCBI Human Genome http://www.ncbi.nlm.nih.gov/projects/genome/guide/human/ RefSeq, Announcements, Statistics HGP Mammalian Gene Collection 1990 – 2006 http://mgc.nci.nih.gov/ 16 years (‘Completed’ March 2009) ~ $3 billion CDS: CoDing Sequence, region of nucleotides that corresponds to the sequence of amino acids in the predicted protein. The CDS includes start and stop codons, therefore coding sequences begin with an "ATG" and end with a stop codon. Human-human: Much of our genetic variation is caused by single- nucleotide differences in our DNA code: these are called single nucleotide polymorphisms, or SNPs. As a result, each of us has a unique genetic code that typically differs in about three million nucleotides from every other person (1 in ~ 1,500). Sequence the genomes of individual humans? $100 million Sanger sequencing Ref: 12.3 Mb variation/2.8 Gb 3.2 M SNPs ~850,000 indels! James D. Watson <$1 million “454” sequencing (2nd Gen) 3.3 M SNPs Ref: + 29 Mb new! Ethics? 2001 draft of human genome – anonymous donors …. Nature 452, 872-876 (17 April 2008) The complete genome of an individual by massively parallel DNA sequencing Nature 452, 872-876 (17 April 2008) The complete genome of an individual by massively parallel DNA sequencing A Remarkable Issue of Nature - Nov. 2008! Nature. 2008 Nov 6;456(7218):60-5. “454” sequencing Comment in: Nature. 2008 Nov 6;456(7218):49-51. 3 M SNPs The diploid genome sequence of an Asian individual. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J. Beijing Genomics Institute at Shenzhen, Shenzhen 518000, China. [email protected] Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics. A Remarkable Issue of Nature - Nov. 2008! Nature. 2008 Nov 6;456(7218):53-9. Comment in: Nature. 2008 Nov 6;456(7218):49-51. Accurate whole human genome sequencing using reversible terminator chemistry. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, …………………………. J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ. Illumina Cambridge Ltd. (Formerly Solexa Ltd), Essex CB10 1XL, UK. [email protected] DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole- genome re-sequencing and many other biomedical applications. HGP/Venter by Sanger. Watson/Asian by 454. Nigerian by Illumina. A Remarkable Issue of Nature - Nov. 2008! Nature 456, 66-72 (6 November 2008) Received 28 May 2008; Accepted 16 September 2008 DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome Timothy J.