BIT 815L DEEP SEQUENCING

GENOMICS*

RM Kelly

Department of Chemical and Biomolecular Engineering North Carolina State University

3313 Partners II Bldg Phone: (919) 515-6396 Email: [email protected]

*Based on Lecture Notes provided by: Michael W.W. Adams, Department of Biochemistry and Molecular Biology, University of Georgia

20th Century: The Era of Physics

- combustion engine, microelectronics, nuclear energy, cell phones

21st Century: The Era of Biology

- molecular medicine, pharmaceuticals, biotechnology, bioenergy

GENOMICS

- will drive biology and biotechnology in the 21st Century

TheThe GenomicsGenomics RevolutionRevolution

1600 http://genomesonline.org/ 1400 Genomes OnLine Database 1200 300



1000

800 200 Human (44th)* 600 E. coli (8th)* 400 100 # Genomes Sequenced # Genomes 200 # Metagenomes # Genomes Sequenced Sequenced # Genomes

0 0 1995 1999 2003 2007 2011

History of the Year Driving Force: Genomics “Completion” of the Revolution? Human Genome February, 2001 http://www.nature.com/nature/journal/v409/n6822/abs/409860a0.html http://www.sciencemag.org/cgi/content/abstract/291/5507/1304 Nature 431, 931-945 (21 October 2004) Finishing the euchromatic sequence of the human genome No TIME article International Human Genome Sequencing Consortium No press conferences Abstract No political celebrations!

The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers 99% of the euchromatic genome and is accurate to an error rate of 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,00025,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead. Oct. 2004 - Finished … ?

Not quite …… annotating/polishing individual chromosomes …… TheMay last 2006: human Finally chromosome - the last (#1) chromosome sequence is “finished” is manually annotated

17 May 2006

Researchers have reached a landmark point in one of the world's most important scientific projects by sequencing the last chromosome in the human genome, the so-called 'book of life'.

The sequence of chromosome 1, published in Nature this week, took a team at the Wellcome Trust Sanger Institute and colleagues in the UK and USA ten years to sequence.

Chromosome 1 is the largest of the human chromosomes, containing about 8 per cent of all human genetic information. It is home to 3141 genes, nearly twice as many as the average chromosome, and linked to more than 350 known diseases, including cancer, Parkinson's and Alzheimer's disease, and high cholesterol.

The sequence is expected to help researchers around the world find novel diagnostics and treatments for cancers, autism, mental disorders and other illnesses.

This was the final chromosome to be sequenced by the Human Genome Project, started in 1990 to identify the genes and DNA sequences that provide a blueprint for human beings. Reference Human Genome (composite of anonymous donors)

“Completed” May, 2006

22,300 protein-coding genes > 99% of euchromatin (2.8 of 3.2 Gb) Error: < 1 in 100,000

Genome Reference Consortium (2009) http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml (Jan. 20th, 2011 – still updating ….)

NCBI Human Genome http://www.ncbi.nlm.nih.gov/projects/genome/guide/human/ RefSeq, Announcements, Statistics HGP Mammalian Gene Collection 1990 – 2006 http://mgc.nci.nih.gov/ 16 years (‘Completed’ March 2009) ~ $3 billion

CDS: CoDing Sequence, region of nucleotides that corresponds to the sequence of amino acids in the predicted protein. The CDS includes start and stop codons, therefore coding sequences begin with an "ATG" and end with a stop codon. Human-human: Much of our genetic variation is caused by single- nucleotide differences in our DNA code: these are called single nucleotide polymorphisms, or SNPs. As a result, each of us has a unique genetic code that typically differs in about three million nucleotides from every other person (1 in ~ 1,500).

Sequence the genomes of individual humans? $100 million Sanger sequencing Ref: 12.3 Mb variation/2.8 Gb 3.2 M SNPs ~850,000 indels! James D. Watson <$1 million “454” sequencing (2nd Gen) 3.3 M SNPs Ref: + 29 Mb new!

Ethics? 2001 draft of human genome – anonymous donors ….

Nature 452, 872-876 (17 April 2008) The complete genome of an individual by massively parallel DNA sequencing Nature 452, 872-876 (17 April 2008) The complete genome of an individual by massively parallel DNA sequencing A Remarkable Issue of Nature - Nov. 2008!

Nature. 2008 Nov 6;456(7218):60-5. “454” sequencing Comment in: Nature. 2008 Nov 6;456(7218):49-51. 3 M SNPs The diploid genome sequence of an Asian individual.

Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J.

Beijing Genomics Institute at Shenzhen, Shenzhen 518000, China. [email protected]

Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics. A Remarkable Issue of Nature - Nov. 2008!

Nature. 2008 Nov 6;456(7218):53-9. Comment in: Nature. 2008 Nov 6;456(7218):49-51. Accurate whole human genome sequencing using reversible terminator chemistry.

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, …………………………. J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ.

Illumina Cambridge Ltd. (Formerly Solexa Ltd), Essex CB10 1XL, UK. [email protected]

DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole- genome re-sequencing and many other biomedical applications. HGP/Venter by Sanger. Watson/Asian by 454. Nigerian by Illumina.  A Remarkable Issue of Nature - Nov. 2008!

Nature 456, 66-72 (6 November 2008) Received 28 May 2008; Accepted 16 September 2008 DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome

Timothy J. Ley1,2,3,4,9, ……….. John F. DiPersio1,4 & Richard K. Wilson2,3,4

Washington University School of Medicine, St. Louis, Missouri 63108, USA, University of Washington, Seattle, Washington 98195, USA

This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence (http:// creativecommons.org/licenses/by-nc-sa/3.0/), which permits distribution, and reproduction in any medium, provided the original author and source are credited. This licence does not permit commercial exploitation, and derivative works must be licensed under the same or similar licence.

Acute myeloid leukaemia is a highly malignant haematopoietic tumour that affects about 13,000 adults in the United States each year. The treatment of this disease has changed little in the past two decades, because most of the genetic events that initiate the disease remain undiscovered. Whole-genome sequencing is now possible at a reasonable cost and timeframe to use this approach for the unbiased discovery of tumour-specific somatic mutations that alter the protein-coding genes. Here we present the results obtained from sequencing a typical acute myeloid leukaemia genome, and its matched normal counterpart obtained from the same patient's skin. We discovered ten genes with acquired mutations; two were previously described mutations that are thought to contribute to tumour progression, and eight were new mutations present in virtually all tumour cells at presentation and relapse, the function of which is not yet known. Our study establishes whole-genome sequencing as an unbiased method for discovering cancer- initiating mutations in previously unidentified genes that may respond to targeted therapies. A Remarkable Issue of Nature - Nov. 2008!

Nature 456, 66-72 (6 November 2008) Received 28 May 2008; Accepted 16 September 2008 DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome

Timothy J. Ley1,2,3,4,9, ……….. John F. DiPersio1,4 & Richard K. Wilson2,3,4

Washington University School of Medicine, St. Louis, Missouri 63108, USA, University of Washington, Seattle, Washington 98195, USA Can anyone get This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence (http:// creativecommons.org/licenses/by-nc-sa/3.0/), which permits distribution, and reproduction in any medium, provided the original author and source are credited. This licence does not permit commercial exploitation, and derivativetheir works must genome be licensed under the same or similar licence. sequenced? Acute myeloid leukaemia is a highly malignant haematopoietic tumour that affects about 13,000 adults in the United States each year. The treatment of this disease has changed little in the past two decades, because most of the genetic events that initiate the disease remain undiscovered. Whole-genome sequencing is now possible at a reasonable cost and timeframe to use this approach for the unbiased discovery of tumour-specific somatic mutations that alter the protein-coding genes. Here we present the results obtained from sequencing a typical acute myeloid leukaemia genome, and its matched normal counterpart obtained from the same patient's skin. We discovered ten genes with acquired mutations; two were previously described mutations that are thought to contribute to tumour progression, and eight were new mutations present in virtually all tumour cells at presentation and relapse, the function of which is not yet known. Our study establishes whole-genome sequencing as an unbiased method for discovering cancer- initiating mutations in previously unidentified genes that may respond to targeted therapies. http://www.knome.com/

March 4, 2008 Jan. 11, 2009 http://www.nytimes.com/2009/01/11/magazine/11Genome-t.html http://www.knome.com/ May 2008

http://www.genome.gov/24519851 July 23, 2009

Limiting factor: not the technology – but what it all means …….. (and costs are coming down ……) http://www.knome.com/

2011: individuals?

2011: cost? Knome – Case Studies

Exomes vs. Complete Sequence vs. Epigenetics? Nature 461, 272-276 Nature – Sept. 2009 Targeted capture and massively parallel sequencing of 12 human exomes

Sarah B. Ng1, Emily H. Turner1, Peggy D. Robertson1, Steven D. Flygare1, Abigail W. Bigham2, Choli Lee1, Tristan Shaffer1, Michelle Wong1, Arindam Bhattacharjee4, Evan E. Eichler1,3, Michael Bamshad2, Deborah A. Nickerson1 & Jay Shendure1Department of Genome Sciences,Department of Pediatrics, University of Washington,Howard Hughes Medical Institute, Seattle, Washington 98195, USAAgilent Technologies, Santa Clara, California 95051, USA

Genome-wide association studies suggest that common genetic variants explain only a modest fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability1, 2. Although DNA sequencing costs have fallen markedly3, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. We have therefore sought to develop second- generation methods for targeted sequencing of all protein-coding regions (‘exomes’), to reduce costs while enriching for discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of 12 humans. These include eight HapMap individuals representing three populations4, and four unrelated individuals with a rare dominantly inherited disorder, Freeman–Sheldon syndrome (FSS)5. We demonstrate the sensitive and specific identification of rare and common variants in over 300megabases of coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex through larger sample sizes and appropriate weighting of non- synonymous variants by predicted functional impact. Exome: protein-coding ~1% of the genome or ~30 Mb split into ~180,000 exons http://www.23andme.com/

Sequencing vs Arrays/Chips

Cost of Complete Sequence? Jan. 2009 Knome $95,000

Now? January 1, 2010 Science (2010) 327, 78-81

3rd Gen Sequencing! http://www.everygenome.com/

January, 2011 January, 2011 The Genomics Revolution The Genomics Revolution 1600

1400

1200

1000 You? < $10,000? 800 Human (44th)* 600 E. coli (8th)* 400

# Genomes Sequenced # Genomes 200 # Genomes Sequenced Sequenced # Genomes

0 1995 1999 2003 2007 2011

Year Knome llumina http://www.knome.com/ http://www.everygenome.com The Genomics Revolution The Genomics Revolution 1600

1400 What can we learn

1200 from our close relatives?

1000 You? < $10,000? 800 Human (44th)* 600 E. coli (8th)* 400

# Genomes Sequenced # Genomes 200 # Genomes Sequenced Sequenced # Genomes

0 1995 1999 2003 2007 2011

Year Knome llumina http://www.knome.com/ http://www.everygenome.com

http://www.ncbi.nlm.nih.gov/genome/guide/chimp/

You-neighbor: 3 M SNPs, 1.0 M indels You-chimp: 35 M SNPs, 5.0 M indels Venter vs Ref Human: 12.3 Mb/2.8 Gb Venter vs Chimp: 150 Mb/2.8 Gb

Human-human: Much of our genetic variation is caused by single-nucleotide differences in our DNA code: these are called single nucleotide polymorphisms, or SNPs. As a result, each of us has a unique genetic code that typically differs in about three million nucleotides from every other person (1 in ~ 1,500). http://www.ncbi.nlm.nih.gov/genome/guide/chimp/

You-neighbor: 3 M SNPs, 1.0 M indels You-chimp: 35 M SNPs, 5.0 M indels Venter vs Ref Human: 12.3 Mb variation/2.8 Gb Venter vs Chimp: 150 Mb variation/2.8 Gb

Human-human: Much of our genetic variation is caused by single-nucleotide differences in our DNA code: these are called single nucleotide polymorphisms, or SNPs. As a result, each of us has a unique genetic code that typically differs in about three million nucleotides from every other person (1 in ~ 1,500).

Nov., 2008

http://www.sciencedaily.com/releases/2008/11/081105191731.htm Genome Res. 2008 Nov;18(11):1698-710.  Copy number variation and evolution in humans and chimpanzees. Perry GH et al.

Copy number variants (CNVs) underlie many aspects of human phenotypic diversity and provide the raw material for gene duplication and gene family expansion. However, our understanding of their evolutionary significance remains limited. We performed comparative genomic hybridization on a single human microarray platform to identify CNVs among the genomes of 30 humans and 30 chimpanzees as well as fixed copy number differences between species. We found that human and chimpanzee CNVs occur in orthologous genomic regions far more often than expected by chance and are strongly associated with the presence of highly homologous intrachromosomal segmental duplications. By adapting population genetic analyses for use with copy number data, we identified functional categories of genes that have likely evolved under purifying or positive selection for copy number changes. In particular, duplications and deletions of genes with inflammatory response and cell proliferation functions may have been fixed by positive selection and involved in the adaptive phenotypic differentiation of humans and chimpanzees. One of the genes affected by CNVs is CCL3L1, for which lower copy numbers in humans have been associated with increased susceptibility to HIV infection. Remarkably, the study of 60 human and chimpanzee genomes found no evidence for fixed CNDs between human and chimp and no within-chimp "This is the first study of this scale, comparing directly the genomes of many CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in humans and chimpanzees," says Dr Richard Redon, from the Wellcome Trust number in chimpanzee compared to human: typically, there were eight copies Sanger Institute, a leading author of the study. "By looking at only one in human, but apparently only one in all chimpanzees. 'reference' sequence for human or chimpanzee, as has been done previously, it is not possible to tell which differences occur only among individual The authors suggest that it might be evolutionary selection of CNDs in chimpanzees or humans and which are differences between the two species. TBC1D3 that have driven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured "This is our first view of those two important legacies of evolution." category) and is on a core region for duplication - a focal point for large regions of duplication in human genome. Rather than examining single-letter differences in the genomes (so-called SNPs), the researchers looked at copy number variation (CNV) - the gain or "It is evident that there has been striking turnover in gene content between loss of regions of DNA. CNVs can affect many genes at once and their humans and chimpanzees, and some of these changes may have resulted from significance has only been fully appreciated within the last two years. The exceptional selection pressures," explains Dr George Perry from Arizona State team looked at genomes of 30 chimpanzees and 30 humans: a direct University and Brigham and Women's Hospital, another leading author of the comparison of this scale or type has not been carried out before. study. "For example, a surprisingly high number of genes involved in the inflammatory response - APOL1, APOL4, CARD18, IL1F7, IL1F8 - are The comparison uncovered CNVs that are present in both species as well as completely deleted from chimp genome. In humans, APOL1 is involved in copy number differences (CNDs) between the two species. CNDs are likely to resistance to the parasite that causes sleeping sickness, while IL1F7 and include genes that have influenced evolution of each species since humans CARD18 play a role in regulating inflammation: therefore, there must be and chimpanzees diverged some six million years ago. different regulations of these processes in chimpanzees.

"Broadly, the two genomes have similar patterns and levels of CNVs - around "We already know that inactivation of an immune system gene from the 70-80 in each individual - of which nearly half occur in the same regions of human genome is being positively selected: now we have an example of the two species' genomes," continues Dr Redon. "But beyond that similarity similar consequences in the chimpanzee." we were able to find intriguing evidence for key sets of genes that differ between us and our nearest relative." CNVs in humans and chimpanzees often occur in equivalent genomic locations: most lie in regions of the genomes, called segmental duplications, One of the genes affected by CNVs is CCL3L1, for which lower copy that are particularly 'fragile'. However, one in four of the 355 CNDs that the numbers in humans have been associated with increased susceptibility to HIV team found do not overlap with CNVs within either species - suggesting that infection. Remarkably, the study of 60 human and chimpanzee genomes found they are variants that are 'fixed' in each species and might mark  no evidence for fixed CNDs between human and chimp and no within-chimp CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in number in chimpanzee compared to human: typically, there were eight copies in human, but apparently only one in all chimpanzees.

The authors suggest that it might be evolutionary selection of CNDs in TBC1D3 that have http://www.sciencedaily.com/releases/2008/11/081105191731.htmdriven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured category) and is on a core region for duplication - a focal point for large regions of duplication in human genome.

"I i id h h h b iki i b One of the genes affected by CNVs is CCL3L1, for which lower copy numbers in humans have been associated with increased susceptibility to HIV infection. Remarkably, the study of 60 human and chimpanzee genomes found no evidence for fixed CNDs between human and chimp and no within-chimp "This is the first study of this scale, comparing directly the genomes of many CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in humans and chimpanzees," says Dr Richard Redon, from the Wellcome Trust number in chimpanzee compared to human: typically, there were eight copies Sanger Institute, a leading author of the study. "By looking at only one in human, but apparently only one in all chimpanzees. 'reference' sequence for human or chimpanzee, as has been done previously, it is not possible to tell which differences occur only among individual The authors suggest that it might be evolutionary selection of CNDs in chimpanzees or humans and which are differences between the two species. TBC1D3 that have driven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured "This is our first view of those two important legacies of evolution." category) and is on a core region for duplication - a focal point for large regions of duplication in human genome. Rather than examining single-letter differences in the genomes (so-called SNPs), the researchers looked at copy number variation (CNV) - the gain or "It is evident that there has been striking turnover in gene content between loss of regions of DNA. CNVs can affect many genes at once and their humans and chimpanzees, and some of these changes may have resulted from significance has only been fully appreciated within the last two years. The exceptional selection pressures," explains Dr George Perry from Arizona State team looked at genomes of 30 chimpanzees and 30 humans: a direct University and Brigham and Women's Hospital, another leading author of the comparison of this scale or type has not been carried out before. study. "For example, a surprisingly high number of genes involved in the inflammatory response - APOL1, APOL4, CARD18, IL1F7, IL1F8 - are The comparison uncovered CNVs that are present in both species as well as completely deleted from chimp genome. In humans, APOL1 is involved in copy number differences (CNDs) between the two species. CNDs are likely to resistance to the parasite that causes sleeping sickness, while IL1F7 and include genes that have influenced evolution of each species since humans CARD18 play a role in regulating inflammation: therefore, there must be and chimpanzees diverged some six million years ago. different regulations of these processes in chimpanzees.

"Broadly, the two genomes have similar patterns and levels of CNVs - around "We already know that inactivation of an immune system gene from the 70-80 in each individual - of which nearly half occur in the same regions of human genome is being positively selected: now we have an example of the two species' genomes," continues Dr Redon. "But beyond that similarity similar consequences in the chimpanzee." we were able to find intriguing evidence for key sets of genes that differ between us and our nearest relative." CNVs in humans and chimpanzees often occur in equivalent genomic locations: most lie in regions of the genomes, called segmental duplications, One of the genes affected by CNVs is CCL3L1, for which lower copy that are particularly 'fragile'. However, one in four of the 355 CNDs that the numbers in humans have been associated with increasedA susceptibility closer to HIV relative? team found do not overlap with CNVs within either species - suggesting that infection. Remarkably, the study of 60 human and chimpanzee genomes found they are variants that are 'fixed' in each species and might mark  no evidence for fixed CNDs between human and chimp and no within-chimp CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in number in chimpanzee compared to human: typically, there were eight copies in human, but apparently only one in all chimpanzees.

The authors suggest that it might be evolutionary selection of CNDs in TBC1D3 that have http://www.sciencedaily.com/releases/2008/11/081105191731.htmdriven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured category) and is on a core region for duplication - a focal point for large regions of duplication in human genome.

"I i id h h h b iki i b One of the genes affected by CNVs is CCL3L1, for which lower copy numbers in humans have been associated with increased susceptibility to HIV infection. Remarkably, the study of 60 human and chimpanzee genomes found no evidence for fixed CNDs between human and chimp and no within-chimp "This is the first study of this scale, comparing directly the genomes of many CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in humans and chimpanzees," says Dr Richard Redon, from the Wellcome Trust number in chimpanzee compared to human: typically, there were eight copies Sanger Institute, a leading author of the study. "By looking at only one in human, but apparently only one in all chimpanzees. 'reference' sequence for human or chimpanzee, as has been done previously, it is not possible to tell which differences occur only among individual The authors suggest that it might be evolutionary selection of CNDs in chimpanzees or humans and which are differences between the two species. TBC1D3 that have driven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured "This is our first view of those two important legacies of evolution." category) and is on a core region for duplication - a focal point for large regions of duplication in human genome. Rather than examining single-letter differences in the genomes (so-called SNPs), the researchers looked at copy number variation (CNV) - the gain or "It is evident that there has been striking turnover in gene content between loss of regions of DNA. CNVs can affect many genes at once and their humans and chimpanzees, and some of these changes may have resulted from significance has only been fully appreciated within the last two years. The exceptional selection pressures," explains Dr George Perry from Arizona State team looked at genomes of 30 chimpanzees and 30 humans: a direct University and Brigham and Women's Hospital, another leading author of the comparison of this scale or type has not been carried out before. study. "For example, a surprisingly high number of genes involved in the inflammatory response - APOL1, APOL4, CARD18, IL1F7, IL1F8 - are The comparison uncovered CNVs that are present in both species as well as completely deleted from chimp genome. In humans, APOL1 is involved in copy number differences (CNDs) between the two species. CNDs are likely to resistance to the parasite that causes sleeping sickness, while IL1F7 and include genes that have influenced evolution of each species since humans CARD18 play a role in regulating inflammation: therefore, there must be and chimpanzees diverged some six million years ago. different regulations of these processes in chimpanzees.

"Broadly, the two genomes have similar patterns and levels of CNVs - around "We already know that inactivation of an immune system gene from the 70-80 in each individual - of which nearly half occur in the same regions of human genome is being positively selected: now we have an example of the two species' genomes," continues Dr Redon. "But beyond that similarity similar consequences in the chimpanzee." we were able to find intriguing evidence for key sets of genes that differ between us and our nearest relative." CNVs in humans and chimpanzees often occur in equivalent genomic locations: most lie in regions of the genomes, called segmental duplications, One of the genes affected by CNVs is CCL3L1, for which lower copy that are particularly 'fragile'. However, one in four of the 355 CNDs that the numbers in humans have been associated with increasedA susceptibility closer to HIV relative? team found do not overlap with CNVs within either species - suggesting that infection. Remarkably, the study of 60 human and chimpanzee genomes found they are variants that are 'fixed' in each species and might mark  no evidence for fixed CNDs between human and chimp and no within-chimp CNV. Rather, they found that a nearby gene called TBC1D3 was reduced in number in chimpanzee compared to human: typically, there were eight copies in human, but apparently only one in all chimpanzees.

The authors suggest that it might be evolutionary selection of CNDs in TBC1D3 that have http://www.sciencedaily.com/releases/2008/11/081105191731.htmdriven the population differences. Consistent with this novel observation, TBC1D3 is involved in cell proliferation (favoured category) and is on a core region for duplication - a focal point for large regions of duplication in human genome.

"I i id h h h b iki i b Nov., 2006

SCIENCE VOL 314, 17 NOVEMBER 2006 65 kb Humans and chimpanzees diverged some 6,000,000 years ago. Nov., 2006

~1 Mb “454”

736,941 same CHN 10,167 same HN 3,447 same HC 434 same NC Nov. 2006 New York Times, October 19, 2007 http://www.nytimes.com/2007/05/31/science/31cnd-gene.html Aug., 2008

Mt DNA 16.6 kb

Cell. 2008 Aug 8 134(3):416-26 2008-2009?

http://www.nature.com/genomics/human/papers/articles.html http://www.sciencemag.org/content/vol291/issue5507/ July, 2009

Science 17 July 2009: Vol. 325. no. 5938, pp. 318 – 321 Retrieval and Analysis of Five Neandertal mtDNA GenomesAdrian W. Briggs,1,*  Jeffrey M. Good ……….. Svante Pääbo1 Analysis of Neandertal DNA holds great potential for investigating the population history of this group of hominins, but progress has been limited due to the rarity of samples and damaged state of the DNA. We present a method of targeted ancient DNA sequence retrieval that greatly reduces sample destruction and sequencing demands and use this method to reconstruct the complete mitochondrial DNA (mtDNA) genomes of five Neandertals from across their geographic range. We find that mtDNA genetic diversity in Neandertals that lived 38,000 to 70,000 years ago was approximately one-third of that in contemporary modern humans. Together with analyses of mtDNA protein evolution, these data suggest that the long-term effective population size of Neandertals was smaller than that of modern humans and extant great apes.

Diversity of five individuals (not complete sequence) Science 328, 710-722 Science – May 2010 A Draft Sequence of the Neandertal Genome

Richard E. Green1,*†‡, Johannes Krause1,†§, Adrian W. Briggs1,†§, Tomislav Maricic1,†§, Udo Stenzel1,†§, Martin Kircher1,†§, Nick Patterson2,†§, Heng Li2,†, Weiwei Zhai3,†||, Markus Hsi-Yang Fritz4,†, Nancy F. Hansen5,†, Eric Y. Durand3,†, Anna-Sapfo Malaspinas3,†, Jeffrey D. Jensen6,†, Tomas Marques-Bonet7,13,†, Can Alkan7,†, Kay Prüfer1,†, Matthias Meyer1,†, Hernán A. Burbano1,†, Jeffrey M. Good1,8,†, Rigo Schultz1, Ayinuer Aximu-Petri1, Anne Butthof1, Barbara Höber1, Barbara Höffner1, Madlen Siegemund1, Antje Weihmann1, Chad Nusbaum2, Eric S. Lander2, Carsten Russ2, Nathaniel Novod2, Jason Affourtit9, Michael Egholm9, Christine Verna21, Pavao Rudan10, Dejana Brajkovic11, eljko Kucan10, Ivan Guic10, Vladimir B. Doronichev12, Liubov V. Golovanova12, Carles Lalueza-Fox13, Marco de la Rasilla14, Javier Fortea14,¶, Antonio Rosas15, Ralf W. Schmitz16,17, Philip L. F. Johnson18,†, Evan E. Eichler7,†, Daniel Falush19,†, Ewan Birney4,†, James C. Mullikin5,†, Montgomery Slatkin3,†, Rasmus Nielsen3,†, Janet Kelso1,†, Michael Lachmann1,†, David Reich2,20,* Svante Pääbo.

Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other. http://www.sciencemag.org/content/328/5979/710.full.pdf NY Times: May 6, 2010  Science – May 2010 Nature 468, 1053-1060 Nature – Dec. 2010 Genetic history of an archaic hominin group from Denisova Cave in Siberia

David Reich, Richard E. Green, Martin Kircher, Johannes Krause, Nick Patterson, Eric Y. Durand, Bence Viola, Adrian W. Briggs, Udo Stenzel,  Philip L. F. Johnson, Tomislav Maricic, Jeffrey M. Good, Tomas Marques-Bonet, Can Alkan, Qiaomei Fu, Swapan Mallick, Heng Li, Matthias  Meyer, Evan E. Eichler, Mark Stoneking, Michael Richards, Sahra Talamo, Michael V. Shunkov, Anatoli P. Derevianko, Jean-Jacques Hublin et al.

Using DNA extracted from a finger bone found in Denisova Cave in southern Siberia, we have sequenced the genome of an archaic hominin to about 1.9-fold coverage. This individual is from a group that shares a common origin with Neanderthals. This population was not involved in the putative gene flow from Neanderthals into Eurasians; however, the data suggest that it contributed 4–6% of its genetic material to the genomes of present-day Melanesians. We designate this hominin population ‘Denisovans’ and suggest that it may have been widespread in Asia during the Late Pleistocene epoch. A tooth found in Denisova Cave carries a mitochondrial genome highly similar to that of the finger bone. This tooth shares no derived morphological features with Neanderthals or modern humans, further indicating that Denisovans have an evolutionary history distinct from Neanderthals and modern humans.

Neandertal Neandertal Completed Eukaryotic Genomes (First 18): 1997 - 2003

Organism Type Genome (Mb) # ORFs Date

Caenorhabditis briggsae (worm) NEMATODES 104 19,500 11/17/03

Neurospora crassa FUNGI 43 10,082 4/23/03

Ciona intestinalis (sea squirt) FISH 117 12/13/02 Mus musculus (mouse)15th MAMMALS 3,000 30,000 12/5/02 gambiae (mosquito) ARTHROPODA 278 14,000 10/4/02

Plasmodium yoelii yoelii PROTOZOA 23 5,878 10/3/02

Plasmodium falciparum PROTOZOA 23 5,268 10/3/02

Takifugu rubripes (fugu) (pufferfish) FISH 400 8/23/02

Oryza sativa L. ssp. Indica (rice) PLANTS 420 50,000 4/5/02

Oryza sativa ssp. Japonica (rice) PLANTS 420 50,000 4/5/02

Schizosaccharomyces pombe (yeast) FUNGI 14 4,824 2/21/02

Encephalitozoon cuniculi MICROSPORIDIA 2.5 1,997 11/22/01

Guillardia theta CRYPTOMONAS 0.55 464 4/26/01 Homo sapiens (human) 5th PRIMATES 3,000 35,000 2/15/01 Arabidopsis thaliana (mustard) PLANTS 115 25,498 12/14/00

Drosophila melanogaster (fly) ARTHROPODA 137 14,100 3/24/00

Caenorhabditis elegans (worm) NEMATODES 97 19,099 12/11/98

Saccharomyces cerevisiae (yeast) FUNGI 12 6,294 5/29/97 Nature 420, 515 - 516 (05 December 2002); doi:10.1038/420515a Dec., 2002 Comparative genomics: The mouse that roared

MARKS.BOGUSKI Ensembl Mouse Database http://www.ensembl.org/Mus_musculus/index.html Mark S. Boguski is at the Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North D4 100, PO Box 19024, Seattle, Washington 98109, USA. NCBI Mouse Resources e-mail: [email protected] http://www.ncbi.nlm.nih.gov/genome/guide/mouse/

The laboratory mouse has become an indispensable tool for investigators in many areas of biomedical research. The availability of the full mouse genome sequence will immeasurably advance both the character and the pace of discovery.

"Know then thyself, presume not God to scan; The proper study of mankind is man," wrote Alexander Pope in 1733. What better reason could there have been to sequence the human genome? But the planners of the Human Genome Project realized that the data could not be fully understood, or used to advance biomedicine, in isolation. Indeed, many of the "lessons learned and promises kept"1 have been derived from the study of model organisms. Mus musculus, a species of mouse, has been one of the five key model organisms sequenced since the beginnings of the Human Genome Project. In 1998–99 the US National Institutes of Health published an action plan for mouse genomics2 which, among other things, called for a working draft sequence of the mouse genome by 2003. An international Mouse Genome Sequencing Consortium3 has now achieved this goal. The particular mouse strain concerned is C57BL/6J (Fig. 1), and the consortium's findings are reported on page 520 of this issue. Nature 420, 520 - 562 (05 December 2002); doi:10.1038/nature01262 Dec., 2002 Initial sequencing and comparative analysis of the mouse genome Ensembl Mouse Database http://www.ensembl.org/Mus_musculus/index.html NCBI Mouse Resources Mouse Genome Sequencing Consortium http://www.ncbi.nlm.nih.gov/genome/guide/mouse/

The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism. Genome Reference Consortium (2009) http://www.ncbi.nlm.nih.gov/projects/genome/ assembly/grc/index.shtml The mouse genome is about 14% smaller than the human genome (2.5Gb compared with 2.9Gb). The difference probably reflects a higher rate of deletion in the mouse lineage.

Over 90% of the mouse and human genomes can be partitioned into corresponding regions of conserved synteny, reflecting segments in which the gene order in the most recent common ancestor has been conserved in both species.

At the nucleotide level, approximately 40% of the human genome can be aligned to the mouse genome. These sequences seem to represent most of the orthologous sequences that remain in both lineages from the common ancestor, with the rest likely to have been deleted in one or both genomes.

The neutral substitution rate has been roughly half a nucleotide substitution per site since the divergence of the species, with about twice as many of these substitutions having occurred in the mouse compared with the human lineage.

By comparing the extent of genome-wide sequence conservation to the neutral rate, the proportion of small (50–100bp) segments in the mammalian genome that is under (purifying) selection can be estimated to be about 5%. This proportion is much higher than can be explained by protein-coding sequences alone, implying that the genome contains many additional features (such as untranslated regions, regulatory elements, non-protein-coding genes, and chromosomal structural elements) under selection for biological function. Humans and rodents diverged some 80,000,000 years ago. Humans and chimpanzees diverged some 6,000,000 years ago. Completed Eukaryotic Genomes (First 18): 1997 - 2003

Organism Type Genome (Mb) # ORFs Date

Caenorhabditis briggsae (worm) NEMATODES 104 19,500 11/17/03

Neurospora crassa FUNGI 43 10,082 4/23/03

Ciona intestinalis (sea squirt) FISH 117 12/13/02 Mus musculus (mouse)15th MAMMALS 3,000 30,000 12/5/02 Anopheles gambiae (mosquito) ARTHROPODA 278 14,000 10/4/02

Plasmodium yoelii yoelii PROTOZOA 23 5,878 10/3/02

Plasmodium falciparum PROTOZOA 23 5,268 10/3/02

Takifugu rubripes (fugu) (pufferfish) FISH 400 8/23/02

Oryza sativa L. ssp. Indica (rice) PLANTS 420 50,000 4/5/02

Oryza sativa ssp. Japonica (rice) PLANTS 420 50,000 4/5/02

Schizosaccharomyces pombe (yeast) FUNGI 14 4,824 2/21/02

Encephalitozoon cuniculi MICROSPORIDIA 2.5 1,997 11/22/01

Guillardia theta CRYPTOMONAS 0.55 464 4/26/01 Homo sapiens (human) 5th PRIMATES 3,000 35,000 2/15/01 Arabidopsis thaliana (mustard) PLANTS 115 25,498 12/14/00

Drosophila melanogaster (fly) ARTHROPODA 137 14,100 3/24/00

Caenorhabditis elegans (worm) NEMATODES 97 19,099 12/11/98

Saccharomyces cerevisiae (yeast) FUNGI 12 6,294 5/29/97 Completed Eukaryotic Genomes (Next 13, 2004)

Organism Type Genome (Mb) # ORFs Date

Cryptosporidium hominis PROTOZOA 9.2 3,994 10/28/04

Tetraodon nigroviridis (puffer fish) FISH 385 22,400 10/21/04

Thalassiosira pseudonana (marine diatom) PLANT 25 11,242 10/1/04

Candida glabrata FUNGI 12.3 5,499 7/1/04

Debaryomyces hansenii FUNGI 12.2 7,084 7/1/04

Kluyveromyces lactis FUNGI 10.6 5,504 7/1/04

Yarrowia lipolytica FUNGI 20.5 7,180 7/1/04

Phanerochaete chrysosporium (white rot fungus) FUNGI 30 5/2/04

Cyanidioschyzon merolae ALGAE 16.5 5,331 4/8/04 Rattus norvegicus (Rat)22nd MAMMALS 2,750 20,973 4/1/04 Cryptosporidium parvum PROTOZOA 10.4 3807 3/25/04

Ashbya (Eremothecium) gossypii FUNGI 9 4720 3/4/04

Bombyx mori (silkworm) ARTHROPODA 530 2/29/04 Neandertal Nature 428, 493 - 521 (01 April 2004); doi:10.1038/nature02426 Apr., 2004 Genome sequence of the Brown Norway rat yields insights into mammalian evolution

Rat Genome Sequencing Project Consortium

The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high- quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution. Rat Genome Resources http://www.ncbi.nlm.nih.gov/genome/guide/rat/ http://www.ensembl.org/Rattus_norvegicus/mapview?chr=1 Darwin believed that "natural selection will always act very slowly, often only at long intervals of time"1. The consequences of evolution over timescales of approximately 1,000 millions of years (Myr) and 75Myr were investigated in publications comparing the human with invertebrate and mouse genomes, respectively2, 3. Here we describe changes in mammalian genomes that occurred in a shorter time interval, approximately 12–24Myr (refs 4, 5) since the common ancestor of rat and mouse.

The comparison of these genomes has produced a number of insights:

The rat genome (2.75 gigabases, Gb) is smaller than the human (2.9Gb) but appears larger than the mouse (initially 2.5Gb (ref. 3) but given as 2.6Gb in NCBI build 32, see http://www.ncbi.nlm.nih.gov/genome/seq/NCBIContigInfo.html).

The rat, mouse and human genomes encode similar numbers of genes. The majority have persisted without deletion or duplication since the last common ancestor. Intronic structures are well conserved.

Some genes found in rat, but not mouse, arose through expansion of gene families. These include genes producing pheromones, or involved in immunity, chemosensation, detoxification or proteolysis.

Almost all human genes known to be associated with disease have orthologues in the rat genome but their rates of synonymous substitution are significantly different from the remaining genes.

About 3% of the rat genome is in large segmental duplications, a fraction intermediate between mouse (1–2%) and human (5–6%). These occur predominantly in pericentromeric regions. Recent expansions of major gene families are due to these genomic duplications. Rat versus Mouse: ~ 20 Myr The eutherian core of the rat genome—that is, bases that align orthologously to mouse and human—comprises a billion nucleotides (40% of the euchromatic rat genome) and contains the vast majority of exons and known regulatory elements (1–2% of the genome). A portion of this core constituting 5–6% of the genome appears to be under selective constraint in rodents and primates, while the remainder appears to be evolving neutrally.

Approximately 30% of the rat genome aligns only with mouse, a considerable portion of which is rodent-specific repeats. Of the non-aligning portion, at least half is rat-specific repeats.

More genomic changes occurred in the rodent lineages than the primate: (1) These rodent genomic changes include approximately 250 large rearrangements between a hypothetical murid ancestor and human, approximately 50 from the murid ancestor to rat, and about the same from the murid ancestor to mouse. (2) A threefold-higher rate of base substitution in neutral DNA is found along the rodent lineage when compared with the human lineage, with the rate on the rat branch 5–10% higher than along the mouse branch. (3) Microdeletions occur at an approximately twofold-higher rate than microinsertions in both rat and mouse branches.

A strong correlation exists between local rates of microinsertions and microdeletions, transposable element insertion, and nucleotide substitutions since divergence of rat and mouse, even though these events occurred independently in the two lineages. Human: 681 Mb of 2,800 Mb is primate -specific

NOTE: Coding regions ~ 1% of total DNA

Rat vs Mouse?

Rat vs Human?

Ensembl.Human - 6vsRat Completed Eukaryotic Genomes (First 18): 1997 - 2003 Organism Type Genome (Mb) # ORFs Date

Caenorhabditis briggsae (worm) NEMATODES 104 19,500 11/17/03

Neurospora crassa FUNGI 43 10,082 4/23/03

Ciona intestinalis (sea squirt) FISH 117 12/13/02 Mus musculus (mouse)15th MAMMALS 3,000 30,000 12/5/02 Anopheles gambiae (mosquito) ARTHROPODA 278 14,000 10/4/02

Plasmodium yoelii yoelii PROTOZOA 23 5,878 10/3/02

Plasmodium falciparum PROTOZOA 23 5,268 10/3/02

Takifugu rubripes (fugu) (pufferfish) FISH 400 8/23/02

Oryza sativa L. ssp. Indica (rice) PLANTS 420 50,000 4/5/02

Oryza sativa ssp. Japonica (rice) PLANTS 420 50,000 4/5/02

Schizosaccharomyces pombe (yeast) FUNGI 14 4,824 2/21/02

Encephalitozoon cuniculi MICROSPORIDIA 2.5 1,997 11/22/01

Guillardia theta CRYPTOMONAS 0.55 464 4/26/01 Homo sapiens (human) 5th PRIMATES 3,000 35,000 2/15/01 Arabidopsis thaliana (mustard) PLANTS 115 25,498 12/14/00

Drosophila melanogaster (fly) ARTHROPODA 137 14,100 3/24/00

Caenorhabditis elegans (worm) NEMATODES 97 19,099 12/11/98

Saccharomyces cerevisiae (yeast) FUNGI 12 6,294 5/29/97 Completed Eukaryotic Genomes (First 18): 1997 - 2003 Organism Type Genome (Mb) # ORFs Date

Caenorhabditis briggsae (worm) NEMATODES 104 19,500 11/17/03

Neurospora crassa FUNGI 43 10,082 4/23/03

Ciona intestinalis (sea squirt) FISH 117 12/13/02 Mus musculus (mouse)15th MAMMALS 3,000 30,000 12/5/02 Anopheles gambiae (mosquito) ARTHROPODA 278 1