Sources of Genome Variaon Data

Ananyo Choudhury,Shaun Aron, Sco Hazelhurst, Zané Lombard Wits Bioinformacs n 1000 Genomes n HapMap n Human c Variaon Project n Research data North African Southern Other African New Data coming soon….

1000 Genomes A Deep Catalogue of Human Genec Variaon 2001 2004 Whose genome was dra human “finished” genome human sequenced? sequence genome International Human Genome Sequencing Consortium The human genome reference sequence does not represent an exact match for any one person's genome.

The draft genome is composed of the DNA of an estimated 10 to 20 anonymous individuals across different racial and ethnic groups.

IHGSC. Nature (2001) 409 860-921

With the exception of monozygotic twins,

every one of us is genetically different from every other human who ever lived.

American Express 1990 Advertisement http://www.childrenofsalem.com/days/kids/ericbran/ericbran1.html http://www.childrenofsalem.com/days/kids/ericbran/ericbran1.html Genetic variation to drug responses

Because of genetic differences, different people respond differently to the same drug.

§ Example: § In the 1950s, anaesthestists began using the muscle relaxant succinylcholine § A small proportion of patients went into life- threatening breathing arrest. § Succinylcholine is normally metabolized by cholinesterase but in 1 out of 2,500 people carry two defective copies of the gene for this

See: http://www.mdbrowse.com/Druginf/S/succinylcholine.htm Diseases associated with genec variaons

Type of Disease Gene Responsible Inheritance Autosomal Phenylketonuria (PKU) Phenylalanine hydroxylase (PAH) recessive Autosomal Cysc fibrosis conductance Cysc fibrosis recessive transmembrane regulator (CFTR) X- Autosomal Y-Linked Mito Sickle-cell anemia Autosomal descripon, Linked Beta (HBB) 4,046 recessive molecular 3,732 282 4 28 Autosomal basis known Hunngton's disease Hunngn (HTT) dominant Myotonic dystrophy Autosomal Dystrophia myotonica-protein kinase type 1 dominant (DMPK) Polycysc kidney disease 1 (PKD1) and Polycysc kidney Autosomal polycysc kidney disease 2 (PKD2), disease 1 and 2 dominant respecvely X-linked Hemophilia A Coagulaon factor VIII (F8) recessive Muscular dystrophy, X-linked Dystrophin (DMD) Duchenne type recessive Hypophosphatemic X-linked Phosphate-regulang endopepdase , X-linked dominant homologue, X-linked (PHEX) dominant X-linked Re's syndrome Methyl-CpG-binding protein 2 (MECP2) dominant Spermatogenic failure, Ubiquin-specific pepdase 9Y, Y-linked nonobstrucve, Y- Y-linked (USP9Y) linked Geography and the of

Predicted skin color =annual average UVMED (0.1088)+72.7483.

Jablonski & Chaplan. Journal of (2000) 39, 57–106 Jablonski. Annu. Rev. Anthropol. 2004. 33:585–623 Evoluonary Histories and cause of death of death are oen correlated

Ramos E and Rotimi C, BMC Medical Genomics, 2009 Most diseases and traits involve both environmental and genec components

200 180 Environmental 160 component 140 Genec component 120 100 80 Nutrion, 60 Pathogens, 40 Pollutants, Lifestyle & 20 also other 0 /SNPS A1 A4 A7 A25 A31 A10 A13 A16 A19 A22 A28 A34 A37 A40 Era of GWAS

As of 03/02/14, the catalogue includes 1823 publicaons and 12508 SNPs hp://www.genome.gov/GWAStudies/ Moving beyond the genome …

The 2008 SNP Submissions for the James Watson Genome totaled 3,542,364

The 2008 SNP Submissions for the J. Craig Venter Genome totaled 4,018,050

The 2008 SNP Submissions for the Individual Chinese Genome totaled 5,077,954

The 2008 SNP Submissions for the Individual Korean Genome totaled 1,750,224 2001 2007 2009 Dra Human First individual 1000 Genomes genome human genome Project Populaons in 1000 Genomes Phase 1

Why do we need to sequence so many populaons?? Journey of sapiens

100 k years ago Khoisan

Ramos E and Rotimi C, BMC Medical Genomics, 2009 Evoluonary histories are strongly engraved in genomes

Ancestry Informave Markers

SNPs specific to a populaon

Allele frequencies of a large number of SNPs show strong populaon biases

1000 Genomes Projected populaons

1000 Genomes Hapmap Hapmap 3 New populaons STEPS SNP Calling Structural Variaons Formats Trio project (Pilot II)

Ø whole-genome shotgun sequencing at high coverage (average 42X) of two families Ø one Yoruba from Ibadan, Nigeria (YRI) Ø one of European ancestry in Utah (CEU) Ø Each trio includes two and one daughter. Each of the offspring was sequenced using three plaorms and by mulple centers.

Low-coverage project(Pilot I)

Ø whole-genome shotgun sequencing at low coverage (2–6X) of 1092 genomes from more that 10 populaons Exon project(Pilot III)

Ø targeted capture of 8,140 exons from 906 randomly selected genes (total of 1.4 Mb) followed by sequencing at high coverage (average 50X) in 697 individuals from 7 populaons of Ø 1092 individuals from >10 populaons Outcomes What differs between individuals?

Ø 3-4,000,000 variants Ø 10-11,000 nonsynonymous changes Ø 220-250 in-frame indels Ø 80-100 premature stop codons

Ø 40-50 splice site disrupons How different is your Ø 50-100 HGMD “recessive genome from the reference Human disease causing” mutaons genome? De novo mutaon in trios

n 1001 mutaons selected(CEU) ¨ 49 true germline mutaon ¨ Esmated rate :1.2 X 10-8 ¨ Other 952 were either somac or cell line mutaons

n 669 Mutaons (YRI) What is new in ¨ 35 true germline mutaons me???? ¨ Esmated rate 1.0 X 10-8 ¨ Other 634 were either somac or cell line mutaons

n Across the two trio offspring, a single, synonymous, coding germline mutaon was observed Revising Disease associaon Phase 3 Data n African data ¨ ACB ¨ ASW ¨ ESN ¨ GWD ¨ LWK ¨ MSL ¨ YRI n Asian ¨ BEB ITU JPT ¨ CDX KHV ¨ CHB PJL ¨ CHS STU ¨ GIH n American ¨ CLM ¨ MXL ¨ PEL ¨ PUR n European ¨ CEU ¨ FIN ¨ GBR ¨ IBS ¨ TSI Take home ….

ü Measurement of human DNA important ü 1000 Genomes key project: provides locaon, frequency and local structure of approx 36M SNPs, 1M short dels, and 14k SVs, >50% ü Expect contains 95% of the currently accessible variants ü Each person has ~275 loss-of-funcon vars in annotated genes and 50-100 vars previously implicated in inherited disorders ü rate of de novo germline base substuon mutaons approxy 10-8 per bp per gen ü More out there Thank You HapMap Single nucleode polymorphisms (SNPs)

n Most common genec variant n SNPs are used as markers to locate genes in DNA sequences - useful in disease mapping n Tesng 12 million common SNPs would be extremely expensive ¨ For a case-control study with 1,000 cases & 1,000 controls ¨ Genotype all for all SNPs ¨ That adds up to 24 billion genotypes ¨ Imagine, this approach cost 50 cents a genotype. ¨ That’s R12 billion for each disease – completely out of the queson!! How HAPMAP could benefit human health n Provide an extensive resource that researchers can use to discover the genec variants involved in disease and individual responses to therapeuc agents n Learn much more about the origins of illnesses and about ways to prevent, diagnose and treat n Associaon studies n Customizable treatment, new therapies GOAL OF HAPMAP n The Internaonal HapMap Project aims to idenfy a large fracon of the genec diversity in the human species n Enable sciensts to take advantage of how SNPs and other genec variants are organised on chromosomes ¨ Genec variants that are near each other tend to be inherited together. ¨ E.g. people who have an A rather than a G at a parcular can have idencal genec variants at other SNPs in the chromosomal region surrounding the A. ¨ These regions of linked variants are known as . This phenomenon is influenced by recombinaon & linkage disequilibrium Recombinaon Linkage Disequilibrium n Origins of haplotypes ¨ The non-random associaon between in a populaon

Low LD High LD Linkage Equilibrium

2 SNPs = 4 Haplotypes 2 SNPs = 2 Haplotypes Premise of HapMap SNPs, Haplotypes & tagSNPs SNPs, Haplotypes & tagSNPs

SNPs and haplotype blocks. (A) SNPs. Shown is a short stretch of DNA from four versions of the same chromosome region in different people. Most of the DNA sequence is identical in these chromosomes, but three bases are shown where variation occurs. Each SNP has two possible alleles; the first SNP in panel A has the alleles cytosine and thymine. (B) Haplotypes. A haplotype is made up of a particular combination of alleles at nearby SNPs. Shown here are the observed genotypes for 20 SNPs that extend across 6,000 bases of DNA. Only the variable bases are shown, which include the three SNPs that are shown in panel A. For this region, most of the chromosomes in a survey turn out to have haplotypes 1-4. (C) Tag SNPs. Genotyping just the three tag SNPs out of the 20 SNPs is sufficient to identify these four haplotypes uniquely. For instance, if a particular chromosome has the pattern A-T-C at these three tag SNPs, this pattern matches the pattern determined for haplotype 1. Haplotypes n SNPs that occur together suggests underlying structure to genome n SNPs occurr in blocks of which there are common variees n ~65% to 85% of the human genome is organized in haplotypes n If blocks easily idenfied could be important tool for studying genec variaon in relaon to disease, drug response etc.. n Founded in 2002 n Parcipang instuons and funding from , UK, Canada, China, USA and Nigeria n “ ...develop a haplotype map of the human genome, which will describe the common paerns of human DNA sequence variaon” Strategy

1. Recruit individuals that represent global diversity 2. Genotype SNPS for all individuals 3. Idenfy chromosomal regions with groups of strongly associated SNPs – haplotypes 4. Determine linkage disequilibrium between SNPs 5. Idenfy tagSNPs for the haplotypes Populaons sampled n Yoruba people in Ibadan, Nigeria ¨ 30 both--and-adult-child trios n Japanese in Tokyo ¨ 45 unrelated individuals n in Beijing ¨ 45 unrelated individuals n The U.S. Utah residents of northern and western European ancestry ¨ 30 trios ¨ Residents with ancestry from Northern and Western Genotyping n 11 Centers for typing: Canada, China, Japan, UK, USA n Genotyped at least one common SNP every 5 kb n The Phase I HapMap contained 1,007,329 SNPs that passed a set of quality control filters ¨ SNPs at f > or = 0.05 MAF chosen n The HapMap Project contributed ~6 million new SNPs to dbSNP ¨ In 2005 dbSNP contained 9.2 million candidate human SNPs, of which 3.6 million have been validated by both alleles having been seen two or more mes during discovery (‘double-hit’ SNPs), and 2.4 million have genotype data Haplotyping n Phased haplotypes were generated using the program PHASE version 2.0 n Each allele in a genotype is assigned to one or the other parental chromosome using computer algorithms n The numbers and size of possible haplotypes are limited because of recombinaon events Haplotype output

Nature 2005 LD Measures n D prime (D’) ¨ D’ is the difference between the expected and the observed haplotype frequency. ¨ D' (normalised LD) is the only measure of LD not sensive to allele frequencies. ¨ A score of 1 = LD n R square (r2) ¨ The square of the correlaon coefficient r, a measure of the effect of X in reducing the uncertainty in predicng Y . ¨ Gives informaon on sample size required to detect associaon. ¨ A score of 1 = LD n Likelihood of Odds (LOD) Score ¨ The logarithm of odds - a stascal measure of the likelihood that two genec markers occur together on the same chromosome and are inherited as a single unit of DNA (co-segregaon). ¨ A score of >2 =LD LD Plots

§ The triangle plot is constructed by connecting every pair of SNPs along lines at 45 degrees to the horizontal track line. § The colour of the diamond at the position that two SNPs intersect indicates the amount of LD: more intense colours indicate higher LD. § A diamond indicates missing data LD AND tagSNPs n Reduce the number of SNPs needed to genotype region (use few tagSNPs)

¨ High LD - few SNPs sampled ¨ Low LD – more SNPs sampled

Interesng findings

A: Similarity of allele frequencies in CHB/JPT samples. § These were subsequently analyzed jointly

B: Idenficaon of recombinaon hot spots § 21,617 idenfied recombinaon hotspots § ~1 per 122 kb Interesng findings C: Haplotype sizes vary across populaons due to migraonal history ¨ Haplotypes in non-African populaons tend to be longer than in African populaons

D: LD correlates to genomic features ¨ Areas of very high and very low LD have the highest density of genes ¨ LD low n associated with immune and neuro-physiological genes ¨ LD elevated n associated with cell cycle regulators, DNA damage responses, DNA/RNA metabolism.

HAPMAP – Phase Comparison

Phase 1 Phase 2 Phase 3 Samples & POP 269 samples 270 samples 1,184 samples panels (4 panels) (4 panels) (11 panels)

Genotyping centers HapMap Internaonal Perlegen Broad & Sanger Consorum

Unique SNPs 1.1 M 3.8 M 1.6 M (Affy 6.0 & (phase I+II) Illumina 1M) ------Sequenced ten 100- Sequence Data kb regions (n=692)

Reference Nature (2005) Nature (2007) Nature (2010) 437:p1299-1320 449:p851-861 467: p52-58 Human Genome Diversity Project

Aim to collect wide range of human diversity — endogenous

http://web.stanford.edu/group/rosenberglab/ diversity.html

Key African Data Sets Publicly available

May et al, 2013. 10.1186/1471-2164-14-644. South Africans from Soweto Henn et al 2013. 10.1371/journal.pgen.1002397. North Africans. Pickrell et al. 2012. 10.1038/ncomms2140. Khoi-san data Schlebusch et al. 10.1126/science.1227721. Khoi- san, Coloured SA, “SW” and “SE” Bantu spearks Other key data African Genome Variation Project genotyping 2.5 million genetic variants in 100 individuals each from over 10 ethnic groups across sub-Saharan Africa

Other data not public Some key papers, data sets not available

HAPMAP Phase III

LABEL POPULATION SAMPLE # Samples ASW African ancestry in Southwest USA 90 Utah residents with Northern and Western European CEU 180 ancestry from the CEPH collection CHB Han Chinese in Beijing, China 90 CHD Chinese in Metropolitan Denver, Colorado 100 GIH Gujarati Indians in Houston, Texas 100 JPT Japanese in Tokyo, Japan 91 LWK Luhya in Webuye, Kenya 100 MEX Mexican ancestry in Los Angeles, California 90 MKK Maasai in Kinyawa, Kenya 180 TSI Toscans in Italy 100 YRI Yoruba in Ibadan, Nigeria 180 1,301 HapMap 3 Samples

• 1,184 samples from diverse populations (N=11) • Individual and community consent for thorough genetic ascertainment (up to complete resequencing) and public sharing of data on Internet Interesng Outcomes n Of the SNPs idenfied through sequencing, 77% were new (i.e. not previously in dbSNP) and 99% of those had a MAF < 5% ¨ Reveal that many more variants remain to be found, especially rare variants

The International HapMap 3 Consortium, Nature Sept 2010; 467:52-58 Interesng Outcomes n Confirmed that non-African diversity is largely a subset of African diversity n African samples provided a more complete discovery resource for variant sites in non-African than the converse n However, it does not work as well for rare variants ¨ Rare variants could likely be more NB in populaon-specific contribuons to disease? ¨ Underscores the value of next-gen sequencing of whole genomes within various populaons to find rare variants that contribute to disease. Using HAPMAP data: Populaon substructure in Africans n Here is the result of running ADMIXTURE on the three African HapMap-3 populaons, using about 440K SNPs, including Tuscans as a non-African group.

LWK MMK TSI YRI