Applications of Human Genome Sequencing: Personal and Disease Genomics

Michael Snyder

February 11, 2014

Conﬂicts: Personalis, Genapsys Outline

• General introduction on human variation • Disease genome sequencing – Cancer, Mystery Diseases • Personal genomics Genetic Variation Among People: Three Types 1) Single nucleotide variants (SNVs) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG 3.7 Million/person 2) Short Indels (Insertions/ Deletions 1-100 bp) GATTTAGATCGCGATAGAG GATTTAGA------TAGAG 300-600K/person Structural Variation (SV): Large Blocks (1000bp) of DNA that are Deleted, Inserted or Inverted

- People Have >2,500 differences (>2.5kb) Relative to the Reference Human Genome Sequence - Likely responsible for much human differences and disease

Size Distribution of CNV in a Human Genome

~2500 SVs >2.5kb per Person

VCFS

* Analysis of SV Breakpoints

Homologous Recombination 14%

Nonhomologous Recombination 56%

Retrotransposons 30% 17% of SVs Affect Genes

Non-allelic homologous recombination (NAHR; breakpoints in OR51A2 and OR51A4)

Olfactory Receptor Gene Fusion Heterogeneity in Olfactory Receptor Genes (Examined 851 OR Loci)

CNVs affect: 93 Genes 151 ψgenes The Cost of DNA Sequencing is Dropping

Human Genome Cost ~$3K http://www.genome.gov/ 1 Billion Human DNA Sequences A Personal Genome Sequence is Determine by Comparing to a Reference Genome Sequence

30X: 75-100 b Map to Reference Genome

Snyder et al. Genes Dev 2010;24:423-431 1. Paired ends Methods to Find SVs Deletion Reference Mapping Genome

Reference Sequenced paired-ends

3. Split read 2. Read depth (or aCGH) Deletion Deletion Reference Reference

Genome Genome

Read Reads

Mapping Mapping Read count Reference

Zero level 4. Match with database [Snyder et al. Genes & Dev. 2010 Variant Calling Pipeline

14 Challenges with Genome Seqeuncing

1) Accuracy (particularly indels

and SVs)

2) Coverage

3) Phasing

Genome Sequencing Reveals Many Variants (3.3 M Hi conf. SNVs, 217K Indels and 3K SVs)

• Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X)

Illumina Complete Genomics

345K 3.30M 100K 9% 89% 2%

Hugo Lam SNV Comparison

• Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X)

Ti/Tv = 2.14 20/20 Sanger

Illumina Complete Genomics

345K 3.30M 100K Ti/Tv = 1.40 9% 89% 2% Ti/Tv = 1.68 2/15 Sanger 17/18 Sanger

31 Disease 3 Disease Associated SNP Associated SNP

Hugo Lam, Michael Clark, Rui Chen Sequencing Accuracy Sequencing the Same Genome Twice

146,100 SNPs (3.7%)

Personalis Variant Detection vs Sequencing Depth

19 Exome-seq (80X) and WGS- specific detection (45X)

Exome-seq can find variants missed by WGS Local phasing + population data= highly phased blocks

Percent SNPs phased 98.2% Switch accuracy 99.9%+

Moleculo: Volodymyr Kuleshov, Michael Kertesz Genome Sequencing 1) Understand and treating human disease - Cancer - Mystery Diseases - Prenatal testing

2) Examine human variation (1000 Genomes Project)

3) Personal genome sequencing for examining disease risk?

Cancer Genome Sequencing

1) Cancer is a genetic disease: both inherited and somatic

2) 10-20 “driver” mutations

3) Each cancer is unique

4) Sequence genomes (cancer tissue and normal) find genetic changes and suggest possible therapies

Vogelstein et al., March Science, 2013 Cancer Genome Sequencing Normal Cancer VS (30-40X) (>60X)

GATTTAGATCGCGATAGAG GATTTAGATCACGATAGAG Whole Genome Sequencing of Esophageal Cancer

Small Somatic Variants

1695 2178

3720

14950

Deletions Insertions SNV Substitutions

197 Deleons 117 Ampliﬁcaons >1000 of Cancer Genomes Have Been Sequenced The Cancer Genome Atlas

Brain (Glioma, Glioblastoma ) Breast Gastrointestinal (Colon, Rectal, Stomach) Gynecologic (Ovarian , Cervical, Uterine) Head and Neck Hematologic (AML) Melanoma Lung Urologic (kidney, Prostate)

Results of Cancer Genomics

• 10-100,000 SNVs for many cancer types + Structural variants • 473 cancer genes have been idenﬁed • Most common genes are already known e.g. p53, RB1, KRAS, BRCA1/2, etc

- Every patient is has a different spectrum of mutations

Bell et. al., 2011, Nature More Results

- Mutations often fall in common pathways e.g. Wnt, PI3K/AKT, Ras/Raf, TP53, RB, etc ~12 pathways (Vogelstein)

- Different types of cancer can have the same mutated gene e.g. EGFR mutated in NSCLC and colon cancer

- Common to find a potentially “druggable” mutation from sequencing TCGA-identified Pathways in Ovarian Cancer

RB and PI3K/AKT Signaling Homologous Recombination

Notch Signaling

29 Copy number alterations from metastatic Colon tumor

Chromosome 7: Two amplification regions

Normal diploid copy number

Log 2 ratio Chr 7p arm Chr 7q arm - genomic copy number

CEN

Chromosome 7 coordinates (Mb) NCBI 37 • Two regional amplifications with complex genomic structure.

• Both loci > 10 copy number showing statistical

significance. With Hanlee Ji Pharmacogenomics—Matching Drugs to Disease Her2 Mutated in 25% of Breast Cancer Iressa®: – Used to treat nonsmall cell lung carcinoma – 10% of patients respond – Works only if patients tumors have very speciﬁc mutations in EGFR CML (Chronic Myeloid Leukemia) - Gleevac

1) Chromosome Translocation involved Abl Protein Kinase and bcr

2) Gleevac—binds active site of Abl and inhibits it; causes remission

3) Gleevac binds other kinases that are mutated in other cancer: Gleevac can be used to treat them.

Genome Sequencing and Mystery Disease

• Rare Mendelian Diseases: Some examples – Miller’s syndrome – Charot-Marie Disease – Brain formation (exome) – Congenital diarrhea (exome) – Cranial facial formation (exome) Family with Charcot-Marie Disease • Neuropathy – Heterogenous disease—many different genes mapped • Sequence genome to 30X coverage • 3.4 M SNPs: (561,719 novel) – 2,255,102 in intergenic – 1,165,204 in genes, introns etc. • 174 nonsynonymous SNPs in region of interest • 54 related to Neuropathies • Ultimately zoomed in on SH3TC2 gene: Full blown disease has two mutations: Y169H (missense), R954X (nonsense)* Single heterozygotes have some phenotypes

*Implicated previously Lupski et al 2010 NEJM

Solving Mystery Diseases: Dizygotic Twins: Dopamine Responsive Dystonia

• Constantly sick, colicky, failed to meet milestones “floppy”; MRI showed some abnormalities • Children diagnosed with X dystonia • Trial of L-DOPA showed dramatic improvement in 2 days • Sequenced genomes-found mutation in SPR Gene • Administered dopamine + seratonin precursor

From Richard Gibbs, Baylor Solving Mystery Diseases: Child With Variety of Conditions Developmentally Delayed, Significant Health Issues

Father Mother SNVs: 3,119,588 SNVs: 3,125,880 Private: 596,691 Private: 581,754 Indels: 750,522 Indels: 723,379 F M SNVs: Single nucleotide variants Indels: = Insertions/ deletions (~<100bp) A1

Child Candidates: SNVs: 3,118,638 TCP10L2, Private: 33,158 Indels: 673,809 SUPV3L1, PIEZO1 DNAH2, NGLY1, FANCA, WFS1 Lessons Learned

1) Overall success rate for identifyng causative mutations is low 25%

2) Information not always directly actionable but still valuable.

3) Best success with a) Specific phenotypes b) Large families

4) Need large database to share information: Recurrence is key. ClinVar Fetal DNA Sequencing

1) Cell free fetal DNA can be detected in maternal blood as early a 4-5 weeks gestation

2) 4-10% circulating DNA is fetal à increases with pregnancy

3) Targeted detection of mutations

4) Whole genome sequencing routinely used to detect trisomies: Down’s (Chr. 21), Chromosome 18 and Chromosome 13. 99% sensitivity

5) Taking over from aminocentesis Fetal DNA Sequencing

Srinivasan A, Bianchi DW, Huang H, et al. Am J Hum Genet 2013; 1–10. Personal Genome Sequencing:

Can genome sequencing of a healthy person be useful in health care? Health Is a Product of Genome + Environment

Genome

Health Exposome Health Is a Product of Genome + Environment

Genome

Health Exposome Personalized Medicine: Combine Genomic and Other Omic Information

Genomic GGTTCCAAAAGTTTATTGGATGCCGTTTCA GTACATTTATCGTTTGCTTTGGATGCCCTA Presently: ATTAAAAGTGACCCTTTCAAACTGAAATTC ATGATACACCAATGGATATCCTTAGTCGAT Few Tests (<20) AAAATTTGCGAGTACTTTCAAAGCCAAATG AAATTATCTATGGTAGACAAAACATTGACC AATTTCATATCGATCCTCCTGAATTTATTG GCGTTAGACACAGTTGGTATATTTCAAGTG ACAAGGACAATTACTTGGACCGTAATAGAT ? TTTTTGAGGCTCAGCAAAAAAGAAAATGGA AATTAATTTTGAAGTGCCATTGA….!

1. Predict risk 2. Diagnose, 3. Monitor, 4. Treat, & 5. Understand Disease States Personalized Medicine: Combine Genomic and Other Omic Information

Genomic Transcriptomic, Proteomic GGTTCCAAAAGTTTATTGGATGCCGTTTCA GTACATTTATCGTTTGCTTTGGATGCCCTA ATTAAAAGTGACCCTTTCAAACTGAAATTC ATGATACACCAATGGATATCCTTAGTCGAT AAAATTTGCGAGTACTTTCAAAGCCAAATG AAATTATCTATGGTAGACAAAACATTGACC AATTTCATATCGATCCTCCTGAATTTATTG GCGTTAGACACAGTTGGTATATTTCAAGTG ACAAGGACAATTACTTGGACCGTAATAGAT TTTTTGAGGCTCAGCAAAAAAGAAAATGGA AATTAATTTTGAAGTGCCATTGA….!

1. Predict risk 2. Diagnose, 3. Monitor, 4. Treat, & 5. Understand Disease States Personal “Omics” Profiling (POP) Genome and Epigenome

Transcriptome (mRNA, miRNA, isoforms, edits)

Proteome Personal Omics Cytokines

Metabolome

Autoantibody-ome

Microbiome Personal “Omics” Profiling (POP)

Genome Epigenome Personal Transcriptome Omics (mRNA, miRNA, isoforms, edits) Profile Proteome Initially Cytokines 40K Molecules/ Metabolome Measure- ments Autoantibody-ome Now Billions! Microbiome Personal Omics Profile 44 months; 66 Timepoints; 6 Viral Infections

Chen et al., Cell 2012 Accurate Genome Sequencing

Whole Genome Sequencing • Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X) Illumina CG Exome Sequencing • Nimblegen 345K 3.30M 100K • Illumina 9% 89% 2% • Aglilent

3.3 M Hi conf. SNVs, 217K Indels and 3K SVs 2 or more Platforms (Plus low confidence) Variants

M P

Percent SNPs phased ~80%

- Misses many variants (only assigns certain hets) - Cannot properly assign de novo variants Local phasing + population data= highly phased blocks

Percent SNPs phased 98.2% Switch accuracy 99.9%+

Moleculo: Volodymyr Kuleshov, Michael Kertesz Many SNVs are Expressed

RNA 2.67 B 100 b PE reads 30,963 (40 reads or more) 1,797 nonsynonymous 8 nonsense

Protein >130 Hi Confidence

RNA Editing 2,376 Hi confidence Allele Specific Expression

Jennifer Li-Pook-Than Medical All variants ~3.5M Interpretation Pipeline Rare/novel variants (<5%)

Non-Coding Coding Synonymous

mRNA tRNA miRNA Splice UTR Nonsynonymous stability rate (1320)

Seed miRNA sequence targets SIFT PP2 Damaging (234)

OMIM/Curated Mendelian disease (51) Rick Dewey & Euan Ashley Curated List of Rare Variants (SNVs, All heterozygous) Missense • ALAD, ABCC2, ACADVL, ADAMTS13, AGRN, BAAT, CDS1, CHD7, COL4A3, CTSD, DGCR2, DLD, DYSF, EPCAM, FGFR1OP, FKRP, GAA, GNAI2, HSPB1, IGKC, ITPR1, MED12, MKS1, NTRK1, PCM1, PKD1, PLEKHG5, PMS2, PRSS1, PTCH2, SERPINA1, SETX, SYNE1, TERT, TTN, VWF, ZFPM2, PNPLA2. Nonsense • PRAMEF2, PLCXD2, NUP54, RP1L1, PIK3C2G, NDE1, GGN, CYP2A7, IGKC

Not Rare But Important • KCNJ11 , KLF14, GCKR …

Bolded Genes expressed in PBMC (RNA). Rare Variants in Disease Genes (51 Total)

Missense • ALAD, ABCC2, ACADVL, ADAMTS13, AGRN, BAAT, CDS1, CHD7, COL4A3, CTSD, DGCR2, DLD, DYSF, EPCAM, FGFR1OP, FKRP, GAA, GNAI2, HSPB1, IGKC, ITPR1, MED12, MKS1, NTRK1, PCM1, PKD1, PLEKHG5, PMS2, PRSS1, PTCH2, SERPINA1, SETX, SYNE1, TERT, TTN, VWF, ZFPM2, PNPLA2. Aplastic Nonsense Anemia • PRAMEF2, PLCXD2, NUP54, RP1L1, PIK3C2G, NDE1, GGN, CYP2A7, IGKC

Not Rare But Important High • KCNJ11 , KLF4, GCKR … Cholesterol Diabetes Disease risk profile from Integration Using VariMed

Rong Chen and Atul Butte Disease risk profile T2D Type 2 diabetes

Genotype Test LR Studies Samples Probability Prevalence 27% TCF7L2 rs7903146 CT 1.18 49 140717 30% rs10811661 TC 0.85 18 154141 27% KCNQ1 rs2237892 CT 0.80 13 6570 23% FTO rs8050136 CC 0.87 10 63470 20% CDKAL1 rs7754840 GG 0.91 10 51327 19% SLC30A8 rs13266634 CT 0.94 9 145718 18% IGF2BP2 rs4402960 GT 1.06 8 104401 19% KCNJ11 rs5219 TT 1.15 7 87066 21% rs1111875 TT 0.88 6 93188 19% PPARGC1A rs2970847 TC 1.31 4 5558 24% rs7578326 AA 1.07 3 94337 25% rs4457053 GG 1.12 3 94337 27% KCNQ1 rs231362 GG 1.07 2 94337 29% ARAP1 rs1552224 AA 1.03 2 94337 29% JAZF1 rs864745 TC 1.00 2 89920 29% RBMS1 rs1020731 GA 0.95 2 84605 28% rs9300039 CC 1.05 2 42170 29% WFS1 rs10010131 GG 1.07 2 30248 31% EPO rs1617640 AA 1.48 2 4011 39% TP53INP1 rs896854 TC 1.01 1 94337 40% rs5945326 AA 1.09 1 94337 42% rs12779790 AG 1.06 1 89920 43% rs1153188 TA 0.96 1 89920 42% THADA rs7578597 TT 1.03 1 89920 43% rs4607103 CC 1.04 1 89920 44% rs17036101 GG 1.02 1 89920 44% MTNR1B rs10830963 CC 0.94 1 16061 43% HNF1B rs4430796 GG 1.13 1 11320 46%

10% 50% 100% Rong Chen and Atul Butte GLUCOSE LEVELS

HbA1c (%): 6.4 6.7 4.9 5.4 5.3 4.7 (Day(Day Number) Number) (329) (369) (476) (532) (546) (602) 160

) 150

dL 140 130 120 110 100 90

Glucose (mg/ 80 -150 0 %)"#)"#200 250 300 350 400 450 500 550 600 650 Day Number (Relative to 1st Infection)

HRV INFECTION RSV INFECTION LIFESTYLE (DAY 0-21) (DAY 289-311) CHANGE (DAY 380- 58 CURRENT) High Interest Drug-Related Variants

Gene SNP Patient Drug(s) Affected genotype

rs10811661 C/T Troglitazone (Increased Beta-Cell Function)

CYP2C19 rs12248560 C/T Clopidogrel (Increased Activation)

LPIN1 rs10192566 G/G Rosiglitazone (Increased Effect)

SLC22A1 rs622342 A/A Metformin (Increased Effect)

VKORC rs9923231 C/T Warfarin (Lower Dose Required) Color Key Expression of 50 Cytokines HRV RSV ? 0 0.2 0.4 0.6 0.8 1 Value DAY 0 DAY 0 DAY 12

LIF NGF G-CSF IFN-G IL-12-P70 IL-5 IL-13 TNF-A IL-1B IL-4 IL-2 IL-7 IL-17 TGF-b EOTAXIN PDGFBB ENA78 MCP-1 GRO ALPHA IL-17F IFN-b FGFb IL-12P40 GM-CSF IL-10 TNF-B MCP-3 M-CSF SCF VEGF MIP1a sFAS ligand IL-15 PAI-1 IP10 IFN-a MIG IL-1a CRP V-CAM-1 ICAM-1 IL-1RA RANTES Trail Insulin MIP-1B CD40Ligand Resistin HGF LEPTIN IL-8 TGF-a UNK.I.S UNK.II.S UNK.Z.S UNK.V.S UNK.X.S UNK.III.S UNK.VI.S UNK.IX.S UNK.XI.S UNK.XII.S UNK.XV.S UNK.VIII.S UNK.XIII.S UNK.XIV.S UNK.XVI.S UNK.XVII.S Transcriptome, Proteome, Metabolome Analysis Summary: Processing Steps

Fourier Space Raw Datasets: Data Preprocessing: Common Framework; Transcriptome Clustering and QC, Normalization, data Classiﬁcation: Proteome Enrichment Analysis Statistical Simulation Autocorrelated (I) and Metabolome Spike Sets (II) & (III)

(1) Preprocessing (2) Common Classiﬁcation Scheme (3) Clustering and Enrichment Analysis - Overall trends (autocorrelation) - Spikes at speciﬁc timepoints

george mias Integrated Analysis of Proteome, Transcriptome, Metabolome Dynamics: Overall trend

RSV george mias Dynamical Outcomes for Integrated Analysis of Proteome, Transcriptome, Metabolome

Glucose Regulation of Insulin Secretion

Platelet Plug Formation

george mias RSV 18 days Study of 10 Healthy People 5 Asian, 5 European Dewey, Grove, Pan, Ashley, Quertermous et al

- Median 5 reportable disease risk associations (ACMG) per individual (range 2-6) - 3 followup diagnostic tests (range 0-10) - Cost $362-$1427 per individual - 54 minutes per variant

Many Unaddressed Challenges

1) Interpreting non protein coding regions

2) DNA Methylation

3) Microbiome (with George Weinstock) 4) Exposome 5) Other

Mapping Regulatory Information to Personal Genomes

X Two approaches: 1) RegulomeDB: Assembling regulatory information from the ENCODE Project and other sources.

2) Mapping transcription factor binding in different people.

66 RegulomeDB Genomic Coverage Data Type Types Features (bp) Transcription Factor ChIP-Seq 495 conditions / cell lines 7,721,822 230,795,743 (ENCODE) Transcription Factor ChIP-Seq 32 conditions / cell lines 397,534 140,534,725 (non-ENCODE) Transcription Factor ChIP-exo 1 condition 35,161 2,604,066

Histone Modifications 284 conditions / cell lines/ marks 23,055,241 2,805,205,184

DNase I Hypersensitive Sites 114 conditions / cell lines 20,710,098 614,973,579

FAIRE Sites 25 conditions / cell lines 4,816,196 476,386,909

DNase I Footprints 50 cell lines 128,266,803 178,722,370

Predicted Binding (PWMs) 1,158 motifs 239,713,973 1,151,732,122 eQTLs 142,945 SNPs 142,945 142,945 dsQTLs 6,069 SNPs 6,069 6,069

Manual Annotations 6 Genomic Regions 282 11,607

VISTA Enhancers 1,448 Enhancers 1,325 1,658,146

Validated SNPs affecting binding 855 SNPs 855 855 Alan Boyle Damaging Variation in an Individual

Protein Coding Non-coding

Alan P Boyle RegulomeDB Damaging Variation in an Individual

Gene Regulatory region and

Protein Coding Non-coding CAPN1 Compound Heterozygote

CAPN1 Regulatory region

Maternal Chromosome and Paternal Chromosome

Calcium-sensitive cysteine protease in brain synapses

Protective against Alzheimer’s Disease (Trinchese et al. 2008)

Alan P Boyle RegulomeDB RegulomeDB Genomic Coverage Data Type Types Features (bp) Transcription Factor ChIP-Seq 495 conditions / cell lines 7,721,822 230,795,743 (ENCODE) Transcription Factor ChIP-Seq 32 conditions / cell lines 397,534 140,534,725 (non-ENCODE) Transcription Factor ChIP-exo 1 condition 35,161 2,604,066

Histone Modifications 284 conditions / cell lines/ marks 23,055,241 2,805,205,184

DNase I Hypersensitive Sites 114 conditions / cell lines 20,710,098 614,973,579

FAIRE Sites 25 conditions / cell lines 4,816,196 476,386,909

DNase I Footprints 50 cell lines 128,266,803 178,722,370

Predicted Binding (PWMs) 1,158 motifs 239,713,973 1,151,732,122 eQTLs 142,945 SNPs 142,945 142,945 dsQTLs 6,069 SNPs 6,069 6,069

Manual Annotations 6 Genomic Regions 282 11,607

VISTA Enhancers 1,448 Enhancers 1,325 1,658,146

Validated SNPs affecting binding 855 SNPs 855 855 Alan Boyle Damaging Variation in an Individual

Protein Coding Non-coding

Alan P Boyle RegulomeDB Damaging Variation in an Individual

Gene Regulatory region and

Protein Coding Non-coding CAPN1 Compound Heterozygote

CAPN1 Regulatory region

Maternal Chromosome and Paternal Chromosome

Calcium-sensitive cysteine protease in brain synapses

Protective against Alzheimer’s Disease (Trinchese et al. 2008)

Alan P Boyle RegulomeDB DNA Methylation

• Deep Sequencing: two time points analyzed a) 1.5 B Uniquely mapped reads (50X) b) 2.69 B Uniquely mapped reads (89.6X)

• ~19,000 non CG disruption allele differential methylated CGs

• 539 allele differential methylated regions (DMRs)

• Identified well known regions: H19, GNAS

• Identified many novel regions Differential DNA Methylation Novel Imprinting at SEC22B Locus DMRs Overlap with Regulatory Regions 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 TSS TES 5UTR 3UTR exom gene DNaseI TF Possible Phenotypic Consequences of DMRs? Fig. 6

10 kb chr17:21,302,276-21,333,703

Damaging SNPs in the Paternal Allele 1 _ M

0 - P

-1 _ KCNJ18

OMIM: Susceptibility to thyrotoxic periodic paralysis The Future? Genomic Sequencing Omes and Other Information: Home Sensors

GGTTCCAAAAGTTTATTGGATGCCGT TTCAGTACATTTATCGTTTGCTTTGG ATGCCCTAATTAAAAGTGACCCTTTC AAACTGAAATTCATGATACACCAATG GATATCCTTAGTCGATAAAATTTGCG AGTACTTTCAAAGCCAAATGAAATTA TCTATGGTAGACAAAACATTGACCAA TTTCATATCGATCCTCCTGAATTTAT TGGCGTTAGACACAGTTGGTATATTT A….!

1. Predict risk 2. Early Diagnose 3. Monitor 4. Treat

http://www.baby-connect.com/

Three Impact Areas for Sequencing a Healthy Person’s Genome

• Predisposition testing – determines who is at risk for a disease • Pharmacogenomics – determines which drug and how much of the drug is best for treating a given individual • Issues – how should a person’s personal genetic information be used? Predisposition Testing—Examples

• BRCA1/BRCA2: screening in women with a signiﬁcant family history may help predict their risk for breast or ovarian cancer • Long QT syndrome (cardiac condition): screening of individuals with LQTS may help establish what type of LQTS is present and guide disease management. a) Different mutations have different effects--can cause arrhytmia b) Some drugs work better for some patients than others. • Type 2 diabetes—screening individuals with impaired glucose tolerance for TCF7L2 may help identify those who are at high risk for progressing to type 2 diabetes

• ~1700 loci implicated in disease: Can search these. Predisposition Testing—Snyder (heterozygous variant) 1) SERPINA1 defining the Pi Z allele in alpha-1 antitrypsin (carrier), confirming prior.

2) Variant (A202T) in TERT associated with predisposition to aplastic anemia.

3) NTRK1 associated with congenital insensitivity to pain and anhidrosis (carrier).

4) IGKC (nonsense mutation) associated with immunoglobulin kappa light chain deficiency (mild clinical phenotype with recurrent URI) Pharmacogenomics • People differ in how they respond to drugs in part due to genetics

• E.g. 20% to 40% of people with hypertension, depression, and diabetes do not respond to drug treatment1

• Genetic factors inﬂuence: – Drug concentration in the body – How active a drug is against its target in the body

1. Haga et al. JAMA. 2004;291:2869-2871. Pharmacogenomics— Drug Responses Cytochrome p450s - Involved in drug metabolism - ~80 genes; Many exhibit sex speciﬁc expression Warfarin - anticoagulant - CYP2C9 & VKORC1 alleles determine sensitivity Predicted Drug Response from a Genome Sequence Advantages of Pharmacogenomics

1) Targeted therapies a) Only give drugs to patients who will respond b) Only give to patients who will not have adverse side effects. 2) Can give the right dose. How valuable is a genome sequence for medicine?

• Right now few loci are predictive of disease

• Most diseases are complex - Diabetes, schizophrenia, autism - Many can better explained by family history e.g. Height - Why can’t the relevant loci be detected? Conclusions

1) Genome sequencing is here

2) Improvements are still ongoing

3) It will have an important impact on understanding human disease and in medicine Most Genome Sequencing Projects Ignore SVs

Project Technology Paired SNPs; SVs New Genotype Reference End Short Seq. Indel European-Venter Sanger Yes 3M; 0.2M (> 1M Limited Levy et al., 0.3M 1000bp) 2007 European- 454 No 3M; Limited No No Wheeler et Watson 0.2M al., 2008 European- Helicos No 3M Limited No No Pushkarev et Quake al., 2009 Asian Illumina Partially 3M; 2.7K No No Wang et al., 0.1M (>100bp) 2008 HapMap Illumina Yes 4M; 10K 0.1K No No Bentley et Sample; al., 2008 Yoruban 18507 HapMap SOLiD Partially 4M; 5.5K No No McKernan et Sample; 0.2M (unknown al., 2009 Yoruban 18507 definition) Korean Illumina Yes 3M Limited No No Ahn et al., 2009 Korean- AK1 Illumina Yes 3.45M; ~300 CNVs No No Kim et al., 0.17M 2009 Three human Complete Yes 3.2- Limited (50- No Limited Drmanac et genomes Genomics 4.5M; 90K block al., 2009 0.3-0.5M substitutions) AML genome & Illumina No 3.8M; Limited No No Ley et al., normal 0.7K 2008 counterpart AML genome Illumina Yes 64 Limited No No Mardis et al., 2009 Melanoma Illumina Yes 32K;1K 51 No No Pleasance et genome al., 2009a Lung cancer SOLiD Yes 23K; 65 392 No No Pleasance et genome al. 2009b

Why Are SVs Not Studied More?

• Often involves repeated regions (transposons, duplicated regions)

• Rearrangements are complex

• People have not appreciated their importance 1. Paired ends Methods to Find SVs Deletion Reference Mapping Genome

Reference Sequenced paired-ends

3. Split read 2. Read depth (or aCGH) Deletion Deletion Reference Reference

Genome Genome

Read Reads

Mapping Mapping Read count Reference

Zero level 4. Match with database [Snyder et al. Genes & Dev. 2010 Paired-End Mapping (PEMer)

Shear to 3 kb

Adaptor ligation Bio Bio Circularize

Genomic DNA Fragments

Bio

Random Cleavage Bio

Bio 200-300bp

Bio

454 Sequencing (250bp reads, 400K reads/run)

Map paired ends to reference genome

Korbel et al., 2007 Science Sequence Read Depth Analysis (CNVnator)

Individual sequence

Reads

Mapping

Reference genome

Counting mapped reads

Read depth signal

Zero level

95 Novel method, CNVnator, mean-shift approach

• For each bin attraction (mean- shift) vector points in the direction of bins with most similar RD signal • No prior assumptions about number, sizes, haplotype, frequency and density of CNV regions • Achieves discontinuity- preserving smoothing • Derived from image-processing applications

Alexej Abyzov CNVnator on RD data

NA12878, Solexa 36 bp paired reads, ~28x coverage 1. Paired ends Methods to Find SVs Deletion Reference Mapping Genome

Reference Sequenced paired-ends

3. Split read 2. Read depth (or aCGH) Deletion Deletion Reference Reference

Genome Genome

Read Reads

Mapping Mapping Read count Reference

Zero level 4. Match with database [Snyder et al. Genes & Dev. 2010 SV/CNV: vs. Known SVs (Database of Genome Variants + 1KG)

DGV + # Methods Total 1KG % Three 479 463 96.7% Two 2,326 2,091 89.9% One 53,694 6,094 11.3% Total 56,499 8,646 15.3%

* Preliminary results -- Methods: RP, SR, RD; Platform Glucose (mg/dL) [70-100] Day Number (Relative to (Relative 1 Number Day HRV INFECTION (DAY 0-21) GLUCOSE GLUCOSE LEVELS (Day Number) (Day (%): HbA1c RSV RSV INFECTION (DAY 289-311)

(329) (329) 6.4 st Infection)

(369) 6.7 LIFESTYLE CURRENT) (DAY 380- CHANGE

(476) 4.9 100 Pharmacogenomics: High Interest Drug-Related Variants

Gene SNP Patient Drug(s) Affected genotype

rs10811661 C/T Troglitazone (Increased Beta-Cell Function)

CYP2C19 rs12248560 C/T Clopidogrel (Increased Activation)

LPIN1 rs10192566 G/G Rosiglitazone (Increased Effect)

SLC22A1 rs622342 A/A Metformin (Increased Effect)

VKORC rs9923231 C/T Warfarin (Lower Dose Required)

Can help tell a) which drugs to take/avoid and b) dosing Mapping Regulatory Variation to Personal Genomes

X Two approaches: 1) RegulomeDB: Assembling regulatory information from the ENCODE Project and other sources.

2) Mapping transcription factor binding in different people.

102 RegulomeDB

Data Type Count Transcription Factor ChIP-Seq (ENCODE) 495 conditions / cell lines Transcription Factor ChIP-Seq (non- 156 conditions / cell lines ENCODE) DNase I Hypersensitive Sites 114 cell lines Dnase I Footprints (predicted with 19 cell lines CENTIPEDE) Predicted Binding (PWMs) 751 motifs eQTLs 130,868 SNPs Chromatin Marks (Epigenomics Road Map) >100 Validated SNPs affecting binding 2,217 SNPs

Manual Curation 3 loci

Alan Boyle Variability at CD2 and ZNF804A loci in Humans

NFκB Pol II

19193

19099

18505

18951

18526

15510

10847

12878

12892

12891

IgG

7.5% and 25% of NFκB and Pol II binding regions vary Variable Binding Sites rs6135095 Can Be Associated with 19099 C/C Disease Risk 18505 C/C e.G Atherosclerosis 18951 A/A

18526 A/A

10847 C/C

12878 A/C

12892 A/C

12891 A/C

NFKB (12878)

1,420,000 1,424,000 Chr 20 Variable Binding Associated Personal Sequences

A B Do You Want To Get Your Genome

Sequenced?

Why You Might 1) Might want to know if you are likely to get certain diseases

2) Might Want To Know Sensitivity To Medication (Pharmacogenomics)

3) Just curious

Why You Might Not 1) Might not want to know if you are likely to get certain diseases (e.g. incurable diseases)

2) Most diseases are complex: we don’t really know enough to make good use of the information

3) Worry about genetic privacy/ discrimination by employers, insurance companies

Misuse of Genetic Information— Possible Concerns

• Discrimination (health insurance, employment) • Misinterpretation by consumers – Many genetic markers indicate increased risk/ predisposition to a condition, not absolute certainty that the condition will exist You may have a genetic predisposition to diabetes but may never develop the disease Genetic Testing and the Career of Eddy Curry

• Hypertrophic Cardiomyopathy (HCM) – HCM makes excessive exertion potentially fatal – Star Basketball Player Reggie Lewis (Celtics) died of it

From: cache.viewimages.com Genetic Discrimination and the Career of Eddy Curry From: www.raisport.rai.it

• Experienced heart arrhythmia when he was playing for the Chicago Bulls in 2005 • He was benched because of the possibility he had HCM • Bulls made contract offers contingent on Curry taking a genetic test for HCM • Curry refused because a positive test would effectively end his NBA career

From: www.raisport.rai.it Issues

• Should an employer be allowed to require an employee to take a genetic test? • Should Eddy Curry have the test done? Genetic Information Nondiscrimination Act From: images.businessweek.com • Recently passed congress • Protects individuals against discrimination (in health insurance or employment) based on their genetic information • Intended to encourage Americans to take advantage of genetic testing as part of their medical care

• Does not cover long term disability or life insurance! Conclusions

1) Genome sequencing is here

2) Improvements are still ongoing

3) It will have an important impact on understanding human disease and in medicine Personal Genome Sequencing:

Can genome sequencing of a healthy person be useful in health care? Flow chart for determining a personal genome sequence.

Snyder et al. Genes Dev. 2010;24:423-431 Understand and Treat Human Disease

1) Late Stage Cancer

2) Analyzing familial diseases Exome Sequencing (exome ~2% of the whole genome)

Cost per Genome The cost of whole $100,000,000 genome sequencing remains relatively high $10,000,000

$1,000,000

$100,000

$10,000

$1,000

$100

$10 Exome-sequencing is <1/10th the cost of WGS. $1

http://www.genome.gov/sequencingcosts/ Exome Sequencing

Image from Bamshad et al (2011) and Metzker (2009) Variability at CD2 and ZNF804A loci in Humans

NFκB Pol II

19193

19099

18505

18951

18526

15510

10847

12878

12892

12891

IgG

7.5% and 25% of NFκB and Pol II binding regions vary Effect of Motif Associated SNPs on Binding

u ◊ NFκB Genotype¢ NFκB Genotype Pol II Genotype Position Weight Matrices 19193 G/G C/C T/T ¢ u 19099 T/T A/G A/A NFκB 18505 G/G C/A T/T

18951 A/G A/A T/T ◊ TATA Box 18526 A/G A/A T/T T/T 15510 A/A A/A T/T CCAAT Box 10847 A/G A/A G/T 12878 A/G C/A T/T 12892 G/G C/A GC Box G/T 12891 A/A A/A

IgG PIK3R1

122 Variable Binding Sites rs6135095 Can Be Associated with 19099 C/C Disease Risk 18505 C/C e.G Atherosclerosis 18951 A/A

18526 A/A

10847 C/C

12878 A/C

12892 A/C

12891 A/C

NFKB (12878)

1,420,000 1,424,000 Chr 20