Applications of Human Genome Sequencing: Personal and Disease Genomics
Michael Snyder
February 11, 2014
Conflicts: Personalis, Genapsys Outline
• General introduction on human variation • Disease genome sequencing – Cancer, Mystery Diseases • Personal genomics Genetic Variation Among People: Three Types 1) Single nucleotide variants (SNVs) GATTTAGATCGCGATAGAG GATTTAGATCTCGATAGAG 3.7 Million/person 2) Short Indels (Insertions/ Deletions 1-100 bp) GATTTAGATCGCGATAGAG GATTTAGA------TAGAG 300-600K/person Structural Variation (SV): Large Blocks (1000bp) of DNA that are Deleted, Inserted or Inverted
- People Have >2,500 differences (>2.5kb) Relative to the Reference Human Genome Sequence - Likely responsible for much human differences and disease
Size Distribution of CNV in a Human Genome
~2500 SVs >2.5kb per Person
*
VCFS
* Analysis of SV Breakpoints
Homologous Recombination 14%
Nonhomologous Recombination 56%
Retrotransposons 30% 17% of SVs Affect Genes
Non-allelic homologous recombination (NAHR; breakpoints in OR51A2 and OR51A4)
Olfactory Receptor Gene Fusion Heterogeneity in Olfactory Receptor Genes (Examined 851 OR Loci)
CNVs affect: 93 Genes 151 ψgenes The Cost of DNA Sequencing is Dropping
Human Genome Cost ~$3K http://www.genome.gov/ 1 Billion Human DNA Sequences A Personal Genome Sequence is Determine by Comparing to a Reference Genome Sequence
30X: 75-100 b Map to Reference Genome
Snyder et al. Genes Dev 2010;24:423-431 1. Paired ends Methods to Find SVs Deletion Reference Mapping Genome
Reference Sequenced paired-ends
3. Split read 2. Read depth (or aCGH) Deletion Deletion Reference Reference
Genome Genome
Read Reads
Mapping Mapping Read count Reference
Zero level 4. Match with database [Snyder et al. Genes & Dev. 2010 Variant Calling Pipeline
14 Challenges with Genome Seqeuncing
1) Accuracy (particularly indels
and SVs)
2) Coverage
3) Phasing
Genome Sequencing Reveals Many Variants (3.3 M Hi conf. SNVs, 217K Indels and 3K SVs)
• Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X)
Illumina Complete Genomics
345K 3.30M 100K 9% 89% 2%
Hugo Lam SNV Comparison
• Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X)
Ti/Tv = 2.14 20/20 Sanger
Illumina Complete Genomics
345K 3.30M 100K Ti/Tv = 1.40 9% 89% 2% Ti/Tv = 1.68 2/15 Sanger 17/18 Sanger
31 Disease 3 Disease Associated SNP Associated SNP
Hugo Lam, Michael Clark, Rui Chen Sequencing Accuracy Sequencing the Same Genome Twice
146,100 SNPs (3.7%)
Personalis Variant Detection vs Sequencing Depth
19 Exome-seq (80X) and WGS- specific detection (45X)
Exome-seq can find variants missed by WGS Local phasing + population data= highly phased blocks
Percent SNPs phased 98.2% Switch accuracy 99.9%+
Moleculo: Volodymyr Kuleshov, Michael Kertesz Genome Sequencing 1) Understand and treating human disease - Cancer - Mystery Diseases - Prenatal testing
2) Examine human variation (1000 Genomes Project)
3) Personal genome sequencing for examining disease risk?
Cancer Genome Sequencing
1) Cancer is a genetic disease: both inherited and somatic
2) 10-20 “driver” mutations
3) Each cancer is unique
4) Sequence genomes (cancer tissue and normal) find genetic changes and suggest possible therapies
Vogelstein et al., March Science, 2013 Cancer Genome Sequencing Normal Cancer VS (30-40X) (>60X)
GATTTAGATCGCGATAGAG GATTTAGATCACGATAGAG Whole Genome Sequencing of Esophageal Cancer
Small Somatic Variants
1695 2178
3720
14950
Deletions Insertions SNV Substitutions
197 Dele ons 117 Amplifica ons >1000 of Cancer Genomes Have Been Sequenced The Cancer Genome Atlas
Brain (Glioma, Glioblastoma ) Breast Gastrointestinal (Colon, Rectal, Stomach) Gynecologic (Ovarian , Cervical, Uterine) Head and Neck Hematologic (AML) Melanoma Lung Urologic (kidney, Prostate)
Results of Cancer Genomics
• 10-100,000 SNVs for many cancer types + Structural variants • 473 cancer genes have been iden fied • Most common genes are already known e.g. p53, RB1, KRAS, BRCA1/2, etc
- Every patient is has a different spectrum of mutations
Bell et. al., 2011, Nature More Results
- Mutations often fall in common pathways e.g. Wnt, PI3K/AKT, Ras/Raf, TP53, RB, etc ~12 pathways (Vogelstein)
- Different types of cancer can have the same mutated gene e.g. EGFR mutated in NSCLC and colon cancer
- Common to find a potentially “druggable” mutation from sequencing TCGA-identified Pathways in Ovarian Cancer
RB and PI3K/AKT Signaling Homologous Recombination
Notch Signaling
29 Copy number alterations from metastatic Colon tumor
Chromosome 7: Two amplification regions
Normal diploid copy number
Log 2 ratio Chr 7p arm Chr 7q arm - genomic copy number
CEN
Chromosome 7 coordinates (Mb) NCBI 37 • Two regional amplifications with complex genomic structure.
• Both loci > 10 copy number showing statistical
significance. With Hanlee Ji Pharmacogenomics—Matching Drugs to Disease Her2 Mutated in 25% of Breast Cancer Iressa®: – Used to treat nonsmall cell lung carcinoma – 10% of patients respond – Works only if patients tumors have very specific mutations in EGFR CML (Chronic Myeloid Leukemia) - Gleevac
1) Chromosome Translocation involved Abl Protein Kinase and bcr
2) Gleevac—binds active site of Abl and inhibits it; causes remission
3) Gleevac binds other kinases that are mutated in other cancer: Gleevac can be used to treat them.
Genome Sequencing and Mystery Disease
• Rare Mendelian Diseases: Some examples – Miller’s syndrome – Charot-Marie Disease – Brain formation (exome) – Congenital diarrhea (exome) – Cranial facial formation (exome) Family with Charcot-Marie Disease • Neuropathy – Heterogenous disease—many different genes mapped • Sequence genome to 30X coverage • 3.4 M SNPs: (561,719 novel) – 2,255,102 in intergenic – 1,165,204 in genes, introns etc. • 174 nonsynonymous SNPs in region of interest • 54 related to Neuropathies • Ultimately zoomed in on SH3TC2 gene: Full blown disease has two mutations: Y169H (missense), R954X (nonsense)* Single heterozygotes have some phenotypes
*Implicated previously Lupski et al 2010 NEJM
Solving Mystery Diseases: Dizygotic Twins: Dopamine Responsive Dystonia
• Constantly sick, colicky, failed to meet milestones “floppy”; MRI showed some abnormalities • Children diagnosed with X dystonia • Trial of L-DOPA showed dramatic improvement in 2 days • Sequenced genomes-found mutation in SPR Gene • Administered dopamine + seratonin precursor
From Richard Gibbs, Baylor Solving Mystery Diseases: Child With Variety of Conditions Developmentally Delayed, Significant Health Issues
Father Mother SNVs: 3,119,588 SNVs: 3,125,880 Private: 596,691 Private: 581,754 Indels: 750,522 Indels: 723,379 F M SNVs: Single nucleotide variants Indels: = Insertions/ deletions (~<100bp) A1
Child Candidates: SNVs: 3,118,638 TCP10L2, Private: 33,158 Indels: 673,809 SUPV3L1, PIEZO1 DNAH2, NGLY1, FANCA, WFS1 Lessons Learned
1) Overall success rate for identifyng causative mutations is low 25%
2) Information not always directly actionable but still valuable.
3) Best success with a) Specific phenotypes b) Large families
4) Need large database to share information: Recurrence is key. ClinVar Fetal DNA Sequencing
1) Cell free fetal DNA can be detected in maternal blood as early a 4-5 weeks gestation
2) 4-10% circulating DNA is fetal à increases with pregnancy
3) Targeted detection of mutations
4) Whole genome sequencing routinely used to detect trisomies: Down’s (Chr. 21), Chromosome 18 and Chromosome 13. 99% sensitivity
5) Taking over from aminocentesis Fetal DNA Sequencing
Srinivasan A, Bianchi DW, Huang H, et al. Am J Hum Genet 2013; 1–10. Personal Genome Sequencing:
Can genome sequencing of a healthy person be useful in health care? Health Is a Product of Genome + Environment
Genome
Health Exposome Health Is a Product of Genome + Environment
Genome
Health Exposome Personalized Medicine: Combine Genomic and Other Omic Information
Genomic GGTTCCAAAAGTTTATTGGATGCCGTTTCA GTACATTTATCGTTTGCTTTGGATGCCCTA Presently: ATTAAAAGTGACCCTTTCAAACTGAAATTC ATGATACACCAATGGATATCCTTAGTCGAT Few Tests (<20) AAAATTTGCGAGTACTTTCAAAGCCAAATG AAATTATCTATGGTAGACAAAACATTGACC AATTTCATATCGATCCTCCTGAATTTATTG GCGTTAGACACAGTTGGTATATTTCAAGTG ACAAGGACAATTACTTGGACCGTAATAGAT ? TTTTTGAGGCTCAGCAAAAAAGAAAATGGA AATTAATTTTGAAGTGCCATTGA….!
1. Predict risk 2. Diagnose, 3. Monitor, 4. Treat, & 5. Understand Disease States Personalized Medicine: Combine Genomic and Other Omic Information
Genomic Transcriptomic, Proteomic GGTTCCAAAAGTTTATTGGATGCCGTTTCA GTACATTTATCGTTTGCTTTGGATGCCCTA ATTAAAAGTGACCCTTTCAAACTGAAATTC ATGATACACCAATGGATATCCTTAGTCGAT AAAATTTGCGAGTACTTTCAAAGCCAAATG AAATTATCTATGGTAGACAAAACATTGACC AATTTCATATCGATCCTCCTGAATTTATTG GCGTTAGACACAGTTGGTATATTTCAAGTG ACAAGGACAATTACTTGGACCGTAATAGAT TTTTTGAGGCTCAGCAAAAAAGAAAATGGA AATTAATTTTGAAGTGCCATTGA….!
1. Predict risk 2. Diagnose, 3. Monitor, 4. Treat, & 5. Understand Disease States Personal “Omics” Profiling (POP) Genome and Epigenome
Transcriptome (mRNA, miRNA, isoforms, edits)
Proteome Personal Omics Cytokines
Metabolome
Autoantibody-ome
Microbiome Personal “Omics” Profiling (POP)
Genome Epigenome Personal Transcriptome Omics (mRNA, miRNA, isoforms, edits) Profile Proteome Initially Cytokines 40K Molecules/ Metabolome Measure- ments Autoantibody-ome Now Billions! Microbiome Personal Omics Profile 44 months; 66 Timepoints; 6 Viral Infections
/
/
Chen et al., Cell 2012 Accurate Genome Sequencing
Whole Genome Sequencing • Complete Genomics: 35 b paired ends (150X) • Illumina: 100 b paired ends (120X) Illumina CG Exome Sequencing • Nimblegen 345K 3.30M 100K • Illumina 9% 89% 2% • Aglilent
3.3 M Hi conf. SNVs, 217K Indels and 3K SVs 2 or more Platforms (Plus low confidence) Variants
M P
Percent SNPs phased ~80%
- Misses many variants (only assigns certain hets) - Cannot properly assign de novo variants Local phasing + population data= highly phased blocks
Percent SNPs phased 98.2% Switch accuracy 99.9%+
Moleculo: Volodymyr Kuleshov, Michael Kertesz Many SNVs are Expressed
RNA 2.67 B 100 b PE reads 30,963 (40 reads or more) 1,797 nonsynonymous 8 nonsense
Protein >130 Hi Confidence
RNA Editing 2,376 Hi confidence Allele Specific Expression
Jennifer Li-Pook-Than Medical All variants ~3.5M Interpretation Pipeline Rare/novel variants (<5%)
Non-Coding Coding Synonymous
mRNA tRNA miRNA Splice UTR Nonsynonymous stability rate (1320)
Seed miRNA sequence targets SIFT PP2 Damaging (234)
OMIM/Curated Mendelian disease (51) Rick Dewey & Euan Ashley Curated List of Rare Variants (SNVs, All heterozygous) Missense • ALAD, ABCC2, ACADVL, ADAMTS13, AGRN, BAAT, CDS1, CHD7, COL4A3, CTSD, DGCR2, DLD, DYSF, EPCAM, FGFR1OP, FKRP, GAA, GNAI2, HSPB1, IGKC, ITPR1, MED12, MKS1, NTRK1, PCM1, PKD1, PLEKHG5, PMS2, PRSS1, PTCH2, SERPINA1, SETX, SYNE1, TERT, TTN, VWF, ZFPM2, PNPLA2. Nonsense • PRAMEF2, PLCXD2, NUP54, RP1L1, PIK3C2G, NDE1, GGN, CYP2A7, IGKC
Not Rare But Important • KCNJ11 , KLF14, GCKR …
Bolded Genes expressed in PBMC (RNA). Rare Variants in Disease Genes (51 Total)
Missense • ALAD, ABCC2, ACADVL, ADAMTS13, AGRN, BAAT, CDS1, CHD7, COL4A3, CTSD, DGCR2, DLD, DYSF, EPCAM, FGFR1OP, FKRP, GAA, GNAI2, HSPB1, IGKC, ITPR1, MED12, MKS1, NTRK1, PCM1, PKD1, PLEKHG5, PMS2, PRSS1, PTCH2, SERPINA1, SETX, SYNE1, TERT, TTN, VWF, ZFPM2, PNPLA2. Aplastic Nonsense Anemia • PRAMEF2, PLCXD2, NUP54, RP1L1, PIK3C2G, NDE1, GGN, CYP2A7, IGKC
Not Rare But Important High • KCNJ11 , KLF4, GCKR … Cholesterol Diabetes Disease risk profile from Integration Using VariMed
Rong Chen and Atul Butte Disease risk profile T2D Type 2 diabetes
Genotype Test LR Studies Samples Probability Prevalence 27% TCF7L2 rs7903146 CT 1.18 49 140717 30% rs10811661 TC 0.85 18 154141 27% KCNQ1 rs2237892 CT 0.80 13 6570 23% FTO rs8050136 CC 0.87 10 63470 20% CDKAL1 rs7754840 GG 0.91 10 51327 19% SLC30A8 rs13266634 CT 0.94 9 145718 18% IGF2BP2 rs4402960 GT 1.06 8 104401 19% KCNJ11 rs5219 TT 1.15 7 87066 21% rs1111875 TT 0.88 6 93188 19% PPARGC1A rs2970847 TC 1.31 4 5558 24% rs7578326 AA 1.07 3 94337 25% rs4457053 GG 1.12 3 94337 27% KCNQ1 rs231362 GG 1.07 2 94337 29% ARAP1 rs1552224 AA 1.03 2 94337 29% JAZF1 rs864745 TC 1.00 2 89920 29% RBMS1 rs1020731 GA 0.95 2 84605 28% rs9300039 CC 1.05 2 42170 29% WFS1 rs10010131 GG 1.07 2 30248 31% EPO rs1617640 AA 1.48 2 4011 39% TP53INP1 rs896854 TC 1.01 1 94337 40% rs5945326 AA 1.09 1 94337 42% rs12779790 AG 1.06 1 89920 43% rs1153188 TA 0.96 1 89920 42% THADA rs7578597 TT 1.03 1 89920 43% rs4607103 CC 1.04 1 89920 44% rs17036101 GG 1.02 1 89920 44% MTNR1B rs10830963 CC 0.94 1 16061 43% HNF1B rs4430796 GG 1.13 1 11320 46%
10% 50% 100% Rong Chen and Atul Butte GLUCOSE LEVELS
HbA1c (%): 6.4 6.7 4.9 5.4 5.3 4.7 (Day(Day Number) Number) (329) (369) (476) (532) (546) (602) 160
) 150
dL 140 130 120 110 100 90
Glucose (mg/ 80 -150 0 %)"#)"#200 250 300 350 400 450 500 550 600 650 Day Number (Relative to 1st Infection)
HRV INFECTION RSV INFECTION LIFESTYLE (DAY 0-21) (DAY 289-311) CHANGE (DAY 380- 58 CURRENT) High Interest Drug-Related Variants
Gene SNP Patient Drug(s) Affected genotype
rs10811661 C/T Troglitazone (Increased Beta-Cell Function)
CYP2C19 rs12248560 C/T Clopidogrel (Increased Activation)
LPIN1 rs10192566 G/G Rosiglitazone (Increased Effect)
SLC22A1 rs622342 A/A Metformin (Increased Effect)
VKORC rs9923231 C/T Warfarin (Lower Dose Required) Color Key Expression of 50 Cytokines HRV RSV ? 0 0.2 0.4 0.6 0.8 1 Value DAY 0 DAY 0 DAY 12
LIF NGF G-CSF IFN-G IL-12-P70 IL-5 IL-13 TNF-A IL-1B IL-4 IL-2 IL-7 IL-17 TGF-b EOTAXIN PDGFBB ENA78 MCP-1 GRO ALPHA IL-17F IFN-b FGFb IL-12P40 GM-CSF IL-10 TNF-B MCP-3 M-CSF SCF VEGF MIP1a sFAS ligand IL-15 PAI-1 IP10 IFN-a MIG IL-1a CRP V-CAM-1 ICAM-1 IL-1RA RANTES Trail Insulin MIP-1B CD40Ligand Resistin HGF LEPTIN IL-8 TGF-a UNK.I.S UNK.II.S UNK.Z.S UNK.V.S UNK.X.S UNK.III.S UNK.VI.S UNK.IX.S UNK.XI.S UNK.XII.S UNK.XV.S UNK.VIII.S UNK.XIII.S UNK.XIV.S UNK.XVI.S UNK.XVII.S Transcriptome, Proteome, Metabolome Analysis Summary: Processing Steps
Fourier Space Raw Datasets: Data Preprocessing: Common Framework; Transcriptome Clustering and QC, Normalization, data Classification: Proteome Enrichment Analysis Statistical Simulation Autocorrelated (I) and Metabolome Spike Sets (II) & (III)
(1) Preprocessing (2) Common Classification Scheme (3) Clustering and Enrichment Analysis - Overall trends (autocorrelation) - Spikes at specific timepoints
george mias Integrated Analysis of Proteome, Transcriptome, Metabolome Dynamics: Overall trend
RSV george mias Dynamical Outcomes for Integrated Analysis of Proteome, Transcriptome, Metabolome
Glucose Regulation of Insulin Secretion
Platelet Plug Formation
george mias RSV 18 days Study of 10 Healthy People 5 Asian, 5 European Dewey, Grove, Pan, Ashley, Quertermous et al
- Median 5 reportable disease risk associations (ACMG) per individual (range 2-6) - 3 followup diagnostic tests (range 0-10) - Cost $362-$1427 per individual - 54 minutes per variant