Computational personal genomics: selection, regulation, epigenomics, disease
Manolis Kellis
Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAGenes TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATRegulatory motifs CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA Encode TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC Control CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG proteins TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG gene expression GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT Personal genomics today: 23 and We Recombination breakpoints
Me vs. my brother Family Inheritance Family
Dad’s mom My dad Mom’s dad
Human ancestry Human
Disease risk Disease
Genomics: Regions mechanisms drugs Systems: genes combinations pathways Understanding human variation and human disease Gene annotation Roles in gene/chromatin regulation (Coding, 5’/3’UTR, RNAs) Activator/repressor signatures Evolutionary signatures CATGACTG CATGCCTG Disease-associated Non-coding annotation variant (SNP/CNV/…) Other evidence of function Chromatin signatures Signatures of selection (sp/pop) • Challenge: from loci to mechanism, pathways, drug targets
Goal: A systems-level understanding of genomes and gene regulation: • The regulators: Transcription factors, microRNAs, sequence specificities • The regions: enhancers, promoters, and their tissue-specificity • The targets: TFstargets, regulatorsenhancers, enhancersgenes • The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks Tools for interpreting the human genome • Evolutionary signatures Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures Link enhancer networks – Activity-based linking of regulators enhancers targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Evolutionary signatures reveal genes, RNAs, motifs Compare 29 mammals
Increased conservation pinpoints functional regions Distinct patterns of change distinguish diff. functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions microRNAs - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3’UTR motifs Regulatory motifs - Mutations preserve consensus Lindblad-Toh Nature 2011 - Increased Branch Length Score Stark Nature 2007 - Genome-wide conservation Compare 29 mammals: Reveal constrained positions NRSF motif
• Reveal individual transcription factor binding sites • Within motif instances reveal position-specific bias • More species: motif consensus directly revealed Human constraint outside conserved regions Active regions Average diversity (heterozygosity)
Aggregate over the genome
Ward and Kellis Science 2012
• Non-conserved regions: • Conserved regions: – ENCODE-active regions – Non-ENCODE regions show reduced diversity show increased diversity Lineage-specific constraint in Loss of constraint in human biochemically-active regions when biochemically-inactive Human-specific enhancer functions play regulatory roles
Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development
Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth Tools for interpreting the human genome • Evolutionary signatures Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures Link enhancer networks – Activity-based linking of regulators enhancers targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Integrate epigenomics datasets in multiple cell types
Epigenomic maps • Epigenetic modifications • DNA/histone/nucleosome
2. DNA • Encode epigenetic state 1. Histone methylation modifications • Histone code hypothesis • Distinct function for distinct combinations of marks? • Hundreds of histone marks • Astronomical number of histone mark combinations • How do we find biologically relevant ones?
3. DNA • Unsupervised approach accessibility • Probabilistic model • Explicit combinatorics Chromatin state dynamics across nine cell types
Predicted linking
Correlated activity • Single annotation track for each cell type • Summarize cell-type activity at a glance • Can study 9-cell activity pattern across Epigenomics Roadmap: 90 complete epigenomes
Interpret GWAS, global effects, reveal relevant cell types Enhancer modules associated with tissue identity
Clustering of 500,000 distal enhancers
• Tissue-specific disease-relevant tissues and processes Dissect motifs in 10,000s of human enhancers
54000+ measurements (x2 cells, 2x repl) • Predict activators/repressors based on activity correlations • Validate by engineering enhancers disrupting causal motifs Example activator: conserved HNF4 motif match WT expression specific to HepG2 Motif match disruptions reduce expression to background
Non-disruptive changes maintain expression
Random changes depend on effect to motif match Tools for interpreting the human genome • Evolutionary signatures Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures Link enhancer networks – Activity-based linking of regulators enhancers targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Revisiting disease- xx associated variants
• Disease-associated SNPs enriched for enhancers in relevant cell types • E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator GED: Global repression of brain enhancers in AD MAP Memory and Aging Project + ROS Religious Order Study Dorsolateral PFC Genotype (1M SNPs x700 ind.)
Reference Chromatin states
Methylation (450k probes x 700 ind)
• Variation in methylation patterns largely genotype driven • Global signature of repression in 1000s regulatory regions: hypermethylation, enhancer states, brain regulator targets Global hyper-methylation in 1000s of AD-associated loci
Top 7000 probes value - P
480,000 probes, ranked by Alzheimer’s association Methylation
Alzheimer’s-associated probes are hypermethylated • Global effect across 1000s of probes – Rank all probes by Alzheimer’s association – 7000 probes increase methylation (repressed) – Enriched in brain-specific enhancers – Near motifs of brain-specific regulators Complex disease: genome-wide effects Covers computational challenges associated with personal genomics: - genotype phasing and haplotype reconstruction resolve mom/dad chromosomes - exploiting linkage for variant imputation co-inheritance patterns in human population - ancestry painting for admixed genomes result of human migration patterns - predicting likely causal variants using functional genomics from regions to mechanism - comparative genomics annotation of coding/non-coding elements gene regulation - relating regulatory variation to gene expression or chromatin quantitative trait loci - measuring recent evolution and human selection selective pressure shaped our genome - using systems/network information to decipher weak contributions combinatorics - challenge of complex multi-genic traits: height, diabetes, Alzheimer's 1000s of genes MIT Computational Biology group Compbio.mit.edu
Mike Lin Ben Holmes Soheil Luke Angela Feizi Bob Yen Ward Altshuler Mukul Chris Bansal Bristow Stefan Washietl Pouya Kheradpour Matt Manolis Irwin Rachel Eaton Kellis Jason Jessica Daniel Ernst Wu Jungreis Marbach Sealfon Louisa DiStefano Dave Loyal Goff Sushmita Hendrix Roy Stata3 Stata4