Computational personal : selection, regulation, , disease

Manolis Kellis

Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCAGenes TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATRegulatory motifs CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA Encode TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC Control CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG proteins TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG gene expression GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT Personal genomics today: 23 and We Recombination breakpoints

Me vs. my brother Family Inheritance Family

Dad’s mom My dad Mom’s dad

Human ancestry Human

Disease risk Disease

Genomics: Regions  mechanisms  drugs Systems: genes  combinations  pathways Understanding human variation and human disease Gene annotation Roles in gene/chromatin regulation (Coding, 5’/3’UTR, RNAs)  Activator/repressor signatures  Evolutionary signatures CATGACTG CATGCCTG Disease-associated Non-coding annotation variant (SNP/CNV/…) Other evidence of function  Chromatin signatures  Signatures of selection (sp/pop) • Challenge: from loci to mechanism, pathways, drug targets

Goal: A systems-level understanding of and gene regulation: • The regulators: Transcription factors, microRNAs, sequence specificities • The regions: enhancers, promoters, and their tissue-specificity • The targets: TFstargets, regulatorsenhancers, enhancersgenes • The grammars: Interplay of multiple TFs  prediction of gene expression  The parts list = Building blocks of gene regulatory networks Tools for interpreting the human • Evolutionary signatures  Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures  Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures  Link enhancer networks – Activity-based linking of regulators  enhancers  targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics  Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Evolutionary signatures reveal genes, RNAs, motifs Compare 29 mammals

Increased conservation pinpoints functional regions Distinct patterns of change distinguish diff. functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions microRNAs - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3’UTR motifs Regulatory motifs - Mutations preserve consensus Lindblad-Toh Nature 2011 - Increased Branch Length Score Stark Nature 2007 - Genome-wide conservation Compare 29 mammals: Reveal constrained positions NRSF motif

• Reveal individual transcription factor binding sites • Within motif instances reveal position-specific bias • More species: motif consensus directly revealed Human constraint outside conserved regions Active regions Average diversity (heterozygosity)

Aggregate over the genome

Ward and Kellis Science 2012

• Non-conserved regions: • Conserved regions: – ENCODE-active regions – Non-ENCODE regions show reduced diversity show increased diversity  Lineage-specific constraint in  Loss of constraint in human biochemically-active regions when biochemically-inactive Human-specific enhancer functions play regulatory roles

Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development

Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth Tools for interpreting the • Evolutionary signatures  Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures  Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures  Link enhancer networks – Activity-based linking of regulators  enhancers  targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics  Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Integrate epigenomics datasets in multiple cell types

Epigenomic maps • Epigenetic modifications • DNA/histone/nucleosome

2. DNA • Encode epigenetic state 1. Histone methylation modifications • Histone code hypothesis • Distinct function for distinct combinations of marks? • Hundreds of histone marks • Astronomical number of histone mark combinations • How do we find biologically relevant ones?

3. DNA • Unsupervised approach accessibility • Probabilistic model • Explicit combinatorics Chromatin state dynamics across nine cell types

Predicted linking

Correlated activity • Single annotation track for each cell type • Summarize cell-type activity at a glance • Can study 9-cell activity pattern across Epigenomics Roadmap: 90 complete epigenomes

Interpret GWAS, global effects, reveal relevant cell types Enhancer modules associated with tissue identity

Clustering of 500,000 distal enhancers

• Tissue-specific disease-relevant tissues and processes Dissect motifs in 10,000s of human enhancers

54000+ measurements (x2 cells, 2x repl) • Predict activators/repressors based on activity correlations • Validate by engineering enhancers disrupting causal motifs Example activator: conserved HNF4 motif match WT expression specific to HepG2 Motif match disruptions reduce expression to background

Non-disruptive changes maintain expression

Random changes depend on effect to motif match Tools for interpreting the human genome • Evolutionary signatures  Genome annotation – Distinct signatures for proteins, ncRNAs, miRNAs, motifs – Read-through, excess-constraint, networks,human selection • Chromatin signatures  Dynamic regulatory regions – Define chromatin states from combinations of histone marks – Distinct classes of promoter/enhancer/transcribed/repressed • Activity signatures  Link enhancer networks – Activity-based linking of regulators  enhancers  targets – Testing of 1000s of enhancer activator / repressor motifs • Personal genomics  Interpret disease mechanism – Disease-associations: Mechanistic predictions for variants – Beyond top hits: 2000+ GWAS T1D-associated enhancers – Global methylation changes in Alzheimer’s: NRSF, ELK1, CTCF Revisiting disease- xx associated variants

• Disease-associated SNPs enriched for enhancers in relevant cell types • E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator GED: Global repression of brain enhancers in AD MAP Memory and Aging Project + ROS Religious Order Study Dorsolateral PFC (1M SNPs x700 ind.)

Reference Chromatin states

Methylation (450k probes x 700 ind)

• Variation in methylation patterns largely genotype driven • Global signature of repression in 1000s regulatory regions: hypermethylation, enhancer states, brain regulator targets Global hyper-methylation in 1000s of AD-associated loci

Top 7000 probes value - P

480,000 probes, ranked by Alzheimer’s association Methylation

Alzheimer’s-associated probes are hypermethylated • Global effect across 1000s of probes – Rank all probes by Alzheimer’s association – 7000 probes increase methylation (repressed) – Enriched in brain-specific enhancers – Near motifs of brain-specific regulators Complex disease: genome-wide effects Covers computational challenges associated with personal genomics: - genotype phasing and reconstruction  resolve mom/dad chromosomes - exploiting linkage for variant imputation  co-inheritance patterns in human population - ancestry painting for admixed genomes  result of human migration patterns - predicting likely causal variants using  from regions to mechanism - annotation of coding/non-coding elements  gene regulation - relating regulatory variation to gene expression or chromatin  quantitative trait loci - measuring recent evolution and human selection  selective pressure shaped our genome - using systems/network information to decipher weak contributions  combinatorics - challenge of complex multi-genic traits: height, diabetes, Alzheimer's  1000s of genes MIT Computational Biology group Compbio.mit.edu

Mike Lin Ben Holmes Soheil Luke Angela Feizi Bob Yen Ward Altshuler Mukul Chris Bansal Bristow Stefan Washietl Pouya Kheradpour Matt Manolis Irwin Rachel Eaton Kellis Jason Jessica Daniel Ernst Wu Jungreis Marbach Sealfon Louisa DiStefano Dave Loyal Goff Sushmita Hendrix Roy Stata3 Stata4