The Epigenome Tools 2: ChIP-Seq and Data Analysis

Chongzhi Zang

[email protected] http://zanglab.com

PHS5705: Public Health Genomics March 20, 2017 1 Outline

• Epigenome: basics review • ChIP-seq overview • ChIP-seq data analysis

2 Epigenome

The epigenome is a multitude of chemical compounds that can tell the genome what to do. The epigenome is made up of chemical compounds and proteins that can attach to DNA and direct such actions as turning genes on or off, controlling the production of proteins in particular cells. -- from genome.gov

Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI) 3 Epigenomic marks

• DNA • Histone marks – Covalent modifications – Histone variants • regulators – Histone modifying – Chromatin remodeling complexes • * factors

4 Histone modifications

• Nucleosome Core Particles • Core : H2A, H2B, H3, H4 Notation: • Covalent modifications on histone tails include: methylation (me), (ac), … • Histone variants • Histone modifications are implicated in influencing gene expression. Allis C. et al. . 2006

5 einsronigteTS sascae ih2 modifications 25 2-kb with a associated as is defined TSS, was ( which the red), surrounding in region (highlighted region promoter ehltosaeeepie ytegnmclcsfor histone locus and genomic the by histone exemplified these are of patterns distribution The nw as known AUEGENETICS NATURE H3K14ac and H4K8ac affects p300, GCN5 or of CBP depletion Additionally, not initiation- 19). but the PCAF, (ref. different with II or interacts with Pol p300 RNA whereas elongation- associate competent the II, with Pol associates can RNA PCAF example, competent (HATs) For genes. transferases of regions acetyl histone osbepten,ol ml rcineit tpooes f4,339 Of promoters. at the exists Of fraction H2A.Z. small and a methylations only 19 association acetylations, patterns, for possible 18 promoters the gene of 12,541 each the with of each examined we way, P de c a * * * * * * * * * * oietf h atrso itn oictosi nunbiased an in modifications histone of patterns the identify To * * o Number of combinatorial patterns

H3K4ac H3K14ac H2BK5ac H2BK20ac H4K16ac H4K12ac H3K4me3 H3K4me1 H3K36me1 H3R2me1 H3R2me2 H3K79me1 H3K79me2 H3K79me3 H3K36ac H2BK12ac H4K5ac H2AK5ac H2BK5me1 H3K4me2 H3K9me1 H3K18ac H3K27me1 H2AK9ac H4K20me1 H4K20me3 H3K9ac H4K8ac H2BK120ac H3K27me2 H3K23ac H2A.Z H4K91ac Histone modifications associate with with associate modifications Histone 10,000 1,000 10 100 10 Number ofenhancersassociatedwithonepattern Chr10: 1 À 184 PRKCBP1 7 16 12 14 12 10 17 15 13 15 23 10 39 15 4 8 3 6 6 7 8 8 5 5 9 6 3 7 5 3 5 6 7 3 2 5 4 8 4 1 ). Fractions of enhancers IL2RA regulation of gene expression gene of regulation 0.05 0.10 0.15 0.20 0.25 0.30 0.35 6145000 0 H2AK5ac Differential expression 10 H2AK9ac ,wihi xrse nCD4 in expressed is which ), OUE40 VOLUME H2A.Z

log2 (fold-change) 6150000 H2BK5ac H2BK5me1 100

H2BK12ac CD28RE H2BK20ac

H2BK120ac 6155000 H3K4ac [ 1,000 H3K4me1 UBR7 NUMBER H3K4me2 H3K4me3

H3K9ac 6160000 H3K9me1 10,000 H3K9me2 H3K9me3 H3K14ac [ UY2008 JULY + H3K18ac * * * * * * * * * * * * * * * * * Percentage of enhancers with * b el ( cells T H3K23ac

H2AK5ac H4K12ac H4K20me1 H3K4ac H2BK5me1 H2AK9ac H3K23ac H3K36me1 H3K27me2 H3K27ac H4K20me3 H4K5ac H3K9me2 H3K9me3 H3K14ac H3K4me2 H2A.Z H3K18ac H2BK20ac H3K4me1 H3K27me3 H3K4me3 H3K9ac H3K36me3 H3R2me1 H3R2me2 H3K79me1 H3K79me2 H3K79me3 H2BK120ac H2BK5ac H3K36ac H3K9me1 H4K8ac H4R3me2 H3K27me1 higher expression H2BK12ac H4K91ac H4K16ac

100 H3K27ac 30 40 50 60 70 80 90 10 20 H3K27me1 0 Chr12: 01010010,000 1,000 100 10 1 ZMYND8 H3K27me2

H3K27me3 23 17 10 16 11 12 17 30 11 25 i.1g Fig. 3 8 7 5 7 4 6 4 9 7 3 6 5 3 4 5 5 5 4 3 2 2 3 5 5 6 5 4 4 Enhancers-associated geneexpression H2AZ, H3K4me1, H3K4me2, H3K4me3, H3K9me1: 21 H3K9me1: H3K4me3, H3K4me2, H3K4me1, H2AZ, H3K36ac H3K36me1 66810000 H3K36me3 H3K79me1 .The ). CNS22 H2AZ, H3K4me2, H3K4me3, H3K9me1: 37 H3K9me1: H3K4me3, H3K4me2, H2AZ, (also H3K79me2 H3K79me3 20

Wang, Zang et al. al. et Zang Wang, H3R2me1 . H3R2me2 H4K5ac 66830000 ekcreainwt xrsinwti oicto pattern modification a within showed of expression modifications extent ( H2BK5ac with the correlation and determine weak H3K79me3 uniquely the not the Although expression, do activation. patterns with correlated modification modifications other most whereas H4K20me3, and H3K9me3 H3K9me2, H3K27me2, including h vrg eeepeso iho ihu ahmodification each without or with expression ( gene average the mature for required being not inactive their function. the with T-cell in consistent patterns, enriched are I cell– transmission development, class synaptic in involved and genes signaling many cell contrast, the In shown). in (data roles house-keeping not enriched their with are consistent patterns, metabolism III class and active physiology cellular in involved H3K4me1, H4K20me1: 21 H4K20me1: H3K4me1, upeetr i.5 Fig. Supplementary i.2c Fig. H4K8ac ocreaeec oicto ihgn xrsin ecompared we expression, gene with modification each correlate To H4K12ac No modification: 1,817 modification: No H2AZ, H3K4me3: 20 H3K4me3: H2AZ, 36 H3K4me1: H2AZ, All enhancers:4,179 H4K16ac H4K20me1 IFNG H4K20me1: 42 H3K36me3: 41 H3K27me3: 65 H3K4me3: 34 H3K4me3: 64 H3K4me1: .HK7e a mn ru frpesv ak also marks repressive of group a among was H3K27me3 ).

H2AK5ac: 63 H2AK5ac: H4K20me3 H4K91ac H2AZ: 67 H2AZ: 66850000 Nat Genet Genet Nat 2008 online). utpegnsad315wt nyoegene one ( only each with 3,165 with and associated genes multiple are 1,174 patterns, detected eeOtlg nlsssget htgenes that suggests analysis Ontology Gene ( backbone modification addition the to in H3K79me1/2/3 H4K16ac, and H2BK5me1, H4K20me1 includes it gene and sion, intermediate expres- H4K16ac, highest the shows plus III with Class expression. backbone discussed correlates the (as containswhich modifications or II 17 below), Class of con- class. sisting backbone also modification a this modification or H3K36me3 no to containing or belong patterns H3K4me3 The only but contain acetylations. H2A.Z also no and H3K9me1 patterns H3K4me1/2/3, These low with correlate expression. and H3K27me3 class contain in patterns I six of Four expression. in to III ing and II (I, classes into three patterns top these classify roughly can we ( reference a of as expression genes mean all the using in patterns, genes these of expression the genes. examined 62 next than We more with associated each are ie soitdwt ahpten ( pattern. each with associated sites and The sites. hypersensitive DNase 4,179 at modifications 38 ( the modifications. of each with associated ( enhancers left. the on asterisks indicated by are CNS22 at modifications Significant shown. are CNS22, , downstream the at ( patterns left. modification the on asterisks by indicated the of red) IL2RA in (highlighted enhancer CD28RE the ( enhancers. 3 Figure h S ftenaetkongn.Al all All, gene. known nearest the of to TSS enhancer the an assigning largest by ten patterns the modification with expression gene of analysis atr sindicated. is pattern each with associated of sites number hypersensitive The sites. hypersensitive I DNase 6 x xsidctstenme fhypersensitive of number the indicates axis i.2a Fig. ee infiatmdfiain are modifications Significant gene. atrso itn oictosat modifications histone of Patterns y xsidctstenme fpatterns, of number the indicates axis a

h itn oicto atr at pattern modification histone The )

.Te1 otpeaetpatterns prevalent most 13 The ). d atrso histone of Patterns ) c h rcin of fractions The ) i.2b Fig. IFNG LETTERS i.2b Fig. .I em that seems It ). eeadits and gene b i.2b Fig. e Histone ) Correlation ) accord- ) .Our ). 899 “Functions” of histone marks

Table 3. Distinctive Chromatin Features of Genomic Elements Functional Annotation Histone Marks References Promoters H3K4me3 Bernstein et al., 2005; Kim et al., 2005; Pokholok et al., 2005 Bivalent/Poised Promoter H3K4me3/H3K27me3 Bernstein et al., 2006 Transcribed Gene Body H3K36me3 Barski et al., 2007 Enhancer (both active and poised) H3K4me1 Heintzman et al., 2007 Poised Developmental Enhancer H3K4me1/H3K27me3 Creyghton et al., 2010; Rada-Iglesias et al., 2011 Active Enhancer H3K4me1/H3K27ac Creyghton et al., 2010; Heintzman et al., 2009; Rada-Iglesias et al., 2011 Polycomb Repressed Regions H3K27me3 Bernstein et al., 2006; Lee et al., 2006 H3K9me3 Mikkelsen et al., 2007 survey DNA enrichment at 500 representative loci using the steps in a row, or Sequential-ChIP-seq, can uncover histone  nCounter probe system. High-quality antibodies are distinctRivera &PTMs Ren Cell on2013 the7 same molecule or chromatin-associated proteins from IgG patterns and the occupancy distribution correlates in the same complex. Several groups combined bisulfite with a logical set of chromatin states. Validating antibody re- sequencing with ChIP giving rise to BisChIP-seq and ChIP-BS- agents for enrichment specificity and robustness ensures good seq (Brinkman et al., 2012; Statham et al., 2012). Long-distance quality ChIP-seq data sets with high signal to noise ratios. DNA interactions mediated by a specific protein can be profiled Given that ChIP-seq is a mature technology, the technical using chromatin interaction analysis by paired-end-tag restrictions of the technique are well defined by its users. These sequencing, or ChIA-PET (Fullwood et al., 2009). We anticipate restrictions include the need for large amounts of starting mate- other inventive uses of ChIP technology to continue to uncover rial, limited resolution, and the dependence on antibodies. undiscovered roles of histone modifications and histone variants Improvements to ChIP-seq have been developed to address in transcriptional regulation. these limitations and expand the possibilities of its use. Collect- ing enough starting material for ChIP-seq can be challenging Mapping of Chromatin Structures because experiments typically require 1 million (histone modifi- Nucleosome Positioning cations) to 5 million (TFs and chromatin modifiers) cells. Although Moving up the hierarchy of genomic organization, we now look this is feasible when studying fast dividing cell lines, the chal- beyond the DNA and histone modifications to the positioning lenge arises when studying primary cells and rare populations of along the genome. Our epigenome at its such as cancer stem cells or progenitor cells. ChIP-seq samples most basic level is repeating units of 147 base pairs wrapped of 50,000 cells or less are possible with the ChIP-nano protocol 1.7 times around each nucleosome with varying distances of (Adli and Bernstein, 2011). Key method modifications achieve linker DNA between each unit. Even this extremely simplistic effective chromatin fragmentation in small volumes, ensure model is complex because nucleosome positioning can both minimal sample handling and loss by washing samples in col- inhibit and promote factor binding (Bell et al., 2011). First, umns, and reduce background signal. Another procedure, called nucleosomes can be positioned to obstruct or reveal specific ChIP-exo, improves the limited resolution from fragmentation DNA sequences. Second, because modifications on histone tails heterogeneity after chromatin is prepared by sonication (Rhee serve as binding platforms for transcriptional regulators, nucleo- and Pugh, 2011). As its name suggests, sonicated and immuno- some positioning regulates factor recruitment. And finally, precipitated DNA is treated with a 50-to-30 exonuclease to digest nucleosomes are suggested to inhibit transcription by slowing DNA to the footprint of the crosslinked protein such that progression of RNA polymerase II as it transcribes through a sequencing results are nucleotide resolution. This type of high- gene body. From a medical perspective, it will be important to resolution protein-binding data is most beneficial for uncovering determine the possible role of aberrant nucleosome positioning motifs of specific binding proteins and the effect of sequence as caused by disease-associated SNPs, insertions, deletions, variants on protein-binding affinity. Profiling genome-wide and translocations. DNA-protein interactions with ChIP-seq is technically chal- Our understanding of the regulation of nucleosome positioning lenging when studying novel proteins or protein isoforms, such came from studies of smaller genomes, such as those in yeast as a histone variant, that lacks a robust or specific antibody. and fly (Jiang and Pugh, 2009). Nucleosome positioning along In this case, an obvious approach is to transiently or stably ex- DNA is influenced by favorable DNA sequence composition, press a protein of interest (POI) with a tag or epitope that can the actions of ATP-dependent nucleosome remodelers, and be readily ChIP’ed. Controls are necessary to ensure the fusion strongly positioned nucleosomes (Mavrich et al., 2008; Narlikar protein’s localization is not altered by nonendogenous expres- et al., 2013; Yuan et al., 2005). Although we understand the sion levels, protein instability, steric inherence, or other effects main determinants of nucleosome positioning, the exact contri- of the tag itself. bution of each is unclear and currently under debate. A ChIP step can be added to other genomic profiling ap- The most common method for profiling genome-wide proaches for integrated epigenomic profiling. First, two ChIP nucleosome positioning is microcococal nuclease digestion of

44 Cell 155, September 26, 2013 ª2013 Elsevier Inc. H3K4me3/H3K27me3 Bivalent Domain

H3K4me3 Repressed

H3K27me3

Remained

Poised Induced

From: https://pubs.niaaa.nih.gov/publications/arcr351/77-85.htm 8 ChIP-seq: Profiling epigenomes with sequencing

histone

nucleosome

ATAC-seq

Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI) 9 Published ChIP-seq datasets are skyrocketing We are entering the Big Data era Number of ChIP-seq datasets on GEO

3000

2500 ✤ 200920115066 2000

✤ 2012GEO5241 1500

1000 ✤ 500

0

!

Mei et al. Nucleic Acids Research 2016 10 Chromatin (ChIP)

11 Protein-DNA crosslinking in vivo (for TF)

12 Chop the chromatin using sonication (TF) or micrococal nuclease (MNase) digestion (histone)

13 Specific factor-targeting antibody

14 Immunoprecipitation

15 DNA purification

16 PCR amplification and sequencing

17 ChIP-seq data analysis overview

Scale 500 bases hg19 chr19: 15,308,000 15,308,100 15,308,200 15,308,300 15,308,400 15,308,500 15,308,600 15,308,700 15,308,800 15,308,900 15,309,000 15,309,100 15,309,200 User Supplied Track

@ILLUMINA-8879DC:231:KK:3:1:1070:945 1:Y:0: NNNAATACAGTCAGAAACATATCATATTGGAGAATA #################################### @ILLUMINA-8879DC:231:KK:3:1:1153:945 1:Y:0: NNNAAGCACACAGAAGATAACTAAACAATCAAGTAG #################################### @ILLUMINA-8879DC:231:KK:3:1:1222:945 1:Y:0: NNNAAGGGTCTTGAGAAGAAATCATTCTGGATGGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1304:939 1:Y:0: NNNCCAGGCTCCCGCGATTCTCCTGCCTCAGCTTCT #################################### @ILLUMINA-8879DC:231:KK:3:1:1354:945 1:Y:0: NNNCTCTTCCTTAGCTAAACTTTCAACTAAGCCAAA #################################### @ILLUMINA-8879DC:231:KK:3:1:1411:932 1:Y:0: NNNGTAGGACCATTGGCGTTGCGACACAAAAAATTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1496:937 1:Y:0: NNNTTCATCGGGTTGAGAGTCCCCTTGTTGCATGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1533:939 1:Y:0: NNNATTTTCCCGTTCCAGGTCGCAATTTCCGCCGTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1573:940 1:Y:0: NNNGGGGTGCGCCTTTAGTCCCAGCTACTCAGGAAC ####################################

18 ChIP-seq data analysis overview

• Where in the genome do these sequence reads come from? - Sequence alignment and quality control • What does the enrichment of sequences mean? - Peak calling • What can we learn from these data? – Downstream analysis and integration

19 ChIP-seq data analysis: basic processing

• alignment of each sequence read: bowtie or BWA

cannot map to the reference genome ✗ can map to multiple loci in the genome ✗ can map to a unique location in the ✔ genome • redundancy control:

Langmead et al. 2009, ✔ Zang et al. 2009 20 Original algorithm of MACS v1 ChIP-seq datad analysis: Peak calling

• DNA fragment size estimation • pile-up profiling peak model cross-correlation

!"#$%&'( d

!)*+,)-

Peak Model HFinput2sequence-chr1-tag-correlation 0.055

0.35 forward tags reverse tags 0.05

0.30 0.045 0.04 0.25 0.035 4 0.20 0.03

Percentage 0.025 0.15 0.02 • Peak/signal 0.10 0.015

0.05 0.01 detection 0.005 −600 −400 −200 0 200 400 600 0 50 100 150 200 250 300 350 400

Distance to the middle

21 ChIP-seq data analysis: Peak calling

• Sharp peaks • Broad peaks transcription factor binding, Histone modifications, DNase, ATAC-seq “super-enhancers” Diffuse MACS (Zhang, 2008) dynamic background SICER (Zang, 2009) Poisson model Spatial clustering of localized weak signal and integrative Poisson model

NOTCH1

H3K27ac

22 Wang, Zang et al. 2014 MACS

• Model-based Analysis for ChIP-Seq

• Ta g distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-seq show local biases in the genome – Chromatin and sequencing bias – 200-300bp control windows have to few tags ChIP – But can look further Control

Dynamic λlocal = 300bp max(λBG, [λctrl, λ1k,] λ5k, λ10k) 1kb 5kb 10kb

http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008 SICER

• Spatial-clustering Identification of ChIP-Enriched Regions

5kb

10kb ★★★★★ omictools.com

Zang et al. Bioinformatics 2009 24 ChIP-seq peak calling: Parameters

Parameter Remarks Species and reference genome version, Genome e.g. hg38, hg18, mm10, mm9 Fraction of the mappable genome, vary in Effective genome rate species, read length, etc. Estimated by default; can specify DNA fragment size otherwise Data resolution, usually nucleosome Window size periodicity length, i.e. 200bp (for SICER only) Allowable gaps between Gap size eligible windows, usually 2 or 3 windows P-value cut-off Threshold for peak calling, from model Threshold for peak calling, BH correction False discovery rate (FDR) cut-off from p-value.

25 ChIP-seq data analysis: Review

1. Read mapping (sequence alignment)

2. Peak calling: MACS or SICER 1. QC 2. DNA fragment size estimation (for Single-end) 3. Pile-up profile generation 4. Peak/signal detection

3. Downstream analysis/integration

26 Data formats

• fastq: raw sequences

• BED: chr11 10344210 10344260 255 0 - chr4 76649430 76649480 255 0 + chr3 77858754 77858804 255 0 + chr16 62688333 62688383 255 0 + chr22 33031123 33031173 255 0 -

• SAM/BAM: aligned sequencing reads

• bedGraph, Wig, bigWig: pile-up profiles for browser visualization

27 Data flow

Raw sequence • fastq reads

Aligned • BAM/BED reads Bowtie/BWA Reference genome Profile; • bedGraph/Wig/bigWig

Peaks • BED MACS/SICER

28 Galaxy: web-interface analysis platform • https://usegalaxy.org/

29 Run MACS on Cistrome, a Galaxy-based platform • http://cistrome.org/ap/

30 Run SICER on Galaxy-based platforms • http://services.cbib.u-bordeaux.fr/galaxy/

31 ChIP-seq: Downstream analysis

• Data visualization – UCSC genome browser: http://genome.ucsc.edu/ – WashU epigenome browser: http://epigenomegateway.wustl.edu/ – IGV: http://software.broadinstitute.org/software/igv/ • Meta analysis – CEAS: http://liulab.dfci.harvard.edu/CEAS/ • Integration with gene expression – BETA: http://cistrome.org/BETA/ – MARGE: http://cistrome.org/MARGE/ • Integration with other epigenomic data – GREAT: http://great.stanford.edu – ENCODE SCREEN: http://screen.umassmed.edu/ – MANCIE: https://cran.r-project.org/package=MANCIE – Cistrome DB: http://cistrome.org/db/ 32 BETA: Binding Expression Target Analysis

∆ij • Regulatory Potential P (g )= exp − i " λ # ∈! TSS j S(i)

i

j

33 MARGE: A big data driven, integrative regression and semi- supervised approach for predicting functional enhancers

samples sample samples enhancer samples selection prediction 34 Wang, Zang et al. Genome Res 2016 ENCODE https://www.encodeproject.org/

35 Cistrome Data Browser http://cistrome.org/db/

36 ChIP-seq data analysis: Review

1. Read mapping (sequence alignment)

2. Peak calling: MACS or SICER 1. QC 2. DNA fragment size estimation (for Single-end) 3. Pile-up profile generation 4. Peak/signal detection

3. Downstream analysis/integration

37 Summary

• ChIP-seq is used to profile epigenomes • ChIP-seq data analysis • MACS for narrow peaks • SICER for broad peaks • Online tools and resources

38 Further Reading

The cancer epigenome: Concepts, challenges, and therapeutic opportunities

Science 17 Mar 2017: Vol . 355,FRONTIERS Issue IN CANCER6330, THERAPY pp.1147-1152 http://science.sciencemag.org/content/355/6330/1147 Heterochromatin Euchromatin Heterochromatin evidence has implicated epigenetic regulators as the primary targets that establish the fertile Writer Reader Eraser soil for malignant outgrowth. Several groups have established that clonal hematopoiesis, whereby hematopoietic stem and progenitor cell clones carrying somatic mutations give rise to the ma- jority of blood cells, occurs in at least 10% of people older than 65 years of age (25–27). The three most common mutations driving clonal Nucleosome DNA Histones stem cell expansion occur in genes encoding the epigenetic regulators DNMT3A, TET2, and ASXL1 BRD2/3/4 (25–27). These mutations confer a competitive SMARCA2 self-renewal advantage to the hematopoietic stem EZH2 SMARCA4 cells harboring them, resulting in clonal expan- MMSET BAZ2A sion of these cells (28). The precise molecular DOT1L PB1 mechanisms governing this increased capacity SETD7 BRPF1 for self-renewal are still under investigation, MLL1 ATAD2 but it is increasingly clear that clonal hemato- PRMT1 L3MBTL3 poiesis is a prelude to a variety of hematopoietic PRMT3 WDR5 malignancies, including myelodysplastic syndromes Nucleus PRMT5 CREBBP BRD9 HDAC SMYD2 EP300 BRD7 JMJD3/UTX IDH1* and acute myeloid leukemia (AML) (25–27). More- DNMT G9a/GLP MOZ CBX7 LSD1 IDH2* over, there is accumulating evidence that pre- leukemic cells containing mutations in epigenetic Therapies DNA Histone modifcations DNA and regulators such39 as the de novo DNA methyltrans- targeting: modifcations histone modifcations ferase DNMT3A can survive conventional chemo- therapies and serve as the nidus for leukemia Fig. 1. Current and emerging epigenetic therapies. Chromatin exists in two major states: an open relapse (29). It remains to be established whether on March 20, 2017 relaxed conformation called euchromatin, within which most transcriptionally active genes reside, and a the lessons learned about the central role of epi- more condensed compact state called heterochromatin, which is largely transcriptionally silent. The genetic regulators in the origins of clonal malig- dynamic transition between these states is mediated by chromatin modifications such as methylation nant hematopoiesis are broadly applicable to other and acetylation, which are laid down by epigenetic writers, bound by epigenetic readers, and removed by malignancies. epigenetic erasers. Many epigenetic proteins have more than one functional domain, allowing them to function as epigenetic readers and writers or epigenetic readers and erasers. A growing number of small- Cancer mutations in “dark matter” molecule drugs are being developed to target these epigenetic regulators. Highlighted in red are the targets affect chromatin regulation for epigenetic therapies that are either in routine clinical use or currently being evaluated in clinical trials. As we near completion of a comprehensive an- IDH1 and IDH2 are marked with asterisks because although they are not epigenetic proteins, mutations notation of recurrent mutations in protein-coding in these proteins profoundly affect epigenetic erasers of DNA methylation (TET proteins) and histone genes, it is clear that we have just scratched the methylation (Jumonji-C domain proteins). Inhibitors of IDH1 and IDH2 reduce levels of the oncometabolite surface of the cancer genome. The mutation rate 2HG and alleviate the inhibition of these epigenetic erasers. of the noncoding regulatory genome, or so-called http://science.sciencemag.org/ “dark matter,” is nearly double that of coding of the hallmarks of cancer, EMT is underpinned H3K4, respectively—largely work in an opposing regions, although what role these noncoding mu- by alterations in epigenetic regulators (22). manner to repress or facilitate gene expression. tations play in driving oncogenesis is the subject At the molecular level, these cell state trans- Members of these two essential complexes are of ongoing investigation (30). Such mutations itions are largely mediated through the collabo- some of the most commonly mutated epigenetic occur in multiple gene promoters and enhancer rative action of transcription factors and epigenetic regulators in cancer (6, 13). Some malignancies, elements and are found in a range of cancers regulators. The diversity of histone and DNA mod- such as germinal center–derived B cell lymphomas, (31, 32). A pioneering example was the discovery ifications introduces a complexity that can subtly contain mutations in the PcG protein EZH2 and in of mutations within the promoter region of TERT, Downloaded from alter transcription programs within the cell (4). the TrxG member MLL2, alongside mutations in the gene that encodes the catalytic subunit of These chromatin modifications, including acet- histone acetyltransferases such as CREBBP [cyclic telomerase, in more than 70% of melanomas ylation, methylation, and phosphorylation, are not adenosine monophosphate response element– (33, 34). Although it is well recognized that can- static entities but constitute a dynamically chang- binding protein (CREB) binding protein] and EP300 cer cells have high telomerase activity, muta- ing and complex landscape that evolves in a cell (E1A-binding protein) (24). These findings would tions within the coding region of the telomerase context–dependent fashion. Notably, chromatin suggest that impaired resolution of the germinal B gene are not common in human cancer genomes. modifications that activate or repress transcrip- cell transcription program of these cells prevents Interestingly, the TERT promoter mutations ap- tion are not always mutually exclusive, as evi- maturation and is a seminal event in the initia- pear to increase the expression of TERT by creat- denced by “bivalent domains” marking genes tion and maintenance of these malignancies. ing a de novo binding motif for the ETS family of poised for transcription in normal and malignant transcription factors. cells (23). First described in embryonic stem cells, Epigenetic dysregulation contributes to In addition to the promoter mutations, mu- genes associated with bivalent promoters, which the origin of cancer tations that alter enhancer elements have also SCIENCE have the concurrent presence of both the activa- Cancer is thought to develop through clonal ex- been discovered and demonstrated to be func- ting histone modification H3K4me3 and the pansion of mutant premalignant stem and progen- tionally relevant. Chromosomal translocations repressive modification H3K27me3, have been itor cells. These cells subsequently acquire further involving inv(3)(q21q26.2) or t(3;3)(q21;q26.2) implicated in the development of cancer and mutations that provide a subclonal growth and result in AML with a poor prognosis; this is in are noted to undergo DNA hypermethylation at proliferative advantage, which ultimately man- part due to the aberrant and sustained expres- their CpG islands (23). The Polycomb (PcG) and ifests in a clinically diagnosed malignancy. This sion of the proto-oncogene EVI1.Detailedstudies Trithorax (TrxG) complexes—which are respon- evolutionary model of cancer has been well stud- have now revealed that the translocation leads to

GRAPHIC: ADAPTED BY K. SUTLIFF/ sible for the methylation of histones H3K27 and ied in the hematopoietic system, where recent the relocation of a distal enhancer for the GATA2

Dawson, Science 355,1147–1152 (2017) 17 March 2017 2of6 Thank you very much!

[email protected] http://zanglab.com

40