Epigenomic analysis reveals DNA motifs regulating histone modifications in human and mouse

Vu Ngoa,1, Zhao Chenb,1, Kai Zhanga, John W. Whitakerb, Mengchi Wanga, and Wei Wanga,b,c,2

aGraduate Program of Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA 92093-0359; bDepartment of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359; and cDepartment of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359

Edited by Steven Henikoff, Fred Hutchinson Cancer Research Center, Seattle, WA, and approved January 3, 2019 (received for review August 6, 2018) Histones are modified by enzymes that act in a locus, cell-type, and An analogy is that a transcription factor (TF) recognizes the same developmental stage-specific manner. The recruitment of enzymes DNA motif but its binding sites are cell-type–dependent. However, if to chromatin is regulated at multiple levels, including interaction we identify all motifs enriched in the TF binding sites across a large with sequence-specific DNA-binding factors. However, the DNA- and diverse set of cell types, the most common motif is likely the one binding specificity of the regulatory factors that orchestrate spe- recognized by the TF. Histone modifications are more complicated cific histone modifications has not been broadly mapped. We have than a single TF binding and one histone mark can be regulated by analyzed 6 histone marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3, K3H9me3, H3K36me3) across 121 human cell types and tissues from multiple factors recognizing different motifs. Therefore, a compar- the NIH Roadmap Epigenomics Project as well as 8 histone marks ative analysis across diverse cell types/tissues is critical. (with addition of H3K4me2 and H3K9ac) from the mouse ENCODE Recently, machine learning approaches have proven to be Consortium. We have identified 361 and 369 DNA motifs in human useful in understanding epigenetic processes. For example, a and mouse, respectively, that are the most predictive of each histone support vector machine has been used to predict the impact of mark. Interestingly, 107 human motifs are conserved between the SNPs on DNase I sensitivity in their native genomic context (1). two species. In human embryonic cell line H1, we mutated only the Prediction of histone modifications solely from knowledge of TF found DNA motifs at particular loci and the significant reduction of binding both at promoters and at potential distal regulatory ele- GENETICS H3K27ac levels validated the regulatory roles of the perturbed motifs. ments (2) was done using logistic regression-based classifier or The functionality of these motifs was also supported by the evidence using k-mer features to train a logistic regression model that dis- that histone-associated motifs, especially H3K4me3 motifs, signifi- tinguishes peak sequences from flanking regions (3). Our previous cantly overlap with the expression of quantitative trait loci SNPs in cancer patients more than the known and random motifs. Further- work also demonstrated that DNA motifs are predictive of histone more, we observed possible feedbacks to control chromatin dynamics modifications and DNA methylation in five cell types (4). All of as the found motifs appear in the promoters or enhancers associ- these works have suggested the possibility of deciphering the ated with various histone modification enzymes. These results pave the way toward revealing the molecular mechanisms of epigenetic Significance events, such as histone modification dynamics and epigenetic priming. How the locus-specific histone modifications are achieved is epigenomics | cis-regulatory elements | locus specificity | chromatin not fully understood. One of the contributing mechanisms is dynamics | CRISPR that DNA binding molecules recognize specific sequences and their binding recruits or stabilizes the histone modification istone modifications play key roles in many biological pro- enzyme complexes. Comprehensive identification of such se- Hcesses. Mammalian genomes contain histone-modifying en- quence patterns is the first step toward revealing possible zymes that are responsible for modifying histone tails by adding or regulatory grammar for establishing histone modifications. In removing chemical groups, such as methyl and acetyl groups. The this study, we have cataloged the DNA motifs tightly associ- placement of histone modifications is precisely regulated to en- ated with six and eight important histone modifications in sure that specific regulatory elements and are correctly human and mouse, respectively. We show that mutating the activated or repressed in a given cell-type, environment, or de- found motifs at particular loci led to significant reduction of the velopment stage. Understanding the mechanisms that regulate histone modification levels. These histone-associated motifs, locus-specific modification in a cell-state–dependent manner is especially H3K4me3 motifs, significantly overlap with expres- critical toward uncovering the grammar of epigenetic regulation. sion of quantitative trait loci SNPs in cancer patients more than A possible mechanism to establish or maintain locus-specific known motifs, further suggesting their regulatory roles. We histone modification is through binding of sequence-specific pro- also found possible feedback loops mediated by these motifs, teins or noncoding RNAs, which recruit or enhance the modifying implicating their possible roles in histone modification dynamics enzymes’ binding to a particular locus. Other factors can contribute and epigenetic priming. to this specificity, such as DNA methylation, chromatin accessi- Author contributions: V.N., J.W.W., and W.W. designed research; V.N., Z.C., K.Z., and bility, and 3D chromatin contacts. Because histone modifications J.W.W. performed research; M.W. contributed new reagents/analytic tools; V.N., Z.C., are wiped out and reestablished in the zygote, the information K.Z., and J.W.W. analyzed data; and V.N., Z.C., and W.W. wrote the paper with contri- encoded in the DNA sequence is pivotal to initiate the process of bution from all authors. locus-specific histone modifications. Despite the existence of other The authors declare no conflict of interest. contributing factors, it is still critical to comprehensively catalog This article is a PNAS Direct Submission. the sequence motifs that can provide locus-specific guidance for Published under the PNAS license. the enzymatic functions, which can be the first step toward fully 1V.N. and Z.C. contributed equally to this work. decoding the mechanisms regulating locus specificity of histone 2To whom correspondence should be addressed. Email: [email protected]. modifications. Furthermore, if particular DNA motifs are associ- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. ated with histone modifications in many and diverse cell types, they 1073/pnas.1813565116/-/DCSupplemental. are likely important or even causally related to histone modifications.

www.pnas.org/cgi/doi/10.1073/pnas.1813565116 PNAS Latest Articles | 1of10 Downloaded by guest on September 25, 2021 grammar encoded in the genome regulating epigenetic modifica- tion weight matrices (PWMs) are then generated by first picking tions, but the scope of the previous studies is still limited. a top k-mer and enriched k-mers similar to itself to construct a Furthermore, because the sequences of many histone- “seed” PWM, which is then extended by adding more enriched modifying enzymes are conserved, it would also be interesting k-mers that are a few base pairs shifted from the original one. The to investigate whether the regulatory grammar that controls the motifs are then further ranked and filtered based on how well they placement of histone modification is conserved. However, a differentiate the foreground from the background using LASSO direct comparison between the human and mouse genome is (least absolute shrinkage and selection operator) logistic regres- unlikely to identify these motifs because they may be dispersed sion. The final set of motifs is then evaluated by random forest. in the overall nonconserved genomic regions. A strategy to cir- Epigram was individually applied to each dataset (see Mate- cumventthisdifficultyistouncover the DNA motifs associ- rials and Methods for details). For each histone modification in ated with the same histone modification patterns in different each sample, Epigram found DNA motifs that discriminate en- species and then compare the similarities between them to assess richment peaks of the mark under consideration from a back- their conservation. ground of regions that do not overlap with any peak of the six Here, we present a comprehensive survey of histone modification- histone modifications. Importantly, the background has the associated motifs in a large set of diverse cell types and tissues in equal GC content, number of regions, and sequence lengths as both human and mouse (5, 6). Comparative analyses have revealed the foreground to avoid inflated prediction results caused by that 107 motifs are conserved between human and mouse. Fur- simple features or an unbalanced dataset (4). In our previous thermore, in the human embryonic stem cell H1 cell mutating the paper (4), we performed several additional analyses to remove motifs led to significant perturbation of the H3K27ac levels. We also confounding factors, such as some histone marks preferring found that the histone-associated motifs are likely to overlap with particular genomic regions (e.g., H3K4me3 in promoters). Our SNPs in cancer patients, which indicates their regulatory functions. analyses showed that the identified motifs can discriminate the modified regions from different backgrounds. Given the large Results number of experiments we analyzed in this study, we did not re- Identification of DNA Motifs in 121 Human Cell Types. To have a peat these additional analyses for each experiment. We achieved comprehensive catalog of the cis-regulatory elements that are good performances, with average areas under the curve (AUCs) involved in regulating the human epigenome, we used Epigram (4) ranging from 0.71 to 0.91 (Fig. 1A and Dataset S2). to analyze the data of six histone modifications from 121 different In total, Epigram identified 65,361 motifs. Because some cell-types and tissues generated by the NIH Roadmap Epi- motifs are likely to be shared between different cell types or genomics Project (6) and ENCODE (5) (Fig. 1A). histone modifications, it is not surprising that many motifs were In general, Epigram looks for enriched motifs that best dif- found multiple times. To reduce the redundancy, we used a motif ferentiate the foreground from the background sequences. The distance metric to quantify the similarity between different mo- program first computes an enrichment score for each k-mer tifs, based on which we hierarchically clustered the motifs (see based on how often it appears in the input sequences com- Materials and Methods for details). The resulting tree was then − pared with the shuffled input sequences and a genomic back- cut using a threshold of 0.15, corresponding to a P value of ∼10 3 ground. k-mers are then ranked based on their final weights: that was calculated using a distribution of similarity distances for randomly shuffled motifs (SI Appendix, Fig. S1 shows the process W = logðPPÞ p log Ewg + logðEshÞ , and example of a cluster). Motifs within the distance threshold were considered to represent the same motif and it is obvious with W as the k-mer’s enrichment weight, PP as the proportion of that examples shown in SI Appendix, Fig. S1 are similar. The sequences that contains the k-mer over the total number of input motif having the most enrichment scores within each cluster was sequences, Ewg as the k-mer’s enrichment over the genomic back- selected as the representative, where the enrichment score was ground, and Esh as its enrichment over the shuffled input. Posi- computed by comparing the occurrence of a motif in the histone

A Epigram Average AUCs Histone ChIP- Random Forest Motif seq Data class er Clusttering

De novo Peak Calling motif Enrichment nding ltering

Background Sequence-set Final motifs Generation Balancing

BDC Fig. 1. Identification of DNA motifs associated Nos. of motif clusters per type Example Known motifs Motifs K4me1 K4me3 K27ac K9me3 K27me3 K36me3 multi-mark with six histone modifications in 121 human cells E001_H3K4me1 TF/Mark Tomtom alignment E001_H3K4me1 H3K4me1 31 and tissues. (A, Left) The workflow of identifying

E001_H3K4me1 G TC GA G T A A ACA GT GT TC G CREB A GCCATC 2 5 6 1 3 4 7 8 9 11 TGA 10 E001_H3K4me1 H3K27ac_H3K4me1 12 motifs associated with histone modifications. E001_H3K4me1 . H3K4me3 37 T CG G A C CT TGC TG AGG A C CG G CGAC H3K4me3 A A T 1 2 4 5 6 8 7 9 3 G C C (Right) The average AUCs for each mark across all 10 . G 11 . H3K27ac_H3K4me3 16 121 human cell types. (B) Association of the final

G C GT C CAT C TT GT G T CAAA T T AA A C A E121_H3K4me1 SP1 T 8 1 9 6 7 2 3 4 C5 11 C CC 10 E122_H3K4me1 H3K27ac 22 motif set with the histone marks across all 121 hu- E123_H3K4me1 C C G GC G G G A G TA T T G C GAA H3K27me3_H3K4me1 4 H3K27ac_H3K4me1 TCACT C CCA 5 7 9 1 3 4 6 8 E127_H3K4me1 2 G man cell types. The x axis represents each motif E128_H3K4me1 H3K36_H3K4me1 5 cluster in the final set, color-coded by their asso-

A T GG C A C T AC C T G G T G 1 2 3 4 5 6 7 8 JUN G C C 9 T A T A 10 11 H3K27ac_H3K27me3 4 ciated histone marks. The y axis represents the

GCG GT A GA T AC C G AG A A C G A T A TG 1 2 3 4 6 7 8 9 H3K27ac CTG C5 C 10 11 H3K9me3 ... 89 ChIP-seq experiments ordered by histone modifi-

T cations. Black spots inside the matrix show G A T C GA C G G CA CT T TGCC T CG 7 G C GG G IRF8 CT A A ACCACTGC AAA 1 2 3 5 6 7 H3K27me3 T4 8 9 C 10 11 TTC TT 12 13 14 15 whether a motif cluster was found in a ChIP-seq H3K36me3 ... 117 T T C G G CT A A C C G AT T C C GAT G CA A TG T T T ACTGAGCA CCGG H3K27ac 1 2 3 4 5 6 7 8 9 11 10 12 experiment. (C) Number of motif clusters per other 2-marks 14 type. (D) Example of de novo motifs matched to 3+ marks 3 the known motifs.

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1813565116 Ngo et al. Downloaded by guest on September 25, 2021 modification peaks of interest and the background. As a result, scription start site (TSS) (8). Other examples include SP1 and we obtained motif clusters. To identify the most confident motifs, SP3 motifs that are known to recruit HDAC1 to repress tran- we selected the largest clusters so that together they capture scription of various genes; HDAC inhibitors can target SP1 sites roughly 50% of the original motifs. In the end, there are 361 clus- to activate transcription (11). Thus, it makes sense to find these ters with at least 40 individual motifs (containing 52.6% of our motifs within promoter/enhancer-specific histone marks. We also total starting motifs); the resulted clusters are shown in Fig. 1B found the motif recognized by the cAMP response element-binding (see example motifs in Dataset S1). protein (CREB). CREB is known to recruit CBP (CREB-binding To determine whether a motif cluster is mark-specific or protein), which has intrinsic HDAC activity (12). shared between marks, we counted the number of times that its member motifs were found to be predictive of each mark in any Experimentally Validating the Possible Regulatory Roles of DNA cell or tissue. We then performed a hypergeometric test (P value Motifs on Histone Modifications. We selected H3K27ac for ex- − cut-off of 10 3) to identify the statistically significant association perimental validation as it marks both the active promoter and between the motif cluster and marks. The background of the enhancer. We took a strategy of mutating the motifs rather than hypergeometric test was the original set of 65,361 motifs. For deleting the entire region using CRISPR/Cas9 to validate the each cluster, the hypergeometric test was based on all members direct impact of the motifs we identified on histone modifica- of that cluster. For example, cluster H3K4me3+H3K27ac_872 had tions. The advantage of this approach is to keep the investigated 384 motifs in total, among which 133 were identified from H3K4me3 sequence remaining at the same length and thus avoids the ef- experiments and 84 motifs found in H3K27ac experiments, while the fects on H3K27ac caused by sequence deletion rather than motif background contained 10,936 of the total 65,361 motifs obtained disruption (SI Appendix,TablesS2–S4). On the other hand, from H3K4me3 experiments, and 8,839 obtained from H3K27ac this strategy limited our choice of cell lines to embryonic stem − experiments; the P value was thus 1.01 × 10 16 to be associated cells in which recombination is possible, compared with the − with mark H3K4me3 and 1.65 × 10 5 for H3K27ac. Among the fully differentiated cells. Therefore, we focused on the histone- 361 motifs, 303 are associated with only one histone mark, in- associated motifs identified from H1 embryonic stem cells and dicating their high specificity to histone modification. For these four H1-derived cells representing early developmental stages mark-specific motifs, H3K36me3 and H3K9me3 contribute a (Mes, MSC, NPC, TRO cells) (4). For the experiment, we chose large portion (117 and 89 motifs, respectively), and the motifs motif clusters that are associated with H3K27ac and scanned the

associated with narrow marks are inclined to be shared between two regions: one from the top-ranked predicted locus in chro- GENETICS marks (Fig. 1C). Because broad marks like H3K36me3 often mosome 3 (chr3) by Epigram and one from the middle ranked in cover whole bodies, identified motifs can come from introns 1 (chr1). The regions are ranked by the number of or exon regions. These are confounding factors in predicting trees within the random forest model in Epigram that had cor- H3K36me3 signals. Because H3K36me3 has been shown to be rectly predicted the regions as containing H3K27ac modification, important for splicing (7), the found motifs can be important for which indicates the confidence of prediction. These two regions both establishing H3K36me3 and regulating splicing. Some are about 300-bp long, making them suitable for motif shuffling H3K36me3 motif clusters contain some motifs associated with with the genomic-editing strategy. The score cut-off chosen for H3K4me1 (Fig. 1B). However, when the background was taken each motif to call occurrence was the score that best differentiated the into account, these motif clusters did not pass the hypergeometric foreground from the background in Epigram. The chr3 site contains test for H3K4me1 enrichment and thus were not classified as such. four motifs, including three matched to known ones (TEAD4, Among the 58 motifs associated with more than one histone GATA, and JUNB), and the chr1 site contains seven motifs, in- mark, a large portion is motifs associated with H3K27ac, cluding two matched to known ones (TEAD4 and ZBTB33) (SI H3K4me3, and H3K4me1. In general, broad histone marks do Appendix,TableS2). not share motifs with narrow marks. Furthermore, the multimark Because the strategy to introduce shuffled motifs into the ge- motifs are largely associated with functional combinations of nome will have residual loxP sites left after selection cassette re- histone marks. For example, H3K27ac share a significant num- moval (Materials and Methods and Fig. 2 A and B), we need to ber of motifs with H3K4me3 and H3K4me1, which is rea- consider the residual loxP sequence when comparing the histone sonable because H3K4me3/H3K27ac and H3K4me1/H3K27ac modification between unmodified and modified cells. The modified mark active promoters and enhancers, respectively. In contrast, cells with both alleles that have loxP but are heterozygous at motif H3K4me3 and H3K4me1 do not share motifs with each other. regions are ideal for our purpose, where the allelic WT motif re- We also found that H3K27me3 and H3K4me1 share motifs, gions served as a control in comparing the status of histone modi- which is not surprising as they together mark poised enhancers fication. We used CRISPR plasmid and homology-directed repair (8). H3K27me3 and H3K4me3 also share motifs as these mo- (HDR) donor plasmid to cotransfect the H1 cells, followed by tifs occur in bivalent promoters, which are important in early Puromycin selection, clonal isolation, and genotyping PCR. By an- embryogenesis (9). alyzing the genotyping PCR and Sanger sequencing results of the The majority of the found motifs do not match with any known clones from the two chromosome loci, we found that about half of motif: in human, 71 of 361 motifs have a match using TomTom (14) the clones had loxP cassette in both alleles but the shuffled motifs at e-value cut-off of 0.1 (examples of known motif matches are in region in only one of the alleles (named motif-shuffled heterozy- Fig. 1D and SI Appendix,Fig.S2D). We have provided the complete gote), and the rest of clones had a loxP cassette and WT motifs list of the identified motifs and whether they match with any known region in both alleles (named loxP-control homozygote). No clones motif in the Dataset S1. Numerous identified motifs are known with the loxP cassette and shuffled motifs region in both alleles were to be important for histone modifications. For example, the c-JUN found. These results were ascribed to the possibility that the spacer motif was found to be associated with H3K27ac in our region may also serve as the homology arm along with the left analysis, which is consistent with the previous studies showing the homology arm to facilitate the integration of the loxP cassette regulatory role of c-JUN on histone modifications, such as (indicated with the light blue dashed line in Fig. 2B). For our Ser10 phosphorylation of histone H3, acetylation of histones purpose, the motif-shuffled heterozygote should be ideal for H3 and H4, and recruitment of histone deacetylase 3 (HDAC3), comparing the H3K27ac level between alleles within the same NF-κB subunits, and RNA polymerase II across the ccl2 locus cell, while the loxP-control homozygote would serve as a control (10). In c-JUN–deficient cells, HDAC3 binding around the ccl2 for comparing the potential effect of residual loxP sequence to locus was low compared with nondeficient cells, leading to in- unmodified cells. For each locus, we selected a motif-shuffled creased histone acetylation levels in the 5′ region of the tran- heterozygote and a loxP-control homozygote from the sequenced

Ngo et al. PNAS Latest Articles | 3of10 Downloaded by guest on September 25, 2021 C Chromosome 1: 144166402-144167402 (hg18) 50 A MfeI H1 cell Zoomed in pCas9-GFP-gRNA: H3K27ac U6 trRNA CAG hCas9 T2A GFP pA ChIP-Seq 0 HDR Donor Plasmid: Wild-type allele WT1 WT2 loxP-ctrl allele Left Arm PGK Puro Spacer Mutated Region Right Arm WT1 WT2 loxP loxP Mutant allele MS1 MS2 loxP sequence Allele-specific probes Wild-type motifs B Sequence-shuffled motifs hCas9 Wild-type allele: Left Arm Spacer Motifs Region Right Arm gRNA

Left Arm PGK Puro Spacer Mutated Region Right Arm loxP loxP HDR Donor Plasmid Puromycin selection loxP-control allele: D Chromosome 3: 178116895-178117895 (hg18) 50 Left Arm PGK Puro Spacer Mutated Region Right Arm H1 cell Zoomed in H3K27ac loxP loxP ChIP-Seq 0 Mutant allele: Wild-type allele WT1 Left Arm PGK Puro Spacer Motifs Region Right Arm WT2 loxP-ctrl allele loxP loxP WT1 WT2 Mutant allele Expression of Cre recombinase MS1 loxP sequence MS2 Allele-specific probes loxP-control allele: Wild-type motifs Left Arm Spacer Mutated Region Right Arm Sequence-shuffled motifs loxP

Mutant allele: Left Arm Spacer Motifs Region Right Arm loxP

Expand for ChIP assay

Fig. 2. Experimental validation of the found motifs on regulating histone modifications. (A) CRISPR and donor constructs: MfeI restriction enzyme site indicated for gRNA cloning. (B) Schematic flowchart of CRISPR-mediated knockin in H1 ESCs. (C) ChIP-qPCR results of chr1 locus: H3K27ac ChIP-seq peaks of the region are shown (Upper); WT, loxP-control, and mutant alleles are shown (Lower) with WT motifs marked in red and sequence-shuffled motifs marked in yellow; black lines indicate the survey region of ChIP-qPCR; ChIP-qPCR results are shown at the bottom, error bar is shown for three biological replicates with *P = 0.0371 for probe #1 and **P = 0.0072 for probe #2. (D) ChIP-qPCR result of chr3 locus: H3K27ac ChIP-seq peaks of the region are shown (Upper); WT, loxP- control, and mutant alleles are shown (Lower) with WT motifs marked in red and sequence-shuffled motifs marked in yellow; black lines indicate the survey region of ChIP-qPCR; ChIP-qPCR results are shown at the bottom, error bar is shown for three biological replicates with **P = 0.0026 for probe #1 and **P = 0.002872 for probe #2.

clones to proceed with Cre-mediated loxP cassette removal. Sub- probe chr1-WT1 with P = 0.0072 from three biological repli- sequent clones were identified by genotyping PCR for the loxP cates, while another pair of probes showed 33% decrease with cassette removal from both alleles and further confirmed by P = 0.0371 (Fig. 2C). Similar results were found in the chr3 locus Sanger sequencing. (Fig. 2D). These results validated the regulatory roles of the We performed H3K27ac ChIP assays on three different geno- identified motifs on histone modifications. Note that the top- types for each locus, including unmodified H1 with two WT alleles, ranked chr3 locus has four motifs mutated but showed more a loxP-control homozygote with two loxP-control alleles, and a significant alteration of the H3K27ac signals compared with the motif-shuffled heterozygote with one loxP-control allele and one middle-ranked chr1 locus with seven motifs mutated. mutant allele. Regarding the allele-specific probes for qPCR, we designed two pairs of primers for each allele based on the sequence DNA Motifs Associated with Histone Modifications in Mouse differences between the WT and sequence-shuffled motifs (MS) Embryonic Tissues. To investigate how the DNA motifs associated (Fig. 2 C and D). The ChIP-qPCR result was analyzed with the with histone modifications have evolved, we conducted the same percent input method and further normalized to internal con- analysis in the mouse ENCODE dataset that contains 8 his- trols of regions with low and high H3K27ac level to minimize the tone modifications (H3K4m1/2/3, H3K9ac/me3, H3K27ac/me3, variability of processing different samples. In the chr1 locus, both H3K36me3) from 12 embryonic tissues at 7 different developmental of the two WT probes (WT1 and WT2) showed similar H3K27ac stages (SI Appendix,TableS5). To be consistent with all of the other level between the samples of H1 and loxP-control homozygote, analyses done by the ENCODE Consortium, we used the peaks indicating the effect caused by the residual loxP sequence was called by the ENCODE DCC. While the performance on H3K4me3 negligible in this locus (Fig. 2C). In the sample of motif-shuffled is comparable between human and mouse, the average AUCs heterozygote, the allele-specific probe chr1-MS1 showed about of Epigram for each mark in mouse is slightly (about 0.04– 60% decrease of the H3K27ac level compared with its paired 0.05) lower than in human. There can be many reasons for this

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1813565116 Ngo et al. Downloaded by guest on September 25, 2021 difference, one of which is the difference in data quality: most of conserved regions. H3K4me1 is an enhancer mark and also ap- the human data are from cell lines with a higher quality than the pears in promoters; its motifs’ PhastCons scores are also higher mouse data obtained from tissues that are composed of hetero- than the background. Noticeably, the PhastCons score shows a geneous cell types. We indeed observed a lower number of broad dip at the H3K4me3/1 motifs. We performed SPAMO (16) peaks in the mouse samples than in the human samples: for ex- analysis on the H3K4me3 motif loci versus TSS and found that ample, several thousands of H3K9me3 peaks in many mouse tis- conserved H3K4me3 motifs are frequently 6–7 bp downstream − sues compared with the average 40,000 peaks in human, and an of TSS with a P value of 1.89 × 10 7 (SI Appendix,Fig.S4). average of 13,000 and 32,000 peaks of H3K27me3 in mouse and Because TSSs are the most conserved (Fig. 3B), this creates human, respectively, and 22,000 peaks in mouse and 47,000 peaks the dip when plotting the PhastCons scores by centering the in human for H3K36me3, respectively. Despite that, the perfor- motifs and without considering the orientation of the promoters. mances are still at a significant level (AUC of 0.7–0.95 in SI Ap- We suspect that the dip in the H3K4me1 plot may result from pendix, Fig. S2A and also in Dataset S2). a similar reason because enhancers may have orientation bias We identified 48,080 motifs in mouse. After hierarchical clus- indicated by unidirectional eRNA signals (17). The motifs asso- tering, we obtained 5,086 motif clusters. To focus on the most ciated with the other four marks all show a peak in the PhastCons confident motifs and achieve a comparable number of motifs as in score at the motif location. H3K9me3 motifs are the most con- human, we selected the clusters using a size cut-off so that the served surrounded by a background PhastCons score; note resulted clusters contain roughly 50% of all of the original motifs. that some known TF motifs, such as ELK1, SOX2, NYFA, and With a size cut-off of 30, we ended up with 369 clusters, containing NANOG, have similar peaky conservation at the motif sites 50.8% of the total motifs. Among these 369 motifs, 94 are matched but not every known motif shows such a pattern. For example, with known motifs (Dataset S1). Similar to the human results, a TEAD1/TEAD4 and GATA1 have low conservation at motif majority (263 motifs) of the 369 motifs is specific to their respective sites (Fig. 3B). Because H3K9me3 marks heterochromatin, it histone marks, while 89 motifs are associated with two or more is not surprising that the PhastCons scores in the nearby regions are marks (SI Appendix,Fig.S2B). This number is increased compared the same as the background. What is striking is that the H3K9me3 with that of the human analysis (58 shared motifs) likely because we motifs are the most conserved based on PhastCons scores and included two more narrow peak histone modifications (H3K9ac and this is true for both conserved and nonconserved H3K9me3 K3K4me2) and narrow marks tend to share motifs with each other motifs between human and mouse. As a comparison, the regions

as observed in human. immediately around the H3K27ac, H3K27me3, or H3K36me3 GENETICS The distribution of mark-specific motifs in mouse resembles motifs are less conserved than the background, which indi- those in human (SI Appendix, Fig. S2 C and D). For example, the cates that these regions are overall fast evolved; but the motif H3K9me3- and H3K36me3-specific motifs account for a large positions have a peaky PhastCons score that suggests their portion of the motifs but another broad mark, H3K27me3, has functional importance. only several specific motifs. Furthermore, numerous DNA motifs matched to the known ones—such as c-Jun, SP1, SP3, and CREB Histone Modification-Associated Motifs Overlap with Disease discussed above—were also found in mouse and some additional Expression Quantitative Trait Loci SNPs. To further explore the example motifs [such as USF1, known to recruit histone modifi- functional roles of the found motifs, we took the SNP data in the cation complexes (13)] are shown in SI Appendix,Fig.S2D. The Cancer Genome Atlas (18) (TCGA) on 32 cancers to de- termine whether these motifs are important in diseases. We first Histone Modification-Associated Motifs Are Conserved Between used Matrix expression quantitative trait loci (eQTL) (19) to Human and Mouse. The similarities from the two independent identify eQTL SNPs from the mutation and RNA-sequencing analyses in two species indicate that the possible regulatory re- (RNA-seq) data. The resulting eQTL SNPs were then over- lationship between DNA motifs and histone modifications may lapped with the human motifs’ loci (an example is shown in SI be conserved. In fact, among the 361 human and 369 mouse Appendix, Fig. S3) to determine whether the histone-associated motifs, 107 of them are conserved [TomTom (14) e-value cut-off motifs overlap with eQTL SNPs more often than random and of 0.1]. Among these 107 conserved motifs, a majority are as- known motifs. We calculated the number of overlaps per kilobase sociated with the same or similar histone marks (Fig. 3A): of SNPs per million base of motif binding sequences (OPKM) to 24 with the same mark; 67 motifs with at least one shared mark compare between the motifs (Materials and Methods). in the multimark-associated motifs or with different marks that The distributions of motifs H3K4me3, H3K27ac, and H3K27me3 occur in similar regions. For example, except for one motif, the over gene bodies (Fig. 4A) show that these motifs are more H3K4me3 human motifs are all associated with H3K4me2, H3K9ac, specific to the first 10% of the gene body, which is close to their H3K27me3, or H3K27ac in mouse. A small portion (16 motifs) of promoters and the first exons. H3K9me3, H3K36me3, and conserved motifs has different mark associations between human H3K4me1 motifs are more spread out over the genome and and mouse (example motifs in SI Appendix,TableS1). consistently they are roughly equally distributed over gene bod- We next examined whether these conserved motifs appear in ies. TCGA SNPs concentrate in the second half and the first the conserved regions. PhastCons (15) scores from multiple 10% of the gene body. As a result, the overlaps between SNPs alignment of human and 45 vertebrate genomes (including and histone motifs have a bias toward the gene’s3′ end. How- mouse) were plotted for these motifs. Obviously, the conserved ever, we did observe that a significant number of overlaps occur motifs appear in regions having significantly higher PhastCons at the first 10% of the gene body (Fig. 4A). The first exon is scores than the nonconserved ones (Fig. 3C). Among the con- known to be important for establishing the histone modifications served motifs, motifs associated with the same marks show the needed for gene transcription. H3K4me3 and H3K27ac are ac- overall highest PhastCons scores, which is not unexpected. tive promoter marks, while H3K27me3 indicates repressed or Manual examination of example loci also confirmed this trend poised promoters. Their associated motifs overlapping with (Fig. 3C). The conserved motifs with different mark association eQTL SNPs at the beginning of the gene body in cancer pa- and nonconserved motifs can be the fast evolved or species- tients is thus reasonable and this observation also indicates the specific ones. Their appearance in the regions with relatively functionality of the found motifs. lower PhastCons scores is not surprising. The human motifs associated with H3K4me3 overlap signifi- For the 91 motifs retaining the same or similar marks be- cantly more often with eQTL SNPs compared with both known tween human and mouse, the conservation patterns vary. H3K4me3 and random motifs (Fig. 4B). A Mann–Whitney U test was used is a promoter mark and its associated motifs appear in the most to calculate the P value of each mark’s log(OPKM) distribution

Ngo et al. PNAS Latest Articles | 5of10 Downloaded by guest on September 25, 2021 A Distribution of conserved motifs B PhastCons scores around promoters and some known TF motifs between human and mouse NANOG TEAD1 80 Promoter TSS NFYA TEAD4 0.30 Background 0.30 ELK1 GATA1 60 Background 0.25 ELF1 0.25 SOX2 Similar 0.20 0.20

% 40 Similar 0.15 0.15 20 0.10 Same Di erent 0.10 -0.25kb center 0.25kb -0.25kb center 0.25kb 0 5’ 3’ Conserved Non-conserved C PhastCons scores around motifs Example genome browser view 90kb Same 0.18 Gene

0.14 PhastCons H3K4me1 0.10 H3K4me3 H3K9me3 H3K27ac 0.06 H3K36me3 -2.5kb center 2.5kb Gene 0.18 Similar PhastCons 0.14 H3K4me1 H3K4me3 0.10 H3K9me3 H3K27ac H3K27me3 0.06 H3K36me3 -2.5kb center 2.5kb Gene 0.18 Di erent PhastCons 0.14 H3K4me1 H3K4me3 0.10 H3K9me3 H3K27ac H3K27me3 0.06 H3K36me3 -2.5kb center 2.5kb

Gene 0.18 Non-conserved PhastCons 0.14 H3K4me1 H3K4me3 0.10 H3K9me3 H3K27ac H3K27me3 0.06 H3K36me3 -2.5kb center 2.5kb Random loci

Fig. 3. Comparison of histone-associated motifs in human and mouse. (A) Distribution of conserved/nonconserved histone motifs between human and mouse. Among conserved motifs, there are three categories: motifs that retain histone mark labels between human and mouse (Same), motifs that have similar histone mark labels between human and mouse (Similar), and motifs that have different histone mark labels between human and mouse (Different). (B) Average PhastCons scores around promoters and some example known motifs’ loci. (C, Left) Average Phastcons scores around histone motifs’ loci. (Right) Example genome browser view of human histone motif loci and PhastCons scores.

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1813565116 Ngo et al. Downloaded by guest on September 25, 2021 A

B

C D GENETICS

Fig. 4. Histone-associated motifs tend to overlap with eQTL SNPs in cancer patients. (A) Distribution of TCGA SNPs and histone motifs over gene body. Each gene’s body is split into 10 equal bins. (B) Distribution of log(OPKM) of different histone motif types compared with known motifs and random motifs. (C)Log (OPKM) of histone motif and known motif per cancer type. (D) Example of motifs with highest average log(OPKM).

over all of the 32 cancers being shifted to the right of the known motifs are biologically relevant. It is also important to note that motifs’. For example, the log(OPKM) distribution for H3K4me3- the TCGA SNPs were measured by SNP arrays, which are designed associated motifs and known motifs have a mean of 0.969 (SD of with probes focused on promoters and gene bodies. This genomic 1.726) and −0.163 (SD of 1.22), respectively. The corresponding P location bias may explain why Epigram motifs associated with his- values for all H3K4me3, H3K27ac, and H3K27me3 motifs were tone marks, such as H3K9me3 and H3K4me1, do not overlap more × −68 × −5 0.00, 1.045 10 , and 8.43 10 , respectively. Among the his- with TCGA SNPs than the known motifs. Interestingly, the con- tone motifs that overlap the most with cancer SNPs, several match served H3K4me3 motifs showed more overlap compared with the with the known motifs, such as ZNF639, NRF1, and CREB (Fig. nonconserved motifs (Fig. 4B), which suggests that these conserved 4D) that have been shown to be related to cancer development. For motifs are more relevant to gene expression. Surprisingly, com- example, in liver, inactivation of the NRF1 genecanleadtohepatic pared with randomly shuffled motifs, only H3K4me3 motifs overlap neoplasia (20). ZNF639 protein has been shown to be associated × −109 with the pathogenesis of oral and esophageal squamous cell carci- more with the SNPs (P value of 8.95 10 ), and the known ∼ nomas (21). CREB protein is mutated in more than 85% of motifs overlap even less than the random motifs (P value 0.0) microsatellite instability colon cancer cell lines (22). regardless the random motifs generated from shuffling histone Note that these known motifs only have intermedium log motifs or known motifs. A possible explanation is that the known (OPKM) and the de novo motifs (motifs that did have a signif- motifs are crucial for housekeeping functions and disease-SNPs icant match with HOCOMOCOv10 when using TomTom) have avoid disrupting them to facilitate proliferation of tumor cells. even much higher log(OPKM). This suggests that the found This speculation needs further experimental investigation.

Ngo et al. PNAS Latest Articles | 7of10 Downloaded by guest on September 25, 2021 Different cancers have drastically different mutation rates (23) cation enzymes. For example, HDAC genes’ promoters all contain (Fig. 4C). When considering OPKM, the cancers with relatively H3K27ac-related motifs. Motif H3K27ac_4280 CCTCCTCCC, − lower mutation rates have higher OPKM values than those with foundin39cells/tissues(P value 2.72 × 10 3), appears in the relatively higher mutation rates, which indicates that the somatic promoters of HDAC1/HDAC2 (Fig. 5B) and numerous other mutations tend to occur in the histone associated motifs. For deacetylases. HDAC1/2 are responsible for lysine deacetylation of example, LAML (acute myeloid leukemia) has significantly the core histone (H2A, H2B, H3, H4) as annotated in the lower somatic mutation frequency than lung adenocarcinoma UniProt database and is specifically documented to deacetylate (LUAD); consistently, LAML has significantly higher OPKM H3K9ac in the GREAT annotation (Fig. 5B). This may suggest a with all motifs than LUAD (Fig. 4C). We have calculated the negative feedback loop of histone acetylation: the H3K27ac motifs correlation between the average mutation rate and average are responsible for establishing/maintaining the H3K27ac signal in OPKM per cancer for all histone mark-related motifs, and the the promoters of HDACs, which suggests transcribing HDACs; the − Spearman correlation score is −0.635 with a P value of 9.1 × 10 5. transcribed HDACs, in turn, deacetylate H3K9ac and/or H3K27ac This observation also supports the functionality of the found marks in the genome. motifs. In cancers with low mutation rate, each mutation is likely These observations in human were also confirmed in mouse more important than those of higher mutation rates. The higher (Fig. 5 C and D). Of 369 mouse motifs, 91 of them have enriched OPKM for these cancers suggests that histone motifs are im- GO terms related to histone modification. Fig. 5C shows an portant and among the first to be altered in cancers. example motif H3K4me3_4223 that is highly specific to the H3K4me3 mark (86.6% of the loci appear within H3K4me3 Histone Modification Can Be Regulated via Positive or Negative peaks in mouse forebrain E11.5; 35 of 66 tissue time points Feedback Loops. To further characterize the found motifs, we contain this motif). This motif appears in the promoter regions collected the top 5,000 occurring loci for each motif and per- of several histone methyltranferases, such as histone methyl- formed GREAT (24) analysis. The majority of the motifs did transferase MLL1, also known as KMT2A (appearing in the not show any enriched (GO) terms, which is not human example in Fig. 5A), which is a catalytic subunit of the unexpected because these motifs are associated with histone MLL1/MLL complex, which facilitates the methylation of modifications generally needed for every biological process. H3K4 and forms a positive feedback to H3K4me3. Similar to the However, 17 motifs were found associated with histone modifi- human motifs, acetylation-associated mouse motifs provide cations. For example, motif H3K4me3_3087, identified as a GCC- negative feedbacks. For example, H3K9ac_4441 (65.15% of its box motif (consensus sequence CGCCGCCGCCGC), is highly occurrences locate within H3K9ac peaks in mouse forebrain specific to H3K4me3 (for example, 63% of its occurring loci are E11.5, and it was found in 40 of 66 tissue time points), appears in within the H3K4me3 peaks in the H1 cells) and was found in the promoter regions of HDACs. The negative feedback loop 118 of 121 cell lines or tissues; interestingly, its enriched GO terms involves several genes previously seen in the human example include histone lysine methylation (hypergeometric P value (HDAC2-family genes, Sirt1). This illustrates that methylation/ − 4.7366 × 10 8). In fact, motif H3K4me3_3087 is located at the acetylation processes can be controlled by interplays of several promoter regions of several methyltransferases (examples in Fig. factors involving feedback loops. 5A). An example is SET Domain Containing 3 (SETD3), a pro- tein important for development, that can act as transcription Discussion coactivator and histone methylatransferase (25). Thus, SETD3 Similar to identifying gene-coding sequences in the genome be- can potentially activate the transcription of itself and other ing the first step toward understanding gene expression and methyltransferases, further regulating the differentiation process. function, we argue that cataloguing motifs associated with his- Another example is Lysine Demethylase 6A (KDM6A) that tone modifications would pave the way toward revealing the specifically demethylates Lys-27 of histone H3. KDM6A’s molecular mechanisms of how the information encoded in the demethylation of Lys-27 is accompanied by methylation of Lys-4 genomic sequence is read to regulate histone modification in a of histone H3 (26), which can potentially further up-regulate tissue- and time-dependent manner. Taking advantage of the KDM6A. For the H3K4me3_3087 motif, its activation is likely epigenomic data generated by the ENCODE and the Epi- to exert a positive feedback through enhancing active marks, genomics Roadmap projects in diverse cell types and tissues, we including H3K4me1/2/3 and inhibiting repressive marks, in- have established the most comprehensive catalog of DNA motifs cluding H3K27me3 and H3K9me3. associated with histone modifications in both human and mouse. We have investigated whether H3K4me3 motifs avoid or The regulatory function of some of these motifs on local histone prefer the promoters of the histone methylation enzymes. Even modifications was validated by the drastic change of H3K27ac though H3K4me3 marks the majority of promoters, not every upon only mutating the relevant motifs, and supported by their H3K4me3 motif appears in all promoters; in fact, each H3K4me3 significant overlap with eQTL SNPs in cancer patients. Particu- motif only appears in on average 7.8% of the gene promoters. larly interesting, the cancers with lower somatic mutation fre- From the GREAT database, we identified 105 genes that have quency tend to have larger portion of mutations overlapping with terms related to histone methylation for both positive and neg- histone-associated motifs than the cancers with higher somatic ative regulations (methyltransferase and demethylase), such as mutation frequency, which also supports that the found motifs KDM4C, KDM4A, SUZ12, KDM4D, TET2, KDM8, SETD2, are functionally important. and SETD3. We counted the number of H3K4me3 motif matches Furthermore, the comparison between human and mouse that are within promoters of histone methylation enzyme genes and motifs showed that a large portion of the found motifs is con- found a significant increase compared with promoters of protein- served. Therefore, the insights obtained from the mouse em- coding genes that are not considered histone-methylation related bryogenesis can facilitate studying human development. A sur- − by GREAT (P value of 1.472 × 10 6 given by a Mann–Whitney prising observation is that the conservation at the motif loci is U test) (Fig. 5E). This is consistent with the fact that GREAT significantly different from the neighbor regions, such as a dip analysis picked up the associations with histone-methylation in of PhastCons score at the H3K4me3 motifs compared with the these H3K4me3 motifs. surrounding regions, which is completely different from the con- Interestingly, we observed that the H3K27ac-associated motifs servation profiles of the known TF motifs. Indeed, there are only seem to form negative feedbacks on acetylation. The possible a few found motifs similar to the known TF motifs, which may feedback mechanisms are derived from the motifs’ occurrence in partially explain why the interplay between DNA sequence and both the promoters and enhancers closest to the histone modifi- histone modifications remains largely mysterious.

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1813565116 Ngo et al. Downloaded by guest on September 25, 2021 ABchrX: 44,873,175 - 45,112,779 chr1: 32,292,086 -32,333,635 Refseq Gene KDM6A HDAC2 Refseq Gene H3K4me3 H3K27ac H3K4me3_3087 H3K27ac_4280

C chr9: 44,803,354 - 44,881,274 D chr10: 36,974,543 - 37,001,888 Refseq Gene Fig. 5. Histone modification can be regulated by Kmt2a Refseq Gene Hdac2 positive or negative feedback loops. (A) Positive H3K4me3 H3K9ac H3K4me3_4223 feedback loops in human H3K4me3. (B) Negative H3K9ac_4441 feedback loops in human H3K27ac. (C)Positivefeed- back loops in mouse H3K4me3. (D) Negative feed- back loops in mouse H3K9ac. Histone signals for genome browser views are from H1 cell for human and Forebrain-E11.5 sample for mouse. Green arrows denote positive regulation, red arrows are for negative regulation, gray lines indicate association between motifs and genes (motif occurring in the promoters or enhancers of a gene). (E) Average number of H3K4me3

motif matches within promoter regions of histone GENETICS methylation genes and other protein coding genes. Promoter regions are defined as 1,000-bp centered at TSS. Only protein-coding genes were considered. E Histone-methylation enzyme’s promoters have more H3K4me3 motif matches than other genes’ Genes that do not have any matches were not in- No. genes Avg. n* Top genes, in decreasing order of occurrences cluded. The two distributions of number of matches Histone-methylation genes 105 10.527 CARM1, EHMT1, WHSC1, KDM6A, KMT2C are greatly different from each other; Mann–Whitney U test gives a P value of 1.472 × 10−6.*n is number Other protein-coding genes 15241 6.553 POU3F3, ZFHX3, MEX3D, RNF213, PBX3 of H3K4me3 motif matches per gene’spromoter.

Another interesting observation is that the histone-associated Making the Sets of Sequences for the Prediction of Histone Modification by motifs appear to relate with histone modification enzymes; Epigram. We run Epigram to compare regions that are enriched with an for example, the H3K4me3 motifs tend to be associated with epigenomic modification to regions that do not possess any of the modifi- cations being considered. The enriched regions, or foreground, were the methyltransferases, suggesting positive feedbacks, and H3K27ac high-confidence regions that were identified as the intersect of two or more motifs tend to be associated with deacetylases, which indicates replicates (as described above). To establish a background, we took all of the possible negative feedbacks. Because the temporal deposition of continuous stretches in the genome that were 100% mappable but do not histone modifications is particularly important in development and overlap with any of the histone modifications peaks. Regions of the genome differentiation, the feedbacks provided by the histone-associated are not 100% mappable if the DNA sequence is replicated elsewhere in the motifs may guide studies to reveal the mechanisms, such as histone genome. This replication of DNA sequences reduces mappability, as it is a dynamics and epigenetic priming. requirement of the mapping procedure that reads map uniquely. To measure regions’ mappability, we used a precomputed dataset that considered 35-bp Materials and Methods reads mapping uniquely within the . When considering overlap between 100% mappable regions and histone modification peaks, Data Processing. For human data, ChIP-seq experiments using antibodies for the union of all peaks was used rather than the high-confidence regions (the six different histone modifications in 121 cell types were used to assess the intersect of two or more replicates). predictability of histone modification from DNA motifs. The six histone modifications are H3K4me1, H3K4me3, H3K27me3, H3K27ac, H3K9me3, and Applying Epigram to Each Data. Epigram was individually applied on different H3K36me3. Each of the ChIP-seq experiments had at least two replicates, and datasets (correspond to each cell type-histone mark). For example, Epigram input control samples are also provided. Mapped reads were made mono- identifies 100 motifs from the H1-H3K4me3 data, then 120 motifs from H9- clonal using HOMER. For mouse data, ChIP-seq experiments used antibodies H3K27ac data, and so forth. We combined all of these motifs together and for 8 different histone modifications in 12 embryonic cell types at 7 different removed redundancy among them. developmental time points. The eight histone modifications include the six used in the human data with the addition of H3K9ac and K3K4me2 (SI Motif Clustering. We used a standard hierarchical clustering algorithm to Appendix, Table S5). group similar motifs. To calculate the similarity between motifs, we first The human data were processed as described previously by Whitaker et al. (4). aligned the motifs. Let m1, m2 be two motif PWMs: m1 and m2 were aligned HOMER was used to call peaks for the ChIP-seq data. We used two different together with a gap penalty that increases based on the number of over- criteria for narrow histone peaks (H3K27ac, H3K4me1, and H3K4me3) and hanging positions. For each overlapping position, a Jensen–Shannon Di- broad peaks (H3K27me3, H3K36me3, and H3K9me3). The mouse data were vergence score is calculated. These scores were then averaged to get the processed by the ENCODE Processing Pipeline. The pipeline calls peaks sepa- overall score. The average score per position was calculated as: rately on each of the biological replicate using MACS2 (27). Note that the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 resulted replicated peaks were significantly shorter than peaks obtained from AverageðalignmentÞ = a2 + a2 + ... + a2 , n 1 2 n Homer. Therefore, to match our human data, we further merged histone peaks within 1,000 bp for the narrow peaks, and within 2,500 bp for the broad peaks. with n being the number of overlapping positions.

Ngo et al. PNAS Latest Articles | 9of10 Downloaded by guest on September 25, 2021 This averaging method puts more weight on large differences between linear model was chosen. Age, gender, race, days to death, and days to last the PWMs at single positions than small differences over several positions. follow-up were taken from clinical data to use as covariates. To associate a For each alignment, a gap penalty is added to alignments that do not SNP with a gene, we used SNPs that are on the same chromosome and at maximize the overlapping portion. The alignment distance was computed as: most 106 bp away from the gene’s TSS. False-discovery rate cut-off was chosen at 0.05. We further accounted for linkage disequilibrium using PLINK 2k − 1 D Aij = AverageðalignmentÞ + 0.05 , (29) and SNP data from the 1000 Genomes Project (30). For linkage dis- n equilibrium pruning, the minor allele frequency cut-off was 0.10 and vari- with k being the number of gaps in the alignment. ance inflation factor threshold 1.5. eQTL SNPs that were in the pruned out The distances of all possible alignments, including reverse-complementary, set are removed from further analyses. were computed and the smallest one was the distance between m1 and m2. Then, a hierarchical tree was constructed using average linkage. Motif OPKM Calculations. The OPKM of a motif is defined as the number of SNPs clustering was done in two steps. First, motifs from the same histone mark overlapping with all of the motif-occurring loci divided by the number of SNPs were clustered together and a motif with the highest information content (in thousands) and number of base pairs covered by all of the motif-occurring was selected to represent each cluster. Then, the representative motifs of all loci (in millions). Thus, the formula is: different marks were clustered together. We used a height cut-off of 0.15 to cut the resulting tree for each clustering step. numOfOverlaps OPKMðcancerÞ = × 106 × 103. numOfSNPs × numOfBases Histone-Associated Motifs Forming Feedback Loops Analysis. We first used GREAT to analyze the functions of the motifs. For each motif, the top 5,000 loci with the lowest P values were analyzed. In the case when more Plotting PhastCons Scores. For each motif, 5,000 loci were picked randomly. than 5,000 top loci have the same P value, we randomly picked 5,000 loci The resulting sites were combined based on histone mark association. The from those. The default background was used. For GO-term enrichment cut- PhastCons scores derived from multiple alignments of 45 vertebrate genomes offs, we used a false-discovery rate of <0.05 for both the binomial and with human were used. To plot the baseline, PhastCons scores of 100,000 hypergeometric tests. We tested with 5,000, 4,000, and 3,000 top loci for randomly chosen 10-bp sites in hg19 were also calculated. H3K4me3_3087. In all three cases, the GO term “histone methyltransferase ” activity was enriched, albeit at different P values: for 5,000 loci the P value SPAMO Analysis. The original SPAMO algorithm determines whether a dis- is 4.3244e-12, 2.7109e-8 for 4,000 loci, and 3.4110e-6 for 3,000 loci. Thus, tance is significantly between a pair of motifs. In general, the program changing the number of loci picked will likely not change the results significantly. calculates the distribution of distances between the primary and secondary Based on the GREAT results, we examined the functions of the target genes motifs’ loci and looks for overrepresentative distances. We adapted the al- of the histone modification motifs and constructed a network to represent their gorithm to our case with the primary loci being TSS and the secondary loci relationship. First, the motifs were filtered for enriched terms that contain being histone H3K4me3 motifs’. histone acetylation/deacetylation or histone methylation/demethylation based on the GREAT annotation, which can be slightly different from the other databases. For each motif, we identified the genes involved in each of the term Experimental Validation Protocols. The processes for CRISPR/Cas9 and donor (i.e., the enzymes such as methyltransferase and demethylase). We then de- construct design, plasmid construction, cell culture and electroporation, and termined the histone residue that each enzyme modifies and whether the histone H3K27ac ChIP-qPCR analysis are detailed in SI Appendix. modification is positive or negative (e.g., methylation or demethylation), based on the GREAT database. The motifs were connected to the specific histone Data. Clustered motifs for human (361 motifs) and mouse (369 motifs) can be marks. Finally, we pooled together the relationship between motif and genes, found in Dataset S1. Additional information can be found in the companion gene and histone marks, histone marks and motif to build a network. website wanglab.ucsd.edu/star/MouseENCODE/HistoneMotifs.

Call eQTL SNPs in the TCGA Data. Processed RNA-seq and mutation data were ACKNOWLEDGMENTS. This project is partially supported by NIH Grants downloaded from Firehose (28). The data contain tumor samples from U54HG006997 and R01HG009626 and California Institute of Regenerative 32 cancer types. The R package MatrixeQTL was used to find eQTLs SNPs. The Medicine Grant RB5 07012.

1. Lee D, et al. (2015) A method to predict the impact of regulatory variants from DNA 17. Mikhaylichenko O, et al. (2018) The degree of enhancer or promoter activity is re- sequence. Nat Genet 47:955–961. flected by the levels and directionality of eRNA transcription. Genes Dev 32:42–57. 2. Benveniste D, Sonntag H-J, Sanguinetti G, Sproul D (2014) Transcription factor binding 18. Weinstein JN, et al.; Cancer Genome Atlas Research Network (2013) The Cancer Ge- predicts histone modifications in human cell lines. Proc Natl Acad Sci USA 111:13367–13372. nome Atlas pan-cancer analysis project. Nat Genet 45:1113–1120. 3. Setty M, Leslie CS (2015) SeqGL identifies context-dependent binding signals in 19. Shabalin AA (2012) Matrix eQTL: Ultra fast eQTL analysis via large matrix operations. genome-wide regulatory element maps. PLoS Comput Biol 11:e1004271. Bioinformatics 28:1353–1358. 4. Whitaker JW, Chen Z, Wang W (2015) Predicting the human epigenome from DNA 20. Xu Z, et al. (2005) Liver-specific inactivation of the Nrf1 gene in adult mouse leads to motifs. Nat Methods 12:265–272, 7, 272. nonalcoholic steatohepatitis and hepatic neoplasia. Proc Natl Acad Sci USA 102:4120–4125. 5. Yue F, et al.; Mouse ENCODE Consortium (2014) A comparative encyclopedia of DNA 21. Gen Y, et al. (2010) SOX2 identified as a target gene for the amplification at 3q26 that elements in the mouse genome. Nature 515:355–364. is frequently detected in esophageal squamous cell carcinoma. Cancer Genet 6. Kundaje A, et al.; Roadmap Epigenomics Consortium (2015) Integrative analysis of Cytogenet 202:82–93. 111 reference human epigenomes. Nature 518:317–330. 22. Ionov Y, Matsui S, Cowell JK (2004) A role for p300/CREB binding protein genes in 7. Kolasinska-Zwierz P, et al. (2009) Differential chromatin marking of introns and ex- promoting cancer progression in colon cancer cell lines with microsatellite instability. pressed exons by H3K36me3. Nat Genet 41:376–381. Proc Natl Acad Sci USA 101:1273–1278. 8. Calo E, Wysocka J (2013) Modification of enhancer chromatin: What, how, and why? 23. Lawrence MS, et al. (2013) Mutational heterogeneity in cancer and the search for new Mol Cell 49:825–837. cancer-associated genes. Nature 499:214–218. 9. Vastenhouw NL, Schier AF, Akhtar A, Neugebauer K (2012) Bivalent histone modifi- 24. McLean CY, et al. (2010) GREAT improves functional interpretation of cis-regulatory cations in early embryogenesis. Curr Opin Cell Biol 24:374–386. regions. Nat Biotechnol 28:495–501. 10. Wolter S, et al. (2008) c-Jun controls histone modifications, NF-kappaB recruitment, 25. Eom GH, et al. (2011) Histone methyltransferase SETD3 regulates muscle differenti- and RNA polymerase II function to activate the ccl2 gene. Mol Cell Biol 28:4407–4423. ation. J Biol Chem 286:34733–34742. 11. Sowa Y, et al. (1999) Histone deacetylase inhibitor activates the p21/WAF1/Cip1 gene 26. Lee MG, et al. (2007) Demethylation of H3K27 regulates polycomb recruitment and promoter through the Sp1 sites. Ann N Y Acad Sci 886:195–199. H2A ubiquitination. Science 318:447–450. 12. Ogryzko VV, Schiltz RL, Russanova V, Howard BH, Nakatani Y (1996) The transcrip- 27. Zhang Y, et al. (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol 9:R137. tional coactivators p300 and CBP are histone acetyltransferases. Cell 87:953–959. doi:10.1186/gb-2008-9-9-r137. 13. Huang S, Li X, Yusufzai TM, Qiu Y, Felsenfeld G (2007) USF1 recruits histone modification 28. Broad Institute TCGA Genome Data Analysis Center (2016) . Data from “Analysis- complexes and is critical for maintenance of a chromatin barrier. MolCellBiol27:7991–8002. ready standardized TCGA data from Broad GDAC Firehose 2016_01_28 run,” 14. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS (2007) Quantifying similarity 10.7908/C11G0KM9. Available at http://gdac.broadinstitute.org/runs/stddata__2016_01_28/. between motifs. Genome Biol 8:R24. Accessed February 11, 2017. 15. Siepel A, et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, 29. Chang CC, et al. (2015) Second-generation PLINK: Rising to the challenge of larger and yeast genomes. Genome Res 15:1034–1050. and richer datasets. Gigascience 4:7. 16. Whitington T, Frith MC, Johnson J, Bailey TL (2011) Inferring transcription factor 30. Auton A, et al.; 1000 Genomes Project Consortium (2015) A global reference for complexes from ChIP-seq data. Nucleic Acids Res 39:e98. human genetic variation. Nature 526:68–74.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1813565116 Ngo et al. Downloaded by guest on September 25, 2021