Supplementary Material

Inferring active cis-regulatory modules to predict functional regulatory elements

Contents

Supplementary Results ...... 2 S1. Workflow of ChIP-GSM...... 2 S2. CRM inference comparison using ChIP-seq data of K562 cells ...... 5 S3. K562 TF expression and functions ...... 7 S4. Genes regulated by ChIP-GSM inferred CRMs for K562 promoter ...... 10 S5. ChIP-seq data simulation...... 11 Supplementary Methods ...... 13 References ...... 18

Supplementary Results

S1. Workflow of ChIP-GSM

Start

HOMER package Transfer CHIP-seq SAM file to tag file readable by HOMER Step 1 makeTagDirectory TF_tag_folder -format sam TF.sam

Generate window based read count at gene promoter regions Step 2 annotatePeaks.pl tss hg19 -size 20000 -hist 500 -ghist -d TF_tag_folder

ChIP-GSM R script

Load sample and input ChIP-seq count data for each TF Step 3 Normalize total number of read tags at all windows to 1e7

Step 4 Collect candidate TF modules

Loop num=1

Loop num

Determine the total number of read tags to be assigned to Step 5 foreground or background regions

Step 6 Sample weights at foreground regions and toss read tags one by one to each region accordingly

Sample weights at background regions and toss read tags Step 7 one by one to each region accordingly

Step 8 Sample mean parameter of binding distance distribution

Step 9 Sample variance parameter of noise distribution

Step 10 Sample a TF module at each region

Store sampled modules at all regions as a map

Loop num+=1

End

Fig. S1. Workflow of the ChIP-GSM approach.

ChIP-seq data pre-processing Step 1: We use HOMER (v4.9) to process ChIP-seq read count profiles (BAM files; aligned to human reference genome hg19) into read tag files where 5’ start locations as well as directions of read tags in each chromosome are stored in an individual file.

Step 2: We partition human cell type-specific enhancer or promoter-like regions (referred to hg19; downloaded from ENCODE webserver http://screen.encodeproject.org/) into 500 bps segments (regions less than 500 bps will be round up to 500 bps around the original region center) and count reads uniquely aligned to each segment.

Step 3: We normalize the total number of reads in each ChIP-seq profile to 107. This step is necessary and important to eliminate any bias caused by the sequencing depth when we pool multiple ChIP-seq profiles for a joint analysis.

Step 4: ChIP-GSM can either take a list of pre-generated cis-regulatory modules (CRMs) as input or automatically identify a set of candidate CRMs based on the input ChIP-seq data.

Cis-regulatory module inference using ChIP-GSM Step 5: We initiate the model by assigning possible CRM structures to each region. Then, based on the read counts, we determine regions potentially with or without bindings of each TF. For example, a region with more than 10 reads or 2 folds to the control ChIP- seq profile is likely to contain a binding site (a foreground region); otherwise it is a background region. We roughly estimate the total number of reads sequenced from foreground or background regions. After that, we process foreground and background regions separately as we assume their read counts follow different distributions.

Step 6: For each TF, we calculate a weight for each foreground region based on its read count, by assuming a Power-Law distribution. Here, the Power-Law distribution parameters are TF-specific, which can be obtained from distribution fitting of read counts

3 in each TF ChIP-seq profile. We assign reads (the total number is determined in Step 5) one by one to all foreground regions according to their weights.

Step 7: For the same TF but background regions, we calculate a weight for each based on its read count and a Gamma distribution assumption. Parameters of this Gamma distribution are obtained by fitting read counts of the control ChIP-seq profile. We assign the remaining read tags in current TF ChIP-seq profile one by one to background regions according to their weights. Step 8: Specially for promoter study, for each TF, we estimate the mean parameter for the Exponential distribution modelling the relative distance of foreground regions to the nearest transcription starting site (TSS), using their genomic locations. Steps 5~8 are repeated for each TF.

Step 9: We estimate the variance of the residuals between observed and assigned read counts across all segments of all TFs.

Step 10: For each region, we calculate a conditional probability for each candidate CRM given the assigned read counts and randomly select a CRM according to their conditional probability distribution. This current region is classified as a foreground region for TFs within the sampled CRM or a background region. Results of Step 10 are recorded and then brought back to Step 5 to start the next round of sampling.

We run the sampling process until the sampler appears to converge on the equilibrium distribution and then start accumulating samples on CRMs. After collecting enough samples, the sampling frequency of each module-region unit denotes the posterior probability for binding occurrence. Foreground regions regulated by each module can be identified.

S2. CRM inference comparison using ChIP-seq data of K562 cells

ZBTB7A EGR1 CBX2 CBX8 PHF8 ZBTB7A EGR1 CBX2 CBX8 RNF2 RBBP5 PHF8 ZBTB7A CBX2 CBX8 RNF2 EZH2 SUZ12 SAP30 HDAC1 RBBP5 PHF8

SAP30 HDAC1 SMC3 RAD21 CTCFL RAD21 CTCFL NFYA NFYB RBBP5 CHD1 CTCFL CTCF FOSL1 NFYA NFYB SP2 CHD1 SIRT6 FOSL1 NFYA NFYB SIRT6 PML JUND FOSL1 MAFF MAFK PML BCLAF1 FOS JUND MAFF MAFK BACH1 JUND CEBPB CEBPB ZNF384 E2F6 MAX TAL1 GATA1 TEAD4 GATA2 MAX USF1 TEAD4 GATA2 USF1 BHLHE GATA2 ZBTB33

Fig. S2. Top 30 ChIP-GSM inferred CRMs from K562 promoter regions.

Transcription factors

RPC155 1 RNF2 CBX2 CBX8 0.8 SUZ12 EZH2 0.6 ZBTB33 GATA2 CBX2 CBX8 RNF2 EZH2 SUZ12 SIRT6 0.4 KDM1A GATA1 TAL1 0.2 GATA2 ZBTB33 TEAD4 PML CEBPB 0 FOSL1 TAL1 GATA1 TEAD4 JUND GTF3C2 SPI1 ELF1 JUND CEBPB FOSL1 SP2

s SP2

r EGR1 o

t ZBTB7A

c CHD1 a

f RBBP5 ZBTB7A EGR1

PHF8 n

o HDAC1 i

t SAP30 p i TAF1 SAP30 HDAC1 RBBP5 PH8 CHD1

r TBP c

s MAX

n HCFC1

a MYC r YY1 MAX USF1 BHLHE E2F6 T BCLAF1 BHLHE E2F6 USF1 FOS ZNF384 FOS ZNF384 BACH1 MAFF MAFK MAFF MAFK BACH1 ZNF143 CTCF CTCFL RAD21 SMC RAD21 CTCFL CTCF SMC3 NFYA

NFYB

30 44 21 NFYA NFYB (a) (b)

Fig. S3. Inferred CRMs using the Rulefit approach (1). (a) Hierarchical clustering of relative importance (RI) matrix estimated by Rulefit. (b) 11 inferred CRMs using the same group color as ChIP-GSM, but not able to predict more specific associations of TFs.

E2F6 PRC155 SUZ12 CBX8 MAFF MAFK SUZ12 CBX2 ZBTB7A EGR1 EZH2 SUZ12 PML SIRT6 EZH2 SUZ12 CBX2 CBX8 RBBP5 PHF8 EZH2 SUZ12 CBX2 RBBP5 SAP30 HDAC1 PHF8 SUZ12 CBX2 RNF2 CHD1 RBBP5 SAP30 HDAC1 PHF8 CBX2 CBX8 RNF2 RBBP5 SAP30 HDAC1 CBX2 CBX8 RNF2 RBBP5 SAP30 CTCFL SMC3 CBX2 CBX8 RNF2 CHD1 SAP30 RAD21 CTCFL SMC3 CBX2 CBX8 RNF2 CHD1 RBBP5 RAD21 CTCFL SMC3 CBX2 CBX8 CHD1 RBBP5 SAP30 HDAC1 PHF8 RAD21 CTCFL SMC3 CTCF CHD1 RBBP5 SAP30 HDAC1 PHF8 RAD21 CTCFL SMC3 CTCF CHD1 RBBP5 SAP30 HDAC1 RAD21 CTCFL CHD1 SAP30 HDAC1 CTCFL CTCF GATA2 ZBTB33 GATA2 ZBTB33 TAL1 GATA1 TEAD4 GATA2

Fig. S4. CRMs inferred by ISA (2). CRMs predicted by ISA can be roughly clustered into four major groups (similar to the Groups 1, 2, 4 and 6 inferred by ChIP-GSM). However, CRMs in Groups 5 and 8 identified by ChIP-GSM are missing in the results of ISA.

CHD1 RBBP5 PH8 ZBTB7A SAP30 HDAC1

CBX8 CBX2 CBX2 RNF2 EZH2 SUZ12 GATA2 ZBTB33 GATA2 ZBTB33

CTCFL CTCF SMC RAD21 TAL1 GATA1 TEAD4 KDM1A Fig. S5. CRMs inferred by Plaid (3). Plaid identified five modules with only high-level large-scale associations captured.

S3. K562 TF expression and functions

4 BHLHB2 MAX 3 E2F6 MAFF 2 MAFK BACH1 1 SMC3 RAD21 CTCFL 0 CTCF ZNF384 CEBPB SP2 NFYB NFYA FOSL1 JUND FOS BCLAF1 PML SIRT6 CHD1 RBBP5 ZBTB33 TAL1 GATA1 TEAD4 GATA2 SUZ12 EZH2 RNF2 CBX8 CBX2 SAP30 HDAC1 RBBP5 PHF8 ZBTB7A EGR1

SP2 MAX

PML FOS

TAL1

E2F6

NFYA MAFF

PHF8 CBX2 CBX8 EZH2 NFYB MAFK

RNF2 CTCF

EGR1 JUND SMC3

CHD1 SIRT6

SAP30 GATA2 GATA1

SUZ12 TEAD4 FOSL1

CTCFL RAD21

RBBP5 RBBP5 BACH1

HDAC1 CEBPB

ZBTB7A ZBTB33 ZNF384

BCLAF1 BHLHB2 Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8

Fig. S6. Co-expression of TFs in inferred CRMs. Gene expression data was downloaded from GEO database with ID: GSE1036. Color bar represents –log10(p-value) of Pearson correlation coefficient. Rectangles with different colors represent the ChIP-GSM identified CRM groups.

TF functions:

In Group 1, the similarity for TFs is very significant (red blocks). Two associations, RBBP5-PHF8-SAP30-HDAC1 and EGR1-ZBTB7A, have been validated in K562 cells (1,4). A novel association PHF8–ZBTB7A is predicted in CRMs 9 and 37; the p-value of the mRNA expression correlation is 3.75e-2.

In Group 2, all TFs (i.e., CBX2, CBX8, RNF2, EZH2 and SUZ12) are Polycomb-group proteins whose associations have been established under stem cells and further verified

7 to be existent in K562 cells (5). However, their expression patterns are quite diverse so that there is no significant correlation on their expression patterns.

In Group 3, pairwise correlation is highlighted but there is no significant triple wise correlation. One association GATA1-GATA2-TAL1 has been previously observed in (1) based on whole genome study. Due to existing evidence in support of the associations of GATA3-GATA2 and GATA3-TEAD4 (6,7), a high-order association of GATA2- GATA3-TEAD4 can be expected. However, due to the lack of GATA3 ChIP-seq data, only GATA2-TEAD4 is identified; the p-value of expression correlation is 3.29e-2. GATA2-ZBTB33 is confirmed by observing binding signals enrichment of GATA2 at regions regulated by ZBTB33 in K562 cells (8).

In Group 4, the combinations of TFs are more diverse and the supporting evidence of mRNA expression correlation is weak. The co-binding of RBBP5 and CHD1 has been observed at active promoters with enrichment of H3K4me3 (9,10). SIRT6 is also highly enriched at active promoters as well and its association with CHD1 is related to productive initiation (5).

Group 5 is a large group where two high-order associations, SP2-NFYA-NFYB and FOSL1-FOS-JUND-CEBPB, have been observed, which are also validated by another study on K562 cells (11). The overall correlation of mRNA expression of Group 5 TFs is very strong, as dense dark red units are observed in the Group 5 block. Here, a novel association, FOSL1-NFYA-NFYB, is also predicted, which is strongly supported by their co-binding enrichment at promoter regions of a large number of genes as well as their significant mRNA expression correlation (the core dark red block in Group 5).

In Group 6, the association of RAD21-SMC3-CTCF can be verified based on fact that both RAD21 and SMC3 are members of the cohesin complex, which is known to interact with CTCF (12) and act as an insulator (13). Their triple-wise mRNA co- expression correlation can be predicted based on their highly significant pairwise correlation relationship. Since CTCFL is a paralog of CTCF, its interaction with RAD21 or SMC3 is also supported by their binding signal enrichment at the same regions.

Group 7 only has three TFs but the association of BACH1 with the MAF protein complex including MAFF and MAFK is very strong, especially with MAFK to form a heterodimer (14). As shown in Fig. S5, BACH1 has a strong correlation with either MAFF or MAFK.

In Group 8, E2F6 may be recruited by MAX to gene promoters via E boxes (CACGTG) via protein-protein interaction (15). MAX, USF1, and BHLHE40 (BHLHB2) are E-box binding factors so that two associations of them including MAX-USF1 and BHLHE40- USF1 are identified by the ChIP-GSM approach. The correlation between E2F6 and MAX is very high. Since MAX and BHLHE40 are indirectly connected with USF1 (the expression profile of which is not available in our study), the correlation of MAX and BHLHE40 is weaker but still significant.

S4. Genes regulated by ChIP-GSM inferred CRMs for K562 promoter

Fig. S7. Target genes regulated by selected 30 CRMs functioning at K562 promoter.

S5. ChIP-seq data simulation

Breast cancer MCF-7 cell line Case 1: ER-α, GATA3, SIN3A, NR2F2 Case 2: CTCF, E2F1, MBD3, NR2F2, PML, POLR2A, RAD21 Case 3: CEBPB, CREB1, CTCF, E2F1, EP300, ER-α, FOXA1, GATA3, MBD3, MYC, NR2F2, PBX1, PML, POLR2A, RAD21, SIN3A, TCF7, TCF12, TLE3, ZNF217

0.14 0.14 ER-alpha background regions (input data) GATA3 background regions (input data)

0.12 0.12 n

ER-alpha foreground regions n GATA3 foreground regions

c n

0.1 n 0.1

t i

0.08 i 0.08

0.06 0.06

i b

0.04 b 0.04

r P 0.02 P 0.02

0 0 0 50 100 150 0 50 100 150 Read count Read count (A) ER- α (B) GATA3

0.14 0.15 NR2F2 background regions (input data) SIN3A background regions (input data) 0.12 NR2F2 foreground regions

n SIN3A foreground regions

c n

n 0.1 u

u 0.1

t i

i 0.08

0.06

l i

i 0.05 b

b 0.04

r P P 0.02

0 0 0 50 100 150 0 50 100 150 Read count Read count (C) NR2F2 (D) SIN3A

Fig. S8. Distributions of simulated ChIP-seq read counts for Case 1. For each TF, we count the TF ChIP-seq read tags falling in each region (500 bps around peak summit). We also count read tags in the matched control ChIP-seq profile. Repeating the pre-processing for every TF, we select regions with non- zero read counts of at least two TFs. Then, among these regions, for each TF, we randomly and respectively perturb the read counts at binding or non-binding regions. For each case, we generate 10 replicates.

1 ChIP-GSM SignalSpider jMOSAICS ChromHMM

0.9

e r

u 0.8

e m

- 0.7 F

0.6

0.5 Case 1 Case 2 Case 3

(A)

1 ChIP-GSM SignalSpider jMOSAICS ChromHMM

0.9

e r

u 0.8

e m

- 0.7 F

0.6

0.5 Case 1 Case 2 Case 3

(B) Figure S9. Comparing ChIP-GSM against competing methods on CRM inference using realistically simulated ChIP-seq data. (A) F-measures of each method on all regions; (B) F-measure of each method on regions with weak binding events. SignalSpider and jMOSAICS were developed to infer CRMs using binding signals of a small number of TFs. ChromHMM was developed to identify chromatin state using histone modifications. ChIP-GSM provides improved performance especially when there are a large number TFs with weak binding events.

Supplementary Methods

(read count at foreground region with ) follows a Power-Law distribution: X k ,t b(ck ,t) = 1

-g g -1æ X ö t t . (S-1) P( X k,t ) ~ ç ÷ X min è Xmin ø

(read count at background region with ) follows a Gamma distribution: Ik ,t b(ck ,t) = 0

It, −1 PIII()~()exp()ktIt,,− . (S-2)

N k ,t (read count noise at each region) follows a Gaussian distribution:

1 N 2 P(N ) = exp(- ). (S-3) k,t 2 2s 2 2ps N N

2 s N (noise variance at all regions) follows an Inverse Gamma distribution: 1 b 2 2 -a N -1 N P(s N ) = (s N ) exp(- 2 ). (S-4) CN s N

dk ,t (relative distance of each region to the nearest TSS) follows Exponential distribution for or Uniform distribution for :

ì P(dk,t | åck,mbm,t = 1) ~ lt exp(-lt d ) ï í m , (S-5) ïP(dk,t | åck,mbm,t = 0) ~ Uniform(d) î m where (exponential factor) follows Gamma distribution as follows: lt

1 al -1 P(lt ) = l exp(-lbl ). (S-6) Cl

Gibbs sampling procedure:

(1) Determine the total number of tags

For the t-th TF, we randomly assign a weight pkt, to the k-th region according to Gamma distribution. Based on initial selection or previous round of sampling, we have a module- region regulatory map and know which module or TF is binding at which region, as C . For any region with state , we amplify its weight by times. Then we can b(ck ,t) = 1 F roughly estimate the total number of read tags assigned to foreground regions as follows:

Fåb(ck ,t) pk,t k Rt,X = Rt , (S-7) Fåb(ck ,t) pk,t + å(1- b(ck ,t)) pk,t k k where is the total number of read tags of the t-th TF, which is normalized to 107 in this Rt study. The total number of read tags Rt,I assigned to background regions for the t-th TF can be calculated as Rt - Rt,X .

(2) Sampling read counts for foreground regions

Conditional probability of X k ,t can be calculated as follows:

2 P( X |Yk,t ,dk ,b(ck ,t) = 1,F) µ P(Yk,t | X,b(ck ,t) = 1,s N )P( X )P(dk | b(ck ,t) = 1) -g 1 é (Y - X )2 ù g -1æ X ö t . (S-8) µ expê- k,t ú× t ×l exp(-l d ) s 2s 2 X ç X ÷ t t k N êë N úû min è min ø

We calculate a weight pkt, for each region according to Eq. (S-8) as follows:

-g 1 é (Y - X ' )2 ù g -1æ X ' ö t p = expê- k,t k,t ú× t k,t ×l exp(-l d ), (S-9) k,t C 2s 2 X ç X ÷ t t k X ,t êë N úû min è min ø

-g é (Y - X ' )2 ù g -1æ X ' ö t where C = expê- k,t k,t ú× t k,t ×l exp(-l d ). X ,t å 2s 2 X ç X ÷ t k k êë N úû min è min ø

We first assign reads to each foreground region to meet the requirement of X min

PowerLaw distribution. Then we assign the remaining Rt,X - åb(ck ,t)Xmin reads to all k foreground regions according to the weight pkt, at each region and obtain a read count

X k ,t for each foreground region.

(3) Sampling read counts for background regions

Conditional probability of Ik ,t can be calculated as follows:

2 P(I |Yk,t ,b(ck ,t) = 0,F) µ P(Yk,t | I,b(ck ,t) = 0,s N )P(I)P(dk | b(ck ,t) = 0) 2 é (Y - I) ù a -1 . (S-10) 1 k,t é I ,t ù Dd µ expê- ú× (I ) exp(-bI ,t I) × s 2s 2 êë úû d N êë N úû p

We calculate a weight for each background region according to Eq. (S-10) and assign

Rt,I reads to background regions according to their weights. In detail, we randomly

sample a temp count I 'k ,t based on Eq. (S-10) by varying I from 1 to Xupper . Then, a weight at a background region can be calculated as:

2 ('YI )−  −1 1 k,, tk t It, pIIk,,,, tk=−− tI t k texp'exp(' ) ( )  , (S-11) C 2 2  I, tN 

2 é (Y - I ' ) ù a -1 where C = expê- k,t k,t ú×(I ' ) I ,t exp(-b I ' ). I ,t å 2s 2 k,t I ,t k,t k êë N úû

We assign reads to all background regions according to the weight pkt, at each

region and finally obtain a read count Ik ,t .

(4) Sampling exponential parameter of relative distance

Conditional probability of t can be calculated as follows:

P(lt | D,C,F) µ Õ P(dk | b(ck ,t) = 1)P(lt ) k

al ,t -1 µ Õ exp(-lt dk )lt lt exp(-ltbl,t ) . (S-12) k|b(ck ,t)=1

al ,t -1+åb(ck ,t ) k µ lt exp(-lt (bl,t + åb(ck ,t) dk )) k

It can be seen from the form of the equation that Eq. (S-12) is also a Gamma distribution.

(5) Sampling variance of noise component After repeating the steps according to Eq. (S-7) ~ (S-12) for all TFs, the conditional

2 probability of  N can be calculated as follows:

2 2 2 P(s N | Y,X,I,C) µ ÕÕ P(Yk,t |Y 'k,t ,s N )P(s N ) k t

1 é (Y - Y ' ) ù -2a -2 æ 1 ö k,t k ,t N . (S-13) µ ÕÕ expê- 2 ús N expç -bN 2 ÷ k t s N êë 2s N úû è s N ø

-(a +KT /2)-1 2 N æ é 1 2 ù 1 ö µ (s N ) expç - êbN + åå(Yk,t - Y 'k,t ) ú 2 ÷ è ë 2 k t ûs N ø It can be found that Eq. (S-13) is still an Inverse-gamma distribution.

(6) Sampling CRMs for foreground regions For the k-th region, the conditional probability of each CRM can be calculated as follows:

where Ck = å P(ck = m| Y,X,I,D,B,F). m

For each region, we sample a CRM according to its conditional probability distribution and update the binding state of each region.

The MATLAB version of Elastic Net Logistic Regression package can be downloaded from https://web.stanford.edu/~hastie/glmnet_matlab/.

Parameter settings are listed as follows: options.alpha = 0.1; options.nlambda = 100; options.standardize = true; options.intr = true; options.thresh = 1e-7; options.cl = [-Inf;Inf]; options.maxit = 1e+5; options.ltype = 'Newton'; options.standardize_resp = false; options.mtype = 'ungrouped'; family = 'binomial';

References

1. Gerstein, M.B., Kundaje, A., Hariharan, M., Landt, S.G., Yan, K.K., Cheng, C., Mu, X.J., Khurana, E., Rozowsky, J., Alexander, R. et al. (2012) Architecture of the human regulatory network derived from ENCODE data. Nature, 489, 91-100. 2. Bergmann, S., Ihmels, J. and Barkai, N. (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review. E, Statistical, nonlinear, and soft matter physics, 67, 031902. 3. Lazzeroni, L. and Owen, A. (2002) Plaid models for gene expression data. Stat Sinica, 12, 61-86. 4. Giannopoulou, E.G. and Elemento, O. (2013) Inferring chromatin-bound protein complexes from genome-wide binding assays. Genome research, 23, 1295-1306. 5. Ram, O., Goren, A., Amit, I., Shoresh, N., Yosef, N., Ernst, J., Kellis, M., Gymrek, M., Issner, R., Coyne, M. et al. (2011) Combinatorial patterning of chromatin regulators uncovered by genome- wide location analysis in human cells. Cell, 147, 1628-1639. 6. Ralston, A., Cox, B.J., Nishioka, N., Sasaki, H., Chea, E., Rugg-Gunn, P., Guo, G., Robson, P., Draper, J.S. and Rossant, J. (2010) Gata3 regulates trophoblast development downstream of Tead4 and in parallel to Cdx2. Development, 137, 395-403. 7. Home, P., Saha, B., Ray, S., Dutta, D., Gunewardena, S., Yoo, B., Pal, A., Vivian, J.L., Larson, M., Petroff, M. et al. (2012) Altered subcellular localization of transcription factor TEAD4 regulates first mammalian cell lineage commitment. Proceedings of the National Academy of Sciences of the United States of America, 109, 7362-7367. 8. Blattler, A., Yao, L., Wang, Y., Ye, Z., Jin, V.X. and Farnham, P.J. (2013) ZBTB33 binds unmethylated regions of the genome associated with actively expressed genes. Epigenetics & chromatin, 6, 13. 9. Smith, E. and Shilatifard, A. (2010) The chromatin signaling pathway: diverse mechanisms of recruitment of histone-modifying enzymes and varied biological outcomes. Molecular cell, 40, 689-701. 10. Sims, R.J., 3rd, Chen, C.F., Santos-Rosa, H., Kouzarides, T., Patel, S.S. and Reinberg, D. (2005) Human but not yeast CHD1 binds directly and selectively to histone H3 methylated at lysine 4 via its tandem chromodomains. The Journal of biological chemistry, 280, 41789-41792. 11. Ernst, J. and Kellis, M. (2013) Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome research, 23, 1142-1154. 12. Wendt, K.S. and Peters, J.M. (2009) How cohesin and CTCF cooperate in regulating gene expression. Chromosome Res, 17, 201-214. 13. Xie, D., Boyle, A.P., Wu, L., Zhai, J., Kawli, T. and Snyder, M. (2013) Dynamic trans-acting factor colocalization in human cells. Cell, 155, 713-724. 14. Okita, Y., Kamoshida, A., Suzuki, H., Itoh, K., Motohashi, H., Igarashi, K., Yamamoto, M., Ogami, T., Koinuma, D. and Kato, M. (2013) Transforming growth factor-beta induces transcription factors MafK and Bach1 to suppress expression of the heme oxygenase-1 gene. The Journal of biological chemistry, 288, 20658-20667. 15. Ogawa, H., Ishiguro, K., Gaubatz, S., Livingston, D.M. and Nakatani, Y. (2002) A complex with chromatin modifiers that occupies E2F-and Myc-responsive genes in G(0) cells. Science, 296, 1132-1136.