Understanding Transcriptional Regulation by Integrative Analysis of Transcription Factor Binding Data

Research Understanding transcriptional regulation by integrative analysis of transcription factor binding data

1,2 1,2 1,2 2 1,2,3 5 Chao Cheng, Roger Alexander, Renqiang Min, Jing Leng, Kevin Y. Yip, Joel Rozowsky,1,2 Koon-kiu Yan,1,2 Xianjun Dong,4 Sarah Djebali,5 Yijun Ruan,6 Carrie A. Davis,7 Piero Carninci,8 Timo Lassman,8 Thomas R. Gingeras,7 Roderic Guigo´,5 Ewan Birney,9 Zhiping Weng,4 Michael Snyder,10 and Mark Gerstein1,2,11,12 1Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA; 2Program in 10 Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA; 3Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong; 4Program in Bioinformatics and Integrative Biology, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts 01655, USA; 5Center for Genomic Regulation (CRG) and UPF, Dr. Aiguader, 88, 08003 Barcelona, Spain; 6Genome Institute of Singapore, Singapore 138672; 7Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11791, USA; 8RIKEN Omics 15 Science Center, Yokohama Institute, Yokohama, Kanagawa, Japan; 9European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom; 10Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA; 11 AU1 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA

Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) AU2 binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcrip- 20 tional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the 25 differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner. [Supplemental material is available for this article.] 30 Transcription factors (TFs) are critical for the transcriptional regula- In several previous studies, statistical models were con- tion of gene expression (Takahashi and Yamanaka 2006; Vaquerizas structed to study the regulatory functions of TF on gene expression et al. 2009). In humans, they represent the largest family of pro- based on the gene expression and TF-binding data (Ouyang et al. teins, accounting for around 10% of genes (Babu et al. 2004). There 2009; Cheng and Gerstein 2011). These studies showed that TF- 50 35 are two types of TFs: general and sequence-specific. The former TFs binding signals around the transcription start sites (TSSs) of genes act cooperatively with RNA polymerase II and are ubiquitously are predictive of gene expression levels with fairly high accuracy. involved in the transcription of a large fraction of genes (Lee and But these studies have the following limitations: First, estimates of Young 2000). The latter TFs bind specific subsets of target genes, gene expression have relied on probes (microarray) or sequence leading to distinct spatiotemporal patterns of gene expression reads (RNA-seq) spread across a gene, possibly across multiple 55 40 (Kadonaga 2004). Although systematic gene expression quantifi- unknown isoforms of that gene. It is often difficult to accurately cation has been available for a decade from microarray experi- determine the expression level of each transcript based on such ments (Schena et al. 1995), only recently has the genome-wide data, which limits the predictive power of these models. Second, identification of TF-binding sites become possible, owing to the the numbers of TFs used in these models were quite limited and development of chromatin immunoprecipitation followed by perhaps not representative (12 TFs in both studies). Third, the TF- 60 45 microarray (ChIP-chip) and sequencing (ChIP-seq) technologies binding data were available for only a single cell line, so it was not (Ren et al. 2000; Johnson et al. 2007). possible to investigate the specificity of the models by examining the degree to which differential TF binding between two conditions affects differential expression of genes in those conditions. 12Corresponding author Fortunately, the ENCODE project has generated a large amount 65 E-mail [email protected] of data that enables us to overcome all of these limitations (The Article and supplemental material are at http://www.genome.org/cgi/doi/ 10.1101/gr.136838.111. Freely available online through the Genome Research ENCODE Project Consortium 2012). In addition to expression Open Access option. quantification of transcripts from RNA-seq (Wang et al. 2009) and

22:000–000 Ó 2012, Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 1 www.genome.org Cheng et al.

RNA–PET experiments (Ruan et al. 2007; JO Luo, JM Fullwood, gene expression explained by the model. In order to evaluate the 70 YJ Koh, L Veeravalli, S Djebali, R Guigo, C Davis, T Gingeras, stability of our results, we built models using four different A Shahab, Y Ruan, et al., in prep.), the consortium has also used machine-learning methods: random forest (RF), support vector 130 Cap Analysis of Gene Expression (CAGE) to quantify the expres- regression (SVR), multivariate adaptive regression splines (MARS), sion levels of >130,000 TSSs (annotated by GENCODE). In contrast and multiple linear regression (MLR). Performance of the first three to RNA-seq, CAGE is a technology that directly measures the methods was roughly comparable, and was better than MLR, im- 75 transcriptional signal at the TSS of genes (Shiraki et al. 2003; plying a nonlinear relationship between TF binding and TSS ex- T Lassmann, P Carninci, in prep.). In total, the expression data pression (Supplemental Fig. S1). In this article, to simplify pre- 135 include 267 expression profiles, representing RNA samples in sentation we focus on results from the RF method for models with multiple cell lines that are extracted from different cellular com- multiple predictors and the SVR method for models with a single ponents using different RNA extraction protocols. Moreover, the predictor (see Methods for details). Results from different methods 80 ENCODE project has generated >400 TF-binding profiles for more are highly consistent and lead to the same conclusions, e.g., the than 120 human TFs or transcription related proteins, including relative importance of different TFs for predicting gene expression. 140 both general and sequence-specific TFs (Gerstein et al. 2012). The Our results indicate that TF-binding signals around the TSS completeness of the ENCODE data enable us to study the transcrip- are informative for ‘‘predicting’’ their expression levels. For ex- tional regulation of TFs more accurately and comprehensively. ample, Figure 1A shows the consistency between predicted and F1 85 In this study, we apply our previously developed model actual expression levels of TSSs measured by CAGE of whole-cell (Cheng and Gerstein 2011) to the ENCODE data to better un- Poly A+ RNA in K562 cells. TF binding accounts for at least 67% of 145 derstand transcriptional regulation. We quantify the relationship the variance of expression levels (R2 = 0.67). In total, there are 267 between TF-binding signals around TSS and the expression level of promoter expression profiles representing 12 different human cell TSS measured by different technologies, and we study the relative lines in our data set. The performance of the model is not directly 90 contribution of different TF categories and of individual TFs. We comparable between cell lines, because different numbers of TF- compare the regulatory difference between different types of TSS. binding data sets are available for different cell lines. Since the 150 We also show that differential expression of genes can be de- most complete data were from K562, we chose this cell line for termined largely by the differential binding of TFs. Finally, we further analysis. The expression levels of a large fraction of TSSs explore how TFs coordinate with other chromatin features (e.g., (;50% on average) are not detected (RPKM = 0) in any of these 95 chromatin modifications and DNase hypersensitivity) to regulate K562 data sets. Thus, we developed a more complicated model that transcription. first classifies TSSs into expressed and nonexpressed categories and 155 then adopts a regression model to predict the expression levels for the expressed TSSs only (The ENCODE Project Consortium 2012). Results When applied to the TF data, this model achieves results very consistent with the methods without a classification step in terms Relating TF-binding signals to gene expression levels 2 of the R value and the relative importance of different TFs. We 160 The ENCODE project has performed a large-scale analysis of gene therefore focus on the classification-free models in the rest of this 100 expression and transcription factor (TF) binding in multiple hu- analysis. man cell lines. In the gene expression data, the transcription levels We compared the impact of different technologies, cellular of ;130,000 GENCODE-annotated TSSs were quantified using components, and RNA extraction protocols on the ‘‘prediction three different technologies: cap analysis of gene expression accuracy’’ of models. We used the binding signals of 40 TFs to 165 (CAGE) RNA–PET, and RNA-seq in multiple cellular components, 105 and with several different RNA extraction protocols. Meanwhile, the binding sites of ;120 TFs in the human genome were determined by ChIP-seq experiments (Gerstein et al. 2012). These data sets enable us to investigate the relationship between TF binding and gene expression in a systematic and quantitative manner. 110 We have previously shown in mouse that the expression levels of transcripts can be accurately reflected by TF-binding signals in their TSS regions (Cheng and Gerstein 2011). In this study, we aim at validating this result using data from CAGE that directly measures the expression levels of TSSs, and to investigate the in- 115 fluences of different technologies and RNA extraction methods on TSS expression quantification. We constructed models to quantify the ability of TF-binding signals to statistically predict the expression levels of promoters. Unless stated otherwise, we represent Figure 1. Accuracy of the TF model for predicting TSS expression levels. the binding strength of a TF in a promoter by its average ChIP-seq (A) Consistency of predicted values with expression levels measured by 120 signal in a 100-bp region centered on the TSS. We combined the CAGE in Poly A+ RNA samples extracted from whole cells. (B) Comparison of predictive accuracies of the TF model for expression data generated by TSS expression data with TF-binding data and then divided them three different technologies: CAGE, RNA–PET, and RNA-seq. (C ) Com- into a training data set and a test data set. A model was trained on parison of predictive accuracies of the TF model for expression data from the training data set and then applied to the test data to predict the three different RNA extraction protocols: Poly A+, Poly A-, and total RNA. expression levels of TSSs (see Methods for details). The relationship (D) Comparison of predictive accuracies of the TF model for expression data in different cellular components. In B–D, only data sets from K562 are 125 between expression and TF binding was quantified by the corre- used. The binding signals of 40 TFSSs are used as predictors. HCP and LCP lation between predicted and actual expression levels (R), or by the are high and low CpG content promoters, respectively. Separate models coefficient of determination (R2), the percentage of variance of are constructed for ALL, HCP, and LCP categories.

2 Genome Research www.genome.org

Fig(s). 1 live 4/C Relating gene expression with factor binding

predict each of the 57 K562 expression profiles and compared methods, which indicate the contribution of TFs after considering the resulting accuracies in terms of R2 values. We found that the their intercorrelations in a model, and thus provide complementary highest predictive accuracy was achieved for TSS expression data information to the individual predictive power. Specifically, in from CAGE (Fig. 1B). RNA-seq, as a method for quantifying ex- a random forest model the RI of a TF is calculated as the increase of 170 pression at the transcript level, seems unable to precisely capture prediction error (%IncMSE) when binding data for this TF is per- 230 the expression levels of TSSs. Furthermore, prediction accuracies muted. In general, highly predictive TFs have more binding peaks, vary significantly among different RNA extraction protocols with particularly in the TSS proximal regions. We found in the full Poly A+>Poly A- > Total RNA (Fig. 1C). No obvious difference was modelthatthetopfivemostimportantTFsinK562areYY1, observed between the prediction accuracies for expression data E2F4, MYC, MAX, and ELF1. We also examined the effect of TF– 175 from different cellular components (Fig. 1D). It can also be seen TF interaction on the predictive accuracy. Our results indicated 235 that expression levels of promoters with high CpG content (HCP) that including interaction terms in the model did not lead to are easier to predict than those with low CpG content (LCP). We further improvement. will investigate the effect of CpG content on gene expression in In principle, we would expect the binding of transcriptional more detail below. activators to positively correlate with gene expression levels, and a negative correlation for transcriptional repressors. Surprisingly, 240 we observe a positive correlation between the expression level of 180 Contribution of different TFs to the regulation TSSs and the binding signal of most ENCODE TFs (Supplemental of gene expression Table S2). For instance, the binding of REST, which represses neu- The ENCODE project has generated ChIP-seq data for a large ronal genes in non-neuronal tissues (Schoenherr and Anderson number of DNA-binding proteins. These proteins can be roughly 1995), is positively correlated with gene expression (r = 0.70). This 245 classified into six different categories, including sequence-specific implies that TF occupancy alone may not be sufficient to determine 185 TFs (TFSS), general or nonspecific TFs (TFNS), chromatin structure the function of a TF at a locus, as has been demonstrated in a recent factors (ChromStr), chromatin remodeling factors (ChromRem), study (Lickwar et al. 2012). For many TFs, their binding signal in histone methyltransferases (HISase), and Pol3-associated factors a DNA region may simply reflect the accessibility of the local (Pol3F) (Supplemental Table S1). For each TF, we constructed a chromatin structure. 250 model of expression prediction using it as the single predictor. We 190 compared their capability for predicting expression levels of TSSs in K562 (e.g., whole cell poly A+ RNA). We found that individually, The effect of promoter CpG content on gene expression TFs in the TFNS category were significantly more predictive than The CpG content of promoters in eukaryotes has been shaped by proteins in other groups (P = 0.004, t-test), whereas proteins from DNA methylation (Deaton and Bird 2011). Cytosines in CpG di- the ChromRem and Pol3F categories were significantly less pre- nucleotides can be methylated to form 5-methylcytosine, which 195 dictive (P = 0.0004 and P = 0.006, respectively, t-test) (Fig. 2A; undergoes a high rate of mutation into uracil. Meanwhile, methyl- 255 Supplemental Table S1). TFs in the TFNS category are implicated in ation of CpG sites within the promoter is a critical regulatory general transcriptional regulation. For instance, the TATA-binding mechanism to inactivate a gene (Pai et al. 2011). As a consequence, protein (TBP) is a common subunit required by all three of the genes repressed in germ-line cells or early developmental stages tend human RNA polymerases, I, II, and III (Kornberg 2007). Binding of to have lower CpG content in their promoters (Deaton and Bird 200 these general TFs is essential for transcriptional initiation of most 2011). When genes are repressed by methylation of CpG cytosines in 260 promoters, and therefore it makes sense that their binding signals their promoters, those cytosines tend to mutate to uracil, so there is a have the highest predictive capabilities for gene expression. In sort of ‘‘evolutionary arms race’’ between CpG-based repression contrast, it is expected that TFs in the Pol3F category are, in gen- and mutation to uracil that lowers CpG content. eral, less predictive, because RNA Pol III is 205 involved in initiating transcription of only a small fraction of promoters. For each of the 40 TFSSs assayed in K562, we investigated its individual predictive power in a degenerate model that F2 210 uses this TF as a single predictor (Fig. 2B). Strikingly, each TF alone can predict TSS expression levels of all genes with fairly high accuracy. As shown, the binding signal of MAX alone can explain 55% of 215 the variance in expression of all TSS, which is only ;12% lower than the variance explained by the full model (67%). The R2 in a degenerate model indicates the power of a TF for predicting expression individually. 220 In the full model, the relative importance of TFs for predicting the expression levels Figure 2. The capabilities of different TFs to predict TSS expression level. (A) Comparison of the pre- of promoters is roughly reflected by their dictive accuracies of individual DNA-binding proteins in six different categories. (*) Indicates that the predictive powers of TFs in a corresponding category are significantly different from those of the other TFs. Relative Importance score (RI score, see (B) The predictive accuracy of using each individual TFSS as the single predictor. (C ) The relative impor- Methods) (Fig. 2C). We use the standard RI tance of each TFSS in the Random Forest model. The calculation is based on the CAGE expression data in 225 metrics of different machine learning Poly A+ RNA samples extracted from K562 whole cells. Note that TFSS labels are shared by B and C.

Genome Research 3 www.genome.org Cheng et al.

We calculated normalized CpG content for all GENCODE promoters (Fig. 3F). This finding implies that the regulation of F3 265 promoters (see Methods). As shown in Figure 3A, normalized CpG E2F4 on gene expression might be affected by the status of CpG 290 content follows a bimodal distribution, based on which we divided sites. In fact, it has been demonstrated that E2F binding can be promoters into two classes: high CpG promoters (HCP) and low regulated by CpG methylation (Campanero et al. 2000). CpG promoters (LCP). HCP promoters are more highly expressed In promoters whose low expression level is mediated by CpG than LCP promoters as measured by CAGE experiments in all methylation, the methylated CpG dinucleotides have a relatively 270 expression profiles. For example, in K562 whole-cell Poly A+ high chance to mutate into UpG. Especially for promoters re- 295 RNA, 62% of HCP promoters are expressed, while only 15.5% of pressed in germline cells or in early developmental stages, such LCP promoters are expressed (Fig. 3B). Furthermore, among the mutations can be passed on to the next generation, resulting in a expressed TSSs, the expression level of HCP promoters is signifi- reduction in CpG content in that promoter region in future gen- cantly higher than that of LCP promoters (Fig. 3C). erations. We examined the correlation coefficient of normalized 275 We have shown in Figure 1 that the expression levels of HCP CpG content with expression levels of promoters in different cell 300 promoters are easier to predict than those of LCP promoters. We lines. We found that the best correlation was obtained in H1HESC further compared the relative importance of each TF for predicting (H1 human embryonic stem cells), indicating that CpG content the expression levels of HCP and LCP promoters. As shown in best reflects promoter expression status in this cell line. This in- Figure 3D, the relative importance (RI) scores for the HCP model dicates that gene expression and DNA methylation in germline 280 are generally greater than those for the LCP model, consistent with cells or early developmental stages might be more similar to 305 the higher predictive accuracy of the HCP model. The orders of the H1HESC than other cell lines. We also examined the effect of using RI scores in the two models are roughly consistent, with the ex- CpG content for classifying expressed and nonexpressed promoters. ception of E2F4. In the HCP model, E2F4 is the second most im- As shown in Figure 3H, this method of classification achieves its portant TF, but in the LCP model its relative importance is very low. highest accuracy (AUC = 0.82; see Methods for details) in H1HESC, 285 Consistently, the average binding signal of E2F4 at the TSS site with much lower accuracy in HEPG2 cells. 310 is lower in LCP promoters than in HCP promoters (Fig. 3E). The binding signal of E2F4 alone accounts for 47% of the variance Regulation of alternative TSS by TFs of expression levels for HCP promoters, but only 14% for LCP Many genes have multiple transcriptional start sites. Specifically, ;35% of genes annotated by GENCODE possess more than one TSS (Harrow et al. 2012). To investigate whether there are systematic differences in the regulation of different classes of TSS, we 315 selected all genes with alternative TSSs and collected the first and the second TSS of these genes to form two TSS sets (the average distance between the first and the second TSS is 236 bp). Then we constructed separate models for the first TSS and the second TSS sets. Using expression data from the CAGE and RNA–PET experi- 320 ments, we achieved higher predictive accuracy for the second TSS set (Fig. 4). The same trend was observed in RNA-seq data only F4 when the short RNA extraction protocol was adopted. Higher prediction accuracy was achieved for the first TSS set for RNA-seq data using other RNA extraction protocols. It is unlikely that these 325 results are caused by the CpG content issue, because the two TSS sets are similar in their CpG contents (56.2% and 55.2% of TSSs in the first and second set, respectively, are HCPs). Moreover, there is no significant difference in the expression levels between the two TSS sets. Our results imply that expression levels of the down- 330 stream TSS might rely more on TF regulation, while other chromatin features might have more influence on the transcription of the first TSS. In addition, the relative importance of TFs is different between the two models (Supplemental Fig. S2). For instance, MXI1 is the second most predictive TF in the model for the first 335 TSS set, but it shows only a low relative importance in the model for the second TSS set. Thus, there might exist distinct regulatory mechanisms between the first and the other TSSs as suggested in Figure 3. The relationship between promoter CpG content and expression level. (A) The distribution of normalized CpG content for all hu- Davuluri et al. (2008) and Wray et al. (2003). man GENCODE TSSs. (B) The fraction of expressed TSSs in HCPs and LCPs. (C ) The distributions of expression levels of expressed HCPs and LCPs. (D) The relative importance of each TF in the HCP- and LCP-specific models. Correlation of differential gene expression 340 (E) The aggregated binding signals of E2F4 around the TSS of HCPs and with differential TF binding LCPs. (F) The predictive accuracies of HCP- and LCP-specific models using E2F4 as the single predictor. (G) The Spearman correlation coefficients TF binding is regulated in a cell-type specific manner, so we expect between normalized CpG content and expression levels in different cell that in two different cell lines, differential TF binding should be lines (CAGE data for Poly A+ RNA from whole cells). (H) The accuracies of correlated with differential TSS expression. We investigated this using normalized CpG content to classify expressed and nonexpressed promoters in H1HESC and HEPG2. In B–F, the CAGE expression data for hypothesis using the data in K562 and GM12878, which were 345 RNA extracted from K562 whole cells are used. derived from erythroleukemia cells and normal lymphoblastoid

4 Genome Research www.genome.org

Fig(s). 3 live 4/C Relating gene expression with factor binding

in their promoters and associated with strong signals of active 390 histone marks in their promoters and gene bodies. We thus quantified the relationship between histone modifications and TF-binding signals using the predictive models. We find that histone modification can be predicted accurately by the binding signals of TFs at the TSS regions. As shown in Figure 6, the 395 F6 TF-binding signal at the TSS of genes can predict H3K4me3 signals around the TSS with very high accuracy (R2 = 0.85). It is also highly predictive of the signals of other histone marks, such as H3K9ac Figure 4. Comparison of accuracies of the TF model for predicting the and H3K79me3 (see Supplemental Fig. S3). More interestingly, the expression level of the first and second TSS of genes. The binding signals TF-binding signals can predict the patterns of histone marks, i.e., 400 of 40 TFSSs are used as the predictors, and only promoters from genes the positions where they are located. For example, the best pre- with at least two TSSs are included in the models. The calculation is based diction accuracy was achieved right at the TSS for H3K4me3, on expression data from K562. RNA-seq (s) and RNA-seq (o) represent RNA-seq data using small-RNA extraction protocol and other protocols, which is known to be a mark for active promoters (Koch et al. respectively. 2007). In contrast, high predictive accuracy was obtained at the TSS and in the transcribed region of genes for H3K36me3, which 405 is a histone mark for the gene body (Kolasinska-Zwierz et al. 2009). cells, respectively. We selected promoters with more than fourfold The relative importance of TFs is different for predicting different expression difference between the two cell lines and constructed histone modification types, but MAX, YY1, ETS1, and E2F6 are a K562-specific model (K-model) and a GM12878-specific model generally the most informative ones (see Supplemental Fig. S4; 350 (G-model) (used 22 shared TFs in both models). When applied to Supplemental Table S3). 410 whole-cell Poly A+ RNA expression data, the K-model explains 55% of the variance in the expression level of promoters in K562, Interplay between TF binding and other chromatin F5 but only 16% of the variance in GM12878 (Fig. 5A). Similarly, the features for regulating gene expression G-model accounts for much more variance of expression in 355 GM12878 (49%) than in K562 (34%). Moreover, TFs exhibit dif- The expression levels of promoters are strongly correlated with the ferent relative importance in the two cell lines. For example, SP1 local chromatin structure around the promoter regions. On one shows a relatively stronger effect on gene expression in GM12878, hand, chromatin structure is largely determined by nucleosome 415 whereas MAX and ETS1 have a stronger effect in K562 (Fig. 5C). density (Lee et al. 2007) and histone modifications (Kouzarides We next examined the effectiveness of predicting differential 2007), which are in turn influenced by TFs (Narlikar et al. 2002). 360 expression based on differential binding of TFs in promoter re- On the other hand, chromatin structure influences accessibility of gions. The binding differences (log2) in K562 versus GM12878 the underlying DNA to TFs (Li et al. 2007). The chromatin structure were calculated for 22 TFs for which the ChIP-seq data were available of DNA can be captured by two technologies: DNase hypersensi- 420 in both cell lines. A model using those differences as predictors tivity (Follows et al. 2006; Sabo et al. 2006) and Formaldehyde- explains 53% of the variance in expression differences (log2 ratios) Assisted Isolation of Regulatory Elements (FAIRE) experiments 365 of TSSs between K562 and GM12878 (whole cell Poly A+ RNA extraction) (Fig. 5B). We also explored the relative importance of TFs in the differential expression model. Interestingly, we find that the TFs important for differential expression (e.g., YY1) are in general those that are important in both the K-model and the 370 G-model. TFs with higher RI scores in only one cell line (e.g., SP1, MAX, and ETS1) show quite limited contributions to predicting differential expression of promoters (Fig. 5C). In addition to the regression models, we also constructed classification models. Specifically, we selected 4493 K562-specific

375 (log2(K562/GM12878)>2) and 8183 GM12878-specific (log2 (GM12878/K562)>2) TSSs, and examined the capability of each individual TF for discriminating these two TSS categories (using the TF as the single classifier). As shown in Figure 5D, all of these TFs can classify the two TSS categories, with YY1 achieving the highest 380 classification accuracy (AUC = 0.86). Similar results were achieved when different thresholds were used to select K562 and GM12878 specific TSSs. Figure 5. Cell line specificity of the TF model. (A) Models trained and tested on data from the same cell line result in higher predictive accuracies. K Model and G Model represent models trained with data from K562 Relationship between histone modifications and GM12878, respectively. (B) Consistency of predicted log2 fold and TF-binding signals changes with the experimentally measured differences between K562 and GM12878. Differential binding of 22 TFs are used as the predictors in 385 We have previously shown that both TF binding and histone a predictive model of differential expression. (C ) The relative importance modification are predictive of expression levels of genes (Cheng of TFs in K562- and GM12878-specific models as well as the predictive model for differential expression. (D) The power of each individual TF for and Gerstein 2011; Cheng et al. 2011b). In fact, at promoter regions, classifying K562- and GM12878-specific promoters (log2 fold change >2). TF-binding signals and histone modification signals are highly cor- CAGE expression data in Poly A+ RNA extracted from K562 and GM12878 related. Active genes are generally bound by transcriptional activators whole cells were used in the calculation.

Genome Research 5 www.genome.org

Fig(s). 4,5 live 4/C Cheng et al.

TFSS|Nucleosome model) are 16%, 23%, and 37%, respectively. In contrast, after taking into account the TFSS-binding data, the additional variance further explained by these other chromatin features are negligible (<1%), and including them in a model cannot 470 further improve the prediction accuracy for TSS expression. In fact, a combined model including all of these five categories of features leads to an accuracy of R2 = 0.74.

Discussion

TFs and histone modifications are two critical factors that co- 475 ordinately regulate gene transcription. The regulatory mechanisms Figure 6. The effectiveness of TF-binding signals for predicting histone- of these and other factors are summarized in Figure 8. First, TFs and F8 modification patterns around the TSS of promoters. The binding signals of histone modifications can regulate the initiation of transcription 40 TFSSs are used as the predictors. Both the TF-binding and the histone- by interacting with RNA polymerase and other general TFs and modification data are from K562. recruiting them to the TSS (see points 5, 6, 7, and 8 in Fig. 8), or by 480 changing the accessibility of promoters to them via modulating chromatin structure (see points 3 and 4 in Fig. 8) (Mitchell and (Giresi et al. 2007). We thus applied models to investigate the Tjian 1989; Li et al. 2007). This regulation is achieved with the relationships between gene expression and TF binding (including assistance of chromatin modifiers and other chromatin-associated 425 both TFSSs and TFNSs), histone modifications, DNase, and FAIRE proteins, e.g., proteins that specifically recognize and bind modi- 485 data generated by ENCODE. Given the TFSS-binding data and fied histones (Kouzarides 2007). For these reasons, TF-binding another chromatin feature X (where X can be histone modifica- data, histone modification data, and the data that capture local tion, general TF binding, DNase, FAIRE, or nucleosome occupancy chromatin structure (e.g., DNase and FAIRE) are all predictive of data), we constructed five models to calculate the fractions of the expression levels of genes (Fig. 7). Second, these factors are 430 variance of promoter expression levels (R2) explained by TFSS- inter-related and coordinately participate in transcriptional regu- 490 binding data alone (TFSS model), X data alone (X model), a combi- lation. For example, TFs such as YY1 can influence histone modi- nation of TFSS binding and X data (TFSS + X model), the additional fications by recruiting histone modifiers to a DNA region (Yang variance explained by TFSS-binding data after considering the X et al. 1997); and conversely, histone modifications can affect TF data (TFSS|X model), and the additional variances explained by X binding by directly recruiting them or indirectly by changing the F7 435 data after considering the TFSS binding data (X|TFSS model) (Fig. 7; accessibility of DNA regions to them (Li et al. 2007). As a conse- 495 Supplemental Table S4). quence, TF-binding and histone-modification signals are often The binding data of sequence-specific TFs and general TFs (Pol highly correlated in TSS proximal regions. Due to this high co- II, TATA-binding proteins, etc.) account for at least 74% of the ordination, they share a similar amount of information for ‘‘pre- variance in gene expression levels (the TFSS + TFNS model). The dicting’’ gene expression levels (Cheng and Gerstein 2011); i.e., 440 remaining variance of gene expression levels (26%) is mainly de- they are redundant. Third, the transcription status of genes can in 500 termined by post-transcriptional regulation. General TFs alone turn affect the TF-binding and histone modifications by interact- account for 73% of variance (the TFNS model), and explain 8% ing with TFs and histone modifiers (Okitsu et al. 2010). A recent additional variance after considering the sequence-specific TF- study shows that TAF3, the TBP-associated core promoter factor, binding data (the TFNS|TFSS model). This 8% additional variance is interacts with CTCF to form DNA loops that connect core pro- 445 basically what is regulated at the transcriptional level but not moters with promoter-distal sites, implying that general TFs might 505 captured by the binding data of those 40 TFSSs in the TFSS model, regulate chromatin structure of distal regions (Liu et al. 2011). This e.g., distal regulation by enhancers and regulation contributed by other factors. After taking into account general TF binding, the additional variance contributed by TFSS binding (the TFSS|TFNS 450 model) is very limited (3%). After considering the histone modification data, binding of TFSS accounts for a further 13% of additional variance in gene expression levels (the TFSS|HM model), and 8% vice versa (the HM|TFSS model). This suggests that the contributions of TFSS 455 binding and histone modification to aggregate expression of TSS are highly but not completely redundant. Each provides extra information that is not accounted for by the other. We note that here we only use histone modification signals at the TSS regions (100 Figure 7. The relationship of the TFSS-binding data with five types of bp). Since histone modifications affect a broad region around chromatin features for predicting promoter expression. For each type of chromatin feature, we constructed five models to calculate the fraction 460 genes, the actual variance that can be explained by the HM model of variance of promoter expression levels explained by the TFSS alone should be even larger (Cheng et al. 2011b; X Dong, M Greven, A (TFSS), by each feature alone (X), by a combination of TFSS and feature X Kundaje, S Djebali, BJ Brown, C Cheng, M Gerstein, GR Serra, E (TFSS+X), as well as the additional variance explained by TFSS after taking Birney, Z Weng, in prep.). feature X into account (TFSS|X) and vice versa (X|TFSS). Feature X rep- resents general transcription factors (TFNS), histone modifications (HM), The additional variance explained by TFSS-binding data DNase signal, FAIRE signal, or nucleosome occupancy. CAGE expression 465 after considering the data of DNase (the TFSS|Dnase model), data in Poly A+ RNA extracted from K562 whole cells were used in the FAIRE (the TFSS|FAIRE model), and nucleosome occupancy (the calculation.

6 Genome Research www.genome.org

Fig(s). 6,7 live 4/C Relating gene expression with factor binding

We have previously shown that the histone-modification model for gene expression prediction is tissue specific (Cheng and Gerstein 2011). In this work, we show that the TF model is also tissue specific, or more precisely, cell line specific (Fig. 5A). The best prediction accuracy is achieved when the TF-binding data and TSS- 555 expression data from the same cell line are used. Note that to predict the expression in a cell line, we always use the TF-binding data from the same cell line, although the model might be trained from the other cell line. Thus, the higher performance of the model in the matched cell line is not caused by differential TF binding; 560 instead, it reflects the different regulatory mechanisms between K562 and GM12878. In addition, TFs show different relative importance in different cell lines. A TF might be active and exhibit significant influence on gene expression in K562, but inactive with Figure 8. Regulatory mechanism of TF binding, histone modification, little effect on gene expression in GM12878. For example, SP1 565 and other chromatin features on gene expression. shows a relatively stronger effect on gene expression in GM12878 than in K562, while MAX and ETS1 show the opposite trend. Conventionally, TF binding is often regarded as an on/off event. feedback complicates the cause and effect relationship between TF However, Figure 5B shows that the differential expression of TSS binding, histone modifications, and gene expression. Taken together can be precisely reflected by the difference of TF-binding signals 570 with previous studies, our analysis reveals a highly coordinated between two cell lines. This suggests that a quantitative way of TF 510 system for regulation of gene expression that consists of TFs, his- binding should be used for studying the TF–gene regulatory re- tone modifications, RNA Polymerase, and other chromatin-related lationship (Biggin 2011; Cheng et al. 2011a). proteins. Based on normalized CpG content, TSSs can be categorized In previous studies, it has been shown that TF binding and into HCPs and LCPs. TSSs in the former class tend to have higher 575 histone modifications are predictive of expression levels of mRNA expression levels. Our results show that the expression levels of 515 transcripts measured by RNA-seq or microarrays (Ouyang et al. HCPs can be more accurately predicted than those of LCPs by TF- 2009; Cheng and Gerstein 2011). These studies also showed that binding signals. We also find that the relative importance of some expression levels from RNA-seq could be more accurately predicted TFs is different between HCPs and LCPs. Methylation of CpG sites than those from microarrays, indicating the higher precision of around TSS can represent another mechanism of gene expression 580 the former. In eukaryotes, many genes have multiple transcripts, regulation. In fact, it has been shown that binding of E2F factors 520 which might start from different TSS. Technically, it is often diffi- was affected by the methylation status of their binding sites cult to quantify precisely the expression level of each transcript by (Campanero et al. 2000; Landolin et al. 2010). Thus, the difference RNA-seq or microarray. We overcame this problem in this study by between the HCP model and the LCP model might reflect co- focusing on TSS regions, relating TF-binding signal around TSS operation between TF binding and DNA methylation for tran- 585 with expression levels of TSS. CAGE is by nature the technology to scriptional regulation. 525 quantify expression levels of TSS. For RNA–PET and RNA-seq data, The first TSS of a gene might be recognized in a different way we also calculate the TSS expression levels by focusing on TSS from the other TSSs by the transcriptional machinery. For instance, proximal regions. Overall, at the TSS level we obtained higher it might recognize different TSSs independently, or alternatively, it predictive accuracy compared with those models for predicting tends to recognize the most upstream TSS but skip it in a certain 590 expression of transcripts. Our results also suggest that CAGE can frequency to initiate transcription at a downstream TSS. A recent 530 best capture the expression levels of TSS. In addition, the accu- study of the glucocorticoid (GR) and estrogen (ER) nuclear re- racy of TSS expression quantification is also dependent on the ceptors (Voss et al. 2011) found that GR is a ‘‘driver’’ TF, while ER RNA-extraction protocol being used, with highest performance is a ‘‘passenger’’ TF that benefits from ‘‘assisted loading’’ from GR. It achieved in Poly A+ RNA. For RNA-seq data the expression levels was posited that driver TFs bind to closed but breathing chromatin 595 for TSS and transcript are both available, and we find that the TF and recruits chromatin remodeling factors to open the chromatin 535 models can predict transcript expression with a slightly higher fully. Passenger TFs only bind to chromatin that has been opened accuracy than TSS expression (Supplemental Table S5). This in- already by driver TFs or some other chromatin remodeling mech- dicates that RNA-seq, unlike CAGE, more accurately quantifies the anism, so they benefit from assisted loading. In our study of the expression levels for transcripts than for TSSs. first and second TSS of genes, we were better able to model the 600 TF-binding signals used in the TF models capture regulatory second TSS from TF-binding data. We also found that YY1 best 540 information at the transcriptional level. Gene expression levels, predicts expression of the set of first TSS. It is known that YY1 can however, are also determined by post-transcriptional factors like recruit chromatin remodeling factors as expected from a driver TF mRNA degradation. It is therefore more difficult for the TF model (Yang et al. 1997). These facts lead us to postulate that, for most to predict the expression levels of genes that are regulated strongly genes, driver TFs bind to the first TSS and recruit chromatin 605 at the post-transcriptional level. We performed gene ontology remodelers, which then open the chromatin around the second 545 (GO) analysis on poorly predicted genes (i.e., genes with the largest TSS. This hypothesis can explain the relative predictive power of residuals in the TF model). We find significant enrichment for our models: When a passenger TF binds near the second TSS, its some GO categories, e.g., involvement in cell cycle control (Sup- power to predict second TSS expression is boosted by the fact that plemental Table S6). In addition, TSSs whose expression levels are chromatin remodeling has already occurred near the first TSS. 610 _ underestimated by the TF model (y >y ) tend to have higher ex- Likewise, some of the predictive power of TF binding at the first TSS 550 pression variance across different cell lines. goes to predict transcription of the second TSS.

Genome Research 7 www.genome.org

Fig(s). 8 live 4/C Cheng et al.

We show here that TF binding is highly predictive of gene a gradual decrease of predictive accuracy by the TF model (Supple- expression levels using human ENCODE data, and we have pre- mental Fig. S5). 675 615 viously shown the same using mouse data (Cheng and Gerstein The other data sets, including histone modification, DNase I 2011). In yeast, several studies have been performed to relate gene hypersensitivity, FAIRE, and nucleosome occupation, were also expression with motif existence, TF–DNA-binding data, or histone generated by the ENCODE project using high-throughput modification data (Kurdistani et al. 2004; Yuan et al. 2006). For sequencing technologies. The data were processed in the same example, Yuan et al. (2006) constructed a linear regression model way as for the TF-binding data. The human promoters/TSSs were 680 620 to predict transcription rates of yeast genes. They showed that annotated by the GENCODE project, version 7 (Harrow et al. 2012). In this work, we focus our analysis on ;130,000 high-confidence three types of histone acetylations alone accounted for 18% of the TSSs. variance (R2 = 0.18) of transcription rates, and the R2 increased to 33% if TF-binding motif and nucleosome occupancy data were also included in the model. Furthermore, Li et al. (2010) showed in Categorization of DNA-binding proteins 625 another study that TF binding was predictive of intrinsic expres- In this work, we mainly focus on using sequence-specific TFs 685 sion noise of yeast genes, indicating that TF binding impacts not for predicting the expression levels of promoters. In some cases, only the levels but also the fluctuation of gene expression. In addi- however, the model was extended to general TFs and other DNA- tion, many other studies focused on identifying regulatory motifs or binding proteins. Basically, we categorized the DNA-binding pro- TFs underlying a biological process via combining expression data teins with ChIP-seq data available in six categories: sequence-specific 630 with TF-binding data or sequence motif analysis (Conlon et al. 2003; TFs (TFSS), general or nonspecific TFs (TFNS), chromatin structure 690 Yu et al. 2003; Tsai et al. 2005; Li and Zhan 2008). In the future, with factors (ChromStr), chromatin remodeling factors (ChromRem), more data available it would be more practical to perform similar histone methyltransferases (HISase), and Pol3-associated factors analysis in higher organisms. (Pol3F).

Methods Models for predicting TSS expression levels To understand the relationship between TF-binding signals and 695 635 Data processing the expression levels of promoters, we constructed predictive All of the data used in this work were generated by the ENCODE models based on four different machine-learning methods: RF project. The expression data of GENCODE TSSs were produced (random forest), MARS (multivariate adaptive regression splines), using three different technologies (CAGE, RNA–PET, and RNA- SVR (support vector regression), and MLR (multivariate linear re- seq). The data include a total of 267 expression profiles, repre- gression). In these models, the binding signals (the average read 700 640 senting expression profiles for RNA samples in 12 different cell coverage at each nucleotide) in a particular bin (e.g., the 100-bp bin lines extracted from six different cellular components (whole-cell, at the TSS) for a set of TFs (e.g., sequence-specific TFs) were used as cytosolic, nuclear, and nuclear subcompartments, namely chro- the predictors to predict the response variable Y (i.e., the expres- matin, nucleoplasm, and nucleolus) using four different protocols sion levels of promoters). The promoter expression levels are dis- (Poly A+, Poly A-, total, and short RNA). Note that the samples are tributed over an exponential range, so to stabilize variance we 705 645 not evenly collected from different cell lines; a large fraction of them use log2-transformed values as the response variable with 0.03 as are from K562 and GM12878. To facilitate the comparison of pseudo-count. data from different technologies, the RNA-sequencing data were To evaluate the performance of the predictive models, we processed to obtain expression levels of the TSSs (T Lassmann, randomly selected 2000 promoters as the training data and the P Carninci, in prep.). The RNA–PET expression of a TSS is defined as remaining as the test data. A model was trained on the training 710 650 the total number of 59 tags within a 101-bp window centered on data and applied to predicting the expression levels of promoters ^ the TSS. For RNA-seq experiments, the expression level of a TSS in the test data (Yi). The predictive accuracy of the model can be ^ is calculated as the sum of expression levels of all transcripts measured by the correlation (R) between the predicted values (Yi) initiated from it. TSS expression levels are normalized and repre- and the actual experimental expression levels (Yi). Predictive accuracy can also be measured by the coefficient of determination 715 sented as RPM (reads per million) for CAGE, RNA–PET, and short 2 655 RNA-seq data, or RPKM (reads per kilobase per million) for long (R ), the fraction of variance of gene expression explained by the RNA-seq (Poly A+, Poly A- and total RNA) data. The expression model, which is defined as follows: levels of transcripts (based on GENCODE v7 annotation) were measured as RPKM and calculated using the software FLUX ^ 2 2 +iðyi yiÞ CAPACITOR. R = 1 ; + ðy yÞ2 660 The genome-wide TF-binding data were obtained from ChIP- i i seq experiments. The data include >400 binding profiles, representing the binding of >120 TFs and chromatin factors in many where y is the mean gene expression level. 720 different cell lines. Again, the most complete data were available For each model, we generated 10 groups of training and test from K562 and GM12878. We calculated the binding strengths data, and averaged the resulting R or R2 as the predictive accuracy. 665 of each TF at all of the GENCODE TSSs. Specifically, we calculated The variation of R/R2 is low, indicating that the training data set and averaged the number of reads covering a 100-bp DNA region with 2000 promoters is large enough to achieve stable predictions. centering on each TSS, resulting in the binding signal for this To estimate the predictive power of an individual TF, we 725 TSS. We choose the 100-bp region for two reasons: (1) We have predicted the expression levels using a SVR model with the binding previously shown that TF binding signals in a narrowed DNA signal of the TF as the single predictor. It is also informative to show 670 region around TSS achieves the highest prediction accuracy; the relative contribution of each predictor in a model with multi- (2) for genes with multiple TSSs the average distance between the ple predictors. We use the ‘‘%IncMSE’’ (increase of mean squared first and the second TSSs is ;200 bp. In fact, when we increased error) calculated from the Random Forest method to represent the 730 the window size from 100 to 300, 500 until 1500 bp, we observed relative importance (RI) of TFs. Specifically, the values of each TF

8 Genome Research www.genome.org Relating gene expression with factor binding

of the test data were permuted and the prediction error (mean captured by DNase I hypersensitivity and FAIRE data. Thus, all of squared error of all genes) in the test data was recalculated using these chromatin features are predictive of the expression levels of 790 the original model. Compared with the unpermuted data, per- promoters. Using the ENCODE data, we investigated the re- 735 mutation of a TF will, in general, result in increase of prediction lationship of five groups of chromatin features (general TF binding, error. Such an increase (i.e., %IncMSE) is used as a measurement of histone modification, nucleosome occupancy, DNase I hyper- relative importance of a TF in the model (Breiman 2001). A TF with sensitity, and FAIRE signals with the TFSS-binding features in the higher IncMSE value relative to other TFs in the model has higher context of predicting gene expression levels. For each group X, we 795 importance for predicting the gene expression level. constructed five different models. Three of the models use chro- 740 The R packages ‘‘randomForest’’, ‘‘earth’’, and ‘‘e1071’’ were matin features in the group X (the X model), the binding signals of utilized to implement these models (R Core Development Team TFSS (the TFSS model), or a combination of them (the TFSS+X 2011). model) as the predictors, respectively. In the remaining two models, we examined the predictive power of features in X after 800 considering the TFSS-binding signals (the X|TFSS model), and vice Models for predicting differential gene expression versa (the TFSS|X model). Specifically, for the X|TFSS model, we In the differential gene-expression model, the response variable first predicted the expression levels of promoters (Y^) based on the 745 ‘‘Y’’ was calculated as the log2 ratio of the expression levels in K562 binding signals, and then used the features in X to predict the re- versus GM12878 (log2 K562/GM12878), and the predictors ‘‘Xs’’ siduals (Y Y^). We calculated the R2 for each of the five models. 805 were calculated as the log ratio of binding signals between the two The R2 of the X|TFSS model indicates the additional variance cell lines. The predictors in this model are 22 TFs for which the explained by the chromatin features in group X after already taking binding data are available for both the K562 and GM12878 cell into account the TFSS-binding signal. 750 lines. The pseudo-count (0.03) was used during the calculation to avoid extreme values caused by small expression levels. The same approaches as described in the preceding section were used for Calculation of normalized CpG content evaluating model performance and calculating relative importance We calculated the normalized CpG content of all GENCODE pro- 810 of TFs. moters in 2-kb DNA regions centered around their TSSs using the method described in Saxonov et al. (2006). Briefly, the normalized CpG content is calculated by dividing the observed number of 755 Classification of promoters specific to K562 and GM12878 CpG dinucleotides by the expected number in a promoter. Nor- In addition to the regression models, we also constructed classifi- malized CpG contents for promoters followed a bimodal distri- 815 cation models to examine the effectiveness of classifying indi- bution (Fig. 3A). Setting the cutoff value between low and high vidual TSS as either K562-specific or GM12878-specific TSSs based normalized CpG to 0.4 best separated the two peaks in the distri- on the strength of TF-binding signals. We first identified K562- bution. Promoters with a normalized CpG content above the cut- 760 specific and GM12878-specific TSSs according to their expression off value were classified as high CpG content promoters (HCP), in Poly A+ RNA extracted from whole cells. Promoters expressed and the remaining promoters were classified as low CpG content 820 with more than fourfold higher levels in one cell line versus the promoters (LCP). Approximately, the normalized CpG content other were defined as cell-type-specific TSSs. We constructed reflects the existence of CpG island nearby a TSS or not (e.g., many models using RF and SVM (support vector machine) to classify the HCPs are located nearby a CpG island). It considers the CpG en- 765 two types of TSSs. The classification accuracy was measured by the richment in the DNA regions centering directly on the TSS, and AUC (Area Under the ROC curve) in the cross-validation data, thereby is more practical than the CpG island-based method for 825 where the ROC curve (receiver operating characteristic) is a graphic classifying promoters. plot of the sensitivity versus 1-specificity. The AUC takes a value within [0, 1], with a greater value indicating higher performance 770 of a classification model. Data access All data are publicly available on the UCSC Genome Browser Models for predicting histone modifications (http://genome.ucsc.edu/ENCODE/downloads.html). We also constructed models to predict histone modification signals at different positions relative to the TSS by using the TF-binding Acknowledgments 830 signal in 100-bp bins around the TSS as the predictors. With these We thank the ENCODE Consortium for the rich data and insightful 775 models, we examined the power of TF-binding signals for in- discussions. We also thank Dr. Anshul Kundaje and Dr. Ben Brown ferring histone-modification signals of 12 different types, in- for valuable comments and suggestions. We acknowledge support cluding H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me1, from the NIH and from the AL Williams Professorship funds. H3K9me3, H3K27me3, H4K20me1, H3K79me2, H3K9ac, H3K27ac, and H2az. The DNA regions around TSS ([4 kb, 4 kb]) were di- 780 vided into 80 bins, each 100 bp in size. For each bin the histone References 835 modification signals associated with promoters were examined by the models. In these models the response variable Y (histone Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA. 2004. modification signal) was log2 transformed. Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14: 283–291. Biggin MD. 2011. Animal transcription networks as highly connected, Models for understanding the relationships of different quantitative continua. Dev Cell 21: 611–626. 840 Breiman L. 2001. Random Forests. Mach Learn 45: 5–32. 785 chromatin features Campanero MR, Armstrong MI, Flemington EK. 2000. CpG methylation as The expression levels of promoters are correlated with chromatin a mechanism for the regulation of E2F activity. Proc Natl Acad Sci 97: 6481–6486. structure, which is influenced by histone modifications, nucleo- Cheng C, Gerstein M. 2011. Modeling the relative relationship of 845 some occupancy, and TF binding. Chromatin structure can also be transcription factor binding and histone modifications to gene

Genome Research 9 www.genome.org Cheng et al.

expression levels in mouse embryonic stem cells. Nucleic Acids Res 40: Mitchell PJ, Tjian R. 1989. Transcriptional regulation in mammalian cells by 920 553–568. sequence-specific DNA binding proteins. Science 245: 371–378. Cheng C, Min R, Gerstein M. 2011a. TIP: A probabilistic method for Narlikar GJ, Fan HY, Kingston RE. 2002. Cooperation between complexes 850 identifying transcription factor target genes from ChIP-seq binding that regulate chromatin structure and transcription. Cell 108: 475–487. profiles. Bioinformatics 27: 3221–3227. Okitsu CY, Hsieh JC, Hsieh CL. 2010. Transcriptional activity affects the Cheng C, Yan KK, Yip KY, Rozowsky J, Alexander R, Shou C, Gerstein M. H3K4me3 level and distribution in the coding region. Mol Cell Biol 30: 925 2011b. A statistical framework for modeling gene expression using 2933–2946. chromatin features and application to modENCODE datasets. Genome Ouyang Z, Zhou Q, Wong WH. 2009. ChIP-Seq of transcription factors 855 Biol 12: R15. doi: 10.1186/gb-2011-12-2-r15. predicts absolute and differential gene expression in embryonic stem Conlon EM, Liu XS, Lieb JD, Liu JS. 2003. Integrating regulatory motif cells. Proc Natl Acad Sci 106: 21521–21526. discovery and genome-wide expression analysis. Proc Natl Acad Sci 100: Pai AA, Bell JT, Marioni JC, Pritchard JK, Gilad Y. 2011. A genome-wide study 930 3339–3344. of DNA methylation patterns and gene expression levels in multiple Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH. 2008. The functional human and chimpanzee tissues. PLoS Genet 7: e1001316. doi: 10.1371/ 860 consequences of alternative promoter use in mammalian genomes. journal.pgen.1001316. Trends Genet 24: 167–177. R Development Core Team. 2011. R: A language and environment for statistical Deaton AM, Bird A. 2011. CpG islands and the regulation of transcription. computing. R Foundation for Statistical Computing, Vienna. http:// 935 Genes Dev 25: 1010–1022. www.R-project.org. The ENCODE Project Consortium. 2012. Integrative analysis of the human Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, 865 genome. Nature (in press). Schreiber J, Hannett N, Kanin E, et al. 2000. Genome-wide location and Follows GA, Dhami P, Gottgens B, Bruce AW, Campbell PJ, Dillon SC, Smith function of DNA binding proteins. Science 290: 2306–2309. AM, Koch C, Donaldson IJ, Scott MA, et al. 2006. Identifying gene Ruan Y, Ooi HS, Choo SW, Chiu KP, Zhao XD, Srinivasan KG, Yao F, Choo 940 regulatory elements by genomic microarray mapping of DNaseI CY, Liu J, Ariyaratne P, et al. 2007. Fusion transcripts and transcribed hypersensitive sites. Genome Res 16: 1310–1319. retrotransposed loci discovered through comprehensive transcriptome 870 Gerstein BM, Kundaje A, Hariharan M, Landt GS, Yan K, Cheng C, Mu JX, analysis using Paired-End diTags (PETs). Genome Res 17: 828–838. Khurana E, Rozowsky J, Alexander R, et al. 2012. Analysis of the human Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, AU3 regulatory code and network using ENCODE data. Nature (in press). Rosenzweig E, Goldy J, Haydock A, et al. 2006. Genome-scale mapping 945 Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD. 2007. FAIRE of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active 3: 511–518. 875 regulatory elements from human chromatin. Genome Res 17: 877–885. Saxonov S, Berg P, Brutlag DL. 2006. A genome-wide analysis of CpG Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, dinucleotides in the human genome distinguishes two distinct classes of Aken BL, Barrell D, Zadissa A, Searle S, et al. 2012. GENCODE: The promoters. Proc Natl Acad Sci 103: 1412–1417. 950 reference human genome annotation for the ENCODE project. Genome Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring Res (this issue). doi: 10.1101/gr.135350.111. of gene expression patterns with a complementary DNA microarray. 880 Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mapping Science 270: 467–470. of in vivo protein-DNA interactions. Science 316: 1497–1502. Schoenherr CJ, Anderson DJ. 1995. The neuron-restrictive silencer factor Kadonaga JT. 2004. Regulation of RNA polymerase II transcription by (NRSF): A coordinate repressor of multiple neuron-specific genes. Science 955 sequence-specific DNA binding factors. Cell 116: 247–257. 267: 1360–1363. Koch CM, Andrews RM, Flicek P, Dillon SC, Karaoz U, Clelland GK, Wilcox S, Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, 885 Beare DM, Fowler JC, Couttet P, et al. 2007. The landscape of histone Watahiki A, Nakamura M, Arakawa T, et al. 2003. Cap analysis gene modifications across 1% of the human genome in five human cell lines. expression for high-throughput analysis of transcriptional starting Genome Res 17: 691–707. point and identification of promoter usage. Proc Natl Acad Sci 100: 960 Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J. 2009. 15776–15781. Differential chromatin marking of introns and expressed exons by Takahashi K, Yamanaka S. 2006. Induction of pluripotent stem cells from 890 H3K36me3. Nat Genet 41: 376–381. mouse embryonic and adult fibroblast cultures by defined factors. Cell Kornberg RD. 2007. The molecular basis of eukaryotic transcription. Proc 126: 663–676. Natl Acad Sci 104: 12955–12961. Tsai HK, Lu HH, Li WH. 2005. Statistical methods for identifying yeast cell 965 Kouzarides T. 2007. Chromatin modifications and their function. Cell 128: cycle transcription factors. Proc Natl Acad Sci 102: 13532–13537. 693–705. Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM. 2009. A 895 Kurdistani SK, Tavazoie S, Grunstein M. 2004. Mapping global histone census of human transcription factors: Function, expression and acetylation patterns to gene expression. Cell 117: 721–733. evolution. Nat Rev Genet 10: 252–263. Landolin JM, Johnson DS, Trinklein ND, Aldred SF, Medina C, Shulha H, Voss TC, Schiltz RL, Sung MH, Yen PM, Stamatoyannopoulos JA, Biddie SC, 970 Weng Z, Myers RM. 2010. Sequence features that drive human promoter Johnson TA, Miranda TB, John S, Hager GL. 2011. Dynamic exchange at function and tissue specificity. Genome Res 20: 890–898. regulatory elements during chromatin remodeling underlies assisted 900 Lee TI, Young RA. 2000. Transcription of eukaryotic protein-coding genes. loading mechanism. Cell 146: 544–554. Annu Rev Genet 34: 77–137. Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: A revolutionary tool for Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C. 2007. A transcriptomics. Nat Rev Genet 10: 57–63. 975 high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 39: Wray GA, Hahn MW, Abouheif E, Balhoff JP,Pizer M, Rockman MV,Romano 1235–1244. LA. 2003. The evolution of transcriptional regulation in eukaryotes. Mol 905 Li B, Carey M, Workman JL. 2007. The role of chromatin during Biol Evol 20: 1377–1419. transcription. Cell 128: 707–719. Yang WM, Yao YL, Sun JM, Davie JR, Seto E. 1997. Isolation and Li H, Zhan M. 2008. Unraveling transcriptional regulatory programs by characterization of cDNAs corresponding to an additional member of 980 integrative analysis of microarray and transcription factor binding data. the human histone deacetylase gene family. J Biol Chem 272: 28001– Bioinformatics 24: 1874–1880. 28007. 910 Li J, Min R, Vizeacoumar FJ, Jin K, Xin X, Zhang Z. 2010. Exploiting the Yu H, Luscombe NM, Qian J, Gerstein M. 2003. Genomic analysis of gene determinants of stochastic gene expression in Saccharomyces cerevisiae expression relationships in transcriptional regulatory networks. Trends for genome-wide prediction of expression noise. Proc Natl Acad Sci 107: Genet 19: 422–427. 985 10472–10477. Yuan GC, Ma P, Zhong W, Liu JS. 2006. Statistical assessment of the global Lickwar CR, Mueller F, Hanlon SE, McNally JG, Lieb JD. 2012. Genome-wide regulatory role of histone acetylation in Saccharomyces cerevisiae. Genome 915 protein-DNA binding dynamics suggest a molecular clutch for Biol 7: R70. doi: 10.1186/gb-2006-7-8-r70. transcription factor function. Nature 484: 251–255. LiuZ,ScannellDR,EisenMB,TjianR.2011.Controlofembryonicstem cell lineage commitment by core promoter factor, TAF3. Cell 146: 720– 731. Received December 21, 2011; accepted in revised form April 30, 2012.

10 Genome Research www.genome.org NUMBER OF

AUTHOR QUERIES

DATE 5/30/2012 JOB NAME GENOME JOB NUMBER 60741

ARTICLE genome136838 QUERIES FOR AUTHORS Cheng et al. THIS QUERY FORM MUST BE RETURNED WITH ALL PROOFS; HOWEVER, PLEASE MARK YOUR CORRECTIONS DIRECTLY ONTO THE PROOFS, NOT ONTO THIS SHEET. AU1: Please provide a postal code for afﬁliations (3), (8), and (9). AU2: As outlined in our Instructions to Authors, you must use approved nomenclature for gene and protein names and symbols, as it applies for each organism, in text, tables, and ﬁgures. In addition, it is the journal’s style to set gene symbols, alleles, and loci in italic, and proteins in Roman type. Please verify that all have been properly set throughout the manuscript. AU3: Please indicate if the reference is "in press"; if not, please provide volume and page numbers. (in reference "Gerstein, Kundaje, Hariharan, Landt, Yan, Cheng, Mu, Khurana, Rozowsky, Alexander, et al., 2012").