Identifying and mapping cell-type-specific chromatin PNAS PLUS programming of expression

Troels T. Marstranda and John D. Storeya,b,1

aLewis-Sigler Institute for Integrative Genomics, and bDepartment of Molecular Biology, Princeton University, Princeton, NJ 08544

Edited by Wing Hung Wong, Stanford University, Stanford, CA, and approved January 2, 2014 (received for review July 2, 2013) A problem of substantial interest is to systematically map variation Relating DHS to gene-expression levels across multiple cell in chromatin structure to gene-expression regulation across con- types is challenging because the DHS represents a continuous ditions, environments, or differentiated cell types. We developed variable along the genome not bound to any specific region, and and applied a quantitative framework for determining the exis- the relationship between DHS and gene expression is largely tence, strength, and type of relationship between high-resolution uncharacterized. To exploit variation across cell types and test chromatin structure in terms of DNaseI hypersensitivity and genome- for cell-type-specific relationships between DHS and gene expres- wide gene-expression levels in 20 diverse human cell types. We sion, the measurement units must be placed on a common scale, show that ∼25% of show cell-type-specific expression ex- the continuous DHS measure associated to each gene in a well- plained by alterations in chromatin structure. We find that distal defined manner, and all measurements considered simultaneously. regions of chromatin structure (e.g., ±200 kb) capture more genes Moreover, the chromatin and gene-expression relationship may with this relationship than local regions (e.g., ±2.5 kb), yet the local only manifest in a single cell type, making standard measures of regions show a more pronounced effect. By exploiting variation correlation between the two uninformative because their relation- across cell types, we were capable of pinpointing the most likely ship is not linear over a continuous range, as shown in Fig. 1 (fur- hypersensitive sites related to cell-type-specific expression, which ther details in SI Appendix and Figs. S1–S5). we show have a range of contextual uses. This quantitative frame- The computational approach developed here provides a pow- work is likely applicable to other settings aimed at relating continu- erful, tractable, and intuitive way of representing these data and ous genomic measurements to gene-expression variation. capturing biologically informative relationships. We were able to characterize the level to which variation of chromatin accessibility epigenetics | gene regulation | computational biology | association | is associated with gene-expression variation in a cell-type-specific encode manner. Within genomic segments of significant chromatin gene- expression concordance, our methodology is further capable of umans, like all other multicellular organisms, possess a large pinpointing the most likely local sites related to the detected as- Hnumber of distinct cell types, each of which is specialized for sociation. We show that such sites are context specific and can be a particular function within the body. Cells from a variety of shared across genes within a single cell type or across several cell tissue types exhibit different gene-expression profiles relating to types. Our quantitative framework has some generality in that it STATISTICS their function, where typically only a fraction of the genome is may be readily applied to associate any quantitative measure along expressed. As all somatic cells share the same genome, special- the genome to gene-expression variation. ization is in part achieved by physically sequestering regions containing nonessential genes into heterochromatin structures. Results Genes that are needed for the particular task of the cell type Genome-Wide Profiling of Chromatin Accessibility and Gene Expression. display an accessible chromatin structure allowing for the bind- We used data on genome-wide, high-resolution chromatin acces- ing of transcription factors and other related DNA machinery sibility measurements for 20 distinct human primary and culture and subsequent gene expression. cell lines that were obtained by an established sequencing-based BIOPHYSICS AND

To date, most studies have been limited to considering the method (11). In principle, accessible “open” chromatin is cleaved COMPUTATIONAL BIOLOGY chromatin accessibility surrounding the promoter region of genes, < which is typically proximal ( 10 kb) to the transcription region in Significance just one or very few cell types or experimental conditions (1–3). However, it is also of interest to understand how larger regions ’ In order for genes to be expressed in humans, the DNA corre- (10 kb) of chromatin structure relate to a gene s expression var- sponding to a gene and its regulatory elements must be ac- iation across multiple cell types, disease states, or environmental cessible. It is hypothesized that this accessibility and its effect conditions. Recently, several large-scale international collabo- on gene expression plays a major role in defining the different rations have started to generate data that can be used for this cell types that make up a human. We have only recently been purpose (4, 5), although doing so requires new developments in – able to make the measurements necessary to model DNA acces- computational methods (6 8). sibility and gene-expression variation in multiple human cell types A collection of landmark papers from the Encyclopedia of at the genome-wide level. We develop and apply a new quanti- DNA Elements (ENCODE) project were recently published that tative framework for identifying locations in the human genome summarize their most recent efforts to comprehensively un- whose DNA accessibility drives cell-type-specific gene expression. derstand functional elements in the human genome (e.g., refs. 5, 9, 10). Using ENCODE data, we undertook a well-targeted ge- Author contributions: J.D.S. designed research; T.T.M. and J.D.S. performed research; T.T.M. nome-wide investigation to characterize the relationship between and J.D.S. contributed new reagents/analytic tools; T.T.M. analyzed data; and T.T.M. and variations in chromatin structure and gene-expression levels across J.D.S. wrote the paper. 20 diverse human cell lines (SI Appendix,TableS1). We used data The authors declare no conflict of interest. on chromatin structure as ascertained through DNaseI hypersen- This article is a PNAS Direct Submission. sitivity (DHS) measured by next-generation deep-sequencing tech- Freely available online through the PNAS open access option. nology and gene-expression data measured by Affymetrix exon 1To whom correspondence should be addressed. E-mail: [email protected]. arrays. Replicated data on 10 cell lines were also used to assess This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. the robustness of our method. 1073/pnas.1312523111/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1312523111 PNAS | Published online January 27, 2014 | E645–E654 Downloaded by guest on September 28, 2021 Gene expression values DNase I hypersensitive sites for HNF4A, ABper cell-type for HNF4A selected cell-types Scale 50 kb chr20: 42350000 42400000 42450000 42500000 RefSeq Genes

HNF4A 1500 HNF4A

100 _ BJ 1 _ 100 _ CACO2 1 _ 100 _ HL-60 1000 1 _ 100 _ HRCE 1 _ 100 _

DHS data Hela 1 _ 100 _ HepG2 1 _ 500 100 _ Th1 1 _ 3 _ Placental Mammal Basewise Conservation by PhyloP Mammal Cons

-0.5 _ ENCODE Transcription Factor ChIP-seq Txn Factor ChIP 0 BJ TH1 Hela HCF Panc K562 HL60 SAEC HRCE SKMC HMEC HepG2 H7ESC SKNSH CACO2 HUVEC HAEpiC G04450 A HCPEpiC GM06990 C Scaled and centered data

0.6 HepG2 44°

Hela 87°

0.4 ARS: 13.0 (with Hela) q-value: < 10e-07 ARS: 13.5 (without Hela) q-value: < 10e-07 CACO2 14° 0.2

DHS volume Correlation: 0.592 (with Hela) q-value: 0.107 Correlation: 0.715 (without Hela)

0.0 q-value: 0.024 −0.2

0.0 0.2 0.4 0.6 0.8 Gene expression

DErofepyt-llecrepSRAgnitluseR A4FNH sepyt-llecdetcelesrofseliforpSRAlacoL

Scale 50 kb 12 chr20: 42350000 42400000 42450000 42500000 RefSeq Genes

HNF4A 10 HNF4A

100 _ Hela 8 1 _ 100 _ HepG2 1 _ 100 _ CACO2 6 1 _ 1374 _ HNF4A-Hela

0 _ 4 1374 _ HNF4A-HepG2

ARS profiles DHS data 0 _ 1374 _ 2 HNF4A-CACO2

0 _ 3 _ Placental Mammal Basewise Conservation by PhyloP Mammal Cons 0 -0.5 _ ENCODE Transcription Factor ChIP-seq Txn Factor ChIP BJ anc TH1 Hela HCF P K562 HL60 SAEC HRCE SKMC HMEC HepG2 H7ESC SKNSH CACO2 HUVEC HAEpiC AG04450 HCPEpiC GM06990

Fig. 1. Overview of data and proposed approach. (A) Gene-expression measurements for 20 cell lines on an example gene, HNF4A.(B) DHS fragment se- quencing counts in a region about the gene. (C) The DHS signal is captured by summing the overall number of fragments over a given segment size (e.g., ±100 kb) about the gene’s TSS to obtain a DHS volume. After global normalization, the gene-expression data and DHS volume measures are scaled to lie on the unit interval [0,1] and the data are centered about the origin according to the 2D medoid. For the HNF4A example, three outliers are clearly visible; for example, HepG2 displays both chromatin accessibility and active gene expression, whereas HeLa displays only chromatin accessibility. The goal is to quantitatively capture the isolated relationship seen in HepG2 and assess whether this relationship is statistically significant. Traditional measures of linear correlation are not suitable for identifying this type of signal, as shown by the substantial change seen after removal of a single cell line, HeLa, even though the data for HeLa are expected to exist for many genes and cell lines. The proposed ARS is robust to HeLa because the measure is based on angular placement and the median distance to the medoid of the data (dashed circle). (D) The ARS is calculating by first quantifying the relative distance to the origin for each cell line in a robust manner. An angular penalty for each cell line is then calculated to quantify cell types concordant in both expression and DHS measured. This quantity is measured in terms of angular distance from the 45° line, and it is then multiplied times its respective relative distance to give and overall score for each cell line. The maximum score is taken as the statistic for the given gene, allowing a comparison across all genes. (E) A local version of the ARS we introduce can pinpoint DHS “peaks” con- tributing the most to the detected association. See main text for details on the proposed methods.

by the nonspecific endonuclease DNaseI, and the cleaved fragments Table S4). A total of 19,215 genes were analyzed after preprocessing are sequenced to provide a high-resolution, genome-wide map of (Materials and Methods). DHS for every cell type (SI Appendix, Table S2). The interpre- With these quantifications, we sought to characterize the re- tation of these data is that increased fragment counts within a lationship between chromatin accessibility and gene expression region are indicative of greater chromatin accessibility. To investi- in a cell-type-specific manner, summarized in Figs. 1 and 2. What gate the impact of regional chromatin accessibility on gene-expression we mean by “cell-type specific” is that, when pairing gene-expression variation, we likewise used genome-wide gene-expression meas- and chromatin-accessibility measures according to cell type, we urements in each cell line from Affymetrix exon arrays (SI Appendix, observe a relationship between the two measures, typically for one

E646 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey Downloaded by guest on September 28, 2021 PNAS PLUS RAW DATA NORMALIZED DATA

80000 TH1 TH1 60000 K562 K562 Step 1 40000 GM06990

DHS Volume GM06990 20000 0.0 0.2 0.4 0.6 0

0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 Gene Expression Step 2

RELATIVE DISTANCE TO ORIGIN PENALIZED ANGULAR FROM 0.7 0.6 15 0.5 0.4 10 0.3 5 0 0.0 0.1 0.2 BJ BJ TH1 TH1 HCF HCF Hela Hela Panc Panc K562 K562 HL60 HL60 SAEC SAEC HRCE HRCE SKMC SKMC HMEC HMEC HepG2 HepG2 H7ESC H7ESC CACO2 CACO2 SKNSH SKNSH HUVEC HUVEC HAEpiC HAEpiC AG04450 AG04450 HCPEpiC HCPEpiC GM06990 GM06990

Step 3 ANGLE-RATIO STATISTICS (ARS) RANDOMIZED DATA STATISTICS 10 12

COMPARE ARSmax 8 0.2 0.0 0.2 0.4

0.4 0.2 0.0 0.2 0.4 BIOPHYSICS AND COMPUTATIONAL BIOLOGY

0246 IDENTIFY STATISTICALLY SIGNIFICANT GENES WITH DHS BJ TH1 HCF Hela Panc K562 HL60

SAEC Step 4

HRCE XPRESSION ONCORDANCE

SKMC & E C HMEC HepG2 H7ESC CACO2 SKNSH HUVEC HAEpiC AG04450 HCPEpiC GM06990

Fig. 2. Overview of ARS method, applied to example gene CD69. (Step 1) For a given gene, DHS volume and gene expression are calculated for all 20 cell lines as described in the text. DHS volume and gene expression are respectively scaled to lie on the unit interval [0,1] and then median centered before

considering their joint distribution. Each cell type corresponds to a single point. (Step 2) To form the “ratio” component of the ARS, ri, the distance from the origin to each point is calculated and then scaled by the median distance (Left). The angular distance between each point and the identity line is calculated

and evaluated in an exponential function to determine an angular penalty ai for each cell type (Right). (Step 3) The final ARS is computed as the product of the normalized distances and the angular penalties, ARSi ¼ ai × ri . The maximal statistic ARSmax is calculated for each gene and the corresponding cell type recorded, in this case the TH1 cell line. (Step 4) A randomization method is performed to generate null data, upon which null ARSmax are calculated. These are compared with the observed ARSmax values to calculate the statistical significance of each gene.

or very few cell types. To this end, the cell-type specific chromatin equilibrium across samples (SI Appendix,Fig.S7). Alternative profiles were quantified by integrating the DHS fragment counts representations of DHS signal (8, 12) could be used at this step, over increasingly larger genomic segments relative to the gene of although we did not identify any advantages in doing so. Gene- interest (SI Appendix,Fig.S6) to obtain a cell-type-specific re- expression values were summarized as the mean intensity across gional DHS volume. We selected a range of segments that were all probe sets linked to a given RefSeq-gene. likely to encompass all proximal [transcriptional start site (TSS) ± 2.5 kb] and most distal regulatory elements (TSS ± 50, ± 100, ± Detecting Cell-Type-Specific Chromatin Accessibility and Gene- 150, ± 200, and ± 100 kb minus proximal 2.5 kb, and ± 200 kb Expression Concordance. Due to the “on–off” nature of DHS minus proximal 2.5 kb). In addition, to account for copy number and subsequent transcription, there will not necessarily be a lin- variation and -arm-related effects, the obtained DHS ear relationship between DHS and gene-expression measures. volumes were scaled on either side of the centromere to arrive at Using correlation or correlation-like statistics to associate the

Marstrand and Storey PNAS | Published online January 27, 2014 | E647 Downloaded by guest on September 28, 2021 two measurements across all cell types proved to be unreliable No. Significant Genes at Different FDRs SI Appendix – and uninformative (further details in and Figs. S1 A 4000 FDR<1% FDR<5% S5). One of the key types of relationships we sought to detect is FDR<10% 3000 of the type shown in Fig. 1, where one or very few cell types are ARS outliers from the others. The standard Pearson correlation sta- Correlation tistic is not well suited for this scenario. First, it requires the data 2000 P # significant genes

to be jointly normal to obtain parametric values, but the nor- 1000 mal assumption does not hold for these data (SI Appendix, Fig. S2). Second, this correlation statistic is unstable when there are 0 2mb 1mb 50kb 2.5kb 150kb 200kb 500kb outliers, even when using permutation-based P values, demon- 100kb strated directly on these data (SI Appendix, Figs. S1 and S3). The DHS segment size rank-based Spearman correlation statistic is a potential alterna- B tive, but it shows very poor power relative to the method pro- posed here as shown in Fig. 3 (see also SI Appendix, Figs. S4 and

S5). For example, at a false discovery rate (FDR) ≤ 0.05, the ARS proposed method identifies 2,538 genes with a cell-type-specific DHS and gene-expression relationship, whereas the Spearman statistic identifies only 286 (Fig. 3 and SI Appendix, Figs. S4 and S5). The statistic proposed here is designed to be appropriate for scenarios when both measurements are restricted to a narrow relative range, with one or very few cell types appearing as dis-

tinct outliers. To evaluate the relationship between the DHS Spearman Correlation volume of a genomic segment and gene expression, we took into C Relative ARSi Values Correlation Cross-Products account the compactness of the measurements versus any dis- TH1 tinct outliers in both dimensions and whether the outliers were SKNSH SKMC concordant in both measurements (i.e., a simultaneous increase SAEC or decrease) to form an overall composite measure called an Panc K562 angle ratio statistic (ARS; detailed in Figs. 1 and 2, Materials and HUVEC Methods,andSI Appendix). To summarize, we first scale and HRCE HMEC Genes Genes median center the DHS volume and expression data, respectively, HL60 HepG2 for a given gene. We then calculate the relative distance of each Color key Hela Color key and histogram HCPEpiC and histogram cell type to the overall center of the data, which serves as a way to HCF HAEpiC Count

measure the degree to which each cell type is an outlier. To Count 1500 H7ESC 0 20000 GM06990 0 measure concordance of DHS volume and gene expression, we 0.2 0.6 1 0.2 0.6 1 Value CACO2 Value calculate the angular distance between each point and the 45° line BJ of identity, penalizing points farther away from the line of identity AG04450 Cell Lines according to a data-derived exponential function. These two Cell Lines quantities are then multiplied to form an ARSi value for each cell Fig. 3. Statistical significance for ARS and correlation across genomic type ði ¼ 1; 2; ...; 20Þ, and the maximal value ARSmax is the segments. (A) Depicts the number of significant genes found at in- overall statistic that quantifies cell-type-specific DHS volume and creasingly larger genomic segments for ARS and Spearman correlation, gene-expression concordance for a given gene. respectively (solid line is ARS and dashed line is Spearman correlation). (B) Statistical significance according to DHS volume segment size. Col- To identify statistically significant genes from ARSmax, we con- structed a null distribution based on randomization of the observed umn 2 shows the percentage of genes estimated to have concordant DHS volume and gene-expression variation as captured by ARSmax (1 − π^0,as experimental data (Materials and Methods, SI Appendix and Fig. – ARS estimated in ref. 13). Columns 3 5 show the number of statistically sig- S8). max values obtained from the randomized data were used nificant genes at various FDR cutoffs. Although the 2.5-kb window shows as a basis for determining a P value of the observed ARSmax for more significant genes at the stringent FDR cutoffs, indicating a larger each gene. FDR-based statistical significance and the proportion of effect size, the overall percentage of genes showing a relationship is genes with a true chromatin accessibility and gene-expression re- notably lower than the more distal DHS volumes. Compared with lationship were estimated from the P values (ref. 13; Fig. 3B). Spearman correlation, ARS is more powerful at detecting these associ- We estimate that ∼25% of genes show concordance between ations (see SI Appendix for further details). (C) The relative ARSi values ± chromatin accessibility and gene-expression variation in a cell- across all cell types for significant genes in the 100 kb region versus the type-specific manner. Although our strategy is capable of detecting analogous components for Spearman correlation (the cross-product terms that sum to form the overall correlation). The ARSi values distin- outliers showing negative concordance (decreased chromatin ac- guish cell lines that have a strong DHS and expression concordance cessibility and decreased gene expression), none were found to be substantially more clearly than the Spearman correlation, showing that significant at FDR ≤ 0.05. The number of significant genes in- the traditional correlation is more likely to generate spurious results creased by inclusion of distal DHS volume (Fig. 3B,column2), from small changes to the data. Enrichment of biological functions for indicating that distal chromatin programming effects are more the significant genes found by either method corroborates this finding widespread in a genome-wide sense (14). On the other hand, using (SI Appendix,Fig.S13). the proximal DHS volume we observe a greater empirical effect size compared with the distal DHS volumes (Fig. 3B, columns 3–5). This observation is explained by the aggregation of genes increasingly distal regions also increases the noise in the DHS significant for the same cell type along the genome (15). Testing volume, wherefore the effect size and ultimately the number of whether one or more significant genes within a ±100-kb region true associations starts to decline (Fig. 3A). were associated with the same cell type we found that 481 out of 668 significant genes within the specified boundary stem from Experimental Replication. To assess reproducibility, we tested the the same cell type (Fishers exact test P < 2.2e-16; SI Appendix concordance of significant results among replicated data for 10 and Fig. S14). It is however important to note that inclusion of cell types. Based on two independent measurements of DHS and

E648 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey Downloaded by guest on September 28, 2021 gene expression, respectively, we calculated the fraction of pre- influential in explaining the cell-type-specific gene-expression PNAS PLUS dictions preserved in all four-way comparisons (SI Appendix). We variation, thereby indicating that they have the most regulatory found that between 86 and 91% of significant genes (FDR ≤ 0.05) potential. We retained the gene-expression values for a given were identical (SI Appendix,Fig.S15). significant gene, and now considered the DHS volume within nonoverlapping consecutive regions at a high-resolution 60-base- and Pathway Analysis. To determine the biological pair windows. The ARS was calculated for each 60-bp window, coherence of the set of genes found to be significant for each which can then be plotted over the entire region used in iden- cell-type, we performed a gene ontology (GO) enrichment analysis tifying the gene as statistically significant. (The original sequence (16). The method computes enrichment within the process and alignments for the DHS data were mapped into 20-bp windows function components of GO categories and assigns a numerical spanning the entire genome, so we chose windows of size 60 bp significance to the findings. In nearly all cases the results were in to smooth the local DHS measure across neighboring sites.) For agreement with the actual biology; see http://encode.princeton.edu/ example, for a gene significant with respect to a ±200-kb DHS for results on all DHS segment sizes. For example, human T cells volume, we calculated ∼6,700 local ARS values for each cell type. showed a strong enrichment of T-cell receptor related genes, whereas hepatic cells showed enrichment of lipid metabolism- These can then be plotted in such a way that the signal emanating related genes. KEGG pathways (17, 18) were likewise enriched from that location is visible, loosely analogous to a logarithm of in a cell-type-specific manner. For example, HepG2 (a hepato- odds (LOD) score profile in linkage analysis. Additional steps cellular carcinoma cell line) showed significant enrichment for were taken, involving scaling across the 60-bp windows to preserve SI Appendix genes within the bile-acid synthesis and drug metabolism, and a valid interpretation of their relative magnitudes ( ). “ ” HL60 (a human promyelocytic leukemia cell line) showed sig- We first selected the subset of local ARS profile peaks by nificant enrichment within the hematopoietic cell lineage. thresholding the local ARS profiles in a principled manner Materials and Methods Furthermore, all genes detected within each cell type at ( ), and we analyzed both positional biases FDR < 0.05 (±100kb DHS volume) were analyzed through the and sequence compositions as they relate to function. We then use of Ingenuity Pathways Analysis (Ingenuity Systems, www. analyzed the entire trajectories of local ARS profiles at specific ingenuity.com). For all but 3 cases out of 20 (two cell types likely loci, showing that they identify both known and putative regu- had too few significant genes detected to get reliable annota- latory DHS for given genes. tions), the category “physiological system development and func- tion” was in clear correspondence with that expected given the cell Positional Bias of Putative Regulatory DHS. Because the overall type (SI Appendix,Fig.S13). For instance, TH1 was enriched for statistical significance increases when calculating DHS volume cell-mediated immune response, K562 for hematological system over more distal regions up to 200 kb (Fig. 3), we investigated the development and function, and H7ESC for embryonic devel- positional bias of local ARS peaks in a cell-type-specific manner. opment. For each gene, there tended to be low relative ARSi Fig. 4A shows smoothed densities of positional local ARS peak across the remaining cell types, indicating that we detected truly counts by cell type, which exhibit high cell-type-specific differ- cell-type-specific genes as clear outliers on a genome-wide scale. ences, specifically the density around the TSS. Random densities ARS However, some cases showed large relative i in a few tissues, were generated by randomly assigning positional counts to tissues STATISTICS which prompted us to investigate these instances further. in equal proportions to the observed counts, where it can be seen Among genes with a statistically significant ARSmax statistic, that the cell-type differences are no longer present (SI Appendix, additional inspection of the remaining ARSi were explored for Fig. S17). This points to the existence of cell-type-specific dif- detection of possible substructures. We calculated relative ARS ferences in the base-pair distance of regulatory DHS to TSS. values within each gene dividing all ARSi by ARSmax. In addition to many instances of singular outliers, we detected a gradient Sequence Analysis of Peaks in Local ARS Profiles. We next sought to behavior among significant genes, where a few cell types were characterize the functional significance of sequences correspond- evident as outliers (SI Appendix, Fig. S16).

ing to local ARS peaks. Because a general indicator of function- BIOPHYSICS AND ality is conservation, we extracted the conservation track values Local ARS Profiles. The DHS data itself provides a rich source of (phastCons44wayPrimate, hg18) (19) corresponding to the local COMPUTATIONAL BIOLOGY information about regulatory elements in the genome. However, ARS peaks and to the negative control set (Materials and Methods). when used in conjunction with gene-expression data across Values range from 0 to 1, with 1 indicating the most conserved. differing cell types, there is an opportunity to discover which The regions with local ARS peaks were significantly more con- locations of chromatin accessibility drive gene expression in a cell- served than regions from the negative control set (Kolmogorov– type-specific manner. This goal prompted us to develop a tech- Smirnov P < 2.2e-16; SI Appendix,Fig.S18), indicating substantial nique to model the relationship for fine-scale segments of DHS conservation of sequences corresponding to local ARS peaks. volume across the larger segments. As the above strategy focused DNase-I hypersensitive sites are well established markers on examination of chromatin gene-expression interactions over genomic segments, investigation of fine-scale patterns allowed us of regulatory and other DNA-binding . We therefore to: (i) validate that distal regulatory regions were indeed present sought to establish whether known cell-type-specific transcrip- tion-factors binding sites (TFBSs) are overrepresented in the as peaks in chromatin accessibility concordant with gene expres- Materials sion in a cell-type-specific manner, (ii) perform sequence analyses local ARS peaks relative to the negative control set ( and Methods of these chromatin accessibility peaks, (iii) compare localized ). Because regions distal to the TSS are rarely associations across cell types or within a single gene, and (iv) studied in this context, we eliminated all local ARS peaks and ± provide a framework for quantifying regions of interest on a con- negative controls that fell within 10 kb of the TSS. This step tinuous scale for investigation of regulatory elements. was taken to demonstrate that the proposed approach is capable We therefore extended our approach to allow one to identify of detecting distal TFBS, up to 200 kb from the TSS. and map DHS sites to genes on which they show strong evidence We used the JASPAR database (20) to identify TFBS that for playing a regulatory role in a cell-type-specific manner. This are differentially represented in the local ARS peaks relative to was carried out by providing a fine-scale version of the ARS the negative control set (Materials and Methods). The over- and quantification, called a local ARS profile for genes with a sta- underrepresented TFBS show distinct cell-type-specific patterns tistically significant ARSmax statistic over a larger segment. The (21) and provide a rich insight into cell-type-specific gene reg- peaks of the local ARS profiles pinpoint which DHS are most ulation (Fig. 4B), several of which are listed here:

Marstrand and Storey PNAS | Published online January 27, 2014 | E649 Downloaded by guest on September 28, 2021 A

0.00030

0.00025

tissue CACO2 0.00020 GM06990 H7ESC HAEpiC Hela 0.00015 HepG2 density HL60 HRCE K562 0.00010 SKNSH TH1

0.00005

0.00000

-200kb -120kb -40kb +40kb +120kb Distance from TSS

20.2 92.532.536.583.469.548.372.3 GABPA 91.3 97.512.331.4 2.24 4.97 4.65 4.72 Egr1 B 2.45 2.73 2.97 4.47 3.86 4.26 4.2 4.61 ELK4 2.71 90.2 78.251.545.3 4.21 3.84 4.11 NFYA 2.2 3.98 3.08 3.65 3.34 3.66 ELK1

2.33 1.2 11.2 4.02 4.12 3.54 2.91 Klf4 Transcription factor models 56.2 43.2 4.14 4.11 3.45 2.92 SP1 2.34 3.11 3.04 2.43 2.26 TFAP2A 2.45 3.48 2.38 3.3 2.69 3.65 E2F1 36.4 39.333.5 15.5 4.74 CTCF 4.79 HNF1A 3.67 Pax6 −3.36 3.76 HNF1B 3.27 HNF4A 32.3 17.3 PPARG::RXRA −4.1 Lhx3 2.64 4.71 MIZF 7.08 Pou5f1 7.01 Sox2 5.52 znf143 −2.34 − 9.2 68.2 0.3 2 2.4 Zfx 6.04 RXR::RAR_DR5 5.08 NR3C1 2.48 2.75 2.44 3.15 TEAD1 35.273.3 80.6 Tal1::Gata1 7.25 REST −2.84 −3.66 −2.43 SOX9 −3.52 −3.26 −3.91 −2.61 −2.36 −2.76 −4.11 FOXI1 −4.31 −3.3 −3.07 −2.89 −2.47 −3.13 −4.49 Foxd3 −2.49 −4.84 −2.18 −3.06 −2.2 FOXL1 −3.43 NKX3−1 −3.02 Hltf −3.82 −2.18 ARID3A −3.1 −2.49 −2.07 Pdx1 −3.48 −2.8 −2.25 Prrx2 TH1

Cell-types Hela K562 HL60 HRCE HepG2 CACO2 SKNSH H7ESC HAEpiC GM06990 Log2-fold change Count 0 50 150 −4−20246

Fig. 4. Analysis of local ARS profiles. (A) Distribution of local ARS peaks relative to the TSS according to cell type. The positional bias of cell-type-specific local ARS peaks as measured by the density of local ARS peaks within cell lines with respect to position from to the TSS. Clear differences in the amount of distal regulation are seen across the cell types and the density around the TSS differ markedly among cell types. For example, HL60 shows a more proximal signal relative to that of HAEpiC. (B) Transcription-factor binding site analysis among local ARS peaks occurring 10 to 200 kb from the TSS. Sequences corresponding to local ARS peaks within significant cell-type-specific genes were searched with known transcription-factor binding site models, and the relative over- and

underrepresentation was assessed based on a negative control set. Instances of absolute log2 fold-change ≥2 are displayed within the relevant cell types. Overrepresentation is indicative of a preferential transcription factor binding site, and is therefore a likely regulatory candidate for the observed gene expression. Underrepresented sites indicate factors that should be avoided to maintain proper cell-type-specific expression profiles. For instance, Sox2 and Pou5f1 (Oct4) were observed solely overrepresented in the embryonic cells, H7ESC.

• Among the hepatocyte nuclear factors we found HNF1B (tran- the proximal promoter region (24). The ubiquitous CCAAT- scription factor 2, TCF-2) and HNF4A to have significant chro- binding-factor family is linked to cellular differentiation in matin gene expression concordance in HRCE (a human renal a variety of tissues (25). cortical epithelial cell line) and HepG2, respectively (SI Ap- • Retinoid X receptors (RXR)–retinoic acid receptors (RAR) pendix, Figs. S19 and S20). Furthermore we found the local were found in human amniotic epithelial cells (HAEpiC). The ARS profiles in the respective tissues to display a marked over- coexpression of RAR and RXR (26) is essential for proper representation of the factor in question, HNF1B in HRCE and placental development, and RXR null mouse mutants are HNF4A in HepG2. Mutations in HNF1B have been associated lethal after 10 d due to placental defects (27). with a broad range of renal diseases (22), and HNF4A is essen- • Forkhead binding sites were found to be primarily underrep- tial for hepatocyte differentiation and morphology (23). resented, specifically FOXD3 was under-represented in, among • H7ESC (a human embryonic stem cell line) was found to show others, the leukemic cell types. Silencing of FOXD3 by aber- overrepresentation of SOX2 and POU5F1 (Oct-4) both essen- rant chromatin modification has been implicated in leukemo- tial for self-renewal in undifferentiated stem cells. genesis (26). Overexpression of FOXD3 prevents neural crest • NFYA (a CCAAT-binding ) was found overrepre- formation (28). It is interesting to note that binding sites for sented in almost all tissues. This factor is essential for en- the factor were underrepresented in SKNSH, a neuroblastoma hancer function by recruiting distal transcription factors to derived from neural crest cells.

E650 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey Downloaded by guest on September 28, 2021 • NF-κB was found over-represented in TH1, where it promotes Mapping Putative Regulatory DHS to Genes. We also investigated PNAS PLUS the expression of, among others, interleukin 12 (IL-12) essen- the utility of considering the entire trajectory of local ARS pro- tial for TH1 development (29). files at a locus to characterize the regulatory architecture of cell-type-specific expression. We investigated in detail two well- The differentially represented TFBSs were distributed largely β ± characterized examples of regulatory interactions at the -globin distal. For all cell types, from 68 to 79% were located more than (HBB) locus control region and at the stem cell leukemia (SCL) 50 kb away from the TSS. We repeated the analysis with only TAL1 SI ± gene, also known as , with several more appearing in the proximal regions ( 10 kb from the TSS), and we found that Appendix, Figs. S25–S37. It can be seen from these analyses that several important known cell-type-specific motifs were no longer the local ARS profiles provide a means to map DHS sites to detected (SI Appendix,Fig.S21). genes in a cell-type-specific manner. To validate our in silico TFBS predictions with an indepen- The HBB (β-globin) locus control region (LCR) comprises an dent data source, we downloaded the UCSC genome tracks for array of functional elements that in vivo gives rise to five major POU5F1 (Oct-4) in HESC and NFYA in K562 (http://genome. DNase I hypersensitive sites (HS1–HS5; refs. 36–38; Fig. 5) ucsc.edu/ENCODE/). (Note that HESC is not completely iden- HBE1 e tical to H7ESC, but both are embryonic stem cells.) We next upstream of ( -globin) on the short arm of chromosome calculated the overlap between the POU5F1 binding sites deter- mined from the ChIP-seq data and the positive local ARS peaks

Scale 20 kb hg18 that we identified in H7ESC. Of the 1,614 peaks used for TFBS chr11: 5210000 5220000 5230000 5240000 5250000 5260000 A RefSeq Genes POU5F1 HBB HBD HBBP1 HBG1 HBE1 analysis in H7ESC, 200 overlapped with the ChIP-seq HBG2 510 _ HBB-K562 calls; of the remaining 37,205 positive local ARS peaks in other 0 _ 510 _ cell types, where POU5F1 was not predicted to be enriched, only HBB-HL60 0 _ 7100 _ 119 overlapped. Similarly for K562, of the 4,861 positive local HBE1-K562 Local ARS 0 _ ARS peaks 439 overlapped ChIP-seq calls for NFYA.Forcell 7100 _ HBE1-HL60 NFYA 0 _ types where was not detected [HRCE, SNKSH (a neuro- 100 _ K562 DHS blastoma cell line), and HeLa], of 8,176 positive local ARS peaks, 1 _ 100 _ HL60 DHS only 72 had overlapping ChIP-seq calls. Hence, we obtain an en- 1 _ −16 100 _ richment of 39-fold for POU5F1 (P < 10 ) and 10-fold for NFYA BJ DHS − 1 _ P < 16 ’ 100 _ ( 10 ), respectively. Overall, this supports the method s CACO2 DHS Raw DHS signal 1 _ ability to map relevant regulatory regions on a fine scale. 100 _ Th1 DHS 1 _ Single-Nucleotide Polymorphisms. The local ARS peaks were also

investigated with respect to SNPs found to be significant in genome- Scale 20 kb hg18 chr1: 47410000 47420000 47430000 47440000 47450000 47460000 47470000 B RefSeq Genes PDZK1IP1 TAL1 wide association studies (GWAS). The database compiled by 1300 _ ’ TAL1-CACO2 Johnson and O Donnell (30) containing a total of 52,622 unique 0 _ 1300 _ TAL1-HL60

SNPs associated with disease phenotypes (56,411 total associa- TAL1 0 _ STATISTICS 1300 _ tions) was mapped against the genomic coordinates of the local TAL1-HRCE 0 _ 1300 _

ARS peaks. We found that only 42 SNPs fell within local ARS Local ARS TAL1-K562 0 _ peaks, and the expected value under completely random place- 3800 _ PDZK1IP1-CACO2 0 _ ment of SNPs is 68. A statistical test indicates that there is sig- 3800 _ PDZK1IP1-HL60 nificant underrepresentation of GWAS SNPs (two-sided exact 0 _ PDZK1IP1 3800 _ test P < 0:001). This is likely linked to the conserved state of the PDZK1IP1-HRCE 0 _ 3800 _ sequences corresponding to local ARS peaks (shown above) and PDZK1IP1-K562 Local ARS 0 _ 100 _ that these regions harbor cell-type-specific regulatory elements, CACO2 DHS 1 _ BIOPHYSICS AND as indicated by the above TFBS analysis. 100 _ HL60 DHS 1 _ COMPUTATIONAL BIOLOGY This does not preclude GWAS SNPs from appearing in local 100 _ HRCE DHS ARS peaks. Indeed, such an occurrence may be particularly 1 _ 100 _ Raw DHS Signal noteworthy. An interesting example is that of rs1890131 (31), K562 DHS 1 _ which lies in a local ARS peak driving the ARS-based statistical SI Appendix significance of three different genes ( , Fig. S22): Fig. 5. Mapping putative regulatory DHS with local ARS profiles at two loci. CREG1 in HL60, RCSD1 in TH1, and CD247 in TH1 (SI Ap- (A) The β-globin locus control region. The DHS data for five cell lines (out of pendix, Fig. S23). The SNP rs1890131 has been associated with 20) are shown, as well as the local ARS profiles for HBB and HBE1 in the K562 coronary spasms via CREG1, where cellular repressor of E1A- and HL60 cell lines. The transparent yellow boxes indicate regulatory stimulated genes (CREG) attenuates cardiac hypertrophy (32) regions, specifically hypersensitive regions 1–5(HS1–5), together with a less by blocking mitogen-activated protein kinases/extracellular signal- characterized site upstream of HBD. It can be seen that HBB and HBE1 show regulated kinase (MEK-ERK)1/2-dependent signaling. RCSD1 different local ARS profiles, indicative of differences in use of regulatory elements. The local ARS profile shows no peak in HL60 despite the existence (also termed capZ-interacting protein) is suggested to regulate the of a hypersensitive site when considering DNaseI profile alone. The full data ability of CapZ to remodel actin filament assembly in vivo (33), and local ARS profiles for all 20 cell lines and both genes are displayed in SI and thereby affects cardiac isometric tension generation (34). Last, Appendix, Figs. S25–S29.(B) TAL1 locus. We identified TAL1 as statistically CD247 (TCRζ) is a master regulator of adaptive immune re- significant with its maximal ARS in the K562 cell line across all tested ge- sponses, where loss of TCRζ-chain expression may contribute to nomic segments. Local ARS profiles show a dominant effect from the +40 the inflammatory process that triggers coronary instability by ac- enhancer region (green box), spanning PDZK1IP1. DHS signals across multi- cumulating in coronary plaques (35). The SNP rs1890131 is cen- ple cell types were correctly not detected to be associated with the expres- tered in a hub of transcription-factor binding sites as detected by sion of TAL1. Furthermore, note that even though the DHS data for TAL1 the ENCODE transcription-factor track (SI Appendix,Fig.S24). and PDZK1IP1 are largely overlapping, they nevertheless have distinct local ARS profiles due to their different patterns of gene expression. This dem- The local ARS peak containing this SNP contributed significantly onstrates that ARS is capable of separating interwoven signals across cell to the association detected for all three genes, which are distal and types for neighboring genes, and that there is information to be gained by in different orientation from the site of the SNP (SI Appendix, combining DHS and gene-expression profiling. The full data for all 20 cell Fig. S22). lines and local ARS profiles are displayed in SI Appendix, Figs. S31–S33.

Marstrand and Storey PNAS | Published online January 27, 2014 | E651 Downloaded by guest on September 28, 2021 11. All five sites were present in cell line K562 according to our Web Resource DHS data (see SI Appendix, Figs. S25–S29 for complete data To provide an interface for the community to use the results across all 20 cell types, and SI Appendix, Fig. S30 for local ARS from this work, the local ARS profiles across any given gene in profiles across all genes at this locus control region). Although any of the 20 cell types can be calculated via our web service at the DHS volume at these sites contributed to both HBE1 and http://encode.princeton.edu/, where all results encompassing the HBB yielding statistically significant ARS values, the relative larger DHS regions are also searchable. importance of HS1-5 differs significantly between these two genes, clearly detected by the local ARS profiles (Fig. 5A). Discussion In the case of HBE1, we observed local ARS peaks for HS1 As the epigenome in multicellular organisms is a dynamic entity at −6.1 kb and to a lesser extent HS3 and HS4 (−14.7 and −18 whose variation leads to reprogramming of gene expression (53), kb, respectively). For HBB we observed similar local ARS pro- it is a likely candidate in the etiology of disease complementary files for HS1, HS3, and HS4, and smaller local ARS values for to that of mutations in DNA (54, 55). It is therefore of consid- HS2 (−10.9 kb). It has previously been shown that HS1 is a stable erable interest to identify and characterize the regulatory regions contributing to gene-expression variation with respect to a giv- chromatin structure (37) throughout development and essential HBE1 en disease. for expression (39) due to a GATA-1 binding site, and We have presented a framework for quantifying relationships HS2, 3, and 4 show synergistic enhancement of expression of between chromatin structure and gene expression across multi- HBB – by formation of the LCR holocomplex (40 42). Finally, the ple conditions (here, cell types), facilitating avenues for un- HBD element upstream of has also been reported to specifically derstanding cellular responses by localizing and characterizing enhance transcription of HBB (43). Although HS5 is present in regions of regulatory potential. The local ARS profiles we in- the DHS data for K562, similar open chromatin structures were troduced allow specific hypersensitive regions to be associated detected in other tissues. HS5 (−21 kb) is not in concordance with condition-specific gene expression, thereby conferring con- with tissue-specific gene expression of either HBE1 or HBB,an textual regulatory information not obtainable using DHS data observation in line with this site’s function as an insulator and alone. This effectively pinpoints a short list of primary candidates CTCF-binding site (44). for further functional studies. We found the peaks from the local T-cell acute lymphocytic leukemia protein 1 (TAL1) encodes ARS profiles in statistically significant segments to be both highly a basic helix–loop–helix protein, which is essential for the for- conserved and enriched for known transcription-factor binding mation of all hematopoietic lineages (SI Appendix, Figs. S31–S33 sites as far as 200 kb from transcription start sites. Although be- for all data across the 20 cell types). Previous studies using yond the scope of the current work, we believe our approach could chromatin structure, comparative sequence analysis, transfection be used in conjunction with quantitative trait analyses to increase cis trans assays (45, 46), and transgenic mice (47–51) have identified a the power for detecting true -and -acting SNPs by in- total of five enhancers modulating the expression of TAL1.We terfering with transcription factor binding sites, which in turn leads detect TAL1 as significant with maximal cell line K562 across all to altered DHS signals in a similar manner as Degner et al. (56). As measurements from high-throughput sequencing platforms tested genomic segments (from ±2.5 to ±200 kb) with the most become commonplace in molecular biology, there will be an significant ARSmax occurring for ±50 kb. Further investigation by B increasing demand for the development of new statistical the local ARS profile (Fig. 5 ) showed that although proximal approaches for these data. A major challenge is that sequencing regulatory sites were correctly identified, the most dominant signal measurements are rarely in units directly relatable to one another; is by far confined to the +40 enhancer region and is an order of e.g., DHS measures chromatin accessibility, ChIP-seq measures magnitude greater than other signals. Although the TAL1 +40 binding affinity, RNA-seq measures RNA molecule abundance, region is downstream of PDZK1IP1, it was not linked to the ex- etc. Our framework provides the initial development of a statistic pression of this gene, which was detected as significant in HRCE. that captures relationships among these measurements and ena- The +40 enhancer region has been shown to direct expression bles statistical testing of associations among them. Moreover, by in transgenic mice to primitive, but not definitive erythroblasts, exploiting variation across multiple conditions, the sensitivity of such as the phenotype displayed by K562. This example dem- our approach should only increase with additional data and onstrates that our methodology is capable of identifying regions sources of variation. Hence, the presented framework can likely be of regulatory potential, which otherwise requires laborious ef- applied to test for associations between appropriate continuous fort to annotate. quantitative genomic measurements and gene expression, thereby Local ARS profiles showed both differences and similarities facilitating a comparable basis for metaanalyses on the interplay of across genes as well as cell types. A few examples included: epigenetic features. • CCR2 and CCR5 were significant for two different cell-types, Materials and Methods HL60 and TH1, respectively (SI Appendix, Fig. S34). DHS and Gene-Expression Data. The data used in this study were generated • Part of the HOX-cluster crucial for kidney development in through the ENCODE consortium and are publicly available. Established cell mammals (HOXD8, HOXD4, and HOXD3) showed identical lines and primary cells used in this study were procured from commercial or local ARS profiles (SI Appendix, Fig. S35), and all were sig- other sources as listed in SI Appendix, Table S1. The cells were cultured as per nificant genes in HRCE (52). the vendor recommendations, and individual cell growth protocols are • Another example of shared profiles, but across several cell available in the University of California, Santa Cruz (UCSC) human genome types instead of across several genes, was seen with LOXL2, browser. The DHS data are available at the UCSC genome browser by downloading the track IDs listed in SI Appendix, Table S2 and the web ad- a gene essential for biogenesis of connective tissue, which is dress shown therein. Normalized probe-level expression data were obtained detected as an outlier in human skeletal muscle cells (SKMC) from the Gene Expression Omnibus; the accession numbers for all arrays are and has high relative ARS values in HAEpiC and BJ (skin shown in SI Appendix, Table S4. Probes were mapped to genes according to fibroblast) (SI Appendix, Fig. S36). Further fine-scale investi- HG18 using bowtie (57) allowing for two mismatches and up to 10 maps to gation showed a solid overlap in the local ARS profiles (SI the genome, including the best match. Only probe sets for which all probes Appendix, Fig. S37). had a unique best match and fully corresponded to exon boundaries found in RefSeq annotations (HG18) were retained for further analysis. If a RefSeq These observations point to a potentially widespread sharing gene had multiple splice variants, these were aggregated to a metagene of regulatory mechanisms both across genes and cell types. structure. In the rare event that a gene mapped to currently ambiguous

E652 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey Downloaded by guest on September 28, 2021 regions (e.g., chr6_random) such regions were not included. To arrive at served null P values over ðp ≥ 0:5Þ had a Uniform(0,1) distribution according PNAS PLUS a gene-specific expression value, the mean expression across all probe sets to a Kolmogorov–Smirnov test (SI Appendix). Reassuringly, this lead to within the exon boundaries of the gene model was calculated. This yielded nearly identical values for c across all genomic segment sizes of DHS volume expression measures for 19,215 genes on 20 cell lines. considered (SI Appendix, Fig. S12). The scaled data xs and ys for all genes were aggregated into a single Statistical Methods. The ARS algorithm and statistical analyses were written in distribution in the unit square ½0,1 × ½0,1. From this, randomized data sets the R programming language (58). The main ARS algorithm, results, GO were created by sampling 20 points that preserves the fact that either one analyses, and preprocessed data are available at http://encode.princeton. point must lie on ð1,1Þ or two points lie on ðx,1Þ and ð1,yÞ, respectively. The edu/. Complete details of the ARS algorithm, including the null randomi- 20 sampled points are then median centered and the ARSmax statistic is zation strategy and estimation of the angular penalty, are provided in calculated. We performed this 100 times to generate 100 sets of null ARSmax SI Appendix. statistics for every gene (for a total of 100 × 19,215 null statistics). A P value A schematic of the method is shown in Fig. 1. We represented the was then formed for each gene by calculating the frequency by which null ... measurements of a single gene by two paired vectors x ¼ðx1, ,xmÞ for statistics exceed the observed statistic. The P values were then used to cal- ... gene expression and y ¼ðy1, ,ymÞ for DHS volume, where m is the number culate FDR q values for the genes, as previously described (13). See SI Ap- of cell types under consideration (here, m 20). To place the two variables ¼ pendix for full details on this randomization method. on a common scale, each vector was scaled by its maximum observation s x s y x ¼ ... and y ¼ ... so that all values are now in ½0,1. Each maxfx1 , ,xm g maxfy1 , ,ym g Selecting Local ARS Profile Peaks for Further Analysis. We first identified genes vector was then centered by its median medðxsÞ and medðysÞ to form called significant at FDR < 0.10 for the ARS analysis performed on the seg- x* ¼ xs − medðxsÞ and y* ¼ ys − medðysÞ. Hence the data for a given gene ment size of ±200 kb about the TSS. We recorded the maximal cell type for and segment are now centered around the 2D medoid where the center of each of these genes (i.e., the cell type yielding the ARS value), producing mass of the data lies at the origin. If there is little variation across the max multiple cell types, all points would cluster around the medoid, and singular a list of significant gene/cell-type pairs. We limited our selection of gene/cell- cell-types displaying greater variation would be present as distinct outliers type pairs to those cell types that were maximal at this threshold for at least 100 genes. For each of these selected gene/cell-type pairs, we scaled its local (SI Appendix,q Figs.ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi S8 and S9). To gauge potential outliers the Euclidean ARS profile by the maximal value in the ±200 kb segment about the TSS. All *2 *2 ... distances di ¼ xi þ yi were calculated for every cell type i ¼ 1, ,m to DNA sequences ±50 bp with scaled local ARS profile value >0.5 were then produce the distance vector d ¼ðd1, ...,dmÞ. We formed a ratio statistic selected as local ARS peaks. Likewise, all DNA sequences ±50 bp with scaled di according to ri ¼ medðdÞ, thereby quantifying the relative distance of each local ARS profile value <0.2 were selected as the negative control set. The point to the medoid. local ARS peak set consisted of a total of 38,819 100-bp regions, and the Although the ratios ri describe the dispersion of the data, it does not negative control set consisted of 156,060 100-bp regions. account for any concordance between the measurements. A perfectly con- cordant relationship between the two measurements would result in points TFBS Analysis. We took the above local ARS peaks and negative control set, lying along the 45° diagonal identity line. We therefore calculated the angle and we eliminated all segments within ±10 kb of the TSS, reducing the θ for each data point x*,y* relative to the unit vector 1,0 for i 1, ...,m, i ð i i Þ ð Þ ¼ number of local ARS peak segments from 38,819 to 32,063 and negative where 0 ≤ θ ≤ 360. The angular penalty involves first calculating the smaller i control segments from 156,060 to 148,423. These were searched with all of the two angular distances between θ and the identity line, denoted as Δ . i i nonredundant vertebrate positions count matrices in the JASPAR database For example, Δ ¼j45 − θ j for 0 ≤ θ < 135. The angular penalty is calculated i i i (20). The position count matrices were converted to position weight matrices as a ¼ expðc × Δ Þ, where c ≤ 0 and is determined empirically to satisfy i i using a uniform background, and a matrix specific thresholding of 0.8 of the a correct null distribution (SI Appendix). Therefore, the value ai measures the maximal log odds score was used. Significant over- or under-representation STATISTICS penalized angular distance of ðx*,y*Þ from the identity line in a symmetric i i was determined by exact binomial tests where the probability was based on fashion (SI Appendix, Fig. S10). The statistic applied to each ðx*,y*Þ pair is i i the frequency of hits per in the negative control sequences. Effect then ARSi ¼ ai × ri , with the gene’s overall statistic being the maximum, size was calculated as log2 fold change between number of hits per base pair ARSmax ¼ maxðARS1, ARS2, ..., ARSmÞ. In addition to calculating these quantities for each gene, we also recorded the ordering of the cell types as in the local ARS peaks versus the negative control set. determined by their relative ARSi values. Inclusion of the angular penalty had a twofold purpose. First, it correctly ACKNOWLEDGMENTS. We thank the Stamatoyannopoulos lab for useful eliminated points that were outliers in only one dimension, gene expression discussions and suggestions, Shane Neph for help with collating gene ontology analyses, Richard Sandstrom for information on experimental or DHS alone, and therefore not of interest here because there is no direct

details, Michael Hudock for assistance with computations, and Lance Parsons BIOPHYSICS AND relationship between the two measurements. Second, penalizing such points for building the website. The publicly available ENCODE data used in this acted as a tuning parameter adjusting for the degree of off-diagonal noise in work were generated by the Stamatoyannopoulos lab. This research was COMPUTATIONAL BIOLOGY the data, and thereby ensured a correct null distribution and P values (SI supported in part by National Institutes of Health Grants U54 HG004592 and Appendix, Fig. S11). The specific value of c was determined such that ob- R01 HG002913.

1. Xi H, et al. (2007) Identification and characterization of cell type-specific and ubiq- 13. Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc uitous chromatin regulatory structures in the human genome. PLoS Genet 3(8):e136. Natl Acad Sci USA 100(16):9440–9445. 2. Boyle AP, et al. (2008) High-resolution mapping and characterization of open chro- 14. Sanyal A, Lajoie BR, Jain G, Dekker J (2012) The long-range interaction landscape of matin across the genome. Cell 132(2):311–322. gene promoters. Nature 489(7414):109–113. 3. Song F, et al. (2005) Association of tissue-specific differentially methylated regions 15. Sproul D, Gilbert N, Bickmore WA (2005) The role of chromatin structure in regulating (TDMs) with differential gene expression. Proc Natl Acad Sci USA 102(9):3336–3341. the expression of clustered genes. Nat Rev Genet 6(10):775–781. 4. Satterlee JS, Schübeler D, Ng HH (2010) Tackling the epigenome: Challenges and 16. Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z (2009) GOrilla: A tool for discovery opportunities for collaboration. Nat Biotechnol 28(10):1039–1044. and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10: 5. Bernstein BE, et al.; ENCODE Project Consortium (2012) An integrated encyclopedia of 48–48. DNA elements in the human genome. Nature 489(7414):57–74. 17. Kanehisa M (2002) The KEGG database. Novartis Found Symp 247:91–101, discussion 6. Hawkins RD, Hon GC, Ren B (2010) Next-generation genomics: An integrative ap- 101–103, 119–128, 244–252. proach. Nat Rev Genet 11(7):476–486. 18. Dennis G, Jr., et al. (2003) DAVID: Database for Annotation, Visualization, and In- 7. Heintzman ND, et al. (2009) Histone modifications at human enhancers reflect global tegrated Discovery. Genome Biol 4(5):3. cell-type-specific gene expression. Nature 459(7243):108–112. 19. Siepel A, et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, 8. Ernst J, Kellis M (2010) Discovery and characterization of chromatin states for sys- and yeast genomes. Genome Res 15(8):1034–1050. tematic annotation of the human genome. Nat Biotechnol 28(8):817–825. 20. Portales-Casamar E, et al. (2010) JASPAR 2010: The greatly expanded open-access 9. Gerstein MB, et al. (2012) Architecture of the human regulatory network derived database of transcription factor binding profiles. Nucleic Acids Res 38(Database issue): from ENCODE data. Nature 489(7414):91–100. D105–D110. 10. Thurman RE, et al. (2012) The accessible chromatin landscape of the human genome. 21. Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of Nature 489(7414):75–82. cell-type-specific transcription factor binding. Genome Res 22(9):1723–1734. 11. Sabo PJ, et al. (2004) Discovery of functional noncoding elements by digital analysis of 22. Ulinski T, et al. (2006) Renal phenotypes related to hepatocyte nuclear factor-1beta chromatin structure. Proc Natl Acad Sci USA 101(48):16837–16842. (TCF2) mutations in a pediatric cohort. J Am Soc Nephrol 17(2):497–503. 12. Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS (2007) Un- 23. Parviz F, et al. (2003) Hepatocyte nuclear factor 4alpha controls the development of supervised segmentation of continuous genomic data. Bioinformatics 23(11):1424–1426. a hepatic epithelium and liver morphogenesis. Nat Genet 34(3):292–296.

Marstrand and Storey PNAS | Published online January 27, 2014 | E653 Downloaded by guest on September 28, 2021 24. Wright KL, et al. (1994) CCAAT box binding protein NF-Y facilitates in vivo re- 42. Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W (2002) Looping and interaction cruitment of upstream DNA binding transcription factors. EMBO J 13(17):4042–4053. between hypersensitive sites in the active beta-globin locus. Mol Cell 10(6): 25. Lekstrom-Himes J, Xanthopoulos KG (1998) Biological role of the CCAAT/enhancer- 1453–1465. binding protein family of transcription factors. J Biol Chem 273(44):28545–28548. 43. Acuto S, et al. (1996) An element upstream from the human delta-globin-encoding 26. Sapin V, Ward SJ, Bronner S, Chambon P, Dollé P (1997) Differential expression of gene specifically enhances beta-globin reporter gene expression in murine eryth- transcripts encoding retinoid binding proteins and retinoic acid receptors during roleukemia cells. Gene 168(2):237–241. placentation of the mouse. Dev Dyn 208(2):199–210. 44. Tanimoto K, et al. (2003) Human beta-globin locus control region HS5 contains CTCF- 27. Sapin V, Dollé P, Hindelang C, Kastner P, Chambon P (1997) Defects of the chorio- and developmental stage-dependent enhancer-blocking activity in erythroid cells. – allantoic placenta in mouse RXRalpha null fetuses. Dev Biol 191(1):29 41. Mol Cell Biol 23(24):8946–8952. 28. Pohl BS, Knöchel W (2001) Overexpression of the transcriptional repressor FoxD3 45. Fordham JL, Göttgens B, McLaughlin F, Green AR (1999) Chromatin structure and – prevents neural crest formation in Xenopus embryos. Mech Dev 103(1-2):93 106. transcriptional regulation of the stem cell leukaemia (SCL) gene in mast cells. Leu- 29. Murphy TL, Cleveland MG, Kulesza P, Magram J, Murphy KM (1995) Regulation of kemia 13(5):750–759. interleukin 12 p40 expression through an NF-kappa B half-site. Mol Cell Biol 15(10): 46. Göttgens B, et al. (2002) Establishing the transcriptional programme for blood: The – 5258 5267. SCL stem cell enhancer is regulated by a multiprotein complex containing Ets and ’ 30. Johnson AD, O Donnell CJ (2009) An open access database of genome-wide associa- GATA factors. EMBO J 21(12):3039–3050. tion results. BMC Med Genet 10:6. 47. Chapman MA, et al. (2003) Comparative and functional analyses of LYL1 loci establish 31. Suzuki S, et al. (2007) A novel genetic marker for coronary spasm in women from marsupial sequences as a model for phylogenetic footprinting. Genomics 81(3): a genome-wide single nucleotide polymorphism analysis. Pharmacogenet Genomics 249–259. 17(11):919–930. 48. Göttgens B, et al. (2001) Long-range comparison of human and mouse SCL loci: Lo- 32. Bian Z, et al. (2009) Cellular repressor of E1A-stimulated genes attenuates cardiac calized regions of sensitivity to restriction endonucleases correspond precisely with hypertrophy and fibrosis. J Cell Mol Med 13(7):1302–1313. peaks of conserved noncoding sequences. Genome Res 11(1):87–97. 33. Eyers CE, et al. (2005) The phosphorylation of CapZ-interacting protein (CapZIP) 49. Göttgens B, et al. (2002) Transcriptional regulation of the stem cell leukemia gene by stress-activated protein kinases triggers its dissociation from CapZ. Biochem J (SCL)—Comparative analysis of five vertebrate SCL loci. Genome Res 12(5):749–759. 389(Pt 1):127–135. 50. Sánchez M, et al. (1999) An SCL 3′ enhancer targets developing endothelium together 34. Pyle WG, La Rotta G, de Tombe PP, Sumandea MP, Solaro RJ (2006) Control of cardiac with embryonic and adult haematopoietic progenitors. Development 126(17): myofilament activation and PKC-betaII signaling through the actin capping protein, 3891–3904. CapZ. J Mol Cell Cardiol 41(3):537–543. 51. Sinclair AM, et al. (1999) Distinct 5′ SCL enhancers direct transcription to developing 35. Ammirati E, et al. (2008) Expansion of T-cell receptor zeta dim effector T cells in acute coronary syndromes. Arterioscler Thromb Vasc Biol 28(12):2305–2311. brain, spinal cord, and endothelium: Neural expression is mediated by GATA factor – 36. Tuan D, Solomon W, Li Q, London IM (1985) The “beta-like-globin” gene domain in binding sites. Dev Biol 209(1):128 142. human erythroid cells. Proc Natl Acad Sci USA 82(19):6384–6388. 52. Di-Poï N, Zákány J, Duboule D (2007) Distinct roles and regulations for HoxD genes in 37. Forrester WC, Thompson C, Elder JT, Groudine M (1986) A developmentally stable metanephric kidney development. PLoS Genet 3(12):e232. chromatin structure in the human beta-globin gene cluster. Proc Natl Acad Sci USA 53. Wong AH, Gottesman II, Petronis A (2005) Phenotypic differences in genetically 83(5):1359–1363. identical organisms: The epigenetic perspective. Hum Mol Genet 14(Spec No 1): – 38. Grosveld F, et al. (1990) The dominant control region of the human beta-globin do- R11 R18. main. Ann N Y Acad Sci 612:152–159. 54. Schneider E, et al. (2010) Spatial, temporal and interindividual epigenetic variation 39. Shimotsuma M, Okamura E, Matsuzaki H, Fukamizu A, Tanimoto K (2010) DNase I of functionally important DNA methylation patterns. Nucleic Acids Res 38(12): hypersensitivity and epsilon-globin transcriptional enhancement are separable in lo- 3880–3890. cus control region (LCR) HS1 mutant human beta-globin YAC transgenic mice. J Biol 55. Hatchwell E, Greally JM (2007) The potential role of epigenomic dysregulation in Chem 285(19):14495–14503. complex human disease. Trends Genet 23(11):588–595. 40. Fraser P, Hurst J, Collis P, Grosveld F (1990) DNaseI hypersensitive sites 1, 2 and 3 of the 56. Degner JF, et al. (2012) DNase I sensitivity QTLs are a major determinant of human human beta-globin dominant control region direct position-independent expression. expression variation. Nature 482(7385):390–394. Nucleic Acids Res 18(12):3503–3508. 57. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient 41. Molete JM, et al. (2001) Sequences flanking hypersensitive sites of the beta-globin alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. locus control region are required for synergistic enhancement. Mol Cell Biol 21(9): 58. R Development Core Team (2009) R: A Language and Environment for Statistical 2969–2980. Computing (R Foundation for Statistical Computing, Vienna).

E654 | www.pnas.org/cgi/doi/10.1073/pnas.1312523111 Marstrand and Storey Downloaded by guest on September 28, 2021