Computational Characterization of Chromatin Domain Boundary-Associated Genomic Elements Seungpyo Hong and Dongsup Kim*

Published online 23 August 2017 Nucleic Acids Research, 2017, Vol. 45, No. 18 10403–10414 doi: 10.1093/nar/gkx738 Computational characterization of chromatin domain boundary-associated genomic elements Seungpyo Hong and Dongsup Kim* Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea Received January 03, 2017; Revised July 18, 2017; Editorial Decision August 10, 2017; Accepted August 14, 2017 ABSTRACT units of chromatin interact with each other to form larger units (4). The basic structural units of chromatin are nucle- Topologically associated domains (TADs) are 3D ge- osomes, which have a 10-nm fiber structure; their dynamic nomic structures with high internal interactions that interaction leads to the formation of larger-scale chromatin play important roles in genome compaction and gene domains characterized by high intra-domain and low inter- regulation. Their genomic locations and their asso- domain interactions (5). Two classes of chromatin domain ciation with CCCTC-binding factor (CTCF)-binding have been described; the smaller of these, which are referred sites and transcription start sites (TSSs) were re- to as physical or contact domains, constitute about 185 000 cently reported. However, the relationship between bp (6) and interact with each other to form larger topolog- TADs and other genomic elements has not been ically associated domains (TADs) of ∼1 million bp (7). systematically evaluated. This was addressed in the Although the existence of various genomic structures has present study, with a focus on the enrichment of been experimentally demonstrated, how they are formed is largely unknown; TADs are of particular interest given these genomic elements and their ability to predict that they are implicated in gene regulation (8,9). It is the TAD boundary region. We found that consensus thought that particular genomic elements are located at the CTCF-binding sites were strongly associated with boundaries of TADs and are involved in their formation. TAD boundaries as well as with the transcription CCCTC-binding factors (CTCFs) are more frequently ob- factors (TFs) Zinc finger protein (ZNF)143 and Yin served at TAD boundaries than in other regions, as are Yang (YY)1. TAD boundary-associated genomic ele- transcription starting sites (TSSs) of housekeeping (HK) ments include DNase I-hypersensitive sites, H3K36 genes and histone modifications such as H3K36 trimethy- trimethylation, TSSs, RNA polymerase II, and TFs lation (H3K36me3) (10). In fact, CTCF-binding sites are such as Specificity protein 1, ZNF274 and SIX home- key features of TAD boundaries since they can interact with obox 5. Computational modeling with these genomic CTCF-binding sites of another TAD boundary and thereby elements suggests that they have distinct roles in segregate the activities of adjacent TADs in association with cohesin complexes (6). However, given that CTCF- TAD boundary formation. We propose a structural binding sites are frequently found outside the TAD bound- model of TAD boundaries based on these findings ary, these sites are not in themselves sufficient to demar- that provides a basis for studying the mechanism of cate TADs. Hence, in one of previous studies, histone mod- chromatin structure formation and gene regulation. ification marks were incorporated into the prediction of TAD boundaries along with CTCF-binding sites (11). One INTRODUCTION other previous study examining the association between 76 The human genome is tightly packed into the cell nucleus, DNA-binding proteins and chromatin loop anchors located and yet is readily accessible to other cellular components at the ends of contact domains found that the transcrip- for the maintenance of cellular activity and response to ex- tion factors (TFs) Zinc finger protein (ZNF)143 and Yin ternal signals. The physical structure of the genome respon- Yang (YY)1 were enriched at TADs (6); the enrichment of sible for this apparent paradox has been sought for a long ZNF143 at TADs was later confirmed by another group time. Biochemical and microscopic studies revealed that the (12). However, these studies did not systematically search genome has different levels of organization, including nu- for genomic elements associated with TAD boundaries nor cleosomes, chromatin domains, and chromosome territories did they evaluate their predictive power. The predictabil- (1–3). Recent advances in chromatin conformation capture ity measure could clarify the association between genomic techniques have enabled detailed examination of genomic elements and TAD boundaries, since it takes into account structures; it was suggested that the human genome has a both coverage and data enrichment. For example, TSSs of fractal-like architecture in which the small local structural HK genes are the most highly enriched genomic element *To whom correspondence should be addressed. Tel: +82 42 350 4317; Fax: +82 42 350 4310; Email: [email protected] C The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 10404 Nucleic Acids Research, 2017, Vol. 45, No. 18 at TAD boundaries, but have weak predictive power since transcription factor binding site peaks from multiple exper- they are present in only a small fraction of TAD boundaries. iments in various cell lines (wgEncodeRegTfbsClusteredV3 Conversely, CTCF binding sites are good predictors of the table). TF signals obtained from the H1-human embry- boundaries though they are not particularly enriched at the onic stem cell (hESC) cell line were separately considered boundaries than other genomic elements. in order to evaluate the cell type dependency of the TAD In the present study, we investigated the association be- structure. Non-TFBS signals such as locations of DNase I- tween TAD boundaries and various genomic elements in- hypersensitive sites and TSSs were also retrieved from the cluding TF-binding sites (TFBSs), TSSs, histone modifi- UCSC Genome Browser (13) and were tagged as ‘Other’. cations, and DNase I-hypersensitive sites. We first identi- The full list of data used in this study is shown in Sup- fied genomic elements that are enriched at TAD bound- plementary Table S3. TSSs were divided into three groups: aries. We then employed machine-learning methodology to TSSs of mRNA (TSS), TSSs of mRNA and non-coding determine those that are important for accurately predict- RNAs (TSS-ALL), and TSSs of HK genes (TSS-HK). HK ing TAD boundaries (Figure 1). We confirmed that con- genes were identified from a previously published list (15). sensus CTCF-binding sites present in multiple cell lines The center of each signal was taken as its genomic location; were better predictors of TAD boundaries, and were as- for example, a CTCF peak located at 16 110–16 390 on chro- sociated with TFs such as ZNF143. We also searched for mosome 1 was assumed to be at 16 250. genomic elements that represent the structural characteris- tic of TAD boundaries by measuring their enrichment at Analysis of consensus CTCF-binding sites and their ability to predict these boundaries, and identified TAD boundary-associated genomic elements such as We found that genomic elements associated with consen- ZNF143, YY1, MYC complexes, Specificity protein (SP)1, sus CTCF-binding sites occupied a narrower region, and we ± ZNF274 and SIX homeobox (SIX)5 as well as DNase I- therefore reduced the range of analysis to 15 kb. Consen- hypersensitive sites, H3K36me3 and TSSs. We generated a sus CTCF-binding sites were defined as signals with a peak ≥ predictive model based on these genomic elements and ana- score 600 and supported by 60 or more experiments. lyzed their localization at specific sites in the TAD. Our findings reveal a number of novel TAD boundary-associated Position-specific linear model (PSLM) genomic elements whose functions can provide insight into To evaluate the predictive power of each genomic element, the mechanisms of gene regulation. we devised a computational model, PSLM, that considers both position and density of genomic elements in genomic MATERIALS AND METHODS segments and classifies them into boundaries or TADs. In the model, the analyzed segments––of boundaries or TADs and boundaries TADs––were divided into 11 bins and the number of each A total of 2994 TADs with a length >200 kb were selected genomic element in each bin was counted, which are used as from 3062 previously defined TADs (10). The meeting ends 11D feature vectors that represent the positional preference of two adjacent TADs were designated as a boundary.When of the genomic element around the boundary or the TAD. there was a gap between the two ends, these were merged For each genomic element, boundary and TAD segments into a boundary if the spacing was <100 kb. There were were converted into feature vectors, and a predictive model 1254 gapped cases, of which 718 were merged. Ultimately, that segregated boundary and TAD feature vectors was gen- 3586 boundaries were analyzed in this study. The genomic erated. A linear model was developed using the Bayesian locations of TADs and boundaries are shown in Supple- Ridge method (16), which resembles linear regression but mentary Tables S1 and S2. eliminates abnormally

Computational Characterization of Chromatin Domain Boundary-Associated Genomic Elements Seungpyo Hong and Dongsup Kim*

2017.08.28 Anne Barry-Reidy Thesis Final.Pdf

Genome-Wide Inference of Natural Selection on Human Transcription Factor Binding Sites

Accompanies CD8 T Cell Effector Function Global DNA Methylation

Association of Gene Ontology Categories with Decay Rate for Hepg2 Experiments These Tables Show Details for All Gene Ontology Categories

Supplementary Data

Detecting Global in Uence of Transcription

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

Transcriptome Analysis of Complex I-Deficient Patients Reveals Distinct

Supporting Information

Integration of 198 Chip-Seq Datasets Reveals Human Cis-Regulatory Regions

Transcriptional and Post-Transcriptional Regulation of ATP-Binding Cassette Transporter Expression

Virtual Chip-Seq: Predicting Transcription Factor Binding