Motif Signatures in Stretch Enhancers Are Enriched for Disease-Associated Genetic Variants
Total Page:16
File Type:pdf, Size:1020Kb
Quang et al. Epigenetics & Chromatin (2015) 8:23 DOI 10.1186/s13072-015-0015-7 RESEARCH Open Access Motif signatures in stretch enhancers are enriched for disease‑associated genetic variants Daniel X Quang1,2, Michael R Erdos3, Stephen C J Parker4*† and Francis S Collins3† Abstract Background: Stretch enhancers (SEs) are large chromatin-defned regulatory elements that are at least 3,000 base pairs (bps) long, in contrast to the median enhancer length of 800 bps. SEs tend to be cell-type specifc, regulate cell- type specifc gene expression, and are enriched in disease-associated genetic variants in disease-relevant cell types. Transcription factors (TFs) can bind to enhancers to modulate enhancer activity, and their sequence specifcity can be represented by motifs. We hypothesize motifs can provide a biological context for how genetic variants contribute to disease. Results: We integrated chromatin state, gene expression, and chromatin accessibility [measured as DNase I Hyper- sensitive Sites (DHSs)] maps across nine diferent cell types. Motif enrichment analyses of chromatin-defned enhancer sequences identify several known cell-type specifc “master” factors. Furthermore, de novo motif discovery not only recovers many of these motifs, but also identifes novel non-canonical motifs, providing additional insight into TF binding preferences. Across the length of SEs, motifs are most enriched in DHSs, though relative enrichment is also observed outside of DHSs. Interestingly, we show that single nucleotide polymorphisms associated with diseases or quantitative traits signifcantly overlap motif occurrences located in SEs, but outside of DHSs. Conclusions: These results reinforce the role of SEs in infuencing risk for diseases and suggest an expanded regu- latory functional role for motifs that occur outside highly accessible chromatin. Furthermore, the motif signatures generated here expand our understanding of the binding preference of well-characterized TFs. Background (referred to as SEs in this paper) [3]. By our defnition, Chromatin immunoprecipitation combined with high- SEs have a length of at least 3,000 DNA base pairs (bps) throughput sequencing (ChIP-seq) can identify the and are much larger than typical enhancers (TEs), which genome-wide locations of target proteins, including tran- we defned to be any enhancer less than or equal to the scription factors (TFs), RNA Polymerase II, and cova- median enhancer length of 800 bps. SEs are generally cell lently modifed histones [1]. ChIP-seq datasets for several type specifc, associated with increased cell-specifc gene chromatin marks and the sequence-specifc factor CTCF expression, and tend to harbor disease-relevant genetic can be computationally integrated to discover combina- variants derived from genome-wide association stud- torial and spatial patterns that produce a consistent anno- ies (GWASs). In our previous paper [3], we showed that tation of promoter, enhancer, insulator, transcribed, and enrichment for GWAS variants increases with the length repressed chromatin states. In a recent study, we profled of enhancers, but we did not try to defne the precise the chromatin states of several cell lines using Chrom- relationship of GWAS variants to motifs located within HMM [2], which revealed the presence of large gene the enhancers—that is the goal of this paper. SEs share control elements that we designated “stretch enhancers” several traits with another recently defned class of large enhancers designated super-enhancers by Young and *Correspondence: [email protected] colleagues [4, 5]. Like SEs, super-enhancers drive cell- †Stephen CJ Parker and Francis S Collins contributed equally type-specifc gene expression; however, super-enhancers 4 Departments of Computational Medicine and Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA have been defned by the disproportionate abundance of Full list of author information is available at the end of the article Mediator or histone 3 lysine 27 acetylation (H3K27ac) © 2015 Quang et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Quang et al. Epigenetics & Chromatin (2015) 8:23 Page 2 of 14 signal [6], whereas SEs are defned by patterns of histone Of the ten cell types profled, nine also have DNase-seq modifcations, or chromatin states. Enhancers can func- data available. DNase-seq is a method used to identify tion independently of their endogenous spatial contexts, the genome-wide locations of DHSs, which are regions which is a property exploited in luciferase assays to meas- of the genome that are highly sensitive to cleavage by ure enhancer activity by placing enhancers upstream DNase I and mark regulatory elements such as enhancers of reporter genes. Tis suggests that the information and promoters [8]. DHSs generally mark regions that are required for the enhancer activity is encoded in the more accessible for TF binding and are enriched for TF underlying DNA sequences. We, therefore, hypothesize binding motifs. that the sequence content of the enhancers can provide Figure 1a displays the accessible chromatin, chromatin additional insight into the relationship between enhancer state, and gene expression profles across the nine cell function and enhancer length, which we address by stud- types around the ABCC8 locus. Gene expression RNA- ying how SEs difer from TEs. seq [9] tracks clearly show that the ABCC8 transcript is Enhancer sequences are known to be enriched for exclusively expressed in the islet sample, and the chroma- transcription factor binding sites, which contribute to tin state tracks show this gene body is covered by several enhancer activity. Upon binding to their recognition islet-specifc SEs. Tis integrative approach can identify motifs, some TF proteins can form complexes with other cell-specifc chromatin and expression profles to provide proteins, which can alter the 3D conformation of chro- a basis for further understanding the functional efects of matin and recruit RNA Polymerase II to promote the SNPs in common, complex diseases. In the ABCC8 exam- transcription of target genes located in cis, sometimes ple, lead type 2 diabetes (T2D) GWAS SNPs (red arrow 2 at considerable distances. Te motif, or sequence bind- heads) and several linked SNPs (r ≥ 0.8) (green bars) ing specifcity, for a TF can be represented as a posi- overlap islet-specifc chromatin states. Enhancers have tion weight matrix (PWM) that specifes the nucleotide been shown to overlap multiple linked SNPs more often frequency at each position along the binding sequence. than expected at random [3, 10], suggesting that multiple Recently, less complex nucleotide patterns—like dinu- enhancer variants work together in concert to alter gene cleotide repeats—were shown to contribute to enhancer expression and contribute to disease susceptibility. activity [7]. From the ABCC8 example, we can infer some properties In this study, we analyze the motif signatures of SEs about the interplay between chromatin states and DNase and how they difer from those of TEs. We scan enhancer I hypersensitivity in SEs. First, we notice that the SEs sequences using known motifs from databases to iden- encompass several DHSs, but the entire SE does not dis- tify TFs that are characteristic of each cell type studied, play DNase hypersensitivity. Unsurprisingly due to their which we are also able to recover using de novo motif length, SEs contain proportionately more DHSs than TEs discovery. We investigate how GWAS single nucleotide do (Additional fle 2: Figure S1). Second, strong enhancer polymorphisms (SNPs) can afect TF motif signatures states in islets in ABCC8, for example (Figure 1a), tend in SEs, which can provide important clues about how to overlap spikes in the islet DNase-seq signal. By aggre- genetic variations contribute to disease risk. gating the DNase-seq tag density relative to DNase-seq peak summits in either SEs or TEs, we fnd there is rela- Results and discussion tively little diference in DNase I hypersensitivity between Systematic chromatin state, DNase hypersensitivity, the classes of enhancers (Figure 1b). However, there is a and gene expression profling across nine diverse cell large diference in the H3K27ac ChIP-seq signal (Fig- types ure 1c). Notably, there is a dip in the H3K27ac signal in In our previous study, we used the ChromHMM algo- the center of the DNase peak summit, which is a refec- rithm to systematically integrate ChIP-seq histone tion of bound TFs that can displace histones. Tis TF- modifcation and CTCF datasets and uniformly profle induced confguration of chromatin architecture is much chromatin states across ten diverse cell types. Tese more pronounced in SEs than it is in TEs. Of relevance, ChromHMM segmentations are used to profle SEs, this H3K27ac dip feature recently provided the central which are defned as regions of at least 3,000 bps con- motivation for a novel computational algorithm to detect taining contiguous segments marked as enhancer states. factor binding sites [11]. Given the pronounced dip in SEs TEs are defned similarly, but are less than or equal to versus TEs, this recent computational algorithm likely 800 bps in length. Although SEs only constitute the top picked up mostly on SE-mediated factor binding sites. 10% of enhancers in terms of length, they represent a In generating the signal histogram plots, we accounted disproportionately large percentage of the total number for the minor diference in mappability between SEs of nucleotides among all enhancers (Additional fle 1: and TEs. Generally, SEs are slightly more mappable than Table S1). TEs (Additional fle 2: Figure S2), which suggests that SE Quang et al.