Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Human cardiac cis-regulatory elements, their cognate transcription factors
2 and regulatory DNA sequence variants
3
4 Dongwon Lee1,†,*, Ashish Kapoor1,‡, Alexias Safi2, Lingyun Song2, Marc K. Halushka3,
5 Gregory E. Crawford2, Aravinda Chakravarti1,†,*
6
7 1McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of
8 Medicine, Baltimore, MD 21205, United States; 2Department of Pediatrics and Center
9 for Genomic & Computational Biology, Duke University Medical Center, Durham, NC
10 27708, United States; 3Department of Pathology, Johns Hopkins University School of
11 Medicine, Baltimore, MD 21205, United States.
12
13 †Current address: Center for Human Genetics & Genomics, New York University School
14 of Medicine, New York, NY, USA
15 ‡Current address: University of Texas Health Science Center at Houston, Houston, TX,
U SA 16 USA
17
18 Running title: A human cardiac cis-regulatory map
19
20 Keywords: cis-regulatory elements, transcription factors, regulatory variants, QT
21 interval, heritability
22
23 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
24 *Send all correspondence to:
25
26 Aravinda Chakravarti, Ph.D. & Dongwon Lee, Ph.D.
27 Center for Human Genetics and Genomics
28 New York University School of Medicine
29 435 East 30th Street, SM Room 803/4
30 New York, NY 10016, USA
31 (T) 212-263-8023
32 (E) [email protected], [email protected]
2 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
33 Abstract
34 cis-regulatory elements (CRE), short DNA sequences through which transcription
35 factors (TF) exert regulatory control on gene expression, are postulated to be the major
36 sites of causal sequence variation underlying the genetics of complex traits and diseases.
37 We present integrative analyses, combining high-throughput genomic and epigenomic
38 data with sequence-based computations, to identify the causal transcriptional
39 components in a given tissue. We use data on adult human hearts to demonstrate that a)
40 sequence-based predictions detect numerous, active, tissue-specific CREs missed by
41 experimental observations, b) learned sequence features identify the cognate TFs, c)
42 CRE variants are specifically associated with cardiac gene expression, and, d) a
43 significant fraction of the heritability of exemplar cardiac traits (QT interval, blood
44 pressure, pulse rate) is attributable to these variants. This general systems approach can
45 thus identify candidate causal variants and the components of gene regulatory networks
46 (GRN) to enable understanding of the mechanisms of complex disorders at a tissue or
47 cell-type basis.
3 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
48 Introduction
49 The majority of sequence variation affecting inter-individual complex disease risk
50 is common and noncoding, in contrast to the rare coding variation underlying
51 Mendelian disorders (Chakravarti and Turner 2016). These findings are consistent with
52 the intense natural selection on coding sequence variation and the much weaker
53 selection on noncoding variation (Asthana et al. 2007). Most complex traits arise from
54 multi-genic effects with identified risk alleles having small effects so that no individual
55 effect is either necessary or sufficient (Chakravarti and Turner 2016). Thus, there must
56 be considerable redundancy in non-coding functions. The best candidates for these non-
57 coding functions are cis-regulatory elements (CREs) of gene expression because they are
58 the primary agents of regulatory control, and affect disease risk through altered gene
59 expression (Davidson 2010; Maurano et al. 2012; Chatterjee et al. 2016; Phillips-
60 Cremins and Corces 2013).
61
62 Understanding the regulatory genomic architecture is, therefore, important for
63 elucidating the etiologies of complex disorders, particularly at the tissue- and cell-type
64 levels. This involves identifying the numbers of TFs and enhancers engaged, their
65 genomic distribution and the feedback mechanisms required for expression control of
66 each gene, namely, the components of its gene regulatory network (GRN) (Davidson
67 2010). Additionally, we need to quantify the effect of sequence variation within the GRN
68 on gene expression. These questions can now be answered using high-throughput ChIP-
69 seq (Barski et al. 2007; Johnson et al. 2007; Robertson et al. 2007), DNase-seq (Boyle et
70 al. 2008; John et al. 2011), and ATAC-seq (Buenrostro et al. 2013) experimental data
4 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
71 from the ENCODE (The ENCODE Project Consortium 2012) and Roadmap epigenomics
72 projects (Roadmap Epigenomics Consortium et al. 2015), on many cell types, tissues,
73 developmental times and states, in conjunction with human sequence variation (1KGP)
74 (The 1000 Genomes Project Consortium 2015) and its effects on gene expression across
75 human tissues (GTEx) (The GTEx Consortium 2015). The fulcrum of a GRN are its
76 enhancers (CREs), however, their complete experimental detection is not possible
77 because CRE activity is affected by many factors (Misteli 2001; Degner et al. 2012; He et
78 al. 2014): (1) functional strength is variable across enhancers, (2) activity is stochastic
79 across cells, (3) detected activity varies by experimental protocol and sample, and, (4)
80 resolution of a CRE’s genomic location from current data is approximate. In principle,
81 ChIP-seq data are superior but the general lack of TF-specific antibodies is a major
82 roadblock to this approach. We solve these major impediments instead by using a
83 generalizable genome sequence-based machine learning approach (Lee et al. 2011;
84 Ghandi et al. 2014; Lee et al. 2015). This method learns the short genome sequence
85 features that maximally discriminate sequences underlying CREs from random genome
86 sequences. The learned models are then used to identify additional CREs with similar
87 sequence features, including combinations of transcription factor binding sites, that
88 failed detection through experiments.
89
90 We focus here on identifying the GRN ‘parts list’ of the adult human heart, and
91 the effects of CRE genetic variation on an exemplary cardiac trait, the
92 electrocardiographic QT interval (QTi), causal to sudden cardiac death and other
93 conduction disorders (Tomaselli et al. 1994). We show the generalizability of our results
5 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
94 by using these cardiac enhancers to explain phenotypic variation in systolic/diastolic
95 blood pressure and pulse rate but not BMI. We focus on the heart because of the
96 extensive genome-wide association study (GWAS) literature implicating CRE variation
97 in many cardiovascular diseases and phenotypes (Arking et al. 2014; Kapoor et al. 2014;
98 the CARDIoGRAMplusC4D Consortium 2015; Eppinga et al. 2016), but also because the
99 QTi is likely affected by the cardiac system alone. A number of human cardiac CRE
100 maps exist (May et al. 2012; Narlikar et al. 2010) with a recent study identifying distal
101 heart enhancers based on H3K27ac and EP300/CREBBP profiles (Dickel et al. 2016). At
102 least one study has also used DNase-seq and histone modification ChIP-seq data for
103 heart tissues to identify functional regulatory variants associated with electrocardiogram
104 traits (Wang et al. 2016). Our study improves on these maps by comprehensive
105 identification of CREs. In addition, we identify their corresponding TFs and analyze the
106 effects of genetic variants within these heart CREs for heart-related traits.
107
108 Results
109 Experimental detection of cardiac CREs is incomplete
110 A schematic overview of our approach is shown in Fig. S1. We first performed
111 DNase-seq to identify open chromatin regions through DNase I hypersensitive sites
112 (DHSs) as potential CREs in two adult human hearts (left ventricles) together with three
113 publicly available heart DNase-seq data sets (Fig. 1A). Peak calling with MACS2 (Zhang
114 et al. 2008) identified varying numbers of DHSs (50,000 ~ 110,000) across the five
115 samples (Methods) (Supplemental Fig. S2A). We defined DHSs as 600bp regions
116 centered at the identified summits. DHSs frequently cluster in small regions, and
6 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
117 22~32% of extended DHSs overlap their neighbors forming larger DHSs with multiple
118 summits (Supplemental Fig. S2B). We next compared heart DHSs with other tissue
119 DHSs using the top 50,000 DHSs from each sample and the Jaccard index (number of
120 bases in the intersection over the union for each pair). These comparisons show higher
121 similarity (~50%) between the adult cardiac samples than with other tissues (~30%),
122 including fetal heart (Methods) (Supplemental Fig. S2C). Thus, DHS maps do uncover
123 tissue-relevant CREs although many regions are open across many tissues. Nevertheless,
124 technical and biological variation affect the adult cardiac DHS maps. To maximally
125 detect CREs, we aggregated all DHSs across the five adult heart replicates and identified
126 ~160,000 distinct regions (“observed DHSs”) covering ~4% of the genome. Using
127 multiple samples is important because ~40% of DHSs were detected only once across
128 the replicates while only ~22% detected across all five (Fig. 1B).
129
130 Sequence-based models can predict additional CREs
131 To elucidate the sequence code for these cardiac CREs, we used the gapped k-mer
132 support vector machines (SVM) method, gkm-SVM (Ghandi et al. 2014), widely
133 accepted as one of the state-of-the-art sequence-based methods (Kreimer et al. 2017).
134 Also, extracting predictive sequence features from gkm-SVM is easier than from other
135 methods. We used an improved software LS-GKM (Lee 2016) that enabled training on
136 larger data sets (~100,000 DHSs) (Supplemental Fig. S3A). The gkm-SVM attained
137 considerable accuracy in classifying CREs from non-CREs on reserved chromosome 9
138 data (Fig. 1C) (Methods). Of note, the choice of a specific chromosome did not affect the
139 method’s performance as its accuracy was unchanged with 5-fold cross validation
7 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
140 (Supplemental Fig. S3B). Importantly, our model trained on a single data set can predict
141 DHSs observed in the other data sets and systematically assigns higher scores to DHSs
142 experimentally missed than random genomic regions, demonstrating that sequence-
143 based predictions do indeed identify true DHSs (Supplemental Methods; Supplemental
144 Fig. S4A).
145
146 The number of samples used for model building determines a model’s accuracy.
147 Our tests show that using all 5 samples identified an additional 5~10% true DHSs than
148 any single sample although the maximum accuracy was rapidly attained (Supplemental
149 Methods; Supplemental Fig. S4B). Further, the randomness of negative sets for training
150 is also a major source of variation. To minimize this effect we averaged ten different
151 models, trained on independently generated negative sets, and used the combined
152 model for DHS prediction at balanced precision and recall rates (~43%) (Fig. 1D)
153 (Methods).
154
155 Scanning the entire genome yielded an additional 88,026 distinct “predicted”
156 heart DHSs after excluding regions that overlapped observed ones (Fig. 2A) (Methods).
157 Thus, these predicted DHSs do not suffer from overfitting by construction. For
158 validations, we used several functional annotations to compare them to observed
159 regions. First, we assessed sequence conservation against randomly permuted regions
160 (Supplemental Fig. S5A): both observed DHSs and, to a lesser extent, predicted DHSs
161 significantly overlap genomic conserved elements (Davydov et al. 2010) (Binomial test,
162 P < 2.2 × 10-16). Second, we asked how frequently predicted DHSs were in open
8 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
163 chromatin in other tissues. We defined two different DHS sets: (a) a universal set by
164 combining DHSs from all ENCODE and Roadmap data sets, except from adult heart,
165 and (b) a heart-related set of DHSs from cardiac-related samples only (Methods). Both
166 observed and predicted DHSs significantly overlapped both DHS classes (Supplemental
167 Fig. S5B, C): 94.7% of observed and 57.7% of predicted DHSs are open in heart-related
168 tissues with an additional 30.7% in other cell-types (Fig. 2A). Third, we compared
169 H3K27ac histone modification marks in heart tissues to show the same pattern: 52.3%
170 of observed but only 10.9% of predicted DHSs overlapped H3K27ac marked regions
171 (Supplemental Fig. S5D). Yet, we identified ~7,500 additional regions, under the
172 stringent criteria that predicted regions overlap H3K27ac marks and are heart-related
173 DHSs, not detected by DNase-seq alone. Predicted DHSs are largely restricted to
174 specific tissues and cells (Fig. 2B; Supplemental Fig. S5E) while nearly half of observed
175 DHSs are open across many different non-cardiac cell-types. Moreover, predicted DHSs
176 show systematically weaker DHS signals than observed DHSs (Supplemental Fig. S5F).
177
178 Varying degrees of overlap with functional annotations led us to further
179 investigate potential sources of variation in CRE identification. We divided predicted
180 DHSs into four distinct categories based on overlap: heart H3K27ac histone
181 modifications and heart related DHSs (tier 1); heart related DHSs only (tier 2); universal
182 DHSs only (tier 3); and, the remainder (tier 4). For each of these categories, we
183 considered the (1) proportion of ambiguous bases, (2) average SNP (single nucleotide
184 polymorphism) frequency per base (common and rare SNPs, 1% MAF as a threshold), (3)
185 proportional overlap with H3K9me3 histone modifications in heart tissues, a
9 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
186 representative heterochromatin mark (Nakayama et al. 2001), and (4) proportional
187 overlap with FANTOM5 enhancers (Lizio et al. 2015) (Methods). As a positive control,
188 we included observed DHSs for comparison (Supplemental Table S1). First, the
189 predicted DHSs in the lower tiers (3 and 4) have poorer mappability: predicted DHSs in
190 tier 4 have 2~3.5 times more ambiguous bases than observed DHSs, depending on read
191 length. Since the functional annotations (H3K27ac and heart related DHSs) were also
192 identified by sequencing, the variability in CRE detection can be explained at least in
193 part by differential mappability. SNP frequencies also follow the expected pattern;
194 higher SNP frequency in the lower tiers (2 & 3) imply that these predicted DHSs are
195 more variable making read mapping difficult. Predicted DHSs in tier 4 show the lowest
196 SNP frequency because their extremely poor mappability reduces SNP identification
197 (Nielsen et al. 2011). DHSs predicted in the lower tiers are more enriched in
198 heterochromatin, suggesting that chromatin organization, a feature difficult to predict
199 using sequence-based models only, influences CRE detection as well. Lastly, we used
200 FANTOM5 enhancers (Lizio et al. 2015) as an orthogonal enhancer validation set and
201 compared them to predicted DHSs. Significant proportions of predicted DHSs in tiers 1,
202 2 and 3 (12.3%, 5.9%, and 2.7%) were FANTOM5 enhancers (Binomial test, P < 2.2 × 10-
203 16 for all cases) but tier 4 had only 0.5% FANTOM5 enhancers (P < 0.97).
204
205 Another potential cause of variation in CRE detection lies in peak calling, with
206 some predicted DHSs being missed being below our detection threshold. Thus, we
207 recalled DHS peaks with relaxed thresholds (False Discovery Rate<0.1, 0.15, 0.2 and
208 0.25) and identified larger numbers of DHSs (193k, 205k, 242k, and 280k, respectively).
10 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
209 We calculated the proportion of the original predicted DHSs overlapping these newly
210 identified DHSs (Supplemental Table S2) to show that predicted DHSs, especially in the
211 higher tiers, overlap many of these less stringent DHSs. Thus, a significant fraction of
212 predicted DHSs are likely true. Taken together, our results suggest that sequence-based
213 predictions of an additional ~55% cardiac CREs is highly complementary to their
214 experimental detection, and necessary for obtaining comprehensive CRE maps.
215
216 Sequence-based models can also predict cardiac TFs
217 We next tested whether the sequence features of gkm-SVM allow identification of
218 the cognate TFs through their TF binding sequences (TFBS), in the corresponding
219 tissue/cell types (Lee et al. 2011; Gorkin et al. 2012). First, we used the distribution of all
220 11-mer weights in the SVM model and compared it to those that match TFBSs for known
221 heart TFs. Indeed, TFBSs for CTCF and MEF2A as well as other known cardiac factors
222 are significantly enriched in the top 5th percentile of the SVM weight distribution, while
223 an exemplar non-cardiac factor such as POU5F1 is not (Fig. 3A). Thus, to search for
224 cardiac TFs, we used the CIS-BP database (Weirauch et al. 2014) augmented by 93 new
225 position weight matrices (PWMs) of C2H2 Zinc-Finger (ZF) TFs (Schmitges et al. 2016),
226 to show that 11-mers matching 54% of PWMs (473/868) associated with 506 TFs were
227 enriched in the top 5th percentile of the distribution (Supplemental Data S1). Expectedly,
228 11-mers matching these predicted cardiac TFs systematically have larger SVM weights
229 than those that don’t (Supplemental Fig. S6). Since not all TFs are expressed in any
230 given tissue, we further restricted attention to the 334 TFs expressed in heart left
231 ventricles using gene expression profiles from the GTEx project (Supplmental Methods);
11 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
232 Fisher’s exact test confirmed that predicted TFs were significantly associated with
233 cardiac gene expression (Fig. 3B).
234
235 These results are physiologically relevant since left ventricles and atrial
236 appendages from adult hearts were the two most significant tissues among the 53 GTEx
237 tissue gene expression profiles we examined (Fig. 3C). Many other tissues also showed
238 significant association because many TFs are expressed across multiple tissues. Thus,
239 enrichment tests after removing 353 commonly (>90% of tissues) expressed TFs
240 dramatically reduced the significance for every tissue except for the two heart tissues
241 (Fig. 3D). Thus, cardiac CRE functions are activated by a large number of both
242 commonly expressed and cardiac-specific TFs. Our analyses revealed 78 selectively
243 expressed and CRE-enriched TFs in heart tissues of which 29 are C2H2 ZFs with diverse
244 binding specificities with the remaining 49 belonging to seven other major classes; 16
245 basic-helix-loop-helix (bHLH), 9 Nuclear Receptor (NR), 5 Sox, 4 Homeodomain (HD),
246 4 basic leucine zipper (bZIP), 3 GATA, and 3 ETS transcription factors (Fig. 4). Many
247 expressed TFs in the same family have similar sequence specificities, explaining some
248 part of the expected functional redundancy.
249
250 We validated these predicted cardiac TFs also using ENCODE ChIP-seq data
251 (Supplemental Methods) based on two classes: potential cardiac TF ChIP-seq data (n =
252 1,054 for 176 TFs) and the remainder (n = 313 for 126 TFs). For each, we calculated the
253 overlap between ChIP-seq peaks and our observed and predicted DHSs to show that
254 candidate cardiac TF-bound regions overlap heart DHSs significantly more often than
12 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
255 non-cardiac TFs (P < 2.2 × 10-16 for observed DHSs, P < 2.84 × 10-16 for predicted DHSs,
256 with one-tailed two-sample Kolmogorov-Smirnov tests) (Supplemental Fig. S7).
257
258 Common cardiac DHSs are mostly promoters and CTCF sites
259 Many DHSs are accessible across diverse cell types (Xi et al. 2007; Song et al.
260 2011; Thurman et al. 2012). These common DHSs are typically enriched in transcription
261 start sites (TSSs) or CTCF binding sites or both, with the remainder enriched in cell-type
262 specific distal enhancers. To discriminate the cardiac DHSs, we defined common DHSs
263 as regions that are open in ≥30% of all ENCODE/ Roadmap tissue samples (Lee et al.
264 2015) (Supplemental Methods). Consistent with previous observations (Xi et al. 2007;
265 Song et al. 2011; Thurman et al. 2012), ~47% of observed heart DHSs are common DHSs
266 of which ~63% overlap TSSs or CTCF bound regions (Supplemental Fig. S8). To
267 understand the sequence features of these classes of DHSs we separately trained a gkm-
268 SVM model after removing these common DHSs. In this heart-specific model,
269 consistent with the CTCF ChIP-seq peak overlap analysis, most of the predictive 11-mers
270 that showed a major decrease in SVM scores (Z-score differences > 6) were CTCF
271 binding sites. In the top 5 motifs showing a major decrease, were additionally ZBTB4,
272 PLAG1, SP4 and KLF10 binding sites, known to be promoter binding TFs (Supplemental
273 Fig. S8C, D). These analyses provide a principled, statistical way to distinguish TSSs,
274 CTCF binding sites and distal enhancers from DHS data, on a tissue or cell-type specific
275 manner.
276
13 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
277 Predicted cardiac regulatory variants affect chromatin accessibility and
278 gene expression
279 Our sequence-based model allows systematic predictions of whether a DNA
280 sequence variant within a CRE is likely to affect its function, thus making detection of
281 (cardiac) regulatory variants possible. To assess this function, we used the deltaSVM
282 method (Lee et al. 2015; Beer 2017; Kreimer et al. 2017) to first assess a small set of
283 high-confidence cardiac regulatory variants associated with allele-biased DHSs
284 (Methods). We tested two different gkm-SVM models, one trained on all heart DHSs
285 ("generic model”) and the other on cardiac-specific DHSs after removing common DHSs
286 (“specific model”): deltaSVM scores from both models were significantly correlated with
287 their allele-biased chromatin accessibility (Fig. 5A; Supplemental Fig. S9A). Both
288 models achieved comparable precision, 46% and 52% at 40% recall, using 4× larger sets
289 of control variants with no allele bias (Fig. 5B). Although their performance is similar
290 they do not predict the same variants since ~30% of variants are unique to each model
291 (Supplemental Fig. S9B). This result is consistent with the k-mer weight distribution
292 differences between the models (Supplemental Fig. S8C) and suggests that the specific
293 model is better at detecting cardiac-specific TFBSs by ignoring CTCF and general
294 promotor TFBSs.
295
296 These results prompted the question whether deltaSVM predicted variants
297 affected local gene expression? We identified all high scoring variants within heart
298 DHSs (Supplemental Data S2) and compared them to GTEx expression quantitative
299 trait loci (eQTLs) from 44 different tissues (Methods): deltaSVM variants predicted by
14 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
300 the generic model are significantly associated with eQTLs in many tissues (Fig. 5C). This
301 significance is strongly correlated with the number of eQTLs in CREs, which is also
302 correlated with the sample size used for gene expression studies (The GTEx Consortium
303 2015). On the other hand, variants predicted by the specific model are mostly associated
304 with eQTLs from heart tissues although their overall statistical significance is lower (Fig.
305 5D). We concluded that despite the pleiotropy of many regulatory variants, the specific
306 model does identify cardiac-specific regulatory variants with increased specificity, but at
307 the cost of missing some regulatory variants common to many tissues. Further,
308 inclusion of predicted DHSs enhanced detection by increasing the statistical significance
309 of eQTL associations (Supplemental Fig. S9C, D).
310
311 Predicted cardiac regulatory variants explain cardiac phenotypes
312 We hypothesize that these predicted causal variants, within observed and
313 predicted CREs, explain a significant fraction of cardiac phenotype heritability which we
314 tested using QTi GWAS (Arking et al. 2014), an intermediate trait involved in long QT
315 syndrome and sudden cardiac death (Tomaselli et al. 1994). We used our published QTi
316 meta-analysis on 76,061 European ancestry subjects and ~2.7 million SNPs and
317 performed Q-Q analysis. First, variants within heart CREs (red dots) are significantly
318 enriched in QTi GWAS SNPs in comparison to all common SNPs (black dots). A subset
319 of these heart CRE variants, predicted to be causal by deltaSVM, especially by the
320 specific model (green dots), are further enriched in QTi-associated variants (Fig. 6A).
321
15 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
322 Taken together these variants contribute substantially to QTi heritability, as
323 shown for some functional annotations of the genome relative to other complex
324 phenotypes (Yang et al. 2011). To quantify this contribution, we used linkage
325 disequilibrium (LD) score regression methods on GWAS summary statistics (Finucane
326 et al. 2015) to evaluate the heritability contribution from CRE causal variants
327 (Supplemental Methods). 11.2% of QTi variation was explained by all common
328 autosomal variants (1KGP SNPs with MAF > 5% in European ancestry subjects). Using
329 predefined cell-type group functional annotations (Finucane et al. 2015), the
330 cardiovascular cell-type group contributed to the most significant enrichment as
331 compared to other tissues (Fig. 6B); this cell-type group includes lung tissues so that the
332 specificity for QTi may be diluted. Consequently, we tested our heart DHS based
333 annotations by comparing the enrichment under various definitions of causality (Fig.
334 6C). We first evaluated different DHS lengths, discovering that a higher heritability
335 (61%) could be explained by extending DHSs to 2kb lengths, although higher
336 enrichment (7.9×) was achieved at the original definition of 600bp. We surmise that
337 narrower DHSs truncate some CREs, missing true variants. As a comparison, we also
338 evaluated a recently published cardiac CRE map based on EP300 and H3K27ac bound
339 regions from multiple cardiac developmental stages (Dickel et al. 2016). This
340 enrichment is comparable to that achieved by our 2kb heart DHSs (5.4× vs. 5.3×; one-
341 tailed Z-score P < 0.51). The average size of Dickel’s cardiac CREs is bigger than our 2kb
342 heart DHSs (3,211 bp versus 2,467 bp) but the number of elements is smaller (82,119
343 versus 133,102). The majority of Dickel’s CREs (49,506 out of 82,119, or 60%) overlap
344 our DHSs, but many regions are still detected by just one approach, suggesting that each
16 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
345 method detects somewhat different types of CREs. Next, adding H3K27ac marks to
346 DHSs further increases (from 5.7× to 8.5×) the enrichment by effectively filtering out
347 less informative DHSs. While the difference between these two enrichments is only
348 marginally significant (one-tailed Z-score P < 0.06), H3K27ac marks alone capture
349 more variants than H3K27ac overlapping DHS variants (7.7% vs. 5.8%) but explains a
350 smaller heritability fraction (44.3% vs. 48.8%) (Fig. 6C; Supplemental Fig. S10). Thus,
351 some H3K27ac peaks do not capture CRE variants, consistent with the fact that
352 H3K27ac peaks are typically located at CRE boundaries. Adding deltaSVM predictions
353 greatly increased enrichment, but explained a smaller fraction of heritability, likely due
354 to false negatives in deltaSVM predictions. Nonetheless, our heart DHS-based
355 annotations, combined with deltaSVM predictions and H3K27ac marks, achieved a
356 highly significant 33.1× enrichment. Tier 1 predicted DHSs consistently increase the
357 explained heritability by 5~10% (i.e. 1~3% additional heritability) without
358 compromising enrichment (Fig. 6C; Supplemental Fig. S10). These results confirm that
359 regulatory sequence variation is the major source of QTi phenotypic variation and that
360 we can identify a majority of such causal variants. Of note, including predicted DHSs
361 with weaker functional support (Tiers 2,3, and 4) decreased overall SNP heritability
362 (Supplemental Table S3). This is probably due to the assumption in the LD score
363 regression method that effect sizes are normally distributed. Under this assumption,
364 adding a large number of SNPs with very small effect sizes can introduce a downward
365 bias. It is also consistent with our observation that many predicted DHSs are weaker
366 compared to observed ones, making them less contributory to the phenotype.
367
17 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
368 To demonstrate our method's broad applicability, we also analyzed three
369 additional cardiac phenotypes (systolic blood pressure, diastolic blood pressure, and
370 pulse rate) and one non-cardiac phenotype (BMI) using the UK Biobank project public
371 data (Sudlow et al. 2015) (Methods). Similar to the QTi result, predicted cardiac
372 regulatory variants significantly contributed to all three cardiac-relevant phenotypes but
373 not BMI (Supplemental Fig. S11). The heritability of blood pressure explained by the
374 cardiac variants is consistently less than that for QTi and pulse rate, suggesting that
375 regulatory variants active in other tissues, such as kidney and blood vessels (Hoffmann
376 et al. 2017), also contribute to blood pressure phenotypes.
377
378 The heart CRE map we generated and the deltaSVM predictions of causality for
379 specific variants within these CREs can be used to probe each of the known QTi GWAS
380 loci in much greater detail (Supplemental Data S3). The QTi GWAS meta-analysis
381 (Arking et al. 2014) identified 35 independent loci with genome-wide significant (P < 5 ×
382 10-8) common variants. By restricting attention to significant SNPs and their LD proxies
383 (r2 > 0.9), we identified 149 variants as potentially regulatory (Supplemental Table S4).
384 Of these, only ~50% (n = 72) are highly associated (r2 > 0.6) with one of the 67 sentinel
385 SNPs (cf. some of the 35 loci have multiple independent sentinel SNPs). On the other
386 hand, 108 variants have alternative measures of high association (|D’| > 0.9) suggesting
387 that many index SNPs may tag multiple regulatory variants within their haplotypes.
388 Note that these regulatory variants are not uniformly distributed across the associated
389 loci: 75% are located within 8 loci (PLN, NOS1AP, ELP6, LITAF, CNOT1, SCN5A,
390 LAPTM4B, and KCNH2). These variants are significantly enriched in GTEx eQTLs as
18 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
391 well: 104 of 149 variants are eQTLs in at least one tissue (2.7× enrichment; binomial test,
-30 392 P < 3.6 × 10 ), of which 57 are eQTLs in the two heart tissues (7.5× enrichment; P < 2.2
393 × 10-34) implying that these specific variants affect the QTi phenotype by perturbing
394 gene expression in the heart.
395
396 As an independent validation of deltaSVM predictions, we finally analyzed a
397 small number of GWAS variants obtained from a recently published study (Wang et al.
398 2016). In this report, 18 putative functional SNPs were selected based on sub-threshold
399 GWAS significance (5 × 10-8 < P < 1 × 10-4) for the QTi and QRS duration, additionally
400 supported by other functional annotations (histone modification marks, DHSs, or
401 eQTLs). These were tested by luciferase assays in human iPSC-derived cardiomyocytes.
402 Of these, 14 were found within our heart DHSs, of which five were predicted to be
403 potentially regulatory by deltaSVM. All of these deltaSVM predicted SNPs showed
404 differential enhancer activities by the luciferase assay (100% specificity) but the study
405 also identified 5 additional regulatory SNPs (50% false negative rate) (Supplemental
406 Table S5). Although the small sample size explains the lack of statistical significance
407 (Fisher’s exact test P < 0.125), the result is consistent with our predictions.
408
409 Discussion
410 This study shows that sequence-based models can allow the systematic detection
411 of specific non-coding CREs, the TFs that engage them, and sequence variants that
412 affect their binding to regulate target gene expression. These models also allow
413 prediction of the functional effects of noncoding sequence variation within these CREs
19 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
414 on human phenotypes, as shown here for cardiac traits. Although much more progress
415 is required, improved epigenomic and genomic data sets, data at cell-type resolution,
416 more refined sequence-based models, yet more sophisticated machine learning
417 algorithms, can lead to reading and assessing the entire human genome sequence of
418 thousands of individuals comprehensively. These advances have important implications
419 for understanding the role of both regulatory and structural variation in both Mendelian
420 and complex disease. In the short run, as our results on the QT interval, blood pressure
421 and pulse rate demonstrate, we have specific variants with strong a posteriori evidence
422 of regulatory effects on specific genes that regulate these phenotypes. These predictions
423 are specific because they fail to predict non-cardiac phenotypes (BMI). Thus, high-
424 throughput functional tests of specific variants in specific CREs modulated by specific
425 TFs and affecting specific target cardiac genes are likely to be fruitful. In turn, our
426 cardiac model with genome-wide QT interval or other cardiac trait data can also enable
427 prediction of additional genes, not merely loci, which modulate these traits. Eventually,
428 the specific pattern of use of cardiac enhancers and their target genes across many traits
429 is likely to teach us many new facets of cardiac physiology.
430
431 The research described here is integrative, general and broadly applicable to all
432 cell-types and tissues. In time, such regulatory CRE maps, their cognate TFs and
433 sequence variants affecting CRE activity can be routinely constructed for a wide variety
434 of cell types and tissues. The comparative analyses of such data will teach us a great deal
435 about what is common and what is specific regulation for each cell type and the GRNs
436 within them. Recently, Pritchard and colleagues have advanced the hypothesis that most
20 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
437 complex traits and diseases are omnigenic, arising from the perturbations of thousands
438 of genes by neighboring sequence variants: although some of these genes have a
439 physiologic role many (most?) others merely happen to be expressed in disease/trait-
440 relevant cell types (Boyle et al. 2017). They attribute this behavior to GRNs that are so
441 interconnected that numerous ‘trait-irrelevant’ genes affect the functions of a much
442 smaller set of core (‘trait-relevant’) genes. The comparative analyses of the models we
443 describe will be essential for distinguishing the core from the peripheral trait genes, as
444 well as assess the effects of different tissues on a given disease.
445
446 Methods
447 Heart DNase-seq data sets
448 We collected two human heart left ventricle (LV) samples and performed DNase-seq
449 experiments as previously described (Song and Crawford 2010) (Supplemental
450 Methods). In addition to the DNase-seq data sets we generated, three additional heart
451 DNase-seq data sets were obtained from the ENCODE and the Roadmap Epigenomics
452 projects using the mapped reads provided by the consortia (GSM1027322,
453 ENCFF000SPN, and ENCFF000SPP). To identify DHSs, we processed the mapped
454 DNase-seq reads (hg19) using MACS2 (Zhang et al. 2008) with the following parameters
455 “-g hs –nomodel –shift -50 –extsize 100”. 600 bp regions centered at the identified
456 summits were used to define DHSs. The identified DHSs from all five samples were then
457 merged to observe a total of 164,235 distinct regions. For pairwise comparisons, we
458 selected the top 50,000 regions based on the MACS2 P-values from each sample and
459 calculated the Jaccard index for each pair. For comparative analyses, we chose four
21 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
460 adult tissues (ovary, pancreas, psoas muscle, and small intestine) as well as five fetal
461 tissues (adrenal gland, brain, heart, kidney, and spinal cord) all available from the
462 Roadmap project.
463
464 Genome-wide prediction of cis-regulatory elements using gkm-SVM
465 For each of the heart DHS data sets, we trained the gkm-SVM models as previously
466 described with some modifications, and systematically evaluated its ability to predict
467 new DHSs missed by experiments (Supplemental Methods). To identify the best model,
468 we evaluated the combined model as well as the five individual ones by varying SVM
469 thresholds. The combined model, which averages the SVM scores over the five heart
470 data sets, consistently outperformed other models and was chosen as the best model for
471 subsequent genome-wide CRE prediction. For predicting CREs, we scored the whole
472 human genome (hg19) for every 600bp interval with a 100bp sliding window. We
473 applied a SVM score cut-off (>0.9) that balanced precision and recall values estimated
474 from the test set (precision=recall=0.43). In this study we used hg19 as a reference
475 genome and not GRCh38 because realigning reads to the new reference does not
476 substantially alter hg19-based versus GRCh38-based gkm-SVM models (Supplemental
477 Methods).
478
479 Genomic properties of detected DHS elements
480 For each heart DHS set (observed and predicted), we calculated the proportion of
481 regions overlapping three different genomic annotations: a set of conserved regions, and
482 two different sets of open chromatin regions—a universal DHS set and a heart-related
22 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
483 DHS set. We used GERP++ evolutionarily constrained elements (Davydov et al. 2010)
484 as conserved regions, and defined an overlap if at least 50bp of the GERP++ element(s)
485 was contained. For open chromatin regions, we downloaded publicly available DNase-
486 seq data sets from the ENCODE UCSC genome browser (genome.ucsc.edu/encode) and
487 the Roadmap Epigenomics project website (www.roadmapepigenomics.org) and
488 identified DHS peaks as described above. We defined the universal DHS set by merging
489 all DHSs from 797 DNase-seq data sets (counting replicates separately, when available)
490 except for the three heart DNase-seq data sets. To reduce false positive DHSs, we
491 excluded regions that were detected only once across all data sets, resulting in 967,724
492 unique DHSs covering ~45% of the genome. To define a cardiac-relevant DHS set we
493 only used samples derived from muscles, blood vessels, and fetal hearts (n = 126),
494 resulting in 605,466 heart-related elements covering ~19% of the genome. We defined
495 an overlap by at least 300bp. For histone modification marks of enhancers, we used
496 H3K27ac ChIP-seq peaks and defined 118,597 distinct cardiac H3K27ac marked regions
497 (Supplemental Methods). As a negative control, we created 100 independent sets for
498 each of the heart DHS sets (observed and predicted) by randomizing their genomic
499 positions. We then calculated the average proportion of regions overlapping genomic
500 annotations and performed binomial tests using the average as a parameter p of the null
501 distribution.
502
503 To investigate the potential source of variation in CRE identification, we evaluated
504 additional genomic properties of the predicted DHSs: mappability, heterochromatin
505 DNA, and SNP frequencies (Supplemental Methods). To further validate our predictions,
23 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
506 we used CAGE based enhancers from the FANTOM5 (Functional ANnoTation Of the
507 Mammalian genome) project as an independent set (Lizio et al. 2015) and calculated the
508 proportion of the predicted DHSs overlapping these elements (>1bp overlap).
509
510 Predictive sequence feature analysis
511 To determine potential cardiac TFs, we identified TFs whose binding sites have
512 systematically higher gkm-SVM weights. Specifically, we first scored all non-redundant
513 11-mers (N = 411/2 = 2,097,152) using gkm-SVM. We averaged these SVM scores over
514 the five data sets to reduce the variance between biological replicates, and normalized
515 them to zero means. In parallel, we also identified 868 distinct motifs associated with
516 904 human TFs (Supplemental Data S4 and S5; Supplemental Methods). For each of
517 these 868 motifs, we found all 11-mers matching the motif, determined by FIMO (Bailey
518 et al. 2009; Grant et al. 2011) with default setting, and tested whether they were
519 enriched in the top 5% of the 11-mer score distribution. To resolve the issue that FIMO
520 cannot align k-mers to PWMs longer than the k-mers, we generated sets of shorter
521 PWMs of lengths between 8bp and 11bp tiling across full-length PWMs longer than 8bp.
522 We then defined “a hit” when a given 11-mer matched any of these PWMs. We tested the
523 null hypotheses that the expected number of the motif matching 11-mers found in the
524 top 5% scoring 11-mers is equal to 5% of all matching 11-mers, using a Poisson
525 distribution, and identified PWMs with the Bonferroni corrected P < 0.01. We note that
526 our motif identification is robust with respect to the choice of k-mers: identical analysis
527 using 10-mers and 12-mers yielded 446 and 514 motifs, respectively, among which 436
528 (96.1%) and 466 (94.6%) overlap the original 473 motifs with 11-mers.
24 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
529
530 Allele-biased heart DHSs
531 To evaluate deltaSVM predictions, we identified DNA sequence variants that directly
532 affected chromatin accessibility in cardiac tissues by using allele-biased mapping of
533 DNase-seq reads (Supplemental Methods). In each of the 17 heart DNase-seq data sets
534 (fetal and adult combined), we identified all heterozygous regions with at least 10 reads,
535 and calculated P-values of allele-biased DHSs using QuASAR (Harvey et al. 2015). We
536 combined P-values using Fisher’s method and identified 100 high-confidence allele-
537 biased DHS SNPs (P < 0.001 and observed in at least 3 samples with the same direction
538 of effect). For precision-recall analysis, we also identified a 4× larger control set of SNPs
539 that do not exhibit allele-specific alignment (combined P > 0.9 and observed in at least 4
540 samples) by random sampling. To estimate standard errors, we generated ten
541 independent control SNP sets.
542
543 deltaSVM analysis of variants within CREs
544 We adapted the deltaSVM method (Lee et al. 2015) to predict the regulatory effect of
545 variants within observed and predicted DHSs. We identified ~610,000 common
546 variants with at least 1% MAF in European ancestry 1KGP subjects residing within
547 cardiac CREs: of these, ~420,000 and ~190,000 were in observed and predicted CREs,
548 respectively. Next, we extracted 21 bp sequences centered at each of these CRE variants
549 from the reference genome (hg19) and calculated the SVM scores of both alleles using
550 the heart DHS gkm-SVM models. The final deltaSVM score is the average of the
551 differences in these scores from five different heart gkm-SVM models. These were
25 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
552 repeated using gkm-SVM models trained on heart restricted DHSs. ~15% of these
553 variants have potential cardiac regulatory effects given a deltaSVM cut-off that yields 40%
554 recall when predicting allele-biased heart DHS SNPs. Note that all of the deltaSVM
555 significant SNPs predicted in this study are, by definition, restricted to the heart CREs.
556 To investigate the relationship between deltaSVM predicted causal variants and their
557 effect on gene expression, we checked for their overlap with GTEx eQTLs (V6P),
558 restricting attention to eQTLs within ±50kbp of the associated genes, to increase
559 specificity. Using the variants in the cardiac DHSs, we tested for associations using the
560 χ2 test with the binary variables of whether the variants have significant deltaSVM
561 scores or not and whether they are eQTLs in a given tissue or not.
562
563 Data access
564 All sequencing reads from this study have been submitted to the NCBI Gene Expression
565 Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession number GSE104989.
566 Supplemental data and custom scripts are available in the GitHub repository at
567 https://github.com/Dongwon-Lee/heart_cre_map and in Supplemental Material.
568
569 Acknowledgements
570 We are grateful for the assistance of Dr. Jody Hooper in obtaining autopsy hearts and to
571 Dr. Dan Arking for providing summary statistics from his published meta-analysis of the
572 QT-interval GWAS. This study has benefitted greatly from advice and discussions with
573 Dr. Michael A. Beer, as well as constructive comments from the Chakravarti and
574 Crawford laboratories. The research reported here was supported by the computational
26 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
575 resources of the Maryland Advanced Research Computing Center (MARCC) and NIH
576 grants GM104469, HL086694 and HL128782.
577
578 Author contributions
579 D.L., A.K., and A.C. conceived and designed the study; M.K.H. collected the adult heart
580 tissues; A.S., L.S., and G.E.C. performed DNase-seq experiments; D.L. conducted all
581 computational analyses; and, D.L. and A.C. wrote the manuscript. All authors were
582 involved in manuscript revision.
583
584 Disclosure declaration: The authors declare that they have no competing interests.
585
586 References
587 Arking DE, Pulit SL, Crotti L, van der Harst P, Munroe PB, Koopmann TT, Sotoodehnia
588 N, Rossin EJ, Morley M, Wang X, et al. 2014. Genetic association study of QT
589 interval highlights role for calcium signaling pathways in myocardial
590 repolarization. Nat Genet 46: 826–836.
591 Asthana S, Noble WS, Kryukov G, Grant CE, Sunyaev S, Stamatoyannopoulos JA. 2007.
592 Widely distributed noncoding purifying selection in the human genome. Proc
593 Natl Acad Sci 104: 12410–12415.
594 Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble
595 WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids
596 Res 37: W202–W208.
27 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
597 Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K.
598 2007. High-Resolution Profiling of Histone Methylations in the Human Genome.
599 Cell 129: 823–837.
600 Beer MA. 2017. Predicting enhancer activity and variant impact using gkm-SVM. Hum
601 Mutat 38: 1251–1258.
602 Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford
603 GE. 2008. High-Resolution Mapping and Characterization of Open Chromatin
604 across the Genome. Cell 132: 311–322.
605 Boyle EA, Li YI, Pritchard JK. 2017. An Expanded View of Complex Traits: From
606 Polygenic to Omnigenic. Cell 169: 1177–1186.
607 Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of
608 native chromatin for fast and sensitive epigenomic profiling of open chromatin,
609 DNA-binding proteins and nucleosome position. Nat Methods 10: 1213–1218.
610 Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working
611 Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL,
612 Neale BM. 2015. LD Score regression distinguishes confounding from
613 polygenicity in genome-wide association studies. Nat Genet 47: 291–295.
614 Chakravarti A, Turner TN. 2016. Revealing rate-limiting steps in complex disease
615 biology: The crucial importance of studying rare, extreme-phenotype families.
616 BioEssays 38: 578–586.
28 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
617 Chatterjee S, Kapoor A, Akiyama JA, Auer DR, Lee D, Gabriel S, Berrios C, Pennacchio
618 LA, Chakravarti A. 2016. Enhancer Variants Synergistically Drive Dysfunction of
619 a Gene Regulatory Network In Hirschsprung Disease. Cell 167: 355-368.e10.
620 Davidson EH. 2010. The Regulatory Genome: Gene Regulatory Networks In
621 Development And Evolution. Academic Press.
622 Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. 2010. Identifying
623 a High Fraction of the Human Genome to be under Selective Constraint Using
624 GERP++. PLoS Comput Biol 6: e1001025.
625 Degner JF, Pai AA, Pique-Regi R, Veyrieras J-B, Gaffney DJ, Pickrell JK, De Leon S,
626 Michelini K, Lewellen N, Crawford GE, et al. 2012. DNase I sensitivity QTLs are
627 a major determinant of human expression variation. Nature 482: 390–394.
628 Dickel DE, Barozzi I, Zhu Y, Fukuda-Yuzawa Y, Osterwalder M, Mannion BJ, May D,
629 Spurrell CH, Plajzer-Frick I, Pickle CS, et al. 2016. Genome-wide compendium
630 and functional assessment of in vivo heart enhancers. Nat Commun 7: 12923.
631 Eppinga RN, Hagemeijer Y, Burgess S, Hinds DA, Stefansson K, Gudbjartsson DF, van
632 Veldhuisen DJ, Munroe PB, Verweij N, van der Harst P. 2016. Identification of
633 genomic loci associated with resting heart rate and shared genetic predictors with
634 all-cause mortality. Nat Genet 48: 1557–1563.
635 Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, Anttila V, Xu H,
636 Zang C, Farh K, et al. 2015. Partitioning heritability by functional annotation
637 using genome-wide association summary statistics. Nat Genet 47: 1228–1235.
29 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
638 Ghandi M, Lee D, Mohammad-Noori M, Beer MA. 2014. Enhanced Regulatory
639 Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10:
640 e1003711.
641 Gorkin DU, Lee D, Reed X, Fletez-Brant C, Bessling SL, Loftus SK, Beer MA, Pavan WJ,
642 McCallion AS. 2012. Integration of ChIP-seq and machine learning reveals
643 enhancers and a predictive regulatory sequence vocabulary in melanocytes.
644 Genome Res 22: 2290–2301.
645 Grant CE, Bailey TL, Noble WS. 2011. FIMO: scanning for occurrences of a given motif.
646 Bioinformatics 27: 1017–1018.
647 Harvey CT, Moyerbrailean GA, Davis GO, Wen X, Luca F, Pique-Regi R. 2015. QuASAR:
648 quantitative allele-specific analysis of reads. Bioinformatics 31: 1235–1242.
649 He HH, Meyer CA, Hu SS, Chen M-W, Zang C, Liu Y, Rao PK, Fei T, Xu H, Long H, et al.
650 2014. Refined DNase-seq protocol and data analysis reveals intrinsic bias in
651 transcription factor footprint identification. Nat Methods 11: 73–78.
652 Hoffmann TJ, Ehret GB, Nandakumar P, Ranatunga D, Schaefer C, Kwok P-Y, Iribarren
653 C, Chakravarti A, Risch N. 2017. Genome-wide association analyses using
654 electronic health records identify new loci influencing blood pressure variation.
655 Nat Genet 49: 54–64.
656 John S, Sabo PJ, Thurman RE, Sung M-H, Biddie SC, Johnson TA, Hager GL,
657 Stamatoyannopoulos JA. 2011. Chromatin accessibility pre-determines
658 glucocorticoid receptor binding patterns. Nat Genet 43: 264–268.
30 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
659 Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-Wide Mapping of in Vivo
660 Protein-DNA Interactions. Science 316: 1497–1502.
661 Kapoor A, Sekar RB, Hansen NF, Fox-Talbot K, Morley M, Pihur V, Chatterjee S,
662 Brandimarto J, Moravec CS, Pulit SL, et al. 2014. An Enhancer Polymorphism at
663 the Cardiomyocyte Intercalated Disc Protein NOS1AP Locus Is a Major Regulator
664 of the QT Interval. Am J Hum Genet 94: 854–869.
665 Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan
666 R, Sinnott-Armstrong NA, et al. 2017. Predicting gene expression in massively
667 parallel reporter assays: A comparative study. Hum Mutat 38: 1240–1250.
668 Lee D. 2016. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32:
669 2196–2198.
670 Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A
671 method to predict the impact of regulatory variants from DNA sequence. Nat
672 Genet 47: 955–961.
673 Lee D, Karchin R, Beer MA. 2011. Discriminative prediction of mammalian enhancers
674 from DNA sequence. Genome Res 21: 2167–2180.
675 Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, Abugessaisa I,
676 Fukuda S, Hori F, Ishikawa-Kato S, et al. 2015. Gateways to the FANTOM5
677 promoter level mammalian expression atlas. Genome Biol 16: 22.
31 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
678 Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP,
679 Sandstrom R, Qu H, Brody J, et al. 2012. Systematic Localization of Common
680 Disease-Associated Variation in Regulatory DNA. Science 337: 1190–1195.
681 May D, Blow MJ, Kaplan T, McCulley DJ, Jensen BC, Akiyama JA, Holt A, Plajzer-Frick
682 I, Shoukry M, Wright C, et al. 2012. Large-scale discovery of enhancers from
683 human heart tissue. Nat Genet 44: 89–93.
684 Misteli T. 2001. Protein Dynamics: Implications for Nuclear Architecture and Gene
685 Expression. Science 291: 843–847.
686 Nakayama J, Rice JC, Strahl BD, Allis CD, Grewal SIS. 2001. Role of Histone H3 Lysine
687 9 Methylation in Epigenetic Control of Heterochromatin Assembly. Science 292:
688 110–113.
689 Narlikar L, Sakabe NJ, Blanski AA, Arimura FE, Westlund JM, Nobrega MA,
690 Ovcharenko I. 2010. Genome-wide discovery of human heart enhancers. Genome
691 Res 20: 381–392.
692 Nielsen R, Paul JS, Albrechtsen A, Song YS. 2011. Genotype and SNP calling from next-
693 generation sequencing data. Nat Rev Genet 12: 443–451.
694 Phillips-Cremins JE, Corces VG. 2013. Chromatin Insulators: Linking Genome
695 Organization to Cellular Function. Mol Cell 50: 461–474.
32 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
696 Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen
697 A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, et al. 2015. Integrative
698 analysis of 111 reference human epigenomes. Nature 518: 317–330.
699 Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier
700 B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles of STAT1 DNA
701 association using chromatin immunoprecipitation and massively parallel
702 sequencing. Nat Meth 4: 651–657.
703 Schmitges FW, Radovani E, Najafabadi HS, Barazandeh M, Campitelli LF, Yin Y, Jolma
704 A, Zhong G, Guo H, Kanagalingam T, et al. 2016. Multiparameter functional
705 diversity of human C2H2 zinc finger proteins. Genome Res 26: 1742–1752.
706 Song L, Crawford GE. 2010. DNase-seq: A High-Resolution Technique for Mapping
707 Active Gene Regulatory Elements across the Genome from Mammalian Cells.
708 Cold Spring Harb Protoc 2010.
709 http://cshprotocols.cshlp.org/content/2010/2/pdb.prot5384 (Accessed January
710 14, 2015).
711 Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee B-K, Sheffield NC, Gräf S, Huss
712 M, Keefe D, et al. 2011. Open chromatin defined by DNaseI and FAIRE identifies
713 regulatory elements that shape cell-type identity. Genome Res 21: 1757–1767.
714 Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green
715 J, Landray M, et al. 2015. UK Biobank: An Open Access Resource for Identifying
33 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
716 the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS
717 Med 12: e1001779.
718 The 1000 Genomes Project Consortium. 2015. A global reference for human genetic
719 variation. Nature 526: 68–74.
720 the CARDIoGRAMplusC4D Consortium. 2015. A comprehensive 1000 Genomes-based
721 genome-wide association meta-analysis of coronary artery disease. Nat Genet 47:
722 1121–1130.
723 The ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements
724 in the human genome. Nature 489: 57–74.
725 The GTEx Consortium. 2015. The Genotype-Tissue Expression (GTEx) pilot analysis:
726 Multitissue gene regulation in humans. Science 348: 648–660.
727 Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC,
728 Stergachis AB, Wang H, Vernot B, et al. 2012. The accessible chromatin
729 landscape of the human genome. Nature 489: 75–82.
730 Tomaselli GF, Beuckelmann DJ, Calkins HG, Berger RD, Kessler PD, Lawrence JH, Kass
731 D, Feldman AM, Marban E. 1994. Sudden cardiac death in heart failure. The role
732 of abnormal repolarization. Circulation 90: 2534–2539.
733 Wang X, Tucker NR, Rizki G, Mills R, Krijger PH, Wit E de, Subramanian V, Bartell E,
734 Nguyen X-X, Ye J, et al. 2016. Discovery and validation of sub-threshold genome-
735 wide association study loci using epigenomic signatures. eLife 5: e10557.
34 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
736 Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi
737 HS, Lambert SA, Mann I, Cook K, et al. 2014. Determination and Inference of
738 Eukaryotic Transcription Factor Sequence Specificity. Cell 158: 1431–1443.
739 Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RDG, Chenoweth JG,
740 Tesar PJ, Furey TS, et al. 2007. Identification and Characterization of Cell Type–
741 Specific and Ubiquitous Chromatin Regulatory Structures in the Human
742 Genome. PLOS Genet 3: e136.
743 Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM,
744 Andrade M de, Feenstra B, Feingold E, Hayes MG, et al. 2011. Genome
745 partitioning of genetic variation for complex traits using common SNPs. Nat
746 Genet 43: 519–525.
747 Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers
748 RM, Brown M, Li W, et al. 2008. Model-based Analysis of ChIP-Seq (MACS).
749 Genome Biol 9: R137.
750
751 Figure legends
752 Figure 1. A machine learning algorithm (gkm-SVM) accurately predicts cis-
753 regulatory elements. (A) An example of heart DNase-seq signals (raw data) and
754 peaks (MACS2) at the GATA4 locus across multiple human samples. (B) A genome-
755 wide heatmap of DNase-seq read densities in 1,000 bp windows centered at heart DHSs.
756 Randomly sampled 1,000 regions were used. Regions were grouped based on the
757 configuration of the DHS peaks across the five samples with at least one observed DHS.
35 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
758 (C) ROC curves of gkm-SVM models for five replicates against reserved test sets. (D)
759 Comparisons of the fraction of DHSs overlapping predicted regions (precision) and
760 fraction of predicted regions overlapping DHSs (recall).
761
762 Figure 2. Learned sequence features predicts additional CREs. (A) Venn
763 diagram of observed and predicted DHSs, and their proportions in non-cardiac cell-
764 types. (B) Heatmap of DHS signal intensities in heart-related cell types and tissues at
765 observed (left) and predicted (right) DHSs. Randomly sampled 5,000 regions from
766 observed and predicted DHSs were used. Observed DHSs are mostly located in
767 ubiquitously open chromatin regions in cardiac relevant cells and tissues, while
768 predicted DHSs show greater cell-type specificity.
769
770 Figure 3. Learned sequence features identifies cardiac TFs. (A) SVM weight
771 distributions of PWM-matched 11-mers for the top two scoring TFs (CTCF and MEF2A),
772 five well-known cardiac-specific TFs (GATA4, NKX2-5, HAND1, HAND2, SRF, and
773 HEY2), and a cardiac-irrelevant TF (POU5F1). The fifth percentile is shown as a green
774 line. The enrichment is defined as the fraction of TFBS matching 11-mers in the top 5th
775 percentile of all 11-mers compared to the expected fraction (5%). (B) 2x2 contingency
776 tables comparing TFs enriched in heart DHSs with TFs expressed in heart left ventricles
777 (top) and atrial appendages (bottom). (C, D) -log10(P) of one-sided Fisher’s exact test
778 for every tissue tested using all TFs (C) and after removing commonly-expressed TFs
779 (D); the two tissues from adult heart (atrial appendage and left ventricle) are
36 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
780 highlighted in red with the Bonferroni corrected P-value threshold (0.05) shown as a
781 dashed line.
782
783 Figure 4. A list of 78 TFs enriched in heart DHSs as well as selectively
784 expressed in cardiac tissues. The 78 TFs were grouped based on their DNA binding
785 domain families. These PWMs were also clustered based on their sequence specificity.
786 For each TF, cluster number (Clus), name, gene expression (FPKM) in left ventricle, fold
787 enrichment (nfolds), and PWM are shown, respectively.
788
789 Figure 5. Identification of common, cardiac-specific regulatory sequence
790 variants. (A) deltaSVM scores from the cardiac-specific model as compared to allele-
791 biased chromatin accessibility (C: Pearson correlation coefficient, n: number of variants,
792 P: t-distribution P-value). (B) Precision-recall curves of deltaSVM scores of allele-
793 biased DHSs against 4× larger control SNP sets; the dashed green line indicates the
794 recall rate 40%. Error bars are the standard errors calculated from ten independently
795 sampled control SNP sets. (C, D) Statistical significance (P) of deltaSVM SNPs from the
796 generic (C) and specific (D) cardiac models as compared to eQTLs using χ2 test; The two
797 heart tissues are highlighted (AA: Heart atrial appendages, LV: Heart left ventricles);
798 The Bonferroni corrected P-value threshold (0.05) is shown as a dashed line.
799
800 Figure 6. Cardiac regulatory variants explain QT-interval heritability. (A) Q-
801 Q plots of QTi GWAS results using different subsets of common genome-wide sequence
802 variants (numbers of variants in parentheses); The dashed gray line indicates the
37 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
803 genome-wide significance threshold (P < 5 × 10-8). (B) Comparison of enrichment P-
804 values between ten pre-defined cell-type group functional annotations. (C) Enrichment
2 805 values, estimated as the fraction of the heritability (hG ) explained by variants over the
806 fraction of the SNPs, for various heart DHS-based annotations (Cadiovascular: the
807 predefined cell-type group functional annotation for cardiovascular tissues, H3K27ac:
808 H3K27ac marks in heart, Dickel: heart CRE map from Dickel et al. 2016, obsDHS 600bp:
809 observed DHSs, obsDHS 1kb/2kb: observed DHSs with 1kb/2kb extension, allDHS 2kb:
810 observed and predicted DHSs with 2kb extension, dSVM: deltaSVM predicted variants);
811 a combination of multiple annotations indicates intersection of them. For example,
812 “obsDHS 2kbp & dSVM” means a set of variants predicted by deltaSVM and overlapping
813 observed DHSs with 2kb extension; the proportion of SNPs are in parenthesis; Error
814 bars denote standard errors estimated by a block Jackknife method; no enrichment is
815 shown as a dashed line.
38 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
A 11,520 kb 11,540 kb 11,560 kb 11,580 kb 11,600 kb B
Chaklab1 0 2 4 6
log2 (DHS reads)
Chaklab2
Chaklab 1 Chaklab 2 Encode 1 Encode 2 Roadmap ENCODE1
ENCODE2 n=5
Roadmap
Combined n=4
GATA4 n=3 C D Chaklab 1
Chaklab 2 n=2 Encode 1
Encode 2 Chaklab 1 (0.917) Roadmap Chaklab 2 (0.908) Combined Encode 1 (0.866) Precision Sensitivity Encode 2 (0.889) Roadmap (0.919) n=1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity Recall Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
A Observed DHSs
0.3% Heart related DHSs 85,990 5.0% Other DHSs 94.7% Fetal Heart tissues None Vascular endothelial cells 76,445 Muscle tissues
Myoblasts 11.6% Heart related DHSs 88,026 Myosatellites 30.7% 57.7% Other DHSs Myocytes None Predicted DHSs Fibroblasts B Smooth muscle cells
0 40 80 0 40 80 Percentile Percentile Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
A CTCF MEF2A GATA4 NKX2−5 (n = 14,173) (n = 5,142) (n = 1,280) (n = 1,280) All 11−mers All 11−mers All 11−mers All 11−mers Density Density Density Density 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 SVM weights SVM weights SVM weights SVM weights
HAND1/2 SRF HEY2 POU5F1 (n = 5,732) (n = 7,193) (n = 3,580) (n = 6,253) All 11−mers All 11−mers All 11−mers All 11−mers Density Density Density Density 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 SVM weights SVM weights SVM weights SVM weights C B Heart Left Ventricle )
FPKM<1 FPKM>1 Total Fisher P ( Not 10
261 137 398 log enriched −
Enriched 172 334 506 0 5 10 15 20 Total 433 471 904 Ovary Lung Liver Testis SpleenThyroid Uterus Vagina Pituitary Prostate Bladder -21 Stomach Brain−SN Pancreas Brain−ACC P < 1.82 × 10 (Fisher’s exact) Artery−aorta Brain−NAcc Whole blood Artery−Nervetibial −tibial Brain−cortex Adrenal gland Brain−caudate Colon−sigmoid Fallopian tube Kidney−cortex Artery−coronary BrainBrain−amygdala−putamen Muscle−skeletal Esophagus−GEJ Adipose−visceral Brain−spinal cord Colon−Braintransverse−cerebellum Heart−left ventricle CervixSkin−ectocervix−sun exposed Cervix−endocervix Minor salivaryBrain−hippocampus gland Brain−frontal Esophaguscortex −mucosa Brain−hypothalamus Skin−not sun exposed Heart−atrial appendage Esophagus−muscularis Breast−mammary tissueAdipose−subcutaneous Transformed fibroblasts Heart Atrial Appendage Brain−cerebellar hemisphere D EBV transformed lymphocytes Small intestine−terminal ileum FPKM<1 FPKM>1 Total
Not ) 262 136 398 enriched Fisher P (
Enriched 167 339 506 10 log − Total 429 475 904
P < 4.43 × 10-23 (Fisher’s exact) 0 2 4 6 8
Ovary Lung Liver Testis SpleenThyroid Uterus Vagina Pituitary Prostate Bladder Stomach Brain−SN Pancreas Brain−ACC Artery−aorta Brain−NAcc Whole blood Artery−Nervetibial −tibial Brain−cortex Adrenal gland Brain−caudate Colon−sigmoid Fallopian tube Kidney−cortex Artery−coronary BrainBrain−amygdala−putamen Muscle−skeletal Esophagus−GEJ Adipose−visceral Brain−spinal cord Colon−Braintransverse−cerebellum Heart−left ventricle CervixSkin−ectocervix−sun exposed Cervix−endocervix Minor salivaryBrain−hippocampus gland Brain−frontal Esophaguscortex −mucosa Brain−hypothalamus Skin−not sun exposed Heart−atrial appendage Esophagus−muscularis Breast−mammary tissueAdipose−subcutaneous Transformed fibroblasts
Brain−cerebellar hemisphere EBV transformed lymphocytes Small intestine−terminal ileum Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
Zinc Finger Nuclear receptor Clus Name FPKM nfolds PWM Clus Name FPKM nfolds PWM 2 ZNF71 1.80 2.30 13 NR4A3 2.83 2.44 3 ZNF563 1.47 3.14 13 RXRG 1.90 2.42 3 ZBTB42 3.03 2.06 13 ESRRG 3.02 2.23 5 BCL11A 1.47 4.77 13 ESRRB 1.66 2.01 6 KLF5 1.11 6.46 13 RARB 4.87 1.99 6 WT1 1.12 4.56 13 THRB 2.69 1.62 6 SP4 1.24 3.85 13 PPARG 5.05 1.60 6 EGR3 1.03 3.37 18 AR 3.64 2.64 6 KLF8 1.82 2.46 35 PGR 1.91 1.89 6 ZFY 1.38 1.82 SOX 6 ZNF684 2.11 1.52 Clus Name FPKM nfolds PWM 6 RREB1 5.69 1.36 9 SOX15 1.29 4.50 6 PLAG1 1.82 1.33 9 SOX8 1.09 3.79 7 SNAI3 1.41 2.58 9 SOX9 6.59 3.40 8 OSR1 1.42 1.40 9 SOX6 1.20 2.20 12 ZNF264 1.11 1.60 9 SOX17 6.64 1.44 17 ZNF586 2.14 1.98
21 ZNF418 1.88 1.33 bZIP 24 BCL6B 5.84 1.41 Clus Name FPKM nfolds PWM 25 ZNF582 1.25 1.46 1 NFE2L3 1.54 4.22 26 REST 3.70 2.46 11 CREB5 2.24 9.64 26 ZNF415 4.67 1.58 11 CREB3L42.13 7.09 26 ZNF594 1.60 1.46 11 CREB3L113.46 2.98 27 ZNF549 1.06 1.38 30 ZNF708 1.20 1.86 Homeodomain 33 ZNF423 1.21 1.50 Clus Name FPKM nfolds PWM 38 ZNF250 1.07 1.73 3 PKNOX2 1.88 2.32 38 ZSCAN311.52 1.40 7 MEIS2 5.76 3.04 38 ZNF547 1.05 1.32 7 MEIS1 4.78 2.48 10 NKX2−5 112.36 2.03 bHLH Clus Name FPKM nfolds PWM GATA 3 MSC 5.44 5.82 Clus Name FPKM nfolds PWM 3 TCF21 3.01 4.60 20 GATA4 45.54 2.87 3 ATOH8 7.64 2.33 20 GATA6 26.82 1.86 3 TAL1 2.57 2.13 20 GATA2 7.81 1.44 3 LYL1 4.04 2.13 ETS 3 TCF15 9.91 1.92 Clus Name FPKM nfolds PWM 4 MLXIPL 1.36 4.82 5 ERG 4.08 10.56 4 MITF 10.72 3.38 5 ETV1 5.66 9.60 4 HEY1 9.30 2.99 5 ELF4 2.42 6.73 4 HEYL 22.67 2.71 4 HEY2 20.23 2.59 Others 4 MYC 8.22 1.81 Clus Name FPKM nfolds PWM 6 ARNT2 1.17 1.75 9 FOXS1 2.61 2.01 17 EBF1 3.20 2.76 15 NFATC1 3.96 2.34 32 HAND2 31.92 2.34 24 STAT4 1.25 2.22 32 HAND1 19.47 2.34 25 TEAD4 5.78 2.53 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
A B C = 0.636, n = 100 specific (0.52) ● P = 5.7e−13 ● generic (0.46) ● ● ●
● ● ● random (0.20) ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●● ●
− 1 0 1● 2 3
● ● Average precision Average ●
● − 2 deltaSVM w/ specific model − 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of reads mapping to alternate alleles Recall
C Generic model D Specific model
● ● AA LV ● ● ●
● ● ● ● ● ● ● LV ● ● ● ) ● )
P AA P ● ● ● ● ●
( ( ● ● ● ●
10 10 ● ●
● ● ● ● ● ● ● ●
− log ● − log ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 0 1 2 3 4 5 6 7
0 10000 30000 50000 0 10000 30000 50000
# of eQTLs in heart DHSs # of eQTLs in heart DHSs Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
A ● B ●●● Cardiovascular ● all HapMap3 (n=2,710,220) ● SkeletalMuscle ● ● ● heart DHSs (n=191,782) ●● ● Hematopoietic ● ●
) ● deltaSVM generic (n=24,708) GI ● deltaSVM specific (n=29,125) Liver observed ● ● P (
10 Other
log Kidney −
● Adrenal Pancreas CNS Connective Bone 0 10 20 30 40
0 1 2 3 4 0 1 2 3 4
− log10(P expected) − log10(P) C (Proportion of h2g)/(Proportion of SNPs) 0 10 20 30 40
Dickel ( 9.3%) H3K27ac ( 7.7%) allDHS 2kbp (11.8%) Cardiovascular (11.1%) obsDHS 600bpobsDHS ( 4.2%)obsDHS 1kbp ( 6.4%) 2kbp (11.4%) obsDHS 2kbpallDHS & dSVM 2kbp &( 1.9%)dSVM ( 2.0%) obsDHS 2kbpallDHS & H3K27ac 2kbp & H3K27ac ( 5.8%) ( 6.2%)
obsDHS 2kbpallDHS & H3K27ac 2kbp & H3K27ac & dSVM &( 0.9%)dSVM ( 1.0%) Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press
Human cardiac cis-regulatory elements, their cognate transcription factors, and regulatory DNA sequence variants
Dongwon Lee, Ashish Kapoor, Alexias Safi, et al.
Genome Res. published online August 23, 2018 Access the most recent version at doi:10.1101/gr.234633.118
Supplemental http://genome.cshlp.org/content/suppl/2018/08/31/gr.234633.118.DC1 Material
P
Accepted Peer-reviewed and accepted for publication but not copyedited or typeset; accepted Manuscript manuscript is likely to differ from the final, published version.
Open Access Freely available online through the Genome Research Open Access option.
Creative This manuscript is Open Access.This article, published in Genome Research, is Commons available under a Creative Commons License (Attribution-NonCommercial 4.0 License International license), as described at http://creativecommons.org/licenses/by-nc/4.0/. Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here.
Advance online articles have been peer reviewed and accepted for publication but have not yet appeared in the paper journal (edited, typeset versions may be posted when available prior to final publication). Advance online articles are citable and establish publication priority; they are indexed by PubMed from initial publication. Citations to Advance online articles must include the digital object identifier (DOIs) and date of initial publication.
To subscribe to Genome Research go to: https://genome.cshlp.org/subscriptions
Published by Cold Spring Harbor Laboratory Press