Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

1 Human cardiac cis-regulatory elements, their cognate transcription factors

2 and regulatory DNA sequence variants

3

4 Dongwon Lee1,†,*, Ashish Kapoor1,‡, Alexias Safi2, Lingyun Song2, Marc K. Halushka3,

5 Gregory E. Crawford2, Aravinda Chakravarti1,†,*

6

7 1McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of

8 Medicine, Baltimore, MD 21205, United States; 2Department of Pediatrics and Center

9 for Genomic & Computational Biology, Duke University Medical Center, Durham, NC

10 27708, United States; 3Department of Pathology, Johns Hopkins University School of

11 Medicine, Baltimore, MD 21205, United States.

12

13 †Current address: Center for Human Genetics & Genomics, New York University School

14 of Medicine, New York, NY, USA

15 ‡Current address: University of Texas Health Science Center at Houston, Houston, TX,

U SA 16 USA

17

18 Running title: A human cardiac cis-regulatory map

19

20 Keywords: cis-regulatory elements, transcription factors, regulatory variants, QT

21 interval, heritability

22

23 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

24 *Send all correspondence to:

25

26 Aravinda Chakravarti, Ph.D. & Dongwon Lee, Ph.D.

27 Center for Human Genetics and Genomics

28 New York University School of Medicine

29 435 East 30th Street, SM Room 803/4

30 New York, NY 10016, USA

31 (T) 212-263-8023

32 (E) [email protected], [email protected]

2 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

33 Abstract

34 cis-regulatory elements (CRE), short DNA sequences through which transcription

35 factors (TF) exert regulatory control on expression, are postulated to be the major

36 sites of causal sequence variation underlying the genetics of complex traits and diseases.

37 We present integrative analyses, combining high-throughput genomic and epigenomic

38 data with sequence-based computations, to identify the causal transcriptional

39 components in a given tissue. We use data on adult human hearts to demonstrate that a)

40 sequence-based predictions detect numerous, active, tissue-specific CREs missed by

41 experimental observations, b) learned sequence features identify the cognate TFs, c)

42 CRE variants are specifically associated with cardiac , and, d) a

43 significant fraction of the heritability of exemplar cardiac traits (QT interval, blood

44 pressure, pulse rate) is attributable to these variants. This general systems approach can

45 thus identify candidate causal variants and the components of gene regulatory networks

46 (GRN) to enable understanding of the mechanisms of complex disorders at a tissue or

47 cell-type basis.

3 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

48 Introduction

49 The majority of sequence variation affecting inter-individual complex disease risk

50 is common and noncoding, in contrast to the rare coding variation underlying

51 Mendelian disorders (Chakravarti and Turner 2016). These findings are consistent with

52 the intense natural selection on coding sequence variation and the much weaker

53 selection on noncoding variation (Asthana et al. 2007). Most complex traits arise from

54 multi-genic effects with identified risk alleles having small effects so that no individual

55 effect is either necessary or sufficient (Chakravarti and Turner 2016). Thus, there must

56 be considerable redundancy in non-coding functions. The best candidates for these non-

57 coding functions are cis-regulatory elements (CREs) of gene expression because they are

58 the primary agents of regulatory control, and affect disease risk through altered gene

59 expression (Davidson 2010; Maurano et al. 2012; Chatterjee et al. 2016; Phillips-

60 Cremins and Corces 2013).

61

62 Understanding the regulatory genomic architecture is, therefore, important for

63 elucidating the etiologies of complex disorders, particularly at the tissue- and cell-type

64 levels. This involves identifying the numbers of TFs and enhancers engaged, their

65 genomic distribution and the feedback mechanisms required for expression control of

66 each gene, namely, the components of its gene regulatory network (GRN) (Davidson

67 2010). Additionally, we need to quantify the effect of sequence variation within the GRN

68 on gene expression. These questions can now be answered using high-throughput ChIP-

69 seq (Barski et al. 2007; Johnson et al. 2007; Robertson et al. 2007), DNase-seq (Boyle et

70 al. 2008; John et al. 2011), and ATAC-seq (Buenrostro et al. 2013) experimental data

4 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

71 from the ENCODE (The ENCODE Project Consortium 2012) and Roadmap epigenomics

72 projects (Roadmap Epigenomics Consortium et al. 2015), on many cell types, tissues,

73 developmental times and states, in conjunction with human sequence variation (1KGP)

74 (The 1000 Genomes Project Consortium 2015) and its effects on gene expression across

75 human tissues (GTEx) (The GTEx Consortium 2015). The fulcrum of a GRN are its

76 enhancers (CREs), however, their complete experimental detection is not possible

77 because CRE activity is affected by many factors (Misteli 2001; Degner et al. 2012; He et

78 al. 2014): (1) functional strength is variable across enhancers, (2) activity is stochastic

79 across cells, (3) detected activity varies by experimental protocol and sample, and, (4)

80 resolution of a CRE’s genomic location from current data is approximate. In principle,

81 ChIP-seq data are superior but the general lack of TF-specific antibodies is a major

82 roadblock to this approach. We solve these major impediments instead by using a

83 generalizable genome sequence-based machine learning approach (Lee et al. 2011;

84 Ghandi et al. 2014; Lee et al. 2015). This method learns the short genome sequence

85 features that maximally discriminate sequences underlying CREs from random genome

86 sequences. The learned models are then used to identify additional CREs with similar

87 sequence features, including combinations of binding sites, that

88 failed detection through experiments.

89

90 We focus here on identifying the GRN ‘parts list’ of the adult human heart, and

91 the effects of CRE genetic variation on an exemplary cardiac trait, the

92 electrocardiographic QT interval (QTi), causal to sudden cardiac death and other

93 conduction disorders (Tomaselli et al. 1994). We show the generalizability of our results

5 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

94 by using these cardiac enhancers to explain phenotypic variation in systolic/diastolic

95 blood pressure and pulse rate but not BMI. We focus on the heart because of the

96 extensive genome-wide association study (GWAS) literature implicating CRE variation

97 in many cardiovascular diseases and phenotypes (Arking et al. 2014; Kapoor et al. 2014;

98 the CARDIoGRAMplusC4D Consortium 2015; Eppinga et al. 2016), but also because the

99 QTi is likely affected by the cardiac system alone. A number of human cardiac CRE

100 maps exist (May et al. 2012; Narlikar et al. 2010) with a recent study identifying distal

101 heart enhancers based on H3K27ac and EP300/CREBBP profiles (Dickel et al. 2016). At

102 least one study has also used DNase-seq and histone modification ChIP-seq data for

103 heart tissues to identify functional regulatory variants associated with electrocardiogram

104 traits (Wang et al. 2016). Our study improves on these maps by comprehensive

105 identification of CREs. In addition, we identify their corresponding TFs and analyze the

106 effects of genetic variants within these heart CREs for heart-related traits.

107

108 Results

109 Experimental detection of cardiac CREs is incomplete

110 A schematic overview of our approach is shown in Fig. S1. We first performed

111 DNase-seq to identify open chromatin regions through DNase I hypersensitive sites

112 (DHSs) as potential CREs in two adult human hearts (left ventricles) together with three

113 publicly available heart DNase-seq data sets (Fig. 1A). Peak calling with MACS2 (Zhang

114 et al. 2008) identified varying numbers of DHSs (50,000 ~ 110,000) across the five

115 samples (Methods) (Supplemental Fig. S2A). We defined DHSs as 600bp regions

116 centered at the identified summits. DHSs frequently cluster in small regions, and

6 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

117 22~32% of extended DHSs overlap their neighbors forming larger DHSs with multiple

118 summits (Supplemental Fig. S2B). We next compared heart DHSs with other tissue

119 DHSs using the top 50,000 DHSs from each sample and the Jaccard index (number of

120 bases in the intersection over the union for each pair). These comparisons show higher

121 similarity (~50%) between the adult cardiac samples than with other tissues (~30%),

122 including fetal heart (Methods) (Supplemental Fig. S2C). Thus, DHS maps do uncover

123 tissue-relevant CREs although many regions are open across many tissues. Nevertheless,

124 technical and biological variation affect the adult cardiac DHS maps. To maximally

125 detect CREs, we aggregated all DHSs across the five adult heart replicates and identified

126 ~160,000 distinct regions (“observed DHSs”) covering ~4% of the genome. Using

127 multiple samples is important because ~40% of DHSs were detected only once across

128 the replicates while only ~22% detected across all five (Fig. 1B).

129

130 Sequence-based models can predict additional CREs

131 To elucidate the sequence code for these cardiac CREs, we used the gapped k-mer

132 support vector machines (SVM) method, gkm-SVM (Ghandi et al. 2014), widely

133 accepted as one of the state-of-the-art sequence-based methods (Kreimer et al. 2017).

134 Also, extracting predictive sequence features from gkm-SVM is easier than from other

135 methods. We used an improved software LS-GKM (Lee 2016) that enabled training on

136 larger data sets (~100,000 DHSs) (Supplemental Fig. S3A). The gkm-SVM attained

137 considerable accuracy in classifying CREs from non-CREs on reserved 9

138 data (Fig. 1C) (Methods). Of note, the choice of a specific chromosome did not affect the

139 method’s performance as its accuracy was unchanged with 5-fold cross validation

7 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

140 (Supplemental Fig. S3B). Importantly, our model trained on a single data set can predict

141 DHSs observed in the other data sets and systematically assigns higher scores to DHSs

142 experimentally missed than random genomic regions, demonstrating that sequence-

143 based predictions do indeed identify true DHSs (Supplemental Methods; Supplemental

144 Fig. S4A).

145

146 The number of samples used for model building determines a model’s accuracy.

147 Our tests show that using all 5 samples identified an additional 5~10% true DHSs than

148 any single sample although the maximum accuracy was rapidly attained (Supplemental

149 Methods; Supplemental Fig. S4B). Further, the randomness of negative sets for training

150 is also a major source of variation. To minimize this effect we averaged ten different

151 models, trained on independently generated negative sets, and used the combined

152 model for DHS prediction at balanced precision and recall rates (~43%) (Fig. 1D)

153 (Methods).

154

155 Scanning the entire genome yielded an additional 88,026 distinct “predicted”

156 heart DHSs after excluding regions that overlapped observed ones (Fig. 2A) (Methods).

157 Thus, these predicted DHSs do not suffer from overfitting by construction. For

158 validations, we used several functional annotations to compare them to observed

159 regions. First, we assessed sequence conservation against randomly permuted regions

160 (Supplemental Fig. S5A): both observed DHSs and, to a lesser extent, predicted DHSs

161 significantly overlap genomic conserved elements (Davydov et al. 2010) (Binomial test,

162 P < 2.2 × 10-16). Second, we asked how frequently predicted DHSs were in open

8 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

163 chromatin in other tissues. We defined two different DHS sets: (a) a universal set by

164 combining DHSs from all ENCODE and Roadmap data sets, except from adult heart,

165 and (b) a heart-related set of DHSs from cardiac-related samples only (Methods). Both

166 observed and predicted DHSs significantly overlapped both DHS classes (Supplemental

167 Fig. S5B, C): 94.7% of observed and 57.7% of predicted DHSs are open in heart-related

168 tissues with an additional 30.7% in other cell-types (Fig. 2A). Third, we compared

169 H3K27ac histone modification marks in heart tissues to show the same pattern: 52.3%

170 of observed but only 10.9% of predicted DHSs overlapped H3K27ac marked regions

171 (Supplemental Fig. S5D). Yet, we identified ~7,500 additional regions, under the

172 stringent criteria that predicted regions overlap H3K27ac marks and are heart-related

173 DHSs, not detected by DNase-seq alone. Predicted DHSs are largely restricted to

174 specific tissues and cells (Fig. 2B; Supplemental Fig. S5E) while nearly half of observed

175 DHSs are open across many different non-cardiac cell-types. Moreover, predicted DHSs

176 show systematically weaker DHS signals than observed DHSs (Supplemental Fig. S5F).

177

178 Varying degrees of overlap with functional annotations led us to further

179 investigate potential sources of variation in CRE identification. We divided predicted

180 DHSs into four distinct categories based on overlap: heart H3K27ac histone

181 modifications and heart related DHSs (tier 1); heart related DHSs only (tier 2); universal

182 DHSs only (tier 3); and, the remainder (tier 4). For each of these categories, we

183 considered the (1) proportion of ambiguous bases, (2) average SNP (single nucleotide

184 polymorphism) frequency per base (common and rare SNPs, 1% MAF as a threshold), (3)

185 proportional overlap with H3K9me3 histone modifications in heart tissues, a

9 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

186 representative heterochromatin mark (Nakayama et al. 2001), and (4) proportional

187 overlap with FANTOM5 enhancers (Lizio et al. 2015) (Methods). As a positive control,

188 we included observed DHSs for comparison (Supplemental Table S1). First, the

189 predicted DHSs in the lower tiers (3 and 4) have poorer mappability: predicted DHSs in

190 tier 4 have 2~3.5 times more ambiguous bases than observed DHSs, depending on read

191 length. Since the functional annotations (H3K27ac and heart related DHSs) were also

192 identified by sequencing, the variability in CRE detection can be explained at least in

193 part by differential mappability. SNP frequencies also follow the expected pattern;

194 higher SNP frequency in the lower tiers (2 & 3) imply that these predicted DHSs are

195 more variable making read mapping difficult. Predicted DHSs in tier 4 show the lowest

196 SNP frequency because their extremely poor mappability reduces SNP identification

197 (Nielsen et al. 2011). DHSs predicted in the lower tiers are more enriched in

198 heterochromatin, suggesting that chromatin organization, a feature difficult to predict

199 using sequence-based models only, influences CRE detection as well. Lastly, we used

200 FANTOM5 enhancers (Lizio et al. 2015) as an orthogonal enhancer validation set and

201 compared them to predicted DHSs. Significant proportions of predicted DHSs in tiers 1,

202 2 and 3 (12.3%, 5.9%, and 2.7%) were FANTOM5 enhancers (Binomial test, P < 2.2 × 10-

203 16 for all cases) but tier 4 had only 0.5% FANTOM5 enhancers (P < 0.97).

204

205 Another potential cause of variation in CRE detection lies in peak calling, with

206 some predicted DHSs being missed being below our detection threshold. Thus, we

207 recalled DHS peaks with relaxed thresholds (False Discovery Rate<0.1, 0.15, 0.2 and

208 0.25) and identified larger numbers of DHSs (193k, 205k, 242k, and 280k, respectively).

10 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

209 We calculated the proportion of the original predicted DHSs overlapping these newly

210 identified DHSs (Supplemental Table S2) to show that predicted DHSs, especially in the

211 higher tiers, overlap many of these less stringent DHSs. Thus, a significant fraction of

212 predicted DHSs are likely true. Taken together, our results suggest that sequence-based

213 predictions of an additional ~55% cardiac CREs is highly complementary to their

214 experimental detection, and necessary for obtaining comprehensive CRE maps.

215

216 Sequence-based models can also predict cardiac TFs

217 We next tested whether the sequence features of gkm-SVM allow identification of

218 the cognate TFs through their TF binding sequences (TFBS), in the corresponding

219 tissue/cell types (Lee et al. 2011; Gorkin et al. 2012). First, we used the distribution of all

220 11-mer weights in the SVM model and compared it to those that match TFBSs for known

221 heart TFs. Indeed, TFBSs for CTCF and MEF2A as well as other known cardiac factors

222 are significantly enriched in the top 5th percentile of the SVM weight distribution, while

223 an exemplar non-cardiac factor such as POU5F1 is not (Fig. 3A). Thus, to search for

224 cardiac TFs, we used the CIS-BP database (Weirauch et al. 2014) augmented by 93 new

225 position weight matrices (PWMs) of C2H2 Zinc-Finger (ZF) TFs (Schmitges et al. 2016),

226 to show that 11-mers matching 54% of PWMs (473/868) associated with 506 TFs were

227 enriched in the top 5th percentile of the distribution (Supplemental Data S1). Expectedly,

228 11-mers matching these predicted cardiac TFs systematically have larger SVM weights

229 than those that don’t (Supplemental Fig. S6). Since not all TFs are expressed in any

230 given tissue, we further restricted attention to the 334 TFs expressed in heart left

231 ventricles using gene expression profiles from the GTEx project (Supplmental Methods);

11 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

232 Fisher’s exact test confirmed that predicted TFs were significantly associated with

233 cardiac gene expression (Fig. 3B).

234

235 These results are physiologically relevant since left ventricles and atrial

236 appendages from adult hearts were the two most significant tissues among the 53 GTEx

237 tissue gene expression profiles we examined (Fig. 3C). Many other tissues also showed

238 significant association because many TFs are expressed across multiple tissues. Thus,

239 enrichment tests after removing 353 commonly (>90% of tissues) expressed TFs

240 dramatically reduced the significance for every tissue except for the two heart tissues

241 (Fig. 3D). Thus, cardiac CRE functions are activated by a large number of both

242 commonly expressed and cardiac-specific TFs. Our analyses revealed 78 selectively

243 expressed and CRE-enriched TFs in heart tissues of which 29 are C2H2 ZFs with diverse

244 binding specificities with the remaining 49 belonging to seven other major classes; 16

245 basic-helix-loop-helix (bHLH), 9 Nuclear (NR), 5 Sox, 4 Homeodomain (HD),

246 4 basic (bZIP), 3 GATA, and 3 ETS transcription factors (Fig. 4). Many

247 expressed TFs in the same family have similar sequence specificities, explaining some

248 part of the expected functional redundancy.

249

250 We validated these predicted cardiac TFs also using ENCODE ChIP-seq data

251 (Supplemental Methods) based on two classes: potential cardiac TF ChIP-seq data (n =

252 1,054 for 176 TFs) and the remainder (n = 313 for 126 TFs). For each, we calculated the

253 overlap between ChIP-seq peaks and our observed and predicted DHSs to show that

254 candidate cardiac TF-bound regions overlap heart DHSs significantly more often than

12 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

255 non-cardiac TFs (P < 2.2 × 10-16 for observed DHSs, P < 2.84 × 10-16 for predicted DHSs,

256 with one-tailed two-sample Kolmogorov-Smirnov tests) (Supplemental Fig. S7).

257

258 Common cardiac DHSs are mostly promoters and CTCF sites

259 Many DHSs are accessible across diverse cell types (Xi et al. 2007; Song et al.

260 2011; Thurman et al. 2012). These common DHSs are typically enriched in transcription

261 start sites (TSSs) or CTCF binding sites or both, with the remainder enriched in cell-type

262 specific distal enhancers. To discriminate the cardiac DHSs, we defined common DHSs

263 as regions that are open in ≥30% of all ENCODE/ Roadmap tissue samples (Lee et al.

264 2015) (Supplemental Methods). Consistent with previous observations (Xi et al. 2007;

265 Song et al. 2011; Thurman et al. 2012), ~47% of observed heart DHSs are common DHSs

266 of which ~63% overlap TSSs or CTCF bound regions (Supplemental Fig. S8). To

267 understand the sequence features of these classes of DHSs we separately trained a gkm-

268 SVM model after removing these common DHSs. In this heart-specific model,

269 consistent with the CTCF ChIP-seq peak overlap analysis, most of the predictive 11-mers

270 that showed a major decrease in SVM scores (Z-score differences > 6) were CTCF

271 binding sites. In the top 5 motifs showing a major decrease, were additionally ZBTB4,

272 PLAG1, SP4 and KLF10 binding sites, known to be promoter binding TFs (Supplemental

273 Fig. S8C, D). These analyses provide a principled, statistical way to distinguish TSSs,

274 CTCF binding sites and distal enhancers from DHS data, on a tissue or cell-type specific

275 manner.

276

13 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

277 Predicted cardiac regulatory variants affect chromatin accessibility and

278 gene expression

279 Our sequence-based model allows systematic predictions of whether a DNA

280 sequence variant within a CRE is likely to affect its function, thus making detection of

281 (cardiac) regulatory variants possible. To assess this function, we used the deltaSVM

282 method (Lee et al. 2015; Beer 2017; Kreimer et al. 2017) to first assess a small set of

283 high-confidence cardiac regulatory variants associated with allele-biased DHSs

284 (Methods). We tested two different gkm-SVM models, one trained on all heart DHSs

285 ("generic model”) and the other on cardiac-specific DHSs after removing common DHSs

286 (“specific model”): deltaSVM scores from both models were significantly correlated with

287 their allele-biased chromatin accessibility (Fig. 5A; Supplemental Fig. S9A). Both

288 models achieved comparable precision, 46% and 52% at 40% recall, using 4× larger sets

289 of control variants with no allele bias (Fig. 5B). Although their performance is similar

290 they do not predict the same variants since ~30% of variants are unique to each model

291 (Supplemental Fig. S9B). This result is consistent with the k-mer weight distribution

292 differences between the models (Supplemental Fig. S8C) and suggests that the specific

293 model is better at detecting cardiac-specific TFBSs by ignoring CTCF and general

294 promotor TFBSs.

295

296 These results prompted the question whether deltaSVM predicted variants

297 affected local gene expression? We identified all high scoring variants within heart

298 DHSs (Supplemental Data S2) and compared them to GTEx expression quantitative

299 trait loci (eQTLs) from 44 different tissues (Methods): deltaSVM variants predicted by

14 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

300 the generic model are significantly associated with eQTLs in many tissues (Fig. 5C). This

301 significance is strongly correlated with the number of eQTLs in CREs, which is also

302 correlated with the sample size used for gene expression studies (The GTEx Consortium

303 2015). On the other hand, variants predicted by the specific model are mostly associated

304 with eQTLs from heart tissues although their overall statistical significance is lower (Fig.

305 5D). We concluded that despite the pleiotropy of many regulatory variants, the specific

306 model does identify cardiac-specific regulatory variants with increased specificity, but at

307 the cost of missing some regulatory variants common to many tissues. Further,

308 inclusion of predicted DHSs enhanced detection by increasing the statistical significance

309 of eQTL associations (Supplemental Fig. S9C, D).

310

311 Predicted cardiac regulatory variants explain cardiac phenotypes

312 We hypothesize that these predicted causal variants, within observed and

313 predicted CREs, explain a significant fraction of cardiac phenotype heritability which we

314 tested using QTi GWAS (Arking et al. 2014), an intermediate trait involved in long QT

315 syndrome and sudden cardiac death (Tomaselli et al. 1994). We used our published QTi

316 meta-analysis on 76,061 European ancestry subjects and ~2.7 million SNPs and

317 performed Q-Q analysis. First, variants within heart CREs (red dots) are significantly

318 enriched in QTi GWAS SNPs in comparison to all common SNPs (black dots). A subset

319 of these heart CRE variants, predicted to be causal by deltaSVM, especially by the

320 specific model (green dots), are further enriched in QTi-associated variants (Fig. 6A).

321

15 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

322 Taken together these variants contribute substantially to QTi heritability, as

323 shown for some functional annotations of the genome relative to other complex

324 phenotypes (Yang et al. 2011). To quantify this contribution, we used linkage

325 disequilibrium (LD) score regression methods on GWAS summary statistics (Finucane

326 et al. 2015) to evaluate the heritability contribution from CRE causal variants

327 (Supplemental Methods). 11.2% of QTi variation was explained by all common

328 autosomal variants (1KGP SNPs with MAF > 5% in European ancestry subjects). Using

329 predefined cell-type group functional annotations (Finucane et al. 2015), the

330 cardiovascular cell-type group contributed to the most significant enrichment as

331 compared to other tissues (Fig. 6B); this cell-type group includes lung tissues so that the

332 specificity for QTi may be diluted. Consequently, we tested our heart DHS based

333 annotations by comparing the enrichment under various definitions of causality (Fig.

334 6C). We first evaluated different DHS lengths, discovering that a higher heritability

335 (61%) could be explained by extending DHSs to 2kb lengths, although higher

336 enrichment (7.9×) was achieved at the original definition of 600bp. We surmise that

337 narrower DHSs truncate some CREs, missing true variants. As a comparison, we also

338 evaluated a recently published cardiac CRE map based on EP300 and H3K27ac bound

339 regions from multiple cardiac developmental stages (Dickel et al. 2016). This

340 enrichment is comparable to that achieved by our 2kb heart DHSs (5.4× vs. 5.3×; one-

341 tailed Z-score P < 0.51). The average size of Dickel’s cardiac CREs is bigger than our 2kb

342 heart DHSs (3,211 bp versus 2,467 bp) but the number of elements is smaller (82,119

343 versus 133,102). The majority of Dickel’s CREs (49,506 out of 82,119, or 60%) overlap

344 our DHSs, but many regions are still detected by just one approach, suggesting that each

16 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

345 method detects somewhat different types of CREs. Next, adding H3K27ac marks to

346 DHSs further increases (from 5.7× to 8.5×) the enrichment by effectively filtering out

347 less informative DHSs. While the difference between these two enrichments is only

348 marginally significant (one-tailed Z-score P < 0.06), H3K27ac marks alone capture

349 more variants than H3K27ac overlapping DHS variants (7.7% vs. 5.8%) but explains a

350 smaller heritability fraction (44.3% vs. 48.8%) (Fig. 6C; Supplemental Fig. S10). Thus,

351 some H3K27ac peaks do not capture CRE variants, consistent with the fact that

352 H3K27ac peaks are typically located at CRE boundaries. Adding deltaSVM predictions

353 greatly increased enrichment, but explained a smaller fraction of heritability, likely due

354 to false negatives in deltaSVM predictions. Nonetheless, our heart DHS-based

355 annotations, combined with deltaSVM predictions and H3K27ac marks, achieved a

356 highly significant 33.1× enrichment. Tier 1 predicted DHSs consistently increase the

357 explained heritability by 5~10% (i.e. 1~3% additional heritability) without

358 compromising enrichment (Fig. 6C; Supplemental Fig. S10). These results confirm that

359 regulatory sequence variation is the major source of QTi phenotypic variation and that

360 we can identify a majority of such causal variants. Of note, including predicted DHSs

361 with weaker functional support (Tiers 2,3, and 4) decreased overall SNP heritability

362 (Supplemental Table S3). This is probably due to the assumption in the LD score

363 regression method that effect sizes are normally distributed. Under this assumption,

364 adding a large number of SNPs with very small effect sizes can introduce a downward

365 bias. It is also consistent with our observation that many predicted DHSs are weaker

366 compared to observed ones, making them less contributory to the phenotype.

367

17 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

368 To demonstrate our method's broad applicability, we also analyzed three

369 additional cardiac phenotypes (systolic blood pressure, diastolic blood pressure, and

370 pulse rate) and one non-cardiac phenotype (BMI) using the UK Biobank project public

371 data (Sudlow et al. 2015) (Methods). Similar to the QTi result, predicted cardiac

372 regulatory variants significantly contributed to all three cardiac-relevant phenotypes but

373 not BMI (Supplemental Fig. S11). The heritability of blood pressure explained by the

374 cardiac variants is consistently less than that for QTi and pulse rate, suggesting that

375 regulatory variants active in other tissues, such as kidney and blood vessels (Hoffmann

376 et al. 2017), also contribute to blood pressure phenotypes.

377

378 The heart CRE map we generated and the deltaSVM predictions of causality for

379 specific variants within these CREs can be used to probe each of the known QTi GWAS

380 loci in much greater detail (Supplemental Data S3). The QTi GWAS meta-analysis

381 (Arking et al. 2014) identified 35 independent loci with genome-wide significant (P < 5 ×

382 10-8) common variants. By restricting attention to significant SNPs and their LD proxies

383 (r2 > 0.9), we identified 149 variants as potentially regulatory (Supplemental Table S4).

384 Of these, only ~50% (n = 72) are highly associated (r2 > 0.6) with one of the 67 sentinel

385 SNPs (cf. some of the 35 loci have multiple independent sentinel SNPs). On the other

386 hand, 108 variants have alternative measures of high association (|D’| > 0.9) suggesting

387 that many index SNPs may tag multiple regulatory variants within their haplotypes.

388 Note that these regulatory variants are not uniformly distributed across the associated

389 loci: 75% are located within 8 loci (PLN, NOS1AP, ELP6, LITAF, CNOT1, SCN5A,

390 LAPTM4B, and KCNH2). These variants are significantly enriched in GTEx eQTLs as

18 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

391 well: 104 of 149 variants are eQTLs in at least one tissue (2.7× enrichment; binomial test,

-30 392 P < 3.6 × 10 ), of which 57 are eQTLs in the two heart tissues (7.5× enrichment; P < 2.2

393 × 10-34) implying that these specific variants affect the QTi phenotype by perturbing

394 gene expression in the heart.

395

396 As an independent validation of deltaSVM predictions, we finally analyzed a

397 small number of GWAS variants obtained from a recently published study (Wang et al.

398 2016). In this report, 18 putative functional SNPs were selected based on sub-threshold

399 GWAS significance (5 × 10-8 < P < 1 × 10-4) for the QTi and QRS duration, additionally

400 supported by other functional annotations (histone modification marks, DHSs, or

401 eQTLs). These were tested by luciferase assays in human iPSC-derived cardiomyocytes.

402 Of these, 14 were found within our heart DHSs, of which five were predicted to be

403 potentially regulatory by deltaSVM. All of these deltaSVM predicted SNPs showed

404 differential enhancer activities by the luciferase assay (100% specificity) but the study

405 also identified 5 additional regulatory SNPs (50% false negative rate) (Supplemental

406 Table S5). Although the small sample size explains the lack of statistical significance

407 (Fisher’s exact test P < 0.125), the result is consistent with our predictions.

408

409 Discussion

410 This study shows that sequence-based models can allow the systematic detection

411 of specific non-coding CREs, the TFs that engage them, and sequence variants that

412 affect their binding to regulate target gene expression. These models also allow

413 prediction of the functional effects of noncoding sequence variation within these CREs

19 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

414 on human phenotypes, as shown here for cardiac traits. Although much more progress

415 is required, improved epigenomic and genomic data sets, data at cell-type resolution,

416 more refined sequence-based models, yet more sophisticated machine learning

417 algorithms, can lead to reading and assessing the entire sequence of

418 thousands of individuals comprehensively. These advances have important implications

419 for understanding the role of both regulatory and structural variation in both Mendelian

420 and complex disease. In the short run, as our results on the QT interval, blood pressure

421 and pulse rate demonstrate, we have specific variants with strong a posteriori evidence

422 of regulatory effects on specific that regulate these phenotypes. These predictions

423 are specific because they fail to predict non-cardiac phenotypes (BMI). Thus, high-

424 throughput functional tests of specific variants in specific CREs modulated by specific

425 TFs and affecting specific target cardiac genes are likely to be fruitful. In turn, our

426 cardiac model with genome-wide QT interval or other cardiac trait data can also enable

427 prediction of additional genes, not merely loci, which modulate these traits. Eventually,

428 the specific pattern of use of cardiac enhancers and their target genes across many traits

429 is likely to teach us many new facets of cardiac physiology.

430

431 The research described here is integrative, general and broadly applicable to all

432 cell-types and tissues. In time, such regulatory CRE maps, their cognate TFs and

433 sequence variants affecting CRE activity can be routinely constructed for a wide variety

434 of cell types and tissues. The comparative analyses of such data will teach us a great deal

435 about what is common and what is specific regulation for each cell type and the GRNs

436 within them. Recently, Pritchard and colleagues have advanced the hypothesis that most

20 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

437 complex traits and diseases are omnigenic, arising from the perturbations of thousands

438 of genes by neighboring sequence variants: although some of these genes have a

439 physiologic role many (most?) others merely happen to be expressed in disease/trait-

440 relevant cell types (Boyle et al. 2017). They attribute this behavior to GRNs that are so

441 interconnected that numerous ‘trait-irrelevant’ genes affect the functions of a much

442 smaller set of core (‘trait-relevant’) genes. The comparative analyses of the models we

443 describe will be essential for distinguishing the core from the peripheral trait genes, as

444 well as assess the effects of different tissues on a given disease.

445

446 Methods

447 Heart DNase-seq data sets

448 We collected two human heart left ventricle (LV) samples and performed DNase-seq

449 experiments as previously described (Song and Crawford 2010) (Supplemental

450 Methods). In addition to the DNase-seq data sets we generated, three additional heart

451 DNase-seq data sets were obtained from the ENCODE and the Roadmap Epigenomics

452 projects using the mapped reads provided by the consortia (GSM1027322,

453 ENCFF000SPN, and ENCFF000SPP). To identify DHSs, we processed the mapped

454 DNase-seq reads (hg19) using MACS2 (Zhang et al. 2008) with the following parameters

455 “-g hs –nomodel –shift -50 –extsize 100”. 600 bp regions centered at the identified

456 summits were used to define DHSs. The identified DHSs from all five samples were then

457 merged to observe a total of 164,235 distinct regions. For pairwise comparisons, we

458 selected the top 50,000 regions based on the MACS2 P-values from each sample and

459 calculated the Jaccard index for each pair. For comparative analyses, we chose four

21 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

460 adult tissues (ovary, pancreas, psoas muscle, and small intestine) as well as five fetal

461 tissues (adrenal gland, brain, heart, kidney, and spinal cord) all available from the

462 Roadmap project.

463

464 Genome-wide prediction of cis-regulatory elements using gkm-SVM

465 For each of the heart DHS data sets, we trained the gkm-SVM models as previously

466 described with some modifications, and systematically evaluated its ability to predict

467 new DHSs missed by experiments (Supplemental Methods). To identify the best model,

468 we evaluated the combined model as well as the five individual ones by varying SVM

469 thresholds. The combined model, which averages the SVM scores over the five heart

470 data sets, consistently outperformed other models and was chosen as the best model for

471 subsequent genome-wide CRE prediction. For predicting CREs, we scored the whole

472 human genome (hg19) for every 600bp interval with a 100bp sliding window. We

473 applied a SVM score cut-off (>0.9) that balanced precision and recall values estimated

474 from the test set (precision=recall=0.43). In this study we used hg19 as a reference

475 genome and not GRCh38 because realigning reads to the new reference does not

476 substantially alter hg19-based versus GRCh38-based gkm-SVM models (Supplemental

477 Methods).

478

479 Genomic properties of detected DHS elements

480 For each heart DHS set (observed and predicted), we calculated the proportion of

481 regions overlapping three different genomic annotations: a set of conserved regions, and

482 two different sets of open chromatin regions—a universal DHS set and a heart-related

22 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

483 DHS set. We used GERP++ evolutionarily constrained elements (Davydov et al. 2010)

484 as conserved regions, and defined an overlap if at least 50bp of the GERP++ element(s)

485 was contained. For open chromatin regions, we downloaded publicly available DNase-

486 seq data sets from the ENCODE UCSC genome browser (genome.ucsc.edu/encode) and

487 the Roadmap Epigenomics project website (www.roadmapepigenomics.org) and

488 identified DHS peaks as described above. We defined the universal DHS set by merging

489 all DHSs from 797 DNase-seq data sets (counting replicates separately, when available)

490 except for the three heart DNase-seq data sets. To reduce false positive DHSs, we

491 excluded regions that were detected only once across all data sets, resulting in 967,724

492 unique DHSs covering ~45% of the genome. To define a cardiac-relevant DHS set we

493 only used samples derived from muscles, blood vessels, and fetal hearts (n = 126),

494 resulting in 605,466 heart-related elements covering ~19% of the genome. We defined

495 an overlap by at least 300bp. For histone modification marks of enhancers, we used

496 H3K27ac ChIP-seq peaks and defined 118,597 distinct cardiac H3K27ac marked regions

497 (Supplemental Methods). As a negative control, we created 100 independent sets for

498 each of the heart DHS sets (observed and predicted) by randomizing their genomic

499 positions. We then calculated the average proportion of regions overlapping genomic

500 annotations and performed binomial tests using the average as a parameter p of the null

501 distribution.

502

503 To investigate the potential source of variation in CRE identification, we evaluated

504 additional genomic properties of the predicted DHSs: mappability, heterochromatin

505 DNA, and SNP frequencies (Supplemental Methods). To further validate our predictions,

23 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

506 we used CAGE based enhancers from the FANTOM5 (Functional ANnoTation Of the

507 Mammalian genome) project as an independent set (Lizio et al. 2015) and calculated the

508 proportion of the predicted DHSs overlapping these elements (>1bp overlap).

509

510 Predictive sequence feature analysis

511 To determine potential cardiac TFs, we identified TFs whose binding sites have

512 systematically higher gkm-SVM weights. Specifically, we first scored all non-redundant

513 11-mers (N = 411/2 = 2,097,152) using gkm-SVM. We averaged these SVM scores over

514 the five data sets to reduce the variance between biological replicates, and normalized

515 them to zero means. In parallel, we also identified 868 distinct motifs associated with

516 904 human TFs (Supplemental Data S4 and S5; Supplemental Methods). For each of

517 these 868 motifs, we found all 11-mers matching the motif, determined by FIMO (Bailey

518 et al. 2009; Grant et al. 2011) with default setting, and tested whether they were

519 enriched in the top 5% of the 11-mer score distribution. To resolve the issue that FIMO

520 cannot align k-mers to PWMs longer than the k-mers, we generated sets of shorter

521 PWMs of lengths between 8bp and 11bp tiling across full-length PWMs longer than 8bp.

522 We then defined “a hit” when a given 11-mer matched any of these PWMs. We tested the

523 null hypotheses that the expected number of the motif matching 11-mers found in the

524 top 5% scoring 11-mers is equal to 5% of all matching 11-mers, using a Poisson

525 distribution, and identified PWMs with the Bonferroni corrected P < 0.01. We note that

526 our motif identification is robust with respect to the choice of k-mers: identical analysis

527 using 10-mers and 12-mers yielded 446 and 514 motifs, respectively, among which 436

528 (96.1%) and 466 (94.6%) overlap the original 473 motifs with 11-mers.

24 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

529

530 Allele-biased heart DHSs

531 To evaluate deltaSVM predictions, we identified DNA sequence variants that directly

532 affected chromatin accessibility in cardiac tissues by using allele-biased mapping of

533 DNase-seq reads (Supplemental Methods). In each of the 17 heart DNase-seq data sets

534 (fetal and adult combined), we identified all heterozygous regions with at least 10 reads,

535 and calculated P-values of allele-biased DHSs using QuASAR (Harvey et al. 2015). We

536 combined P-values using Fisher’s method and identified 100 high-confidence allele-

537 biased DHS SNPs (P < 0.001 and observed in at least 3 samples with the same direction

538 of effect). For precision-recall analysis, we also identified a 4× larger control set of SNPs

539 that do not exhibit allele-specific alignment (combined P > 0.9 and observed in at least 4

540 samples) by random sampling. To estimate standard errors, we generated ten

541 independent control SNP sets.

542

543 deltaSVM analysis of variants within CREs

544 We adapted the deltaSVM method (Lee et al. 2015) to predict the regulatory effect of

545 variants within observed and predicted DHSs. We identified ~610,000 common

546 variants with at least 1% MAF in European ancestry 1KGP subjects residing within

547 cardiac CREs: of these, ~420,000 and ~190,000 were in observed and predicted CREs,

548 respectively. Next, we extracted 21 bp sequences centered at each of these CRE variants

549 from the reference genome (hg19) and calculated the SVM scores of both alleles using

550 the heart DHS gkm-SVM models. The final deltaSVM score is the average of the

551 differences in these scores from five different heart gkm-SVM models. These were

25 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

552 repeated using gkm-SVM models trained on heart restricted DHSs. ~15% of these

553 variants have potential cardiac regulatory effects given a deltaSVM cut-off that yields 40%

554 recall when predicting allele-biased heart DHS SNPs. Note that all of the deltaSVM

555 significant SNPs predicted in this study are, by definition, restricted to the heart CREs.

556 To investigate the relationship between deltaSVM predicted causal variants and their

557 effect on gene expression, we checked for their overlap with GTEx eQTLs (V6P),

558 restricting attention to eQTLs within ±50kbp of the associated genes, to increase

559 specificity. Using the variants in the cardiac DHSs, we tested for associations using the

560 χ2 test with the binary variables of whether the variants have significant deltaSVM

561 scores or not and whether they are eQTLs in a given tissue or not.

562

563 Data access

564 All sequencing reads from this study have been submitted to the NCBI Gene Expression

565 Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession number GSE104989.

566 Supplemental data and custom scripts are available in the GitHub repository at

567 https://github.com/Dongwon-Lee/heart_cre_map and in Supplemental Material.

568

569 Acknowledgements

570 We are grateful for the assistance of Dr. Jody Hooper in obtaining autopsy hearts and to

571 Dr. Dan Arking for providing summary statistics from his published meta-analysis of the

572 QT-interval GWAS. This study has benefitted greatly from advice and discussions with

573 Dr. Michael A. Beer, as well as constructive comments from the Chakravarti and

574 Crawford laboratories. The research reported here was supported by the computational

26 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

575 resources of the Maryland Advanced Research Computing Center (MARCC) and NIH

576 grants GM104469, HL086694 and HL128782.

577

578 Author contributions

579 D.L., A.K., and A.C. conceived and designed the study; M.K.H. collected the adult heart

580 tissues; A.S., L.S., and G.E.C. performed DNase-seq experiments; D.L. conducted all

581 computational analyses; and, D.L. and A.C. wrote the manuscript. All authors were

582 involved in manuscript revision.

583

584 Disclosure declaration: The authors declare that they have no competing interests.

585

586 References

587 Arking DE, Pulit SL, Crotti L, van der Harst P, Munroe PB, Koopmann TT, Sotoodehnia

588 N, Rossin EJ, Morley M, Wang X, et al. 2014. Genetic association study of QT

589 interval highlights role for calcium signaling pathways in myocardial

590 repolarization. Nat Genet 46: 826–836.

591 Asthana S, Noble WS, Kryukov G, Grant CE, Sunyaev S, Stamatoyannopoulos JA. 2007.

592 Widely distributed noncoding purifying selection in the human genome. Proc

593 Natl Acad Sci 104: 12410–12415.

594 Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble

595 WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids

596 Res 37: W202–W208.

27 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

597 Barski A, Cuddapah S, Cui K, Roh T-Y, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K.

598 2007. High-Resolution Profiling of Histone Methylations in the Human Genome.

599 Cell 129: 823–837.

600 Beer MA. 2017. Predicting enhancer activity and variant impact using gkm-SVM. Hum

601 Mutat 38: 1251–1258.

602 Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford

603 GE. 2008. High-Resolution Mapping and Characterization of Open Chromatin

604 across the Genome. Cell 132: 311–322.

605 Boyle EA, Li YI, Pritchard JK. 2017. An Expanded View of Complex Traits: From

606 Polygenic to Omnigenic. Cell 169: 1177–1186.

607 Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of

608 native chromatin for fast and sensitive epigenomic profiling of open chromatin,

609 DNA-binding and nucleosome position. Nat Methods 10: 1213–1218.

610 Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working

611 Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL,

612 Neale BM. 2015. LD Score regression distinguishes confounding from

613 polygenicity in genome-wide association studies. Nat Genet 47: 291–295.

614 Chakravarti A, Turner TN. 2016. Revealing rate-limiting steps in complex disease

615 biology: The crucial importance of studying rare, extreme-phenotype families.

616 BioEssays 38: 578–586.

28 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

617 Chatterjee S, Kapoor A, Akiyama JA, Auer DR, Lee D, Gabriel S, Berrios C, Pennacchio

618 LA, Chakravarti A. 2016. Enhancer Variants Synergistically Drive Dysfunction of

619 a Gene Regulatory Network In Hirschsprung Disease. Cell 167: 355-368.e10.

620 Davidson EH. 2010. The Regulatory Genome: Gene Regulatory Networks In

621 Development And Evolution. Academic Press.

622 Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. 2010. Identifying

623 a High Fraction of the Human Genome to be under Selective Constraint Using

624 GERP++. PLoS Comput Biol 6: e1001025.

625 Degner JF, Pai AA, Pique-Regi R, Veyrieras J-B, Gaffney DJ, Pickrell JK, De Leon S,

626 Michelini K, Lewellen N, Crawford GE, et al. 2012. DNaseI sensitivity QTLs are

627 a major determinant of human expression variation. Nature 482: 390–394.

628 Dickel DE, Barozzi I, Zhu Y, Fukuda-Yuzawa Y, Osterwalder M, Mannion BJ, May D,

629 Spurrell CH, Plajzer-Frick I, Pickle CS, et al. 2016. Genome-wide compendium

630 and functional assessment of in vivo heart enhancers. Nat Commun 7: 12923.

631 Eppinga RN, Hagemeijer Y, Burgess S, Hinds DA, Stefansson K, Gudbjartsson DF, van

632 Veldhuisen DJ, Munroe PB, Verweij N, van der Harst P. 2016. Identification of

633 genomic loci associated with resting heart rate and shared genetic predictors with

634 all-cause mortality. Nat Genet 48: 1557–1563.

635 Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, Anttila V, Xu H,

636 Zang C, Farh K, et al. 2015. Partitioning heritability by functional annotation

637 using genome-wide association summary statistics. Nat Genet 47: 1228–1235.

29 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

638 Ghandi M, Lee D, Mohammad-Noori M, Beer MA. 2014. Enhanced Regulatory

639 Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10:

640 e1003711.

641 Gorkin DU, Lee D, Reed X, Fletez-Brant C, Bessling SL, Loftus SK, Beer MA, Pavan WJ,

642 McCallion AS. 2012. Integration of ChIP-seq and machine learning reveals

643 enhancers and a predictive regulatory sequence vocabulary in melanocytes.

644 Genome Res 22: 2290–2301.

645 Grant CE, Bailey TL, Noble WS. 2011. FIMO: scanning for occurrences of a given motif.

646 Bioinformatics 27: 1017–1018.

647 Harvey CT, Moyerbrailean GA, Davis GO, Wen X, Luca F, Pique-Regi R. 2015. QuASAR:

648 quantitative allele-specific analysis of reads. Bioinformatics 31: 1235–1242.

649 He HH, Meyer CA, Hu SS, Chen M-W, Zang C, Liu Y, Rao PK, Fei T, Xu H, Long H, et al.

650 2014. Refined DNase-seq protocol and data analysis reveals intrinsic bias in

651 transcription factor footprint identification. Nat Methods 11: 73–78.

652 Hoffmann TJ, Ehret GB, Nandakumar P, Ranatunga D, Schaefer C, Kwok P-Y, Iribarren

653 C, Chakravarti A, Risch N. 2017. Genome-wide association analyses using

654 electronic health records identify new loci influencing blood pressure variation.

655 Nat Genet 49: 54–64.

656 John S, Sabo PJ, Thurman RE, Sung M-H, Biddie SC, Johnson TA, Hager GL,

657 Stamatoyannopoulos JA. 2011. Chromatin accessibility pre-determines

658 binding patterns. Nat Genet 43: 264–268.

30 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

659 Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-Wide Mapping of in Vivo

660 -DNA Interactions. Science 316: 1497–1502.

661 Kapoor A, Sekar RB, Hansen NF, Fox-Talbot K, Morley M, Pihur V, Chatterjee S,

662 Brandimarto J, Moravec CS, Pulit SL, et al. 2014. An Enhancer Polymorphism at

663 the Cardiomyocyte Intercalated Disc Protein NOS1AP Locus Is a Major Regulator

664 of the QT Interval. Am J Hum Genet 94: 854–869.

665 Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan

666 R, Sinnott-Armstrong NA, et al. 2017. Predicting gene expression in massively

667 parallel reporter assays: A comparative study. Hum Mutat 38: 1240–1250.

668 Lee D. 2016. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32:

669 2196–2198.

670 Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A

671 method to predict the impact of regulatory variants from DNA sequence. Nat

672 Genet 47: 955–961.

673 Lee D, Karchin R, Beer MA. 2011. Discriminative prediction of mammalian enhancers

674 from DNA sequence. Genome Res 21: 2167–2180.

675 Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, Abugessaisa I,

676 Fukuda S, Hori F, Ishikawa-Kato S, et al. 2015. Gateways to the FANTOM5

677 promoter level mammalian expression atlas. Genome Biol 16: 22.

31 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

678 Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP,

679 Sandstrom R, Qu H, Brody J, et al. 2012. Systematic Localization of Common

680 Disease-Associated Variation in Regulatory DNA. Science 337: 1190–1195.

681 May D, Blow MJ, Kaplan T, McCulley DJ, Jensen BC, Akiyama JA, Holt A, Plajzer-Frick

682 I, Shoukry M, Wright C, et al. 2012. Large-scale discovery of enhancers from

683 human heart tissue. Nat Genet 44: 89–93.

684 Misteli T. 2001. Protein Dynamics: Implications for Nuclear Architecture and Gene

685 Expression. Science 291: 843–847.

686 Nakayama J, Rice JC, Strahl BD, Allis CD, Grewal SIS. 2001. Role of Histone H3 Lysine

687 9 Methylation in Epigenetic Control of Heterochromatin Assembly. Science 292:

688 110–113.

689 Narlikar L, Sakabe NJ, Blanski AA, Arimura FE, Westlund JM, Nobrega MA,

690 Ovcharenko I. 2010. Genome-wide discovery of human heart enhancers. Genome

691 Res 20: 381–392.

692 Nielsen R, Paul JS, Albrechtsen A, Song YS. 2011. Genotype and SNP calling from next-

693 generation sequencing data. Nat Rev Genet 12: 443–451.

694 Phillips-Cremins JE, Corces VG. 2013. Chromatin Insulators: Linking Genome

695 Organization to Cellular Function. Mol Cell 50: 461–474.

32 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

696 Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen

697 A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, et al. 2015. Integrative

698 analysis of 111 reference human epigenomes. Nature 518: 317–330.

699 Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier

700 B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles of STAT1 DNA

701 association using chromatin immunoprecipitation and massively parallel

702 sequencing. Nat Meth 4: 651–657.

703 Schmitges FW, Radovani E, Najafabadi HS, Barazandeh M, Campitelli LF, Yin Y, Jolma

704 A, Zhong G, Guo H, Kanagalingam T, et al. 2016. Multiparameter functional

705 diversity of human C2H2 proteins. Genome Res 26: 1742–1752.

706 Song L, Crawford GE. 2010. DNase-seq: A High-Resolution Technique for Mapping

707 Active Gene Regulatory Elements across the Genome from Mammalian Cells.

708 Cold Spring Harb Protoc 2010.

709 http://cshprotocols.cshlp.org/content/2010/2/pdb.prot5384 (Accessed January

710 14, 2015).

711 Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee B-K, Sheffield NC, Gräf S, Huss

712 M, Keefe D, et al. 2011. Open chromatin defined by DNaseI and FAIRE identifies

713 regulatory elements that shape cell-type identity. Genome Res 21: 1757–1767.

714 Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green

715 J, Landray M, et al. 2015. UK Biobank: An Open Access Resource for Identifying

33 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

716 the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS

717 Med 12: e1001779.

718 The 1000 Genomes Project Consortium. 2015. A global reference for human genetic

719 variation. Nature 526: 68–74.

720 the CARDIoGRAMplusC4D Consortium. 2015. A comprehensive 1000 Genomes-based

721 genome-wide association meta-analysis of coronary artery disease. Nat Genet 47:

722 1121–1130.

723 The ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements

724 in the human genome. Nature 489: 57–74.

725 The GTEx Consortium. 2015. The Genotype-Tissue Expression (GTEx) pilot analysis:

726 Multitissue gene regulation in humans. Science 348: 648–660.

727 Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC,

728 Stergachis AB, Wang H, Vernot B, et al. 2012. The accessible chromatin

729 landscape of the human genome. Nature 489: 75–82.

730 Tomaselli GF, Beuckelmann DJ, Calkins HG, Berger RD, Kessler PD, Lawrence JH, Kass

731 D, Feldman AM, Marban E. 1994. Sudden cardiac death in heart failure. The role

732 of abnormal repolarization. Circulation 90: 2534–2539.

733 Wang X, Tucker NR, Rizki G, Mills R, Krijger PH, Wit E de, Subramanian V, Bartell E,

734 Nguyen X-X, Ye J, et al. 2016. Discovery and validation of sub-threshold genome-

735 wide association study loci using epigenomic signatures. eLife 5: e10557.

34 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

736 Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, Najafabadi

737 HS, Lambert SA, Mann I, Cook K, et al. 2014. Determination and Inference of

738 Eukaryotic Transcription Factor Sequence Specificity. Cell 158: 1431–1443.

739 Xi H, Shulha HP, Lin JM, Vales TR, Fu Y, Bodine DM, McKay RDG, Chenoweth JG,

740 Tesar PJ, Furey TS, et al. 2007. Identification and Characterization of Cell Type–

741 Specific and Ubiquitous Chromatin Regulatory Structures in the Human

742 Genome. PLOS Genet 3: e136.

743 Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM,

744 Andrade M de, Feenstra B, Feingold E, Hayes MG, et al. 2011. Genome

745 partitioning of genetic variation for complex traits using common SNPs. Nat

746 Genet 43: 519–525.

747 Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers

748 RM, Brown M, Li W, et al. 2008. Model-based Analysis of ChIP-Seq (MACS).

749 Genome Biol 9: R137.

750

751 Figure legends

752 Figure 1. A machine learning algorithm (gkm-SVM) accurately predicts cis-

753 regulatory elements. (A) An example of heart DNase-seq signals (raw data) and

754 peaks (MACS2) at the GATA4 locus across multiple human samples. (B) A genome-

755 wide heatmap of DNase-seq read densities in 1,000 bp windows centered at heart DHSs.

756 Randomly sampled 1,000 regions were used. Regions were grouped based on the

757 configuration of the DHS peaks across the five samples with at least one observed DHS.

35 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

758 (C) ROC curves of gkm-SVM models for five replicates against reserved test sets. (D)

759 Comparisons of the fraction of DHSs overlapping predicted regions (precision) and

760 fraction of predicted regions overlapping DHSs (recall).

761

762 Figure 2. Learned sequence features predicts additional CREs. (A) Venn

763 diagram of observed and predicted DHSs, and their proportions in non-cardiac cell-

764 types. (B) Heatmap of DHS signal intensities in heart-related cell types and tissues at

765 observed (left) and predicted (right) DHSs. Randomly sampled 5,000 regions from

766 observed and predicted DHSs were used. Observed DHSs are mostly located in

767 ubiquitously open chromatin regions in cardiac relevant cells and tissues, while

768 predicted DHSs show greater cell-type specificity.

769

770 Figure 3. Learned sequence features identifies cardiac TFs. (A) SVM weight

771 distributions of PWM-matched 11-mers for the top two scoring TFs (CTCF and MEF2A),

772 five well-known cardiac-specific TFs (GATA4, NKX2-5, HAND1, HAND2, SRF, and

773 HEY2), and a cardiac-irrelevant TF (POU5F1). The fifth percentile is shown as a green

774 line. The enrichment is defined as the fraction of TFBS matching 11-mers in the top 5th

775 percentile of all 11-mers compared to the expected fraction (5%). (B) 2x2 contingency

776 tables comparing TFs enriched in heart DHSs with TFs expressed in heart left ventricles

777 (top) and atrial appendages (bottom). (C, D) -log10(P) of one-sided Fisher’s exact test

778 for every tissue tested using all TFs (C) and after removing commonly-expressed TFs

779 (D); the two tissues from adult heart (atrial appendage and left ventricle) are

36 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

780 highlighted in red with the Bonferroni corrected P-value threshold (0.05) shown as a

781 dashed line.

782

783 Figure 4. A list of 78 TFs enriched in heart DHSs as well as selectively

784 expressed in cardiac tissues. The 78 TFs were grouped based on their DNA binding

785 domain families. These PWMs were also clustered based on their sequence specificity.

786 For each TF, cluster number (Clus), name, gene expression (FPKM) in left ventricle, fold

787 enrichment (nfolds), and PWM are shown, respectively.

788

789 Figure 5. Identification of common, cardiac-specific regulatory sequence

790 variants. (A) deltaSVM scores from the cardiac-specific model as compared to allele-

791 biased chromatin accessibility (C: Pearson correlation coefficient, n: number of variants,

792 P: t-distribution P-value). (B) Precision-recall curves of deltaSVM scores of allele-

793 biased DHSs against 4× larger control SNP sets; the dashed green line indicates the

794 recall rate 40%. Error bars are the standard errors calculated from ten independently

795 sampled control SNP sets. (C, D) Statistical significance (P) of deltaSVM SNPs from the

796 generic (C) and specific (D) cardiac models as compared to eQTLs using χ2 test; The two

797 heart tissues are highlighted (AA: Heart atrial appendages, LV: Heart left ventricles);

798 The Bonferroni corrected P-value threshold (0.05) is shown as a dashed line.

799

800 Figure 6. Cardiac regulatory variants explain QT-interval heritability. (A) Q-

801 Q plots of QTi GWAS results using different subsets of common genome-wide sequence

802 variants (numbers of variants in parentheses); The dashed gray line indicates the

37 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

803 genome-wide significance threshold (P < 5 × 10-8). (B) Comparison of enrichment P-

804 values between ten pre-defined cell-type group functional annotations. (C) Enrichment

2 805 values, estimated as the fraction of the heritability (hG ) explained by variants over the

806 fraction of the SNPs, for various heart DHS-based annotations (Cadiovascular: the

807 predefined cell-type group functional annotation for cardiovascular tissues, H3K27ac:

808 H3K27ac marks in heart, Dickel: heart CRE map from Dickel et al. 2016, obsDHS 600bp:

809 observed DHSs, obsDHS 1kb/2kb: observed DHSs with 1kb/2kb extension, allDHS 2kb:

810 observed and predicted DHSs with 2kb extension, dSVM: deltaSVM predicted variants);

811 a combination of multiple annotations indicates intersection of them. For example,

812 “obsDHS 2kbp & dSVM” means a set of variants predicted by deltaSVM and overlapping

813 observed DHSs with 2kb extension; the proportion of SNPs are in parenthesis; Error

814 bars denote standard errors estimated by a block Jackknife method; no enrichment is

815 shown as a dashed line.

38 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

A 11,520 kb 11,540 kb 11,560 kb 11,580 kb 11,600 kb B

Chaklab1 0 2 4 6

log2 (DHS reads)

Chaklab2

Chaklab 1 Chaklab 2 Encode 1 Encode 2 Roadmap ENCODE1

ENCODE2 n=5

Roadmap

Combined n=4

GATA4 n=3 C D Chaklab 1

Chaklab 2 n=2 Encode 1

Encode 2 Chaklab 1 (0.917) Roadmap Chaklab 2 (0.908) Combined Encode 1 (0.866) Precision Sensitivity Encode 2 (0.889) Roadmap (0.919) n=1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity Recall Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

A Observed DHSs

0.3% Heart related DHSs 85,990 5.0% Other DHSs 94.7% Fetal Heart tissues None Vascular endothelial cells 76,445 Muscle tissues

Myoblasts 11.6% Heart related DHSs 88,026 Myosatellites 30.7% 57.7% Other DHSs Myocytes None Predicted DHSs Fibroblasts B Smooth muscle cells

0 40 80 0 40 80 Percentile Percentile Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

A CTCF MEF2A GATA4 NKX2−5 (n = 14,173) (n = 5,142) (n = 1,280) (n = 1,280) All 11−mers All 11−mers All 11−mers All 11−mers Density Density Density Density 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 SVM weights SVM weights SVM weights SVM weights

HAND1/2 SRF HEY2 POU5F1 (n = 5,732) (n = 7,193) (n = 3,580) (n = 6,253) All 11−mers All 11−mers All 11−mers All 11−mers Density Density Density Density 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3 SVM weights SVM weights SVM weights SVM weights C B Heart Left Ventricle )

FPKM<1 FPKM>1 Total Fisher P ( Not 10

261 137 398 log enriched −

Enriched 172 334 506 0 5 10 15 20 Total 433 471 904 Ovary Lung Liver Testis SpleenThyroid Uterus Vagina Pituitary Prostate Bladder -21 Stomach Brain−SN Pancreas Brain−ACC P < 1.82 × 10 (Fisher’s exact) Artery−aorta Brain−NAcc Whole blood Artery−Nervetibial −tibial Brain−cortex Adrenal gland Brain−caudate Colon−sigmoid Fallopian tube Kidney−cortex Artery−coronary BrainBrain−amygdala−putamen Muscle−skeletal Esophagus−GEJ Adipose−visceral Brain−spinal cord Colon−Braintransverse−cerebellum Heart−left ventricle CervixSkin−ectocervix−sun exposed Cervix−endocervix Minor salivaryBrain−hippocampus gland Brain−frontal Esophaguscortex −mucosa Brain−hypothalamus Skin−not sun exposed Heart−atrial appendage Esophagus−muscularis Breast−mammary tissueAdipose−subcutaneous Transformed fibroblasts Heart Atrial Appendage Brain−cerebellar hemisphere D EBV transformed lymphocytes Small intestine−terminal ileum FPKM<1 FPKM>1 Total

Not ) 262 136 398 enriched Fisher P (

Enriched 167 339 506 10 log − Total 429 475 904

P < 4.43 × 10-23 (Fisher’s exact) 0 2 4 6 8

Ovary Lung Liver Testis SpleenThyroid Uterus Vagina Pituitary Prostate Bladder Stomach Brain−SN Pancreas Brain−ACC Artery−aorta Brain−NAcc Whole blood Artery−Nervetibial −tibial Brain−cortex Adrenal gland Brain−caudate Colon−sigmoid Fallopian tube Kidney−cortex Artery−coronary BrainBrain−amygdala−putamen Muscle−skeletal Esophagus−GEJ Adipose−visceral Brain−spinal cord Colon−Braintransverse−cerebellum Heart−left ventricle CervixSkin−ectocervix−sun exposed Cervix−endocervix Minor salivaryBrain−hippocampus gland Brain−frontal Esophaguscortex −mucosa Brain−hypothalamus Skin−not sun exposed Heart−atrial appendage Esophagus−muscularis Breast−mammary tissueAdipose−subcutaneous Transformed fibroblasts

Brain−cerebellar hemisphere EBV transformed lymphocytes Small intestine−terminal ileum Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

Zinc Finger Clus Name FPKM nfolds PWM Clus Name FPKM nfolds PWM 2 ZNF71 1.80 2.30 13 NR4A3 2.83 2.44 3 ZNF563 1.47 3.14 13 RXRG 1.90 2.42 3 ZBTB42 3.03 2.06 13 ESRRG 3.02 2.23 5 BCL11A 1.47 4.77 13 ESRRB 1.66 2.01 6 KLF5 1.11 6.46 13 RARB 4.87 1.99 6 WT1 1.12 4.56 13 THRB 2.69 1.62 6 SP4 1.24 3.85 13 PPARG 5.05 1.60 6 EGR3 1.03 3.37 18 AR 3.64 2.64 6 KLF8 1.82 2.46 35 PGR 1.91 1.89 6 ZFY 1.38 1.82 SOX 6 ZNF684 2.11 1.52 Clus Name FPKM nfolds PWM 6 RREB1 5.69 1.36 9 SOX15 1.29 4.50 6 PLAG1 1.82 1.33 9 SOX8 1.09 3.79 7 SNAI3 1.41 2.58 9 SOX9 6.59 3.40 8 OSR1 1.42 1.40 9 SOX6 1.20 2.20 12 ZNF264 1.11 1.60 9 SOX17 6.64 1.44 17 ZNF586 2.14 1.98

21 ZNF418 1.88 1.33 bZIP 24 BCL6B 5.84 1.41 Clus Name FPKM nfolds PWM 25 ZNF582 1.25 1.46 1 NFE2L3 1.54 4.22 26 REST 3.70 2.46 11 CREB5 2.24 9.64 26 ZNF415 4.67 1.58 11 CREB3L42.13 7.09 26 ZNF594 1.60 1.46 11 CREB3L113.46 2.98 27 ZNF549 1.06 1.38 30 ZNF708 1.20 1.86 Homeodomain 33 ZNF423 1.21 1.50 Clus Name FPKM nfolds PWM 38 ZNF250 1.07 1.73 3 PKNOX2 1.88 2.32 38 ZSCAN311.52 1.40 7 MEIS2 5.76 3.04 38 ZNF547 1.05 1.32 7 MEIS1 4.78 2.48 10 NKX2−5 112.36 2.03 bHLH Clus Name FPKM nfolds PWM GATA 3 MSC 5.44 5.82 Clus Name FPKM nfolds PWM 3 TCF21 3.01 4.60 20 GATA4 45.54 2.87 3 ATOH8 7.64 2.33 20 GATA6 26.82 1.86 3 TAL1 2.57 2.13 20 GATA2 7.81 1.44 3 LYL1 4.04 2.13 ETS 3 TCF15 9.91 1.92 Clus Name FPKM nfolds PWM 4 MLXIPL 1.36 4.82 5 ERG 4.08 10.56 4 MITF 10.72 3.38 5 ETV1 5.66 9.60 4 HEY1 9.30 2.99 5 ELF4 2.42 6.73 4 HEYL 22.67 2.71 4 HEY2 20.23 2.59 Others 4 8.22 1.81 Clus Name FPKM nfolds PWM 6 ARNT2 1.17 1.75 9 FOXS1 2.61 2.01 17 EBF1 3.20 2.76 15 NFATC1 3.96 2.34 32 HAND2 31.92 2.34 24 STAT4 1.25 2.22 32 HAND1 19.47 2.34 25 TEAD4 5.78 2.53 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

A B C = 0.636, n = 100 specific (0.52) ● P = 5.7e−13 ● generic (0.46) ● ● ●

● ● ● random (0.20) ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●● ●

− 1 0 1● 2 3

● ● Average precision Average ●

● − 2 deltaSVM w/ specific model − 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fraction of reads mapping to alternate alleles Recall

C Generic model D Specific model

● ● AA LV ● ● ●

● ● ● ● ● ● ● LV ● ● ● ) ● )

P AA P ● ● ● ● ●

( ( ● ● ● ●

10 10 ● ●

● ● ● ● ● ● ● ●

− log ● − log ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 0 1 2 3 4 5 6 7

0 10000 30000 50000 0 10000 30000 50000

# of eQTLs in heart DHSs # of eQTLs in heart DHSs Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

A ● B ●●● Cardiovascular ● all HapMap3 (n=2,710,220) ● SkeletalMuscle ● ● ● heart DHSs (n=191,782) ●● ● Hematopoietic ● ●

) ● deltaSVM generic (n=24,708) GI ● deltaSVM specific (n=29,125) Liver observed ● ● P (

10 Other

log Kidney −

● Adrenal Pancreas CNS Connective Bone 0 10 20 30 40

0 1 2 3 4 0 1 2 3 4

− log10(P expected) − log10(P) C (Proportion of h2g)/(Proportion of SNPs) 0 10 20 30 40

Dickel ( 9.3%) H3K27ac ( 7.7%) allDHS 2kbp (11.8%) Cardiovascular (11.1%) obsDHS 600bpobsDHS ( 4.2%)obsDHS 1kbp ( 6.4%) 2kbp (11.4%) obsDHS 2kbpallDHS & dSVM 2kbp &( 1.9%)dSVM ( 2.0%) obsDHS 2kbpallDHS & H3K27ac 2kbp & H3K27ac ( 5.8%) ( 6.2%)

obsDHS 2kbpallDHS & H3K27ac 2kbp & H3K27ac & dSVM &( 0.9%)dSVM ( 1.0%) Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press

Human cardiac cis-regulatory elements, their cognate transcription factors, and regulatory DNA sequence variants

Dongwon Lee, Ashish Kapoor, Alexias Safi, et al.

Genome Res. published online August 23, 2018 Access the most recent version at doi:10.1101/gr.234633.118

Supplemental http://genome.cshlp.org/content/suppl/2018/08/31/gr.234633.118.DC1 Material

P

Accepted Peer-reviewed and accepted for publication but not copyedited or typeset; accepted Manuscript manuscript is likely to differ from the final, published version.

Open Access Freely available online through the Genome Research Open Access option.

Creative This manuscript is Open Access.This article, published in Genome Research, is Commons available under a Creative Commons License (Attribution-NonCommercial 4.0 License International license), as described at http://creativecommons.org/licenses/by-nc/4.0/. Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here.

Advance online articles have been peer reviewed and accepted for publication but have not yet appeared in the paper journal (edited, typeset versions may be posted when available prior to final publication). Advance online articles are citable and establish publication priority; they are indexed by PubMed from initial publication. Citations to Advance online articles must include the digital object identifier (DOIs) and date of initial publication.

To subscribe to Genome Research go to: https://genome.cshlp.org/subscriptions

Published by Cold Spring Harbor Laboratory Press