1 Synthetic CpG islands reveal DNA sequence

2 determinants of structure

3

4 Elisabeth Wachter1, Timo Quante1, Cara Merusi1, Aleksandra Arczewska1,

5 Francis Stewart2, Shaun Webb,1 and Adrian Bird1#

6

7

8 1 The Wellcome Trust Centre for Cell Biology 9 University of Edinburgh 10 The King’s Buildings 11 Mayfield Road 12 Edinburgh 13 EH9 3JR 14 UK 15 16 2 Genomics & Biotechnology Center 17 Technische Universitaet 18 Dresden 19 Tatzberg 47 20 01307 Dresden 21 Germany 22

23 #Corresponding author: [email protected]; phone: +44-(0)131-650-5670

24

25 Running title: CpG island DNA affects chromatin structure

1 26 Abstract

27

28 The mammalian genome is punctuated by CpG islands (CGIs), which differ

29 sharply from the bulk genome by being rich in G+C and the dinucleotide CpG.

30 CGIs often include transcription initiation sites and display “active”

31 marks, notably 4 . In embryonic stem cells (ESCs)

32 some CGIs adopt a “bivalent” chromatin state bearing simultaneous “active” and

33 “inactive” chromatin marks. To determine whether CGI chromatin is

34 developmentally programmed at specific genes or is imposed by shared features

35 of CGI DNA, we integrated artificial CGI-like DNA sequences into the ESC genome.

36 We found that bivalency is the default chromatin structure for CpG-rich, G+C-rich

37 DNA. A high CpG density alone is not sufficient for this effect, as A+T-rich

38 sequence settings invariably provoke de novo DNA methylation leading to loss of

39 CGI signature chromatin. We conclude that both CpG-richness and G+C-richness

40 are required for induction of signature chromatin structures at CGIs.

41

42 Introduction

43 CpG islands (CGIs) are stretches of atypical genomic DNA sequences that mark

44 most gene promoters (Bird, 1986; Deaton and Bird, 2011). Though individually

45 unique in nucleotide sequence, mammalian CGIs share two features that often

46 correlate but can vary independently: a G+C-rich base composition (65% versus

47 40% for the bulk genome) and high density of the dinucleotide CpG (5-10 fold

48 higher than the bulk genome). Whereas bulk genomic DNA is globally methylated,

49 CGIs associated with promoters invariably lack DNA methylation. It has been

50 proposed that CGIs function as generic platforms for gene regulation, probably

2 51 via proteins that bind to CpG and influence chromatin modification (Blackledge

52 and Klose, 2011; Deaton and Bird, 2011). This hypothesis has gained credence

53 due to accumulating evidence that enrichment or depletion of specific chromatin

54 marks at CGIs is linked to proteins that bind to unmethylated CpG. So far, all such

55 proteins possess CXXC zinc finger domains that bind specifically to the CpG dyad

56 in duplex DNA (Lee et al., 2001). Cfp1, for example, is a CXXC domain-containing

57 component of the Set1/COMPASS complex (Lee and Skalnik, 2005), which

58 generates H3K4 methylation, a signature chromatin mark at non-methylated

59 CGIs (Bernstein et al., 2006; Guenther et al., 2007). Accordingly, Cfp1 is

60 concentrated at CGIs as determined by chromatin immunoprecipitation (ChIP)

61 and its absence is associated with reduced H3K4 methylation at many CGIs

62 (Clouaire et al., 2012; Thomson et al., 2010). The H3K4 methyltransferases Mll1

63 and Mll2 are also CXXC proteins and each is found at CGIs in mouse embryonic

64 stem cells (ESCs) as determined by ChIP-Seq (Denissov et al., 2014; Hu et al.,

65 2013). Similarly the CXXC domain-containing proteins Kdm2a and Kdm2b are

66 enriched at CGIs. Kdm2a is an H3K36 demethylase that contributes to depletion

67 of at CGI promoters (Blackledge et al., 2010), whereas Kdm2b

68 facilitates recruitment of the PRC1 complex to transcriptionally silent CGIs

69 (Farcas et al., 2012; Wu et al., 2013).

70

71 Previous studies have shown that CGI-like DNA sequences can impose an altered

72 chromatin state in ESCs. Promoter-less CGI-like DNA sequences of invertebrate

73 origin in mouse ES cells were initially shown to cause local enrichment of

74 trimethylation of lysine 4 of histone H3 () and to recruit the CpG

75 binding protein Cfp1 (Thomson et al., 2010). Similar experiments using G+C-rich

3 76 DNA derived from bacteria created chromatin marked by both and

77 H3K4me3 (Mendenhall et al., 2010). Whereas H3K4me3 is characteristic of

78 active promoters, H3K27me3 is associated with transcriptional repression via

79 the polycomb complex. The coincidence of these two marks in so-called

80 “bivalent” chromatin is thought to be a feature of genes that are poised to

81 become either active or silent during early development (Azuara et al., 2006;

82 Bernstein et al., 2006; Voigt et al., 2013). Many native CGIs in ESCs adopt a

83 “bivalent” chromatin structure when transcriptionally silent.

84

85 Despite evidence that CXXC proteins play a role at CGIs, the hypothesis that CpG

86 density is the critical determinant of CGI function has not been directly tested.

87 Here we vary the DNA sequence composition of artificial promoterless CGIs to

88 assess the relative importance of G+C-richness and high CpG density. Our assay

89 relies upon chromosomal integration of artificial CGI-like sequences into the

90 genome of ESCs. We show that a bivalent chromatin configuration and absence

91 of DNA methylation represent the default state of biologically inert CGI-like DNA

92 sequences. While a high CpG density is essential, it is not sufficient to guarantee

93 the bivalent chromatin structure, as CpG-rich DNA sequences with a G+C-poor

94 average base composition do not acquire this chromatin signature. In fact A+T-

95 rich, G+C poor insertions with a CpG frequency matching that of CGIs

96 consistently become DNA methylated in ESCs, although the bivalent

97 configuration can be restored if the dominant DNA methylation is removed. Our

98 findings demonstrate that CpG-richness is essential for the formation of bivalent

99 chromatin, whereas G+C-richness is required to exclude DNA methylation, which

100 when present is dominant over the other chromatin marks.

4 101

102 Results

103 Formation of a novel bivalent domain at artificial CGI-like sequences

104 Comparison between CGIs and an equivalent number of sequences from the bulk

105 genome shows that both CpG frequency and G+C richness are distinct in CGIs

106 compared with bulk genomic DNA of mouse (Figure 1A). A previous study

107 showed that bacterial DNA with CGI-like features organised bivalent chromatin

108 in ESCs (Mendenhall et al., 2010). In order to verify and extend this result we

109 introduced CGI-like DNA sequences of ~1000 nucleotide pairs (the average

110 length of native CGIs) into a bacterial artificial chromosome (BAC) containing a

111 human “gene desert” using recombineering (see Figure 1 – figure supplement 1).

112 A computer-generated CGI-like sequence (Artificial CGI 1) was designed with a

113 CpG frequency of one per ~10 base pairs and a base composition of 65% G+C

114 (Figure 1A, Figure 1-figure supplement 2 and Figure 1-source data 1 ). A second

115 CGI-like sequence (PuroGFP; Figure 1A, Figure 1-figure supplement 2 and Figure

116 1-source data 1) was of prokaryotic origin, being derived from a promoter-less

117 bacterial puromycin gene adjacent to codon-optimised green fluorescent protein

118 coding sequence (Thomson et al., 2010). All constructs lack a promoter, allowing

119 us to focus on the interaction between DNA sequence and chromatin

120 modification without the complicating involvement of transcription. The gene

121 desert regions flanking CGI-like sequences are intended to insulate against

122 effects of the genomic and chromatin environment at different BAC integrations

123 sites.

124

5 125 Three independently transfected stable ESC lines with low copy number random

126 integrations of the BAC containing Artificial CGI 1 and one cell line with the

127 PuroGFP CGI were selected for ChIP analysis (see Figure 1-figure supplement 1).

128 As controls we monitored active genes (Sox2 and Gapdh), which are marked by

129 H3K4me3, an endogenous bivalent gene (Hoxc8), which carries both H3K27me3

130 and H3K4me3, and an inter-genic region of chromosome 15 (m15), which bears

131 neither mark. The CGI-like insertions consistently generated a bivalent

132 chromatin structure marked by H3K4me3 and H3K27me3 (Figure 1B and C and

133 Figure 1-figure supplement 1). We refer to DNA sequence domains as bivalent

134 using the convention that H3K4me3 and H3K27me3 marks coincide at a single

135 integration site. It is possible that some cells in the population harbour an

136 integrant marked by only one of these marks whereas other cells possess only

137 the other mark. This configuration has not been detected at the few ESC bivalent

138 domains tested so far. More likely is that the two marks are interspersed at a

139 given bivalent sequence domain, though we did not test this experimentally [see

140 (Voigt et al., 2013)]. The levels of H3K4me3 at the artificial CGI were similar to

141 those at the endogenous bivalent gene Hoxc8, whereas H3K4me3 levels in the

142 DNA sequences flanking the CGI were unaffected by the insertion. H3K27me3

143 was less discrete, spreading variably into flanking regions of the human BAC, but

144 Suz12, a component of the PRC2 complex that deposits the H3K27me3 mark, was

145 tightly localised to the CGI-like sequence in each case. We tested an unrelated

146 artificial CGI-like sequence with similar overall sequence properties (Artificial

147 CGI 2; Figure 1-figure supplement 1), this time integrated at a recombination

148 cassette within the beta globin locus of mouse ESCs (Lienert et al., 2011).

149 Independent replicate stable transformant cell lines again showed consistent

6 150 presence of a bivalent chromatin structure (data not shown). The DNA

151 methylation status of the integrated artificial CGIs was investigated by bisulfite

152 sequencing. This showed that all the insertions reproducibly maintained a low

153 level of DNA methylation (Figure 1D & Figure 1-figure supplement 1). DNA

154 methylation levels at CGI-like sequences inserted into the gene desert (~10%)

155 were consistently somewhat higher than in the cassette exchange system (0.5%)

156 and were somewhat higher than an endogenous control CGI in the same DNA

157 samples (Dlx5; ~4-5%, data not shown). Despite this variability, it is evident that

158 all of these artificial DNA sequences maintain a largely non-methylated status.

159

160 H3K4me3 at an artificial CGI is independent of Cfp1 and RNA polymerase II

161 To test whether these promoterless artificial DNA sequences were

162 transcriptionally inert, we performed ChIP with antibodies recognising three

163 differentially phosphorylated forms of RNA polymerase II in replicate cell lines.

164 None displayed a peak over the insert, whereas control active genes were RNA

165 polymerase-positive as expected (Figure 2A and Figure 2-figure supplement 1).

166 Consistent with the absence of RNA polymerase, we did not detect significant

167 peaks of histone acetylation over the CGI-like insertions (Figure 2B). In summary,

168 data derived from three distinct promoter-less CGI-like sequences indicate that

169 bivalent chromatin is the default state for transcriptionally inert CGI-like DNA in

170 ESCs.

171

172 We next asked whether the CXXC protein Cfp1 is enriched at the CGI-like

173 sequences. To facilitate detection of Cfp1, we introduced the BAC containing the

174 artificial CGI into a transgenic cell line expressing a Cfp1-GFP fusion protein

7 175 (Denissov et al., 2014). Having verified that a bivalent domain was formed at the

176 artificial CGI in these cells (data not shown), ChIP was performed on three

177 independent cell lines using an anti-GFP antibody. We consistently observed

178 discrete enrichment of Cfp1 at the CGI-like insertion (Figure 2C). To determine

179 whether the formation of a bivalent domain at the inserted artificial CGI-like

180 sequence was dependent on Cfp1, the artificial CGI was introduced into Cfp1 -/-

181 mouse ES cells (Carlone & Skalnik, 2001) and three independent lines were

182 analysed by ChIP. H3K4me3 levels at the artificial CGI were clearly detectable in

183 the Cfp1 -/- cells, indicating that Cfp1 is not required for the formation of

184 H3K4me3 levels at the bivalent domain (Figure 2D). Depletion of Cfp1 was

185 previously reported to preferentially cause a decrease of H3K4me3 at active

186 genes without affecting non-productive genes (Clouaire et al, 2012). In

187 agreement with this finding, we observed reduced H3K4me3 at the active

188 control gene Sox2 compared to wildtype cells (compare Figures. 2D and 1C). We

189 also followed the fate of chromatin modifications during differentiation of ESCs

190 to neuronal progenitor cells (Figure 2 – figure supplement 1B) and found a

191 consistent drop in H3K4me3 accompanied by persistent or increased H3K27me3

192 (Figure 2-figure supplement 1). This transition from bivalency to H3K27me3

193 marking alone matches that at native CGI-associated genes that remain

194 transcriptionally silent during differentiation (Bernstein et al., 2006).

195

196 A high CpG frequency is necessary for the creation of bivalent domain

197 Although G+C content and CpG frequency are related features, they can be varied

198 independently (Figure 1A). To establish the importance of these features for

199 determination of bivalent chromatin, we varied CpG frequency and G+C content

8 200 in 1000 base pair long artificial DNA sequences (Figure 1-figure supplement 2

201 and Figure 1-source data 1). An artificial CGI with a base composition similar to

202 that of a normal CGI (65% G+C) but with a low density of CpGs, similar to that of

203 the bulk genome (1CpG/100bp), was designed (Low CpG / High G+C). This Low

204 CpG / High G+C sequence failed to create bivalent chromatin as neither

205 H3K4me3 nor H3K27me3 was detected in three independent ESC lines (Figure

206 3A). We note that the relative values of control and experimental data points are

207 consistent between experiments although we observe variability in the absolute

208 precipitation levels due to the use of different antibody suppliers between

209 experiments over an extended time period. Our conclusion from this data is that

210 a G+C-rich base composition alone is insufficient to recruit either H3K4me3 or

211 H3K27me3.

212

213 A+T-rich CGIs become reproducibly DNA methylated

214 This result raised the possibility that CpG frequency alone determines the

215 chromatin state, with G+C content playing no role. To test this idea, we generated

216 four different artificial DNA sequences that were CpG-rich to the same level as

217 typical CGIs (10CpGs/100bp), but relatively A+T-rich in overall base composition

218 (three of 40% and one of 50% G+C on average; Figure 1-figure supplement 2 and

219 Figure 1-source data 1). Contrary to expectation, none of these insertions

220 generated a focus of bivalent chromatin in multiple independent cell lines

221 (Figure 4A and Figure 4-figure supplement 1). A potential explanation for this

222 finding came from an analysis of DNA methylation status, which showed that in

223 replicate cell lines the CGIs had all become densely methylated at CpGs (Figure

224 4B and Figure 4-figure supplement 2). The striking contrast between the

9 225 consistent methylation-free status of three separate G+C-rich, CpG-rich

226 integrants and the reproducible dense methylation of four unrelated A+T-rich,

227 CpG-rich sequences of the same length indicates that base composition is a

228 strong determinant of DNA methylation status. A plot of G+C-content against

229 percentage CpG methylation showed a sharp transition between 50 and 60%

230 G+C (Figure 4D). Interestingly, a CGI-like insertion with a base composition of

231 55% G+C (MeCP2-eGFP) studied previously (Thomson et al., 2010) showed an

232 intermediate DNA methylation level, suggesting that it lies on the transition

233 point for triggering de novo methylation (Figure 4D).

234

235 Removal of DNA methylation restores the bivalent domain

236 The results indicate that DNA methylation is dominant over both H3K4me3 and

237 H3K27me3, as in its presence neither chromatin mark is observed at the

238 artificial CGIs. To test this hypothesis, we asked if removal of DNA methylation

239 could restore bivalent chromatin at A+T-rich CpG-rich sequences by using

240 mutant ESCs lacking the de novo DNA methyltransferases Dnmt 3a and Dnmt 3b

241 (Okano et al., 1999), which display severe DNA hypomethylation (Figure 4-figure

242 supplement 2). To ensure that histone modifying enzyme activities are not

243 disrupted in the Dnmt 3a/3b double knock out cells, we generated stable cell

244 lines with the artificial CGI-containing BAC and confirmed that a bivalent domain

245 was observed over the insertion (Figure 4-figure supplement 2). When the High

246 CpG /High A+T construct was introduced into Dnmt 3a/3b knock out cells it now

247 remained unmethylated, and importantly, bivalent chromatin was detected at

248 the inserted sequence (Figure 4C). We noticed that H3K4me3 was reproducibly

249 weaker than controls in Dnmt 3a/Dnmt 3b double mutant cells, whereas a robust

10 250 H3K27me3 signal was obtained. This may indicate that A+T-rich DNA is less able

251 to recruit H3K4me3 despite its high CpG density. The dominance of the

252 polycomb-associated H3K27me3 mark at A+T-rich CGIs was also seen when DNA

253 methylation was partially reduced by growing cells for 10days in 2i medium,

254 which enhances the pluripotent state. In line with previous reports (Ficz et al.,

255 2013; Habibi et al., 2013), we found that the average level of genomic CpG

256 methylation was reduced from ~95% to ~55% (Figure 4-figure supplement 2).

257 Whereas cells grown in serum-containing medium lacked both the tested histone

258 marks, 2i-grown cells displayed a strong increase in H3K27me3 without the

259 appearance of noticeable H3K4me3 (Figure 4-figure supplement 2). We conclude

260 that while CpG frequency is a key feature of CGIs that determines the signature

261 chromatin marks at CGIs, A+T-richness confers an intrinsic susceptibility to de

262 novo methylation that is dominant over these chromatin modifications (Figure

263 4E).

264

265 Status of endogenous DNA domains with atypical CpG or G+C composition

266 Our study assessed the role of G+C-richness versus CpG-richness on chromatin

267 structure by experimentally varying these sequence parameters individually in

268 synthetic DNA domains. The question arises: do equivalent atypical sequences

269 exist naturally in the mouse genome and if so what is their chromatin

270 modification status? We searched first for G+C-rich (≥61%), CpG deficient

271 (≤1/100bp) sequences of ≥500bp in length and identified 1,954 examples. None

272 of these coincided with bivalent chromatin in ESCs, in agreement with our

273 results using synthetic DNA. To determine if A+T-rich, CpG-rich sequences also

274 exist naturally in mice, we divided the genome into windows of 100bp and

11 275 calculated the G+C content and CpG density of each. To ensure that we selected

276 regions with profiles consistently above the thresholds (as is the case for our

277 artificial constructs) rather than short CpG dense regions flanked by AT rich

278 DNA, we subtracted windows with ≥50% GC content or <5 CpGs and searched

279 for blocks of adjoining windows. Uniformly A+T-rich and CpG-rich DNA

280 sequences of this kind were absent in the mouse genome. Even when the criteria

281 were relaxed to include sequences of only 500 base pairs in length (the

282 approximate minimum size of CGIs), we found no examples. As A+T-rich regions

283 are consistently DNA methylated, it seems likely that, in the absence of strong

284 selection, the high mutation rate of 5-methylcytosine effectively resists CpG

285 accumulation at these domains over evolutionary time.

286

287 Discussion

288

289 Epigenetic marks are often considered to be sensitive to both developmental and

290 environmental signals. Our data reinforce the complementary view that the

291 underlying genetic information, independent of nurture or developmental

292 history, plays an important role in setting up CGI chromatin structures. In ESCs

293 at least, each of the major CGI chromatin configurations is informed by DNA

294 sequence. We find that the non-methylated state of CGIs depends on G+C-rich

295 base composition and that both the signature histone marks H3K4me3 and

296 H3K27me3 depend on a high density of the non-methylated CpG dyad. The

297 importance of primary DNA sequence in determining bulk features of the

298 epigenome have been suggested elsewhere. For example, DNA sequence motifs

299 that are recognised by transcription factors strongly influence DNA methylation

12 300 patterns (Lienert et al., 2011; Stadler et al., 2011) and inter-individual variation

301 in other epigenomic features maps to sites of human DNA sequence

302 heterogeneity that are probably causal (Kasowski et al., 2013; Kilpinen et al.,

303 2013; McVicker et al., 2013). Our data show that even gross features of genomic

304 DNA, such as base composition and the frequency of the 2-base pair sequence

305 CpG, influence the epigenome.

306

307 Do our findings reflect the relationship between naturally occurring DNA

308 sequences and chromatin modification? Almost all native CGIs in mouse ESCs are

309 non-methylated at the DNA level and coincide with peaks of H3K4me3, which is

310 often seen as the signature histone mark of CGIs (Bernstein et al., 2006;

311 Thomson et al., 2010). About one third of the H3K4me3-marked CGIs in mouse

312 ES cells also carry H3K27me3 and are therefore defined as bivalent (Ku et al.,

313 2008). H3K27me3 is usually relatively dispersed compared with the discrete

314 localisation of components of the PRC2 complex, which deposit this mark. It is

315 estimated that at least 97% of peaks corresponding to the PRC2 component Ezh2

316 coincide with CGIs (Ku et al., 2008). This has led to the suggestion that polycomb

317 is targeted to G+C-rich or CpG-rich DNA (Long et al., 2013; Lynch et al., 2012;

318 Mendenhall et al., 2010; Tanay et al., 2007). It was shown previously that CpG

319 density within the non-methylated CGI fraction as a whole is proportional to the

320 H3K4me3 ChIP-seq signal in both mouse and human cells (Illingworth et al.,

321 2010). We asked whether the same relationship holds true for bivalent CGIs and

322 whether it also applies to H3K27me3. Within a set of 2,547 exclusively bivalent

323 promoters from mouse ESCs (Denissov et al., 2014; Marks et al., 2012), we found

324 that 92% coincide with CGIs. Within this bivalent group, CGI length and CpG

13 325 density both correlate positively with H3K4me3 and also H3K27me3 levels

326 (Figure 4 – figure supplement 3). In contrast to typical CGI-like sequences,

327 endogenous G+C-rich, CpG-poor sequences do not display a bivalent chromatin

328 structure. A+T-rich, CpG-rich domains of CGI-like dimensions, however, are

329 effectively absent from the mouse genome. The sum of available data argues

330 strongly that a high abundance of the dinucleotide CpG is a key precondition for

331 the formation of bivalent chromatin.

332

333 A possible mechanism for recruitment of H3K4me3 involves DNA binding by

334 H3K4 methyltransferases, each of which is associated with a CpG-binding CXXC

335 domain. In Mll1 and Mll2 the DNA binding domains are within the SET-

336 containing protein themselves (Allen et al., 2006; Cierpicki et al., 2010), whereas

337 in the case of Set1/COMPASS the CXXC domain resides in the Cfp1 protein

338 component of the multi-subunit complex (Lee and Skalnik, 2005). The H3K4me3

339 component of bivalent CGIs in ESCs depends on Mll2 (Denissov et al., 2014; Hu et

340 al., 2013). Accordingly, we have found that bivalent CGIs form normally in Cfp1-

341 /- ES cells. Since the binding of the CXXC domain to CpG is abolished by

342 methylation of the cytosine moiety (Lee et al., 2001), this may explain why DNA

343 methylation prevents the formation of H3K4me3 even in the presence of high

344 CpG densities. The mechanism responsible for recruiting PRC2, which is the

345 complex responsible for the establishment of H3K27me3, remains uncertain, as

346 no CpG binding component of PRC2 has yet been detected. (Hu et al., 2013;

347 Thomson et al., 2010; Wu et al., 2013). A recent study, however, indicates that

348 the CXXC-domain protein KDM2B can target the PRC1 complex to CGIs and

14 349 recruit PRC2 secondarily, which would accord with our findings (Blackledge et

350 al., 2014).

351

352 Bivalent chromatin has attracted attention due to the proposal that it represents

353 a poised transcriptional state in pluripotent cells (Bernstein et al., 2006; Voigt et

354 al., 2013). It is suggested that the poised state is resolved during differentiation

355 as the affected gene becomes either transcriptionally activated or repressed.

356 Recent evidence has clearly established that H3K27me and H3K4me can be

357 present on the same , albeit on different histone tails (Voigt et al.,

358 2012). The biological significance of bivalent chromatin has recently become less

359 certain, however. Whereas silent CGI promoters in ESCs grown in serum usually

360 exhibit a bivalent chromatin structure, this chromatin configuration is

361 significantly reduced in 2i medium, which discourages differentiation and is

362 thought to induce a more pronounced pluripotent state (Marks et al., 2012). Also,

363 cells in which the histone methyltransferase Mll2 is depleted lose H3K4me3 at

364 hitherto bivalent CGIs, but this has no detectable effect on the induction kinetics

365 of the associated genes upon differentiation (Hu et al., 2013). The evidence

366 remains inconclusive, but it raises the possibility that bivalency is not an

367 essential precondition for gene activation during differentiation. Here we find

368 that bivalency is not confined to poised developmental genes but is a default

369 response to any CpG-rich DNA sequences, even when these are completely

370 artificial. It is apparent from this that the bivalent chromatin structure is not

371 reserved for developmentally important genes, but is a response to general

372 features of local DNA sequence.

373

15 374 Bulk mammalian DNA has a base composition of 40% G+C and is globally

375 methylated at CpGs to an average level of ~65%, whereas G+C-rich CGIs are

376 usually DNA methylation-free. The transfection experiments reported here

377 recapitulate this distinction as G+C-rich CGI-like DNA reproducibly resisted DNA

378 methylation, whereas A+T-rich DNA reliably became densely methylated. Our

379 findings raise the possibility that this broadly binary pattern of global DNA

380 methylation may be determined by base composition. CpG density did not affect

381 this susceptibility to methylation, which occurred even when CpG occurred at

382 densities typical of endogenous CGIs. Two simple alternative explanations are

383 possible: either A+T-rich DNA attracts de novo methylation, or G+C-richness

384 excludes it, leaving the bulk genome to be methylated by default. Both Dnmt 3L

385 and Dnmt 3A have the potential to be repelled by H3K4me3 (Jia et al., 2007; Ooi

386 et al., 2007), suggesting that recruitment of H3K4 methyltransferases with CpG-

387 binding CXXC domains protects CGIs against de novo methylation. We find,

388 however, that A+T-rich CpG-rich sequences are reproducibly DNA methylated

389 despite their ability to attract H3K4me3 when DNA methylation is absent. Since

390 these sequences were initially non-methylated when transfected into cells, it

391 may be that H3K4me3 alone is not sufficient to exclude CpG methylation from

392 these regions. This would accord with our previous observation that a

393 moderately G+C-rich artificial CGI when inserted into ES cells was extensively

394 methylated despite the presence of H3K4me3 on non-methylated copies

395 (Thomson et al., 2010). Alternatively, A+T-rich CpG-rich sequences may attract

396 H3K4me3 less robustly than G+C-rich CGIs, thereby allowing access to de novo

397 DNA methyltransferases.

398

16 399 The CGI phenomenon is conserved throughout the vertebrate lineage, but

400 interestingly the G+C-richness characteristic of mammalian and bird CGIs is not

401 seen in some vertebrate groups (Long et al., 2013). Fish for example have non-

402 methylated CGI-like sequences at promoters, but these are not markedly G+C-

403 rich compared with the bulk genome (Cross et al., 1991; Long et al., 2013). Since

404 A+T-rich sequences are susceptible to de novo DNA methylation in mammals, it

405 follows that some vertebrate groups may rely on a different set of mechanisms to

406 prevent methylation at G+C-poor CGIs. Identifying potential components that

407 discriminate between mammalian CGIs and the bulk genome is a priority for

408 future work.

409

410

411 Materials and Methods

412

413 Mouse ES cell lines

414 ES cells were grown in gelatinized dishes in Glasgow MEM (Gibco) supplemented

415 with 15% fetal bovine serum (Hyclone), 1% sodium pyruvate, 1% non-essential

416 amino acids, 0.1% β-mercaptoethanol, 100U/ml penicillin, 100 μg/ml

417 streptomycin and leukemia inhibitory factor (LIF). E14TG2a ES cells were used

418 as wild-type ES cells. Cfp1-/- cells were a gift from David Skalnik and have been

419 described previously (Carlone et al., 2005). Dnmt 3a/3b double knock out mouse

420 ESC (DKOs) were as described (Okano et al, 1999). TC1-mES cells (a gift from Dr

421 Ann Dean) with the hygromycin/thymidine kinase cassette in the β-globin locus

422 (Lienert et al., 2011) were used for recombination-mediated cassette exchange.

423 Cfp1-GFP tagged mouse ES cells have been described (Denissov et al, 2014). This

17 424 integrated BAC contains the whole Cfp1 gene with regulatory elements and GFP

425 fused to the last codon of to create a C-terminal GFP tag.

426

427 Recombineering

428 Custom Perl scripts were used to create random sequences with specific

429 frequencies of CpG, G+C content and length (https://github.com/swebb1/cpg_tools).

430 Supplementary File 1 lists the CGI-like sequences used in this study. Artificial

431 DNAs were synthesised (GeneArt) and cloned into a plasmid containing a

432 selection cassette and homology arms for recombineering. CGI-constructs were

433 introduced into human gene desert BACs by recombineering using the Red/ET

434 system by Gene Bridges. Briefly, bacteria containing the gene desert BAC of

435 interest (gene desert 1 mChr1:81,106,616-81,153,886 or gene desert 2

436 mChr18:36,042,881-36,175,341) were cultivated in LB medium plus

437 chloramphenicol at 37°C o/n. On the next day 40ng Red/ET plasmid were

438 electroporated (1350 V, 10 μF, 600 Ohms) into the cells containing the BAC and

439 incubated in LB containing chloramphenicol and tetracycline at 30° for at least

440 15h. Next day 50μl of 10% L-arabinose were added to induce the expression of

441 the Red/ET recombination proteins and samples were incubated at 37°C for 1h.

442 Cells were electroporated with 200-300ng of linearized DNA containing the CGI-

443 like sequence, a kanamycin selection cassette and homology arms. Cells were re-

444 suspended in 1ml of SOC medium and recovered for 1-2h at 37°C. Cells were

445 plated on plates containing chloramphenicol and kanamycin. Plates were

446 incubated at 37° o/n. Colonies were screened by colony PCR and control digests

447 for successful recombination. The linearized BAC containing the CGI-like

448 constructs and a selection cassette was used for transfection of mESCs.

18 449

450 ES cell transfection

451 Linearized BAC DNA (0.5 to 2μg) containing the CGI-like sequences and a

452 selection cassette flanked by Frt sites was used to transfect 60% confluent

453 mouse ESC growing in a 6-well plate using LipofectamineTM LTX PlusTM

454 (Invitrogen). DNA was made up with OptiMem (GIBCO) to 500μl, 2.5μl PLUS

455 reagent were added and incubated for 5 min at RT. Afterwards 6.25μl

456 Lipofectamine were added, incubated for 30min at RT and added to the ES cells.

457 Cells were split in a range of different ratios (10-0.1% of transfected cells) 24h

458 after transfection and plated onto 10cm2 dishes. Next day, selection medium

459 containing the appropriate antibiotic (G418 250 μg/ml or Blasticidin 3 μg/ml)

460 was added and cells were grown until colonies were ready to be picked. Clones

461 were analysed for incorporation of the constructs by PCR and 2-3 independent

462 clones with low copy number integration were selected for the excision of the

463 selection cassette. Circular plasmid (50μg) containing a eukaryotic expression

464 cassette for Flp or Dre were added to 2x107 cells and electroporated at 250V and

465 500μF using a BioRad electroporator (GenePulser Xcell). Cells were left to

466 recover for 20 min at RT and seeded at different dilutions. Next day 0.8μg/ml

467 puromycin were added for 48h and cells were cultured until colonies were big

468 enough for picking. Successful excision was confirmed by Southern blotting.

469 Artificial CGIs for insertion into the beta-Globin locus via recombination

470 mediated cassette exchange were synthesised (GeneArt) with flanking inverted

471 loxp sites and cloned into pBSIISK+ and electroporated into TC-1 ES cells

472 carrying a Hygromycin/Thymidine Kinase double selection cassette in the beta-

473 Globin locus (Lienert et al, 2011). Cells (4x106) were pre-selected with

19 474 hygromycin for 10d, electroporated with 25µg L1-artificial CGI-1L construct and

475 20µg pCAGGS-Cre and selection for positive clones with 3um Ganciclovir was

476 started 2d after electroporation and continued for 8-10d. Clones were tested for

477 successful insertion of artificial CGIs by PCR screen and Southern Blot. For

478 differentiation of mESC into neural precursors cells were plated (4x106

479 cells/dish) in 15ml EB medium (ES cell medium with 10% FBS and no LIF). After

480 4 days in EB medium trans-retinoic acid (Sigma) was added to start neuronal

481 differentiation. Medium was changed every 2 days. On day 8 EBs were disrupted,

482 trypsinized and used for formaldehyde crosslinking.

483

484 Chromatin Immunoprecipitation

485 Chemical crosslink of chromatin was performed for 10 minutes at room

486 temperature by addition of formaldehyde to a final concentration of 1%.

487 Crosslinking was stopped by addition of glycine to a final concentration of 0.125

488 mM. After 5 minutes incubation cells were washed twice with ice-cold 1x PBS. If

489 acetylation was examined, sodium butyrate was added to the PBS. Cells were

490 centrifuged for 5 minutes at 330 g at 4°C and washed once in wash buffer A

491 (0.25% Triton X-100, 10mM EDTA pH 8.0, 0.5mM EGTA pH 7.5, 10mM, HEPES

492 pH 7.5) and B (0.2M NaCl, 1mM EDTA pH 8.0, 0.5mM EGTA pH 7.5, 10mM HEPES

493 pH 7.5). Each buffer contained PMSF to a final concentration of 1μg/ml, 1x

494 Proteinase Inhibitor Complete Mix (Roche) and 10mM sodium butyrate. Cells

495 were re-suspended in lysis buffer (20% SDS, 0.5M EDTA, 1M Tris pH 8.1, 1X

496 complete protease inhibitors, 1μg/ml PMSF, 10mM sodium butyrate). Chromatin

497 was sheared to 400-600 base pair fragments by sonication of the cell lysate using

20 498 a twin Bioruptor (Diagenode) for 15 cycles, 30 sec on, 30 sec off on high setting.

499 The lysate was centrifuged for 10 minutes at 16,000 g at 4°C to collect cellular

500 debris. Chromatin concentration was measured on a Nanodrop

501 spectrophotometer. Chromatin (30μg or 100μg) was used for ChIP with

502 antibodies against histone modifications or other proteins. For the ChIP,

503 chromatin was made up to 100μl with lysis buffer and 900μl of dilution buffer

504 (20mM Tris-HCl, pH 8, 150mM NaCl, 1% Triton X-100, 1mM EDTA) were added.

505 Antibodies were added as specified and samples were incubated o/n at 4 °C on a

506 rotating wheel. The next day debris was removed by centrifugation for 5min at

507 16000 x g and 50μl dynabeads protein G (Life Technologies) were added to the

508 supernatant of the IPs and samples were incubated on a rotating wheel for 2

509 hours at 4 °C. IPs were washed 3x with 1ml ice-cold wash buffer 1 (20mM Tris-

510 HCl, pH 8, 150mM NaCl, 1% Triton X-100, 1mM EDTA, 0.1% SDS), 2x with ice-

511 cold wash buffer 2 (20mM Tris-HCl, pH 8, 500mM NaCl, 1% Triton X-100, 1mM

512 EDTA, 0.1% SDS) and 1x with ice-cold TE. 100μl or 50 μl of freshly prepared 10%

513 Chelex 100 Resin (100-200 mesh, BioRad) were added to samples or input

514 respectively. Samples were boiled for 12 min, cooled to RT. Proteinase K (2μl of

515 20mg/ml) was added and incubated at 55 °C for 30’ while shaking. After heating

516 to 100°C, samples were spun down and 60μl supernatant were transferred to

517 fresh tubes. Each sample was made up to 300μl total volume with 10mM Tris

518 pH8, 0.1 mM EDTA and assayed by q-PCR.

519

520 qPCR

521 Supplementary File 1 lists the qPCR primers used in this study. qPCR was used to

522 assess enrichment of specific regions in ChIP samples and to determine the copy

21 523 number of BAC DNA inserted into the mouse genome. qPCR reactions (10μl)

524 contained SYBR Green SensiMix (Bioline), 250nM primers and 3µl ChIP DNA or

525 100ng DNA for the determination of copy numbers. PCR was carried out using a

526 Roche Lightcycler and cycling conditions were as follows; initial denaturation at

527 94°C for 10min followed by 45 cycles of denaturation at 94°C for 10 sec, primer

528 annealing for 10 sec and primer extension at 72°C for 15 sec. Using the Roche

529 Lightcycler software, SYBR Green fluorescence measurements were plotted

530 relative to cycle number and the 2nd derivative maximum method was used to

531 determine the cycle threshold values (Ct) for each sample. Values for duplicate of

532 triplicate ChIP samples were calculated as %-input and copy number of artificial

533 CGIs was assessed relative to Sox2 qPCR signal.

534

535 Antibodies

536 α-H3K4me3 (Abcam-8580), α-H3K4me1(Abcam-8895), α-H3K27me3 (Millipore-

537 07-449), -H3K9/K14ac (Abcam 12179), -H3 (Abcam 1791), -SUZ12 (Abcam

538 12073-100), -RNA Pol II N20 (Santa-Cruz -899), -RNA Pol II S5P (Abcam

539 5131), -RNA Pol II unphosphorylated CTD (Abcam 817), -GFP (Chromotec

540 GFP-TRAP-A gta-20), -IgG (Invitrogen 10500C).

541

542 Bisulfite genomic sequencing

543 Bisulfite conversion of genomic DNA was carried out using the EpiTect Bisulfite

544 Kit from Qiagen. The converted DNA was used for PCR amplification of regions of

545 interest, Supplementary File 2 lists the CGI-like sequences used in this study. PCR

546 products were gel-purified and cloned using the Stratagene blunt end cloning kit.

547 Positive clones were sent for sequencing.

22 548

549 Acknowledgments

550 We thank Ann Dean, Dirk Schubeler and David Skalnik for sharing cell lines,

551 Sukhdeep Singh for bioinformatic insights and Martha Koerner, Matt Lyst and

552 Sabine Lagger for critical comments on the manuscript. The work was supported

553 by a Programme Grant from the Wellcome Trust. E.W. was funded by Wellcome

554 Trust 4 year PhD studentships and T.Q. holds a Marie Curie fellowship.

555

556 Author contributions

557 E.W. created and analysed cell lines and interpreted data. A.A created High CpG

558 Medium G+C cell lines and analysed and interpreted data. T.Q. created and

559 analysed cell lines in the beta-globin locus and analysed data. C.M. carried out

560 selection cassette excision, bisulfite sequencing, southern blot analysis and data

561 analysis and interpretation. S.W. provided extensive informatic support for

562 designing constructs and analysing native CGI chromatin. F.S. shared

563 unpublished cell lines and contributed ideas and analysis. E.W., T.Q. and A.B.

564 conceived the project and wrote the manuscript. All authors commented on the

565 final version of the manuscript.

566

567 Competing financial interests

568 The authors declare no competing financial interests.

569

570

23 571 References

572

573 Allen, M.D., Grummitt, C.G., Hilcenko, C., Min, S.Y., Tonkin, L.M., Johnson, C.M.,

574 Freund, S.M., Bycroft, M., and Warren, A.J. (2006). Solution structure of the

575 nonmethyl-CpG-binding CXXC domain of the leukaemia-associated MLL histone

576 methyltransferase. The EMBO journal 25, 4503-4512. 7601340 [pii]

577 10.1038/sj.emboj.7601340

578 Azuara, V., Perry, P., Sauer, S., Spivakov, M., Jorgensen, H.F., John, R.M., Gouti, M.,

579 Casanova, M., Warnes, G., Merkenschlager, M., et al. (2006). Chromatin signatures

580 of pluripotent cell lines. Nature cell biology 8, 532-538. 10.1038/ncb1403

581 Bernstein, B.E., Mikkelsen, T.S., Xie, X., Kamal, M., Huebert, D.J., Cuff, J., Fry, B.,

582 Meissner, A., Wernig, M., Plath, K., et al. (2006). A bivalent chromatin structure

583 marks key developmental genes in embryonic stem cells. Cell 125, 315-326.

584 Bird, A.P. (1986). CpG-rich islands and the function of DNA methylation. Nature

585 321, 209-213.

586 Blackledge, N.P., Farcas, A.M., Kondo, T., King, H.W., McGouran, J.F., Hanssen, L.L.,

587 Ito, S., Cooper, S., Kondo, K., Koseki, Y., et al. (2014). Variant PRC1 complex-

588 dependent H2A ubiquitylation drives PRC2 recruitment and polycomb domain

589 formation. Cell 157, 1445-1459. 10.1016/j.cell.2014.05.004

590 Blackledge, N.P., and Klose, R. (2011). CpG island chromatin: a platform for gene

591 regulation. 6, 147-152. 13640 [pii]

592 Blackledge, N.P., Zhou, J.C., Tolstorukov, M.Y., Farcas, A.M., Park, P.J., and Klose,

593 R.J. (2010). CpG islands recruit a histone H3 lysine 36 demethylase. Mol Cell 38,

594 179-190. 10.1016/j.molcel.2010.04.009

24 595 Carlone, D.L., Lee, J.H., Young, S.R., Dobrota, E., Butler, J.S., Ruiz, J., and Skalnik,

596 D.G. (2005). Reduced genomic cytosine methylation and defective cellular

597 differentiation in embryonic stem cells lacking CpG binding protein. Mol Cell Biol

598 25, 4881-4891. 25/12/4881 [pii]

599 10.1128/MCB.25.12.4881-4891.2005

600 Cierpicki, T., Risner, L.E., Grembecka, J., Lukasik, S.M., Popovic, R., Omonkowska,

601 M., Shultis, D.D., Zeleznik-Le, N.J., and Bushweller, J.H. (2010). Structure of the

602 MLL CXXC domain-DNA complex and its functional role in MLL-AF9 leukemia.

603 Nat Struct Mol Biol 17, 62-68. nsmb.1714 [pii]

604 10.1038/nsmb.1714

605 Clouaire, T., Webb, S., Skene, P., Illingworth, R., Kerr, A., Andrews, R., Lee, J.H.,

606 Skalnik, D., and Bird, A. (2012). Cfp1 integrates both CpG content and gene

607 activity for accurate H3K4me3 deposition in embryonic stem cells. Genes &

608 development 26, 1714-1728. 10.1101/gad.194209.112

609 Cross, S., Kovarik, P., Schmidtke, J., and Bird, A.P. (1991). Non-methylated islands

610 in fish genomes are GC-poor. Nucleic Acids Research 19, 1469-1474.

611 Deaton, A.M., and Bird, A. (2011). CpG islands and the regulation of transcription.

612 Genes & development 25, 1010-1022. 25/10/1010 [pii]

613 10.1101/gad.2037511

614 Denissov, S., Hofemeister, H., Marks, H., Kranz, A., Ciotta, G., Singh, S.,

615 Anastassiadis, K., Stunnenberg, H.G., and Stewart, A.F. (2014). Mll2 is required for

616 H3K4 trimethylation on bivalent promoters in embryonic stem cells, whereas

617 Mll1 is redundant. Development 141, 526-537. 10.1242/dev.102681

618 Farcas, A.M., Blackledge, N.P., Sudbery, I., Long, H.K., McGouran, J.F., Rose, N.R.,

619 Lee, S., Sims, D., Cerase, A., Sheahan, T.W., et al. (2012). KDM2B links the

25 620 Polycomb Repressive Complex 1 (PRC1) to recognition of CpG islands. Elife 1,

621 e00205. 10.7554/eLife.00205

622 00205 [pii]

623 Ficz, G., Hore, T.A., Santos, F., Lee, H.J., Dean, W., Arand, J., Krueger, F., Oxley, D.,

624 Paul, Y.L., Walter, J., et al. (2013). FGF signaling inhibition in ESCs drives rapid

625 genome-wide demethylation to the epigenetic ground state of pluripotency. Cell

626 Stem Cell 13, 351-359. 10.1016/j.stem.2013.06.004

627 Guenther, M.G., Levine, S.S., Boyer, L.A., Jaenisch, R., and Young, R.A. (2007). A

628 chromatin landmark and transcription initiation at most promoters in human

629 cells. Cell 130, 77-88.

630 Habibi, E., Brinkman, A.B., Arand, J., Kroeze, L.I., Kerstens, H.H., Matarese, F.,

631 Lepikhov, K., Gut, M., Brun-Heath, I., Hubner, N.C., et al. (2013). Whole-genome

632 bisulfite sequencing of two distinct interconvertible DNA methylomes of mouse

633 embryonic stem cells. Cell Stem Cell 13, 360-369. 10.1016/j.stem.2013.06.002

634 Hu, D., Garruss, A.S., Gao, X., Morgan, M.A., Cook, M., Smith, E.R., and Shilatifard, A.

635 (2013). The Mll2 branch of the COMPASS family regulates bivalent promoters in

636 mouse embryonic stem cells. Nat Struct Mol Biol 20, 1093-1097.

637 10.1038/nsmb.2653

638 Illingworth, R.S., Gruenewald-Schneider, U., Webb, S., Kerr, A.R., James, K.D.,

639 Turner, D.J., Smith, C., Harrison, D.J., Andrews, R., and Bird, A.P. (2010). Orphan

640 CpG islands identify numerous conserved promoters in the mammalian genome.

641 PLoS Genet 6. 10.1371/journal.pgen.1001134

642 Jia, D., Jurkowska, R.Z., Zhang, X., Jeltsch, A., and Cheng, X. (2007). Structure of

643 Dnmt3a bound to Dnmt3L suggests a model for de novo DNA methylation.

644 Nature 449, 248-251. 10.1038/nature06146

26 645 Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J.B.,

646 Kundaje, A., Liu, Y., Boyle, A.P., Zhang, Q.C., Zakharia, F., Spacek, D.V., et al. (2013).

647 Extensive variation in chromatin states across humans. Science 342, 750-752.

648 10.1126/science.1242510

649 Kilpinen, H., Waszak, S.M., Gschwind, A.R., Raghav, S.K., Witwicki, R.M., Orioli, A.,

650 Migliavacca, E., Wiederkehr, M., Gutierrez-Arcelus, M., Panousis, N.I., et al. (2013).

651 Coordinated effects of sequence variation on DNA binding, chromatin structure,

652 and transcription. Science 342, 744-747. 10.1126/science.1242463

653 Ku, M., Koche, R.P., Rheinbay, E., Mendenhall, E.M., Endoh, M., Mikkelsen, T.S.,

654 Presser, A., Nusbaum, C., Xie, X., Chi, A.S., et al. (2008). Genomewide analysis of

655 PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoS Genet

656 4, e1000242. 10.1371/journal.pgen.1000242

657 Lee, J.H., and Skalnik, D.G. (2005). CpG-binding protein (CXXC finger protein 1) is

658 a component of the mammalian Set1 histone H3-Lys4 methyltransferase

659 complex, the analogue of the yeast Set1/COMPASS complex. The Journal of

660 biological chemistry 280, 41725-41731.

661 Lee, J.H., Voo, K.S., and Skalnik, D.G. (2001). Identification and characterization of

662 the DNA binding domain of CpG-binding protein. The Journal of biological

663 chemistry 276, 44669-44676.

664 Lienert, F., Wirbelauer, C., Som, I., Dean, A., Mohn, F., and Schubeler, D. (2011).

665 Identification of genetic elements that autonomously determine DNA

666 methylation states. Nature genetics 43, 1091-1097. 10.1038/ng.946

667 ng.946 [pii]

668 Long, H.K., Sims, D., Heger, A., Blackledge, N.P., Kutter, C., Wright, M.L., Grutzner,

669 F., Odom, D.T., Patient, R., Ponting, C.P., et al. (2013). Epigenetic conservation at

27 670 gene regulatory elements revealed by non-methylated DNA profiling in seven

671 vertebrates. Elife 2, e00348. 10.7554/eLife.00348

672 Lynch, M.D., Smith, A.J., De Gobbi, M., Flenley, M., Hughes, J.R., Vernimmen, D.,

673 Ayyub, H., Sharpe, J.A., Sloane-Stanley, J.A., Sutherland, L., et al. (2012). An

674 interspecies analysis reveals a key role for unmethylated CpG dinucleotides in

675 vertebrate Polycomb complex recruitment. The EMBO journal 31, 317-329.

676 10.1038/emboj.2011.399

677 Marks, H., Kalkan, T., Menafra, R., Denissov, S., Jones, K., Hofemeister, H., Nichols,

678 J., Kranz, A., Stewart, A.F., Smith, A., et al. (2012). The transcriptional and

679 epigenomic foundations of ground state pluripotency. Cell 149, 590-604.

680 10.1016/j.cell.2012.03.026

681 McVicker, G., van de Geijn, B., Degner, J.F., Cain, C.E., Banovich, N.E., Raj, A.,

682 Lewellen, N., Myrthil, M., Gilad, Y., and Pritchard, J.K. (2013). Identification of

683 genetic variants that affect histone modifications in human cells. Science 342,

684 747-749. 10.1126/science.1242429

685 Mendenhall, E.M., Koche, R.P., Truong, T., Zhou, V.W., Issac, B., Chi, A.S., Ku, M.,

686 and Bernstein, B.E. (2010). GC-rich sequence elements recruit PRC2 in

687 mammalian ES cells. PLoS Genet 6, e1001244. 10.1371/journal.pgen.1001244

688 Okano, M., Bell, D.W., Haber, D.A., and Li, E. (1999). DNA methyltransferases

689 Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian

690 development. Cell 99, 247-257.

691 Ooi, S.K., Qiu, C., Bernstein, E., Li, K., Jia, D., Yang, Z., Erdjument-Bromage, H.,

692 Tempst, P., Lin, S.P., Allis, C.D., et al. (2007). DNMT3L connects unmethylated

693 lysine 4 of histone H3 to de novo methylation of DNA. Nature 448, 714-717.

28 694 Stadler, M.B., Murr, R., Burger, L., Ivanek, R., Lienert, F., Scholer, A., van

695 Nimwegen, E., Wirbelauer, C., Oakeley, E.J., Gaidatzis, D., et al. (2011). DNA-

696 binding factors shape the mouse methylome at distal regulatory regions. Nature

697 480, 490-495. 10.1038/nature10716

698 nature10716 [pii]

699 Tanay, A., O'Donnell, A.H., Damelin, M., and Bestor, T.H. (2007). Hyperconserved

700 CpG domains underlie Polycomb-binding sites. Proc Natl Acad Sci U S A 104,

701 5521-5526. 10.1073/pnas.0609746104

702 Thomson, J.P., Skene, P.J., Selfridge, J., Clouaire, T., Guy, J., Webb, S., Kerr, A.R.,

703 Deaton, A., Andrews, R., James, K.D., et al. (2010). CpG islands influence

704 chromatin structure via the CpG-binding protein Cfp1. Nature 464, 1082-1086.

705 nature08924 [pii]

706 10.1038/nature08924

707 Voigt, P., LeRoy, G., Drury, W.J., 3rd, Zee, B.M., Son, J., Beck, D.B., Young, N.L.,

708 Garcia, B.A., and Reinberg, D. (2012). Asymmetrically modified . Cell

709 151, 181-193. 10.1016/j.cell.2012.09.002

710 Voigt, P., Tee, W.W., and Reinberg, D. (2013). A double take on bivalent

711 promoters. Genes & development 27, 1318-1338. 10.1101/gad.219626.113

712 Wu, X., Johansen, J.V., and Helin, K. (2013). Fbxl10/Kdm2b recruits polycomb

713 repressive complex 1 to CpG islands and regulates H2A ubiquitylation. Mol Cell

714 49, 1134-1146. 10.1016/j.molcel.2013.01.016

715

716

29 717 Figure legends

718

719 Figure 1. A novel bivalent chromatin domain is formed at promoter-less

720 artificial CGI-like sequences integrated within a gene desert in mouse ESCs.

721 (A) CpG frequency and G+C content of CGIs in the mouse genome (blue circles)

722 and an equivalent number of equal-sized (1000 base pair) random fragments of

723 bulk genomic DNA (red circles). (B) Map of human gene desert 2 (grey bars;

724 Chr1:81,106,616-81,153,886) showing the integration site of the Artificial CGI-

725 like construct (purple box). Black boxes at the ends indicate bacterial BAC

726 sequences. Black bars above indicate the position of Q-PCR amplicons (not to

727 scale). (C) Representative anti-H3K4me3 and H3K27me3 ChIP profiles

728 (normalized to H3 ChIP) and Suz12 ChIP profiles (% Input; n=3) for 3

729 independently transfected cell lines. Shaded box includes primers spanning the

730 Artificial CGI. ChIP control amplicons are derived from the TSS of the active

731 genes Sox2 (S) and GAPDH (G); the TSS of bivalent gene Hoxc8 (H) and an

732 inconspicuous negative control region on mouse chromosome 15 (-). Error bars

733 indicate the standard deviation of PCR replicates. (D) Bisulfite sequencing of the

734 3 cell lines shown in (C). In the map above, blue strokes show CpGs in the CGI-

735 like insert and the clear box indicates the bisulfite amplicon. Methylated and

736 unmethylated CpGs are depicted as filled and open circles, respectively. This

737 figure has two supplements:

738 Supplement 1 - Bivalent chromatin at artificial CGI-like sequences in mouse ESCs.

739 Supplement 2 - Synthetic DNA elements with different sequence properties.

740

741

30 742 Figure 2. H3K4me3 at a promoter-less artificial CGI forms independently of

743 Cfp1 and RNA polymerase II

744 (A) Map of gene desert 2 with integrated Artificial CGI-like construct labeled as

745 in Figure 1B. Representative ChIP with an antibody specific for the N-terminus of

746 RNA polymerase II for 3 independent cell lines (% Input over IgG; n=2). (B)

747 Representative anti-H3K9/K14 acetylation ChIP profiles for 3 independently

748 transfected cell lines normalized to H3 ChIP (n=2). (C) Mouse ES cells expressing

749 GFP-tagged Cfp1 were transfected with Artificial CGI construct and bound Cpf1

750 was assayed by ChIP with anti-GFP antibodies in 3 independent cell lines (n=2).

751 (D) Representative anti-H3K3me3 ChIP for three independent Cfp1-/- mouse ES

752 cells transfected with the Artificial CGI construct (n=2). Control ChIP amplicons

753 are as in Figure 1C. Error bars indicate standard deviation of PCR replicates. This

754 figure has one supplement:

755 Supplement 1 - H3K4me3 at an artificial CGI without Cfp1 or RNA polymerase II

756

757

758

759 Figure 3. High G+C content is not sufficient to create a bivalent chromatin

760 domain

31 761 Map of gene desert 2 showing the integration site of the Low CpG / High G+C (L-

762 CpG / H-G+C) construct labeled as in Figure 1B. Representative anti-H3K3me3

763 and H3K27me3 ChIP profiles (normalized to H3) and Suz12 ChIP profiles (%

764 Input; n=3) are shown for three independent transfected cell lines. Shaded bar

765 includes primers spanning the Low CpG / High G+C construct. Control ChIP

766 amplicons are as in Figure 1C. Error bars indicate standard deviation of PCR

767 replicates.

768

769

770 Figure 4. CpG-rich DNA sequences on an A+T-rich background fail to form

771 bivalent chromatin and reproducibly acquire DNA methylation

772 (A) Above: Map of gene desert 2 indicating the integration site of the High CpG /

773 Low G+C 1 (H-CpG / L-G+C 1) construct as in Figure 1B. Representative anti-

774 H3K4me3, H3K27me3 and Suz12 ChIPs shown (n=3) for each of two

775 independently transfected cell lines (3rd line not shown). The shaded bar

776 includes primers spanning the High CpG / Low G+C 1 construct. (B) Bisulfite

777 sequence analysis of the two cell lines shown in (A). Clear box indicates bisulfite

778 amplicon. In the map above, blue strokes show CpGs in the CGI-like insert and

779 the clear box indicates the bisulfite amplicon. Methylated and unmethylated

780 CpGs are depicted as filled and open circles, respectively. (C) The High CpG / Low

781 G+C construct was integrated into Dnmt 3a-/- Dnmt 3b-/- double mutant mouse

782 ES cells. Representative H3K4me3, H3K27me3 and Suz12 ChIPs are shown (n=3).

783 Upper right panel shows bisulfite sequence analysis of a cell line containing the

784 High CpG / Low G+C construct in Dnmt 3a/b -/- cells, presented as in panel (B).

785 (D) The relationship between G+C content of constructs analysed in this study

32 786 and their DNA methylation status. Data for Mecp2-eGFP and Nanog-PuroGFP

787 refer to cell lines reported previously (Thomson et al, 2010), but reanalyzed for

788 this study. (E) Diagrams depicting the influence of CGI sequence composition on

789 chromatin structure. Upper panel: Sequences with high CpG frequency and high

790 C+G content attract both H3K4 and H3K27 methyltransferase to establish

791 bivalent chromatin domains and they remain unmethylated. SET1A/1B and

792 MLL1/2 complexes contain CXXC domains that may target H3K4me3 to CGIs. The

793 mechanism by which the PRC2 complex is targeted is unknown. Middle panel:

794 Without CpGs, H3K4 and K27 methyltransferases are not recruited even when

795 the DNA is G+C-rich. Lower panel: A+T-rich DNA fails to form a bivalent

796 chromatin structure, even when the CpG density is high and is consistently

797 subject to de novo methylation.

798 This figure has the following supplements:

799 Supplement 1: CpG-rich, A+T-rich DNA sequences do not form bivalent

800 chromatin.

801 Supplement 2: DNA methylation at CpG-rich, A+T-rich DNA sequences blocks

802 bivalent chromatin.

803

804 Legends to Figure supplements

805

806 Figure 1 - figure supplement 1.

807 Bivalent chromatin at artificial CGI-like sequences in mouse ESCs.

808 Schematic representation of the experimental protocol for insertion of artificial

809 CGIs into the mouse ESC genome. A linearized plasmid containing the CGI like

810 sequence, a selection cassette flanked by FRT sites, a single LoxP site and 5’ and

33 811 3’ homology regions were integrated into a human gene desert within a BAC via

812 recombineering. The linearized BAC was then transfected into mouse ESCs.

813 Colonies containing the BAC were screened for clones with low copy number

814 integration and transfected with Flp to excise the selection cassette. Successful

815 excision was confirmed by PCR and Southern blotting. Selected clones were used

816 for analysis of chromatin modification and DNA methylation. (B) Diagram above

817 shows the CGI like sequence PuroGFP integrated into a human gene desert 1

818 (mChr18:36,042,881-36,175,341) and the position of primer pairs used for ChIP

819 (not to scale). Representative H3K4me3, H3K27me3 and Suz12 ChIP qPCR

820 experiments (n=2). The shaded box includes primers spanning the PuroGFP

821 construct. (C) Diagram above indicates the mouse ß-globin locus showing the

822 integration site of the Artificial CGI 2 construct. Black bars indicate position of

823 primers used for qPCR. H3K4me3, H3K27me3 and Suz12 ChIP analysis are

824 shown for a representative cell line (data for 2 other cell lines not shown).

825 Shading indicates primers spanning the Artificial CGI 2. (D) Bisulfite sequencing

826 of the Artificial CGI 2 construct. In the map above, blue strokes show CpGs in the

827 CGI-like insert and the clear box indicates the bisulfite amplicon. Methylated and

828 unmethylated CpGs are depicted as filled and open circles, respectively. ChIP

829 control amplicons as described in Figure 1C but including the TSS of the active

830 gene ß-actin. Error bars indicate standard deviation of PCR replicates.

831

832 Figure 1 - figure supplement 2.

833 Synthetic DNA elements with different sequence properties.

34 834 (A) Sequence profiles of CGI-like constructs flanked on either side by 1 kb of

835 human gene desert. Upper panels: % G+C plots; lower panels: CpGs-per-100bp

836 plots. X-axis length shows length in base pairs (bp).

837

838 Figure 1 – source data 1

839 Table showing properties of constructs used in these studies, including length,

840 base composition and CpG frequency (observed over expected CpG ratio; o/e) of

841 all tested constructs. Note that the o/e ratio takes the overall G+C content into

842 account and therefore the High CpG/ Low G+C sequences have a high o/e ratio.

843

844 Figure 2 - figure supplement 1.

845 H3K4me3 at an artificial CGI independent of Cfp1 and RNA polymerase II

846 (A) Scheme above shows gene desert 2 (mChr1:81,106,616-81,153,886) with

847 integration site of the Artificial CGI-like construct (see Figure 1B).

848 Representative ChIP with an antibody specific for the unphosphorylated C-

849 terminus of RNA Pol II and the Serine 5 phosphorylated C-terminus of RNA Pol II

850 for 3 independent cell lines (n=2). Shaded bar indicates primers spanning the

851 Artificial CGI. (B) Undifferentiated mouse ES cells were cultured for 4 days

852 without LIF and then another 4 days in the presence of retinoic acid (RA). All

853 panels photographed at 10x magnification. (C): H3K4me3 and H3K27me3 in

854 ESCs versus neural precursor cells for two independent cell lines (n=2). Error

855 bars indicate standard deviation of PCR replicates.

856

857 Figure 4 - figure supplement 1.

858 CpG-rich, A+T-rich DNA sequences do not form bivalent chromatin.s

35 859 (A & B) Above: diagrams (see Figure 1B) of gene desert 2 with (A) inserted A+T-

860 rich, CpG-rich construct #2 (H-CpG/L-G+C 2) and (B) medium-A+T, CpG-rich

861 construct (H-CpG/M-G+C). Below: representative anti-H3K3me3 and H3K27me3

862 ChIP normalized to H3 and Suz12 ChIP shown as enrichment over the negative

863 control region (n=3) for two independently transfected cell lines for each

864 construct. Shaded bars indicate primers spanning the CGI-like constructs. (C)

865 Scheme showing the mouse ß-globin locus (see Figure 1-figure supplement 1)

866 with integration site of the A+T-rich, CpG rich construct #3 (H-CpG/L-G+C 3)

867 construct. Black bars indicate position of primers used for Q-PCR. Representative

868 H3K4me3 and H3K27me3 ChIP analyses (data for cell lines 2 and 3 not shown).

869 Shaded bar indicates primers spanning the H-CpG/L-G+C 3 construct.

870

871 Figure 4 - figure supplement 2.

872 DNA methylation at CpG-rich, A+T-rich DNA sequences blocks bivalent

873 chromatin.

874 (A) Bisulfite sequence analysis of the H-CpG/L-G+C 2 and 3 and H-CpG/M-G+C

875 constructs. Clear box indicates bisulfite amplicon and blue vertical lines show

876 position of CpGs. Methylated CpGs are depicted as filled black circles,

877 unmethylated CpGs as empty white circles. (B) Bisulfite sequence analysis of IAP

878 elements in wt versus Dnmt 3a/3b knockout mouse ES cells. (C) Dnmt 3a/3b

879 knockout cells can form bivalent chromatin when transfected with the Artificial

880 CGI 1 (see Figure 1C). Representative H3K3me3 and H3K27me3 ChIP normalized

881 to H3 for 2 independent cell lines. (D) Partial loss of DNA methylation causes

882 increased H3K27me3 at the A+T-rich, CpG-rich CGI 1, but does not show elevated

883 H3K4me3. Wt mouse ESCs were transfected and grown for 10 days in either

36 884 normal medium or medium + 2i inhibitors. Representative H3K3me3 and

885 H3K27me3 ChIP normalized to H3 shown (second cell line not shown). (E)

886 Bisulfite sequencing of the H-CpG/L-G+C 1 after 10 days in medium +2i shows

887 reduced DNA methylation at the inserted DNA.

888

889 Figure 4 - figure supplement 3.

890 CpG density and CGI length at bivalent CGIs correlate positively with

891 H3K4me3 and H3K27me3 levels in mouse ESCs.

892 Bivalent CGIs identified in published ChIPseq analyses (Denissov et al., 2014;

893 Marks et al., 2012) were divided into four equal bins based on length or CpG

894 density and plotted against levels of H3K4me3 or H3K27me3 (read counts).

895

37 A B Gene desert 2 55.4 kb 1 2 3 4 8 9 10 11 1 kb

Artificial CGI 1 GC content 567 0.3 0.4 0.5 0.6 0.7

0.0 0.5 1.0 1.5 CpG o/e C Cell line 1 Cell line 2 Cell line 3 H3K4me3 ChIP

10 6 8 8 5 6 6 3 4 4 4 2 3 2 1.5 3 1.5

H3K4me3 / H3 / H3K4me3 1.0 2 1.0 H3K4me3 / H3 H3K4me3 / H H3K4me3 / 0.5 1 0.5 0.0 0 0.0 - 1 2 3 4 5 6 7 8 9 S G H - 1 2 3 4 5 6 7 8 9 S - 1 2 3 4 5 6 7 8 9 0 1 S G H 10 11 10 11 G H 1 1 Left Artificial Right Controls Left Artificial Right Controls Left Artificial Right Controls CGI 1 CGI1 CGI1

H3K27me3 ChIP

0.5 0.8 0.5

0.4 0.4 0.6

0.3 0.3 0.4 0.2 0.2 H3K27me3 / H3 H3K27me3 / H3K27me3 / H3 H3K27me3 /

H3K27me3 / H3 H3K27me3 / 0.2 0.1 0.1

0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 - 1 2 3 4 5 6 7 8 9 S H - 1 2 3 4 5 6 7 8 9 S - 10 11 S G H 10 11 G 10 11 G H Left Artificial Right Controls Left Artificial Right Controls Left Artificial Right Controls CGI 1 CGI 1 CGI 1 Suz12 ChIP

8 8 6

6 6 4

4 4 % Input % Input % Input 2 2 2

0 0 0 1 2 3 4 5 6 7 8 9 - 1 2 3 4 5 6 7 8 9 S H 1 2 3 4 5 6 7 8 9 S - 10 11 S G H 10 11 G 10 11 G H Left Artificial Right Controls Left Artificial Right Controls Left Artificial Right Controls CGI 1 CGI 1 CGI 1

D Artificial CGI 1 CpGs

Cell line 1 Cell line 2 Cell line 3 9.2 % methylated 7.2 % methylated 12.4 % methylated

Wachter et al, 2014 Figure 1 A

Gene desert 2 55.4 kb 1 2 3 4 8 9 10 11 1 kb

Arti cial CGI 1 5 6 7

Cell line 1 Cell line 2 Cell line 3 RNA Pol II ChIP

5,00 0 8,00 0 1,00 0

4,00 0 6,00 0 800 G G G g 3,00 0 g

g 600 I I I / / / / 4,00 0 / / I I I I I I l l 2,00 0 l 400 o o o P P 2,00 0 P 1,00 0 200

0 0 0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 S - 0 1 - 1 1 S G H 1 1 G H 1 2 3 4 5 6 7 8 9 1 1 S G H Left Arti cial Right Controls Left Arti cial Right Controls Left Arti cial Right Controls CGI 1 CGI 1 CGI 1

B H3ac ChIP

8 5 15 3 3 3 H H H

/ / / 4

6 c c c 10

14a 3 14a 14a K K K 4 / / / 9 9 9 2 K K K

3 5 3 3

2 H H H 1

0 0 0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 S H 1 1 S H 1 1 S H Left Arti cial Right Controls Left Arti cial Right Controls Left Arti cial Right Controls CGI 1 CGI 1 CGI 1 C Cfp1 GFP - ChIP 0.5 0.2 0 0.1 5

0.4 0.1 5 t t t u u 0.1 0 u 0.3 np np np I I I

0.1 0 % % % 0.2 0.0 5 0.0 5 0.1

0.0 0.0 0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 S G H 1 1 S G H 1 1 S G H Left Arti cial Right Controls Left Arti cial Right Controls Left Arti cial Right Controls CGI 1 CGI 1 CGI 1 D Cfp1 -/- ES cells

H3K4me3 ChIP 8 6 8 3 3 3 H H H

/ 6 / 6 / 3 3 3 4 e e e m m 4 m 4 4 4 4 K K K 3 3 3 2 H H 2 H 2

0 0 0 1 2 3 4 5 6 7 8 9 0 1 - 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 S G H 1 2 3 4 5 6 7 8 9 1 1 S G H 1 1 S G H Left Arti cial Right Controls Left Arti cial Right Controls Left Arti cial Right Controls CGI 1 CGI 1 CGI 1

Wachter et al, 2014 Figure 2 A Gene desert 2 55.4 kb 1 2 3 4 8 9 10 11 1 kb

Low CpG/ High G+C 5 6 7

Cell-line 1 Cell-line 2 Cell-line 3 H3K4me3 ChIP

20 50 1.4 1.2 3 3 40 3 15 H H H

1.0 / / / 30 3 3 3 e e e 0.8 10 20 m m m 4 4 4 1. 0 1.5 0.0 4 K K K 3 3 3 1.0 H H H 0. 5 0.0 2 0.5

0. 0 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 S - 1 1 S G H 1 1 S G H 1 1 G H Left L-CpG Right Controls Left L-CpG Right Controls Left L-CpG Right Controls H-G+C H-G+C H-G+C H3K27me3 ChIP 1.5 1.0 1.0 3 3 3 H H

H 0.8 0.8

/ / / 3 3 3 1.0 e e e 0.6 0.6 m m m 7 7 7 2 2 2 0.4

0.4 K K K 3

3 0.5 3 H H H 0.2 0.2

0.0 0.0 0.0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 G S H 1 1 S G H 1 1 S G H Left L-CpG Right Controls Left L-CpG Right Controls Left L-CpG Right Controls H-G+C H-G+C H-G+C Suz12 ChIP

0.8 6 1.0

0.6 0.8 t t 4 t u u u 0.6 np np np I I I

0.4 % % % 0.4 2 0.2 0.2

0.0 0 0.0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 S H - 1 1 G S H 1 1 G S H 1 1 G Left L-CpG Right Controls Left L-CpG Right Controls Left L-CpG Right Controls H-G+C H-G+C H-G+C

Wachter et al, 2014 Figure 3 A Gene desert 2 B 55.4 kb 1 2 3 4 8 9 10 11 High CpG / Low G+C 1 1 kb CpGs

High CpG/ Low G+C 1 Cell-line 1 Cell-line 2 93.3 % methylated 93.3 % methylated 5 6 7 Cell-line 1 Cell-line 2 H3K4me3 ChIP 8 10 3 3 8 H H

6 / / 3 3 6 e e m

m 4 4 4 K K 0.4 0.4 3 3 D H H 0.2 0.2

0.0 0.0 100 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 G H 1 1 G H Left H-CpG Right Controls Left H-CpG Right Controls L-G+C 1 L-G+C 1 80

H3K27me3 ChIP n

0.8 0.8 io t 60 la 3 3

y Artificial C GI 1 H H

h / t / 0.6 0.6 Artificial C GI 2 3 3 e e High C pG / Low G +C 1

Me 40

m m High C pG / Low G +C 2 7 7 0.4 0.4 % 2 2 High C pG / Low G +C 3 K K 3 3 20 High C pG / Med G +C H H 0.2 0.2 Mecp2-eGF P Nanog-P uroGF P 0.0 0.0 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 G H 1 1 G H 0 Left H-CpG Right Controls Left H-CpG Right Controls 30 40 50 60 70 L-G+C 1 L-G+C 1 G+C % Suz12 ChIP 0.8 2.5

2.0 E 0.6 t t u u 1.5 np np I I

0.4 SET1A/1B % % 1.0 MLL1/2 0.2 CFP1 0.5

0.0 0.0 CG AGC CG G C CG GT CG GT CG C 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 - 1 1 G H 1 1 G H ? ? Left H-CpG Right Controls Left H-CpG Right Controls JARID2

L-G+C 1 L-G+C 1 PRC2

C Dnmt 3a/3b -/- cells H3K 4me3 C hIP High CpG / Low G+C 1 10 CpGs MLL1/2 8 SET1A/1B

3 6

H CFP1

4

/ 5 % methylated

3 2 e GG ACCC AGCCC TGGGCCTGG m 4

K 0.0 4 ? 3 H 0.0 2 ? PRC2 0.0 0 1 2 3 4 5 6 7 8 9 0 1 JARID2 1 1 S G H - Left H-CpG Right Controls L-G+C 1 MLL1/2 S uz12 C hIP H3K 27me3 C hIP SET1A/1B 0.5 4 CFP1 DNMT 5

f 3A/3B 1 3

0.4 o

H m t /

r

n 3 3 e

e ? v e 0.3 o m

m CG AT TCG TATATTA CG AAT CG A h 2 7 c i 1 2 r ? 0.2 z K n

u 2 3 E

S ? H 0.1 PRC2 0.0 1 - JARID2 1 2 3 4 5 6 7 8 9 0 1 - 1 2 3 4 5 6 7 8 9 0 1 S G H 1 1 S G H 1 1 Left H-CpG Right Controls Left H-CpG Right Controls H3K4me3 Unmethylated CpG L-G+C 1 L-G+C 1 H3K27me3 Methylated CpG Wachter et al, 2014 Figure 4