bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

An All-to-All Approach to the Identification of Sequence-Specific Readers for Epigenetic DNA Modifications on Cytosine

Guang Song1,6, Guohua Wang2,6, Ximei Luo2,3,6, Ying Cheng4, Qifeng Song1, Jun Wan3, Cedric Moore1, Hongjun Song5, Peng Jin4, Jiang Qian3,7,*, Heng Zhu1,7,8,*

1Department of Pharmacology and Molecular Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 2School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China 3Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 4Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA 5Department of Neuroscience and Mahoney Institute for Neurosciences, University of Pennsylvania, Philadelphia, PA 19104, USA 6These authors contributed equally 7Senior author 8Lead Contact *Correspondence: [email protected] (H.Z.), [email protected] (J.Q.).

1 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

SUMMARY Epigenetic modifications of DNA in play important roles in many biological processes. Identification of readers of these epigenetic marks is a critical step towards understanding the underlying molecular mechanisms. Here, we report the invention and application of an all-to-all approach, dubbed Digital Affinity Profiling via Proximity Ligation (DAPPL), to simultaneously profile human TF-DNA interactions using mixtures of random DNA libraries carrying four different epigenetic modifications (i.e., 5-methylcytosine, 5- hydroxymethylcytosine, 5-formylcytosine, and 5-carboxylcytosine). Many that recognize consensus sequences carrying these modifications in symmetric and/or hemi-modified forms were identified for the first time. We further demonstrated that the four modifications in different sequence contexts could either enhance or suppress TF binding strength. Moreover, many modifications can also affect TF binding specificity. Furthermore, symmetric modifications showed a stronger effect to either enhance or suppress TF-DNA interactions than hemi-modifications. Finally, in vivo evidence suggested that USF1 and USF2 might regulate transcription via hydroxymethylcytosine- binding activity in weak enhancers in human embryonic stem cells.

2 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

INTRODUCTION

In mammals, DNA methylation of cytosine at the 5-position (mC) serves as a major epigenetic mechanism to regulate transcription and plays a critical role in many cellular functions, such as genomic imprinting and X--inactivation, repressing transposable elements and regulating transcription (Bird, 2002; Sharp et al., 2011). Aberrant DNA methylation is a hallmark of many human diseases and cancers. Indeed, abnormal methylation has become a potential biomarker for cancer detection, diagnosis, and prognosis used in liquid biopsy (Herceg and Hainaut, 2007). In addition, dysregulation in DNA methylation has been found to be associated with other diseases, such as schizophrenia and autism spectrum disorders (Nardone et al., 2014; Wockner et al., 2014).

Recent efforts revealed that 5mC can be oxidized by three mammalian ten eleven translocation (Tet) proteins to form 5-hydroxymethylcytosine (hmC) (Ito et al., 2011). Sequential oxidation by the Tet enzymes can further convert hmC to 5-formylcytosine (fC) and 5-carboxycytosine (caC), which would eventually lead to active DNA demethylation (He et al., 2011; Song et al., 2013; Wheldon et al., 2014). However, it remains largely unknown whether hmC, fC, and caC merely serve as intermediates in the process of active DNA demethylation or whether they have their own physiological functions (Song and Pfeifer, 2016). Recently, genome-wide mapping of 5hmC revealed that 5hmC is present in many tissues and cell types, especially enriched in embryonic stem cells and neurons (Pastor et al., 2011; Ruzov et al., 2011; Stroud et al., 2011; Sun et al., 2013; Szulwach et al., 2011). In addition to hmC, fC was also found as a stable DNA modification in mammalian genomes (Bachman et al., 2014; Shen et al., 2013; Song et al., 2013; Tan and Shi, 2012). These studies indicated that the oxidized forms of 5mC might have their own physiological functions, and identification of their “readers” and “effectors” might help elucidate their roles in various biological processes (Jin et al., 2016; Kyster Nanan, 2018; Spruijt et al., 2013; Wang et al., 2017; Wu and Zhang, 2017).

Methylated cytosine can also remain asymmetric in the mammalian genomes and such sites are referred as hemi-methylated (Xu and Corces, 2018). A prevailing view is that hemi- methylation is transient and occurs, perhaps, by chance, and that the fate of hemi- methylated DNA is to become fully methylated or unmethylated by replication-coupled dilution (Sharif and Koseki, 2018). However, two recent studies revealed that approximately 10% of CpGs remain stably hemi-methylated in embryonic stem cells and trophoblast stem cells (Sharif et al., 2016; Zhao et al., 2014). In a recent study, Xu and Corces demonstrated that elimination of hemi-methylation causes reduced the frequency of CTCF/cohesion interactions at these loci, suggesting hemi-methylation as a stable epigenetic mark regulating CTCF-mediated chromatin interactions (Xu and Corces, 2018). Intriguingly, they reported that hemi-methylated sites could be inherited over several cell divisions, suggesting that this DNA modification could happen by design and be maintained as a stable epigenetic state. Furthermore, in non-CpG context (i.e., CpA, CpT, or CpC) 5mC is asymmetrical and hemi-hmC can arise via Tet oxidation (Ficz et al., 2011; Pastor et al., 2011; Tahiliani et al., 2009; Wu and Zhang, 2011). As others and we have shown, mCpA is found more often in the gene bodies and represses gene transcription through MeCP2 binding (Gabel et al., 2015; Guo et al., 2014; Kinde et al., 2016). In the context of the palindromic CpG

3 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

dinucleotide, hmCpG presumably exists as a fully hydroxymethylated form in cells; however, they can transiently become hemi-hmC after semi-conservative DNA replication (Hashimoto et al., 2012). Following this logic, hemi-fC and hemi-5caC should also exist, at least transiently, in the mammalian genomes.

Despite acceleration of efforts to map mC and hmC at single resolution in various biological processes and species, it remains a challenge to establish the causality between different types of epigenetic DNA modifications and physiological outcomes. Therefore, the identification of ‘readers’ and ‘effectors’ for mC, hmC, fC and caC in both symmetric and hemi-forms will serve as a critical stepping stone to translate epigenetic signals into biological actions and to decipher the epigenetic ‘codes’ governing biological processes.

In the past, various types of high-throughput technologies, such as -binding array (PBA), (TF) array, yeast one-hybrid, SMiLE-seq and high-throughput SELEX, were developed to profile protein-DNA interactions (PDIs) mostly in the absence of any epigenetic modifications (Badis et al., 2009; Hu et al., 2013; Hu et al., 2009; Isakova et al., 2017; Jolma et al., 2013; Vidal et al., 1996). To identify potential readers for mC, our team was among the first to survey the majority of human TFs with 154 symmetrically methylated DNA probes (Hu et al., 2013), and later Taipale and colleagues individually screened several hundred TF proteins against symmetrically methylated random DNA libraries using SELEX (Yin et al., 2017). In addition, generic DNA sequences, carrying symmetrically modified mC, hmC, fC or caC, were used to pull down and identify potential binding proteins through MS/MS analysis (Iurlaro et al., 2013; Spruijt et al., 2013).

Although these high-throughput efforts have generated a massive amount of data, none of them can be multiplexed to exhaustively survey both the protein (e.g., the entire human TF family) and DNA spaces (e.g., a random DNA library) simultaneously. Due to these design limitations, no systematic efforts have been reported to identify readers for symmetric hmC, fC, or caC modifications, let alone readers for hemi-mC, -fC, or -caC modifications.

To overcome this huge technology bottleneck, herein we report the invention and application of Digital Affinity Profiling via Proximity Ligation (i.e., DAPPL) as an all-to-all approach to exhaustively surveying human TFs with mixtures of random DNA libraries carrying mC, hmC, fC, or caC modifications in either symmetric- or hemi-form to identify sequence- and modification-specific readers. Using specific DNA fragments to covalently barcode each of 1,235 individually purified human TF proteins, we could connect the identity of a TF protein to its captured DNA fragments via proximity ligation in a highly multiplexed reaction. Using this DAPPL approach, we identified numerous novel readers for symmetric-hmC, -fC, and - caC, and hemi-mC, -hmC, -fC, and -caC. We observed that all four modifications could either enhance or suppress TF-DNA interactions and, in some cases, could alter TF-binding specificity. We also observed and experimentally validated that symmetric modifications have a stronger effect to either enhance or suppress TF-DNA interactions than hemi- modifications. Finally, in vivo evidence suggested that USF1 and USF2 might regulate transcription via hydroxymethylcytosine-binding activity in weak enhancers in human embryonic stem cells.

4 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

RESULTS

Establishment of Digital Affinity Profiling via Proximity Ligation To create an all-to-all approach for unbiased, comprehensive profiling of TF-DNA interactions, we invented and optimized the DAPPL approach using DNA-barcoded human TFs and a random 20-mer DNA library as a proof-of-principle experiment. First, 1,615 (1,235 of which are unique) TF and co-factor proteins (Table S1) were purified as GST fusions and captured on glutathione beads. Second, a set of DNA barcode sequences were designed such that any two single nucleotide mis-incorporations, insertions, and/or deletions would not lead to misassignment of TF identity. Third, each protein was covalently tethered to a single-stranded DNA (ssDNA) oligo (i.e., anchor oligo) through a thiol–maleimide “click” chemistry linkage, which was annealed to the 3’-end of a unique barcode oligo, and converted to double-stranded DNA (dsDNA) with a DNA polymerization reaction (using DNA polymerase I Klenow fragment) on beads (Figure S1). Fourth, after the removal of free DNA oligos, an aliquot of each barcoded TF protein on beads was mixed together to generate the TF mixtures. Finally, a random 20-mer dsDNA library (20-mer) was synthesized and capped by a BsaI restriction site for proximity ligation and a fixed sequence for PCR priming (Figure 1).

To carry out the DAPPL reactions, the 20-mer DNA library (~1 X 1012 species) was incubated with the barcoded TF mixtures, followed by stringent washing steps to remove unbound DNA. After crosslinking, a Golden Gate Assembly reaction (Potapov et al., 2018) was performed to ligate the DNA barcodes of the TFs to their captured DNA fragments. The ligated DNA products were PCR-amplified with specific primers to construct libraries for deep sequencing (Figure 1).

Analysis of the DAPPL Datasets and Experimental Validation A total of ~158 million reads were obtained, 74.48% of which contained the expected DAPPL products, each comprising a TF barcode and the DNA sequence it captured. The median number of qualified unique reads per TF was 9,075. To determine the binding consensus for a given TF, we next determined its 6-mer frequencies by sliding a 6-mer window along the 20-mer sequences it captured (Figure S2). To remove possible noise signals resulting from non-specific binding to either the glutathione beads or the GST tag, we compared the 6-mer subsequence frequencies of each TF with those obtained with the barcoded GST on beads, which were included as a negative control. Those 6-mer subsequences that were found preferentially enriched by a given TF were extracted and mapped back the original 20-mer sequences, which were then used to deduce candidate consensus sequences using HOMER motif analysis (Heinz et al., 2010).

To generate library-specific thresholds for identifying significant consensus, we performed the same HOMER analysis against 10,000 randomly selected GST-bound sequences. This procedure was repeated 400 times and 400 P values of the top consensus were obtained. Only when the P value of a candidate consensus was lower than 95% of the 400 P values, was it considered significant and used for further analysis (Figure S2).

5 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A total of 129 unique TFs produced significant consensus sequences (Table S2). Many TFs could recover their previously reported consensus sequences as shown in Figure 2A. Of the 129 TFs 84 TFs have previously known consensus sequences as archived in CIS-BP database, 69 (82.1%) of which were found to be significantly similar (P < 0.05; Tomtom motif comparison tool) to their known ones, suggesting that our DAPPL approach reliably recovered true TF binding motifs (Weirauch et al., 2014) (Figures 2B, Table S2). Furthermore, we examined the reproducibility of DAPPL by taking advantage of the redundancy in our ORF collection. Of the 237 pairs of identical TF proteins, 10 produced significant consensus sequences, based on which we calculated the Pearson correlation coefficient (PCC) of the 6-mer frequencies between each identical TF pair. All of them showed high PPC values, ranging from 0.506 to 0.995, indicating that these assays were highly reproducible (see examples in Figure 2C).

Interestingly, 45 TFs were, for the first time, identified with new consensus sequences using our DAPPL approach (Table S2). To validate these binding activities, we employed an in vitro gel-shift assay (i.e., EMSA) to validate the DAPPL results. Three randomly chosen TFs, namely TEAD2, SFMBT1, and DMTF1, were purified and incubated with their corresponding DNA motifs end-labeled with Cy5 (Figure 2D and Table S3). Without exception, all three TF proteins formed complexes with their newly identified consensus, while they did not show any detectable binding activities with the negative control DNA fragments. The EMSA results were further supported by binding kinetics and affinity studies using a real-time, label-free method (i.e., OCTET) (Figure 2E).

Taken together, we developed an all-to-all approach (i.e., DAPPL) that allowed us to simultaneously identify protein-DNA interactions in vitro in a highly multiplexed fashion. Our bioinformatics analyses and experimental validation demonstrated that DAPPL is highly reproducible and capable of finding both known and novel motifs with high fidelity.

Novel Epigenetic Modification-Associated TF Binding Activities To harness the power of DAPPL, we applied it to identify TF-DNA interactions in the context of four epigenetic modifications, including mC, hmC, fC, and caC in both symmetric and hemi forms. Because the complete modifications of hmC, fC, or caC cannot be obtained with an enzymatic reaction, four random DNA libraries were synthesized, each of which carried a CpG site with mC, hmC, fC or caC flanked by two 8-mer random sequences; a three-nucleotide barcode was added to each sequence to identify the modification (Figure S3A). To ensure symmetric modifications at the middle CpG sites, the complementary strands of each library were synthesized with Klenow reaction in the presence of the corresponding modified dCTPs (i.e., 5-methyl-dCTP; 5-hydroxymethyl-dCTP; 5-formyl- dCTP; 5-carboxy-dCTP). For comparison, an unmodified library of the same design was also synthesized (Figure S3B). An equal amount of the five DNA libraries was mixed to generate the DNA reaction mixture with symmetric modifications (Figure S3C). In parallel, the four hemi-modified DNA libraries and the unmodified library were also synthesized and mixed at an equal molar ratio.

The two library mixtures carrying the symmetric or hemi modifications were then separately incubated with the TF mixtures to carry out the DAPPL reactions, as described above

6 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(Figure 1, S2 and S3C). Using the same computational approach, 44, 97, 84, and 107 TFs were identified that recognized specific DNA consensus sequences containing symmetric modifications of mC, hmC, fC, and caC, respectively (left panel; Figure 3A and Table S4). Similarly, 99, 103, 118, and 139 TFs were identified to recognize specific DNA consensus sequences containing hemi-modifications of mC, hmC, fC, and caC, respectively (right panel; Figure 3A and Table S4).

Venn diagram analysis revealed that many TFs could recognize both symmetric and hemi modifications (Figure 3B). Interestingly, the newly identified binding activities to both symmetric- and hemi-mC, -hmC, -fC, and -caC modifications are widespread across all of the major TF subfamilies (Figure 3C, Table S4). Overall, more TFs were found to interact with hemi-modified DNA consensus sequences than symmetrically modified ones. Furthermore, no significant preference to a particular symmetric or hemi-modification type was observed within each major TF subfamily (two right panels; Figure 3C), suggesting potentially rapid evolution of this type of binding specificity.

Perhaps not surprisingly, we observed that highly conserved homologs often recognized highly similar consensus sequences and shared the same preference for a particular modification, indicating that these newly discovered activities are conserved among closely related homologs. For example, MEIS1/2/3 all recognized a similar consensus carrying 5caC in a sequence of 5’-TGTcaCG in both symmetric and hemi forms (Figure 3D).

Three Scenarios of Epigenetic Modification Impact on TF-DNA Interactions To examine how different epigenetic modifications affect TF-DNA interactions, we compared 6-mer subsequence frequencies obtained with each modified library against the unmodified counterparts. Three major trends were observed across all four epigenetic modifications. First, TF-DNA binding strength was enhanced by a particular modification (Figure 4, and S4A). For example, many 6-mer subsequences captured by HMBOX1 were significantly enriched in the symmetric carboxylated library, suggesting that carboxylation enhanced HMBOX1-DNA interactions. Second, TF-DNA binding strength was suppressed by a particular modification. For instance, the 6-mer frequencies of OVOL2 were found significantly lower with all four symmetric modification libraries, suggesting that these modifications significantly suppressed OVOL2-DNA interactions. In the third scenario, the interactions between TFs and DNA were not significantly affected by any modifications, showing similar binding strength regardless of the modifications (Figure 4, and S4A). Moreover, some modifications could enhance and suppress the binding strength of a particular TF depended on the sequence context (labeled as bimodal in Figure 4A). For example, formylation could enhance the binding strength of ETS1 in a consensus of 5’- CGGAfCGTA, while reducing in a different consensus of 5’-CCGGAAGT (Figure S4A).

Interestingly, different modifications exhibited different impacts on the same proteins. For example, MEIS2 was found to prefer methylated and carboxylated DNA, while hydroxymethylation and formylation reduced its binding strength (Row 32; Figure 4A). Crx, on the other hand, showed strong binding strength to carboxylated DNA while the other three modifications greatly suppressed its binding strength (Row 35; Figure 4A). Indeed, this phenomenon is widespread among many TFs as summarized in Figure 4A. Overall, 14, 1,

7 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

7, and 28 TFs were found to prefer 5mC, 5hmC, 5fC, and 5caC modifications, respectively, whereas methylation, hydroxymethylation, formylation, and carboxylation suppressed binding activities of 21, 24, 25, and 25 TFs, respectively. On the other hand, 9, 8, 6, and 2 TFs were insensitive respectively to methylation, hydroxymethylation, formylation, or carboxylation (i.e., “No preference” in Figure 4A). The same major trends were also observed for hemi-modifications (Figures S4B and S5).

To experimentally validate the observed preference, TBX2, HMBOX1, OVOL2, and DMTF1 were purified to perform EMSA analysis because each of them showed a distinct behavior to the various epigenetic modifications. Using the consensus sequences obtained from our bioinformatics analysis, TBX2 demonstrated stronger binding strength to methylated and formylated consensus sequences than to the unmodified counterpart, in a close agreement with the DAPPL results. However, neither hydroxymethylation nor carboxylation enhanced TBX2’s binding to the same consensus (upper panel; Figure 4B). In the case of HMBOX1, carboxylation in a consensus of 5’-CGACTAA substantially enhanced its binding activity as compared with unmodified, methylated, hydroxymethylated, and formylated counterparts (2nd panel; Figure 4B). On the other hand, OVOL2 preferred the unmodified DNA consensus, whereas all the other reduced or completely inhibited its binding activity (3rd panel; Figure 4B). Similarly, all four modifications were confirmed to suppress DMTF1’s binding activity as compared with the un-modified counterpart, although carboxylation showed the least effect (bottom panel; Figure 4B). Overall, all of the EMSA assays validated our DAPPL results, suggesting that different modifications could impose differential impacts to binding activities of the same TFs.

Impact of Different Epigenetic Modifications on TF-Binding Specificity We also observed that the carboxylated consensus (5’-caCGACTAA) identified for HMBOX1 was significantly different from its known consensus (5’-TAACTA) (Berger et al., 2008; Jolma et al., 2013), suggesting that carboxylation could alter TF binding specificity. This phenomenon is widespread across all four modifications. For example, IRX2 preferred 5’- mCGTTA, which is substantially different from its known unmodified consensus 5’- TTACACG (Figure 5A). Similarly, FOXC2 and HOMEZ showed altered consensus sequences in the presence of different modifications (Figure 5A). On the other hand, while the binding strength of TBX2 and RFX2 was enhanced by methylation and carboxylation, respectively, neither altered their binding specificity (Figure 5B).

To systematically examine the impact of the four modifications on TF binding specificity, we determined the TF consensus sequences that were found preferentially bound to modified sequences in both symmetric and hemi forms. Specifically, we extracted those 6-mers enriched in modified libraries as compared with un-modified ones and constructed the modification-preferred consensus sequences (Figure S4). We next compared sequence similarity between these newly obtained consensus sequences with those obtained with unmodified libraries, as well as those archived in CIS-BP (Weirauch et al., 2014). For example, 13 symmetric 5mC-preferred consensus sequences were identified, four (30.8%) of which are significantly different from the unmodified counterparts. Similarly, 26.3% of the caC-preferred consensus sequences are significantly different (Figure 5C). Overall, 27.5%

8 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

of the modification-preferred consensus sequences are significantly different from the unmodified counterparts.

A similar observation was made for hemi modification-preferred consensus sequences. For example, hemi-modifications altered binding specificity of NR2E1, YBX1, HMBOX1, and JDP2 (Figure S6A). On the other hand, the binding specificity of PKNOX2 and EGR2 was not affected by hydroxymethylation (Figure S6B). Overall, 35.85% consensus sequences with the four modifications were significantly different from the binding unmodified consensus sequence (Figure S6C).

Stronger Effects Imposed by Symmetric Modifications to TF-DNA Interactions One interesting phenomenon observed was that the extent of modification-enhanced or – suppressed binding strength was in general greater for symmetric than hemi modifications. For instance, although methylation in both symmetric and hemi forms enhanced HOMEZ’s binding activity, the enriched 6-mers showed steeper slope for symmetric methylation (red lines in the left panel; Figure 6A), suggesting that symmetric methylation strengthened HOMEZ’s binding activity more than hemi-methylation. In the modification-suppressed cases, symmetric modifications also showed a stronger effect. For example, the slope of methylation-suppressed 6-mers of OVOL2 was flatter for symmetric methylation, suggesting that symmetric mCpG nearly abolishes the interactions while hemi-methylation was partially tolerated (left panel; Figure 6B).

To systematically and quantitatively assess this phenomenon, we identified 30 TF-DNA interactions that were enhanced by both symmetric and hemi-modifications and 60 interactions suppressed by both symmetric and hemi-modifications (Figure 4, S4 and S5). Next, we quantified the TF binding preference for a given modification by deducing the slope in the corresponding scatter plot as illustrated in Figure 6A. Note that the larger the slope, the stronger the preference for that modification. Overall, in 16 (53.33%) of the 30 modification-enhanced cases, symmetric modifications showed higher slopes than the hemi- modifications (right panel; Figure 6A). Similarly, in 58 (96.7%) of the 60 modification- suppressed cases, symmetric modifications showed lower slopes than the hemi- modifications in all four forms (right panel; Figure 6B). These results suggest that symmetric modifications can strongly enhance or suppress TF-DNA interactions while the effects of hemi-modifications are more modest.

To quantitatively confirm the above observation, we selected HOMEZ, TBX2, TBX3, TBX20, and STAT5A, as modification-enhanced cases, to determine their binding kinetics and affinity to DNA fragments with symmetric or hemi-modifications. Taking HOMEZ as an example, we synthesized three versions of DNA oligos carrying unmodified C, mC, or hmC and then converted to the symmetric and hemi modified dsDNA probes in the sequence context of 5’-TATCGATA. Note that the “pure” symmetric sequences were generated by selecting probes such that the only modified C to be incorporated is at the desired modification site. Judged by the EMSA assays as a semi-quantitative measurement, the symmetric modifications appeared to have a stronger effect than the hemi counterparts (left panel; Figure 6C). Next, we quantitatively determined the binding kinetics and affinity using the OCTET method. After the Kon and Koff values were determined at different protein

9 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

concentrations, the affinity Kd values were deduced for the DNA probes carrying symmetric- 5mC, hemi-5mC, symmetric-5hmC, and hemi-5hmC to be 243, 553, 135, and 345 nM, respectively; while the Kd value with the unmodified probe was 5.90 M. Similarly, STAT5A, TBX2, TBX3, and TBX20 all showed stronger affinity to DNA probes carrying symmetric- 5caC and 5fC than those carrying the respective hemi-modifications (right panel; Figures 6C & S7). As an example of modification-suppressed cases, binding and affinity studies also confirmed that symmetrically methylated consensus sequences suppressed the binding activities of OVOL2 more than the hemi-modified ones (bottom panel; Figure 6C). Taken together, our binding kinetics and affinity studies confirmed that symmetric epigenetic modifications had a stronger impact to either enhance or suppress TF-DNA interactions than hemi-modifications.

Potential Function of hmC Readers in Human Stem Cells To explore potential physiological functions of newly identified epigenetic modification readers, we decided to focus on hmC readers because it is more prevalently found in tissues, in particular, human embryonic stem cells, among the oxidized forms of mC. Since the existing hmC map of H1 is of low coverage, we first employed a selective chemical labeling method (Cheng et al., 2018) to extensively map hmC peaks in human embryonic stem cell H1. A total of 3,892 peaks containing hmC were identified, 1,783 of which are located in open chromatin regions.

Of the 143 unique hmC readers (symmetric and/or hemi) identified in this study, seven have available ChIP-seq data in H1 cells (Consortium, 2012). Therefore, we superimposed the available ChIP-seq peaks with the hmC peaks to identify overlapping peaks (Figure 7A). We found that the numbers of overlapping peaks were 229, 68, 36, and 46 for USF1, USF2, TAF7, and ATF2, respectively (Figure 7B). The observed overlapping is significant as determined with a random permutation of ChIP-seq peaks (Figure 7B; see Methods and Materials for more details). Although overlapping peaks were also observed for NRF1, RXRA, and SRF, none of them was significant on the basis of the same criteria. Moreover, using the overlapping peaks, we recovered consensus sequences for both USF1 and USF2 that were significantly similar to those obtained from DAPPL assays (Figure 7C), suggesting that the two TFs recognize hmC in a similar sequence context in H1 cells.

To examine possible functions of hmC-dependent TF-DNA interactions, we investigated the chromatin status of the overlapping peaks of USF1 and USF2 in H1 cells. Using chromatin status annotation (i.e., ChromHMM), we found that the overlapping peaks of both TFs showed a different profile than the overall ChIP-seq peaks and were mostly enriched in weak enhancer regions (Figure 7D). Furthermore, removal of ChIP-seq peaks with moderate to high methylation levels from the overall ChIP-seq peaks did not affect this observation. These results suggested that USF1 and USF2’s might regulate transcription via hmC-binding activity in weak enhancers in H1 cells.

10 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DISCUSSIONS

In this study, we developed an all-to-all approach, dubbed as DAPPL, to enable highly multiplexed profiling of protein-DNA interactions in the context of DNA epigenetic modifications. Conceptually, the establishment of this all-to-all approach should revolutionize means of the traditional high-throughput screenings, because most of the current high- throughput approaches come in the form of one-to-all. In other words, only one bait at a time, either in the form of small molecule (e.g., drug screening), protein (e.g., PBA and SELEX), an expression construct (e.g., yeast one-hybrid), or a piece of DNA (e.g., a protein array and pull down-MS/MS), is used to screen against a collection of potential interacting partners. The use of unique DNA tags to barcode each TF protein allowed us to connect the TF identities to the DNA fragments they captured, serving as the means de-convolute the binding specificity in an all-to-all screening. We envision that this new concept can be applied to survey other types of interactions, including protein-RNA and protein-protein interactions in vitro.

To our satisfaction, our DAPPL approach recovered several known methylated cytosine readers, such as methylated DNA readers MeCP2, MBD4, MEIS1, HOMEZ, TBX2/3/20, RFX2 and PKNOX2 (Bartke et al., 2010; Spruijt et al., 2013; Yin et al., 2017). In two recent studies, DNA baits, carrying 5mC, 5hmC, 5fC or 5caC modifications, were used to pull down potential binding proteins in cell lysates to identify potential modification readers (Iurlaro et al., 2013; Spruijt et al., 2013). Although the binding specificity of these readers was not known, quite a few of them were also recovered in our study, such as a hmC reader MeCP2, seven fC readers (CNBP, CSDA, GTF2I, NRF1, PURA, FOXP4, and ), and seven caC readers (ZBTB7B, CTCF, NRF1, SMARCC1, ZBTB7A, ZBTB7B, and ZNF187) (Iurlaro et al., 2013; Nanan et al., 2018; Spruijt et al., 2013). Other than MeCP2, CTCF, and p53, we identified the consensus sequences carrying the corresponding modifications for the rest of the readers, demonstrating the advantage of the DAPPL approach.

Because the TFs were sampled against the five libraries with different modifications simultaneously in the DAPPL assays, we were able to predict whether a modification in a specific sequence context would enhance or suppress TF-DNA interactions. Using both EMSA and quantitative kinetics studies, all of the tested predictions were successfully verified (Figures 6C and S7). Interestingly, Yin et al also employed bioinformatics analysis to predict those TF-DNA interactions that could be enhanced (i.e., MethylPlus) or suppressed by symmetric DNA methylation (i.e., MethylMinus), although no experimental validation was provided (Yin et al., 2017). Of the 14 TFs identified as methylation-enhanced cases in our study, three proteins, PKNOX2, MEIS2, and MEIS3, were also identified as MethylPlus; whereas the other 11 TFs were not included in the study by Yin et al. As one of the 11 proteins, HOMEZ was confirmed to bind to the hemi and symmetric mC-carrying consensus sequences 10- and 25-fold stronger than the unmodified counterpart, respectively (Figure 6C). Of the 19 methylation-suppressed TFs identified in this study, 11 were also included in the Yin study and predicted as MethylMinus. Eight of the 11 are shared by the two studies. As one of them, OVOL2’s binding activity to symmetric-methylated DNA was undetectable, while its binding affinity dropped to 15 M to hemi-methylated DNA, 3-fold weaker as compared with the unmethylated counterpart (Figure 6C). Therefore, the high agreement

11 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

with the two studies suggested that cytosine modification-enhanced or –suppressed TF- DNA interactions might be a widespread phenomenon in humans.

Compared with intensive studies of symmetric modifications, the biological roles of hemi epigenetic modifications remain obscure and underexplored (Sharif and Koseki, 2018). Emerging evidence suggests that hemi-5mC is a stable epigenetic mark because a small fraction of hemi-mC could be inherited and have been associated with transcription regulation (Xu and Corces, 2018). Moreover, in non-CpG context (e.g., CpA, CpT, or CpC) 5mC is asymmetrical and mCpA, in particular, is often found in gene bodies and correlated with gene transcription in mammalian stem cells and neurons (Gabel et al., 2015; Kinde et al., 2016). However, the molecular functions of such hemi epigenetic marks remain largely unknown because of a lack of readers. Previously, only a handful of proteins, such as Dnmt1 and UHRF1, were reported to recognize hemi-5mC in cells (Li et al., 2018; Liu et al., 2013). For the first time, we identified 18, 11, 15, and 24 proteins could preferentially recognize consensus sequences carrying hemi-5mC, -5hmC, -5fC, and -5caC, respectively (Figures S4B and S5). Consistent with results observed with symmetric modifications, many hemi modifications could either enhance or suppress TF-DNA interactions, although to a lesser extent as evidenced by kinetics studies. We believe that our results and analyses will pave the way for the elucidation of the physiological roles of these types of epigenetic marks.

ACKNOWLEDGEMENTS

This work was supported in part by the NIH (R01AG061852, R24 AA025017, U01CA200469, P50AA026116 to H.Z.; R01GM111514; to H.Z. & J.Q.; R01EY029548; R01EY024580 to J.Q.; R37NS047344 to H.S.; P01NS097206 to H.S. & P.J.; NS051630; AG062256; AG060757; AG052476 to P.J.). We would like to thank Dr. Jef D. Boeke for his support and insightful discussion during this project and Dr. Jun Shen for providing us human embryonic stem cells.

AUTHOR CONTRIBUTIONS

G.S. conducted the majority of wet-lab experiments with help from Y.C., Q.S.. C.M. X.L. and G.W. made the major contributions to developing bioinformatics algorithms and data analyses with help from J.W.. J.B., H.S., and P.J. provided critical reagents and supports to this study. H.Z. and J.Q. conceived the idea, supervised this study and wrote the manuscript with the help from all the authors.

DECLARATION OF INTERESTS

The authors declare no competing interests.

12 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

FIGURE LEGENDS

Figure 1. Principle and execution of the DAPPL approach The principle of the DAPPL approach is to utilize the unique DNA barcodes tethered to TF proteins as identifiers to de-convolute the DNA sequences that a given TF captures in a highly multiplexed binding assay. First, each purified TF protein on glutathione beads was individually conjugated with a unique DNA barcode. Second, barcoded TF proteins on beads are mixed and incubated with a random 20-mer DNA library. Third, after removal of unbound DNA fragments, a TF-capture DNA fragment can be ligated to the DNA barcode conjugated on that particular TF via Golden Gate Assembly because it is now brought to the proximity of the TF barcode DNA. Fourth, the ligated products can be PCR-amplified by utilizing the constant sequences attached to the TF barcodes and one end of the 20-mer DNA library. Finally, the sequences obtained with Next-generation sequencing will be de- convoluted and analyzed with our bioinformatics tools (see Figure S2 for more details).

Figure 2. Identification of novel TF binding motifs by DAPPL (A) Examples of recovering known consensus sequences (boxed) using TF-enriched 6-mers (red dots) obtained via comparison with the GST negative controls. (B) The comparison between the DAPPL-identified motifs and the known ones revealed that the intersecting motifs are significantly similar (P < 0.05; Tomtom motif comparison tool). Upper panel: 83 the 129 TFs with newly identified consensus sequences have known ones archived in CIS-BP. Lower panel: 82% (68) of the 83 overlapping motifs is significantly similar to the known ones. (C) Consensus sequences obtained with DAPPL are highly reproducible as evidenced by high Pearson correlation coefficient between results of redundant TF pairs. (D) Validation of three novel consensus sequences for TEAD2, DMTF, and SFMBT1 using electrophoretic mobility shift assays. (E) Measurement of binding kinetics and affinity values between the three novel consensus sequences and the corresponding TFs using the OCTET system.

Figure 3. Identification of Readers for Various Epigenetic Modifications (A) Venn diagram analysis of newly identified readers for mC, hmC, fC, and caC modifications in symmetric and hemi forms. See also Table S4 for more details. (B) Venn diagram analysis between symmetric and hemi-modification readers in the context of mCpG, hmCpG, fCpG and caCpG modifications. (C) The newly identified binding activities to both symmetric and hemi-mC, -hmC, -fC, and - caC modifications are widespread to all of the major TF subfamilies. Left panel: More TFs were found to interact with hemi-modified DNA consensus sequences than symmetrically modified ones. Right panels: No significant preference to a particular symmetric or hemi- modifications was observed within each major TF subfamily (D) Highly conserved homologs often recognized highly similar consensus sequences and shared the preference for the same modifications. For example, MEIS1/2/3 all recognized a similar consensus carrying 5caC in a sequence of 5’-TGTcaCG in both symmetric and hemi forms (Figure 3D). Letter E in the consensus sequences represents 5caC. The top three 6- mer sequences are highlighted.

13 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4. Three scenarios of Symmetric Modification impact on TF-DNA Interactions (A) A given TF-DNA interaction can be enhanced (red bricks), suppressed (blue bricks), or indifferent (grey bricks) by any of the four symmetric epigenetic modifications. The same scenarios were also observed for hemi-modifications (Figures S4B and S5A). (B) Validation of modification-enhanced (i.e., TBX2 and HMBOX1) or suppressed (i.e., OVOVL2 and DMTF) TF-DNA interactions using EMSA assays. The sequences the DNA used in the EMSA are shown in each box with the modified C labeled in red.

Figure 5. Impact of symmetric epigenetic modifications on TF binding specificity (A) Examples of deviated consensus sequences affected across all four epigenetic modifications. Known consensus sequences (boxed) in the absence of any modifications are compared with those carrying a modified cytosine. Letter E in the consensus sequences represents modified cytosine. (B) Examples (i.e., TBX2 and RFX2) of TF binding specificity not affected by cytosine modifications, although with enhanced binding affinity. (C) Summary of impact of different modifications on TF binding specificity.

Figure 6. Symmetric modifications have a greater impact on TF-DNA interactions than hemi modifications (A) Symmetric modifications enhance TF’s binding to modified consensus sequences more than hemi modifications. Left panel: The enriched 6-mers (red dots) show a steeper slope for symmetric methylation than hemi methylation, although methylation in both symmetric and hemi forms enhanced HOMEZ’s binding activity. Right panel: In 16 (53.33%) of the 30 modification-enhanced cases symmetric modifications showed higher slopes than the hemi- modifications. Letter E in the consensus sequences represents modified cytosine. (B) Hemi modifications suppress TF’s binding to modified consensus sequences less than symmetric modifications. Left panel: The suppressed 6-mers (blue dots) show a flatter slope for symmetric methylation than hemi methylation for OVOL2. Right panel: In 58 (96.7%) of the 60 modification-suppressed cases symmetric modifications showed flatter slopes than the hemi-modifications. (C) Validation and binding kinetics and affinity studies. Left panel: EMSA assays were used to measure binding strength between modified DNA motifs with the corresponding TFs at various concentrations to obtain semi-quantitative measurements. Right panel: The OCTET system was employed as a real-time, label-free method to determine binding kinetics and affinity values for DNA-TF interactions. The red and black arrows indicate when the DNA probe immobilized on an OCTET biosensor was dipped into TF solution and wash buffer, respectively. The deduced KD values are listed for each assay performed at various TF concentrations.

Figure 7. Potential Roles of hmCG Readers in Human Embryonic Stem Cells (A) A genomic region of overlapping 5hmC and USF1’s ChIP-seq peaks. The newly identified USF1’s hmC consensus found in the peak region is underlined. (B) Statistical significance of the overlapping peaks. 5,000 random assignments of TF ChIP- seq peaks in the genome produced a distribution of the numbers of peaks overlapping with

14 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

hmC peaks. The observed numbers of the overlapping peaks were indicated with the red arrows for USF1, USF2, TAF7 and ATF2. (C) Comparison of the consensus sequences obtained with DAPPL and those with sequences in the overlapping peaks. Letter E in the consensus sequences represents modified cytosine. (D) Distribution of chromatin states of the overlapping peaks (blue bars) compared with all of the ChIP-seq peaks (red bars). The overlapping peaks of both USF1 and USF2 were enriched in weak enhancer regions with significant P values.

STAR★METHODS

Experimental Model and Subject Details Cultured Saccharomyces cerevisiae was used to overexpress the recombinant human TF and co-factor proteins. All human ORFs were cherry-picked from the Ultimate Human ORF Collection or generated in our laboratory, followed by subcloning into a yeast expression vector (pEGH-A) through the Gateway recombinant cloning system. Recombinant plasmids were rescued into DH5α and verified by BsrGI restriction endonuclease digestion. Plasmids with correct size inserts were transformed into yeast for protein purification.

Method Details

Transcription Factor Clones and Protein Purification Using the existing human ORF expression library, each TF protein was expressed in yeast, purified as N-terminal GST fusion, and captured on glutathione beads, as described previously (Jeong et al., 2012). In brief, each yeast clone was grown in 800 µL of SC- Ura/glucose liquid medium overnight as primary cultures. 50 µl of saturated yeast cultures were transferred in 8 mL of SC-Ura/raffinose liquid medium until the O.D. 600 reached 0.6, followed by induction with 2% galactose for 6 hours at 30C. The harvested yeast cell pellets were stored at -80C and/or immediately lysed in lysis buffer (50 mM Tris-HCl at pH7.5 with 100 mM NaCl, 1 mM EGTA, 0.01% tritonX-100, 0.1% beta-mercaptoethanol, 1 mM PMSF, and Roche protease inhibitor tablet). Next, the cell lysates were centrifuged at 3,500 g for 10 min at 4°C; the supernatants were transferred to a new plate with prewashed glutathione beads, and incubated overnight at 4°C to capture the GST fusion proteins. Subsequently, the beads were washed three times with wash buffer I (50 mM HEPES at pH7.5 with 500 mM NaCl, 1 mM EGTA, 10% glycerol, and 0.1% beta-mercaptoethanol) and three times with wash buffer II (50 mM HEPES at pH7.5 with 100 mM NaCl, 1 mM EGTA, 10% glycerol, and 0.1% beta-mercaptoethanol) successively. All of TF protein beads were stored at -80C until use. Together with expression of transcription factors, GST tag was expressed and purified as a negative control.

For EMSA and DNA-TF conjugation product tests, the TF proteins were eluted from glutathione beads with elution buffer (50 mM HEPES at pH 8.0 with 100 mM NaCl, 50 mM KAc, 5 mM MgCl2, 40 mM glutathione, 10% glycerol and 0.05% Tween-20). For the TF- DNA binding kinetics assay, glycerol (a high refractive index component) was refrained from

15 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

elution buffer (50 mM HEPES at pH 8.0 with 100 mM NaCl, 50 mM KAc, 5mM MgCl2, 40 mM glutathione, and 0.05% Tween-20).

Conjugation of Anchor Oligo to TF Proteins in a 96-Well Format To conjugate a unique DNA barcode to each protein, a common oligo (i.e., the anchor oligo) was first used to covalently conjugate to each TF protein. In brief, the conjugation between the anchor oligo and each protein was achieved through a “click” chemistry reaction between the hydro sulphonyl group on cysteine residue of TF or GST proteins and a maleimide group tethered to the 5’-end of the anchor oligo. Before the conjugation assay, the maleimide-2,5-dimethylfuran cycloadduct on the anchor oligo was converted to maleimide via a Retro-Diels-Alder reaction. After cooling to room temperature the lyophilized oligos were dissolved to a final concentration of 50 µM in PBS buffer and added to the above purified TF or GST proteins on glutathione beads in 96-well plates. The conjugation was achieved by incubation at room temperature for 1 hour. Subsequently, the unconjugated oligos were removed with three stringent washes (50 mM HEPES at pH7.5 with 100 mM NaCl, 1 mM EGTA, 10% glycerol, and 0.1% BME) and the anchor oligo- conjugated TFs were store at -80C until use.

Assignment of DNA Barcodes to TF Proteins To assign a unique identity to each protein, a collection of 2,000 address oligos were synthesized, each of which contains a BsaI recognized and a cutting site (GGTCTCCGACT) at the 5’-end, a 7-11 nt unique DNA barcode sequence and a 20-nt consistent sequence, complementary to the anchor oligo (Figure S1A, Table S1 and S2). Next, address oligos were individually annealed to the anchor oligos conjugated on the TF proteins in a 96-well format, followed by a Klenow reaction (1 X NEB Buffer 2 with 0.6 mM dNTP mix, 1 U Klenow, 12 uM address oligos) at 37C for 30 min to synthesize the complementary strands. Free address oligos were removed with three washes in the same washing buffer described above. Considering the variation in protein expression level and differences in TF’s DNA binding capability, we combine every 192 barcoded TF proteins to a single test tube for DAPPL assay.

DNA Library Preparation For the 20-mer DNA library, template DNA oligo was synthesized to have 20 random nucleosides flanked by two consistent sequences: one for amplifying DAPPL product (i.e., 5’-GGGAGAAGGTCATCAAGAGG) and the other with a BsaI restriction site and its cutting site (i.e., 5’-GGCATGCAGCCACTATAAGCTTCGAAGACTTGAGACCAT). Double-stranded 20-mer DNA library was generated by annealing the template oligo with a complementary primer oligos with 1:1 ratio, followed by a Klenow reaction. By design, after BsaI digestion (i.e. 5’-GGTCTC(N1)/(N5)-3’) the sticky end of the 20-mer DNA is complementary to that of each TF’s DNA barcode such that they can be annealed and ligated.

For DNA libraries carrying four different epigenetic modifications, including 5mCpG, 5hmCpG, 5fCpG, and 5caCpG, the template oligos were synthesized to encode a 5’ Primer 2 (5’-CACATCCTTCACATTAATCC), an 18-mer sequence with a modified CpG in the middle flanked by two 8-mer random sequences, a short sequence encoding the library identity, and a BsaI restriction site and its cutting site (Figure S3). These oligos were then

16 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

converted into dsDNA libraries in the presence or absence of the modified dCTPs to create symmetric or hemi libraries, respectively, with the Klenow reactions as described above. Each library was purified with a QIAquick PCR Purification kit according to the manufactures’ instructions. The design of each modified library is shown in Table S5. In addition, as an unmodified DNA library of the same design was also created as a negative control.

Finally, the four libraries with four different symmetric modifications and the unmodified library were mixed in equal molar ratio to form the mixture of symmetric modification libraries. A mixture of the hemi modification libraries was created using the same method in parallel.

Development of Digital Affinity Profiling via Proximity Ligation (DAPPL) Reactions To develop and optimize the DAPPL reactions, we incubated each TF mixture on beads with 200 nmole of the 20-mer DNA library in a TF binding buffer (10 mM Tris-HCl at pH 7.5 with 50 mM NaCl, 1mM DTT, and 4% glycerol) for at room temperature for 30 min. After three stringent washes in binding buffer and PBS buffer, the protein-DNA complexes on the beads were crosslinked by 0.1% formaldehyde for 10 min and quenched with Tris-Glycine buffer (pH 7.5). Next, the beads was washed in TBST buffer (0.01% tween-20) for three times and equilibrated by 1 X T4 DNA ligase buffer. A Golden Gate Assembly reaction was performed by adding a master reaction mixture (227 µL of ddH2O with 30 μL 10 X T4 DNA ligase reaction buffer, 3 μL 100× bovine serum albumin, 20 µl of 100 U of BsaI, and 20 µL of 600 U T4 DNA ligase) to the semi-dry beads and incubated for 1 hour at 37C (Potapov et al., 2018). After Proteinase K treatment and phenol/chloroform extraction the ligation products were PCR-amplified with a pair of primers complementary to Primers 1 and 2. A Next- generation Sequencing library was constructed and subjected to deep-sequencing on an Illumina NextSeq500 sequencer.

DAPPL Reactions to Discover Readers for Different Epigenetic DNA modifications The optimized protocols described above were applied to identify potential epigenetic modification readers. To identify symmetric modification readers, we incubated each TF mixture on beads with 200 nmole of the mixture of symmetric modification libraries in a TF binding buffer (10 mM Tris-HCl at pH 7.5 with 50 mM NaCl, 1mM DTT, and 4% glycerol) for at room temperature for 30 min. After three stringent washes in binding buffer and PBS buffer, the protein-DNA complexes on the beads were crosslinked by 0.1% formaldehyde for 10 min and quenched with Tris-Glycine buffer (pH 7.5). Next, the beads was washed in TBST buffer (0.01% tween-20) for three times and equilibrated by 1 X T4 DNA ligase buffer. Golden Gate Assembly was performed by adding a master reaction mixture (227 µl of ddH2O with 30 μL 10 X T4 DNA ligase reaction buffer, 3 μL 100× bovine serum albumin, 20 µl of 100 U of BsaI, and 20 µl of 600 U T4 DNA ligase) to the semi-dry beads and incubated for 1 hour at 37C. After Proteinase K treatment and phenol/chloroform extraction the ligation products were PCR-amplified with a pair of primers complementary to Primers 1 and 2. A next-generation sequencing library was constructed and subjected to deep-sequencing on an Illumina NextSeq500 sequencer.

17 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

To identify hemi modification readers, the exact protocol described above was applied. The analyses of the deep-seq datasets are described below.

Generation of PWM Models Since each TF was fused with GST, we first excluded the DNA binding activity contributed by the GST tag. To do so, we first extracted the 6-mers by a sliding window of length 6 moving along the sequencing reads one nucleotide at a time. The 6-mer occurrences were obtained for the binding sequences to a particular TF. Similarly, the 6-mer occurrences were also obtained for the binding sequences for the GST proteins, which were included in each multiplexed binding assays as a negative control. Next, we compared the 6-mer sequence occurrences between those obtained with each TF and the GST counterparts. We then determined the 6-mers that were enriched for a given TF by comparing their occurrences with the GST counterparts. The count of the 6-mers bound by a TF was referred as:

푋 = {푥1, 푥2, … … 푥4096}

The count of the 6-mers bound by GST was referred as:

푌 = {푦1, 푦2, … … 푦4096}

The occurrence pairs of the 6-mers were referred as:

푃 = {푝1(푥1, 푦1), … … 푝4096(푥4096, 푦4096)}

The enriched 6-mers for a given TF were empirically determined by:

푥푖 푋 푝푖 ∈ {(푥푖, 푦푖)| 푥푖 > 푚푒푑(푋) & > (1, 푄3( )) } 푦푖 푌

The collection of the enriched 6-mers was used to construct the binding consensus sequences (i.e., motifs) using HOMER (http://biowhat.ucsd.edu/HOMER). The enriched 6- mers were first mapped back to the original library sequences. These sequences were used as input for HOMER. Each motif was associated with a P value. To determine the cutoff of the P values to identify significant motifs, we performed 400 non-return random sampling on the GST-bound sequences with a sample size of 10,000 in each library. Each of the 400 runs generated a top P value. These 400 P values were considered as a null distribution and the P value at 95 percentile was used as the cutoff to calculate the significant motifs.

Identification of epigenetic modification dependent TF-DNA interactions We identified sequences that were specific to either modified or unmodified sequences by comparing the sequence frequencies from modified library versus unmodified library for a given protein. To correct the possible library-specific bias, we first performed the Lowess data normalization (Berger et al., 2004) for GST from different modified libraries. Here, X = {x1, x2,…xn} and Y = {y1, y2,…yn} were used to denote the 6-mer frequencies obtained from unmodified versus modified DNA libraries for GST proteins. A log2-based scatterplot of 6-

18 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

mer intensity (A = log2(X + Y)) versus ratio (M = log2(X/Y)) was plotted. A local weighted linear regression was used to calculate a regression curve from the corresponding scatterplot. This curve was then used to correct systematic deviations of 6-mer frequencies between two libraries for each TF.

We then compared the adjusted frequencies of 6-mer obtained from modified library and unmodified library. By plotting the 6-mer frequencies obtained with modified versus unmodified DNA libraries, we identified those sequences that were preferentially bound by TFs in either modified or unmodified format. We next estimated the expected deviation of frequencies of 6-mers for GST proteins from diagonal line (y = x), which represents the difference between the modified and unmodified sequences. We calculated the distance (DistGST) between a 6-mer bound by GST and diagonal line y = x. We set the distance cutoff as:

퐷푖푠푡푐푢푡표푓푓 = 푎푣푔(퐷푖푠푡퐺푆푇) + 6 × 푠푑(퐷푖푠푡퐺푆푇)

If a 6-mer bound by a TF located outside the distance cutoff in the scatterplot (i.e., DistTF > DistGST), it was considered as a candidate carrying either enhanced or suppressed activity to the TF. After we identified the candidate 6-mers, we calculated the log2-based slope for each TF. If the slope is greater than 0.5 or smaller than -0.5, those 6-mers were used to construct the modification-specific consensus sequences.

Quantifying similarity between motifs Tomtom (Gupta et al., 2007) was used to quantify the similarity between two motifs. Two motifs with a P value < 0.05 were considered similar. When multiple motifs were associated with a TF in CIS-BP, we used the smallest P value.

Quantification of the TF binding preference We used the slope of the enriched 6-mers in the scatterplot as a proxy to evaluate binding preference to either modified or unmodified sequences. The count of N enriched 6-mers in unmodified library was referred as:

푋 = {푥1, 푥2, … … 푥푁}

The count of the enriched 6-mers in modified library was referred as:

푌 = {푦1, 푦2, … … 푦푁}

The slope was calculated as:

푦 ∑푁 푖 푖=1 푥 S = 푖 푁

Electrophoretic Mobility Shift Assay Single-stranded DNA oligos were designed and synthesized by IDT (Integrated DNA Technologies, Inc. Skokie, IL) and Genelink (Genelink Inc, Orlando, FL) as shown in Table

19 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

S5. They were next converted to dsDNA with a end-labeled Cy5 T7 primer using the Klenow reactions in the presence or absence of the desired modified dCTPs (New England Biolabs, Ipswich, MA). The dsDNA were cleaned up with QIAquick PCR Purification Kit (Qiagen, Hilden, Germany).

To perform EMSA assays, a Cy5-labeled DNA probe at a final concentration of 5 nM was incubated with its corresponding TF protein(s) in a binding bufffer (25 mM HEPES at pH 8.0 with 100 mM NaCl, 5% (v/v) glycerol, 50 mM KAc, 5 mM MgCl2, 1 mM DTT and 0.1 mg/mL bovine serum albumin) at room temperature (∼22°C) for one hour. The reaction mixture was then loaded onto a 5% Criterion TBE Gel (Bio-Rad Laboratories, Inc.Hercules, CA), and electrophoresed for 90 min at 80 V in 0.5 × TBE buffer. Formation of protein-DNA complexes were detected by scanning the TBE gel using Odyssey® CLx Imaging System (LI-COR Biosciences, Lincoln, NE).

TF-DNA Binding Kinetics and Affinity Measurement with the OCTET System The TF-DNA binding kinetic assays were performed on OCTET RED96, equipped with SAX (High Precision Streptavidin Biosensor) biosensor tips (ForteBio, Menlo Park, CA). The biotin labeled dsDNA probes were generated using the same method as described above but with a biotin-labeled T7 primer. Each DNA probe was then diluted to a final concentration of 0.1 µM in the DNA-binding buffer (50 mM HEPES at pH 8.0 with 100 mM NaCl, 50mM KAc, 5mM MgCl2, 40 mM glutathione, 5 mg/mL BSA, and 0.05% Tween-20). The corresponding TF proteins purified and diluted into the DNA-binding in a series of two- fold dilutions. The binding kinetics of TF-DNA interactions was measured according to the manufactures’ instruction (ForteBio, Menlo Park, CA). In short, a streptavidin-coated biosensor was first incubated in the DNA binding buffer for 10-min to establish the baseline, followed by dipping to a well containing the DNA probe to capture the biotinylated probe. After another around 600-sec baseline establishment in the binding buffer, the biosensor was dipped into a TF protein well to obtain the on-curve until signal saturation. The off-curve was obtained by transferring the biosensor to a well containing fresh binding buffer until the off-curve became flat. Data analysis and curve-fitting were performed using ForteBio’s Data Analysis Software, based on which the Kon and Koff values were determine for each binding assay. The KD values were deduced by taking the ratios of Koff /Kon.

Genome-wide 5hmC Profiling of Human Embryonic Stem Cell H1 Genomic DNA was isolated from human embryonic stem cell H1 with standard protocols. The cell pellet of two million H1 cells was suspended in 500 µL digestion buffer (100 mM Tris-HCl at pH 8.5 with 5 mM EDTA, 1% SDS, 200 mM NaCl) on ice and then treated with Proteinase K at 55°C overnight. After Phenol/Chloroform (25:24:1 saturated with 10 mM Tris, pH 8.0, 1 mM EDTA) extraction the aqueous phase was transferred to a new test tube, mixed with same volume of isopropanol, and stored at -80°C overnight to precipitate the genomic DNA. After spin down with 12,000 rpm for 10 min at 4°C, the DNA pellet was washed with 75% ethanol, air-dried and dissolved in Nuclease-free Water. To perform the hmCG-ChIP-seq, 5hmC-specific chemical labeling and enrichment of DNA fragments with 5hmC were performed using a previously described method (Irier et al., 2014). DNA libraries were prepared following the Illumina protocol of ‘Preparing Samples for ChIP Sequencing of

20 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

DNA’ (Part no. 111257047 Rev. A) using genomic DNA or 5hmC-captured DNA and then subjected to deep-seq on the Illumina Hi-seq 2000 machine.

Potential Function of hmC Readers in H1 Cells For hmCG-ChIP-seq data, we aligned these fastq files into reference genome and used MACS2 to do peak calling separately for two repetitions (with setting: macs2 callpeak -t input_file -f BAM -g hs -n output_prefix -B -q 0.01). We used IDR (Irreproducible Discovery Rate) framework to measure consistency between two replicates. Those shared regions with IDR less than 0.05 were regard as consistency and reproducible peaks. Totally we identified 3892 hmCG modified regions. We downloaded all narrowPeak files about TFs ChIP-Seq datasets in H1-hESC cell lines from ENCODE and UCSC. There are 7 TFs have binding sites overlap with hmCG modified regions. By Sampling random regions on the genome, USF1, USF2 and TAF7 and ATF2 have an enrichment than random regions. Because of the small number of overlaps for TAF7 and ATF2, we only used the MEME to discover the motif in the binding region of USF1 and USF2 in vivo.

Next, we investigated the chromatin status of TF binding under hmCG modification. Combined WGBS dataset, we regard the TF binding regions with average methylation less than 0.2 and not overlapped with hmCG modified regions as background. Using ChromHMM annotation of H1-hESC, we found that TF binding under hmCG modification haves different chromatin states. With the fisher-test we found the USF1 hmCG modification binding regions are enriched in weak enhancers, insulator and weak transcribed regions and USF2 hmCG modification binding regions are enriched in weak enhancers and weak transcribed regions.

DATA AVAILABLEITITY DAPPL seq data that support the findings of this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession no. PRJNA531722. Data of the genome-wide mapping of 5hmC in human embryonic stem cells have been deposited under the accession no. PRJNA541742.

SUPPLEMENTAL FIGURES

Figure S1. DNA-Barcode 1,235 Unique Human TF Proteins, Related to Figure 1 A. Design of the anchor oligo with a maleimide-2, 5-dimethylfuran moiety attached at the 5’- end. B. Design of the barcode oligos annotated with BsaI’s recognition (yellow) and cutting site (arrow), a random 8-mer unique molecule identifier (UMI; red), a barcode sequence (7- 11 nt in length), and a 20-nt consistent sequence complementary to Primer 1 in anchor oligo. C. Schematics of barcoding TF proteins with dsDNA. D. Quality control of purified TF proteins with Coomassie stain. E. Examination of DNA-barcoded TF proteins (ZBED2, p53, and MLX as examples) with Coomassie stain. Arrows indicate the upper shift of barcoded proteins. F. Examination of DNA-barcoded TF proteins with EB stain in the same gel as in (S1E). Arrows indicate the dsDNA barcodes tethered on the proteins.

Figure S2. Motif Discovery Pipeline, Related to Figure 2

21 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

All of the binding sequences were extracted and assigned to each TF protein using the associated TF barcodes, followed by further partitioning using the corresponding DNA library barcodes. To identify the binding motifs for each TF, binding sequences were parsed into six base pair sequences (i.e. 6-mers). To exclude the contribution from GST, we compared the 6-mer frequencies between each TF and the GST counterparts. The 6-mers that were found enriched for the TFs were considered to be true binding sequences (red spots). The full-length sequences were retrieved for the enriched 6-mers and used to discover candidate motifs using HOMER. To determine the significance cutoff, we performed 400 random samplings of the GST-bound sequences with a sample size of 10,000 from each library. These sequences were used to identify binding motifs using HOMER and the corresponding 400 P values were used to construct a null distribution. The P value at 95 percentile in the distribution was used as the cutoff to determine the significant motifs.

Figure S3. Identification of New Epigenetic DNA Modification Readers Using the DAPPL Approach, Related to Figure 4 A. Design of randomized DNA libraries carrying four different epigenetic modifications. Each library carries a CpG (i.e., EG) site with methylation, hydroxymethylation, formylation, or 5- carboxylation, flanked by two 8-mer random sequences and a 3-nt barcode (i.e., Library ID) to identify the modification. B. Design of the unmodified DNA library. C. Schematics of identifying new epigenetic modification readers using a mixture of five randomized DNA libraries (S3A and S3B) carrying either symmetric or hemi modifications using the DAPPL approach.

Figure S4. Impact of symmetric and hemi modifications to TF-DNA interactions, Related to Figure 4 A. Comparison the 6-mer subsequence frequencies obtained with each symmetrically modified library against the unmodified one for a given TF revealed that 14, 1, 7, and 28 TFs preferred 5mC, 5hmC, 5fC, and 5caC modifications, respectively, and that methylation, hydroxymethylation, formylation, and carboxylation suppressed binding activities of 21, 24, 25, and 25 TFs, respectively. We also observed that 9, 8, 6, and 2 TFs were insensitive to methylation, hydroxymethylation, formylation, or carboxylation, respectively. B. When the same analysis was applied to the results obtained with the four hemi modification libraries, we found that 18, 11, 15, and 24 TFs preferred 5mC, 5hmC, 5fC, and 5caC modifications, respectively, and that methylation, hydroxymethylation, formylation, and carboxylation suppressed binding activities of 19, 19, 22, and 21 TFs, respectively. We also observed that 14, 21, 12, and 14 TFs were insensitive to methylation, hydroxymethylation, formylation, or carboxylation, respectively.

Figure S5. Impact of Hemi Modifications on TF-DNA Interactions, Related to Figures 4, 5 and S4 A. Examples of deviated consensus sequences affected across all four epigenetic modifications. Known consensus sequences (boxed) in the absence of the modifications are compared with those carrying a modified cytosine. B. Examples (i.e., PKNOX2 and EGR2) of TF binding specificity not affected by cytosine modifications, although with slightly

22 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

enhanced binding affinity. C. Summary of the impact of various hemi-modifications on the TF binding specificity.

Figure S6. Impact of Hemi Modifications on TF Binding Specificity, Related to Figures 4 and S4 A given TF-DNA interaction can be enhanced (red bricks), suppressed (blue bricks), or indifferent (grey bricks) by any of the four hemi epigenetic modifications.

Figure S7. Additional Validation for TBX2, TBX3 and TBX20, Related to Figure 6 A. Semi-quantitative comparison of binding strength between TBX2/3/20 and their consensus sequences with either unmodified, hemi, or symmetric cytosine using EMSA assays. B. Binding kinetics and affinity studies of the TBX proteins quantitatively confirmed our DAPPL analysis and the EMSA results. The red and black arrows indicate when the DNA probe sensors were dipped into the TF protein solutions and wash buffer, respectively. Affinity values (KD) are represented as mean ± SD and deduced from the Kon and Koff values measured in four independent assays performed at four different concentrations of each TBX protein.

SUPPLEMENTAL TABLES

Table S1. Barcode Sequences for the TF and GST Proteins, Related to Figure 1

Table S2. Consensus Sequences Identified with the DAPPL Approach Using the 20- mer DNA Library, Related to Figure 2

Table S3. DNA Oligos Used for the DAPPL and Validation Assays, Related to Figures 1, 4, 6, S3 and S6

Table S4. Newly Identified Readers for Different Symmetric and Hemi Modifications, Related to Figure 3

23 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

REFERENCE:

Bachman, M., Uribe-Lewis, S., Yang, X., Williams, M., Murrell, A., and Balasubramanian, S. (2014). 5-Hydroxymethylcytosine is a predominantly stable DNA modification. Nature chemistry 6, 1049-1055. Badis, G., Berger, M.F., Philippakis, A.A., Talukder, S., Gehrke, A.R., Jaeger, S.A., Chan, E.T., Metzler, G., Vedenko, A., Chen, X., et al. (2009). Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720-1723. Bartke, T., Vermeulen, M., Xhemalce, B., Robson, S.C., Mann, M., and Kouzarides, T. (2010). Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470-484. Berger, J.A., Hautaniemi, S., Jarvinen, A.K., Edgren, H., Mitra, S.K., and Astola, J. (2004). Optimized LOWESS normalization parameter selection for DNA microarray data. BMC bioinformatics 5, 194. Berger, M.F., Badis, G., Gehrke, A.R., Talukder, S., Philippakis, A.A., Pena-Castillo, L., Alleyne, T.M., Mnaimneh, S., Botvinnik, O.B., Chan, E.T., et al. (2008). Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133, 1266-1276. Bird, A. (2002). DNA methylation patterns and epigenetic memory. & development 16, 6- 21. Cheng, Y., Li, Z., Manupipatpong, S., Lin, L., Li, X., Xu, T., Jiang, Y.H., Shu, Q., Wu, H., and Jin, P. (2018). 5-Hydroxymethylcytosine alterations in the human postmortem brains of autism spectrum disorder. Human molecular genetics 27, 2955-2964. Consortium, E.P. (2012). An integrated encyclopedia of DNA elements in the . Nature 489, 57-74. Ficz, G., Branco, M.R., Seisenberger, S., Santos, F., Krueger, F., Hore, T.A., Marques, C.J., Andrews, S., and Reik, W. (2011). Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature 473, 398-402. Gabel, H.W., Kinde, B., Stroud, H., Gilbert, C.S., Harmin, D.A., Kastan, N.R., Hemberg, M., Ebert, D.H., and Greenberg, M.E. (2015). Disruption of DNA-methylation-dependent long gene repression in Rett syndrome. Nature 522, 89-93. Guo, J.U., Su, Y., Shin, J.H., Shin, J., Li, H., Xie, B., Zhong, C., Hu, S., Le, T., Fan, G., et al. (2014). Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nature neuroscience 17, 215-222. Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L., and Noble, W.S. (2007). Quantifying similarity between motifs. Genome biology 8, R24. Hashimoto, H., Liu, Y., Upadhyay, A.K., Chang, Y., Howerton, S.B., Vertino, P.M., Zhang, X., and Cheng, X. (2012). Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic acids research 40, 4841-4849. He, Y.F., Li, B.Z., Li, Z., Liu, P., Wang, Y., Tang, Q., Ding, J., Jia, Y., Chen, Z., Li, L., et al. (2011). Tet- mediated formation of 5-carboxylcytosine and its excision by TDG in mammalian DNA. Science 333, 1303-1307. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y.C., Laslo, P., Cheng, J.X., Murre, C., Singh, H., and Glass, C.K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell 38, 576- 589.

24 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Herceg, Z., and Hainaut, P. (2007). Genetic and epigenetic alterations as biomarkers for cancer detection, diagnosis and prognosis. Molecular oncology 1, 26-41. Hu, S., Wan, J., Su, Y., Song, Q., Zeng, Y., Nguyen, H.N., Shin, J., Cox, E., Rho, H.S., Woodard, C., et al. (2013). DNA methylation presents distinct binding sites for human transcription factors. eLife 2, e00726. Hu, S., Xie, Z., Onishi, A., Yu, X., Jiang, L., Lin, J., Rho, H.S., Woodard, C., Wang, H., Jeong, J.S., et al. (2009). Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 139, 610-622. Irier, H., Street, R.C., Dave, R., Lin, L., Cai, C., Davis, T.H., Yao, B., Cheng, Y., and Jin, P. (2014). Environmental enrichment modulates 5-hydroxymethylcytosine dynamics in hippocampus. Genomics 104, 376-382. Isakova, A., Groux, R., Imbeault, M., Rainer, P., Alpern, D., Dainese, R., Ambrosini, G., Trono, D., Bucher, P., and Deplancke, B. (2017). SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nature methods 14, 316-322. Ito, S., Shen, L., Dai, Q., Wu, S.C., Collins, L.B., Swenberg, J.A., He, C., and Zhang, Y. (2011). Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science 333, 1300-1303. Iurlaro, M., Ficz, G., Oxley, D., Raiber, E.A., Bachman, M., Booth, M.J., Andrews, S., Balasubramanian, S., and Reik, W. (2013). A screen for hydroxymethylcytosine and formylcytosine binding proteins suggests functions in transcription and chromatin regulation. Genome biology 14, R119. Jeong, J.S., Jiang, L., Albino, E., Marrero, J., Rho, H.S., Hu, J., Hu, S., Vera, C., Bayron-Poueymiroy, D., Rivera-Pacheco, Z.A., et al. (2012). Rapid identification of monospecific monoclonal antibodies using a human proteome microarray. Molecular & cellular proteomics : MCP 11, O111 016253. Jin, S.G., Zhang, Z.M., Dunwell, T.L., Harter, M.R., Wu, X., Johnson, J., Li, Z., Liu, J., Szabo, P.E., Lu, Q., et al. (2016). Tet3 Reads 5-Carboxylcytosine through Its CXXC Domain and Is a Potential Guardian against Neurodegeneration. Cell reports 14, 493-505. Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, K.R., Rastas, P., Morgunova, E., Enge, M., Taipale, M., Wei, G., et al. (2013). DNA-binding specificities of human transcription factors. Cell 152, 327-339. Kinde, B., Wu, D.Y., Greenberg, M.E., and Gabel, H.W. (2016). DNA methylation in the gene body influences MeCP2-mediated gene repression. Proceedings of the National Academy of Sciences of the United States of America 113, 15114-15119. Kyster Nanan, D.S., Maria Prigge, Morgan Thenoz, Allissa Dillman, Mariana Mandler, Shalini Oberdoerffer (2018). TET-catalyzed 5-carboxylcytosine promotes CTCF binding to suboptimal sequences genome-wide. bioRxiv doi: https://doi.org/10.1101/480525. Li, T., Wang, L., Du, Y., Xie, S., Yang, X., Lian, F., Zhou, Z., and Qian, C. (2018). Structural and mechanistic insights into UHRF1-mediated DNMT1 activation in the maintenance DNA methylation. Nucleic acids research 46, 3218-3231. Liu, X., Gao, Q., Li, P., Zhao, Q., Zhang, J., Li, J., Koseki, H., and Wong, J. (2013). UHRF1 targets DNMT1 for DNA methylation through cooperative binding of hemi-methylated DNA and methylated H3K9. Nature communications 4, 1563. Nardone, S., Sams, D.S., Reuveni, E., Getselter, D., Oron, O., Karpuj, M., and Elliott, E. (2014). DNA methylation analysis of the autistic brain reveals multiple dysregulated biological pathways. Translational psychiatry 4, e433.

25 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Pastor, W.A., Pape, U.J., Huang, Y., Henderson, H.R., Lister, R., Ko, M., McLoughlin, E.M., Brudno, Y., Mahapatra, S., Kapranov, P., et al. (2011). Genome-wide mapping of 5- hydroxymethylcytosine in embryonic stem cells. Nature 473, 394-397. Potapov, V., Ong, J.L., Kucera, R.B., Langhorst, B.W., Bilotti, K., Pryor, J.M., Cantor, E.J., Canton, B., Knight, T.F., Evans, T.C., Jr., et al. (2018). Comprehensive Profiling of Four Base Overhang Ligation Fidelity by T4 DNA Ligase and Application to DNA Assembly. ACS synthetic biology 7, 2665-2674. Ruzov, A., Tsenkina, Y., Serio, A., Dudnakova, T., Fletcher, J., Bai, Y., Chebotareva, T., Pells, S., Hannoun, Z., Sullivan, G., et al. (2011). Lineage-specific distribution of high levels of genomic 5- hydroxymethylcytosine in mammalian development. Cell research 21, 1332-1342. Sharif, J., Endo, T.A., Nakayama, M., Karimi, M.M., Shimada, M., Katsuyama, K., Goyal, P., Brind'Amour, J., Sun, M.A., Sun, Z., et al. (2016). Activation of Endogenous Retroviruses in Dnmt1(-/-) ESCs Involves Disruption of SETDB1-Mediated Repression by NP95 Binding to Hemimethylated DNA. Cell stem cell 19, 81-94. Sharif, J., and Koseki, H. (2018). Hemimethylation: DNA's lasting odd couple. Science 359, 1102- 1103. Sharp, A.J., Stathaki, E., Migliavacca, E., Brahmachary, M., Montgomery, S.B., Dupre, Y., and Antonarakis, S.E. (2011). DNA methylation profiles of human active and inactive X . Genome research 21, 1592-1600. Shen, L., Wu, H., Diep, D., Yamaguchi, S., D'Alessio, A.C., Fung, H.L., Zhang, K., and Zhang, Y. (2013). Genome-wide analysis reveals TET- and TDG-dependent 5-methylcytosine oxidation dynamics. Cell 153, 692-706. Song, C.X., Szulwach, K.E., Dai, Q., Fu, Y., Mao, S.Q., Lin, L., Street, C., Li, Y., Poidevin, M., Wu, H., et al. (2013). Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming. Cell 153, 678-691. Song, J., and Pfeifer, G.P. (2016). Are there specific readers of oxidized 5-methylcytosine bases? BioEssays : news and reviews in molecular, cellular and developmental biology 38, 1038-1047. Spruijt, C.G., Gnerlich, F., Smits, A.H., Pfaffeneder, T., Jansen, P.W., Bauer, C., Munzel, M., Wagner, M., Muller, M., Khan, F., et al. (2013). Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146-1159. Stroud, H., Feng, S., Morey Kinney, S., Pradhan, S., and Jacobsen, S.E. (2011). 5- Hydroxymethylcytosine is associated with enhancers and gene bodies in human embryonic stem cells. Genome biology 12, R54. Sun, Z., Terragni, J., Borgaro, J.G., Liu, Y., Yu, L., Guan, S., Wang, H., Sun, D., Cheng, X., Zhu, Z., et al. (2013). High-resolution enzymatic mapping of genomic 5-hydroxymethylcytosine in mouse embryonic stem cells. Cell reports 3, 567-576. Szulwach, K.E., Li, X., Li, Y., Song, C.X., Wu, H., Dai, Q., Irier, H., Upadhyay, A.K., Gearing, M., Levey, A.I., et al. (2011). 5-hmC-mediated epigenetic dynamics during postnatal neurodevelopment and aging. Nature neuroscience 14, 1607-1616. Tahiliani, M., Koh, K.P., Shen, Y., Pastor, W.A., Bandukwala, H., Brudno, Y., Agarwal, S., Iyer, L.M., Liu, D.R., Aravind, L., et al. (2009). Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930-935. Tan, L., and Shi, Y.G. (2012). Tet family proteins and 5-hydroxymethylcytosine in development and disease. Development 139, 1895-1902. Vidal, M., Brachmann, R.K., Fattaey, A., Harlow, E., and Boeke, J.D. (1996). Reverse two-hybrid and one-hybrid systems to detect dissociation of protein-protein and DNA-protein interactions. 26 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Proceedings of the National Academy of Sciences of the United States of America 93, 10315- 10320. Wang, D., Hashimoto, H., Zhang, X., Barwick, B.G., Lonial, S., Boise, L.H., Vertino, P.M., and Cheng, X. (2017). MAX is an epigenetic sensor of 5-carboxylcytosine and is altered in multiple myeloma. Nucleic acids research 45, 2396-2407. Weirauch, M.T., Yang, A., Albu, M., Cote, A.G., Montenegro-Montero, A., Drewe, P., Najafabadi, H.S., Lambert, S.A., Mann, I., Cook, K., et al. (2014). Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431-1443. Wheldon, L.M., Abakir, A., Ferjentsik, Z., Dudnakova, T., Strohbuecker, S., Christie, D., Dai, N., Guan, S., Foster, J.M., Correa, I.R., Jr., et al. (2014). Transient accumulation of 5-carboxylcytosine indicates involvement of active demethylation in lineage specification of neural stem cells. Cell reports 7, 1353-1361. Wockner, L.F., Noble, E.P., Lawford, B.R., Young, R.M., Morris, C.P., Whitehall, V.L., and Voisey, J. (2014). Genome-wide DNA methylation analysis of human brain tissue from schizophrenia patients. Translational psychiatry 4, e339. Wu, H., and Zhang, Y. (2011). Mechanisms and functions of Tet protein-mediated 5- methylcytosine oxidation. Genes & development 25, 2436-2452. Wu, X., and Zhang, Y. (2017). TET-mediated active DNA demethylation: mechanism, function and beyond. Nature reviews Genetics 18, 517-534. Xu, C., and Corces, V.G. (2018). Nascent DNA methylome mapping reveals inheritance of hemimethylation at CTCF/cohesin sites. Science 359, 1166-1170. Yin, Y., Morgunova, E., Jolma, A., Kaasinen, E., Sahu, B., Khund-Sayeed, S., Das, P.K., Kivioja, T., Dave, K., Zhong, F., et al. (2017). Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356. Zhao, L., Sun, M.A., Li, Z., Bai, X., Yu, M., Wang, M., Liang, L., Shao, X., Arnovitz, S., Wang, Q., et al. (2014). The dynamics of DNA methylation fidelity during mouse embryonic stem cell self- renewal and differentiation. Genome research 24, 1296-1307.

27

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1

X 1,235 N20 DNA library TF Mixture Primer 2 TFn Primer 2

Primer 2 I

GST Primer 1 TF Barcode Bsa UMI

GSH bead Primer 2 Wash 1.10 X 1012

Unbound DNA

TF I.D. DNA seq Bound DNA Bound DNA Primer 1 UMI Primer 2

TF TF NGS n BsaI n Ligase GST GST

GSH bead GSH bead Ligation site bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 2

A B 12 ) ) 3

8 3 8 (x 10 (x (x 10 (x γ 4 45 84 501 4 NRF1 C/EBP 0 0 0 4 8 0 4 8 12 DAPPL CIS-BP GST (x 103) GST (x 103) Different ) ) 3 4 8 1 17.9% (x 10 (x (x 10 (x 4

MLX 82.1% ELK1

0 0 0 1 0 4 8 Similar GST (x 104) GST (x 103)

C YBX2 USF1 MEIS2 MLX CSDA FOXC2

Clone 1

Clone 2

PCC 0.745 0.844 0.995 0.955 0.693 0.974

D TEAD2 DMTF SFMBT1

Consensus + + - - + + - - + + - - TF protein - + + + - + + + - + + + T7 - - + - - - + - - - + - N8 - - - + - - - + - - - + E 24.8 ± 2.7 µM 24.4 ± 8.1 µM 48.3 ± 25.3 µM 0.6 0.6 0.4

0.3 0.3 0.2 Binding Binding (nm) 0 0 0 0 800 1600 2400 0 200 400 600 0 500 1000 1500 Time (s) Time (s) Time (s) bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3

A Symmetric readers Hemi readers B mCG hmCG hmCG fCG hmCG fCG mCG caCG mCG caCG 4 40 59 40 57 46

Symmetric Hemi Symmetric Hemi

fCG caCG

29 55 63 38 69 70

Symmetric Hemi Symmetric Hemi

C TF subfamilies Symmetric Hemi mCG bHLH 27 bHLH 51 hmCG bZIP 16 bZIP 19 fCG /DP 4 E2F/DP 11 caCG ETS 25 ETS 55 Symmetric HMG box HMG box mCG 3 8 Homeo Homeo hmCG 52 83 RFX 10 RFX 8 fCG TFII 11 TFII 8 Hemi caCG T-box 8 T-box 15 0 50 100 ZNF 91 ZNF 99 Count Others 142 Others 191 HMG box E2F/DP RFX TFII 0 25 50 75 100 (%) 0 25 50 75 100 (%) CTF/NF-1 T-box bZIP ETS bHLH Homeo ZNF Other CG mCG hmCG fCG caCG

D MEIS1 MEIS2 MEIS3 ATGTCG ATGTCG ATGTCG 6 TGTCGA 12 TGTCGA GTGTCG )

3 TGTCGT GTGTCG 8 TGTCGA caCG 4 8

4 MEIS (x10 MEIS 2 4 Symmetric

0 0 0 0 2 4 6 0 4 8 12 0 4 8

TGTCGT CTGTCG CTGTCG 8 20 TGTCGA 20 TGTCGA TGTCGA

) TGTCGT 3 CTGTCG TGTCGT caCG 4 10 10 Hemi MEIS(x10

0 0 0 0 4 8 0 10 20 0 10 20 GST (x103) GST (x103) GST (x103) bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4

A B Symmetric modification mC hmC fC caC RFX2 TBX2 FOXC2 Probe DNA: ZBTB7B HOMEZ 5’-T(A/C)ACGCCTC TBX2 TBX3 C/C mC/mC hmC/hmC fC/fC caC/caC MEIS1 TF MEIS3 - + - + - + - + - + IRX2 CTCF EN2 HMBOX1 LHX NHLH1 NHLH2 PRDN16 TULP4 STAT5A HMBOX1 ZBTB7A Probe DNA: ZNF256 5’-CGACTAA ZNF468 ZBTB7A ZNF449 C/C mC/mC hmC/hmC fC/fC caC/caC ZNF691 TF - + - + - + - + - + ZSCAN5A OTX1 TBX20 ZSCAN21 FOXP4 LIN28A PKNOX2 MEIS2 ETS1 MLX OVOL2 CRX Probe DNA: C/EBPγ METTL3 5’-CCGTTA OVOL2 HSF2 C/C mC/mC hmC/hmC fC/fC caC/caC TF - + - + - + - + - + USF1 USF2 HSF1 TFAP2C NR2E1 ELF1 NRF1 ELK1 SPDEF DMTF1 TFAP2A DMTF1 NFATC1 Probe DNA: TFAP2B 5’-(A/C/G)CATCCG TFAP2E ZBED2 SFMBT1 C/C mC/mC hmC/hmC fC/fC caC/caC PKNOX1 TF - + - + - + - + - + TAL2 PURA ZNF187 EGR2

Enhanced Suppressed Bimodal No data No preference bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 5

A C HMBOX1 FOXC2 Symmetric modification mC hmC fC caC HOMEZ METTL3 LIN28A HMBOX1 MLX MYOD1 NHLH2 FOXC2 HOMEZ IRX2 IRX2 ZBTB7B RFX2 TBX2 TBX3 MEIS1 MEIS2 MEIS3 PKNOX2 B C/EBPγ CRX TBX2 RFX2 EN2 LHX4 OTX1 STAT5A ZBTB7A ETS1 TBX20 FOXP4 Different Similar bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 6

A B HOMEZ Enhanced OVOL2 Suppressed 5 mC hmC 0.6 fC ) 2 4 caC

HOMEZ 3 0.4

C/C (x103) 2

0.2 Symmetric modificationSymmetric (log mC modificationSymmetric 1 hmC fC OVOL2 caC 0 0.0 0 1 2 3 4 5 0.0 0.2 0.4 0.6

Hemi modification (log2) Hemi modification

C C/C mC/C mC/mC 500 500 500 C/C: 5.90 ± 2.82 µM mC/C: 553 ± 113 nM mC/mC: 234 ± 26.9 nM 0 0 0 nM

HOMEZ-DNA 1.2 1.2 1.2 complex 0.6 0.6 0.6 4500 nM Binding (nm) Binding (nm) Binding (nm) Binding 0.0 0.0 0.0 2250 nM Free DNA 0 400 800 1200 0 400 800 1200 0 400 800 1200 1125 nM Time (s) Time (s) Time (s) 563 nM C/C hmC/C hmC/hmC 375 nM 500 500 500 hmC/C: 345 ± 56.5 nM hmC/hmC: 135 ± 32.3 nM 0 0 0 nM 188 nM 94 nM HOMEZ-DNA 1.2 1.2 47 nM complex 0.6 0.6 Binding (nm) Binding (nm) Binding Free DNA 0.0 0.0 0 400 800 1200 0 400 800 1200 Time (s) C/C fC/C fC/fC Time (s) 50 50 50 C/C: 17.4 ± 10.6 nM fC/C: 7.23 ± 0.96 nM fC/fC: 6.04 ± 1.53 nM 0 0 0 nM TBX3-DNA 3 3 3 complex 2 2 2 50 nM 25 nM 1 1 1 13 nM 6 nM Binding (nm) Binding (nm) Binding (nm) Binding Free DNA 0 0 0 0 400 800 1200 0 400 800 1200 0 400 800 1200 C/C caC/C caC/caC Time (s) Time (s) Time (s) 600 600 600 C/C: 384 nM caC/C: 59.1 ± 23.2 nM caC/caC: 14.4 ± 3.23 nM 0 0 0 nM STAT5A-DNA 1.8 1.8 1.8 complex 1.2 1.2 1.2 680 nM 340 nM 0.6 0.6 0.6 170 nM 85 nM

Free DNA (nm) Binding (nm) Binding (nm) Binding 0.0 0.0 0.0 0 400 800 1200 0 400 800 1200 0 400 800 1200

C/C mC/C mC/mC Time (s) Time (s) mC/mC: n/a Time (s) 300 300 300 C/C: 4.93 ± 4.24 µM mC/C: 14.9 ± 3.18 µM 0 0 0 nM 3 3 3

OVOL2-DNA 2 2 2 300 nM complex 200 nM 1 1 1 150 nM 100 nM Binding (nm) Binding (nm) Binding (nm) Binding 0 0 0 Free DNA 0 400 800 1200 0 400 800 1200 0 400 800 1200 Time (s) Time (s) Time (s) bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 7

A B USF1 USF2 chr14 24784500 24785500 24786500 0.06 hmCG 0.10 0.04

Density Density 0.05 USF1 0.02 Observed Observed

0.00 0.00 0 40 80 120 229 0 20 40 68 CCCTACCACGTGTGTGAA Count Count chr14:24785563-24785579 TAF7 ATF2

C ChIP-seq DAPPL 0.09 0.09

0.06 0.06 USF1 Density Density 0.03 Observed 0.03 Observed

USF2 0.00 0.00 0 10 20 36 0 10 20 46 Count Count D USF1 USF2 Active Promoter 0.3% 1%

Active Promoter 22% ActiveActive Promoter Promoter 32%

e

e t t 6% 11%

WeakWeak Promoter Promoter 21% WeakWeak Promoter Promoter 26%

a

a t

t 9% 12% S

S InactivInactivee Promoter Promoter 10% InactivInactivee Promoter Promoter 10%

8% factor(Names) 10% factor(Names)

n

n i

i StrongStrong Enhancer Enhancer 6% StrongStrong Enhancer Enhancer 7%

t

t a a 41% P=1.5e−18 38%

WeakWeak Enhancer Enhancer 18% WeakWeak Enhancer Enhancer 17% P=9.4e−5 m

m 11% 8% o

o InsulatorInsulator 5% P=2.0e−4 InsulatorInsulator 4% r

r 10% 7% h h P=1.5e−4

Weak Weaktranscr transcribedibed 4% Weak Weaktranscr transcribedibed 2% C 15% C 13% background HeterochromatinHeterochromatin 13% HeterochromatinHeterochromatin 3% 00 100.1 200.2 300.3 400.4 00 100.1 200.2 300.3 400.4 hmCG binding sites FractionRatio RatioFraction

7 bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S1

A Anchor oligo: maleimide-5’ TACGATCATAAACTTAGTTCACTTCAACAAGAATGATAT GATTTAGTTCAGGACGTAGG B Spacer Primer 1 Barcode oligo: 5’-GACTCACGGTCTCC↓GACTGGNNNNNNNNACTATTAAAGGCCCTACGTCCTGAACTAAATC BsaI UMI Barcode Primer 1 rev

C -3’ 3’- Barcode Oligo

TFn Anchor oligo TFn TFn

GST SH GST  -3’ GST  -3’ 1 hr @ RT Anneal GSH bead GSH bead GSH bead

TF Mixture TFn

Klenow GST  Combine

GSH bead D

kDa 130 100

70 55

35

25

E F ZBED2 TP53 MLX ZBED2 TP53 MLX DNA BC - + - + - + - + - + - + kDa kDa 130 130 100 100 70 70

55 55

35 35 25 25

Coomassie blue Ethidium bromide bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S2

Generate 6-mers AATATCGCAACAATCTGTGA N-mer Input AATATC AATATCG Protein AATATCGC N-6 Normalize AATATCGCA .fasta AATATCGCAA … … Extract special binding 6-mers 6 AGATGT GCGCAA ) 3 ) 3 8 CGCAAT

(x10 4 GATGTT ATTGCG (x10

CAGATA γ Retrieve original sequences 4 2 GST ZNF238

containing the enriched C/EBP .fasta 6-mers 0 0 0 2 4 6 0 4 8 GST (x103) GST (x103) Discover candidate motifs CDF ZNF238: Build null distribution Retrieve significant Background C/EBPγ: of p-value calculated … … motifs (p-value ≤0.05) (CDF) with HOMER 0 0.2 0.4 0.6 0.8 1.0 0.8 0.60.4 0.2 0 10 30 50 70 -log(p) bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S3

A 5’-CACATCCTTCACATTAATCCNNNNNNNNEGNNNNNNNNCATTCACTAACTAATCXXXACCTCGA↓GACTTGAGACCAT Primer 2 16-mer DNA Library I.D. BsaI

B 5’-CACATCCTTCACATTAATCCNNNNNNNNCGNNNNNNNNCATTCACTAACTAATCCGTACCTCGA↓GACTTGAGACCAT Primer 2 16-mer DNA Library I.D. BsaI

unmodified library mC library hmC library C

Pool 5 libraries 16-mer DNA Libraries 4.29x109 4.29x109 4.29x109 NNNNNNNN*CGNNNNNNNN

BsaI Primer 2 fC library caC library Library I.D.

4.29x109 4.29x109

Bound DNA Bound DNA TF I.D. DNA seq

Primer 1 UMI Primer 2 Wash TFn TFn

GST GST BsaI NGS Unbound DNA GSH bead Ligase GSH bead Library I.D. Ligation site Figure S4

A

CEBPG

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

CRX

Logo (E)

Logo (S)

CTCF

Logo (E)

Logo (S)

DMTF1

Logo (E)

Logo (S) E2F1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

EGR2

Logo (E)

Logo (S)

ELF1

Logo (E)

Logo (S)

ELK1

Logo (E)

Logo (S) EN2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ETS1

Logo (E)

Logo (S)

FOXC2

Logo (E)

Logo (S)

FOXP4

Logo (E)

Logo (S) HMBOX1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

HOMEZ

Logo (E)

Logo (S)

HSF1

Logo (E)

Logo (S)

HSF2

Logo (E)

Logo (S) IRX2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

LHX4

Logo (E)

Logo (S)

LIN28A

Logo (E)

Logo (S)

MEIS1

Logo (E)

Logo (S) MEIS2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

MEIS3

Logo (E)

Logo (S)

METTL3

Logo (E)

Logo (S)

MLX

Logo (E)

Logo (S) MYOD1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

NFATC1

Logo (E)

Logo (S)

NHLH1

Logo (E)

Logo (S)

NHLH2

Logo (E)

Logo (S) NR2E1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

NRF1

Logo (E)

Logo (S)

OTX1

Logo (E)

Logo (S)

OVOL2

Logo (E)

Logo (S) PKNOX1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

PKNOX2

Logo (E)

Logo (S)

PRDM16

Logo (E)

Logo (S)

PURA

Logo (E)

Logo (S) RFX2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

SFMBT1

Logo (E)

Logo (S)

SPDEF

Logo (E)

Logo (S)

STAT5A

Logo (E)

Logo (S) TAL2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

TBX2

Logo (E)

Logo (S)

TBX3

Logo (E)

Logo (S)

TBX20

Logo (E)

Logo (S) TFAP2A

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

TFAP2B

Logo (E)

Logo (S)

TFAP2C

Logo (E)

Logo (S)

TFAP2E

Logo (E)

Logo (S) TULP4

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

USF1

Logo (E)

Logo (S)

USF2

Logo (E)

Logo (S)

ZBED2

Logo (E)

Logo (S) ZBTB7A

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ZBTB7B

Logo (E)

Logo (S)

ZNF187

Logo (E)

Logo (S)

ZNF256

Logo (E)

Logo (S) ZNF449

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ZNF468

Logo (E)

Logo (S)

ZNF691

Logo (E)

Logo (S)

ZSCAN21

Logo (E)

Logo (S) ZSCAN5A

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S) B

CEBPB

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S) CEBPG

Logo (E)

Logo (S) CRX

Logo (E)

Logo (S)

DMTF1

Logo (E)

Logo (S) E2F1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

EGR2

Logo (E)

Logo (S)

ELF2

Logo (E)

Logo (S)

ELK1

Logo (E)

Logo (S) EN2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ERG

Logo (E)

Logo (S)

ETS1

Logo (E)

Logo (S)

FOXC2

Logo (E)

Logo (S) FOXP4

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

GRHL2

Logo (E)

Logo (S)

HMBOX1

Logo (E)

Logo (S)

HMGN2

Logo (E)

Logo (S) HOMEZ

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

HSF1

Logo (E)

Logo (S)

HSF2

Logo (E)

Logo (S)

JDP2

Logo (E)

Logo (S) LHX4

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

MAGEA4

Logo (E)

Logo (S)

MEIS1

Logo (E)

Logo (S)

MEIS2

Logo (E)

Logo (S) MEIS3

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

METTL3

Logo (E)

Logo (S)

MLX

Logo (E)

Logo (S)

NFATC1

Logo (E)

Logo (S) NHLH2

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

NR2E1

Logo (E)

Logo (S)

NRF1

Logo (E)

Logo (S)

OTUD4

Logo (E)

Logo (S) OTX1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

OVOL2

Logo (E)

Logo (S)

PKNOX1

Logo (E)

Logo (S)

PKNOX2

Logo (E)

Logo (S) POU6F1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

PPP1R13L

Logo (E)

Logo (S)

RBPJ

Logo (E)

Logo (S)

RFX2

Logo (E)

Logo (S) RHOXF1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

RXRA

Logo (E)

Logo (S)

SATB2

Logo (E)

Logo (S)

SFMBT1

Logo (E)

Logo (S) SMARCE1

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

SPDEF

Logo (E)

Logo (S)

STAT5A

Logo (E)

Logo (S)

TBX2

Logo (E)

Logo (S) TBX3

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

TBX20

Logo (E)

Logo (S)

TBXT

Logo (E)

Logo (S)

TERF1

Logo (E)

Logo (S) TFAP2B

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

TFAP2D

Logo (E)

Logo (S)

TFAP2E

Logo (E)

Logo (S)

TRIM24

Logo (E)

Logo (S) TRIM28

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

USF1

Logo (E)

Logo (S)

USF2

Logo (E)

Logo (S)

YBX1

Logo (E)

Logo (S) ZBTB7B

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ZCRB1

Logo (E)

Logo (S)

ZFP36L2

Logo (E)

Logo (S)

ZNF187

Logo (E)

Logo (S) ZNF213

bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Logo (E)

Logo (S)

ZNF518A

Logo (E)

Logo (S) bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S5 Hemi-modification mC hmC fC caC FOXC2 MEIS1 ZBTB7B HMBOX1 TBX2 RFX2 TBX3 FOXP4 EN2 YBX1 PPP1R13L POU6F1 ZNF518A ZCRB1 RXRA TRIM28 TBX20 TBXT ELF2 GRHL2 C/EBPβ LHX4 MAGEA4 NHLH2 STAT5A SATB2 ZFP36L2 MEIS2 MEIS3 NRF1 HSF2 MLX C/EBPγ PKNOX2 NR2E1 PKNOX1 HOMEZ METTL3 TFAP2B TFAP2E ZNF213 TRIM24 EGR2 OTUD4 JDP2 DMTF1 E2F1 USF2 CRX HSF1 SPDEF OVOL2 SFMBT1 USF1 NFATC1 TERF1 ERG OTX1 TFAP2D ELK1 SMARCE1 ETS1 HMGN2 RBPJ RHOXF1 ZNF187

Enhanced Suppressed Bimodal No data No preference bioRxiv preprint doi: https://doi.org/10.1101/638700; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure S6

A C NR2E1 YBX1 Hemi-modification mC hmC fC caC HMBOX1 NR2E1 PKNOX1 FOXP4 YBX1 TBXT HSF2 POU6F1 HMBOX1 JDP2 TFAP2E JDP2 ZBTB7B EN2 FOXC2 MEIS1 MEIS2 MEIS3 C/EBPγ PKNOX2 HOMEZ B TBX2 TBX3 PKNOX2 EGR2 RFX2 METTL3 NRF1 EGR2 TBX20 LHX4 MLX NHLH2 STAT5A Different Similar Figure Figure B A TBX20 TBX3 TBX2 complex TBX2

Binding (nm) Binding (nm) Binding (nm) FreeDNA 0.0 0.4 0.8 1.2 - 0 1 2 DNA 0 1 2 3 bioRxiv preprint C/C: 0 0 0 C/C: C/C S7 0 : : 16.9 72.7 17.4 C/C 400 400 400 ± ± ± 29.3 50 Time (s) Time 10.6 14.3 Time (s) Time Time Time (s) 0 doi: nM fC nM nM certified bypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. /C 800 800 https://doi.org/10.1101/638700 800 50 0 fC / fC 1200 1200 1200 50 nM complex TBX3 Binding (nm) Binding (nm) 0.0 Binding0.4 0.8 (nm)1.2 0 1 2 0 1 2 3 FreeDNA fC 0 0 0 fC fC - DNA /C : /C /C : : : 8.36 44 0 7.23 ± 400 400 400 ; C/C this versionpostedMay16,2019. ± 34 ± Time (s) Time Time (s) Time Time Time (s) 5.43 0.96 nM 50 0 nM nM 800 800 800 fC /C 50 0 1200 1200 1200 fC / fC 50 nM

Binding (nm) Binding (nm) 0.0 Binding0.4 0.8 (nm)1.2 0 1 2 complex TBX20 0 1 2 3 0 0 fC 0 fC fC FreeDNA / / / fC fC fC The copyrightholderforthispreprint(whichwasnot - DNA : : 4.54 : : 6.04 8.07 400 400 400 0 ± ± ± Time (s) Time Time Time (s) Time (s) Time 3.37 1.53 C/C 4.94 50 nM nM nM 0 800 800 800 fC /C 50 1200 1200 0 1200 fC / fC 50 nM