Quick viewing(Text Mode)

Identification and Functional Characterisation of Non-Canonical DNA Methylation Readers in the Mammalian Brain

Identification and Functional Characterisation of Non-Canonical DNA Methylation Readers in the Mammalian Brain

Identification and functional characterisation of non-canonical DNA methylation readers in the mammalian brain

Sufyaan Mohamed

School of Molecular Sciences ARC Centre of Excellence in Plant Energy Biology Harry Perkins Institute of Medical Research 2020 Supervisors: Professor Ryan Lister (60%) Associate Professor Ozren Bogdanovic (20%) Dr Yuliya Karpievitch (10%) Professor Ian Small (5%) Associate Professor Monika Murcha (5%)

i

i. Thesis Declaration:

I, Sufyaan Mohamed, certify that: The work in this thesis has been substantially accomplished during enrolment in this degree. This thesis does not contain material which has been accepted for the reward of any other degree or diploma in my name, in any university or other tertiary institution. No part of this work will, in the future, be used in a submission in my name, for any other degree or diploma in any university or other tertiary institution without the prior approval of The University of Western Australia and where applicable, any partner institution responsible for the joint award of this degree. This thesis does not contain any material previously published or written by another person, except where due reference has been made in the text. The work(s) presented here are not in any way a violation or infringement of any copyright, trademark, patent, or other rights whatsoever of any person. Mass spectrometry samples were processed by Ino Karemaker. The ProteoMM code used for the analysis of the human and mouse datasets described in Chapter III was written by Dr. Yuliya Karpievitch This thesis contains published work and/or work prepared for publication, some of which has been co-authored.

Signature:

Date: 07/12/2020

ii ii. Abstract

DNA methylation is a covalent modification found in all vertebrates and functions as an additional layer of information through which transcription may be controlled. Changes in transcription may be modulated by that bind to DNA methylation termed methyl ‘readers’. DNA Methylation (mC) within mammals occurs predominantly in the CG dinucleotide context (mCG), but non-CG contexts (mCH, where H = A, C or T) exist within restricted cell types. In the brain, mCH accumulates to become the dominant form of DNA methylation in adult neurons, coinciding with synaptogenesis. Whilst many mCG reader proteins have been identified and characterised, the relatively recent discovery of mCA has meant that only one mCH reader, MECP2, has been discovered. This PhD focused on identifying many potential mCG and mCA reader candidate proteins in the human and mouse brain. A DNA pull-down utilising methylated probes and unmethylated controls were coupled with quantitative mass spectrometry (MS) to screen for potential mCG and mCA binders in human and mouse brain.

Chapter 1 presents an introduction to DNA methylation in mammals, and the mechanisms by which this epigenetic mark is deposited, removed and read. Additionally, Chapter 1 discusses some well characterised mCG reader proteins, their effects on transcription and experimental challenges faced in characterising their binding. Chapter 2 contains relevant methods to experiments discussed in Chapters 3,4 and 5. A novel, multivariate Proteomics analysis tool, ProteoMM, was developed to analyse the DNA pull-down MS data. The rationale behind development of ProteoMM, its optimisation and efficacy are detailed in Chapter 3. The identification of novel mCG and mCA readers are discussed in Chapters 4 and 5 respectively. Results from the mCA screen established a list of candidate mCA readers. The top mCA reader, MBD2, was chosen for biochemical validation experiments to confirm a direct, specific interaction for mCA. Details regarding the recombinant expression, purification and DNA binding ability of this are also detailed in Chapter 5. This work constitutes the first comprehensive combined analysis of mCG readers in human and mouse brain. ProteoMM identified a significant overlap in the binding of proteins enriched in each species. Further, this study constitutes the first mCA reader screen, and confirms a direct, specific affinity of MBD2 for mCA providing a crucial repository of mCG and mCA binders future studies can build upon (Chapter 6).

iii iii. Acknowledgements

There are many names and faces of people that cross my mind when thinking about the completion of my PhD. First and foremost is my family, who have supported me through the ups and downs of my PhD and were there through the highs, lows and stressful situations. You were the backbone I needed, emotionally and financially, especially in the latter years of my PhD. Another individual I am eternally grateful to is my partner Nathan for his understanding, patience, support and confidence. I would like to thank my supervisor Prof. Ryan Lister for his feedback, support, and for inspiring a sense of drive and excellence within me, helping me become a better scientist. I would also like to thank my co-supervisors Prof. Ian Small, Assoc. Prof. Prof. Monika Murcha and Assoc. Prof. Ozren Bogdanovic for being good mentors and individuals I could go to for advice when needed. I am really grateful to Dr Yuliya Karpievitch for the numerous sessions spent in her office in which I learnt vital programming skills. I would also like to extend my gratification to Dr Ethan Ford, a master molecular biologist and great individual with sound advice in my early years, as well as Dr Marina Oliva for her advice, banter and comedic relief. Some other notable post-docs include Dr Daniel Poppe for his patience, and for sharing his knowledge and time in cell culture and microscopy, Christian Pflueger for his knowledge (in all matters) and Jahnvi Pflueger for her time in ensuring the laboratory functioned efficiently and for running the numerous NGS libraries I had prepared. Thanks to Tessa for helping revise parts of my thesis, and to the many other students in the lab. Extending out of the Lister lab, I would like to thank Prof. Charlie Bond for making time, having patience and allowing me to learn about recombinant protein expression in his lab. Special thanks to Dr Gavin Knott and Dr Amanda Blythe within this lab among the other, always friendly members for their invaluable advice and patience in teaching me how to operate the sensitive chromatography systems. I would also like to thank Dr Cathie Small, who has been a fond part of my memories since I was introduced to the PEB family as an undergraduate. Alongside, I would like to extend thanks to the admin team, most notably Deb for handling orders, errors and anything lab reagent related, and to Geetha for always being a friendly face, helpful and ever dependable in all matters related to admin. Thanks go out to the many individuals and PhD students that made my experience a more wholesome one. Some have, been and gone, but I am grateful to have known you all. Special mention to Katharina, Karina, Jakob, Jon, Dennis, Tim, Arnold and Max. Thank you for the memories and support, in so many instances, you were all instrumental to my wellbeing.

This research was supported by an Australian Government Research Training Program (RTP) Scholarship.

iv iv. Authorship declaration

This thesis contains work that has been prepared for publication.

Details of work: Identification of mCG and mCA readers in human and mouse brain

Location in thesis: Chapters 3, 4 and 5

Contribution(s) to work: The processing of mass spectrometry samples was performed by Dr Ino Karemaker. The development of ProteoMM was performed by Dr Yuliya Karpievitch who was responsible for development of analysis code concerning the integration, imputation, and normalisation of these datasets. The totality of remaining experiments was performed by the student. These include isolation of protein extract from human and mouse brain, DNA pull-downs, Western blots, optimising and benchmarking ProteoMM, recombinant DNA cloning, recombinant protein expression and purification, electrophoretic mobility shift assays, and all other data analysis and plots within thesis.

Student signature: Date:07/12/2020

I, Ryan Lister, certify that the student’s statements regarding their contribution to each of the works listed above are correct.

Coordinating supervisor signature Date:07/12/2020

v Table of contents

i. Thesis Declaration: ii ii. Abstract iii iii. Acknowledgements iv iv. Authorship declaration v Chapter I Introduction I-11 Epigenetics I-11 I.1.1 Definition of epigenetics I-11 Sculpting the epigenome landscape I-12 I.2.1 Layers of the epigenome I-13 Features of DNA methylation in mammals I-19 I.3.1 DNA methylation at CpG islands (CGIs) I-19 I.3.2 DNA methylation at enhancers and intergenic regions I-20 I.3.3 DNA methylation and alternative splicing I-22 I.3.4 CH methylation I-23 Writing, maintenance, and removal of DNA methylation in mammalian genomes I-24 I.4.1 Writers of DNA methylation I-24 I.4.2 Erasure of DNA methylation I-26 Readers of DNA methylation I-28 I.5.1 The MBD family I-28 I.5.2 Set and RING- associated (SRA) family I-35 I.5.3 Kaiso and the Broad complex, Tramtrack, Bric-á-brac or Poxvirus Zinc-finger (BTB/POZ) family I-36 I.5.4 Expansion of the mCG reader repertoire and the need for contextually relevant, multifaceted characterisation approaches I-38 I.5.5 A need for mCH reader characterisation I-40 Outline of thesis I-41 References I-42 Chapter II Materials and methods II-67 DNA pull-down coupled to Mass spectrometry II-67 II.1.1 Nuclei isolation and protein extraction from mammalian brain II-67 II.1.2 Preparation of biotinylated probes II-67 II.1.3 DNA pull-downs II-68 II.1.4 Nuclear enrichment confirmation by Western blot II-68 II.1.5 On bead trypsin Digest II-69

vi Overview of mass spectrometry analysis using ProteoMM II-70 II.2.1 Identification of mC readers in human and mouse by ProteoMM II-71 II.2.2 Eigen MS normalisation and model-based imputation II-71 II.2.3 Model-based differential expression and presence/absence analysis II-72 II.2.4 Identification of Transcription factors, interactors and protein family information within data II-73 II.2.5 Benchmarking ProteoMM II-73 II.2.6 analysis II-75 Validation of mC reader binding II-75 II.3.1 RT-PCR II-75 II.3.2 Cloning of mCA reader candidates II-75 II.3.3 Protein expression II-76 II.3.4 Protein Purification II-76 II.3.5 Probe design II-77 II.3.6 Electrophoretic Mobility Shift Assay (EMSA) II-78 References II-79 Supplementary information II-80 Chapter III Development and optimisation of ProteoMM, a multivariate statistical analysis tool III-83 Summary III-83 Introduction III-84 III.1.1 Overview of MS proteomics III-84 III.1.2 The identification of protein interactions by MS III-86 III.1.3 Tandem affinity purification III-86 III.1.4 Quantitative Mass spectrometry by isotopic labelling III-87 III.1.5 Quantitative Mass spectrometry by label-free methods III-88 III.1.6 Challenges in the analysis of MS data III-88 III.1.7 ProteoMM, a novel multivariate, multi-dataset peptide level analysis tool III-89 Results III-91 III.2.1 Probe design III-91 III.2.2 DNA pull-down optimisation III-93 III.2.3 Eigen MS implementation and model-based imputation III-95 III.2.4 Differential expression and presence/absence analysis III-100 III.2.5 Validation of ProteoMM by mCG reader verification III-105 III.2.6 Benchmarking ProteoMM on a subset of common high confidence proteins III-106

vii III.2.7 Benchmarking ProteoMM by comparisons to a SELEX repository of mCG/CG readers III-111 III.2.8 Comparisons of mCG and CG DBD-containing proteins with published data III-114 Discussion III-116 III.3.1 ProteoMM, a novel multivariate differential expression proteomics analysis tool III- 116 III.3.2 Implementation of Eigen MS normalisation III-117 III.3.3 Missingness and the need for dataset-tailored imputation III-117 III.3.4 External experimental considerations III-119 III.3.5 Challenges in benchmarking ProteoMM III-119 III.3.6 Comparative assessment of ProteoMM and Perseus based analyses III-120 III.3.7 Overall validation of ProteoMM by comparisons with externally published data III- 123 References III-126 Chapter IV Identification of novel mCG readers in human and mouse brain IV-132 Summary IV-132 Introduction IV-133 IV.1.1 Epigenetic regulation of methylated CGIs IV-133 IV.1.2 Epigenetic regulation at unmethylated CGIs IV-139 IV.1.3 Methods adopted for the assessment of binding IV-141 4.2 Results IV-143 IV.2.1 Global assessment of mCG/CG datasets IV-143 IV.2.2 Identification of novel mCG and CG readers in human and mouse IV-145 IV.2.3 Identification of mCG readers in human or mouse-limited datasets IV-150 IV.2.4 Gene ontology analyses of mCG and CG readers in human and mouse brain IV- 152 Discussion IV-156 IV.3.1 Validation of the affinity pull-down results through the identification of known mCG and CG readers IV-156 IV.3.2 Readers exhibiting unexpected binding behaviour IV-157 IV.3.3 Identification of novel mCG readers IV-160 IV.3.4 Transcriptional effector complexes and IV-163 IV.3.5 Limitations of the pull-down and in mCG reader characterisation IV-166 IV.3.6 Building upon the mCG reader repertoire in the mammalian brain IV-167 References IV-168 Supplementary information IV-179

viii Chapter V Identification and characterisation of mCA readers within human and mouse brain V-183 Summary V-183 Introduction V-183 V.1.1 Non-CG Methylation in mammalian cells and tissues V-184 V.1.2 Writers of mCH V-185 V.1.3 Erasure of mCH V-186 V.1.4 Distribution of mCH and evidence for its roles in coordinating biological processes V-187 V.1.5 Non-CG methylation in the mammalian brain (Expansion from point 1.5) V-188 V.1.6 MECP2, an mCH reader critical for neural development V-189 V.1.7 Towards the need for novel mCH reader identification and characterisation V-191 Results V-192 V.2.1 Global assessment of mCA/CA datasets V-192 V.2.2 Identification of novel mCA and CA readers in human and mouse V-193 V.2.3 Identification of mCA readers in human or mouse-limited datasets V-197 V.2.4 Coupling protein interactors to identified DBD-containing proteins in human and mouse brain V-199 V.2.5 Recombinant expression, purification and validation of MBD2 as an mCA readerV- 202 V.2.6 Gene ontology analyses of mCA and CA readers in human and mouse brainV-206 V.2.7 Summary of mC and C binders by classification into distinct protein regulatory complexes V-209 Discussion V-211 V.3.1 Identification of CA readers in human and mouse brain V-211 V.3.2 Identification of mCA readers in human and mouse brain V-212 V.3.3 Expression and purification of MBD proteins V-218 V.3.4 Biochemical validation of MBD2 as an mCA reader V-219 V.3.5 Capturing protein reader conservation in human and mouse brain V-222 V.3.6 Limitations to the techniques employed within this study V-224 V.3.7 Towards in vivo mCA reader characterisation V-226 References V-229 Supplementary information V-238 Chapter VI General discussion VI-242 Summary VI-242 Potential implications for identified mCG binders VI-243 The first mCA reader screen in mammals VI-244

ix mC reader conservation in human and mouse brain VI-245 Experimental strategies for downstream characterization of mC readers VI-246 Concluding remarks VI-248 References VI-249

x Introduction

Epigenetics

I.1.1 Definition of epigenetics

The use of the term ‘epigenetics’ has shifted over time, with common usage now associated with our ever-increasing understanding of the role of information superimposed upon the DNA to regulate gene expression in Eukaryotes. Seventy-eight years ago, Conrad Waddington defined epigenetics as the process by which genotypes gave rise to phenotypes1,2. The sequencing of the reinforced our understanding that the majority of cells within an organism contain the same DNA sequence. However this sequence information alone is not enough to account for developmental changes that result in stable, disparate cellular identities with distinct functions. Consequently, the term epigenetics later evolved to encompass changes in gene function arising from heritable meiotic and/or mitotic processes, not explained by changes to the underlying DNA sequence3. Today epigenetics has been refined to include factors acting upon the DNA, which are collectively termed the epigenome 4. For example, the chromatin state is influenced by chemical modifications to histones or the addition of methyl groups to cytosine residues on DNA and is an important epigenetic feature controlling DNA accessibility. The prerequisite for heritability as a defining feature of epigenetics is debated, despite mechanisms for the maintenance of epigenomic features existing for both histone modifications5 and DNA methylation6. Some chromatin modification marks, due to their transient nature, are not heritable but are involved in inducing cell type- specific changes in chromatin structure that are permissive to gene transcription7. Similarly, DNA methylation may be deposited de novo by some members of the DNA methyltransferase family (DNMT)8, which can influence gene expression. Neurons, for example, require epigenetic modifications such as DNA methylation for normal control of gene expression and proper functioning9, but these are not non-heritable by virtue of the post-mitotic nature of neurons, thus questioning the necessity of heritability to be considered epigenetic. Given these discoveries and the diverse mechanisms by which these modifications function and are perpetuated, a popular prevailing definition of epigenetics includes changes in gene expression not explained by changes to the underlying DNA sequence, and involving potentially stable marks or “the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states” which may be heritable10.

I-11 Sculpting the epigenome landscape

The sculpting and maintenance of the epigenome is reliant on multiple biochemical processes that add or remove covalent modifications to DNA or histone proteins that can induce changes to chromatin, the complex of DNA and protein11. These processes generate patterns of modifications that can permit or inhibit transcription, offering multiple possible layers of transcriptional modulation. Alteration of the accessibility of DNA is important for modulating the interaction of factors that interact with DNA and affect genome activity, and can be influenced by alteration of modifications on the amino acid tails of histone proteins5. These alterations may be influenced by RNA that recruit chromatin modifiers to loci bound by RNA transcripts12. Methylation of cytosine bases in DNA offers another biochemical signal that can influence transcriptional regulation, for example by inhibiting DNA polymerase activity or transcription factor binding at promoters or enhancers, or by influencing chromatin accessibility through recruitment of methyl-binding proteins and associated co-repressors13,14.

Analysis of the epigenome has revealed extensive modulation and dynamics of epigenetic modifications within organisms, in which unique patterns of chemical modifications of the genome occur by coordinated, precisely regulated mechanisms. Locus-specific15,16 and genome wide interrogation experiments17–21 have delineated the histone and DNA methylation patterns in many organisms, tissues, cell types, and developmental stages. Deposition and maintenance of histone modifications or DNA methylation is achieved by specific ‘writer’ proteins, whilst ‘reader’ proteins recognise these marks in order to orchestrate transcriptional changes, however the diversity of reader proteins has been challenging to characterize. More recently, high throughput screens have been developed utilising techniques such as protein affinity screens coupled with mass spectrometry to identify novel proteins that recognise various epigenomic marks like DNA methylation, DNA hydroxymethylation, or histone modifications22,23. These experiments serve as important initial identification screens to isolate candidate proteins with potential roles in reading the epigenome and altering genome conformation or activity.

The importance of epigenetic marks has been demonstrated by functional studies such as knockout experiments or by large scale analysis of these marks in different experimental conditions. Some of these studies, for example, have contributed to our understanding of the epigenome at various stages of development such as within human embryonic stem cells and derived differentiated cells, or within the developing embryo24,25. Some well-defined processes that are reliant upon the epigenome include processes such as gene expression 26, tissue

I-12 differentiation25, X- inactivation27, imprinting28 and suppression of transposable elements29. Cell-type-specific gene expression changes, for example, have been attributed to cell-type-specific histone and DNA methylation patterns30,31. In addition, it is well established that high levels of DNA methylation32, deposition of specific histone marks33, and RNA directed processes34 are responsible for X-chromosome inactivation and important in transposable element silencing29,35 and genomic imprinting28.

The spatial and temporal output of information encoded by DNA is therefore finely controlled by an interplay between histone modifications, DNA methylation and non-coding RNAs, and the effect that these modifications have upon the genome accessibility and cellular factors that influence genome activity. These epigenome modifications therefore may be thought of as additional layers that are superimposed upon the genome, adding additional contextually- appropriate information. Particularly pertinent to this thesis is the role of DNA methylation within mammals, and how interpretation of this mark, which is specifically added or removed at precise genomic elements, contributes to normal cellular function, and therefore healthy development and homeostasis.

I.2.1 Layers of the epigenome

Histone proteins, their variants, and their covalent modifications constitute a complex layer of the epigenome that is fundamental to nuclear processes including transcription, DNA repair and DNA replication36,37. Mediation of these nuclear processes may be imparted through cis (local chromatin compaction), or via trans (higher-order arrangement of chromatin)38. Two copies of histone proteins H2A, H2B, H3 and H4 are organised into octamers termed the nucleosome, wrapping ~147 bp of DNA in a left-hand, superhelical conformation39,40, and are joined by H1 linker proteins to form repeating structural units that are the main protein component of chromatin41, which is the complex of DNA and protein. Changes in chromatin architecture that facilitate or inhibit access to DNA can be controlled by chemical modifications to histone tails42. The effects of some post-translational modifications (PTMs) on transcription are highly conserved throughout eukaryotes. For example, the discovery of enzymes responsible for acetylation at specific residues on H4, originally discovered within (S. cerevisiae) and Drosophila melanogaster (D. melanogaster), generally coincide with a state permissive to gene expression throughout eukaryotes43,44, referred to as euchromatin. On the other hand, deacetylation is generally associated with gene silencing owing to a more compact state, referred to as heterochromatin45. To date, more than 100 PTMs have been described, leading to the ‘histone code hypothesis’37 that implicates

I-13 histone modifications acting sequentially or in combination to regulate transcriptional events by influencing local chromatin architecture. These modifications are deposited by complex, combinatorial mechanisms to coordinate specific transcriptional outputs11. An alternative hypothesis suggested that the plethora of PTMs work sequentially to provide stability and robustness to gene function and chromatin architecture, but disputed the combinatorial aspect of the hypothesis based on a lack of evidence46. Instead, a signaling histone hypothesis was suggested, drawing from observations that chromatin architecture is robustly maintained by similar mechanisms to signalling networks within the cell. These include signal-receiving docking sites that occur as dense collections47,48 and evidence for feedback mechanisms49 that confer robustness or adaptability to chromatin architecture. Another explanation suggests that a layer of degeneracy, or redundancy exists. This is explained by observations that the same cellular outcome may be obtained by distinct modifications to histones50. This became apparent when genome wide patterns were integrated with two RNA-seq datasets. This analysis demonstrated that different histone patterns resulted in similar transcriptional effects, highlighting a layer of degeneracy in the histone code51.

The combinatorial effects of histone proteins and their effect on transcription still remains a subject of debate. Biochemical analysis of some histone readers suggests some proteins are capable of ‘multivalent’ modification binding, by displaying increased affinity for histone substrates when two are present compared to one52. In other cases, the binding of one histone reader may impede the binding of another53. Mass spectrometry (MS) has also been invaluable in identifying PTM readers associated with certain histone marks, and has identified numerous readers and writers with cooperative binding potential54,55. However there is little biological evidence to suggest histone modifications act via combinatorial means. Despite having a theoretically high level of PTM combinations, genome wide analysis of histone modification patterns reveals a strong co-occurrence of different histone modifications, such that the variance of histone modifications is massively underrepresented within the genome56,57. Nevertheless, Chromatin immunoprecipitation sequencing (ChIP-seq) in many cell types have revealed spatially and temporally correlated combinations of histone marks that can reliably be used to annotate the genome58,59, and current advances within single cells have confirmed that these marks are present within the same cell, rather than a reflection of cellular diversity within a pool of analysed cells60. Biochemical characterisation studies, MS studies and genome wide analyses conducted in parallel are required to address the combinatorial aspect of histone regulation, which has made deciphering the combinatorial aspect of histone regulation challenging. Instead, the employment of these methods in separate studies have suggested that chromatin architecture is controlled by combinatorial mechanisms, and that some proteins may modulate the deposition of or readout of these

I-14 events52,53,61. A brief overview of PTM writers and readers within this process are detailed below.

Interrogation of PTMs and their functions have led to the identification of specific classes of enzymes that function as chromatin ‘writers’ and ‘readers’ regulating and interpreting the histone modification landscape within the cell. Writer proteins alter different PTMs by the addition or removal of certain chemical groups that may include acetylation, methylation, phosphorylation, ubiquitination and sumoylation of moieties37. Reader proteins function to specifically read the PTM code and recruit epigenetic factors to defined PTM sites61. Lysine is a commonly studied histone residue because of its ability to undergo multiple exclusive modifications and is associated with both gene activation and repression. Histone 3 lysine (H3K) methylation is generally associated with activation when lysine residues 4, 36 or 79 are methylated, while methylation of lysine residues 9 or 27 are typically associated with repression62. Deposition of H3K9, for example, is mediated by the writer Su(var)3-9 and drives heterochromatin stabilisation in mice required for genome stability63. Numerous studies have established chromatin writers and readers as crucial intermediates in the cross-talk between chromatin and transcriptional events. For example, characterisation of the Polycomb group (PcG) revealed two distinct PcG subcomplexes responsible for writing and reading methylation of H3K27 to promote chromatin compaction64,65 resulting in transcriptional repression. Upon recognition of PTMs, chromatin readers or writers may recruit other cellular factors to modify transcriptional activity. A recent study demonstrated this concept by implicating histone reader bromodomain protein ZMYND8 in transcriptional repression and DNA repair, through recruitment of the nucleosome remodeling and deacetylation (NuRD) complex66. Firstly, the histone writer KDM5A (an H3K4 histone demethylase) is recruited to transcriptionally active damaged loci marked by H3K4me3. KDM5A then converts H3K4me3 to H3K4me1, providing binding sites for ZMYND8. ZMYND8 harbors a specific affinity for H3K4me1, but not H3K4me3, and is able to mediate repair of these loci by homologous recombination whilst simultaneously inducing their transcriptional repression via its association with NuRD66. This study illustrates both the specificity of histone writers and readers, and the complexity underlying their effects on transcription through association with other proteins and protein complexes. Whilst numerous histone readers and writers have been linked to specific effector- mediated transcriptional regulation, bulk characterisation of PTMs have revealed a variety of genome wide patterns.

The characterisation of histone marks at certain genomic elements such as enhancers or promoters, or between cell types, has resulted in genome-wide chromatin state maps that provide rich sources of information complementing other genomic profiling techniques like

I-15 RNA-seq. Enrichment of H3K4me3 signal by ChIP-seq, for example, is commonly observed on promoters of active genes15. Enrichment of H3K9me3, on the other hand, is often associated with silenced pericentromeric heterochromatin, implicated in impeding the reprogramming of cell identity by shielding lineage related from activation by transcription factors (TFs)67. Whilst some histone marks and their PTMs have been well characterised, and constitute defined correlations with transcription, by and large the interpretation of PTMs and their effects on transcription are not straight forward, sometimes producing disparate outcomes reflective of cell type-specific or developmental processes. The presence of other PTMs at sites marked by H3K4me3 and H3K9me3 may alter the genomic output in ways inconsistent with the general observations of these marks, resulting in a dynamism that is challenging to disentangle. For example embryonic stem cells (ESCs) exhibit bivalent domains marked by H3K4me3 and H3K27me3 that correlate with regions of the genome that are transcriptionally silenced but poised for activation18. Current work is focused on unravelling the histone code in developmental and cell type-specific contexts, while substantial efforts are being made to understand the influence of conserved PTM inheritance processes in newly formed embryos68.

RNA constitutes another layer of epigenomic regulation. Transcripts that do not encode a protein product are termed non-coding RNAs (ncRNA) and make up a substantial proportion of the mammalian genome. Different ncRNAs are classified by their cellular roles, location, and most importantly, size. Small ncRNA include small interfering RNA (siRNA), micro RNA (miRNA) and piwi (P-element Induced WImpi testis in Drosophila) interacting (piRNA), whilst longer transcripts are appropriately called long non-coding RNA (lncRNA)69,70. In humans, about 65% of genes are transcribed into short ncRNAs, whilst lncRNA represent more than 20% of the human genome71,72. Short and lncRNA are vital to various cellular processes, many of which rely on an association with epigenetic processes like chromatin modifiers, implicating ncRNAs in epigenetic regulation of gene expression73–75. The first described ncRNA-related epigenetic mechanism was the regulation of X chromosome inactivation by the master regulator X-inactive specific transcript (Xist). Xist promotes chromosome wide heterochromatinization, and among its many studied functions34, associates with the PcG chromatin remodelling complex, Polycomb repressive complex 2 (PRC2), resulting in chromosome-wide H3K27 methylation of the inactive X chromosome76. PRC2 has since been implicated in RNA directed chromatin remodeling through associations with about 20% of all large intergenic ncRNAs expressed in various cell types77. In this example, ncRNAs have been posited to participate in epigenetic regulation by acting as guides that recruit chromatin remodeling complexes to specific sites genome-wide. Other studies suggest ncRNAs may act as decoy units that sequester chromatin-modifying enzymes, preventing their effects at certain

I-16 loci27,76. For example, the lncRNA Tsix mediates chromosome inactivation in an antagonistic manner to Xist by binding and sequestering PRC2 units, preventing their spread along the inactivated chromosome76. Another report has suggested that Xist and Tsix may form RNA dimers that are processed by endoribonuclease Dicer, yielding siRNAs that coordinate epigenetic regulation of X chromosome inactivation via complex multifaceted mechanisms27. Lastly, characterisation of lncRNAs such as HOTAIR, has revealed lncRNAs that may act as docking sites for protein complexes. HOTAIR originates from the HOXC gene cluster and has been implicated in genomic imprinting through transgene-silencing of the HOXD cluster up to 40kb away. Characterisation of HOTAIR revealed an association with a lysine-specific demethylase 1 (LSD1) and CoREST complexes, suggesting that ncRNAs may act as scaffolds upon which other protein complexes mediate chromatin modifications in cis and trans78. Hotair gene knockouts in mice result in H3K4me3 gain, and to a lesser extent K3K27me loss, at HOX gene clusters and additional target genes79. The roles of lncRNAs as epigenetic effectors have also been highlighted through depletion of RNA processing units like Dicer or Argonaute that lead to the accumulation of lncRNAs which is accompanied by abnormal histone architecture80,81.

D. melanogaster and (C. elegans) have made excellent model systems for identifying and characterising the functions of siRNA and piRNAs82–85. The roles of siRNA and piRNAs within the mammalian epigenome are not as well characterised, however transcriptional silencing by siRNA has been linked to DNA methylation and histone modification processes in many cell types and disease states86,87. Evidence for the roles of piRNAs in coordinating epigenetic processes stems from their association with Piwi proteins. Piwi proteins bind to piRNA but are also direct regulators of PcG expression and possess the ability to bind various PcG proteins88,89. Whilst the mechanisms governing piRNA regulation are scarce, a study has suggested piRNAs recruit heterochromatin protein 1 (HP1) to specific genomic loci which results in RNA polymerase inhibition90. piRNAs have been proposed to tether PIWI proteins MILI and MIW2 to transposable elements (TEs) resulting in methylation of LINE and IAP transposons in the testis, whilst deletion of either piRNA results in aberrant transposon methylation. MIWI2 has also been suggested to have a regulatory role in methylation deposition within germ cells by acting upstream of DNMT3L91,92. The MIWI2 interactome in foetal gonocytes has recently been defined, identifying DNMT3A, DNMT3L and SPOCD1 as components of a repressive chromatin remodelling and de novo methylation apparatus required for young TE silencing93. Loss of Spocd1 in mice resulted in male infertility but did not affect piRNA biogenesis or MIWI2 targeting, indicating that de novo DNA methylation of TEs is driven by this protein and is crucial to the mouse genome DNA methylation state93. In summary, the role played by RNAs in the regulation and maintenance

I-17 of the epigenome is becoming more apparent. However, further studies are required to understand and characterize these elements that encompass a substantial portion of the human genome.

Finally, one of the most extensively studied epigenomic marks is the covalent modification of cytosine bases at the fifth position in DNA, termed DNA methylation. Cytosine DNA methylation is the most studied covalent modification of DNA and is broadly present in eukaryotes including plants, animals, and fungi (excluding some notable examples)94. DNA methylation is most commonly found at CpG sites (CG methylation, mCG) within the genome, deposited symmetrically on both strands of the duplex DNA95. The connection between DNA methylation and gene repression dates back to 1982 when methylated DNA was observed as transcriptionally inactive when transfected into Xenopus oocytes and cultured mammalian cells96,97. Subsequent studies have revealed that its deposition and maintenance are tightly controlled processes fundamental to many biological processes95,98. Assays such as sodium bisulfite treatment of DNA coupled with developments in high throughput DNA sequencing now enable high resolution, single base resolution mapping of the sites of DNA methylation throughout entire genomes, termed the methylome. These have revealed that a significant fraction of genomic methylation patterns are similar throughout many cell types and developmental stages95,99. Exceptions to these observations arise in the formation of the germline and during preimplantation where progressive demethylation events occur in both parental genomes24,100,101. Additionally, bisulfite sequencing of cell populations and at single cell level has revealed the presence of differentially methylated regions (DMRs) in the genome that are important in maintaining cell type specific transcriptional signatures102,103. Comprehensive methylome characterization has also identified the presence of non-CG methylation within mammals in specific cell types, where the methylcytosine is followed by a base other than G (also called mCH, or CH methylation, where H=A, C or T). A prime example is within human and mouse brains, and in particular in neurons, where CH methylation is deposited after birth during synaptogenesis, showing enrichment at genes required for cell type-specific functions104. Despite our recent, improved understanding of CH methylation distribution and dynamics, many questions remain unanswered, such as the mechanisms underlying the deposition and removal of CH methylation, its readout and effect upon transcription or chromatin state, and its potential involvement in development and disease. In contrast, CG methylation has been extensively studied, owing to its widespread occurrence in the mammalian genome. As such, its deposition dynamics, correlations with transcription at defined genomic elements, and the readers and writers of this mark are well established, as detailed in the following section.

I-18 Features of DNA methylation in mammals

I.3.1 DNA methylation at CpG islands (CGIs)

Whole-genome bisulfite sequencing (WGBS) analyses have revealed that 60-80% of the methylated cytosines in human and mouse genomes occur in symmetric dinucleotides, with methylated cytosines adjacent to a guanine, known as CpG methylation, or methylation in the CG context (mCG)20,105. CGIs, named for their atypical enrichment in CG dinucleotide content, span on average ~1000bp and are commonly associated with gene promoters21. CGIs characteristics are generally shared between humans and mice with respect to proximity to genes, suggesting a conserved role in gene regulation. Large scale DNA methylation analysis of human promoters using microarrays established that the majority of CGI promoters are hypomethylated and transcriptionally active, while genes with ‘low’ CG content promoters tend to be highly methylated106. It has also been demonstrated that CG density correlates positively with the H3K4me3 histone mark globally, suggestive of an interdependent mechanistic link between transcriptional activation and CGI patterning107, whilst low CG content promoters are prone to cell type-specific differences in promoter methylation levels108. When methylated, these genes show an inverse relationship between methylation levels and gene expression109. ‘Orphan’ CGIs, named so because they are not associated with canonical promoters, and instead lie downstream of the transcriptional start site (TSS) or within intergenic regions, encompass 50% of mammalian CGIs and exhibit transcriptional initiation properties. Unlike CGIs at known promoters, orphan CGIs tend to be methylated during development, resulting in transcriptional silencing, suggesting that orphan CGIs are implicated in more subtle regulatory functions110. In contrast, recent CRISPR/Cas9 knock-ins of poised enhancers with or without orphan CGIs within mouse embryonic stem cells (mESCS) have demonstrated an ability for hypomethylated orphan CGIs to boost physical and functional communications between poised enhancers and distally located developmental genes111.

Long term stable gene repression is commonly associated with methylated CGI promoters, and is partly responsible for the generalised notion that promoter mCG causes gene silencing. For example, DNA methylation correlates with repression of genes residing on the inactive X chromosome112, genes expressed within germ cells113, and at CGI promoters in somatic cells which is thought to contribute to the maintenance of transcriptional silencing. Transcriptional control is mediated by complex regulatory interactions between chromatin state and cross-talk with transcription factor binding. DNA methylation is thought to sterically hinder transcription factor binding preventing transcriptional initiation, or alternatively, promote binding of transcription factors with specialised mCG binding domains that themselves may promote

I-19 gene repression. One mechanism involves proteins binding to mCG resulting in the recruitment of chromatin modifying enzymes to remove activating marks like H3K4me3114. Despite a range of established examples of mCG-driven transcriptional repression, much still remains unknown. One such question is whether DNA methylation is a cause or effect of gene silencing. In an effort to explore this, analysis of DNA methylation at the Hprt gene on the inactive X chromosome suggested that DNA methylation served to reinforce transcriptional suppression, as DNA methylation was only deposited after the gene had been silenced 115. This indicated that genes on the X chromosome were suppressed by other cellular mechanisms and that DNA methylation served to reinforce inactivation. However other studies suggests nucleosome presence and specific histone modifications may deter or attract proteins like DNA methyltransferases, providing the necessary environment for de novo methylation116–118. Current evidence suggests that mCG can be both the cause of transcriptional repression, or be deposited to reinforce and stably maintain a repressed state. This has recently been more directly observed and tested in targeted epigenome editing experiments employing DNA methyltransferase effectors coupled to targeted DNA binding proteins such as dCas9119,120. Whilst many studies have conducted loci specific DNA methylation targeting experiments, the employment of epigenetic targeting tools to induce methylation on endogenous reporters genome-wide is only beginning to be investigated. In one study, induced promoter methylation was not sufficient to repress transcription at many active promoters, suggesting DNA methylation is not always sufficient for transcriptional silencing and can be reliant upon other factors for stable gene repression at many loci121,122. Current evidence suggests that DNA methylation frequently may not function as the primary molecular mark directing gene silencing and chromatin reconfiguration, but may require additional epigenetic machinery and changes to achieve stable repression123. Epigenome editing studies that build upon the current body of work will be key in unravelling the cause and effect of DNA methylation and establishing the circumstances that require long term stabilization of active genes.

I.3.2 DNA methylation at enhancers and intergenic regions

Enhancers are situated at variable distances upstream or downstream of a gene’s transcriptional start site and control its expression through DNA looping. Most enhancers contain low to intermediate CG levels and exhibit dynamic methylation levels between different cell types and states139. Enhancer regions that are in an active state generally display an intermediate or low level of DNA methylation, which may be explained by several biological processes including deposition and removal of mCG that may be in constant competition with

I-20 each other or inefficient maintenance of mCG through cell divisions124. However, cellular heterogeneity is thought to be a primary contributor to these ‘low-methylated regions’ (LMRs), a product of averaging the methylated and unmethylated binary states of cytosine105, which has recently been proven through recent single-cell WGBSexperiments125. DNA methylation at enhancer and distal regulatory elements has been implicated in determining cell-type identity. For example, different subsets of T cells contain large numbers of DMRs in enhancers associated with genes required for T cell differentiation. Differential methylation of these enhancers was subsequently shown to affect enhancer activity in reporter assays126. The capacity of TFs to access enhancer regions associated with transcription of genes involved in cellular differentiation is one mechanism of control used by the cell to regulate differentiation. The introduction of glucocorticoids in mammalian cells has been observed to cause demethylation of distal regulatory elements and the methylation state of these elements can affect the binding of glucocorticoid binding127. These observations also suggest that enhancer methylation is a dynamic process and may be modulated by passive or active processes to achieve normal cellular division and differentiation processes.

Over half of the human genome consists of repeat DNA sequences128, commonly referred to as repeat elements, from which two classifications arise: tandem repeats (TR) and TEs. TRs exist as repeat sequences within satellite DNA in centromeric or pericentromeric regions of eukaryotic chromosomes129. Heterochromatin associated proteins, RNA interference (RNAi), and DNMTs constitute epigenetic factors that ensure silencing of centromeric and pericentromeric satellite DNA, ensuring proper chromosome function16,130. In line with this hypothesis was a notable observation in which DNMT3B-null HeLa cells exhibited hypoacetylation of minor satellite repeats that correlated with chromosome mispairing and delayed sister chromatid segregation131. As for TEs, their intergenic methylation constitutes a large fraction of the methylation observed in mammalian genomes132. TEs consist of three major classes: long interspersed nuclear repeats (LINE), short interspersed nuclear repeats (SINE) and long terminal repeats (LTR). LINE and LTR elements contain promoter like activities that are hypermethylated to prevent their expression and promote genome stability133.

With the exception of CGIs, most of the CpG dinucleotides in the mammalian genome are methylated134. WGBS analysis between individuals has revealed that differences in methylation patterns are greater between different tissues from the same individual, than when compared across individuals135. The conservation of DNA methylation between individuals suggests its precise deposition and removal are finely controlled processes required for development. The dynamism observed between tissues, especially within gene coding and

I-21 intergenic regulatory regions of the genome, suggests it plays important cell-type specific regulatory roles. Whilst the function of DNA methylation at genomic elements such as promoters106, enhancers136, and insulators137 has been extensively studied. It is often associated with the suppression of transcription or the occlusion of DNA binding proteins at these elements, or simply the binding of proteins to regulatory regions of DNA. Indeed, the binding of some TFs at distal regulatory elements influences DNA methylation levels and can be used to identify active regulatory regions105. These TFs are important in the maintenance of cell identity and in reprogramming of these regions by inducing active demethylation and facilitating necessary changes in gene expression105,138.

I.3.3 DNA methylation and alternative gene splicing

DNA methylation within gene bodies has attracted great interest because of its apparent paradoxical role in promoting transcription139. This has prompted investigation of this mark in the regulation of transcript processing. Transcripts produced by RNA polymerase II (RNAP II) undergo multiple processing steps including capping, splicing, 3’ processing and polyadenylation, before being exported to the cytoplasm. Alternative splicing increases the coding capacity of the genome and is crucial for normal cellular functioning. Splicing events are coupled to transcription by the recruitment of proteins with specialised RNA processing functions140,141. RNAP II binding kinetics are fundamental to splicing events, providing opportunities for binding of both positive and negative splicing factors142,143, whilst RNAP II pausing is enriched at splice sites144. Several histone modifications influence splicing by altering chromatin architecture, indirectly affecting RNAP II elongation and or pausing145,146, whilst other modifications are involved in the direct recruitment of splicing factors12. The roles of DNA methylation in splicing are less well understood. Genome-wide analyses reveal higher levels of DNA methylation at exons compared with introns147,148, whilst exons generally excluded by splicing machinery have lower levels of DNA methylation than do constitutively included exons149. Insights into the mechanisms linking methylation with splicing remain largely unknown. The effects of intragenic DNA methylation on splicing have been investigated for CCCTC-binding factor (CTCF) within naive lymphocytes, where CTCF is involved in the splicing of exon 5 within the CD45 gene. In the absence of intragenic DNA methylation, CTCF binding promoted the inclusion of upstream exons through the mediation of RNAPII pausing150. An interesting aspect of splicing, pertaining to this thesis, is the action of methyl readers at positions important for splicing and the downstream recruitment of splicing factors. The methyl-binding domain proteins (MBDs) are well-established recruiters of histone-modifying enzymes, whose recruitment may provide indirect association with, or inhibit, splicing factor

I-22 recruitment. MECP2 for example, directly binds to alternatively spliced exons, whilst its overexpression or inhibition alters splicing patterns151. MECP2 also associates with splicing factor Prpf3 and SWI/SNF, which regulates RNAP II elongation152. The comprehensive identification of DNA methylation readers and their characterisation, coupled with RNA-sequencing approaches, will be important experiments to conduct in order to ascertain the potential roles of DNA methylation in transcript regulation153.

I.3.4 CH methylation

Until the development of high throughput DNA sequencing, limitations in technology restricted the measurement of DNA methylation throughout the genome to the CG context, or to only relatively small portions of the genome. For example, commonly used techniques relied upon methylation-sensitive restriction enzymes that specifically recognise mCG, or affinity enrichment assays utilising proteins or antibodies with known mCG binding capabilities106,154,155. The development of high throughput DNA sequencing technologies has overcome low resolution and low coverage caveats associated with prior methodologies that enriched for CpG rich regions of the genome154,156, enabling reliable detection of methylcytosine at single base resolution, including DNA methylation that occurs outside of the canonical CpG sequence context19,20,157. In comparison to CG methylation, the genomic distribution and functional implications of CH methylation remain a relatively new area of investigation. CH methylation, or mCH is a non-symmetric DNA methylation mark prevalent in mammalian pluripotent stem cells and neurons20,104,158. Motif analysis of CH methylation reveals a preference for mCAG motifs in ESCs and mCAC in neurons, suggesting CH methylation governs specific processes in the two cell types20,104. Additionally, neurons and pluripotent stem cells display distinct mCH patterning that correlate with cell-type specific transcriptional regulatory processes[Citation error]. The first genome-wide single-base methylome identified that mCH accounts for nearly one-quarter of all methylation events present in ESCs. mCH density was enriched in gene bodies and positively correlated with transcript abundance, but was depleted in protein binding sites and enhancers. Furthermore, mCH disappeared upon differentiation and was restored in induced pluripotent stem cells, indicating that mCH is a common property of pluripotency20. CH methylation has since been identified as a biomarker for endodermal differentiation in a study that correlated lower mCH levels with reduced differentiation capacity159. Unlike within ESCs, gene body CH methylation in the brain displays an inverse correlation with transcription104. Initially low at birth, mCH levels rise, coinciding with synaptogenesis, to become the dominant form of methylation in adult human neurons. Glia display very low mCH levels, but exhibit specifically localized

I-23 hypermethylated CH at a set of genes that are active and CH hypomethylated in neurons, suggesting that it may silence specific neuronal genes in the glial genomes104. Analysis of mCH marks within brain has revealed associations with various brain disease pathologies including Alzheimer’s disease160,161. The cell-type-specific and developmentally dynamic nature of mCH during brain development also suggests a role in neurodevelopment. Mecp2, a known mCH binder162, and a protein whose disruption is causative in Rett Syndrome163, has been observed to bind highly methylated long genes within the brain and recruit the corepressor (NCoR) complex to disrupt transcriptional initiation of these genes164,165. Part of this thesis addresses the roles of mCA binding proteins in the mammalian brain and, as such, will be discussed in more detail in Chapter 5.

Writing, maintenance, and removal of DNA methylation in mammalian genomes

I.4.1 Writers of DNA methylation

The maintenance of DNA methylation states at defined genomic loci is reliant upon proper deposition and removal processes. Cytosine methylation is achieved through the covalent transfer of a methyl group from S-adenosyl methionine (SAM) to the fifth carbon of cytosine. Within mammals, the DNMT family use similar catalytic mechanisms characterized by the formation of a covalent reaction intermediate between enzyme and the substrate base166. These proteins, first observed within bacteria as part of their restriction-modification systems, appear to have been transferred to eukaryotes on more than one occasion167,168. DNMT1 was first identified based on the homology of the catalytic motif that is highly conserved in bacterial cytosine-5 DNMTs169. The human genome encodes five DNMT proteins: DNMT1, DNMT3A, and DNTM3B catalyse the addition of methyl groups onto genomic DNA8,170, whereas DNMT2 methylates tRNA171,172, and DNMT3L does not possess catalytic activity but stimulates de novo DNA methylation through association with DNMT3173. Early work established that each DNMT member has specific developmental expression patterns and specific activities in vitro. DNMT1 is ubiquitously expressed whilst particular isoforms of DNMT3A and DNMT3B enzymes are highly expressed in early embryos and germ cells where active de novo DNA methylation takes place. In comparison, somatic tissues contain lower DNMT3 levels whilst a catalytically inactive isoform of DNMT3B is expressed, influencing the activity of other DNMT3s8,174–176.

I-24 Biochemical analysis revealed that DNMT1 exhibited a high preference for hemimethylated DNA, followed by CG sites within the genome, and had almost no activity at CH sites177,178. Within the cell, DNMT1 localised to replication foci in S phase, and with heterochromatin in late S and G2 phases179–181. These studies and numerous others have implicated DNMT1 in ensuring proper maintenance of DNA methylation patterns through cell divisions. Dnmt1- deficient mice exhibit embryonical lethality at E9.5, prior to the 8-somite stage, and display neural tube defects182,183. Inactivation of the catalytic domain of Dnmt1 resulted in similar developmental defects, suggesting it functions as a DNA methylase that is critical to development184. Recombinant Dnmt3a and Dnmt3b proteins were able to methylate unmethylated DNA, whilst inactivation of both enzymes affected de novo DNA methylation but had no effect on the maintenance of imprinted genes8,185. Recent studies have shown DNMT3B associates with the histone mark H3K36me3, DNMT3L, and H3K4, indicating mechanisms by which selective recruitment of DNMT3 enzymes may be achieved186,187. The distinct patterns of CH methylation in ESCs and neurons are potentially due to differential expression of DNMT3a and DNMT3B in these cell types. ESCs contain high levels of DNMT3B, which preferentially interacts with H3K36me3188, resulting in gene body hypermethylation at CAG DNA sequences in actively transcribed genes189. In addition, cells deficient in both Dnmt3a and Dnmt3b in mouse190 and human ESCs191 have significantly reduced mCA levels, indicating that these enzymes may be responsible for mCA deposition. Dnmt3a deficient mouse ESCs are fatal 4 weeks post gestation, whilst deficiencies in Dnmt3b are lethal at E9.5-E10.5, and exhibit growth impairments and neural tube defects8.

Despite clear sequence conservation of DNMT enzymes, DNMT2 and DNMT3L are evolutionary adaptations of the original DNMTs192. Several DNA methylation deficient eukaryotic organisms possess DNMT2 as their sole DNMT-like protein, suggesting this protein has alternative roles to DNA methylation193. Biochemical and genetic approaches revealed that human DNMT2 does not methylate DNA but localises to the cytoplasm and methylates small RNAs, playing roles in genome protection by regulating the activity of RNA viruses and retrotransposons171. Additional domains present within DNMTs suggest these proteins also participate in transcriptional and post transcriptional processes. For example, by actively participating in gene repression through associations with zinc fingers194, histone deacetylases195, or histone methyltransferases196. Dnmt3a has been observed to regulate enhancers by maintaining DNA hydroxymethylation levels through association with p63, a process required for adult human epidermal stem cell functioning197. The roles of DNMT proteins, whilst primarily studied in the context of DNA methylation, extend beyond it, and include a diverse array of critical regulatory cellular processes.

I-25 I.4.2 Erasure of DNA methylation

Passive and active DNA demethylation are critical processes required for dynamic DNA methylation patterning, for example in pluripotent cells within the early embryo, or in erasing parental origin-specific imprints in primordial germ cells94. Dilution of DNA methylation through cell division in the absence of DNMT1/UHRF1 results in progressive passive demethylation, demonstrated within the mouse maternal genome where DNA methylation was lost passively through successive rounds of cell division28,198. Active DNA demethylation utilises specific proteins with deamination, oxidation, or DNA repair properties. Deamination of methylated cytosine by AID (activation-induced deaminase) / APOBEC (apolipoprotein B mRNA-editing enzyme complex) produces uracil, resulting in a thymine-guanine mismatch. DNA repair proteins, like MBD4, an enzyme active in the base-excision repair (BER) pathway, are then recruited to replace the thymine with unmethylated cytosine199,200. Whilst APOBEC proteins possess the potential for DNA demethylation, their functionality in mediating in vivo active DNA demethylation remains unknown and controversial for a variety of reasons. First, AID/APOBEC activity is much more efficient with ssDNA as a substrate and exhibits substantially higher affinity for unmethylated cytosines relative to methylated cytosines201. Second, APOBEC proteins are confined mainly to muscle and heart tissue, suggesting that these proteins are unlikely to be the principal mediators of active DNA demethylation. Furthermore, deamination of both methylated and unmethylated cytosines is mutagenic and potentially harmful to the cell, making this an unlikely mechanism used for DNA demethylation202. Aid/Apobec-driven DNA demethylation has been supported136 and later challenged203 by studies in zebrafish. The roles of AID/APOBEC in DNA demethylation have been investigated in mammals also producing results that support204–206 or counter207,208 their proposed roles in DNA demethylation. For example, one study204 employed siRNA knockdown of AID, which affected demethylation levels of pluripotent genes in human fibroblast and mouse ESCs, perturbing their expression. ChIP-seq analysis also identified that AID preferentially bound methylated promoters over unmethylated promoters, and in particular, promoters that were subject to DNA demethylation during reprogramming204. Aid-deficient mice contain three times lower DNA methylation levels than wild-type controls, while residual DNA methylation levels within treated samples suggest a layer of redundancy or external demethylation mechanisms exist within mammals209.

The Ten Eleven Translocation (TET) enzymes facilitate DNA demethylation through iterative oxidation of 5-methylcytosine (5mC) yielding 5-hydroxymethylcytosine (5hmC), 5- formylcytosine (5fC), and 5-carboxylcytosine (5caC) derivatives210. The TET-driven oxidised

I-26 products 5fC/5caC may be passively removed through cell division211 or excised and repaired by DNA glycosylases such as thymine DNA glycosylases (TDGs) and repaired by BER212. TET1 was initially discovered as a fusion partner of the H3K4 methyltransferase MLL1 in acute myeloid leukemia213. Its overexpression in cultured cells resulted in a global reduction in 5mC levels214, whilst recombinant expression and biochemical analysis of TET1 - and in subsequent studies, TET2/3 - revealed an affinity for 5mC that was catalytically oxidised to 5hmC215. TET proteins are involved in multiple biological processes critical to stages of embryonic development and in more generalised gene regulatory functioning. The erasure of methylation marks in the germline and developing embryo resets the embryo for the establishment of DNA methylation216. In contrast to the maternal genome, which is demethylated by passive processes, the paternal genome relies on passive and active DNA demethylation mechanisms. Tet3 is highly expressed in the mouse oocyte and zygote until the 2-cell stage Initial studies concluded that tet3 was responsible for converting the majority of DNA methylation in the male pro-nucleus to 5hmC, that is subsequently replaced by cytosine through passive DNA demethylation processes217–219. However a recent study has shown that Tet3-driven 5hmC is not required for the loss of paternal 5mC, and that the loss of 5mC and accumulation of 5hmC are temporally disconnected220. The second wave of DNA demethylation takes place in primordial germ cells (PGCs). Tet1 and Tet2 are expressed in mouse PGCs at E9-E11, concomitant with a rise and then dip in 5hmC levels. Genetic studies have revealed that Tet1 and Tet2 deficient mice display an aberrant DNA demethylation pattern and altered expression in genes relating to meiotic and imprint functioning in PGCs221,222. Conversely, TET knockouts in human ESCs are mostly viable and display less severe developmental phenotypes in comparison to DNMT knockouts, reflective of a more complex DNA demethylation network with some redundancy. Tet1 knockouts exhibit reduced postnatal body size with mild developmental delays221. Tet2 knockouts have increased hematopoietic stem cell self-renewal223, while Tet3 knockouts arrest at E11.5218. In mouse ESCs, Tet1/2 double-knockouts and Tet1/2/3 triple-knockouts show drastically depleted 5hmC levels and increased DNA methylation levels at enhancers224,225. The TET proteins also show differential expression in postnatal development with capacities to influence DNA methylation levels that are important to normal cellular functioning. The mammalian adult brain, for example, contains high levels of 5hmC required for normal brain functioning. The TET proteins are expressed in a cell-type-specific manner within the mammalian brain, and in coordination with DNA demethylation events104,226. Tet2/3 deficiencies results in differentiation defects, whilst their overexpression stimulates neurogenesis227. Within the adult mouse brain, Tet1 deficiencies may cause defects in spatial learning and short-term memory228. Genome-wide analyses in these studies revealed that numerous genes are hypermethylated within their promoter, while another study revealed perturbation in neuron-specific genes229. These

I-27 studies highlight an important developmental role for TETs in DNA methylation processes required within the embryo and within somatic cells.

While ChIP-seq and gene knockout studies have provided insights into the pathways and potential functions of TET enzymes, multiple methods that exploit specific chemical properties of cytosine have been used to map the distribution of modified cytosines across the genome. Genome-wide sequencing of 5mC and its oxidised derivatives offer unbiased approaches to investigate DNA methylation and demethylation patterns and dynamics. The utilisation of these approaches, each with their strengths and weaknesses, has helped shape our understanding of each modification at defined genomic elements across development230. For example, within mouse ESCs, affinity-based methods showed that 5hmC is enriched at intermediate or low CGIs and that 5fC/5caC accumulate at transcriptionally inactive or poised promoters in mouse ESCs231. Other studies within mouse ESCs have observed enrichment of 5hmC within intragenic regions, especially within 3’ ends, a hallmark of active transcription232,233, while enrichment of 5hmC and 5fC at tissue-specific enhancers mark these loci in a ‘poised’ state234,235. Within the the human brain, 5hmC signatures have been associated with gene expression in inhibitory neurons that differ significantly from excitatory neurons160. Correlations of 5fC, 5caC, and 5hmC with gene regulatory processes, such as those within the mouse ESC studies discussed above and within the brain160, have stimulated a recent wave of research into identifying and characterising novel DNA-binders and protein interactors that bind to these marks. The utilisation of DNA pull-downs coupled with Mass Spectrometry (MS) have been invaluable in identifying and characterising novel readers of demethylation processes22. For example, Spruijt et al. 2013 characterised UHRF2 as a 5hmC binder in neural progenitor cells, and identified numerous potential candidates with specific affinity for 5fC and 5caC. These proteins included TDGs, transcription factors, and chromatin regulators, prompting investigations into the effects of these identified readers on 5hmC and its derivatives22,236. Advancement in mass spectrometry and DNA pull-down approaches have also been adopted for high throughput identification of DNA methylation readers, which, as discussed below, have greatly enhanced our understanding of TF binding dynamics and the roles of DNA methylation.

Readers of DNA methylation

I.5.1 The MBD family

I-28 The identification of mCG readers dates back to 1989, when mouse protein extract was incubated with methylated DNA. This led to the discovery of a 120 kDa protein complex called the MeCP1 complex composed of MBD2 and NuRD237. However, the first protein directly proven to bind mCG was MECP2238. Amino acid sequence analysis identified a core 70 amino acid domain with an affinity for mCG, called the Methyl Binding Domain (MBD), and a transcriptional repression domain (TRD)239. Information from expressed sequence tag databases coupled with genome sequencing databases later revealed five additional proteins with highly homologous motifs resembling the MBD domain found in MECP2, appropriately termed the MBD family13. The MBD family represents the first identified protein family that ‘read’ mCG (Figure 1.1). Subsequent research efforts successfully characterised the mCG binding capabilities of each MBD member within human and mouse. Validation to mCG was achieved by gel shift assays, in which recombinant expression of each MBD member was incubated with methylated and unmethylated DNA. These assays revealed that MECP2, MBD1, MBD2, and MBD4 proteins specifically bound mCG, whilst MBD3 bound both methylated and non-methylated DNA in a non-specific manner13. Recently discovered members, MBD5 and MBD6, appear to not bind to methylated DNA240.

MECP2 is the best-studied MBD member and has two isoforms, both of which are ubiquitously expressed. MECP2a is predominantly expressed in placenta, liver, and skeletal muscle, whilst MECP2b is more abundant within the brain, especially within neurons, at levels almost 1:1 with nucleosome levels241–243. However, since its initial discovery as the first mCG reader, extensive biochemical and genomic characterisation has revealed additional roles in chromatin regulation, hmC recognition, and regulation of splicing244,245. The MBD of MECP2 shows a strong selective affinity for dsDNA containing symmetrically methylated CG dinucleotides in vitro238,239. Early in vivo analysis in mouse revealed Mecp2 occupancy at heterochromatic foci that are GC rich and heavily methylated. The study also established the MBD as the crucial determinant of in vivo binding246. Subsequent to these initial studies, MECP2 was also observed to associate with unmethylated DNA, for example, detection of MECP2 at Slc6a2, an unmethylated CpG island promoter, or with the unmethylated maternal H19 gene247,248. In addition, MECP2 contains a highly basic N-terminal domain and A/T hook motifs that associate with the minor groove of AT-rich duplex DNA249. In line with these observations, numerous biochemical studies have concluded that MECP2 is also capable of binding unmethylated DNA249–252. Importantly however, studies conclude that MECP2 binds to unmethylated DNA but binds mCG with ~3 fold higher affinity253,254. These findings mirror the in vivo binding dynamics observed for MECP2. Various ChIP-seq experiments have confirmed that MECP2 demonstrates loci specific methylation-dependent binding, whilst genome-wide binding analyses revealed that mCG is the primary determinant of MECP2 binding, showing

I-29 a linear dependence on local mCG density29,255,256. Despite exhibiting affinity for unmethylated DNA, the primary role of MECP2, demonstrated by various biochemical and genome binding analyses, is the recognition of mCG, which is the primary driver of MECP2 localisation.

Whilst initial studies focused on mCG binding, MECP2 has also emerged as a protein with multifaceted roles. For example, it functions in 5hmC recognition245 and in mediating global chromatin structure29,243. Early in vitro studies reported that the MBD family exhibited no affinity or non-specific affinity for 5hmC257,258. Neurons accumulate 5hmC to levels ~10 fold higher than in ESCs concomitant with MECP2 expression, prompting a more comprehensive investigation of MECP2-5hmC binding164,259. Indeed, MECP2 was identified as the sole 5hmC candidate in a DNA pull-down approach using 5hmC probes in rodent brain protein extracts. A direct affinity to 5hmC was later confirmed by EMSA analysis, demonstrating a specific interaction for 5hmC260. A conflicting study observed MECP2 bound 5hmC, but concluded MECP2 had high affinity for hmCA only, and that hmCG binding affinity resembled an affinity similar to unmethylated DNA164. Efforts to clarify the ambiguous conclusions from each study have since been reconciled. The probes used in the Mellén et al study were chosen based on certain genomic loci, and PCR amplified in a way that incorporated hmCG but, inadvertently, also hmCA260. Furthermore, each probe had an underrepresentation of CG, explaining the observed high affinity hmC interaction260. It is now established that MECP2 binds specifically to hmCA but the presence of hmCG does not appear to inhibit binding. The biological implications of these biochemical findings remain unclear within the brain, especially because hmCA exists at very few regions within the mammalian brain104. Structural analysis revealed MECP2 participates in complex DNA interactions forming DNA loops and inducing close spatial arrangement of two DNA molecules through 2 separate binding surfaces261. Therefore, it is unsurprising that in addition to gene-specific transcriptional control, MECP2 also regulates global chromatin architecture. In vitro studies have shown that high levels of MECP2 can form and stabilize nucleosomal arrays with mC or unmethylated DNA, however it seems that mC may stabilise this interaction250. Studies within neurons have demonstrated that a functional MBD is required to reinforce and stabilise chromatin clustering and chromatin residence time250,262, whilst in the absence of mC, binding is re-distributed to regions with increased DNAse hypersensitivity, marked by H3K4me1 and H3K27ac256. Chromatin looping alters the 3D landscape of chromatin and involves a structural rearrangement of distant loci. In vitro studies have demonstrated MECP2 produces chromatin looping by homodimerisation 261. A complex mechanism was reported whereby, within methylated imprinted loci, MECP2 was shown to induce chromatin looping through recruitment of ATRX and subsequent members of the cohesin complex and CTCF263. Chromatin looping and subsequent histone modifications led to the silencing of the imprinted region, lost in MECP2 mutant cells due to a loss of MECP2-

I-30 ATRX formation264. A similar observation was observed at a different set of imprinted loci, Dlx5 and Dlx6 within MECP2 knockout mice265.

MBD1 is the largest protein in the MBD family that binds a symmetrically methylated CG dinucleotide sequence13. In addition, MBD1 contains a TRD domain involved in protein interactions that drive MBD1’s repressive capabilities, and is the only MBD member that possesses zinc coordinating CXXC domains266,267. Alternative splicing in humans and mice produces variants with two or three CXXC domains268,269. Interrogation of each CXXC domain revealed that CXXC-1 and CXXC-2 are unable to bind DNA by in vitro band shift analysis, whilst the CXXC-3 domain has affinity for unmethylated probes containing CG repeats270. In the absence of an MBD domain, the CXXC-3 domain efficiently localises to non-methylated CG-dense pericentric heterochromatin. This interaction was demonstrated in Dnmt1-deficient cells, whilst the CXXC-1 and CXXC-2 domains were unable to target Mbd1 to methylated or unmethylated DNA in vivo270. Analysis of the MBD domain within MBD1 has demonstrated that it functions in driving MBD1 localisation to methylated DNA, and is primarily responsible for its regulation of target genes. Despite its ability to bind to DNA, the CXXC-3 domain is largely dispensable and more likely involved in protein stability upon MBD binding256,266. Studies of MBD1 knockout mice are not embryonically lethal but affect neural stem cell differentiation and lead to autism-like defects including reduced social interaction, learning deficits, anxiety, abnormal brain serotonin activity and defective sensorimotor gating271–273. In human disease, MBD1 is implicated in autism spectrum disorders and a range of cancers including, lung, endometrial, and pancreatic cancer274–276.

MBD2 has been well characterised, and binds specifically to mCG with no appreciable affinity for hmC22,277. Three isoforms of MBD2 have been characterized, with separate expression profiles and whose biological functions are conferred largely by the inclusion or exclusion of various domains. MBD2a and MBD2b are ubiquitously expressed, while the third isoform, MBD2c, is specific to testis and ESCs13. MBD2a and MBD2b contain an MBD and transcriptional repression domain (TRD), whilst the inclusion of an RG-rich N-terminal domain in MBD2a contributes to distinct protein functionality. MBD2c, on the other hand, possesses no TRD but retains MBD functionality. Regardless of which isoform is analysed, genomic binding patterns of MBD2 resemble MBD1 in that it broadly associates with densely methylated regions, with no detectable sequence specificity but with DNA-binding that correlates with CG DNA methylation density256. Analysis of the genomic elements bound by MBD2 reveal enrichment within highly methylated TSS, promoters, and exons278. MBD2 knockout mice are not embryonically lethal, but neurological behavioural assessment experiments have observed deficits in pup nurturing and nesting behaviour, hypoactivity, and

I-31 low body weight279,280. The lack of apparent phenotype has been attributed to redundant, overlapping functions of the MBDs within the brain that may compensate for a defective or absent MBD2. Within disease, MBD2 is implicated in colorectal, prostate cancer and brain cancers281–283.

Studies of each isoform reveal that MBD2 participates in subtle, but complex, gene regulatory mechanisms. Domain characterisation has confirmed that the TRD is responsible for NuRD recruitment and subsequent gene repression. The constituent proteins of NuRD have nucleosome remodelling and histone deacetylase activity. NuRD consists of ATP-dependent remodelling enzymes CHD3/4, histone deacetylases HDAC1/2, histone chaperones RBBP4/7, DNA binding proteins GATAD2A/B, and metastasis-associated proteins MTA1/2/3284. NuRD localisation to mCG dense regions is driven by MBD2 inducing histone deacetylation and chromatin compaction285,286. This interaction model was posited by an early study utilising a cell culture based reporter assay286–288. A recent genome-wide binding evaluation of MBD2 has built upon these observations and raised more questions about the genomic binding dynamics of MBD2256. Baubec et al. found that MBD2 binds methylated DNA, but the presence of the TRD enables interactions with unmethylated DNA256. This was demonstrated through a biotin-streptavidin based ChIP-seq experiment of MBD2c in ESCs. The experiment demonstrated that MBD2c, an isoform lacking the TRD domain binds exclusively to mCG. Importantly, NuRD was not observed to interact with this isoform owing to the absence of a TRD, in line with previous findings256,286. The study also concluded that the TRD increases the potential of genomic binding sites within the cell, as MBD2 isoforms with the TRD bound to methylated loci and unmethylated loci256. The ability for MBD2 to recognise unmethylated DNA has not been mechanistically elucidated. Perhaps NuRD- mediated mechanisms, rather than MBD2 itself, are responsible for binding to unmethylated DNA, through GATAD2A/B which possess DNA-binding domains. The ramifications of this binding behaviour within a biological setting remain unclear. It has been shown that MBD2a and MBD2b bind tissue-specific, intermediate-to-high CG promoters that are unmethylated, but it still remains unclear as to which isoform, or if both, exhibit this binding pattern 256,278,289. There is also evidence to indicate that post-translational modification of MBD2a in its RG-rich domain by PRMT1 or PRMT5 alters its genomic targets and may reduce its ability to interact with NuRD290,291. The presence of the RG-rich domain in MBD2a, therefore, renders this isoform with unique binding patterns and potentially alters its effects on gene expression through disruptions with NuRD. Last, some studies, albeit few, supporting a role for MBD2 in active transcription, and that this, at least in some cases, is isoform-specific292,293. Whilst some major regulatory roles of MBD2 have been characterised, future work is required to uncover the more subtle isoform-driven genomic targets and their effects on gene expression.

I-32

Amino acid sequence similarity analyses reveal a 70% similarity between MBD3 and MBD2 13. Both genes contain identical intron/exon structures, resulting in 3 isoforms that imply a common origin from a single ancestral gene294,295. Despite this similarity, MBD3, unlike MBD2, binds to mCG non-specifically due to a replacement of a tyrosine residue with phenylalanine within a crucial region of its MBD that dramatically reduces its affinity for mCG253,279. Additionally, MBD3 was absent in a 5hmC MS pull-down in mouse ESCs, likely explained by its known affinity for unmodified CG sites in vivo22. Further, enrichment of MBD3 within unmethylated genomic regions have confirmed this observation whilst raising more questions about the types of genomic elements bound by MBD3. Independent studies in human cell lines have demonstrated that MBD3 localises at CG rich promoters, gene bodies, and enhancers of active genes marked by H3K4me2/3 and H3K27ac278,296. Within mice, one study reported MBD3 enrichment in TSS of CG-rich promoters marked by 5hmC297. Whilst another study concluded that MBD3 bound unmethylated enhancers marked by H3K4me1 and H3K27ac independent of 5hmC and 5mC256. Genome wide binding analysis of MBD3 by ChIP- seq utilised different antibodies, cell lines, and sequencing pipelines, as well as distinct cell- type or isoform-specific binding mechanisms, and this may reflect the incongruent conclusions from each study278,296,297. MBD3, like MBD2, also associates with NuRD but is responsible for distinct Mi2-NuRD complex recruitment at distinct genomic loci. For example, NuRD is required for embryonic pluripotency, yet only MBD3 is embryonically lethal, whilst Mbd2 knockout mice display mild defects279,295. In addition, thorough inspection of each complex reveals a set of common Mi-2/NuRD subunits and additional distinct protein interactors specific to each complex. MBD2 unlike MBD3, co-purifies with Protein Arginine Methyltransferase 5 (PRMT5) and recruits PRMT5 to CG islands in a DNA methylation- dependent manner290. In contrast, within mESCs, MBD3/NuRD repression is mediated by association with Zic2, an enhancer-binding factor required for ESC specification298. Despite high sequence homology294, and a shared subset of protein interactions and complexes285, tandem affinity purification coupled to mass spectrometry (TAP-MS) analysis reveals each protein associates with different proteins290, whilst genome wide binding256,278 and knockout studies279 have demonstrated that MBD2 and MBD3 display unique binding signatures and regulate diverse genomic processes.

MBD4 is expressed in somatic tissue and ESCs and exhibits several isoforms13. Biochemical analyses have revealed that MBD4 binds symmetrical mCG dinucleotides but exhibits a higher affinity for mCG/TG mismatches, resulting from a deaminated mCG on one strand299. Inspection of MBD4 localisation in vivo reveals a linear correlation with mCG density, but also a significant amount of binding to unmethylated CG rich promoters marked by active chromatin

I-33 marks. Like MBD1 and MBD2, MBD4 primarily relies on mCG for its localisation, demonstrated by observing MBD4 at DAPI-dense chromocenters that is lost in DNMT3 triple-knockout lines256. Similar to other MBDs, MBD4 is thought to alter transcription through recruitment of chromatin remodelers. A GST-MBD4 pull-down identified Sin3A and HDAC1 as interaction partners that promote transcriptional repression of the hypermethylated p16INK4ɑ and hMLH1 promoters in a reporter gene assay to the same levels as MBD2 and MECP2300. Despite these observations, MBD4 has primary roles in DNA repair, through its helix-hairpin-helix (HhH) domain that enables MBD4 to function as a DNA repair protein301. The HhH domain catalyses the removal of thymine and uracil paired with guanine within CG sites299,302,303. The essential role for MBD4 within DNA repair is exemplified by MBD4 knockout mice that exhibit two to three times higher mCG-TG transitions and display increased levels of tumorigenesis303,304. MBD4 is proposed to regulate cell cycle progression and trigger cellular apoptosis, playing roles in genome surveillance acting through interactions with Fas-associated death domain (FADD)305. Recent analyses have observed MBD4 in indirect demethylation pathways, mediating excision of 5-hydroxymethyluracil (5-hmU), an intermediate byproduct of TET- mediated demethylation306. Within zebrafish, AID and Mbd4 were observed at loci that undergo demethylation whilst the removal of Mbd4 removal resulted in the remethylation of some genes136. A more recent study was not able to reproduce these findings in zebrafish, concluding there is no evidence for active DNA demethylation driven by AID and MBD4203 and its role in active DNA demethylation of the early embryo remains controversial307.

MBD5 and MBD6 are the most poorly characterised members of the MBD family. MBD5 has two known isoforms whilst MBD6 has only one240. Their MBDs retain similar structure to MBD1 and MECP2, despite deletions and insertions of 9 and 6 amino acids from the first third and last third of their MBDs respectively308,309. MBD5 and MBD6 displayed no affinity for mCG by electrophoretic mobility shift assay (EMSA) and localised to chromocenters regardless of genome-wide hypomethylation in Dnmt1 deficient cells, indicating that their physiological roles are methylation independent240. As with other MBDs, MBD5 and MBD6 are thought to be transcriptional repressors, however the evidence supporting this is limited. For example, MBD5 and MBD6 interact with the human polycomb deubiquitinase complex (PR-DUB), which serves as a precursor for H3K27me3 mediated silencing310,311. Distinct functional roles for each protein remain a question of interest. Whilst each protein is ubiquitously expressed, isoforms of each protein display subtle expression differences, and both MBD5 and MBD6 are highly expressed within the testes240. MBD5 isoform 1 is highly expressed in the brain, whilst isoform 2 is expressed more highly within oocytes than in all other tissues310. Furthermore, MBD5 mutations have been linked to a range of neurodevelopmental disorders312,313. MBD6 is also expressed in the brain and might play roles in neurodegenerative disease314.

I-34

I.5.2 Set and RING- associated (SRA) family

UHRF1 and UHRF2 are highly homologous SRA domain-containing proteins involved in cell cycle progression315 with a ubiquitin ligase (UBL), tudor, and contain plant homeodomains (PHD)316. UHRF1 is involved in the maintenance of methylation within mammals by binding hemimethylated double-stranded CG dinucleotides, whilst this remains debatable for UHRF2317,318. The PHD domain in UHRF1 recognises the unmodified N-terminus of histone H3 and di-/tri methylated lysine on histone H3 and is required for proper DNA maintenance319,320. DNMT1-directed DNA methylation during DNA replication is regulated by UHRF1, the deletion of which is embryonic lethal315. Mouse stem cells lacking UHRF1 show a substantially hypomethylated genome that fails to maintain a higher-order chromatin structure and exhibit augmented transcription of repetitive DNA elements321,322. An inability to compensate for the lack of UHRF1 in UHRF1-deficient cells indicates that UHRF2 maintains non-redundant functional roles that remain elusive. Biochemical characterisations of UHRF2 have revealed that this protein binds to 5hmC with high affinity but binds to methylated and unmethylated DNA indiscriminately. For example, crystal structures of UHRF2 bound to hemimethylated mCG dsDNA have demonstrated UHRF2 binds mCG, with comparable affinity to unmethylated DNA323. More recently, the affinity for each protein to methylated, unmethylated, and hemimethylated DNA and hydroxymethylated counterparts was performed using fluorescence polarisation. This biochemical analysis concluded UHRF1 bound hemimethylated DNA with higher affinity than the other probes, whilst UHRF2 exhibited a lack of affinity for methylated, hemimethylated, and hydroxymethylated probes316. This study was limited to a specific probe sequence context, and at odds with another study in which UHRF2 demonstrated highest affinity for double-stranded hydroxymethylated DNA, followed by hemi- hydroxymethylated DNA, compared to methylated and unmethylated probes323. Results from an MS-proteomics screen are consistent with this observation, observing an enrichment of UHRF2 for binding to 5hmC over unmethylated probes22. Gene knockout experiments in mice suggest tissue-specific regulatory roles may explain the affinity of UHRF2 for 5hmC and mCG recognition324. UHRF2 binds 5hmC, whilst its deletion in mice results in reduced 5hmC levels in the brain that coincide with defects in memory acquisition and retention325. Despite UHRF2 deficient mice being viable, they displayed abnormal electrical brain activities and developed seizures, which was attributed to lower mCG levels at defined genomic loci. DNA methylation analysis indicated that unlike UHRF1, whose knockout affects global mC levels326, UHRF2 knockouts exhibit reduced mC levels at UHRF2 target genes324. Whether this is a direct or

I-35 indirect effect remains to be determined, and whether these subtle reductions in mC levels affect transcription driving changes in gene expression.

I.5.3 Kaiso and the Broad complex, Tramtrack, Bric-á-brac or Poxvirus Zinc-finger (BTB/POZ) family

The Cys2His2 (C2H2) (ZF) motif is one of the most abundant DNA binding motifs in the human proteome, utilising tandem arrays of ZF domains that bind DNA with sequence specificity327. Kaiso is a BTB/POZ-ZF protein encoded by the ZBTB33 gene and represents the original member of the BTB/POZ family of ZFs that engage in methyl-dependent and independent DNA binding328–330. Since its discovery, the transcriptional regulatory effects of Kaiso have been studied in various contexts including cellular proliferation, apoptosis, cellular migration/invasion, and in various cancer models331–335. Gene reporter and protein purification assays have revealed that Kaiso is capable of transcriptional repression of methylated promoters by a proposed interaction network with NCoR, HDAC, and SMRT (silencing mediator or retinoic acid and thyroid ) complex recruitment328,336,337. Interrogation of its binding sites by ChIP-sequencing has surprisingly revealed a preference for lowly methylated regions and hints at transcriptional enhancing roles338. However these results are in contrast to numerous biochemical and ChIP-qPCR studies that establish with confidence that ZBTB33 does bind to its consensus mCG motif and that this binding is also observed in vivo339–343. Other BTB/POZ members, ZBTB4 and ZBTB38, bind a single mCG site, unlike Kaiso, which requires 2 mCG sites. The differences in affinity for each BTB/POZ member are defined by the amino acid composition of each ZF at its DNA binding surface and/or the combinations of ZFs within each protein344,345. Both proteins, like Kaiso, demonstrate transcriptional repression of reporter constructs in transfection assays and are implicated in a range of cellular processes including apoptosis, cellular proliferation, differentiation, and various cancers346–348. Recent analysis of the amino acid composition within ZF domains of BTB/POZ proteins that bind mCG revealed that conserved lysine and arginine residues at distinct sites are required for mCG binding. The presence or absence of these residues at critical positions may predict mCG binding and consequently the existence of many more mCG binding BTB/POZ proteins that are currently elusive, and may be possible mCG readers349. SILAC-based DNA pull-down and Nucleosome Affinity Purification (SNAP) experiments, in which proteins are identified based on their preference for methylated DNA or histone H3 methylation, identified several BTB/POZ members whose binding behaviour seems to reflect the amino acid hypothesis. For example, ZBTB12, ZBTB40, and ZBTB33 had moderate to high affinity for mCG or methylated H3. Zinc Finger domains within these proteins

I-36 contained combinations of the critical lysine and arginine residues required for mCG binding, whilst those excluded from methylated DNA, namely ZBTB9, ZBTB2, and ZBTB25, had no lysine or arginine within their ZF motifs23.

Figure 1.1: Protein domains present within the mC reader families. Adapted for MBD14,38, BTB/POZ344,349, and SRA350 families.

I-37 I.5.4 Expansion of the mCG reader repertoire and the need for contextually relevant, multifaceted characterisation approaches

In recent years, many high throughput mCG reader screens have been developed that identified numerous mCG binders outside the classical families described for mCG recognition. The employment of MS-proteomics or methyl-sensitive SELEX (systematic evolution of ligands by exponential enrichment) approaches, for example, provide robust, systematic, and comprehensive interrogation of DNA binders, and have been successfully employed to screen for readers with an affinity for many DNA modifications, for example mCG or 5hmC. MS-based proteomics screens offer advantages to SELEX based approaches, as they may be implemented in a cell type of choice, identifying DNA binding proteins and protein interactors that are contextually relevant22,351. These approaches have successfully identified numerous mCG binders, revealing that the human proteome consists of a surprisingly large fraction of DNA binders with an affinity for mCG. These DNA-binding screens have also revealed that some proteins bind to both methylated and unmethylated DNA. Whether these DNA binding interactions are biologically relevant or artifacts of biochemical approaches often requires further investigation. The artificial nature of in vitro DNA binding experiments means they do not capture the complexity of genomic DNA. These approaches are therefore inadequate in addressing the complex binding behaviour observed for some proteins that bind multiple target loci within the cell in order to regulate a multitude of genes. In these cases DNA binding may be governed by specific changes in DNA methylation present in specific cell types or may vary in their binding potential for various reasons including the presence of co- interacting proteins or the presence or absence of certain histone marks. Multiple in vitro and in vivo binding experiments are needed to overcome the limitations of artificial DNA-binding screens used to identify DNA binders, and are especially important in cases where proteins have been observed to bind to both methylated and unmethylated DNA. Some examples of proteins outside the classical mCG reader families, or DNA binders with affinity for methylated and unmethylated DNA are described below.

KLF4 represents an excellent example of a DNA binder that was observed to bind methylated and unmethylated DNA, and required further evaluation of its binding profile. A Krüppel-like family of ZF transcription factors that is one of the four Yamanaka reprogramming factors352,353, was identified by a DNA pull-down approach coupled to MS. Subsequent structural characterisation determined KLF4 exhibited a ~1.5X higher affinity for methylated binding elements than for corresponding unmethylated DNA. However, ChIP-bisulfite sequencing revealed KLF4 enrichment at both methylated and unmethylated loci. The binding behaviour

I-38 of KLF4 has therefore been refined, in line with observations from these and other experiments. It is now accepted that KLF4 exhibits a binding profile in a way that correlates with developmental mCG patterning, but in many cases, may bind to unmethylated CG dinucleotides22,354. These observations illustrate the need to follow up biochemical analysis with in vivo binding data because in vitro assays may function as an informative source of information but do not capture the complexity of DNA binding within the cell. The need for contextually relevant characterisation approaches is similarly demonstrated by ZFP57, which was reported to play a role in genomic imprinting through mCG recognition based upon biochemical assays, ChIP-seq analysis, and gene knockout experiments355–357. ZFP57 is a Krüppel-associated box (KRAB) ZF domain protein required for the maintenance of maternal and paternal gene imprinting. Zfp57 was initially demonstrated to bind two asymmetric mCG sites in dsDNA with high affinity357. Analysis of ChIP-seq data revealed a binding enrichment within imprinting control regions that suggested that Zfp57 may bind to hemimethylated DNA sites in vivo. Further characterisation revealed that Zfp57 binds to hemimethylated DNA and interacts with TRIM28, leading to DNMT1 and UHRF1 recruitment in a process essential for imprinting356. In line with this, loss of Zfp57 within the zygote causes partial neonatal lethality, and loss of maternal Zfp57 results in failure to establish proper imprinting, while eliminating Zfp57 in maternal and zygotic cells results in embryonic lethality355. Through the utilisation of biochemical, gene knockout, and ChIP-seq analysis tools, the roles of Zfp57 have been characterised to a large extent.

Early Growth Response 1 (EGR1) belongs to a rapidly expanding list of proteins with mCG affinity that also display affinity for unmethylated CG DNA327. A fluorescence polarisation assay of EGR1 in complex with a recognition sequence 5’ GCG(T/G)GGGCG 3’ was undertaken where the underlined cytosine was either fully/hemimethylated or oxidised. The affinities for hemimethylated and oxidised versions of the 9bp motif and EGR1 were much reduced when compared to methylated and unmethylated motifs. The overall conclusion posited by the authors was that EGR1 discriminates primarily from the oxidised derivatives within the sequence, rather than methylated C from unmethylated C. A binding constant for EGR1 in complex with the methylated substrate was measured at 0.13 µM, and reduced by a factor of ~2.8 for the unmethylated probe358. An independent study focusing on the structure of EGR1 in complex with the same 9bp motif indicates EGR1 binds this motif indiscriminately regardless of methylation status359. Genome-wide binding inspection of EGR1 by ChIP- sequencing analysis revealed that, of the top 1,000 peaks in the mouse brain and human monocytic cell lines, EGR1 exhibits a preference for promoter regions with GC rich sequences that are highly similar to the 9bp motif tested above360. Within mouse brain, EGR1 binding co- localised with activating histone mark H3K9ac, and to a lesser extent the repressive mark

I-39 H3K27me3, suggesting EGR1 primarily binds unmethylated DNA but may in certain cases inhibit gene expression361,362. Whether gene repression in these cases is orchestrated by mCG recognition remains unknown. What is known about EGR1 target loci is that these genomic elements are also frequently bound by other transcription factors363, and a subset of these binding sites are bound by MBD proteins364. Interrogation of EGR1 binding in vivo reveals a preference for unmethylated DNA, but with some binding at methylated loci. Based on these observations, it is reasonable to conclude that EGR1 therefore binds to mCG and CG, and that the cellular mechanisms responsible for this discrimination are due to unknown protein regulatory interactions that may be cell type-specific and/or EGR1 binding kinetics. The number of EGR1 binding sites within the genome outnumber EGR1 molecules within the cell365. The binding kinetics of EGR1 may, therefore, be reliant on its level of expression, half- life, and the availability of potential protein competitors sharing the same binding motif, for example , SP1, and CREB361. It has also been proposed that the under-representation of EGR1 at methylated loci is due to an inability to compete with MBDs for binding at these methylated loci366. Alternatively, the under-representation of EGR1 at methylated loci may be explained by cell-type-specific protein interactors or post-translational modifications that affect the ability of EGR1 to bind to mCG. Last, experimental factors also need consideration, for example, the quality of the antibody used within a ChIP-seq experiment.

The aforementioned examples typify the need for subsequent characterisation of DNA binders identified in DNA-binding screens, constituting examples of proteins with a mixed affinity for methylated and unmethylated DNA, or proteins that required subsequent in vitro and in vivo validation to verify their affinities for the DNA baits used within initial DNA-binding screens. These experiments may include assays such as EMSA, ChIP, and TAP-MS to help overcome caveats associated with initial screening and provide a more comprehensive characterisation of each DNA binder, identifying possible DNA binding loci, or protein interactors that help elicit transcriptional change. This has proved especially important in situations where proteins display an affinity for both methylated and unmethylated CG substrates. Recent implementation of high throughput DNA-binding screens has produced many novel DNA binding candidates with an affinity for mCG. When complemented with protein-interaction and genomic binding data, these candidates have been integral to understanding the molecular repercussions of mCG, highlighting novel mechanisms by which proteins binding to mCG influence various cellular processes.

I.5.5 A need for mCH reader characterisation

I-40 The vast amount of DNA binding literature for mCG readers has contributed substantially to our understanding of how this mark regulates a variety of cellular processes, for example, transcription, chromatin state, and genomic imprinting. Information pertaining to readers of mCH may similarly contribute to our understanding of molecular processes in which this modification is present, but are currently lacking from the literature. Such studies are required to understand how mCH exerts its effects and the mechanisms by which this occurs. The deposition, localisation patterns, expression patterns, and associations with transcription suggest mCH plays important roles within pluripotent stem cells and in the mammalian brain25,104. Thus far, only one mCH reader, MECP2, has been discovered. Characterisation of its affinity for mCA and its genomic localisation has revealed some of the regulatory repercussions associated with mCA patterning in neurons164,165,245. However, to a large extent, the mechanisms by which this modification exerts its effects remain unknown, especially in pluripotent stem cells, for which no mCH reader has been discovered. The utilisation of existing technologies like DNA pull-downs coupled to MS will provide useful screens for identifying mCH readers in these cell types and tissues, providing DNA binding protein candidates that may participate in unknown, crucial biological processes, required for healthy cellular functioning.

Outline of thesis

Protein binding to mCG and mCH provide a means through which DNA methylation can exert its effects in localised and gene-specific ways. Controlled spatial and temporal expression of transcription factors and chromatin modifiers influence developmental processes by coordinating chromatin structure and gene expression changes. Initial studies aimed at identification and characterisation of methyl-reader proteins played integral roles in understanding the readout of mC. Elucidation of the canonical MBDs, for example, helped link gene repression to mCG, simultaneously substantiating the mechanisms by which these readers bind, recruit, and, in most cases, influence local chromatin architecture237,284. The preponderance of mCG studies have identified and classified the binding capabilities of mCG binders to a relatively large degree. Typically, biochemical validation assays designed to assess binding affinity of candidate readers are complemented by genome-wide elucidation of binding dynamics by ChIP-seq coupled to WGBS. In recent years the development of assays like SELEX or high throughput DNA-affinity pull-down assays coupled to MS, have enabled the identification of large scale mCG reader candidates in various tissues and cell types22,23,351. The relatively recent emergence of mCH as an atypical DNA methylation mark largely restricted to pluripotent cells and the mammalian brain has left its potential binding

I-41 proteins relatively uncharacterised. As yet, only one mCA reader, MECP2, has been identified and characterised as a regulator of neuronal development164,367. To this end, the aim of this thesis was to identify and characterize novel mC readers with high affinity for mCG and mCA within the human and mouse brain by employing a high throughput mC reader screen using DNA-pull down approaches, coupled to MS.

Chapter 2 contains all relevant materials and methods pertaining to experiments conducted in subsequent results chapters. Chapter 3 discusses the development of ProteoMultiMatrix (ProteoMM), a novel multivariate MS statistical analysis tool that was developed and optimised to confidently address mC reader conservation in the human and mouse brain. Information pertaining to MS analyses and its challenges are therefore included within the introductory section of Chapter 3. The aim of Chapter 4 was to identify mCG readers within the human and mouse brain through the utilisation of ProteoMM. This dataset provides a rich source of mCG and CG binders within the human and mouse brain, identifying already characterised and novel DNA binders and interactors.

The correlation between a rise in CA methylation with synaptogenesis, and its distinct patterning in neurons and glia, suggests it is involved in neurodevelopmental cellular programs. However, the proteins involved in readout of CA methylation and their downstream molecular effects remain largely unknown. To address this, chapter 5 presents the first high throughput mCA reader screen employed within the human and mouse brain, aimed at identifying possible mCA binding candidates. Further, a direct affinity for mCA was confirmed through recombinant expression and protein purification of the top mCA binding candidate. Chapter 6 discussed the results presented in Chapters 3, 4, and 5, discussing their results, implications, and importance for understanding the complex epigenetic regulatory network of mammalian brain development.

References

1. WADDINGTON & H, C. The epigenotype. Endeavour 1, 18–20 (1942). 2. Waddington, C. H. Epigenetics and evolution. in Symp. Soc. Exp. Biol vol. 7 186–199 (1953). 3. Russo, V. E. A. (vincenzo E. A. )., Martienssen, R. A. & Riggs, A. D. Epigenetic mechanisms of gene regulation. (1996).

I-42 4. Bernstein, B. E., Meissner, A. & Lander, E. S. The mammalian epigenome. Cell 128, 669–681 (2007). 5. Henikoff, S., Furuyama, T. & Ahmad, K. Histone variants, nucleosome assembly and epigenetic inheritance. Trends Genet. 20, 320–326 (2004). 6. Bestor, T. H. & Tycko, B. Creation of genomic methylation patterns. Nat. Genet. 12, 363 (1996). 7. Mahadevan, L. C., Willis, A. C. & Barratt, M. J. Rapid histone H3 phosphorylation in response to growth factors, phorbol esters, okadaic acid, and protein synthesis inhibitors. Cell 65, 775–783 (1991). 8. Okano, M., Bell, D. W., Haber, D. A. & Li, E. DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257 (1999). 9. Hong, E. J., West, A. E. & Greenberg, M. E. Transcriptional control of cognitive development. Curr. Opin. Neurobiol. 15, 21–28 (2005). 10. Bird, A. Perceptions of epigenetics. Nature 447, 396–398 (2007). 11. Goldberg, A. D., Allis, C. D. & Bernstein, E. Epigenetics: a landscape takes shape. Cell 128, 635–638 (2007). 12. Luco, R. F. et al. Regulation of alternative splicing by histone modifications. Science 327, 996–1000 (2010). 13. Hendrich, B. & Bird, A. Identification and characterization of a family of mammalian methyl-CpG binding proteins. Mol. Cell. Biol. 18, 6538–6547 (1998). 14. Fatemi, M. & Wade, P. A. MBD family proteins: reading the epigenetic code. J. Cell Sci. 119, 3033–3037 (2006). 15. Liang, G. et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the transcription start sites in the human genome. Proc. Natl. Acad. Sci. U. S. A. 101, 7357–7362 (2004). 16. Lehnertz, B., Ueda, Y., Derijck, A. & Braunschweig, U. Suv39h-mediated histone H3 lysine 9 methylation directs DNA methylation to major satellite repeats at pericentric heterochromatin. Curr. Biol. (2003). 17. Schübeler, D. et al. The histone modification pattern of active genes revealed through genome-wide chromatin analysis of a higher eukaryote. Genes Dev. 18, 1263–1271 (2004). 18. Bernstein, B. E. et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315–326 (2006). 19. Lister, R. et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–536 (2008). 20. Lister, R. et al. Human DNA methylomes at base resolution show widespread

I-43 epigenomic differences. Nature 462, 315–322 (2009). 21. Deaton, A. M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011). 22. Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013). 23. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470–484 (2010). 24. Mayer, W., Niveleau, A., Walter, J., Fundele, R. & Haaf, T. Embryogenesis: demethylation of the zygotic paternal genome. Nature 403, 501 (2000). 25. Xie, W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148 (2013). 26. Reynolds, N. et al. NuRD suppresses pluripotency gene expression to promote transcriptional heterogeneity and lineage commitment. Cell Stem Cell 10, 583–594 (2012). 27. Ogawa, Y., Sun, B. K. & Lee, J. T. Intersection of the RNA interference and X- inactivation pathways. Science 320, 1336–1341 (2008). 28. Howell, C. Y. et al. Genomic imprinting disrupted by a maternal effect mutation in the Dnmt1 gene. Cell 104, 829–838 (2001). 29. Lorincz, M. C., Schübeler, D. & Groudine, M. Methylation-mediated proviral silencing is associated with MeCP2 recruitment and localized histone H3 deacetylation. Mol. Cell. Biol. 21, 7913–7922 (2001). 30. Heintzman, N. D. et al. Histone modifications at human enhancers reflect global cell- type-specific gene expression. Nature 459, 108–112 (2009). 31. Ziller, M. J. et al. Genomic distribution and inter-sample variation of non-CpG methylation across human cell types. PLoS Genet. 7, e1002389 (2011). 32. Yasukochi, Y. et al. X chromosome-wide analyses of genomic DNA methylation states and gene expression in male and female neutrophils. Proc. Natl. Acad. Sci. U. S. A. 107, 3704–3709 (2010). 33. Casas-Delucchi, C. S. et al. Histone acetylation controls the inactive X chromosome replication dynamics. Nat. Commun. 2, 222 (2011). 34. Pontier, D. B. & Gribnau, J. Xist regulation and function explored. Hum. Genet. 130, 223–236 (2011). 35. Conley, A. B., Miller, W. J. & Jordan, I. K. Human cis natural antisense transcripts initiated by transposable elements. Trends Genet. 24, 53–56 (2008). 36. Skvortsova, K., Iovino, N. & Bogdanović, O. Functions and mechanisms of epigenetic inheritance in animals. Nat. Rev. Mol. Cell Biol. 19, 774–790 (2018). 37. Strahl, B. D. & Allis, C. D. The language of covalent histone modifications. Nature 403,

I-44 41–45 (2000). 38. Fyodorov, D. V., Zhou, B.-R., Skoultchi, A. I. & Bai, Y. Emerging roles of linker histones in regulating chromatin structure and function. Nat. Rev. Mol. Cell Biol. 19, 192–206 (2018). 39. Luger, K., Mäder, A. W., Richmond, R. K., Sargent, D. F. & Richmond, T. J. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389, 251–260 (1997). 40. Bates, D. L. & Thomas, J. O. Histones H1 and H5: one or two molecules per nucleosome? Nucleic Acids Res. 9, 5883–5894 (1981). 41. Kornberg, R. D. Chromatin structure: a repeating unit of histones and DNA. Science 184, 868–871 (1974). 42. Talbert, P. B. & Henikoff, S. Histone variants—ancient wrap artists of the epigenome. Nat. Rev. Mol. Cell Biol. (2010). 43. O’Neill, L. P. & Turner, B. M. Histone H4 acetylation distinguishes coding regions of the human genome from heterochromatin in a differentiation‐dependent but transcription‐ independent manner …. EMBO J. (1995). 44. Lee, D. Y., Hayes, J. J., Pruss, D. & Wolffe, A. P. A positive role for histone acetylation in transcription factor access to nucleosomal DNA. Cell 72, 73–84 (1993). 45. Braunstein, M., Sobel, R. E., Allis, C. D., Turner, B. M. & Broach, J. R. Efficient transcriptional silencing in Saccharomyces cerevisiae requires a heterochromatin histone acetylation pattern. Mol. Cell. Biol. 16, 4349–4356 (1996). 46. Schreiber, S. L. & Bernstein, B. E. Signaling network model of chromatin. Cell 111, 771– 778 (2002). 47. Nan, X., Campoy, F. J. & Bird, A. MeCP2 is a transcriptional repressor with abundant binding sites in genomic chromatin. Cell 88, 471–481 (1997). 48. Splinter, E. et al. CTCF mediates long-range chromatin looping and local histone modification in the β-globin locus. Genes Dev. 20, 2349–2354 (2006). 49. Bornstein, C. et al. A negative feedback loop of transcription factors specifies alternative dendritic cell chromatin States. Mol. Cell 56, 749–762 (2014). 50. Maleszka, R., Mason, P. H. & Barron, A. B. Epigenomics and the concept of degeneracy in biological systems. Brief. Funct. Genomics 13, 191–202 (2014). 51. Althammer, S., Pagès, A. & Eyras, E. Predictive models of gene regulation from high- throughput epigenomics data. Comp. Funct. Genomics 2012, 284786 (2012). 52. Ruthenburg, A. J. et al. Recognition of a mononucleosomal histone modification pattern by BPTF via multivalent interactions. Cell 145, 692–706 (2011). 53. Eustermann, S. et al. Combinatorial readout of histone H3 modifications specifies localization of ATRX to heterochromatin. Nat. Struct. Mol. Biol. 18, 777–782 (2011).

I-45 54. Taverna, S. D. et al. Long-distance combinatorial linkage between methylation and acetylation on histone H3 N termini. Proc. Natl. Acad. Sci. U. S. A. 104, 2086–2091 (2007). 55. Young, N. L. et al. High throughput characterization of combinatorial histone codes. Mol. Cell. Proteomics 8, 2266–2284 (2009). 56. Huff, J. T., Plocik, A. M., Guthrie, C. & Yamamoto, K. R. Reciprocal intronic and exonic histone modification regions in humans. Nat. Struct. Mol. Biol. 17, 1495–1499 (2010). 57. Kharchenko, P. V. et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature 471, 480–485 (2011). 58. Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007). 59. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216 (2012). 60. Sadeh, R., Launer-Wachs, R., Wandel, H., Rahat, A. & Friedman, N. Elucidating Combinatorial Chromatin States at Single-Nucleosome Resolution. Mol. Cell 63, 1080– 1088 (2016). 61. Wang, L. et al. Hierarchical recruitment of polycomb group silencing complexes. Mol. Cell 14, 637–646 (2004). 62. Berger, S. L. The complex language of chromatin regulation during transcription. Nature 447, 407–412 (2007). 63. Peters, A. H. et al. Loss of the Suv39h histone methyltransferases impairs mammalian heterochromatin and genome stability. Cell 107, 323–337 (2001). 64. Cao, R. et al. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298, 1039–1043 (2002). 65. Blackledge, N. P. et al. Variant PRC1 complex-dependent H2A ubiquitylation drives PRC2 recruitment and polycomb domain formation. Cell 157, 1445–1459 (2014). 66. Gong, F., Clouaire, T., Aguirrebengoa, M., Legube, G. & Miller, K. M. Histone demethylase KDM5A regulates the ZMYND8–NuRD chromatin remodeler to promote DNA repair. J. Cell Biol. jcb.201611135 (2017). 67. Becker, J. S., Nicetto, D. & Zaret, K. S. H3K9me3-Dependent Heterochromatin: Barrier to Cell Fate Changes. Trends Genet. 32, 29–41 (2016). 68. Hontelez, S. et al. Embryonic transcription is controlled by maternally defined chromatin state. Nat. Commun. 6, 10148 (2015). 69. Zaratiegui, M., Irvine, D. V. & Martienssen, R. A. Noncoding RNAs and gene silencing. Cell 128, 763–776 (2007). 70. Ponting, C. P., Oliver, P. L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136, 629–641 (2009).

I-46 71. Zadissa, A., Searle, S., Barnes, I. & Bignell, A. GENCODE: the reference human genome annotation for The ENCODE Project. Research (2012). 72. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). 73. Hung, T. et al. Extensive and coordinated transcription of noncoding RNAs within cell- cycle promoters. Nat. Genet. 43, 621–629 (2011). 74. Grote, P. et al. The tissue-specific lncRNA Fendrr is an essential regulator of heart and body wall development in the mouse. Dev. Cell 24, 206–214 (2013). 75. Loewer, S. et al. Large intergenic non-coding RNA-RoR modulates reprogramming of human induced pluripotent stem cells. Nat. Genet. 42, 1113–1117 (2010). 76. Zhao, J., Sun, B. K., Erwin, J. A., Song, J.-J. & Lee, J. T. Polycomb proteins targeted by a short repeat RNA to the mouse X chromosome. Science 322, 750–756 (2008). 77. Khalil, A. M. et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci. U. S. A. 106, 11667–11672 (2009). 78. Marchese, F. P. & Huarte, M. Long non-coding RNAs and chromatin modifiers: their place in the epigenetic code. Epigenetics 9, 21–26 (2014). 79. Li, L. et al. Targeted disruption of Hotair leads to homeotic transformation and gene derepression. Cell Rep. 5, 3–12 (2013). 80. Volpe, T. A. et al. Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi. Science 297, 1833–1837 (2002). 81. Hall, I. M. et al. Establishment and maintenance of a heterochromatin domain. Science 297, 2232–2237 (2002). 82. Lin, H. & Yin, H. A novel epigenetic mechanism in Drosophila somatic cells mediated by Piwi and piRNAs. Cold Spring Harb. Symp. Quant. Biol. 73, 273–281 (2008). 83. Burton, N. O., Burkhart, K. B. & Kennedy, S. Nuclear RNAi maintains heritable gene silencing in Caenorhabditis elegans. Proc. Natl. Acad. Sci. U. S. A. 108, 19683–19688 (2011). 84. Castel, S. E. & Martienssen, R. A. RNA interference in the nucleus: roles for small RNAs in transcription, epigenetics and beyond. Nat. Rev. Genet. 14, 100–112 (2013). 85. Lee, Y. C. G. & Karpen, G. H. Pervasive epigenetic effects of Drosophila euchromatic transposable elements impact their evolution. Elife 6, (2017). 86. Morris, K. V., Chan, S. W.-L., Jacobsen, S. E. & Looney, D. J. Small interfering RNA- induced transcriptional gene silencing in human cells. Science 305, 1289–1292 (2004). 87. Zhou, W., Wang, J., Man, W.-Y., Zhang, Q.-W. & Xu, W.-G. siRNA silencing EZH2 reverses cisplatin-resistance of human non-small cell lung and gastric cancer cells. Asian Pac. J. Cancer Prev. 16, 2425–2430 (2015).

I-47 88. Sugiaman-Trapman, D. et al. Characterization of the human RFX transcription factor family by regulatory and target gene analysis. BMC Genomics 19, 181 (2018). 89. Lin, H. piRNAs in the germ line. Science 316, 397 (2007). 90. Huang, X. A. et al. A major epigenetic programming mechanism guided by piRNAs. Dev. Cell 24, 502–516 (2013). 91. Bourc’his, D. & Bestor, T. H. Meiotic catastrophe and retrotransposon reactivation in male germ cells lacking Dnmt3L. Nature 431, 96–99 (2004). 92. Kuramochi-Miyagawa, S. et al. DNA methylation of retrotransposon genes is regulated by Piwi family members MILI and MIWI2 in murine fetal testes. Genes Dev. 22, 908–917 (2008). 93. Zoch, A. et al. SPOCD1 is an essential executor of piRNA-directed de novo DNA methylation. Nature 584, 635–639 (2020). 94. Feng, S., Jacobsen, S. E. & Reik, W. Epigenetic reprogramming in plant and animal development. Science 330, 622–627 (2010). 95. Smith, Z. D. & Meissner, A. DNA methylation: roles in mammalian development. Nat. Rev. Genet. 14, 204–220 (2013). 96. Vardimon, L., Kressmann, A. & Cedar, H. Expression of a cloned adenovirus gene is inhibited by in vitro methylation. Proceedings of the (1982). 97. Stein, R., Razin, A. & Cedar, H. In vitro methylation of the hamster adenine phosphoribosyltransferase gene inhibits its expression in mouse L cells. Proceedings of the National (1982). 98. Bird, A. DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21 (2002). 99. Rakyan, V. K. et al. DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project. PLoS Biol. 2, e405 (2004). 100. Monk, M., Boubelik, M. & Lehnert, S. Temporal and regional changes in DNA methylation in the embryonic, extraembryonic and germ cell lineages during mouse embryo development. Development 99, 371–382 (1987). 101. Hajkova, P. et al. Chromatin dynamics during epigenetic reprogramming in the mouse germ line. Nature 452, 877–881 (2008). 102. Doi, A. et al. Differential methylation of tissue- and cancer-specific CpG island shores distinguishes human induced pluripotent stem cells, embryonic stem cells and fibroblasts. Nat. Genet. 41, 1350–1353 (2009). 103. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017). 104. Lister, R. et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905 (2013).

I-48 105. Stadler, M. B. et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480, 490 (2011). 106. Weber, M. et al. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat. Genet. 39, 457–466 (2007). 107. Kelly, T. K. et al. H2A.Z Maintenance during Mitosis Reveals Nucleosome Shifting on Mitotically Silenced Genes. Mol. Cell 39, 901–911 (2010). 108. Farthing, C. R. et al. Global Mapping of DNA Methylation in Mouse Promoters Reveals Epigenetic Reprogramming of Pluripotency Genes. PLoS Genet. 4, e1000116 (2008). 109. Weber, M. et al. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat. Genet. 37, 853–862 (2005). 110. Illingworth, R. S. et al. Orphan CpG islands identify numerous conserved promoters in the mammalian genome. PLoS Genet. 6, e1001134 (2010). 111. van IJcken, W., Landeira, D. & Rada-Iglesias, A. Orphan CpG islands boost the regulatory activity of poised enhancers and dictate the responsiveness of their target genes. bioRxiv (2020). 112. Hellman, A. & Chess, A. Gene Body-Specific Methylation on the Active X Chromosome. Science 315, 1141–1143 (2007). 113. Messerschmidt, D. M. & Knowles, B. B. DNA methylation dynamics during epigenetic reprogramming in the germline and preimplantation embryos. Genes (2014). 114. Ballas, N., Grunseich, C., Lu, D. D., Speh, J. C. & Mandel, G. REST and its corepressors mediate plasticity of neuronal gene chromatin throughout neurogenesis. Cell 121, 645–657 (2005). 115. Lock, L. F., Takagi, N. & Martin, G. R. Methylation of the Hprt gene on the inactive X occurs after chromosome inactivation. Cell 48, 39–46 (1987). 116. Zilberman, D., Coleman-Derr, D., Ballinger, T. & Henikoff, S. Histone H2A.Z and DNA methylation are mutually antagonistic chromatin marks. Nature 456, 125 (2008). 117. You, J. S. et al. OCT4 establishes and maintains nucleosome-depleted regions that provide additional layers of epigenetic regulation of its target genes. Proc. Natl. Acad. Sci. U. S. A. 108, 14497–14502 (2011). 118. Gahurova, L. et al. Transcription and chromatin determinants of de novo DNA methylation timing in oocytes. Epigenetics Chromatin 10, 1–19 (2017). 119. Jia, D., Jurkowska, R. Z., Zhang, X., Jeltsch, A. & Cheng, X. Structure of Dnmt3a bound to Dnmt3L suggests a model for de novo DNA methylation. Nature 449, 248–251 (2007). 120. Pflueger, C. et al. A modular dCas9-SunTag DNMT3A epigenome editing system overcomes pervasive off-target activity of direct fusion dCas9-DNMT3A constructs.

I-49 Genome Res. 28, 1193–1206 (2018). 121. Ford, E. E., Grimmer, M. R., Stolzenburg, S. & Bogdanovic, O. Frequent lack of repressive capacity of promoter DNA methylation identified through genome-wide epigenomic manipulation. bioRxiv (2017). 122. Galonska, C. et al. Genome-wide tracking of dCas9-methyltransferase footprints. Nat. Commun. 9, 597 (2018). 123. Kungulovski, G. & Jeltsch, A. Epigenome Editing: State of the Art, Concepts, and Perspectives. Trends Genet. 32, 101–113 (2016). 124. Song, Y. et al. Dynamic Enhancer DNA Methylation as Basis for Transcriptional and Cellular Heterogeneity of ESCs. Mol. Cell 75, 905–920.e6 (2019). 125. Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–820 (2014). 126. Schmidl, C., Klug, M., Boeld, T. J. & Andreesen, R. Lineage-specific DNA methylation in T cells correlates with histone methylation and enhancer activity. Genome (2009). 127. Wiench, M. et al. DNA methylation status predicts cell type‐specific enhancer activity. EMBO J. 30, 3028–3039 (2011). 128. de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 7, e1002384 (2011). 129. Ahmed, M. & Liang, P. Transposable elements are a significant contributor to tandem repeats in the human genome. Comp. Funct. Genomics 2012, 947089 (2012). 130. Jagannathan, M. & Yamashita, Y. M. Function of Junk: Pericentromeric Satellite DNA in Chromosome Maintenance. Cold Spring Harb. Symp. Quant. Biol. 82, 319–327 (2017). 131. Gopalakrishnan, S., Sullivan, B. A., Trazzi, S., Della Valle, G. & Robertson, K. D. DNMT3B interacts with constitutive centromere protein CENP-C to modulate DNA methylation and the histone code at centromeric regions. Hum. Mol. Genet. 18, 3178– 3193 (2009). 132. Szak, S. T. et al. Molecular archeology of L1 insertions in the human genome. Genome Biol. 3, research0052 (2002). 133. Mouse Genome Sequencing Consortium et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). 134. Ehrlich, M. et al. Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res. 10, 2709–2721 (1982). 135. Eckhardt, F. et al. DNA methylation profiling of human 6, 20 and 22. Nat. Genet. 38, 1378–1385 (2006). 136. Rai, K. et al. DNA demethylation in zebrafish involves the coupling of a deaminase, a glycosylase, and gadd45. Cell 135, 1201–1212 (2008).

I-50 137. Wiehle, L., Thorn, G. J., Raddatz, G. & Clarkson, C. T. DNA (de) methylation in embryonic stem cells controls CTCF-dependent chromatin boundaries. Genome (2019). 138. Feldmann, A. et al. Transcription factor occupancy can mediate active turnover of DNA methylation at regulatory regions. PLoS Genet. 9, e1003994 (2013). 139. Maunakea, A. K., Chepelev, I., Cui, K. & Zhao, K. Intragenic DNA methylation modulates alternative splicing by recruiting MeCP2 to promote exon recognition. Cell Res. 23, 1256–1269 (2013). 140. Montecucco, A. & Biamonti, G. Pre-mRNA processing factors meet the DNA damage response. Front. Genet. 4, 102 (2013). 141. Bentley, D. L. Coupling mRNA processing with transcription in time and space. Nat. Rev. Genet. 15, 163–175 (2014). 142. de la Mata, M. et al. A slow RNA polymerase II affects alternative splicing in vivo. Mol. Cell 12, 525–532 (2003). 143. Dujardin, G. et al. How slow RNA polymerase II elongation favors alternative exon skipping. Mol. Cell 54, 683–690 (2014). 144. Nojima, T. et al. Mammalian NET-Seq Reveals Genome-wide Nascent Transcription Coupled to RNA Processing. Cell 161, 526–540 (2015). 145. Fuchs, G., Hollander, D., Voichek, Y., Ast, G. & Oren, M. Cotranscriptional histone H2B monoubiquitylation is tightly coupled with RNA polymerase II elongation rate. Genome Res. 24, 1572–1583 (2014). 146. Zhou, H.-L. et al. Hu proteins regulate alternative splicing by inducing localized histone hyperacetylation in an RNA-dependent manner. Proc. Natl. Acad. Sci. U. S. A. 108, E627–35 (2011). 147. Anastasiadou, C., Malousi, A., Maglaveras, N. & Kouidou, S. Human epigenome data reveal increased CpG methylation in alternatively spliced sites and putative exonic splicing enhancers. DNA Cell Biol. 30, 267–275 (2011). 148. Choi, J. K. Contrasting chromatin organization of CpG islands and exons in the human genome. Genome Biol. 11, R70 (2010). 149. Gelfman, S., Cohen, N., Yearim, A. & Ast, G. DNA-methylation effect on cotranscriptional splicing is dependent on GC architecture of the exon–intron structure. Genome Res. (2013). 150. Shukla, S. et al. CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 479, 74–79 (2011). 151. Long, S. W., Ooi, J. Y. Y., Yau, P. M. & Jones, P. L. A brain-derived MeCP2 complex supports a role for MeCP2 in RNA processing. Biosci. Rep. 31, 333–343 (2011). 152. Batsché, E., Yaniv, M. & Muchardt, C. The human SWI/SNF subunit Brm is a regulator of alternative splicing. Nat. Struct. Mol. Biol. 13, 22–29 (2006).

I-51 153. Chhatbar, K., Cholewa-Waclaw, J., Shah, R., Bird, A. & Sanguinetti, G. Quantitative analysis questions the role of MeCP2 in alternative splicing. Cold Spring Harbor Laboratory 2020.05.25.115154 (2020) doi:10.1101/2020.05.25.115154. 154. Meissner, A. et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454, 766–770 (2008). 155. Cross, S. H., Charlton, J. A., Nan, X. & Bird, A. P. Purification of CpG islands using a methylated DNA binding column. Nat. Genet. 6, 236–244 (1994). 156. Bibikova, M. et al. Human embryonic stem cells have a unique epigenetic signature. Genome Res. 16, 1075–1083 (2006). 157. Cokus, S. J. et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452, 215–219 (2008). 158. Xie, W. et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 148, 816–831 (2012). 159. Butcher, L. M. et al. Non-CG DNA methylation is a biomarker for assessing endodermal differentiation capacity in pluripotent stem cells. Nat. Commun. 7, 10458 (2016). 160. Kozlenkov, A., Li, J., Apontes, P. & Hurd, Y. L. A unique role for DNA (hydroxy) methylation in epigenetic regulation of human inhibitory neurons. Science (2018). 161. Price, A. J. et al. Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation. Genome Biol. 20, 196 (2019). 162. Chen, L. et al. MeCP2 binds to non-CG methylated DNA as neurons mature, influencing transcription and the timing of onset for Rett syndrome. Proc. Natl. Acad. Sci. U. S. A. 112, 5509–5514 (2015). 163. Guy, J., Cheval, H., Selfridge, J. & Bird, A. The role of MeCP2 in the brain. Annu. Rev. Cell Dev. Biol. 27, 631–652 (2011). 164. Gabel, H. W. et al. Disruption of DNA-methylation-dependent long gene repression in Rett syndrome. Nature 522, 89–93 (2015). 165. Boxer, L. D. et al. MeCP2 Represses the Rate of Transcriptional Initiation of Highly Methylated Long Genes. Mol. Cell 77, 294–309.e9 (2020). 166. Wu, J. C. & Santi, D. V. On the mechanism and inhibition of DNA cytosine methyltransferases. Prog. Clin. Biol. Res. 198, 119–129 (1985). 167. Iyer, L. M., Abhiman, S. & Aravind, L. Natural history of eukaryotic DNA methylation systems. Prog. Mol. Biol. Transl. Sci. 101, 25–104 (2011). 168. Loenen, W. A. M., Dryden, D. T. F., Raleigh, E. A., Wilson, G. G. & Murray, N. E. Highlights of the DNA cutters: a short history of the restriction enzymes. Nucleic Acids Res. 42, 3–19 (2014). 169. Bestor, T., Laudano, A., Mattaliano, R. & Ingram, V. Cloning and sequencing of a cDNA

I-52 encoding DNA methyltransferase of mouse cells. The carboxyl-terminal domain of the mammalian enzymes is related to bacterial restriction methyltransferases. J. Mol. Biol. 203, 971–983 (1988). 170. Hermann, A., Goyal, R. & Jeltsch, A. The Dnmt1 DNA-(cytosine-C5)-methyltransferase methylates DNA processively with high preference for hemimethylated target sites. J. Biol. Chem. 279, 48350–48359 (2004). 171. Goll, M. G. et al. Methylation of tRNAAsp by the DNA methyltransferase homolog Dnmt2. Science 311, 395–398 (2006). 172. Jurkowski, T. P., Meusburger, M., Phalke, S. & Helm, M. Human DNMT2 methylates tRNAAsp molecules using a DNA methyltransferase-like catalytic mechanism. RNA (2008). 173. Suetake, I., Shinozaki, F., Miyagawa, J., Takeshima, H. & Tajima, S. DNMT3L stimulates the DNA methylation activity of Dnmt3a and Dnmt3b through a direct interaction. J. Biol. Chem. 279, 27816–27823 (2004). 174. Chen, T., Ueda, Y., Xie, S. & Li, E. A novel Dnmt3a isoform produced from an alternative promoter localizes to euchromatin and its expression correlates with activede novo methylation. J. Biol. Chem. (2002). 175. Chen, T., Ueda, Y., Dodge, J. E., Wang, Z. & Li, E. Establishment and maintenance of genomic methylation patterns in mouse embryonic stem cells by Dnmt3a and Dnmt3b. Mol. Cell. Biol. 23, 5594–5605 (2003). 176. Ratnam, S. et al. Dynamics of Dnmt1 methyltransferase expression and intracellular localization during oogenesis and preimplantation development. Dev. Biol. 245, 304– 314 (2002). 177. Gowher, H. & Jeltsch, A. Molecular enzymology of the catalytic domains of the Dnmt3a and Dnmt3b DNA methyltransferases. J. Biol. Chem. 277, 20409–20414 (2002). 178. Suetake, I., Miyazaki, J., Murakami, C., Takeshima, H. & Tajima, S. Distinct enzymatic properties of recombinant mouse DNA methyltransferases Dnmt3a and Dnmt3b. J. Biochem. 133, 737–744 (2003). 179. Bestor, T. H. & Ingram, V. M. Two DNA methyltransferases from murine erythroleukemia cells: purification, sequence specificity, and mode of interaction with DNA. Proc. Natl. Acad. Sci. U. S. A. 80, 5559–5563 (1983). 180. Leonhardt, H., Page, A. W., Weier, H. U. & Bestor, T. H. A targeting sequence directs DNA methyltransferase to sites of DNA replication in mammalian nuclei. Cell 71, 865– 873 (1992). 181. Schneider, K. et al. Dissection of cell cycle–dependent dynamics of Dnmt1 by FRAP and diffusion-coupled modeling. Nucleic Acids Res. 41, 4860–4876 (2013). 182. Li, E., Bestor, T. H. & Jaenisch, R. Targeted mutation of the DNA methyltransferase

I-53 gene results in embryonic lethality. Cell 69, 915–926 (1992). 183. Lei, H. et al. De novo DNA cytosine methyltransferase activities in mouse embryonic stem cells. Development 122, 3195–3205 (1996). 184. Takebayashi, S.-I., Tamura, T., Matsuoka, C. & Okano, M. Major and essential role for the DNA methylation mark in mouse embryogenesis and stable association of DNMT1 with newly replicated regions. Mol. Cell. Biol. 27, 8243–8258 (2007). 185. Okano, M., Xie, S. & Li, E. Cloning and characterization of a family of novel mammalian DNA (cytosine-5) methyltransferases. Nat. Genet. 19, 219–220 (1998). 186. Ooi, S. K. T. et al. DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature 448, 714–717 (2007). 187. Baubec, T. et al. Genomic profiling of DNA methyltransferases reveals a role for DNMT3B in genic methylation. Nature 520, 243–247 (2015). 188. Karimi, M. M. et al. DNA methylation and SETDB1/H3K9me3 regulate predominantly distinct sets of genes, retroelements, and chimeric transcripts in mESCs. Cell Stem Cell 8, 676–687 (2011). 189. Lee, J.-H., Park, S.-J. & Nakai, K. Differential landscape of non-CpG methylation in embryonic stem cells and neurons caused by DNMT3s. Sci. Rep. 7, 11295 (2017). 190. Arand, J. et al. In vivo control of CpG and non-CpG DNA methylation by DNA methyltransferases. PLoS Genet. 8, e1002750 (2012). 191. Liao, J. et al. Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells. Nat. Genet. 47, 469–478 (2015). 192. Lyko, F. The DNA methyltransferase family: a versatile toolkit for epigenetic regulation. Nat. Rev. Genet. 19, 81–92 (2018). 193. Defossez, P.-A. Ceci n’est pas une DNMT: Recently discovered functions of DNMT 2 and their relation to methyltransferase activity (C omment on DOI 10.1002/bies. 201300088). Bioessays 35, 1024–1024 (2013). 194. Kim, S.-H. et al. Zinc-fingers and 1 (ZHX1) binds DNA methyltransferase (DNMT) 3B to enhance DNMT3B-mediated transcriptional repression. Biochem. Biophys. Res. Commun. 355, 318–323 (2007). 195. Fuks, F., Burgers, W. A., Brehm, A., Hughes-Davies, L. & Kouzarides, T. DNA methyltransferase Dnmt1 associates with histone deacetylase activity. Nat. Genet. 24, 88–91 (2000). 196. Fuks, F., Hurd, P. J., Deplus, R. & Kouzarides, T. The DNA methyltransferases associate with HP1 and the SUV39H1 histone methyltransferase. Nucleic Acids Res. 31, 2305–2312 (2003). 197. Rinaldi, L. et al. Dnmt3a and Dnmt3b Associate with Enhancers to Regulate Human Epidermal Stem Cell Homeostasis. Cell Stem Cell 19, 491–501 (2016).

I-54 198. Hirasawa, R. et al. Maternal and zygotic Dnmt1 are necessary and sufficient for the maintenance of DNA methylation imprints during preimplantation development. Genes Dev. 22, 1607–1616 (2008). 199. Schofield, M. J. & Hsieh, P. DNA mismatch repair: molecular mechanisms and biological function. Annu. Rev. Microbiol. 57, 579–608 (2003). 200. Morgan, H. D., Dean, W., Coker, H. A., Reik, W. & Petersen-Mahrt, S. K. Activation- induced cytidine deaminase deaminates 5-methylcytosine in DNA and is expressed in pluripotent tissues: implications for epigenetic reprogramming. J. Biol. Chem. 279, 52353–52360 (2004). 201. Bransteitter, R., Pham, P., Scharff, M. D. & Goodman, M. F. Activation-induced cytidine deaminase deaminates deoxycytidine on single-stranded DNA but requires the action of RNase. Proc. Natl. Acad. Sci. U. S. A. 100, 4102–4107 (2003). 202. Liao, W. et al. APOBEC-2, a cardiac- and skeletal muscle-specific member of the cytidine deaminase supergene family. Biochem. Biophys. Res. Commun. 260, 398–404 (1999). 203. Shimoda, N. et al. No evidence for AID/MBD4-coupled DNA demethylation in zebrafish embryos. PLoS One 9, e114816 (2014). 204. Bhutani, N. et al. Reprogramming towards pluripotency requires AID-dependent DNA demethylation. Nature 463, 1042–1047 (2010). 205. Bhutani, N. et al. A critical role for AID in the initiation of reprogramming to induced pluripotent stem cells. FASEB J. 27, 1107–1113 (2013). 206. Kumar, R. et al. AID stabilizes stem-cell phenotype by removing epigenetic memory of pluripotency genes. Nature 500, 89–92 (2013). 207. Habib, O., Habib, G., Do, J. T., Moon, S.-H. & Chung, H.-M. Activation-induced deaminase-coupled DNA demethylation is not crucial for the generation of induced pluripotent stem cells. Stem Cells Dev. 23, 209–218 (2014). 208. Shimamoto, R., Amano, N., Ichisaka, T. & Watanabe, A. Generation and Characterization of Induced Pluripotent Stem Cells from Aid. (2014). 209. Popp, C. et al. Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency. Nature 463, 1101–1105 (2010). 210. Ito, S. et al. Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5- carboxylcytosine. Science 333, 1300–1303 (2011). 211. He, Y.-F. et al. Tet-mediated formation of 5-carboxylcytosine and its excision by TDG in mammalian DNA. Science 333, 1303–1307 (2011). 212. Kohli, R. M. & Zhang, Y. TET enzymes, TDG and the dynamics of DNA demethylation. Nature 502, 472–479 (2013). 213. Lorsbach, R. B. et al. TET1, a member of a novel protein family, is fused to MLL in

I-55 acute myeloid leukemia containing the t(10;11)(q22;q23). Leukemia 17, 637–641 (2003). 214. Tahiliani, M. et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930–935 (2009). 215. Ito, S. et al. Role of Tet proteins in 5mC to 5hmC conversion, ES-cell self-renewal and inner cell mass specification. Nature 466, 1129–1133 (2010). 216. Hackett, J. A. et al. Germline DNA Demethylation Dynamics and Imprint Erasure Through 5-Hydroxymethylcytosine. Science 339, 448–452 (2013). 217. Wossidlo, M. et al. 5-Hydroxymethylcytosine in the mammalian zygote is linked with epigenetic reprogramming. Nat. Commun. 2, 241 (2011). 218. Gu, T.-P. et al. The role of Tet3 DNA dioxygenase in epigenetic reprogramming by oocytes. Nature 477, 606–610 (2011). 219. Shen, L. et al. Tet3 and DNA replication mediate demethylation of both the maternal and paternal genomes in mouse zygotes. Cell Stem Cell 15, 459–471 (2014). 220. Amouroux, R. et al. De novo DNA methylation drives 5hmC accumulation in mouse zygotes. Nat. Cell Biol. 18, 225–233 (2016). 221. Dawlaty, M. M. et al. Tet1 is dispensable for maintaining pluripotency and its loss is compatible with embryonic and postnatal development. Cell Stem Cell 9, 166–175 (2011). 222. Yamaguchi, S. et al. Tet1 controls meiosis by regulating meiotic gene expression. Nature 492, 443–447 (2012). 223. Moran-Crusio, K. et al. Tet2 loss leads to increased hematopoietic stem cell self- renewal and myeloid transformation. Cancer Cell 20, 11–24 (2011). 224. Lu, F., Liu, Y., Jiang, L., Yamaguchi, S. & Zhang, Y. Role of Tet proteins in enhancer activity and telomere elongation. Genes Dev. 28, 2103–2119 (2014). 225. Hon, G. C. et al. 5mC oxidation by Tet2 modulates enhancer activity and timing of transcriptome reprogramming during differentiation. Mol. Cell 56, 286–297 (2014). 226. Münzel, M. et al. Quantification of the sixth DNA base hydroxymethylcytosine in the brain. Angew. Chem. Int. Ed Engl. 49, 5375–5377 (2010). 227. Hahn, M. A. et al. Dynamics of 5-hydroxymethylcytosine and chromatin marks in Mammalian neurogenesis. Cell Rep. 3, 291–300 (2013). 228. Zhang, R.-R. et al. Tet1 regulates adult hippocampal neurogenesis and cognition. Cell Stem Cell 13, 237–245 (2013). 229. Rudenko, A. et al. Tet1 is critical for neuronal activity-regulated gene expression and memory extinction. Neuron 79, 1109–1122 (2013). 230. Wu, H. & Zhang, Y. Reversing DNA methylation: mechanisms, genomics, and biological functions. Cell 156, 45–68 (2014).

I-56 231. Yu, M. et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell 149, 1368–1380 (2012). 232. Ficz, G. et al. Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature 473, 398–402 (2011). 233. Williams, K. et al. TET1 and hydroxymethylcytosine in transcription and DNA methylation fidelity. Nature 473, 343–348 (2011). 234. Song, C.-X. et al. Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming. Cell 153, 678–691 (2013). 235. Shen, L. et al. Genome-wide analysis reveals TET- and TDG-dependent 5- methylcytosine oxidation dynamics. Cell 153, 692–706 (2013). 236. Iurlaro, M. et al. A screen for hydroxymethylcytosine and formylcytosine binding proteins suggests functions in transcription and chromatin regulation. Genome Biol. 14, R119 (2013). 237. Meehan, R. R., Lewis, J. D., McKay, S., Kleiner, E. L. & Bird, A. P. Identification of a mammalian protein that binds specifically to DNA containing methylated CpGs. Cell 58, 499–507 (1989). 238. Lewis, J. D. et al. Purification, sequence, and cellular localization of a novel chromosomal protein that binds to methylated DNA. Cell 69, 905–914 (1992). 239. Nan, X., Meehan, R. R. & Bird, A. Dissection of the methyl-CpG binding domain from the chromosomal protein MeCP2. Nucleic Acids Res. 21, 4886–4892 (1993). 240. Laget, S. et al. The human proteins MBD5 and MBD6 associate with heterochromatin but they do not bind methylated DNA. PLoS One 5, e11982 (2010). 241. Kriaucionis, S. & Bird, A. The major form of MeCP2 has a novel N‐terminus generated by alternative splicing. Nucleic Acids Res. 32, 1818–1823 (2004). 242. Mnatzakanian, G. N. et al. A previously unidentified MECP2 open reading frame defines a new protein isoform relevant to Rett syndrome. Nat. Genet. 36, 339–341 (2004). 243. Skene, P. J. et al. Neuronal MeCP2 is expressed at near histone-octamer levels and globally alters the chromatin state. Mol. Cell 37, 457–468 (2010). 244. Du, Q., Luu, P.-L., Stirzaker, C. & Clark, S. J. Methyl-CpG-binding domain proteins: readers of the epigenome. Epigenomics 7, 1051–1073 (2015). 245. Kinde, B., Gabel, H. W., Gilbert, C. S., Griffith, E. C. & Greenberg, M. E. Reading the unique DNA methylation landscape of the brain: Non-CpG methylation, hydroxymethylation, and MeCP2. Proc. Natl. Acad. Sci. U. S. A. 112, 6800–6806 (2015). 246. Nan, X., Tate, P., Li, E. & Bird, A. DNA methylation specifies chromosomal localization of MeCP2. Mol. Cell. Biol. 16, 414–421 (1996). 247. Harikrishnan, K. N. et al. Alleviating transcriptional inhibition of the norepinephrine

I-57 slc6a2 transporter gene in depolarized neurons. J. Neurosci. 30, 1494–1501 (2010). 248. Kernohan, K. D. et al. ATRX partners with cohesin and MeCP2 and contributes to developmental silencing of imprinted genes in the brain. Dev. Cell 18, 191–202 (2010). 249. Adams, V. H., McBryant, S. J., Wade, P. A., Woodcock, C. L. & Hansen, J. C. Intrinsic disorder and autonomous domain function in the multifunctional nuclear protein, MeCP2. J. Biol. Chem. 282, 15057–15064 (2007). 250. Georgel, P. T. et al. Chromatin compaction by human MeCP2. Assembly of novel secondary chromatin structures in the absence of DNA methylation. J. Biol. Chem. 278, 32181–32188 (2003). 251. Hansen, J. C., Ghosh, R. P. & Woodcock, C. L. Binding of the Rett syndrome protein, MeCP2, to methylated and unmethylated DNA and chromatin. IUBMB Life 62, 732–738 (2010). 252. Liu, K. et al. Structural basis for the ability of MBD domains to bind methyl-CG and TG sites in DNA. J. Biol. Chem. 293, 7344–7354 (2018). 253. Fraga, M. F., Ballestar, E. & Montoya, G. The affinity of different MBD proteins for a specific methylated locus depends on their intrinsic binding properties. Nucleic acids (2003). 254. Ishibashi, T., Thambirajah, A. A. & Ausió, J. MeCP2 preferentially binds to methylated linker DNA in the absence of the terminal tail of histone H3 and independently of histone acetylation. FEBS Lett. 582, 1157–1162 (2008). 255. Gregory, R. I., Randall, T. E. & Johnson, C. A. DNA methylation is linked to deacetylation of histone H3, but not H4, on the imprinted genes Snrpnand U2af1-rs1. and Cellular Biology (2001). 256. Baubec, T., Ivánek, R., Lienert, F. & Schübeler, D. Methylation-dependent and - independent genomic targeting principles of the MBD protein family. Cell 153, 480–492 (2013). 257. Valinluck, V., Tsai, H. H., Rogstad, D. K. & Burdzy, A. Oxidative damage to methyl-CpG sequences inhibits the binding of the methyl-CpG binding domain (MBD) of methyl-CpG binding protein 2 (MeCP2). Nucleic acids (2004). 258. Cramer, J. M. et al. Probing the dynamic distribution of bound states for methylcytosine- binding domains on DNA. J. Biol. Chem. 289, 1294–1302 (2014). 259. Globisch, D. et al. Tissue distribution of 5-hydroxymethylcytosine and search for active demethylation intermediates. PLoS One 5, e15367 (2010). 260. Mellén, M., Ayata, P., Dewell, S., Kriaucionis, S. & Heintz, N. MeCP2 Binds to 5hmC Enriched within Active Genes and Accessible Chromatin in the Nervous System. Cell 151, 1417–1430 (2012). 261. Nikitina, T. et al. Multiple modes of interaction between the methylated DNA binding

I-58 protein MeCP2 and chromatin. Mol. Cell. Biol. 27, 864–877 (2007). 262. Stuss, D. P. et al. Impaired in vivo binding of MeCP2 to chromatin in the absence of its DNA methyl-binding domain. Nucleic Acids Res. 41, 4888–4900 (2013). 263. Kernohan, K. D., Vernimmen, D., Gloor, G. B. & Bérubé, N. G. Analysis of neonatal brain lacking ATRX or MeCP2 reveals changes in nucleosome density, CTCF binding and chromatin looping. Nucleic Acids Res. 42, 8356–8368 (2014). 264. Nan, X. et al. Interaction between chromatin proteins MECP2 and ATRX is disrupted by mutations that cause inherited mental retardation. Proc. Natl. Acad. Sci. U. S. A. 104, 2709–2714 (2007). 265. Horike, S.-I., Cai, S., Miyano, M., Cheng, J.-F. & Kohwi-Shigematsu, T. Loss of silent- chromatin looping and impaired imprinting of DLX5 in Rett syndrome. Nat. Genet. 37, 31–40 (2005). 266. Clouaire, T., de Las Heras, J. I., Merusi, C. & Stancheva, I. Recruitment of MBD1 to target genes requires sequence-specific interaction of the MBD domain with methylated DNA. Nucleic Acids Res. 38, 4620–4634 (2010). 267. Sakamoto, Y., Watanabe, S. & Ichimura, T. Overlapping roles of the methylated DNA- binding protein MBD1 and polycomb group proteins in transcriptional repression of HOXA genes and heterochromatin foci …. Journal of Biological (2007). 268. Fujita, N. et al. Methylation-mediated transcriptional silencing in euchromatin by methyl- CpG binding protein MBD1 isoforms. Mol. Cell. Biol. 19, 6415–6426 (1999). 269. Fujita, N. et al. Mechanism of transcriptional regulation by methyl-CpG binding protein MBD1. Mol. Cell. Biol. 20, 5107–5118 (2000). 270. Jørgensen, H. F., Ben-Porath, I. & Bird, A. P. Mbd1 is recruited to both methylated and nonmethylated CpGs via distinct DNA binding domains. Mol. Cell. Biol. 24, 3387–3395 (2004). 271. Zhao, X. et al. Mice lacking methyl-CpG binding protein 1 have deficits in adult neurogenesis and hippocampal function. Proc. Natl. Acad. Sci. U. S. A. 100, 6777–6782 (2003). 272. Allan, A. M. et al. The loss of methyl-CpG binding protein 1 leads to autism-like behavioral deficits. Hum. Mol. Genet. 17, 2047–2057 (2008). 273. Jobe, E. M. et al. Methyl-CpG-Binding Protein MBD1 Regulates Neuronal Lineage Commitment through Maintaining Adult Neural Stem Cell Identity. J. Neurosci. 37, 523– 536 (2017). 274. Cukier, H. N. et al. Novel variants identified in methyl-CpG-binding domain genes in autistic individuals. Neurogenetics 11, 291–303 (2010). 275. Jang, J.-S. et al. Methyl-CpG binding domain 1 gene polymorphisms and risk of primary lung cancer. Cancer Epidemiol. Biomarkers Prev. 14, 2474–2480 (2005).

I-59 276. Ghersi, D. & Singh, M. Interaction-based discovery of functionally important genes in cancers. Nucleic Acids Res. 42, e18 (2014). 277. Hashimoto, H. et al. Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic Acids Res. 40, 4841–4849 (2012). 278. Günther, K. et al. Differential roles for MBD2 and MBD3 at methylated CpG islands, active promoters and binding to exon sequences. Nucleic Acids Res. 41, 3010–3021 (2013). 279. Hendrich, B., Guy, J., Ramsahoye, B., Wilson, V. A. & Bird, A. Closely related proteins MBD2 and MBD3 play distinctive but interacting roles in mouse development. Genes Dev. 15, 710–723 (2001). 280. Wood, K. H. et al. Tagging methyl-CpG-binding domain proteins reveals different spatiotemporal expression and supports distinct functions. Epigenomics 8, 455–473 (2016). 281. Bader, S. et al. MBD1, MBD2 and CGBP genes at chromosome 18q21 are infrequently mutated in human colon and lung cancers. Oncogene 22, 3506–3510 (2003). 282. Patra, S. K., Patra, A., Zhao, H., Carroll, P. & Dahiya, R. Methyl-CpG–DNA binding proteins in human prostate cancer: expression of CXXC sequence containing MBD1 and repression of MBD2 and MeCP2. Biochem. Biophys. Res. Commun. 302, 759–766 (2003). 283. Zhu, D., Hunter, S. B., Vertino, P. M. & Van Meir, E. G. Overexpression of MBD2 in Glioblastoma Maintains Epigenetic Silencing and Inhibits the Anti-Angiogenic Function of the Tumor Suppressor Gene BAI1. Cancer Res. canres.1157.2011 (2011). 284. Torchy, M. P., Hamiche, A. & Klaholz, B. P. Structure and function insights into the NuRD chromatin remodeling complex. Cell. Mol. Life Sci. 72, 2491–2507 (2015). 285. Zhang, Y. et al. Analysis of the NuRD subunits reveals a histone deacetylase core complex and a connection with DNA methylation. Genes Dev. 13, 1924–1935 (1999). 286. Feng, Q. & Zhang, Y. The MeCP1 complex represses transcription through preferential binding, remodeling, and deacetylating methylated nucleosomes. Genes Dev. 15, 827– 832 (2001). 287. Ng, H. H. et al. MBD2 is a transcriptional repressor belonging to the MeCP1 histone deacetylase complex. Nat. Genet. 23, 58–61 (1999). 288. Ramírez, J., Dege, C., Kutateladze, T. G. & Hagman, J. MBD2 and multiple domains of CHD4 are required for transcriptional repression by Mi-2/NuRD complexes. Mol. Cell. Biol. 32, 5078–5088 (2012). 289. Menafra, R. et al. Genome-wide binding of MBD2 reveals strong preference for highly methylated loci. PLoS One 9, e99603 (2014). 290. Le Guezennec, X. et al. MBD2/NuRD and MBD3/NuRD, two distinct complexes with

I-60 different biochemical and functional properties. Mol. Cell. Biol. 26, 843–851 (2006). 291. Tan, C. P. & Nakielny, S. Control of the DNA methylation system component MBD2 by protein arginine methylation. Mol. Cell. Biol. 26, 7224–7235 (2006). 292. Fujita, H. et al. Antithetic effects of MBD2a on gene regulation. Mol. Cell. Biol. 23, 2645– 2657 (2003). 293. Weaver, I. C. G. et al. The methylated-DNA binding protein MBD2 enhances NGFI-A (egr-1)-mediated transcriptional activation of the . Philos. Trans. R. Soc. Lond. B Biol. Sci. 369, (2014). 294. Hendrich, B. et al. Genomic structure and chromosomal mapping of the murine and human Mbd1, Mbd2, Mbd3, and Mbd4 genes. Mamm. Genome 10, 906–912 (1999). 295. Kaji, K. et al. The NuRD component Mbd3 is required for pluripotency of embryonic stem cells. Nat. Cell Biol. 8, 285–292 (2006). 296. Shimbo, T. et al. MBD3 localizes at promoters, gene bodies and enhancers of active genes. PLoS Genet. 9, e1004028 (2013). 297. Yildirim, O. et al. Mbd3/NURD complex regulates expression of 5- hydroxymethylcytosine marked genes in embryonic stem cells. Cell 147, 1498–1510 (2011). 298. Luo, Z. et al. Zic2 is an enhancer-binding factor required for embryonic stem cell specification. Mol. Cell 57, 685–694 (2015). 299. Hendrich, B., Hardeland, U., Ng, H. H., Jiricny, J. & Bird, A. The thymine glycosylase MBD4 can bind to the product of deamination at methylated CpG sites. Nature 401, 301–304 (1999). 300. Kondo, E., Gu, Z., Horii, A. & Fukushige, S. The Thymine DNA Glycosylase MBD4 Represses Transcription and Is Associated with Methylated p16INK4a and hMLH1 Genes. Mol. Cell. Biol. 25, 4388–4396 (2005). 301. Bogdanović, O. & Veenstra, G. J. C. DNA methylation and methyl-CpG binding proteins: developmental requirements and function. Chromosoma 118, 549–565 (2009). 302. Petronzelli, F. et al. Investigation of the substrate spectrum of the human mismatch- specific DNA N-glycosylase MED1 (MBD4): Fundamental role of the catalytic domain. J. Cell. Physiol. 185, 473–480 (2000). 303. Millar, C. B. et al. Enhanced CpG mutability and tumorigenesis in MBD4-deficient mice. Science 297, 403–405 (2002). 304. Wong, E. et al. Mbd4 inactivation increases C→ T transition mutations and promotes gastrointestinal tumor formation. Proceedings of the National Academy of Sciences 99, 14937–14942 (2002). 305. Screaton, R. A. et al. Fas-associated death domain protein interacts with methyl-CpG binding domain protein 4: a potential link between genome surveillance and apoptosis.

I-61 Proc. Natl. Acad. Sci. U. S. A. 100, 5211–5216 (2003). 306. Cortellino, S. et al. Thymine DNA glycosylase is essential for active DNA demethylation by linked deamination-base excision repair. Cell 146, 67–79 (2011). 307. Schuermann, D., Weber, A. R. & Schär, P. Active DNA demethylation by DNA repair: Facts and uncertainties. DNA Repair 44, 92–102 (2016). 308. Hendrich, B. & Tweedie, S. The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet. 19, 269–277 (2003). 309. Roloff, T. C., Ropers, H. H. & Nuber, U. A. Comparative study of methyl-CpG-binding domain proteins. BMC Genomics 4, 1 (2003). 310. Baymaz, H. I. et al. MBD5 and MBD6 interact with the human PR-DUB complex through their methyl-CpG-binding domain. Proteomics 14, 2179–2189 (2014). 311. van Kruijsbergen, I., Hontelez, S. & Veenstra, G. J. C. Recruiting polycomb to chromatin. Int. J. Biochem. Cell Biol. 67, 177–187 (2015). 312. Jaillard, S. et al. 2q23.1 microdeletion identified by array comparative genomic hybridisation: an emerging phenotype with Angelman-like features? J. Med. Genet. 46, 847–855 (2009). 313. Williams, S. R. et al. Haploinsufficiency of MBD5 associated with a syndrome involving microcephaly, intellectual disabilities, severe speech impairment, and seizures. Eur. J. Hum. Genet. 18, 436–441 (2010). 314. Lim, J. et al. A Protein–Protein Interaction Network for Human Inherited Ataxias and Disorders of Purkinje Cell Degeneration. Cell 125, 801–814 (2006). 315. Sharif, J. et al. The SRA protein Np95 mediates epigenetic inheritance by recruiting Dnmt1 to methylated DNA. Nature 450, 908–912 (2007). 316. Vaughan, R. M. et al. Comparative biochemical analysis of UHRF proteins reveals molecular mechanisms that uncouple UHRF2 from DNA methylation maintenance. Nucleic Acids Res. 46, 4405–4416 (2018). 317. Avvakumov, G. V. et al. Structural basis for recognition of hemi-methylated DNA by the SRA domain of human UHRF1. Nature 455, 822–825 (2008). 318. Qian, C. et al. Structure and hemimethylated CpG binding of the SRA domain from human UHRF1. J. Biol. Chem. 283, 34490–34494 (2008). 319. Nady, N., Lemak, A., Walker, J. R. & Avvakumov, G. V. Recognition of multivalent histone states associated with heterochromatin by UHRF1. Journal of Biological (2011). 320. Rothbart, S. B. et al. Multivalent histone engagement by the linked tandem Tudor and PHD domains of UHRF1 is required for the epigenetic inheritance of DNA methylation. Genes Dev. 27, 1288–1298 (2013). 321. Papait, R. et al. Np95 is implicated in pericentromeric heterochromatin replication and in major satellite silencing. Mol. Biol. Cell 18, 1098–1106 (2007).

I-62 322. Sharif, J. et al. Activation of Endogenous Retroviruses in Dnmt1−/− ESCs Involves Disruption of SETDB1-Mediated Repression by NP95 Binding to Hemimethylated DNA. Cell Stem Cell 19, 81–94 (2016). 323. Zhou, T. et al. Structural basis for hydroxymethylcytosine recognition by the SRA domain of UHRF2. Mol. Cell (2014). 324. Liu, Y. et al. UHRF2 regulates local 5-methylcytosine and suppresses spontaneous seizures. Epigenetics 12, 551–560 (2017). 325. Chen, R. et al. The 5-Hydroxymethylcytosine (5hmC) Reader UHRF2 Is Required for Normal Levels of 5hmC in Mouse Adult Brain and Spatial Learning and Memory. J. Biol. Chem. 292, 4533–4543 (2017). 326. Bostick, M. et al. UHRF1 plays a role in maintaining DNA methylation in mammalian cells. Science 317, 1760–1764 (2007). 327. Hudson, N. O. & Buck-Koehntop, B. A. Zinc Finger Readers of Methylated DNA. Molecules 23, (2018). 328. Prokhortchouk, A. et al. The p120 catenin partner Kaiso is a DNA methylation- dependent transcriptional repressor. Genes Dev. 15, 1613–1618 (2001). 329. Prokhortchouk, A. V., Aitkhozhina, D. S., Sablina, A. A., Ruzov, A. S. & Prokhortchouk, E. B. Kaiso, a New Protein of the BTB/POZ Family, Specifically Binds to Methylated DNA Sequences. Russ. J. Genet. 37, 603–609 (2001). 330. Buck-Koehntop, B. A., Martinez-Yamout, M. A., Dyson, H. J. & Wright, P. E. Kaiso uses all three zinc fingers and adjacent sequence motifs for high affinity binding to sequence- specific and methyl-CpG DNA targets. FEBS Lett. 586, 734–739 (2012). 331. Cofre, J., Menezes, J. R. L., Pizzatti, L. & Abdelhay, E. Knock-down of Kaiso induces proliferation and blocks granulocytic differentiation in blast crisis of chronic myeloid leukemia. Cancer Cell Int. 12, 28 (2012). 332. Pozner, A., Terooatea, T. W. & Buck-Koehntop, B. A. Cell-specific Kaiso (ZBTB33) Regulation of Cell Cycle through Cyclin D1 and Cyclin E1. J. Biol. Chem. 291, 24538– 24550 (2016). 333. Koh, D.-I. et al. KAISO, a critical regulator of -mediated transcription of CDKN1A and apoptotic genes. Proc. Natl. Acad. Sci. U. S. A. 111, 15078–15083 (2014). 334. Bassey-Archibong, B. I. et al. Kaiso depletion attenuates the growth and survival of triple negative breast cancer cells. Cell Death Dis. 8, e2689 (2017). 335. Wang, L. et al. Kaiso (ZBTB33) Downregulation by Mirna-181a Inhibits Cell Proliferation, Invasion, and the Epithelial–Mesenchymal Transition in Glioma Cells. Cell. Physiol. Biochem. 48, 947–958 (2018). 336. Yoon, H.-G., Chan, D. W., Reynolds, A. B., Qin, J. & Wong, J. N-CoR Mediates DNA Methylation-Dependent Repression through a Methyl CpG Binding Protein Kaiso. Mol.

I-63 Cell 12, 723–734 (2003/9). 337. Raghav, S. K. et al. Integrative genomics identifies the corepressor SMRT as a gatekeeper of adipogenesis through the transcription factors C/EBPβ and KAISO. Mol. Cell 46, 335–350 (2012). 338. Blattler, A. et al. ZBTB33 binds unmethylated regions of the genome associated with actively expressed genes. Epigenetics Chromatin 6, 13 (2013). 339. Rodova, M., Kelly, K. F., VanSaun, M., Daniel, J. M. & Werle, M. J. Regulation of the Rapsyn Promoter by Kaiso and δ-Catenin. Mol. Cell. Biol. 24, 7188–7196 (2004). 340. Buck-Koehntop, B. A. et al. Molecular basis for recognition of methylated and specific DNA sequences by the zinc finger protein Kaiso. Proc. Natl. Acad. Sci. U. S. A. 109, 15229–15234 (2012). 341. Qin, S. et al. Kaiso mainly locates in the nucleus in vivo and binds to methylated, but not hydroxymethylated DNA. Chin. J. Cancer Res. 27, 148–155 (2015). 342. Donaldson, N. S. et al. Kaiso represses the cell cycle gene cyclin D1 via sequence- specific and methyl-CpG-dependent mechanisms. PLoS One 7, e50398 (2012). 343. Nikolova, E. N., Stanfield, R. L., Dyson, H. J. & Wright, P. E. CH\textperiodcentered\textperiodcentered\textperiodcentered O Hydrogen Bonds Mediate Highly Specific Recognition of Methylated CpG Sites by the Zinc Finger Protein Kaiso. Biochemistry 57, 2109–2120 (2018). 344. Filion, G. J. P. et al. A family of human zinc finger proteins that bind methylated DNA and repress transcription. Mol. Cell. Biol. 26, 169–181 (2006). 345. Sasai, N., Nakao, M. & Defossez, P.-A. Sequence-specific recognition of methylated DNA by human zinc-finger proteins. Nucleic Acids Res. 38, 5015–5022 (2010). 346. Kim, K. et al. Induction of the transcriptional repressor ZBTB4 in prostate cancer cells by drug-induced targeting of microRNA-17-92/106b-25 clusters. Mol. Cancer Ther. 11, 1852–1862 (2012). 347. Kotoku, T. et al. CIBZ Regulates Mesodermal and Cardiac Differentiation of by Suppressing T and Mesp1 Expression in Mouse Embryonic Stem Cells. Sci. Rep. 6, 34188 (2016). 348. Roussel-Gervais, A., Naciri, I., Kirsh, O. & Kasprzyk, L. Loss of the methyl-CpG–binding protein ZBTB4 alters mitotic checkpoint, increases aneuploidy, and promotes tumorigenesis. Cancer Res. (2017). 349. de Dieuleveult, M. & Miotto, B. DNA Methylation and Chromatin: Role(s) of Methyl-CpG- Binding Protein ZBTB38. Epigenetics insights vol. 11 2516865718811117 (2018). 350. Mahmood, N. & Rabbani, S. A. DNA Methylation Readers and Cancer: Mechanistic and Therapeutic Applications. Front. Oncol. 9, 489 (2019). 351. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human

I-64 transcription factors. Science 356, (2017). 352. Zhang, W., Shields, J. M., Sogawa, K., Fujii-Kuriyama, Y. & Yang, V. W. The gut- enriched Krüppel-like factor suppresses the activity of the CYP1A1 promoter in an Sp1- dependent fashion. J. Biol. Chem. 273, 17917–17925 (1998). 353. Takahashi, K. & Yamanaka, S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676 (2006). 354. Liu, Y. et al. Structural basis for Klf4 recognition of methylated DNA. Nucleic Acids Res. 42, 4859–4867 (2014). 355. Li, X. et al. A maternal-zygotic effect gene, Zfp57, maintains both maternal and paternal imprints. Dev. Cell 15, 547–557 (2008). 356. Quenneville, S. et al. In embryonic stem cells, ZFP57/KAP1 recognize a methylated hexanucleotide to affect chromatin and DNA methylation of imprinting control regions. Mol. Cell 44, 361–372 (2011). 357. Liu, Y., Toh, H., Sasaki, H., Zhang, X. & Cheng, X. An atomic model of Zfp57 recognition of CpG methylation within a specific DNA sequence. Genes Dev. 26, 2374– 2379 (2012). 358. Hashimoto, H. et al. Wilms tumor protein recognizes 5-carboxylcytosine within a specific DNA sequence. Genes Dev. 28, 2304–2313 (2014). 359. Zandarashvili, L., White, M. A., Esadze, A. & Iwahara, J. Structural impact of complete CpG methylation within target DNA on specific complex formation of the inducible transcription factor Egr-1. FEBS Lett. 589, 1748–1753 (2015). 360. Kubosaki, A. et al. Genome-wide investigation of in vivo EGR-1 binding sites in monocytic differentiation. Genome Biol. 10, R41 (2009). 361. Koldamova, R. et al. Genome-wide approaches reveal EGR1-controlled regulatory networks associated with neurodegeneration. Neurobiol. Dis. 63, 107–114 (2014). 362. Feng, Y. et al. Correction: EGR1 functions as a potent repressor of transcriptional activity. PLoS One 10, e0131619 (2015). 363. Chen, M., Xiong, F. & Zhang, L. Promoter methylation of Egr-1 site contributes to fetal hypoxia-mediated PKCε gene repression in the developing heart. Am. J. Physiol. Regul. Integr. Comp. Physiol. 304, R683–9 (2013). 364. Ogishima, T. et al. Promoter CpG hypomethylation and transcription factor EGR1 hyperactivate heparanase expression in bladder cancer. Oncogene 24, 6765–6772 (2005). 365. Kemme, C. A., Esadze, A. & Iwahara, J. Influence of quasi-specific sites on kinetics of target DNA search by a sequence-specific DNA-binding protein. Biochemistry 54, 6684– 6691 (2015). 366. Kemme, C. A., Marquez, R., Luu, R. H. & Iwahara, J. Potential role of DNA methylation

I-65 as a facilitator of target search processes for transcription factors through interplay with methyl-CpG-binding proteins. Nucleic Acids Res. 45, 7751–7759 (2017). 367. Skene, P. J., Illingworth, R. S., Webb, S. & Kerr, A. R. W. Neuronal MeCP2 is expressed at near histone-octamer levels and globally alters the chromatin state. Mol. Cell (2010).

I-66

Materials and methods

DNA pull-down coupled to Mass spectrometry

II.1.1 Nuclei isolation and protein extraction from mammalian brain

Excess mouse brain tissue was obtained from Dr Julian Heng who was sacrificing C57BL6 mice for his approved experimental procedures. Dr Heng’s animal ethics protocols were approved by the Harry Perkins Institute Animal Ethics Committee. Approval for the use of human brain tissue was obtained from the UWA Human Research Ethics Office. Protein extract was obtained from adult whole mouse brain (8 weeks) and adult human frontal cortex tissue. Nuclei isolation was performed as described below. All steps were performed at 4°C. Tissue was broken into 1g segments using a mortar and pestle, immersed in liquid nitrogen, ensuring tissue was frozen throughout. Ground tissue was then homogenised within Tenbroeck homogeniser containing chilled Lysis buffer. Lysis buffer consisted of 0.32M filtered sucrose solution, 5mM CaCl2, 3mM Mg(CH3COO)2, 0.1mM EDTA, 10mM Tris-HCl, pH 8.0, 1mM DTT, 0.1% Triton-X and 1X Roche complete EDTA-free protease inhibitor tablets.

Cushion solution containing 1.8M sucrose, 3mM Mg(CH3COO)2, 10mM Tris-HCL, pH 8.0 was layered into 13.2mL ultracentrifugation tubes. Each tube contained 8.5mL of Cushion solution and 3.5mL of Lysis buffer containing homogenate. Homogenate was then filtered through a 40micron strainer to remove unwanted debris. Ultracentrifuge tubes were spun at 25,259 rpm for 2.5 hours at 4°C. Cell pellets were resuspended in 100µl chilled extraction buffer containing

20mM Tris-HCl, pH 7.9, 420mM KCl 1.5mM MgCL2, 0.5mM DTT and 1X Roche complete EDTA-free protease inhibitor tablet. Total nuclear protein concentration was determined by Bradford assay. Protein lysate in each replicate comprised of a mixture of brain samples for the aforementioned experiments because each replicate required more protein than was able to be obtained from each brain tissue sample.

II.1.2 Preparation of biotinylated probes

Complementary 5’ biotinylated, ssDNA oligonucleotides were synthesised by IDT (Table S2.3) and contained the following sequences: Oligonucleotide mCG1 (5’ biotin- GAT GAT GAmC GAmC GAmC GAmC GAT GAT G-3’), oligonucleotide mCG2 (5’ biotin- CAT CAT mCGT mCGT mCGT mCGT CAT CAT C-3’). Oligonucleotide CG1 (5’ biotin- GAT GAT GAC GAC

II-67 GAC GAC GAT GAT G-3’), oligonucleotide CG2 (5’ biotin- CAT CAT CGT CGT CGT CGT CAT CAT C-3’). Oligonucleotide mCA (5’ biotin- GAT GAT GTA mCAC TAmC ACT AmCA CTA mCAC ATG ATG-3’). Oligonucleotide CA1 (5’ biotin- GAT GAT GTA CAC TAC ACT AC ACT ACA CAT GAT-3’), oligonucleotide CA2 (5’ biotin- CAT CAT GTG TAG TGT AGT GTA GTG TAC ATC ATC-3’). Oligonucleotide pairs mCG1 and mCG2, CG1 and CG2, mCA and CA2, and CA1 and CA 2 were annealed in 1X NEB buffer 2.0 by placing complementary sequences in a 95°C water-bath for 2 minutes and cooled gradually to room temperature.

II.1.3 DNA pull-downs

All steps were performed at 4°C. DNA was immobilised with 10µg of Streptavidin Sepharose high performance beads (GE healthcare). Each wash involved inverting 1mL of solution 10 times followed by a 2 minute, 4000 rpm centrifugation to collect bead fraction. Beads were washed once in phosphate buffered saline containing 0.1% NP40, and once in DNA binding buffer containing 1M NaCl, 10mM Tris-HCL, pH 8.0, 1mM EDTA and 0,05% NP40. A total of 9.5µg of DNA diluted in 600µL of DNA binding buffer was incubated with beads for 30 minutes. To ensure all DNA was biotinylated DNA, and efficient bead capture, 500ng of un-incubated and 500ng of incubated DNA was loaded on a 2% agarose gel stained with Ethidium bromide at 1:10,000. DNA-bound beads were washed once in DNA binding buffer and twice in protein incubation buffer containing 150mM KCl, 50mM Tris-HCl, pH 8.0, 0.25% NP40, 1mM DTT and 1 X Roche EDTA-free protease inhibitor tablets. A total of 450µg of nuclear protein extract from mouse whole brain or human frontal cortex supplemented with 10µg competitor polydAdT nucleic acid was incubated with the DNA-bound beads in a 600µL volume for 90 minutes. Beads were washed 3 times in protein incubation buffer and twice in PBS. Protein elution from beads was carried out in 50µl solution containing 2M Urea dissolved in 100mM Tris-HCl, Ph 7.5 and 10mM DTT.

II.1.4 Nuclear enrichment confirmation by Western blot

Equal concentrations of protein extracts prior to and after DNA pull-downs were incubated at 90°C for 5 minutes before being loaded into a Mini-Protean TGX Precast Gel (Bio-Rad). The gel was run at 120V for 100 minutes and transferred to a PVDF membrane with the Trans- Blot® Turbo™ Mini PVDF Transfer Pack using the Trans-Blot® Turbo™ Transfer System (Bio- Rad). Transfer conditions were as per “Protean Mini TXG” settings. The membrane was incubated with Ponceau S for 2 minutes and split into two. Ponceau S stain was removed by

II-68 washing with 2X in DI water for 5 minutes, followed by 1X wash in PBS for 15 minutes. Blocking was achieved by incubating the membrane in PBS, 0.1% tween supplemented with 2.5% milk powder for 45 minutes at room temperature. Incubations with primary antibodies were at 4°C overnight in PBS, 0.1% tween supplemented with 0.5% milk powder using a primary polyclonal rabbit anti-MeCP2 (Abcam, ab2828) at 3µg/mL and primary monoclonal mouse anti-alpha tubulin (GenScript, A01410) at 0.5µg/mL. Membranes were washed as follows, 4X PBS, 0.1% tween supplemented with 0.5% milk powder for 5 min, 2X PBS, 0.1% tween for 30 seconds, 2X PBS for 30 seconds at room temperature. Incubation with the secondary antibodies was performed at room temperature for 60 minutes using an MeCP2 horseradish peroxidase, goat anti-rabbit (Thermo Fisher) at 1:2000 dilution and Actin horseradish peroxidase, rabbit anti-mouse (Thermo Fisher) at 1:10 000 dilution. Membranes were washed as follows, 4X PBS, 0.1% tween supplemented with 0.5% milk powder for 5 min, 2X PBS, 0.1% tween for 30 seconds, 2X PBS for 30 seconds at room temperature. The Clarity ECL Western Blotting substrate (Bio-Rad) were mixed in a 1:1 ratio to produce a chemiluminescence signal.

II.1.5 On bead trypsin Digest

Stage tips were activated by 50µL methanol and placed within 1.5mL eppendorf tubes for fluid collection. Stage tips were washed once in 50µL, 0.5% acetic acid solution, and twice in 50µl, 0.5% acetic acid, 80% acetonitrile solution. Samples were loaded on stage tips and washed once in 50µl, 0.5% acetic acid, 80% acetonitrile solution before being sent for mass spectrometry analysis.

II-69

Figure 2.1: Overview of affinity pull-down for the identification of mC readers. Protein extracts (coloured proteins) derived from frozen human frontal cortex and mouse whole brain were incubated with methylated probes in the CA and CG contexts and their unmethylated controls and subject to mass spectrometry analysis.

Overview of mass spectrometry analysis using ProteoMM

II-70 Peptide and protein counts were recorded by MaxQuant version 1.5.1.01 and the ‘peptide’ output text file containing raw peptide intensities were used for statistical analysis. DE and P/A analysis was performed using ProteoMM, a peptide-level differentiation expression analysis Bioconductor R package. ProteoMM is capable of single and multiple proteomic dataset analyses (Figure 2.1).

ProteoMM performs peptide level normalisation and imputation2–4 on each dataset before combining proteins that are common to both datasets. Proteins that are shared between the two datasets are termed ‘combined’ analyses. Proteins observed in only one of the constituent datasets are analysed separately and termed ‘limited’ analysis. The output of ProteoMM yields two text files containing values corresponding to the two types of analyses implemented, either differential expression (DE) or presence/absence (P/A) analysis. As illustrated in Figure 3.5, ProteoMM yields one DE and one P/A text file for the combined analyses. Each constituent dataset within the combined analyses is assigned its own log2 fold change (log2FC) and p- value. This is an important feature of ProteoMM because it allows for the discrimination of proteins based on three possible outcomes. Namely, proteins that are significant in both datasets or proteins significant to dataset 1 or 2, despite having peptide observations in both datasets. Proteins only observed in dataset 1 or 2 within the limited dataset are assigned DE and P/A text files corresponding to each individual analysis.

II.2.1 Identification of mC readers in human and mouse by ProteoMM

The input format for ProteoMM requires two columns containing peptide sequence and unique protein ID. Common proteins share the same ID between datasets and are used to separate “combined” human and mouse proteins from “human-limited” and “mouse-limited” proteins observed in either species. All other metadata columns such as peptide intensities are propagated through the analysis pipeline. Both human and mouse datasets, for each context, contained 6 raw intensities columns corresponding to 3 technical replicates for the methylated and unmethylated contexts. All 0’s were replaced with NAs and values were log2 transformed3. Missing values between replicates are plotted to assess replicate quality, EigenMS normalization is performed and missing values are imputed using model-based imputation2,3,5.

II.2.2 Eigen MS normalisation and model-based imputation

II-71 EigenMS incorporates ANOVA and singular value decomposition (SVD) to capture and remove biases from LC-MS metabolomics peak intensity measurements. ANOVA is used to capture and preserve the variation between methylated and unmethylated probes. SVD is then applied to a matrix of residuals to find and remove any systematic bias trends. Bias trends are determined by a permutation test6. Eigen MS normalisation and model-based imputation were implemented at the peptide level prior to a roll-up to protein level and was applied to peptides within each independent dataset. The pre-processed DE dataset contained peptides with observations in both methylated and unmethylated contexts, providing sufficient informative data such that normalisation and imputation could be reliably implemented.

II.2.3 Model-based differential expression and presence/absence analysis

Differential expression analysis was performed on the combined and species-limited (for example mouse-limited or human-limited) datasets. Differential expression analysis was therefore applied to the combined, human-limited and mouse-limited datasets across two probe conditions. Conditions one and two refer to proteins bound to the methylated CG (mCG) over unmethylated CG probes and methylated CA (mCA) over unmethylated CA probes respectively (Figure 2.1). The sum of F-statistics from each dataset results in a test statistic that enables the user to determine if experimental groups of interest differ from one another. P-values are obtained as estimates through permutation tests that produce a null distribution of the sum of F-statistics. The log2FC values obtained for proteins within the DE analysis represent the average of the log2 rolled up, normalised and imputed peptide intensity values of each methylated probe over its unmethylated counterpart. All significant proteins had a fold- change cut off threshold of 1.2 and p-value threshold of 0.05 determined by comparisons to external experiments in which previously identified mCG or CG binders were characterised.

The P/A analysis deals with peptides belonging to a protein in one probe condition that had no observed intensities in its corresponding probe. As with DE analysis, P/A was applied to the combined, human-limited and mouse-limited datasets across two probe conditions. These proteins are not in the normalised data because they do not contain any observations within the corresponding context and cannot be subject to model-based normalisation and imputation. Proteins are visualized on a volcano plot, with ‘percent observed’ corresponding to the observed proportion of values within a particular context divided by all potential values within that context. For example, if protein A had 4 peptides within the mCG/CG dataset with observations only for the mCG probe, it has a potential of 12 observations ( potential peptide obs. X replicates = total possible observations). If 9 of those 12 have recorded intensity values,

II-72 it will have a percent observed value of 75% (observed peptide observations / total possible observations X 100). The threshold for P/A was set at 50% with a p-value cut-off of 0.1.

II.2.4 Identification of Transcription factors, interactors and protein family information within data

Ensembl, InterPro, and GO ID information were downloaded using the BiomaRt suite, part of the Bioconductor package version 3.47. DNA-binding domain (DBD)-containing proteins information was appended onto each dataset using GO ‘DNA binding’ and ‘nucleic acid binding’ term identification. The GO IDs for DNA or nucleic acid binding properties were selected for and matched to each dataset. Protein domain information was appended onto the dataset for DBD-containing proteins through InterPro ID information downloaded from the BiomaRt repository. Ensembl peptide ID and GeneIDs were used to link DBD-containing proteins and protein family datasets with the MS data. To determine the efficiency of matching, a published transcription factor only dataset was used as a proxy. The GO ID-based matching system with DBD-containing proteins attached identified 99% of the proteins within the transcription factor dataset and was therefore adopted for use in the proteomics datasets. It is important to note that a positive match with a GO ID for DNA binding indicates the presence of DNA binding domain but not necessarily a transcription factor, since many proteins contain nucleic acid binding domains in addition to other domains like protein interaction domains. Therefore, this list was loosely termed DBD-containing proteins and the remaining proteins were classified broadly into a protein ‘interactor’ category.

II.2.5 Benchmarking ProteoMM

To ensure ProteoMM performed the analysis correctly, the human and mouse mCG/CG datasets were analysed independently with Perseus and then compared to ProteoMM on per protein and global dataset level. The output from MaxQuant containing LFQ and raw intensity values are compatible with Perseus and were used as input files for the Perseus based analyses. The Perseus analysis was therefore implemented on LFQ and raw intensity values. Briefly, each analysis was subject to the following pipeline. The proteins with “Reverse”, “Only identified by site” and “Contaminants” matches were filtered out. Intensity values were log2 transformed, and filtered based on 3 valid values among the 6 replicates. Imputation was performed by implementing a normal distribution with the following settings, width = 0.3 and downshift = 1.8. A two sample t-test was performed with an FDR of 0.05.

II-73

A subset of proteins were chosen for initial comparisons to gage overlap between ProteoMM and Perseus by LFQ or Perseus by Raw intensity. It was reasoned that proteins significant in both human and mouse represent high confidence interactors among every analysis method. This subset was already available from the combined list in ProteoMM and obtained for Perseus-based analyses by the intersection of human and mouse proteins significant within each dataset. The distributions in log2FC among all 3 methods were largely similar. Therefore, a base cut off of log2FC ≥ 1.2 and p-value ≤ 0.05 was used for all analyses given this was the parameters used in the DE analysis for ProteoMM. The subset of proteins common to human and mouse meeting the above significance threshold from each analysis were tabulated and inspected individually to assess the performance of ProteoMM on a per protein basis when compared to Perseus by LFQ or raw intensity.

For a more global comparison, the three analysis outputs were compared to a repository of SELEX based methyl-sensitive transcription factor data that comprehensively profiled transcription factors with an affinity for or repulsion to methylation in the CG context8. For levelled comparisons between Perseus (whose output was per species) and ProteoMM, proteins within the combined analysis were merged with respective species limited datasets to create “entire'' species datasets. For example, the combined human log2FC, GeneID and p-values within combined were merged with human-limited datasets (also containing log2FC, GeneID and p-values) to create the entire-human dataset. Classifications from the SELEX repository were grouped into ‘methyl-plus’ for transcription with an affinity for mCG, ‘methyl- minus’ for those repelled by mCG. The term ‘other’ was adopted for cases in which the outcome of SELEX was inconclusive, the transcription factors bound both mCG and CG or had little effect in either context. Using these criteria, the fraction of proteins matching each SELEX classification were calculated over the total proportion of transcription factors matching to SELEX and expressed as a percentage at each incrementally changing log2FC threshold.

The range of log2FC values for all 3 analysis pipelines was within the ranges -2 to 2 changing incrementally by 0.2 whereby negative log2FC values represent ‘called’ CG binders matching to methyl-minus SELEX data whilst positive log2FC values represent ‘called’ mCG binder matching to methyl plus SELEX data.

In addition to comparisons with SELEX, all significant mCG and CG DNA binders within the DE and P/A analyses were matched against previous studies that identified proteins with an affinity for or repulsion to mCG. The output of ProteoMM was therefore assessed in terms of the number of proteins in agreement with or conflicting with previous studies. This was also

II-74 important because it enabled identification of novel DBD-containing proteins not identified in previous studies with affinity for mCG and CG.

II.2.6 Gene ontology analysis

Significantly enriched proteins from combined DE, species limited DE and PA were merged to create complete human and complete mouse enriched mCG, CG, mCA and CA lists. Gene ontology analysis was performed using DAVID 6.89 using human and mouse lists as an input with corresponding human and mouse whole genome background list already present in the DAVID database. GO direct terms “biological process”, “cellular process” and “molecular function” were selected, with an ease setting of 0.05. Go ID and p-values were then submitted to REVIGO10 with “Medium (0.7)” similarity output settings and “SimRel” semantic similarity measure. An intermediary R scatterplot script containing semantic space ‘x’, semantic space

‘y’, GO terms and log10 p-value data was exported for each list. Final R scripts were generated with appropriate parameters and modified aesthetics.

Validation of mC reader binding

II.3.1 RT-PCR

RNA was obtained from human frontal cortex using the RNeasy mini kit (Qiagen) and was reverse transcribed using SuperScript II Reverse Transcriptase and gene-specific primers for MECP2 (see Table S2.1). Briefly, 175mg of human frontal cortex brain tissue was used as input per RNA extraction and resuspended in 35µL RNase free water. For cDNA synthesis, 1µg of RNA was used per cDNA reaction with 1µL of 10µM gene specific primer.

II.3.2 Cloning of mCA reader candidates

The insertion of full-length MECP2 and MBD2 into pETM11 and pETM41 was performed as follows. MECP2 cDNA derived from the human frontal cortex was PCR amplified with primers incorporating 5’ NcoI and 3’ NotI restriction enzyme cut sites. These were used to ligate MECP2 into the pETM11 and pETM41 plasmid backbones. A gBlock ordered from Integrated DNA Technologies (IDT) was obtained for MBD2 with engineered 5’ NcoI and 3’ NotI restriction enzyme cut sites used for ligation into pETM11 and pETM41 plasmid backbones (Table S2.2). Correct insertion and verification that each sequence contained no mutations

II-75 was carried out by sequencing. The pETM11 and pETM41 plasmids contained a lac operator under the control of a T7 promoter and a gene conferring kanamycin resistance. In addition, each plasmid harbours 5’ and 3’ 6XHistidine tags (HIS), enclosing a Tobacco Etch Virus (TEV) cleavage site situated upstream of the insertion site of the multiple cloning site (MCS). The pETM41 plasmid also contained a downstream Maltose-binding protein (MBP) gene that is co-expressed with the product cloned within the MCS.

The insertion of MBD-domain only sequences into pETM11 was as follows. A sequence containing the MBD domain of MECP2 and MBD2 was selected for insertion into bacterial expression plasmid pETM11. Each MBD sequence was ordered as a gBlock (IDT), after undergoing bacterial codon optimisation using the IDT codon optimisation tool. The sequence selected for MBD-MECP2 was taken from a prior publication11. Each sequence also contained 5’ NcoI and 3’ NotI restriction enzyme cut sites used for ligation into the pETM11 backbone. Following ligation, each insert was sent for sequencing to the Garvan Institute of Medical Research.

II.3.3 Protein expression

Each plasmid was Transformed by electroporation of 100ng of DNA into Rosetta BL21(DE3) E. coli grown overnight at 37°C on Luria-Bertani (LB) agar plates supplemented with 50µg/mL Kanamycin and 120µg/mL chloramphenicol. A single colony was grown in 5mL pre-culture LB media supplemented with 50µg/mL kanamycin, 1mM ZnCl2 and 1mm MgCl2 overnight at 37°C. For MBD-MECP2, an aliquot of pre-culture was transferred at a ratio of 1:500 to LB with the same supplements as the pre-culture and grown at 37°C until an O.D of 0.7-0.8 before rapidly cooling the culture to 16°C. IPTG was added at a final concentration of 0.4mM and culture was grown overnight at 16°C. For MBD-MBD2, an aliquot of pre-culture was transferred at a ratio of 1:500 to LB with the same supplements as the pre-culture and grown at 30°C until an O.D of 0.7-0.8 before rapidly cooling the culture to 16°C. IPTG was added at a final concentration of 0.4mM and the culture was grown for 4 hours at 16°C.

II.3.4 Protein Purification

Cells were harvested by spinning at 4°C at 6000XG for 10min. All steps were performed at 4°C and all buffers were pre-chilled to 4°C. Bacterial pellets were resuspended in chilled lysis buffer at a ratio of 1g pellet:0.5mL lysis buffer. Lysis buffer contained 150mM NaCl, 20mM

II-76 HEPES, pH 7.5, 10% Glycerol and freshly added 0.5mM TCEP, 1mM PMSF, 1X cOmplete ULTRA tablets, Mini, EDTA-free protease inhibitor tablets from Roche and 1µl of Benzonase ® Nuclease at 250 units/µL. Protein extract was obtained through 3 rounds of sonication at 40% amplitude with 10 seconds on and 10 seconds off bursts. His-tag isolation of protein was carried out as follows. Lysed protein was spun at 21,000XG for 30 minutes at 4°C. The clarified supernatant was incubated for 90 minutes at 4°C with pre-equilibrated Protino Ni-NTA beads (Clonetech) by incubation of 1 mL beads: 10 mL lysis buffer containing 10mM Imidazole. Each centrifugation step was performed at 500XG for 5 minutes in a swing bucket centrifuge, cooled to 4°C. Following incubation, beads were centrifuged and washed twice in 10mL protein wash buffer containing chilled 150mM NaCl, 20mM HEPES, pH 7.5, 10% Glycerol, 20mM Imidazole and freshly added 0.5mM TCEP, each time being collected by centrifugation. Beads were then transferred to 10mL Poly-Prep chromatography columns (Biorad) and washed once in protein wash buffer containing 30mM imidazole and twice more in protein wash buffer containing 40 and 50 mM imidazole respectively. Proteins were eluted at a ratio of 1 mL beads: 2 mL protein elution buffer containing 75mM NaCl, 20mM HEPES, pH 7.5, 10% Glycerol, 250mM Imidazole and freshly added 0.5mM TCEP. Nucleic acid contamination was removed by running the purified extract through a 5mL HiTrap heparin HP column (VWR). Protein fractions were obtained by salt exchange chromatography through an increasing NaCl gradient protocol beginning at 75mM and ending at 2M NaCl in buffer containing 20mM HEPES, pH 7.5, 10% Glycerol and freshly added 0.5mM TCEP. Peak fraction eluates corresponding to each protein were pooled and dialysed for 5 hours at 4°C in 2L of EMSA binding buffer 150mM KCl, 20mM HEPES, pH 7.5, 10% Glycerol and freshly added 0.5mM TCEP and concentrated in Vivaspin protein concentrator (VWR) MWCO 10,000 and MWCO 3000 for or MBD-MECP2 and MBD- MBD2 respectively (Figure 5.6). Protein concentration was determined by Bradford assay.

II.3.5 Probe design

Complementary 5’ 6-FAM (fluorescein) ssDNA oligonucleotides were synthesised by IDT (Table S2.3) and contained the following sequences:

Oligonucleotide mCG1 (5’ 6-FA- GAT GAT GAmC GAmC GAmC GAmC GAT GAT G-3’). Oligonucleotide mCG2 (5’ 6-FA- CAT CAT mCGT mCGT mCGT mCGT CAT CAT C-3’). Oligonucleotide CG1 (5’ 6-FA- GAT GAT GAC GAC GAC GAC GAT GAT G-3’). Oligonucleotide CG2 (5’ 6-FA- CAT CAT CGT CGT CGT CGT CAT CAT C-3’). Oligonucleotide mCA (5’ 6-FA- GAT GAT GTA mCAC TAmC ACT AmCA CTA mCAC ATG ATG-3’).

II-77 Oligonucleotide CA1 (5’ 6-FA- GAT GAT GTA CAC TAC ACT AC ACT ACA CAT GAT-3’). Oligonucleotide CA2 (5’ 6-FA- CAT CAT GTG TAG TGT AGT GTA GTG TAC ATC ATC-3’).

Oligonucleotide pairs mCG1 and mCG2, CG1 and CG2, mCA and CA2, and CA1 and CA 2 were annealed in 1X NEB buffer 2.0 by placing complementary sequences in a 95°C water- bath for 2 minutes and cooled gradually to room temperature.

II.3.6 Electrophoretic Mobility Shift Assay (EMSA)

Non-denaturing gels were cast with a 6%, 29:1 Acrylamide:Bis solution mix (Biorad), and contained 0.5X Tris-Borate EDTA, 2.5% glycerol. Electrophoretic Mobility Shift Assay (EMSA) binding conditions were as follows. The binding reaction was carried out in the EMSA binding buffer containing 150mM KCl. 20mM HEPES, pH 7.5, 1.5mM MgCl2, 0.2mM �- mercaptoethanol, 10% Glycerol and freshly added 0.5mM TCEP. Each reaction contained 2µg of Bovine Serum Albumin as a non-specific protein additive and Poly-dI-dC at 50ng/ reaction was used as a non-specific DNA competitor. A 1X working concentration of protein corresponded to 320nM MBD-MECP2 and 110nM MBD-MBD2, incubated in each reaction with fluorescent probes at 25nM. Reactions containing specific competitor DNA were at 100X concentration of labelled probe. Binding reactions were incubated at 4°C for 30 minutes and run at 100V for 30 minutes.

II-78

References

1. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367 (2008). 2. Karpievitch, Y. et al. A statistical framework for protein quantitation in bottom-up MS- based proteomics. Bioinformatics 25, 2028–2034 (2009). 3. Karpievitch, Y. V. et al. Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 25, 2573–2580 (2009). 4. Shearer, J. J. et al. Inorganic Arsenic–Related Changes in the Stromal Tumor Microenvironment in a Prostate Cancer Cell–Conditioned Media Model. Environ. Health Perspect. 124, 1009–1015 (2016). 5. Taverner, T. et al. DanteR: an extensible R-based tool for quantitative analysis of -omics data. Bioinformatics 28, 2404–2406 (2012). 6. Karpievitch, Y. V., Nikolic, S. B., Wilson, R., Sharman, J. E. & Edwards, L. M. Metabolomics data normalization with EigenMS. PLoS One 9, e116221 (2014). 7. Smedley, D. et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589–98 (2015). 8. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, (2017). 9. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). 10. Supek, F., Bošnjak, M., Škunca, N. & Šmuc, T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One 6, e21800 (2011). 11. Hashimoto, H. et al. Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic Acids Res. 40, 4841–4849 (2012).

II-79 Supplementary information

Table S2.1: Primers used within thesis

Primer ID Sequence Target Purpose O51 TTCTGGCCCTGGTTAGGTCT gRNA MeCP2 cDNA synthesis O74 TTCTGGCCCTGGTTAGGTCT gRNA MBD2 cDNA synthesis O69 AGGATTCCATGGTAGCTGGGATGTTAGG PCR MeCP2 forward CTCGCGGCCGCTCAGCTAACTCTCTCGG O70 PCR MeCP2 reverse TCACGGG 5’ MCS pETM11 O73 TAATACGACTCACTATAGGG Sequencing sequencing 3’ MCS’ pETM11 O72 GCTAGTTATTGCTCAGCGG Sequencing sequencing Internal sequencing O48 AGGAGAAGAGACAACAGCTGCC Sequencing primer MECP2 Internal sequencing O50 CCCTGAAGCCACGAAACTCT Sequencing primer MECP2 Internal sequencing O91 ACTGTCCAGCGTTACCTCCT Sequencing primer MBD2

Table S2.1: Plasmids

Fragment Genetic element Origin Plasmid ID F1 MeCP2 gRNA pETM11-MeCP2 F2 MBD2 gRNA pETM11-MBD2 F3 MeCP2 gRNA pETM41-MeCP2 F4 MBD2 gRNA pETM41-MBD2 F5 MBD-MeCP2 gBlock pETM11-MBD-MeCP2 F6 MBD-MBD2 gBlock pETM11-MBD-MBD2

Table S2.3: Tagged Oligonucleotides. Note 5Biosg denotes a 5’ biotin mark, iMe-dC a methyl mark and 56-FAM the tagged 5’ fluorescein molecule.

Primer ID Sequence Target Purpose

II-80 5Biosg/GAT GAT GA/iMe-dC/ GA/iMe-dC/ 5’ Biotin tagged Oligo 1 DNA pull-down GA/iMe-dC/ GA/iMe-dC/ GAT GAT G mCG forward CAT CAT /iMe-dC/GT /iMe-dC/GT /iMe- Oligo 2 DNA pull-down mCG reverse dC/GT /iMe-dC/GT CAT CAT C 5Biosg/GAT GAT GAC GAC GAC GAC GAT 5’ Biotin tagged CG Oligo 3 DNA pull-down GAT G forward Oligo 4 CAT CAT CGT CGT CGT CGT CAT CAT C DNA pull-down CG reverse 5Biosg/GAT GAT GTA /iMe-dC/AC TA/iMe- 5’ Biotin tagged Oligo 5 dC/ ACT A/iMe-dC/A CTA /iMe-dC/AC ATG DNA pull-down mCA forward ATG 5Biosg/GAT GAT GTA CAC TAC ACT ACA 5’ Biotin tagged CA Oligo 6 DNA pull-down CTA CAC ATG ATG forward CAT CAT GTG TAG TGT AGT GTA GTG Oligo 7 DNA pull-down mCA/CA reverse TAC ATC ATC 56-FAM/GAT GAT GA/iMe-dC/ GA/iMe-dC/ 5’ 6 FAM tagged Oligo 8 EMSA GA/iMe-dC/ GA/iMe-dC/ GAT GAT G mCG forward CAT CAT /iMe-dC/GT /iMe-dC/GT /iMe- Oligo 9 EMSA mCG reverse dC/GT /iMe-dC/GT CAT CAT C 56-FAM/GAT GAT GAC GAC GAC GAC 5’ 6 FAM tagged Oligo 10 EMSA GAT GAT G CG forward Oligo 11 CAT CAT CGT CGT CGT CGT CAT CAT C EMSA CG reverse 56-FAM/GAT GAT GTA /iMe-dC/AC TA/iMe- 5’ 6 FAM tagged Oligo 12 dC/ ACT A/iMe-dC/A CTA /iMe-dC/AC ATG EMSA mCA forward ATG 56-FAM/GAT GAT GTA CAC TAC ACT ACA 5’ 6 FAM tagged Oligo 13 EMSA CTA CAC ATG ATG CA forward CAT CAT GTG TAG TGT AGT GTA GTG Oligo 14 EMSA mCA/CA reverse TAC ATC ATC 56-FAM/GA TGA TGT ATA CTA TAC TAT 5’ 6 FAM tagged Oligo 15 EMSA ACT ATA CAT GAT G TA forward CAT CAT GTA TAG TAT AGT ATA GTA Oligo 16 EMSA TA reverse TAC ATC ATC

Table S2.4: Antibodies ID Source Company AB2828 Primary polyclonal rabbit anti-MeCP2 Abcam

II-81 A01410 Primary monoclonal mouse anti-alpha tubulin GenScript MeCP2 horseradish peroxidase, goat anti- Thermo

rabbit Fisher Actin horseradish peroxidase, rabbit anti- Thermo

mouse Fisher

II-82 Development and optimisation of ProteoMM, a multivariate statistical analysis tool

Summary

Mass spectrometry (MS) based proteomics enables large scale evaluation of protein functions, making proteome-wide analysis possible. In particular, the application of quantitative MS from crude protein lysates has been a powerful tool enabling the identification of protein-protein and protein-DNA interactions. However, MS proteomics presents inherent challenges that may include sample preparations that are costly or delicate and produce large scale data analyses that require complex pipelines. In addition, MS proteomics is permeated by missing data that requires specialised downstream analysis pipelines utilising various normalisation and imputation methods. Thus, in undertaking an affinity-proteomics analysis of mC binding proteins, there was a need for the development of a robust, user-friendly R based statistical analysis tool (ProteoMM) capable of analysing multiple MS datasets simultaneously. To this end, ProteoMM, an MS analysis tool was written by Dr Yuliya Karpievitch (Lister lab). My responsibility was to ensure that the script developed by Dr Yuliya Karpievitch ran without any errors to produce a streamlined analysis pipeline. I was also responsible for writing scripts tailored to all the results. These include matching proteins and interactors to GO databases, writing the scripts required for graphically displaying data in heatmaps, scatterplots, volcanos, upset and GO plots generated from the REVIGO tool. Together, with Dr Karpievitch, ProteoMM was used to assess mC reader conservation in brain tissue derived from human and mouse samples using peptide intensity driven, label-free proteomics analysis approaches. The development of the ProteoMM analysis pipeline was benchmarked against existing proteomics tools using raw intensity and LFQ intensity values analysed by Perseus, a commonly employed MS analysis tool. The percentage of correctly enriched proteins within each method was ascertained by comparisons to known mCG and CG DNA-binding proteins that have already been characterised. ProteoMM outperformed existing LFQ generated values and provided superior results to Perseus raw peptide intensity-based analyses, by identifying a higher number of differentially expressed (DE) proteins with already established affinity for mCG within these datasets. These results demonstrate the capabilities of ProteoMM in maximising protein coverage through incorporation of more sophisticated, reliable and informative normalisation and imputation methods relative to other methods tested.

III-83

Introduction

The identification and characterisation of proteins and their interactions are fundamental to biology given their integral roles in all cellular pathways and processes. Most diseases can be linked to protein dysfunction, either by changes to their sequence or modification states that affect protein structure and function. The term proteome was first introduced in 1996 to describe the landscape of all proteins and their modifications, expressed by cells at a certain time point within a given tissue1. MS-based proteomics enables the evaluation of proteins and their interactions and now enables proteome-wide analyses in many organisms and cell types. Detailed within this chapter is an overview of the current MS proteomics approaches relevant to this project. Further, this chapter discusses the rationale behind the development of a new MS statistical analysis pipeline called ProteoMM that improves upon previous univariate analyses by maximising the statistical output through the incorporation of multiple datasets that are analysed at the peptide level.

III.1.1 Overview of MS proteomics

MS proteomics can be divided into two main branches, bottom-up, and top-down proteomics. In either case, both top-down and bottom-up proteomics depend on ionisations of the peptides or proteins, which undergo acceleration in an electric or magnetic field producing a mass-to- charge ratio (m/z) used in the identification of the protein2,3. Bottom-up proteomics, whereby proteins are enzymatically or chemically cleaved into smaller peptides is the conventional and most dominant form of MS-proteomics. Prior to analysis, peptides are fractionated using chromatography based methods. Top-down proteomics refers to the analysis of full-length proteins without cleavage by chemical or enzymatic means. As such, full-length approaches offer a different set of advantages, permitting the characterisation of PTMs, protein isoforms, and protein complexes4. Whilst bottom-up proteomics can be readily analysed on a range of instruments, top-down proteomics requires specific instrumentation and presents inherent pitfalls like incomplete protein ionisation and difficulty in the characterisation of proteins with increasing molecular weight5. In recent years, the analysis of intact proteins has been more successful than in the past with technological advancement providing ion focusing optimisations, higher resolution, and accuracy6. The choice of bottom-up or top-down MS reflects the research or clinical question. Within this project, bottom-up proteomics was employed to identify novel mC readers within the human and mouse brain.

III-84

The first amino acid ionisation was produced by the irradiation of an alanine and tryptophan mixture within a matrix using a 266nm laser and played a major role in the development of MS-proteomics7. Following on, two major techniques are routinely used to generate ions in MS proteomics and are dependent on the state of the material. The first relies on laser pulses passing through samples embedded within a solid dry matrix, called matrix-assisted laser desorption ionisation (MALDI). The second works by ionisation within solution, called electrospray ionisation (ESI)8. MALDI is reliant upon crystalline, matrix-based sources and is a robust, cheap method in the analysis of relatively simple, intact peptide mixtures8. Complex mixtures can be analysed by MALDI, but require a careful choice of a suitable matrix, and optimisation of analyte concentration9. Applications of MALDI encompass a variety of areas ranging from clinical screenings to biological assessment of protein interactions10,11 and analysis of DNA, RNA, and their modifications9,12. Applicable to this project is the second major ionisation approach, termed electrospray ionisation (ESI). ESI generates ions from solution- based sources and enables processing of more delicate samples, for example, studying more subtle structural amino acid compositions as it preserves non-covalent interactions13. Secondly, ESI is more easily coupled to liquid-based chromatographic separation techniques and therefore is well suited to the separation and analysis of more complex protein mixtures8,14,15. ESI has a range of diverse applications within proteomics, but also within drug and biomarker discovery areas by enabling characterisation of ligand-target interactions16,17. Recently, ESI has also been used for lipid profiling within single cells18 and to, for example, identify natural metabolites that specifically target primary leukaemia cells within a pool of heterogeneous cells, leaving the healthy cells untouched19.

Following ionisation, trapped peptide cations undergo collision-induced dissociation (CID) or electron capture dissociation (ECD) resulting in fragmentation20. CID increases kinetic energy of ions which then collide with neutral molecules or to one another while ECD relies on electron bombardment to generate smaller ions21. Various instruments incorporate variations of ionisation and dissociation mechanisms. Triple quadrupole mass spectrometers use lower energy CID to fragment peptides that are filtered by mass in the first quadrupole (Q1), subsequently accelerated into Q2, where the sample undergoes fragmentation and finally scanned in Q3. Each instrument has its own advantages and disadvantages differing in resolution, accuracy, sensitivity, m/z range, time of processing and ion source. The instrument used within this experiment is an Orbitrap Fusion™ Tribrid™ Mass Spectrometer which offers flexible fragmentation and dissociation methods. ESI was used to generate peptide ions whilst fragmentation was by higher-energy collisional dissociation (HCD). HCD is a form of CID utilising higher radio frequency voltages that induce higher energy collisions22. HCD is slower

III-85 than CID because it is performed in an external HCD cell that is then transferred to another cell where fragments are cooled before detection. However, unlike low-energy CID, HCD is not inhibited by low molecular weight compounds and generates more ions with increased resolution and was, therefore, the method of choice23,24.

III.1.2 The identification of protein interactions by MS

Most proteins within the cell exist as part of larger ‘cellular machines’ and represent specific associations of many gene products functioning cooperatively25,26. Their macromolecular organisation is dependent on energy-driven conformational changes, specific post- translational modifications, and chaperone assisted processes that vary in response to cellular requirements27,28. Initial attempts to investigate protein interactions were by large yeast two- hybrid experiments that resulted in transcriptional activation of a reporter gene in response to the interaction of two proteins of interest. However, caveats to yeast two-hybrid assays included high false positive and negative results, lack of stoichiometric information and a limited set of test conditions making affinity purification mass spectrometry (AP-MS) approaches more desirable29,30. Whilst overcoming many of the problems facing yeast two- hybrid systems, the development of streamlined, robust quantitative proteomics has become the method of choice for identifying protein interactions. Quantitative MS, unlike yeast two- hybrid and AP-MS, is not limited by single protein affinity purifications and allows for high throughput measurements of protein abundance between samples offering high proteome coverage, with a wide applicability range31.

III.1.3 Tandem affinity purification

Tandem-affinity purification (TAP) is a form of AP-MS that relies upon the fusion of a tag onto the target protein within the host cell or organism, followed by purification and analysis32. Some commonly used commercially available tags like FLAG, TAP, and GFP exist that enable robust, streamlined purification. However, the introduction of a tag may interfere with protein interactions or subcellular protein localisation33,34. The use of antibodies against proteins or their tags is another method commonly employed but is limited by antibody availability or unwanted cross-reactivity of the antibody35. Alternatives include streptavidin based pull-downs that target an artificially synthesised biotinylated peptide used to tag the protein of interest but inherently enrich for many endogenous biotinylated proteins that constitute non-specific interactions36. AP-MS is also more generally susceptible to increased background noise that

III-86 may arise if the tagged protein is expressed at levels higher than endogenous levels. Lastly, the advancement in MS technologies has resulted in machines with higher peptide detection sensitivity, and are able to detect small peptide levels that make discrimination from background difficult, leading to higher false positives. More stringent purification protocols, whilst reducing background levels, risk losing biologically relevant weaker interactions37. In recent years, quantitative MS approaches have been developed to overcome the previous problems afflicting AP-MS like false positive identifications whilst allowing for lower stringency purification schemes38.

III.1.4 Quantitative Mass spectrometry by isotopic labelling

Traditional quantitative MS relies on the incorporation of differentially labelled isotopic molecules. The labels may be implemented by chemical modification to peptides or metabolic labelling of intact proteins during cell culture39. The identification of protein interactions by quantitative MS relies on the premise that a protein or complex will be enriched in one condition over another, for example, a controlled condition, and enables reliable discrimination of highly abundant background proteins from specific interactors.

ICAT (isotope-coded affinity tags) is an example of chemical labelling that affects cysteines. The tag is attached to an isotopically coded linker with (typically) a biotin tag. For quantitative comparisons, one sample is labelled with the isotopically ‘light’ probe whilst the other with the isotopically ‘heavy’ probe. Samples are usually then combined to minimise sample handling error prior to proteolytic digestion and avidin affinity chromatography. Identical peptides from each condition will elute at the same time but have different intensities thereby providing quantitative information40,41. For example, ICAT was used to identify Mafk-associated proteins and characterise its interaction dynamics before and after erythroid differentiation in MEL cells42. Chemical labelling offers advantages over metabolic labelling in that it can be used on animal tissue, but generally exhibits reduced quantitative accuracy. The favoured metabolic labelling method is SILAC (stable isotope labelling by amino acids in cell culture) in which the cell culture growth medium contains radiolabelled derivatives of certain amino acids (usually arginine or lysine). The pull-down and its control are grown in ‘heavy’ or ‘light’ media, combined, fragmented and then subjected to MS. Peptides with different labels can be distinguished by a shift in mass and quantified by comparing relative signal intensities. SILAC is considered the most accurate quantitation method because sample mixing minimises handling errors and biases. However, its implementation requires specialised protocols and expensive materials that limit its use. SILAC has been used extensively to identify chromatin

III-87 readers and used within DNA binding identification assays identifying differential mC readers and its oxidised derivatives in mouse embryonic stem cells (mESCs), as well as for the identification of mC and hmC readers in Neural Progenitor Cells (NPCs)43,44. SILAC performed in cell culture systems is limited in applicability to certain tissues for which there are representative cell culture systems available. Label-free approaches have therefore recently been developed to overcome this caveat.

III.1.5 Quantitative Mass spectrometry by label-free methods

In label-free proteomics, quantification is achieved by comparing protein abundance between samples using spectrometric peptide signal intensities or the number of MS/MS spectra matched to peptides (spectral counting)45,46. Unlike labelling based approaches, label-free methods allow for simultaneous quantification and identification of proteins without time- intensive and costly labelling. More importantly, label-free proteomics can be applied to any protein source, including animal-derived protein extracts. Analyses, however, require complex algorithm and statistical normalisation methods to derive quantification information. For example, MaxQuant, a software package for quantitative proteomics, offers label-free quantification (LFQ) intensities derived from raw intensities that are normalised on multiple levels ensuring LFQ intensities reflect relative amounts of the proteins47. MaxQuant also calculates intensity Based Absolute Quantification (iBAQ) values that are proportional to the molar quantities of proteins by taking the raw intensities divided by the number of theoretical peptides. This may be useful in determining the stoichiometric ratio of a complex of interest, as was done for the MBD3/NuRD and PRC2 complexes48. The accuracy, precision and reproducibility of label-free quantification methods, while comparable to labelled alternatives, is reduced, as samples are prepared and measured independently. A study comparing label- free approaches to metabolic and chemical labelling methods demonstrated the spectral counting method provided the deepest coverage for identification of proteins but its quantification performance was worse than labelling approaches, especially in quantification reproducibility49. The method of choice depends upon a variety of factors including sample type, budget, time, and expertise both in the lab and bioinformatically. Regardless of which method is employed, missing data is a major analysis caveat facing MS-based analyses.

III.1.6 Challenges in the analysis of MS data

LC-MS/MS may incur between 10-50% missing data which is often filtered out resulting in a loss of potentially informative data50–52. If peptides with missing values are not filtered out, it is

III-88 recommended that the observed data is first normalised and then missing values should be imputed using one of the accepted statistical methods53,54. Imputation methods for MS can be classed into 2 categories: single-digit replacement or local-based imputation55. Single-value imputation replaces missing values by constant or randomly selected values56. A quick and simple single-value imputation method often used within proteomics sets the missing value at half the global or peptide minimum reflecting the instrument’s limit of detection (LOD) but is the least reliable imputation method57,58. Random tail imputation (RTI) is another commonly used single value imputation method in which missing values are taken from the left-tail of a single modelled distribution that is left-tailored59. This approach is more representative of the distribution of the missing values than selecting a minimum value because it relies on a value modelled from the data and may perform well for left-censored datasets in which low-intensity values, like peptide abundances, may be missing. RTI is commonly used because it is relatively easy to implement. Local-based imputation methods derive information from dataset structures and implement normalisation based on this information. Local-similarity based imputation methods rely on a ‘similarity’ measurement using information from other peptides with similar intensity profiles within the same dataset to derive information for missing observations. First, a set of peptides closest to the missing peptide is identified. Next, missing value substitution is based on a weighted value derived from a distance matrix calculated from the neighbouring peptides. This can be done by K nearest neighbours (KNN) using information from peptides with similar peak intensity profiles, local least squares, a regression-based estimation, as well as a number of other methods55,56. All methods represent their own advantages and disadvantages and need to be considered before the analysis. For example, an important assumption when using local-similarity approaches is that genes/proteins are regulated in an inter-dependent manner and that highly correlative observations reflect a common biological role or common gene/protein network60. However, this assumption may not prove true for a subset of proteins and needs to be weighed against the balance of proteins that do show a correlation when making a decision as to whether to use local-similarity approaches. To address the research question within this experiment, a novel multivariate and multi-dataset analysis R package based on the statistical method described in Karpievitch et al. 2009 was developed and benchmarked against existing analyses, to ascertain its performance capabilities and suitability as detailed below.

III.1.7 ProteoMM, a novel multivariate, multi-dataset peptide level analysis tool

Analysis of MS proteomics datasets presents significant challenges due to sampling variation, sample size, differences in magnitudes between datasets, and missing data. Whilst numerous

III-89 approaches to statistically deal with these challenges exist, there is no one pipeline or programme that exists tailored to all analyses. Programmes like Perseus offer a more interactive and user-friendly interface but inhibit user customisation and dataset tailoring, which may be required for more complex experiments61. Numerous packages streamlined for users within R like DanteR exist, which enable statistical proteomics analysis with more tailored approaches to issues like imputation, normalisation, and data filtering, but again may not provide the right tools for every analysis or require many different tools resulting in a complicated analysis pipeline62–65. Currently, there is no MS-based statistical tool that incorporates two or more datasets allowing for multivariate analysis. Current analysis pipelines only perform univariate analysis restricted by copious amounts of missing data that is characteristic of MS-based proteomics66. There is a need for easily adoptable, multivariate analysis tools that simultaneously incorporate two or more datasets enabling more flexible and statistically powerful approaches that are applicable to a wide range of biological questions. For example, within this project, addressing biological conservation and differences that exist between species across a range of conditions.

We developed ProteoMM for the simultaneous mC reader conservation analysis of proteins obtained from human and mouse brain tissue lysate. ProteoMM provides a robust, automated differential expression (DE) analysis pipeline for single and multiple proteomics datasets. It utilises a previously published EigenMS normalisation and a sophisticated imputation model that deals with left-censored data, within the observable range of peptide abundances, by borrowing information from multiple peptides within a protein53. A major assumption in this analysis approach is that peptides belonging to a particular protein exhibit similar patterns for each tested condition. EigenMS uses a combination of analysis of variance (ANOVA) and Singular Value Decomposition (SVD) to remove biases present in the LC-MS data while preserving the variation of interest67. A single p-value and effect size estimate is produced for differences in protein abundances between conditions. The pipeline also incorporates presence/absence (P/A) analysis of proteins in cases whereby DE is not suitable. Detailed within this chapter is an analysis of ProteoMM performance and its capabilities, which was implemented by comparisons of its output with Perseus, a commonly employed MS analysis tool. Briefly, outputs from the Perseus-based analysis, conducted on LFQ and raw peptide intensities, were compared to the raw intensity-based ProteoMM output. Results from each analysis were compared to existing repositories of DNA binders classed as mCG binders or non-mCG binders, to ascertain which method identified the greatest percentage of correctly called DNA binding proteins.

III-90 Results

III.2.1 Probe design

To identify readers of CG methylation (mCG readers) within the human and mouse brain, methylated and unmethylated control probes were used within the mCG/CG DNA pull-down. Repetitive CG sequences within the mammalian genome exist within CGIs, commonly associated with promoters, and are important transcriptional platforms within the cell68. DNA methylation within these elements is an important determinant of protein binding and transcriptional output69. Most CGIs remain unmethylated and are permissive to transcription, or subject to repression via the polycomb repressive complex70. The unmethylated (CG) and methylated (mCG) probes were modelled on these DNA elements to identify potential binders of these regulatory elements within the cell, and contained a repeat array of four CG dinucleotide bases separated by thymine. The mCG probe is identical to the CG probe in DNA sequence, differing only in its cytosine methylation state (Figure 3.1). The mCG probe successfully identified proteins that bind DNA methylation in the CG dinucleotide context and highlighted many previously characterised mCG readers and some novel mCG candidates. In a similar fashion to the mCG probe, the CG probe attracted many already characterised DNA binders that are known to bind DNA in a methyl-independent manner. This chapter describes the design, performance, and assessment of ProteoMM as an analysis tool. Details regarding identified proteins binding to mCG and CG are discussed in detail within Chapter 4. Differential binding to each probe was determined by comparing peptide intensities observed for the mCG probe over the CG probe. Using this method, the performance of ProteoMM was judged by its ability to correctly identify proteins binding to each context by comparisons with published findings.

For the mCA/CA DNA pull-down, probes in the mCA context and control CA context were also designed and performed in parallel to the mCG/CG DNA pull-downs. The design of these probes was based on the following: The occurrence of mCH within mammals is mostly restricted to ECSs, iPSCs, and brain cell types. Of all mCH sites within the mammalian brain, methylation in the CA dinucleotide context is the most abundant and, in particular, the methylation of the CAC motif in human and mouse brain predominates over CAG, with each being methylated at 47% and 20% respectively71. Therefore, a probe containing a repeat array of four CAC motifs flanked by a TA dinucleotide was incorporated repetitively into two probes. One probe remained unmethylated whilst the other was hemimethylated at the CA site, to resemble the predominance and sequence bias of CH methylation in the mammalian brain.

III-91 Details regarding identified proteins binding to mCA and CA are discussed in detail within Chapter 5.

Figure 3.1: Overview of affinity pull-down for the identification of mC readers. Protein extracts derived from frozen human frontal cortex and mouse whole brain were incubated with methylated probes in the CA and CG contexts and their unmethylated controls and subject to mass spectrometry analysis.

III-92 III.2.2 DNA pull-down optimisation

Isolation of crude protein extract was achieved by homogenisation of human frontal cortex and mouse whole brain tissue. Enrichment of the nuclear protein fraction was achieved by ultracentrifugation in a sucrose gradient and by affinity purification with DNA probes used in the pull-down. Nuclear protein enrichment of the affinity purification was confirmed by western blot, comparing cytoplasmic (anti-ACTIN) versus nuclear (anti-MECP2) enrichment of proteins before and after the pull-down (Figure 3.2). Results from the western blot indicate a significant reduction in actin after the pull-down compared to before, whilst MECP2 abundance remains constant. Two bands observed for MECP2 correspond to the two isoforms of the protein. This result indicated that the DNA pull-down successfully enriched for nuclear proteins, with affinity for each probe and a decrease in cytoplasmic protein contaminants. Following this result, the DNA pull-down was performed in triplicate for each DNA probe within each species generating three technical replicates representing in the methylated CG and CA and unmethylated CG and CA contexts before being subjected to MS analysis. The identified proteins from the CG and mCA/CA DNA pull-downs are discussed, analysed, and further characterized at length in Chapters 4 and 5 respectively.

III-93 A) Nuclear extract enrichment

Nuclear marker Cytoplasmic marker

MECP2 ACTIN t t u A u A p p C C A A n n i i m C m C 55kDa 45kDa

B) Missing peptides within Human datasets

Human CA Human CG 0 0 0 0 0 0 8 8 0 0 0 0 0 0 4 4

0 CA1 CA2 CA3 mCA1 mCA2 mCA3 0 CG1 CG2 CG3 mCG1 mCG2 mCG3

C) Missing peptides within Mouse datasets

Mouse CA Mouse CG 0 0 0 0 0 0 5 5 2 2 0 0 CA1 CA2 CA3 mCA1 mCA2 mCA3 CG1 CG2 CG3 mCG1 mCG2 mCG3

Figure 3.2: A) Confirmation of nuclear enrichment after the DNA pull-down using DNA probes in the methylated and unmethylated CA context. Nuclear (anti-MECP2) and cytoplasmic (anti-ACTIN) antibodies are reflective of nuclear and cytoplasmic fractions. The input lanes correspond to protein abundance prior to the pull-down, whilst the mCA and CA labelled lanes correspond to protein abundance after incubation with those DNA probes. (B) Bar plot of the number of missing values (y- axis) for all replicates in human and mouse (C) for CA (left) and CG (right) datasets.

III-94 III.2.3 Eigen MS implementation and model-based imputation

Prior to the EigenMS normalization and model-based imputation, ProteoMM generates a bar graph of missing peptide information for each replicate that is analysed (Figure 3.2). Missing data in MS experiments may arise from incorrectly identified peptides, intensities below the detection limits of the mass spectrometer, or due to incomplete digestion of a protein that produces no measurement for that peptide within a sample. The total number of missing peptide values for all proteins per replicate as produced by ProteoMM are displayed within Figure 3.2. The percentage of missing values for Human mCG/CG and Human mCA/CA was 38.7% and 35.4% respectively. The percentage of missing values for Mouse mCG/CG and Mouse mCA/CA was 41.6% and 41.0% respectively. In general, mCG/CG datasets had ~2.5X more missing peptides than mCA/CA datasets. This was observed in both mouse and human and among all replicates (Figure 3.2 B and C). Given the proportion of missing values were consistent between samples and datasets, and that the same patterns for CG and CA were seen globally, missing value information was attributed to biological factors rather than to any obvious errors in the experiment.

EigenMS normalisation was implemented for the mCG/CG and mCA/CA datasets, displayed in Figure 3.3 and Figure 3.4 respectively. For each dataset, the top 3 trends explaining variance in the data are plotted for raw (left), residual (middle), and normalised (right) data for each replicate and summarised as a percentage above each line graph. Raw trends are indicative of total variation between treatment groups. Residual data is used to capture any biases present in the data. Theoretically samples 1-3 should display higher similarity to each other than to samples 4,5 and 6 because these are samples within the same treatment group. In the mouse mCG/CG dataset, for example, trend 2 in raw data reflects differences between methylated and unmethylated groups consistently between all replicates. Trend 2 in the raw data however, only constitutes 18% of the total variation that resembles treatment group differences whilst trend 1 contains 55% of the variation in the mouse data. Further, assessment of bias trends within the residual data plot indicates that replicates 2 and 5, as well as 3 and 4, show higher similarity, which may impair comparisons across treatment conditions, thus necessitating normalisation. After normalisation, 67% of the variation in the data was explained by the differences between treatment groups. In general, each top EigenMS normalised trend represented a minimum of 67% of the variation in the data, whilst trend 2 constitutes ~30%. Removing the bias attributable to trend 2 was not implemented to avoid over normalising the data.

III-95 Model-based imputation generates values for missing data based on the information pulled from the peptides originating within the same protein. If the missing observation comes from a peptide of lower abundance, then the missing values have a higher chance of being left- censored. Left-censored values are guided by the likelihood model, whereby values are imputed from the left-tail of the peptide abundance distribution. In cases where a missing observation comes from a high abundance peptide, such observations are more assumed to be missing completely at random and are imputed from the entire peptide abundance distribution. ProteoMM has an inbuilt plotting function that allows the user to visually track the number of peptides belonging to a protein and the output of normalisation and imputation of the proteins with their underlying peptide observations. For example, for MECP2 in the human mCG/CG pull-down (Figure 3.3C) and for Mbd2 in mouse mCA/CA DNA pull-down datasets (Figure 3.4C). Each peptide observation and its intensity value across replicates are displayed in different colours. Breaks in the lines represent missing peptide observations present in raw (left) and normalised (middle) plots, filled in by imputation (right). The average protein abundance value returned by ProteoMM for the protein is in a solid bold black line.

III-96

Figure 3.3: The top 3 Eigentrends identified in raw, residual, and EigenMS normalised mCG/CG dataset for (A) human, and (B) mouse. Circles 1-6 correspond to 6 samples (3 replicates from mCG and 3 replicates from CG). These trends capture the main patterns of variation in the raw data, and biases observed in the residual data where the biological variation between treatment groups have

III-97 been removed using ANOVA. The last column shows the eigentrends detected in the normalised data after bias trends have been removed. The second trend in raw data shows systematic differences between mCG and CG (middle trend). The other two eigentrends in raw data capture systematic bias as determined by EigenMS. The top two bias trends in the residual data are the first and third trends captured in the raw data (the second trend is rotated around the x-axis which does not change the meaning of the trend. Normalised data shows a clearer difference between the treatment groups and is captured by the most significant (top) trend as is expected post normalisation. (C) The effect of normalisation and imputation for a specified protein (MECP2) at the peptide level. Each dot represents a peptide observation. Trends for each peptide across a set of replicates are plotted by coloured lines. Each line belongs to a different peptide observation detected for the MECP2 protein. Solid black line denotes the for all peptides rolled up to protein level and confirms the differences between treatment groups are more obvious in normalised and imputed data.

III-98

Figure 3.4: The top 3 Eigentrends identified in raw, residual, and EigenMS normalised mCA/CA dataset for (A) human, and (B) mouse. Circles 1-6 correspond to 6 samples (3 replicates from mCA and 3 replicates from CA). These trends capture the main patterns of variation in the raw data, and biases observed in the residual data where the biological variation between treatment groups have

III-99 been removed using ANOVA. The last column shows the eigentrends detected in the normalised data after bias trends have been removed. The second trend in raw data shows systematic differences between mCA and CA (middle trend). The other two eigentrends in raw data capture systematic bias as determined by EigenMS. The top two bias trends in the residual data are the first and third trends captured in the raw data (the second trend is rotated around the x-axis which does not change the meaning of the trend. Normalised data shows a clearer difference between the treatment groups and is captured by the most significant (top) trend as is expected post normalisation. (C) The effect of normalisation and imputation for a specified protein (Mbd2) at the peptide level. Each dot represents a peptide observation. Trends for each peptide across a set of replicates are plotted by coloured lines. Each line belongs to a different peptide observation detected for the Mbd2 protein. Solid black line denotes the for all peptides rolled up to protein level and confirms the differences between treatment groups are more obvious in normalised and imputed data. level.

III.2.4 Differential expression and presence/absence analysis

ProteoMM implements EigenMS normalisation and imputation at the peptide level which are rolled up to the protein level for visualization. A schematic of the analysis pipeline is illustrated in Figure 3.5. Normalization of the raw data is performed on the individual datasets. Proteins detected in human and mouse are then subjected to a “combined” analysis. Combined analysis matches proteins in both datasets and thus more peptide observations are available for statistical analysis. This lends higher statistical power to the analysis, allowing those proteins to be classified more confidently. This is useful when proteins have very few observations in each species that may otherwise result in their exclusion. This approach is reliant on the premise that proteins in human and mouse exhibit similar binding behaviour. However, the pipeline also allows for discrimination of proteins where binding behaviour differs. Within the combined analysis, DE and P/A analyses are performed, yielding significance values for both species using the combined peptide information. Proteins within the combined analysis may yield significance in both species or in either species. For example, a protein within the combined analysis observed as significant in mouse-only is termed “combined-mouse”. Proteins that were not observed in both species’ datasets are subject to their own DE and P/A analyses. These proteins were limited to observations within either human or mouse and are therefore termed “human-limited” or “mouse-limited”. Proteins within this list are not necessarily missing from the proteome of one species and present in the other. Rather, these proteins could be detected in one species and not the other due to a lack of peptide observations. This may be explained by differences in protein abundances or because the peptide was incorrectly labelled. Another explanation for the presence of a species-limited protein is because its species counterpart was subject to P/A analysis because within that dataset not enough peptides were recorded to perform DE analysis. For example KLF16 bound mCG with no observations for the CG probes in the human dataset, but had detectable peptides for both the mCG and CG probes in mouse. The lack of peptides in human meant that KLF16 was unable to be analysed within the combined analysis. It was therefore analysed

III-100 independently in human P/A and mouse-limited datasets. Proteins analysed by ProteoMM are summarised in Table 3.1. The Table separates the proteins into DNA-binding domain (DBD)- containing proteins and interactors (see section 2.2.4) that meet the significance thresholds for each dataset. These thresholds were set at log2 fold change (FC), log2FC ≥ 1.2 and p- value ≤ 0.05 for DE analysis and percent observed value ≥ 50% and p-value ≤ 0.1 for P/A analysis. The choice of log2FC value within the DE analysis was based on validation of the pull-down by comparing the log2FC values to externally published data available for known mCG readers (see sections 3.2.6, 3.2.7, and 3.3.6). In the P/A analysis, enrichment is expressed as the percent of peptides observed in a given context, with no observations in the corresponding probe condition, for example peptides detected for mCA replicates, and no peptides detected in all CA replicates. A detected peptide may produce 3 possible observations, corresponding to the potential of that peptide to be observed in each of the 3 replicates. Therefore, an observation refers to a peptide intensity measurement being recorded in each of the 3 replicates for that peptide. Percent observed represents the sum of observed peptides from each replicate within a particular context divided by all potential peptide observations in each replicate for that context. For example, the protein DST within P/A analysis for the mCA probe had 2 peptides detected. This means there was the potential to record 6 mCA peptide observations for DST (2 peptides X 3 mCA replicates). DST recorded 3 observations out of the 6 possible for mCA. This means out of 6 peptide intensities that could have potentially been recorded, only 3 were actually detected and recorded. Therefore, its percent observed value is 3 peptide mCA observations detected out of 6 potential observations (2 peptides X 3 mCA replicates), which equals 50%. The combined P/A datasets are proteins that were common to both mouse and human in mCG/CG and mCA/CA contexts, but with peptides only detected in one of the two probe states for a context (methylated or unmethylated). Only one significant protein was observed for mCG in the combined P/A mCG/CG dataset. This was the KAISO member ZBTB44, for which peptides were observed in mCG replicates in mouse and human and no peptides observed in the CG replicates (i.e. exhibiting a present/absent [P/A] state in the comparison of the proteins detected for the methylated and unmethylated probes). Usually, the mCG/CG P/A combined analyses would have its own volcano plot. However because only one protein was detected as significant, to simplify analyses and plots created, the combined P/A and species-limited P/A were merged to create one human and mouse P/A dataset each. Using mouse mCG/CG as an example, combined mCG/CG P/A was merged with mouse-limited P/A datasets to produce one volcano plot depicting all enriched mCG and CG readers observed for mouse P/A.

III-101

Figure 3.5: Schematic of a two dataset differential expression (DE) and presence/absence (P/A) analysis by ProteoMM.

An upset plot (Figure 3.6) presents an alternative summary of the data by listing every possible intersection of all significantly called proteins in human and mouse. This plot was used to assess the success of the DNA pull-down of methyl binding proteins and output of ProteoMM by ensuring no discrepancies in significantly called proteins were observed. In generating this plot, all proteins deemed significant from each probe context were merged into one “entire” dataset. These 4 datasets were entire-human mCG/CG, entire-human mCA/CA, entire-mouse mCG/CG, and entire-mouse mCA/CA. To create these datasets, the proteins exceeding defined significance thresholds in each binding context from the combined, species limited, and P/A were merged into one. Using the human dataset as an example, the entire-human mCG/CG dataset was obtained by merging combined-human with human-limited and human P/A for all proteins deemed significant. This would be a merge of all proteins exceeding log2FC ≥ 1.2 and p-value ≤ 0.05 for the mCG/CG, DE analysis in combined-human and human-limited, and percent observed ≥ 50% and p-value ≤ 0.1 for the human mCG/CG P/A analysis. The intersections by context and species were then plotted. Importantly, no proteins within

III-102 mCG/CG datasets display significance in the other species for the corresponding probe condition. In other words, there were no proteins with significantly enriched affinity for mCG in human also observed as significantly enriched for CG in mouse and vice versa. TAGLN3, an actin filament binding protein, was detected in human mCA and mouse CA. TAGLN3 is not a DNA binding protein and is more than likely tethered to a protein bound to each probe condition. Whilst it cannot be ruled out that this interaction reflects protein differences in human and mouse, it is more likely that the observed differences in mouse and human are a result of background protein-protein interactions. Alternatively, this result could be a false positive that occurred due to a lack of peptide observations for this set of probes. Protein affinity experiments like TAP-MS are required before any definitive conclusions can be made. Of note, human mCG/CG and mouse mCG/CG display the greatest number of shared proteins, followed by human mCG and mouse mCG. The third-largest intersection are proteins identified in human that are repelled by DNA methylation regardless of context. Two proteins were significantly enriched in human and mouse mCA, whilst 1 protein was significantly enriched in human and mouse mCG and mCA. The significance thresholds set (using log2FC ≥ 1.2 and p-value ≤ 0.05 for the DE analysis and percent observed ≥ 50% and p-value ≤ 0.1 for the P/A analysis) yielded coherent results from both datasets and was therefore chosen.

III-103 DBD- Total Dataset (Total proteins) Context containing Interactors Significant proteins

Combined CG DE (1131) Combined mCG 15 3 18 Combined CG 16 4 20 Combined- human mCG 15 31 46 Combined- human CG 5 15 20 Combined- mouse mCG 3 6 9 Combined- mouse CG 3 7 10

Combined CA DE (1117) Combined mCA 2 0 2 Combined CA 3 0 3 Combined- human mCA 3 6 9 Combined- human CA 27 13 40 Combined- mouse mCA 5 8 13 Combined- mouse CA 0 0 0 Human-limited DE (550) Human mCG 11 16 27 Human CG 12 28 40 Human Pres/abs (182) Human mCG 9 6 15 Human CG 6 20 26 Human-limited DE (545) Human mCA 5 5 10 Human CA 23 34 57 Human Pres/abs (201) Human mCA 2 12 14 Human CA 4 9 13 Mouse-limited DE (1421) Mouse mCG 25 37 62 Mouse CG 17 29 46 Mouse Pres/abs (198) Mouse mCG 2 8 10 Mouse CG 4 17 21 Mouse-limited DE (1416) Mouse mCA 7 33 40 Mouse CA 7 10 17 Mouse Pres/abs (217) Mouse mCA 7 12 19 Mouse CA 3 9 12 Table 3.1: The proportion of DNA-binding domain (DBD)-containing proteins and interactors exceeding significance thresholds within combined and limited DE and within P/A mCG/CG and mCA/CA datasets.

III-104

Figure 3.6: Intersection of entire-human and entire-mouse significantly called proteins within the mCA/CA and mCG/CG datasets. Significance thresholds were set at log2FC ≥ 1.2 and p-value ≤ 0.05 for the DE analysis and log2FC ≥ 0.5 and p-value ≤ 0.1 for the P/A analysis. The DE analysis comprises the combined list of proteins, filtered for significance by species and merged with the corresponding species-limited dataset. Size of intersection is displayed by bar plot and its corresponding node. Nodes connected by lines indicate an intersection between datasets. III.2.5 Validation of ProteoMM by mCG reader verification

III-105 The log2FC distributions generated by ProteoMM were compared to Perseus by LFQ and Perseus by raw intensity represented as violin plots (Figure 3.7). Distributions appear uniform, maximally distributed around zero and share similar 1st and 3rd quartile ranges. The global minimum and maximum values for Perseus based methods are more extreme, whilst its overall distribution of points is less centred around zero. This may indicate Perseus has a higher level sensitivity in discriminating between probe conditions, or this observation may reflect differences in each method's normalisation.

Figure 3.7: Distribution of log2FC values within ProteoMM, Perseus by LFQ, and Perseus by raw intensity using human and mouse mCG/CG datasets.

III.2.6 Benchmarking ProteoMM on a subset of common high confidence proteins

A subset of proteins were chosen and compared across Perseus and ProteoMM platforms to assess whether ProteoMM was performing as expected. It was also important to assess the ability of ProteoMM to correctly capture all common proteins and perform a robust combined analysis. The selection of the subset of proteins was therefore based on two reasons. First, it was reasoned that despite the analysis approach, the highest confidence readers amongst all three analysis pipelines were proteins significant to human and mouse. Second, comparing common readers between datasets would also ensure the integration of multiple datasets within ProteoMM to form and assess the efficacy of the combined dataset approach used by

III-106 ProteoMM. Proteins analysed by Perseus were performed separately for human and mouse. To obtain combined Perseus datasets, LFQ and raw intensity datasets in human and mouse were pre-filtered for significance (log2FC ≥ 1.2 and p-value ≤ 0.05), and significant proteins from each species were merged. Given the relative consistency in distributions between each analysis pipeline, the same significance threshold used in ProteoMM DE analysis (log2FC ≥ 1.2 and p-value ≤ 0.05) was set for Perseus based analyses. Applying this threshold to a subset of proteins enabled ProteoMM to be compared to each Perseus analysis in a ‘per protein’ manner, where significantly called protein in both species could be compared between each method. The significant mCG readers determined by each method were extracted and tabulated to compare each pipeline (Table 3.2). Ticks correspond to datasets whereby a protein was deemed significant, whilst crosses correspond to observations for a protein within that dataset (and its analysis type) not meeting set significance thresholds. For example, RFXAP surpassed the threshold for combined-human and not combined-mouse in ProteoMM, whilst being significant to both human and mouse in each Perseus analysis. Table 3.2 is clustered based on significance in each method to simplify different outcomes. Figure 3.8 contains raw peptide, and normalised and imputed peptide trends across mCG and CG replicate conditions for a select group of proteins in the Table to demonstrate how ProteoMM functions for each analysis outcome. This enabled an in depth look at how ProteoMM functioned, and why the protein was deemed significant or not, in comparison to Perseus based analyses at the peptide level.

Any matches between combined Perseus with human-limited or mouse-limited datasets would indicate that ProteoMM was not analysing or merging datasets correctly, as these proteins should be automatically merged and analysed within the combined analysis. The only circumstance in which proteins should be found in species limited (for one species) is under P/A scenarios. In these cases, a protein would be discarded from the combined dataset because it had no observations in a probe condition for one of the two species. In these cases, the protein with insufficient peptides would be shunted to P/A whilst its corresponding species counterpart with sufficient peptide information would be placed in species limited. The ability of ProteoMM to correctly identify and merge significant proteins in both species and combine them into one dataset is demonstrated within this Table because no proteins enriched in Perseus methods match to proteins in the ProteoMM species limited datasets unless its counterpart in the other species was subject to P/A. This is expected because ProteoMM was designed to discard proteins with limited observations from DE analysis, indicating that the P/A analysis pipeline is working as expected. The proteins KLF12, KLF16, and FOXO3 represent situations in which P/A is employed by ProteoMM, preventing these proteins from being analysed within the combined analysis. For example, a lack of peptide observations in

III-107 the human CG probes resulted in KLF16 being shunted to P/A analysis (Figure 3.8C). Since there were enough peptides in the mouse for DE analysis, Klf16 was analysed as part of the mouse DE limited dataset.

Within Table 3.2, cluster 1 contains significantly called mCG readers within ProteoMM combined-human and combined-mouse as well as within both Perseus methods and constitutes the majority of proteins. This is examined at peptide level for ZBTB4 (Figure 3.8A) which contains numerous peptides in human and mouse that show clear enrichment for mCG. This cluster represents proteins that are called with high confidence, showing numerous peptides exhibiting similar trend patterns in both species. The remaining clusters represent proteins that do not display universal significance in all analysis methods. Cluster 2 contains proteins that were present in human and mouse, significant in both Perseus analyses but displayed significance using ProteoMM in either human or mouse. ZFP91 for example, met significance thresholds in both Perseus methods but was not significant in the combined- human or combined-mouse. A closer inspection at the peptide level reveals enrichment for mCG in human and slight enrichment in mouse (Figure 3.8B) but at log2FC levels that did not correspond to significance. Cluster 3 contains proteins significant in ProteoMM only whilst cluster 4 displays one protein that was significant to Perseus by LFQ only and within ProteoMM for mouse only. Lastly, cluster 5 corresponds to proteins with significance in Perseus by raw intensity with mixed significance in ProteoMM. Within this cluster is Nkx2-2 which displays significance in combined-mouse only but was significant within both human and mouse datasets that were analysed using Perseus by raw intensity. Inspection of NKX2-2 at the raw peptide level is more in line with ProteoMM than with Perseus by raw intensity. NKX2-2 is clearly enriched for mouse but displays no discernible enrichment within human (Figure 3.8D).

III-108

Table 3.2: Proteins within this Table were enriched for mCG and are common to human and mouse in one or a combination of analysis methods. These analysis methods include ProteoMM and Perseus by LFQ or Perseus by raw intensity. Each cluster groups proteins by binding patterns observed within each dataset. Ticks correspond to significance within the respective dataset. Crosses correspond to observations of a protein within that dataset that was not significant.

III-109

Figure 3.8: Examples of proteins and their peptides analysed by ProteoMM that exhibited significance in human and mouse mCG for Perseus by LFQ or Raw intensity. Raw (left panel), normalised (middle panel) and normalised and imputed (right panel) peptide observations in human (left) and mouse

III-110 (right) for select protein examples across mCG and CG replicates. The transition from CG to mCG is demarcated by a dotted line running through the centre of each plot.

III.2.7 Benchmarking ProteoMM by comparisons to a SELEX repository of mCG/CG readers

A global comparison of each analysis method was implemented by comparing all the proteins predicted to be enriched in binding methylated probes (mCG) or CG unmethylated (CG) probes to a previously published (“external”) SELEX-based transcription factor dataset. The SELEX experiment comprehensively profiled the binding affinities of human transcription factors for DNA methylation72. Methods pertaining to this comparison are discussed in detail in section 2.2.5. Briefly, transcription factors within the SELEX experiment that were attracted to CG methylation were termed methyl-plus, whilst those repelled by CG methylation were termed methyl-minus. Some transcription factors within the SELEX experiment exhibited mixed affinity for methylated and unmethylated DNA whilst others remained inconclusive. These transcription factors were termed ‘other’ within this analysis to simplify comparisons.

The percentage of transcription factors within the SELEX experiment were matched to DBD- containing proteins from Perseus and ProteoMM analysis methods and plotted as a function of increasing or decreasing log2FC (Figure 3.9). Log2FC refers to peptide abundances detected in the mCG probe over the CG probe, with positive and negative log2FC values corresponding to proteins with a higher abundance for the mCG (mCG binders) and CG (CG binders) probes respectively. As log2FC values become more stringent, one would expect the total number of proteins matched to SELEX to decrease because fewer proteins make the defined cut-off. A more stringent positive log2FC cut-off would also be expected to increase the percentage of mCG binders matching to methyl-plus and decrease those that match to methyl-minus or other. Conversely, a more negative log2FC value should increase the percentage of CG binders matching to methyl-minus and decrease those that match to methyl- plus or other. While this trend is observed for ProteoMM, it is not as pronounced for Perseus based methods. The percent of mCG binders matching to methyl-plus increases as a function of increasing log2FC for ProteoMM and not for the Perseus methods which remain largely unchanged as a function of increasing log2FC. The same increase in percentage was observed for the CG binders (proteins that showed enriched binding to CG unmethylated probes compared to CG methylated probes), correctly matching methyl-minus transcription factors with a greater percentage of more negative log2FC values. Looking at the total numbers of DBD-containing proteins matched to SELEX transcription factors between each method, the LFQ datasets in human and mouse match fewer mCG binders and fewer CG binders to

III-111 SELEX methyl-plus and methyl-minus transcription factors, respectively, when compared to the methods using raw intensity (Perseus by raw intensity and ProteoMM). This suggests that using raw peptide abundances produces a higher number of identified mCG and CG binders compared to the modified LFQ abundances generated by MaxQuant, and that these mCG and CG binders are correctly classified. In comparing ProteoMM to Perseus, ProteoMM vastly outperforms both Perseus methods, identifying ~2.5X more proteins than either Perseus method, showing the same directionality for mCG binders with methyl-plus and CG binders with methyl-minus.

The accuracy at which a human mCG binders match methyl-plus is much lower than the accuracy of human CG binders matching methyl-minus for every analysis pipeline. For example, the human CG binders match to methyl-minus with 100% accuracy for Perseus by

LFQ and at log2FC thresholds as low as -0.4 and -0.8 for Perseus by raw intensity and ProteoMM respectively. This indicates that the proteins I found to be enriched for binding to CG compared to mCG by pull-down-MS experiments tended to also exhibit a higher affinity for CG than mCG in the independent SELEX experiments. The highest methyl-plus match is ProteoMM at 71%. The percentage of correctly called methyl-minus mouse CG binders by ProteoMM is comparable to both Perseus methods. However, as with human CG, ProteoMM identifies many more DBD-containing proteins in the mouse dataset. The overall trends indicate ProteoMM identified more correctly called mCG binders, with a 93% success rate in mouse, also identifying more DBD-containing proteins than either Perseus-based methods. Using the log2FC thresholds decided upon for all analysis plots downstream as an example (log2FC ≥ 1.2 for mCG binders and log2FC ≤ 1.2 for CG binders), ProteoMM identifies 2 more mCG binders and 6 CG binders for human, and 6 more mCG and CG binders compared to Perseus based approaches.

III-112

III-113 Figure 3.9: Proteins within ProteoMM and Perseus based analyses were matched to a methyl- sensitive SELEX based transcription factor repository72. Transcription factors with an affinity for mCG within the repository are classed as ‘methyl-plus’, whilst those enriched for unmethylated CG are classed ‘methyl-minus’. Those transcription factors with mixed affinity are classed as ‘other’. Bar plots represent the percentage of transcription factors from ProteoMM and Perseus analyses that matched to the SELEX database at incrementally defined log2FC values. Positive log2FC values are called mCG binders whilst those with negative log2FC values are called CG binders. The numbers within each segment of each stacked bar plot represent the percentage of all proteins at that particular log2FC threshold that were assigned to Methyl-Plus, Methyl-Minus, or Other categories in the SELEX data.

III.2.8 Comparisons of mCG and CG DBD-containing proteins with published data

All identified DBD-containing proteins that exceeded significance thresholds set at log2FC ≥ 1.2 and p-value ≤ 0.05 for DE analysis, and percent observed ≥ 50% and p-value ≤ 0.1 for P/A analysis, were compared to previously published datasets that identified mCG readers and readers repelled by mCG. These were mostly obtained from high-throughput DNA binding screens such as DNA-pull-downs coupled to MS, SELEX, and microarray based approaches44,72-75. For the reference information for each DBD-containing protein, see tables S4.1-S4.5. The results of this comparison are displayed in Figure 3.6, which contains the proportion (bars) and intersections (nodes) of each dataset with each other and with previously identified mCG or CG binders. In black are DBD-containing proteins for which no reference was found, indicating that these proteins remain uncharacterised and represent novel DBD- containing proteins and potential transcription factors that may directly bind to the DNA probes they were enriched for. Datasets displaying the highest number of novel DBD-containing proteins were within the mouse-limited dataset, identifying 15 novel candidate CG DBD- containing proteins and 12 novel candidate mCG DBD-containing proteins. In blue are DBD- containing proteins whose identification in ProteoMM was in agreement with previously published data. In red are the DBD-containing proteins whose identification in ProteoMM conflicts with previously published data. For simplification, the P/A datasets in mouse and human were combined into P/A mCG and P/A CG because these datasets had fewer overall proteins and because no information is lost when comparing to external datasets.

There is a high level of concordance with published data for the combined mCG and CG lists. No conflicting observations with external studies are observed for the combined mCG DBD- containing proteins list, whilst there is only one DBD-containing protein, ARNT2, within the combined CG list that has reported conflicting binding behaviour. This is consistent with observed binding data for ARNT2, observed as a binder of certain methylated motifs75 and

III-114 unmethylated motifs CG72. The P/A lists for mCG and CG contain no conflicting observations with published data and together with the combined lists represent high confidence proteins.

Within the combined analysis are proteins that met significance thresholds in only mouse or human, despite having peptides in both species. In general, these DBD-containing proteins are in agreement with previous classifications, with the exception of Ddb2, which was significant for mouse CG but reported to bind mCG in an external pull-down44. The human- limited mCG proteins had peptide observations in human and not mouse. Within this list are 3 proteins that conflict with other datasets in that they have been reported to exhibit affinity for unmethylated CG. These proteins are BHLHE40, MAFF72, and ZBTB1474. There were no conflicting results within mouse-limited mCG, however, it is worth mentioning that ZBTB33, observed within this list, has been reported to bind methylated and unmethylated DNA.

Figure 3.10: The numbers of proteins analysed by ProteoMM and their enrichment in each dataset (intersections) are represented by bars and nodes respectively. Proteins in concordance with previously published studies are represented in blue(44,72–75), whilst those conflicting with previously published studies are represented in red.

III-115 Discussion

III.3.1 ProteoMM, a novel multivariate differential expression proteomics analysis tool

MS datasets are often large and complex requiring specialised analysis. Various automated user-friendly tools offer limited customisation and may not provide an optimal pipeline suited to all analyses, for example, comparisons that require multi-dataset integration. An example of this interspecies conservation, an important, frequent biological question that enables the identification of molecular processes that are indispensable to an organism because of their inherent conserved nature. There is a need for a tool that incorporates each dataset into a unified analysis in order to more reliably compare conservation across species. The utility of ProteoMM has been tested in human and mouse, chosen because both species share high levels of protein conservation. Analysis of protein coding genes between the two species averages out to about 85%76. Mouse models are also used to better understand human biology because most proteins function similarly in human and mouse. This made it possible to benchmark ProteoMM by assessing the overall overlap of proteins in both species, that showed enrichment for the same set of probes. If the upset plot (Figure 3.10) showed a large fraction of proteins binding opposite probe conditions between species, then the efficacy of ProteoMM may be questioned. However, we do not observe this, and as one would expect, we observe a large proportion of overlap between species. Secondary to this point, because we see a large overlap in proteins from each species binding in the same context, it means that ProteoMM is functioning as intended and would make a reliable tool for a comparative analysis on more divergent species. No MS-proteomics analysis tools that incorporate multi- dataset integrations into their pipelines exist, owing to the inherent complexity and challenges in the analysis of MS data. Instead, to address a common question such as interspecies conservation, all current tools analyse each dataset independently before comparisons between each. Conclusions on conservation and differences are then done after each individual dataset is analysed independently. Whilst valid, a univariate approach is subject to loss of informative data that is filtered out at the peptide level based on too few observations. Combining datasets provides an additional layer of statistical power, lending more observations to peptides common between species such that these peptides filtered out in univariate analysis, now have more observations in a combined dataset and pass the filtering threshold. A multivariate approach also produces more observations to those proteins already exceeding the peptide count threshold. More counts have increased statistical power lending a higher confidence to those common peptides. Lastly, employing customised statistical

III-116 analyses such as the one employed within this project is time-consuming and requires bioinformatics expertise that an average user may not possess. ProteoMM performs peptide- level differential expression analysis that all current MS tools perform but across multiple proteomics datasets simultaneously. It is an R package that requires minimal R expertise, and a robust, simple yet customisable pipeline for the use of multi-dataset MS proteomic datasets.

III.3.2 Implementation of Eigen MS normalisation

MS data is subject to inherent biases caused by human handling and instrument biases such as run orders. It is essential to downstream analyses that these biases be corrected in order to draw correct biological conclusions from the data. To overcome bias between technical replicates within the mouse and human data, EigenMS normalisation was employed. EigenMS is an adaptation of surrogate variable analysis (SVA), a normalisation method frequently employed on large scale complex datasets like microarray data, to overcome variation arising from technical factors without overfitting77. Like SVA, EigenMS removes systematic biases using the singular value decomposition of residual peak intensities to find trends for significant variation within the datasets not attributable to experimental factors of interest. It is a well- established method that is easily employable within a proteomics analysis pipeline and performs well in comparison to other normalisation methods like local regression normalisation, variance stabilisation normalisation, quantile normalisation, progenesis normalisation and median normalisation without over-correcting data and is capable of dealing with wide-scale missing values62,78. Eigen MS normalisation was successfully employed to each dataset by assessing raw and residual trends. The script automatically generates visual representations of raw, residual and normalised trends and outputs the percent variance of each as a contribution to the whole dataset. This allows the user to control the level of normalisation and prevent overfitting of data (Figure 3.3 and 3.4).

III.3.3 Missingness and the need for dataset-tailored imputation

Apart from unknown systematic biases, MS proteomics is commonly affected by missing data55,79. In MS, peptide intensity values may be missing because of a number of reasons. In case one, they may be present but incorrectly identified. In case two, they exist but at a limit beyond the capabilities of the instrument, while in case three they are not present at all. The danger in overestimating peptide abundances arises when observed values are used to impute missing values in these circumstances. Random missing values is a type of

III-117 ‘missingness’ that occurs for a small proportion of peptides that are missing based on stochastic fluctuations and random errors during the data acquisition process. Additionally, these can originate from inaccurate peak detection and deconvolution of co-eluting compounds, however, it is hard to distinguish which of these processes may be responsible. For example, sequence identification of a peptide occurs from the first scan MS1 after fragmentation. Peaks are then ranked and selected for MS2 in a time-dependent manner, which leads to random missing values based on inaccurate or failed peak detection. In case one, a peptide may be present in one sample, and not in another simply because it was misidentified because of the above-mentioned reasons80. The second case is a type of missingness that is not random and originates from missing values that fall below the limits of detection of the instrument56. Therefore caution must be taken when replacing missing values in that two scenarios of a missing value essentially occur. One may be random (case one), whereby there is no significant difference between experimental conditions or biological (case two), having had low abundance in one condition and therefore was not detected. Left censoring of all peptides within this experiment may result in false positives by leaning on the assumption that no observation reflects a lack of protein affinity for that probe.

As previously mentioned, single-digit replacement of left-censored values is commonly applied to MS proteomics and is a common method of imputation. The caveats of single-digit replacement include distortion in the distribution of missing values, underestimation of the standard deviation and may lead to increased false positives within the dataset81. Newer methods like KNN and local least squares along with various other local-similarity based imputation methods explained in the introduction have been developed and may reduce the amount of data filtered out but only work based on abundance-dependent missingness. Additionally, these complex methods of imputation require some bioinformatic expertise because each method produces statistical outputs that need to be considered carefully by the user prior to analysis.

Imputation within ProteoMM incorporates a method that is a mix of single-digit replacement and local-based similarity imputation to more reliably and intuitively replace missing data. The model generates ‘random’ values for missing intensities using an automated filtering routine whereby formal concepts of information content from maximum likelihood theory guide the selection and exclusion of peptides in the analysis53. For example, the model assumes a normal distribution for all peptides belonging to a protein in a particular comparison group. These peptides are then modelled for both abundance-dependent and censoring dependent likelihoods resulting in a cumulative distribution function. This model has been tested on

III-118 simulated and biological data at the protein and peptide level and produces unbiased results. Overfitting is addressed through p-value distribution.

III.3.4 External experimental considerations

Whilst both human and mouse contain similar missing values per context (Figure 3.2), the mouse datasets contained significantly more peptide observations and proteins. External experimental factors such as the type of tissue which differed by brain region in mouse and human, as well as tissue age and state, contributed to mouse and human protein heterogeneity. Experimental analysis of frontal cortex was possible for human only, because humans possess a large and well developed large frontal cortex that provides sufficient material for the pull-down assays. Whole-brain was the only feasible tissue source for mouse experiments in order to obtain enough material to complete the experiment. Secondly, sample preparation quality could have significantly affected data quality. Human post-mortem brain tissue is commonly obtained after a significant post-mortem interval, and frozen for prolonged periods extending into decades, whilst the mouse tissue was used fresh or within a few days of preservation. Whilst these factors were unavoidable, they may explain some of the differences observed in human and mouse datasets. This most likely accounts for observed protein enrichment differences. As stated above in section 3.3.1, human and mouse share highly similar protein sequences with similar functions. Enrichment differences are therefore more than likely not biological, but reflect differences in peptide abundances that arise from protein extract differences that would differ based on tissue source and age of preservation. Another aspect of the DNA-pulldowns that type and preservation state of tissue would affect is protein heterogeneity. Whole-brain protein extracts are derived from more cell types than frontal cortex, resulting in a higher number of proteins and may explain the greater number of DBD-containing proteins observed in the mouse dataset. Protein extract quality could be attributed to differences between mouse, which was prepared with fresh or newly frozen tissue, compared to the longer post-mortem interval and prolonged frozen storage of the human brain frontal cortex, impacting protein integrity and stability. As a result, and in line with observations, the mouse dataset contained a larger number of unique proteins, more peptides per protein, enabling more accurate analysis that resulted in a greater proportion of significantly called proteins.

III.3.5 Challenges in benchmarking ProteoMM

III-119 ProteoMM was assessed by comparing its output to that of other proteomic statistical methods and with external data. Perseus was chosen as a comparison because it is a widely used proteomics analysis platform implemented by many research groups due to its relatively short analysis pipeline and ease of its implementation. Secondly, MS output files were easily compatible with accepted input formats used within Perseus. To conduct a fair comparison, analysis was done on the same two datasets using raw peptide intensity information. An additional Perseus analysis using LFQ values was also chosen to compare Raw intensity- based analysis against LFQ intensities, an accepted and published input format developed for quantitative analysis of MS data47. There was a need to compare each method to an external source of mCG binders in a methodical and robust manner. The only existing dataset that has comprehensively profiled mCG and CG binders were from a repository based on methyl- sensitive SELEX experiments. SELEX based experiments offer many advantages to MS- based affinity studies, enabling detection of reliable, high affinity in vivo DNA binding sequence information, but are limited to DNA binders only. Therefore, SELEX only enabled DBD- containing protein comparisons but was nevertheless adequate in providing an indiscriminate external source of information in which to compare each analysis output.

III.3.6 Comparative assessment of ProteoMM and Perseus based analyses

The log2FC distributions for each method were plotted as violin plots in Figure 3.7. As expected, all three outputs generate log2FC distributions that are maximally distributed around 0 indicative of background interactions. Perseus LFQ and ProteoMM contain the largest proportion of log2FC values around zero observed for mouse and human mCG/CG datasets.

Additionally, all methods have similar log2FC distributions when comparing across first and third quartiles suggesting the distribution of log2FC values across each method between human and mouse are similar. The first protein-based comparison focused on a subset of high confidence readers that were enriched for mCG in human and mouse. All common mCG readers for each method were then tabulated and compared in a per protein manner and clustered by observations within each method to determine how ProteoMM performs the combined dataset analysis and to assess the overlap with each Perseus method (Table 3.2). Cluster one consisted of proteins enriched in all methods of analysis and made up the majority of proteins, indicating that to a large extent, all 3 methods perform similarly. Secondly, KLF12, unlike the other proteins within the cluster, was not analysed in the combined analysis, but within mouse-limited and P/A human, highlighting the multifaceted approach of ProteoMM. KLF12 did not have sufficient peptide observations within the human data, and was therefore

III-120 not subject to DE analysis within the combined dataset. Instead, KLF12 was analysed as part of human P/A and its mouse counterpart was shuffled to mouse-limited. This was a positive outcome because it demonstrated the combined dataset integration in cases of ‘few-no’ observations in one species, successfully shuffled these proteins to P/A and its counterpart to the species limited datasets accordingly whilst maintaining significance in both species. This trend was also observed for FOXO3 and KLF16. ProteoMM, through P/A, indicates that these proteins are very strongly repelled by or attracted to CG and mCG respectively, information that is lost in both Perseus based methods. Observations in human-limited or mouse-limited would have indicated a problem with ProteoMM at the combined dataset integration level. Observing human-limited or mouse-limited significant proteins in P/A only cases therefore confirmed ProteoMM was matching and analysing the combined proteins as expected. Global comparisons to SELEX indicate ProteoMM and Perseus perform similarly in identifying CG binders, especially notable in the human data, where methyl-minus is correctly assigned to all identified proteins from a log2FC as low as -0.6 (Figure 3.9). The mCG binders called are more complex, however at log2FC values ~1.2, all methods seem to call the same proportions of proteins by methyl-plus and “other”. Values lower than, or equal to log2FC -1.2 and greater than, or equal to log2FC 1.2 were therefore chosen as significance thresholds for all CG and mCG readers respectively, for all datasets.

Differences between methods illustrated in clusters 2-5 of Table 3.2 correspond to cases in which significance in human and mouse mCG was observed in both, either, or one of the three analysis methods. To understand the outcomes of the comparisons, and provide reasoning based on analysis of the raw data, the raw peptides were plotted for a select number of cases pertaining to Table 3.2 in Figure 3.8. In efforts to understand how ProteoMM functioned in these cases, the normalised and normalised and imputed peptides were plotted alongside corresponding raw peptides. Cluster 2, for example, contained proteins significant in both Perseus methods but in which ProteoMM exhibited combined-human only (RFXAP) or no significance in mouse or human (ZFP91). Raw peptides were plotted for ZFP91, observed in Figure 3.8B for human and mouse and show enrichment of ZFP91 for mCG that is more apparent in human than mouse. This is reflective of the log2FC value assigned to each species by ProteoMM, assigning a log2FC of 1 in human and 0.6 in mouse. ZFP91 highlights an important concept in the analysis of MS data, enrichment versus significance. Whilst enriched in both species, ZFP91 did not meet the significance threshold set for the combined analysis. When analysing large datasets with complex patterns of observations, it is important to carefully consider your threshold and recognise the threshold is a balance of obtaining the highest confidence readers with the lowest amount of background. This is reinforced further by looking at the raw peptides in NKX2-2 which was significant in combined-mouse and

III-121 Perseus by raw intensity. From the peptide plots, the significance in mouse is immediately apparent whereas within human, the situation seems more balanced. The respective log2FC values for human and mouse are -0.8 and 1.9

Some important differences arise between each method when taking a global approach and comparing each analysis method to a methyl-sensitive SELEX based repository of transcription factors. Firstly, it is apparent that both raw intensities by ProteoMM and by Perseus result in more matched transcription factors potentially indicating a higher level of coverage or (protein identification). A result that is consistent with a previous study comparing label-free methods to raw intensity-based MS49. Between methods utilising raw intensities, ProteoMM is vastly superior, identifying ~2.5X more transcription factors than Perseus. Whilst this number drops significantly as a function of log2FC for ProteoMM, it still outperforms both Perseus based methods in terms of the proportion of correctly guessed transcription factors and the number of transcription factors at that log2FC. This holds true for mCG and CG in human and mouse datasets.

In summary, whilst it was challenging to perform a comparison of ProteoMM in order to assess its performance, this was successfully achieved. Comparisons were made with Perseus by utilising the same datasets and performing a comparison on a subset of proteins within the combined analysis, allowing for per-protein inspection of how each tool performs. In addition, the outputs of each analysis were compared globally to a previously published dataset from independent SELEX-based experiments that comprehensively profiled and generated a repository of DNA binding protein methylation-dependent binding data. The results indicate ProteoMM successfully integrates multiple datasets into a multivariate analysis whilst maintaining limited datasets for proteins only observed in one of the datasets integrated (and in P/A cases). Secondly, the use of ProteoMM dramatically increases the number of proteins discovered whilst maintaining the same levels of background to bona fide interactors as all methods tested. In comparing ProteoMM to all published work regardless of experimental or analysis source, ProteoMM was able to reliably identify proteins that were in accordance with published data with very few discrepancies. In addition, many novel mCG binders and interactors were identified using ProteoMM that have not been identified before. Explanations into discrepancies and novel mCG interactors will be discussed in Chapter 4 that focuses on identified and novel mCG and CG readers.

III-122 III.3.7 Overall validation of ProteoMM by comparisons with externally published data

An important validation of the pull-down and that successful implementation of ProteoMM relied on reference and database searches that characterised the binding characteristics of each protein. The references Table containing dataset, protein alias, and reference source information is in Table S4.1-S4.5 as part of Chapter 4. Figure 3.10 contains a summary of proteins within each mCG/CG dataset output from ProteoMM. These proteins are intersected with proteins that have been experimentally determined to bind to methylated or unmethylated CG dinucleotide DNA sequences in a variety of other studies (Figure 3.10). Black bars and nodes represent novel proteins that will be discussed in Chapter 4, which pertains to novel mCG readers identified in the human and mouse brain. Of relevance, are the blue bars and nodes that depict intersections of significantly called DBD-containing proteins within each dataset that align with observed binding classifications in previously conducted experiments. Those with red bars are proteins identified within ProteoMM that conflict with a previous study’s characterisation of that protein. As expected, common mCG, the high confidence list, contains the largest number of proteins that have experimentally been determined to bind mCG, followed by mouse-limited mCG, combined CG and combined-human mCG. The common set of proteins represent a list of the highest confidence readers, because ProteoMM incorporates more peptides from both datasets within the analysis, increasing the statistical power. This is apparent when looking at the list of mCG readers in the combined analysis. Of the proteins called as significant in the combined analysis for mCG, human and mouse had 30 and 18 mCG readers, respectively, of which 15 were common. Within this set of 15, 13 have previously been experimentally verified as bona fide mCG readers, and it contains the highest number of intersections with previously published data when compared across datasets. Within the combined mCG analysis, no proteins displayed inconsistencies with published data, and two potentially novel mCG binders were identified (Figure 3.10). Well- characterised mC reader proteins MECP2 and MBD2, belonging to the MBD family, and BTB/POZ member ZBTB4 were among the common mCG readers. Within the significantly called combined CG reader list, 21 and 19 CG binders were enriched for human and mouse respectively, of which 16 were enriched in both species. Common CG contained 10 intersections that aligned with previously published data, second behind common mCG. Surprisingly, common CG contained 1 protein, ARNT2, that conflicted with external studies75. Lastly within the combined analysis is DDB2, significantly enriched for mouse CG, but displaying no mCG enrichment in human (not at levels deemed significant). DDB2 is an interesting example of a protein that exhibited differential binding enrichment for the probes

III-123 between species. Previous binding studies implicate DDB2 in mCG recognition, which is at odds with the results of the mouse pull-down. The molecular reasons behind the binding behaviours exhibited by ARNT2 and DDB2 will be discussed further in Chapter 4.

Human-limited and mouse-limited datasets were analysed independently of one another, similar to the standard way of MS analysis. Owing to a lack of peptide observations in the corresponding species, proteins enriched within these datasets have lower statistical power behind their calls. Indeed, the dataset with the most conflicting intersections with previously published data is the human-limited dataset. For example, of the 20 mCG readers identified in human, 3 have been documented as binders of unmethylated DNA. Among the discrepancies in human are MAFF and BHLHE40, which were identified as mCG binders in human-limited but are classified as DNA-binders repelled by mC in SELEX based experiments, and contentiously, ZBTB1472. As mentioned above, the possible molecular mechanisms underlying some of these discrepancies will be discussed in Chapter 4. Technical reasons for the higher number of conflicts within this dataset arises from two major contributing limitations, namely, sample input and MS analysis. As mentioned, the human tissue post- mortem period and being frozen for prolonged periods, potentially negatively affecting dataset quality in terms of protein coverage, and missing values. The latter could significantly complicate the normalisation and imputation of this dataset. Secondly, human and mouse- limited datasets are constrained by peptide information originating from one species instead of two, slightly reducing the confidence of called, enriched proteins. Whilst mouse-limited was also reliant on univariate statistical analysis, the dataset was of a higher quality, possibly due to fresh input material. This may explain why, for example, why TAGLN3 was enriched for human mCA and mouse CA. In general, whilst containing slightly higher inconsistencies with published data, the human dataset is still a reliable repository of proteins, validated by the number of previously identified DNA binders and protein interactors that were enriched for CG and mCG respectively.

In summary, the power of multivariate analysis and the conservatory role of particular readers within the common dataset contributes to a high confidence mCG and CG reader list. The human-limited and mouse-limited datasets do not provide species specific proteins, it is not within the limits of current MS technology to capture the entire proteome within a single experiment. It is more likely that differences in mouse and human are the result of missing observations within datasets common to MS experiments and or reflect differences in tissue type, age, and preservation state. Nevertheless, significant proteins observed in each species can still be informative and can expand upon the repository of known mC readers, or readers repelled by mC within each species. From the comparisons tested, ProteoMM outperforms

III-124 both Perseus methods in the number of proteins identified, and the proportion of correctly called proteins. It is the first multivariate analysis tool available for implementation of MS data, and through an automated, easily employable R script, conducts more sophisticated normalisation and imputation procedures that increase the reliability of differentially called readers.

III-125 References

1 Wilkins, M. R. et al. From Proteins to Proteomes: Large Scale Protein Identification by Two-Dimensional Electrophoresis and Amino Acid Analysis. Biotechnology 14, 61 (1996). 2. Botelho, D. et al. Top-down and bottom-up proteomics of SDS-containing solutions following mass-based separation. J. Proteome Res. 9, 2863–2870 (2010). 3. McLafferty, F. W. et al. Top-down MS, a powerful complement to the high capabilities of proteolysis proteomics. FEBS J. 274, 6256–6268 (2007). 4. Smith, L. M., Kelleher, N. L. & Consortium for Top Down Proteomics. Proteoform: a single term describing protein complexity. Nat. Methods 10, 186–187 (2013). 5. Compton, P. D., Zamdborg, L., Thomas, P. M. & Kelleher, N. L. On the scalability and requirements of whole protein mass spectrometry. Anal. Chem. 83, 6868–6874 (2011). 6. Tolmachev, A. V., Robinson, E. W., Wu, S., Paša-Tolić, L. & Smith, R. D. FT-ICR MS optimization for the analysis of intact proteins. Int. J. Mass Spectrom. 281, 32–38 (2009). 7. Karas, M., Bachmann, D., Bahr, U. & Hillenkamp, F. Matrix-assisted ultraviolet laser desorption of non-volatile compounds. Int. J. Mass Spectrom. Ion Process. 78, 53–68 (1987). 8. Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003). 9. Duncan, M. W., Nedelkov, D., Walsh, R. & Hattan, S. J. Applications of MALDI Mass Spectrometry in Clinical Chemistry. Clin. Chem. 62, 134–143 (2016). 10. Shevchenko, A. et al. Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. Proceedings of the National Academy of Sciences 93, 14440–14445 (1996). 11. Gessel, M. M., Norris, J. L. & Caprioli, R. M. MALDI imaging mass spectrometry: spatial molecular analysis to enable a new age of discovery. J. Proteomics 107, 71–82 (2014). 12. Meyer, K. & Ueland, P. M. Use of matrix-assisted laser desorption/ionization time-of- flight mass spectrometry for multiplex genotyping. Adv. Clin. Chem. 53, 1–29 (2011). 13. Ofori-Acquah, S. F. et al. Mass Spectral Analysis of Asymmetric Hemoglobin Hybrids: Demonstration of Hb FS (α2γβS) in Sickle Cell Disease. Anal. Biochem. 298, 76–82 (2001). 14. Gale, D. C. Small Volume and Low Flow-Rate Electrospray Ionization Mass Spectrometry of Aqueous Samples; 1993. Rapid Commun. Mass Spectrom. 15. Emmett, M. R. & Caprioli, R. M. Micro-electrospray mass spectrometry: Ultra-high-

III-126 sensitivity analysis of peptides and proteins. J. Am. Soc. Mass Spectrom. 5, 605–613 (1994). 16. Ganem, B., Li, Y. T. & Henion, J. D. Detection of noncovalent receptor-ligand complexes by mass spectrometry. J. Am. Chem. Soc. 113, 6294–6296 (1991). 17. Rosu, F., Gabelica, V., Houssier, C. & De Pauw, E. Determination of affinity, stoichiometry and sequence selectivity of minor groove binder complexes with double- stranded oligodeoxynucleotides by electrospray ionization mass spectrometry. Nucleic Acids Res. 30, e82–e82 (2002). 18. Huang, Q., Mao, S., Khan, M., Zhou, L. & Lin, J.-M. Dean flow assisted cell ordering system for lipid profiling in single-cells using mass spectrometry. Chem. Commun. 54, 2595–2598 (2018). 19. Earl, D. C. et al. Discovery of human cell selective effector molecules using single cell multiplexed activity metabolomics. Nat. Commun. 9, 39 (2018). 20. Frese, C. Development and Application of Novel Electron Transfer Dissociation-based Technologies for Proteomics. (Utrecht University, 2013). 21. Creese, A. J. & Cooper, H. J. Liquid chromatography electron capture dissociation tandem mass spectrometry (LC-ECD-MS/MS) versus liquid chromatography collision- induced dissociation tandem mass spectrometry (LC-CID-MS/MS) for the identification of proteins. J. Am. Soc. Mass Spectrom. 18, 891–897 (2007). 22. Sleno, L. & Volmer, D. A. Ion activation methods for tandem mass spectrometry. J. Mass Spectrom. 39, 1091–1112 (2004). 23. Olsen, J. V. et al. Higher-energy C-trap dissociation for peptide modification analysis. Nat. Methods 4, 709–712 (2007). 24. Jedrychowski, M. P. et al. Evaluation of HCD- and CID-type fragmentation within their respective detection platforms for murine phosphoproteomics. Mol. Cell. Proteomics 10, M111.009910 (2011). 25. Alberts, B. & Miake-Lye, R. Unscrambling the puzzle of biological machines: the importance of the details. Cell 68, 415–420 (1992). 26. Blackstock, W. P. & Weir, M. P. Proteomics: quantitative and physical mapping of cellular proteins. Trends Biotechnol. 17, 121–127 (1999). 27. Alberts, B. The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 92, 291–294 (1998). 28. Goh, C.-S., Milburn, D. & Gerstein, M. Conformational changes associated with protein– protein interactions. Curr. Opin. Struct. Biol. 14, 104–109 (2004). 29. Fromont-Racine, M., Rain, J.-C. & Legrain, P. Building protein-protein networks by two- hybrid mating strategy. Methods Enzymol. 350, 513–524 (2002). 30. Ito, T. et al. Toward a protein--protein interaction map of the budding yeast: a

III-127 comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proceedings of the National Academy of Sciences 97, 1143–1147 (2000). 31. Ong, S.-E. & Mann, M. Mass spectrometry–based proteomics turns quantitative. Nat. Chem. Biol. 1, 252–262 (2005). 32. Puig, O. et al. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 24, 218–229 (2001). 33. Cristea, I. M., Williams, R., Chait, B. T. & Rout, M. P. Fluorescent proteins as proteomic probes. Mol. Cell. Proteomics 4, 1933–1941 (2005). 34. Stadler, C. et al. Immunofluorescence and fluorescent-protein tagging show high correlation for protein localization in mammalian cells. Nat. Methods 10, 315–323 (2013). 35. Selbach, M. & Mann, M. Protein interaction screening by quantitative immunoprecipitation combined with knockdown (QUICK). Nat. Methods 3, 981–983 (2006). 36. Wang, J. et al. A protein interaction network for pluripotency of embryonic stem cells. Nature 444, 364–368 (2006). 37. Smits, A. H. & Vermeulen, M. Exploring Chromatin Readers Using High-Accuracy Quantitative Mass Spectrometry-Based Proteomics. in Systems Analysis of Chromatin- Related Protein Complexes in Cancer (eds. Emili, A., Greenblatt, J. & Wodak, S.) 133– 148 (Springer New York, 2014). 38. Ranish, J. A. et al. The study of macromolecular complexes by quantitative proteomics. Nat. Genet. 33, 349–355 (2003). 39. Vermeulen, M., Hubner, N. C. & Mann, M. High confidence determination of specific protein–protein interactions using quantitative mass spectrometry. Curr. Opin. Biotechnol. 19, 331–337 (2008). 40. Gygi, S. P. et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 (1999). 41. Colangelo, C. M. & Williams, K. R. Isotope-coded affinity tags for protein quantification. Methods Mol. Biol. 328, 151–158 (2006). 42. Brand, M. et al. Dynamic changes in transcription factor complexes during erythroid differentiation revealed by quantitative proteomics. Nat. Struct. Mol. Biol. 11, 73–80 (2004). 43. Vermeulen, M. Identifying chromatin readers using a SILAC-based histone peptide pull- down approach. Methods Enzymol. 512, 137–160 (2012). 44. Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013).

III-128 45. Roy, S. M. & Becker, C. H. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling. Methods Mol. Biol. 359, 87–105 (2007). 46. Zybailov, B. et al. Statistical analysis of membrane proteome expression changes in Saccharomyces cerevisiae. J. Proteome Res. 5, 2339–2347 (2006). 47. Cox, J. et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol. Cell. Proteomics 13, 2513– 2526 (2014). 48. Smits, A. H., Jansen, P. W. T. C., Poser, I., Hyman, A. A. & Vermeulen, M. Stoichiometry of chromatin-associated protein complexes revealed by label-free quantitative mass spectrometry-based proteomics. Nucleic Acids Res. 41, e28 (2013). 49. Li, Z. et al. Systematic comparison of label-free, metabolic labeling, and isobaric chemical labeling for quantitative proteomics on LTQ Orbitrap Velos. J. Proteome Res. 11, 1582–1590 (2012). 50. Wang, W. et al. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75, 4818–4826 (2003). 51. Oberg, A. L. et al. Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. J. Proteome Res. 7, 225–233 (2008). 52. Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. J. Proteome Res. 15, 1116–1125 (2016). 53. Karpievitch, Y. et al. A statistical framework for protein quantitation in bottom-up MS- based proteomics. Bioinformatics 25, 2028–2034 (2009). 54. Karpievitch, Y. V., Dabney, A. R. & Smith, R. D. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinformatics 13 Suppl 16, S5 (2012). 55. Webb-Robertson, B.-J. M. et al. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. J. Proteome Res. 14, 1993–2001 (2015). 56. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001). 57. Polpitiya, A. D. et al. DAnTE: a statistical tool for quantitative analysis of -omics data. Bioinformatics 24, 1556–1558 (2008). 58. Clough, T., Thaminy, S., Ragg, S., Aebersold, R. & Vitek, O. Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 13 Suppl 16, S6 (2012). 59. Deeb, S. J., D’Souza, R. C. J., Cox, J., Schmidt-Supprian, M. & Mann, M. Super-SILAC allows classification of diffuse large B-cell lymphoma subtypes by their protein expression profiles. Mol. Cell. Proteomics 11, 77–89 (2012).

III-129 60. Oh, S., Kang, D. D., Brock, G. N. & Tseng, G. C. Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 27, 78– 86 (2011). 61. Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731 (2016). 62. Karpievitch, Y. V. et al. Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 25, 2573–2580 (2009). 63. Gatto, L. & Lilley, K. S. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 28, 288– 289 (2012). 64. Taverner, T. et al. DanteR: an extensible R-based tool for quantitative analysis of -omics data. Bioinformatics 28, 2404–2406 (2012). 65. Smedley, D. et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589–98 (2015). 66. Taylor, S. L., Ruhaak, L. R., Weiss, R. H., Kelly, K. & Kim, K. Multivariate two-part statistics for analysis of correlated mass spectrometry data from multiple biological specimens. Bioinformatics 33, 17–25 (2017). 67. Karpievitch, Y. V., Nikolic, S. B., Wilson, R., Sharman, J. E. & Edwards, L. M. Metabolomics data normalization with EigenMS. PLoS One 9, e116221 (2014). 68. Deaton, A. M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011). 69. Du, Q., Luu, P.-L., Stirzaker, C. & Clark, S. J. Methyl-CpG-binding domain proteins: readers of the epigenome. Epigenomics 7, 1051–1073 (2015). 70. Beck, S. et al. CpG island-mediated global gene regulatory modes in mouse embryonic stem cells. Nat. Commun. 5, 5490 (2014). 71. Lister, R. et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905 (2013). 72. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, (2017). 73. Bartels, S. J. J., Spruijt, C. G., Brinkman, A. B. & Jansen, P. A SILAC-based screen for Methyl-CpG binding proteins identifies RBP-J as a DNA methylation and sequence- specific binding protein. PLoS One (2011). 74. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470–484 (2010). 75. Hu, S. et al. DNA methylation presents distinct binding sites for human transcription factors. Elife 2, e00726 (2013). 76. Makałowski, W., Zhang, J. & Boguski, M. S. Comparative analysis of 1196 orthologous

III-130 mouse and human full-length mRNA and protein sequences. Genome Res. 6, 846–857 (1996). 77. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, 1724–1735 (2007). 78. Välikangas, T., Suomi, T. & Elo, L. L. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief. Bioinform. 19, 1–11 (2018). 79. Wang, X., Anderson, G. A., Smith, R. D. & Dabney, A. R. A hybrid approach to protein differential expression in mass spectrometry-based proteomics. Bioinformatics 28, 1586–1591 (2012). 80. O’Brien, J. J. et al. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann. Appl. Stat. 12, 2075–2095 (2018). 81. Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models. (Cambridge University Press, 2006).

III-131

Identification of novel mCG readers in human and mouse brain Summary

In mammals, DNA methylation in the CG context (mCG), is critical to many biological processes including transcriptional repression, genome integrity, and suppression of transposable elements. Much of our understanding of the function of mCG is linked to the characterisation and disruption of proteins that can “read” mCG. mCG readers are proteins within certain protein families that exhibit affinity for mCG. The classical mCG reader families comprise the methyl CpG binding (MBD), SET and RING-associated (SRA), and Broad- Complex, Tramtrack and Brick a brac or Pox virus (BTB/POZ) domain protein families. Recent advances in proteomics combined with enrichment methods now enable rapid identification of mCG readers, and have highlighted a diverse network of transcription factors and interactors involved in the direct binding of mCG and in modulating various cellular processes including transcriptional repression and splicing. To screen for potential mCG readers within the mouse brain, and assess their conservation between species, DNA pull-downs coupled to mass spectrometry were employed. The utilisation of methylated and unmethylated DNA probes in the CG context identified many proteins that bound to, or were repelled by, mCG. These DNA- binding domain (DBD)-containing proteins and various protein interactors were subject to a combined analysis to identify proteins with high affinity for mCG in both human and mouse. A separate species-limited analysis was also employed to identify proteins enriched in either species. This study constitutes the first mCG reader screen employed in both human and mouse brain and provides a repository of mCG readers within both, upon which future studies can build upon.

IV-132

Introduction

Transcription factors modulate changes in gene expression through sequence-specific DNA binding. DNA methylation provides an additional layer of complexity, being able to modulate the binding of proteins with specialised mC binding domains or interfering with DNA binding. Gene activation and repression rely on a controlled interplay between methylation, chromatin states, and transcription factor occupancy. The characterisation of mCG readers has therefore been the subject of much investigation, providing opportunities to understand the functions of proteins with mCG binding capabilities within the context of development and disease. The importance of mCG readers in mammals is demonstrated by several observations. Genomes of all vertebrates are globally methylated, and this is coupled with an expansion in the numbers of proteins that read and write DNA methylation1. This indicates that the mCG mark is a fundamental constituent of mammalian genomes, and is supported by knockout studies that prove mCG deposition and its readout are crucial to the correct regulation of biological processes2–5. Importantly, DNA methylation is actively regulated by deposition, maintenance and removal mechanisms. The importance of the mCG readout is evident by the high level of conservation in MBD proteins that extend beyond mammals. For example, the 12 MBD proteins identified in Arabidopsis were characterized by comparisons with human MBD sequences found in MBD1, MBD2, and methyl-CpG binding protein 2 (MECP2), whilst Xenopus laevis MBD3 was identified based on amino acid sequence similarity to its human counterpart6,7.

IV.1.1 Epigenetic regulation of methylated CGIs

The methylation of CG dinucleotides has traditionally been associated with transcriptional repression8. The recruitment of mCG readers acts to sterically hinder transcriptional processes or influence chromatin architecture through co-repressor recruitment, thereby inhibiting transcriptional processes9,10. Each MBD family member may recruit distinct repressive complexes upon binding to DNA11–13. Recruitment is driven by specialised repressive domains like the MBD or other secondary domains present within the MBD family, for example, the TRD. Secondary to the function of MBD1 in mCG binding by its MBD domain is its capability in mediating repression through interaction with Suv39h1 (Figure 4.1), which methylates lysine 9 of histone H3 (H3K9), and heterochromatin protein 114. However, repression of MBD1 target genes is primarily through interactions of the TRD with proteins directing histone methylation/deacetylation and maintaining heterochromatin13,15. For example, the TRD of MBD1 associates with MBD1-containing chromatin-associated factor 1 (MCAF1)/ATFa

IV-133

associated modular (AM) and administers trimethylation of H3K9 (Figure 4.1) upon recruitment of methyltransferase SETDB116–18. Similarly, MBD2 binds to mCG and relies on associations with repressive complexes like NuRD to inhibit transcription.

The Nucleosome Remodelling and Deacetylase (NuRD) complex connects histone deacetylase and ATP-dependent chromatin remodelling activities with gene repression19,20. NuRD is essential in DNA damage repair, genomic stability, and chromatin organisation7,21. Members of the NuRD complex include HDAC1/2, MTA1/2/3, RBBP4/7, GATAD2A/B, MBD2/3, and CHD3/422. The MTA proteins act as scaffolds for the arrangement of the complex, permitting nucleosome interactions with the RBBPs23,24. Gene repression, displayed in Figure 4.1, is mediated by HDACs in a methylation-dependent and independent manner, removing acetyl groups from modified lysine residues on histone tails25. The presence of NuRD family members in organisms like Drosophila and C. elegans, which lack DNA methylation, indicates these complexes are well conserved and can function independently of DNA methylation26,27. The potential for MBD2 to recruit NuRD by methylation-dependent mechanisms, therefore, remained uncertain initially12. Evidence for methylation-dependent NuRD activity was first provided by Feng and Zhang, who were able to demonstrate a biochemical link between nucleosome remodelling and histone deacetylation with DNA methylation-dependent gene repression28. Despite some initial speculation in the literature as to whether MBD2 binding mCG resulted in complete NuRD assembly, it is now widely accepted that MBD2 represents a distinct NuRD complex that is recruited to methylated loci. It has been demonstrated that, unlike MBD3, MBD2 exhibits binding at methylated CG islands genome-wide29, suggesting that methylation-dependent binding achieved by MBD2 couples NuRD recruitment to a discrete set of loci to achieve transcriptional repression. In line with this, gene knockout studies are embryonic lethal for MBD3, but not MBD2, highlighting the importance of each member in regulating diverse biological processes30. It is now accepted that MBD3 mediates methylation-independent repression (Figure 4.1) by binding to unmethylated loci, and thus complements the regulation of methylated loci bound by MBD2.

IV-134

Figure 4.1: Binding of MBDs to DNA results in complex recruitment that induces transcriptional repression. MBD1-mediated repression by trimethylation of H3K9 occurs through MCAF1 or SUV39H1 complex recruitment (adapted from17,31). MBD2 and MBD3 NuRD complexes represent mutually exclusive complexes12,32. Shown in the figure, are the common core NuRD complex members that are recruited to methylated DNA by MBD2 or unmethylated DNA by MBD3. Subsequent histone deacetylation results in gene repression (adapted from29).

HDAC1/2 and RBBP4/6 have also been identified in a large histone deacetylase complex known as the Sin3 complex. As with NuRD, Sin3 recruitment to chromatin is mediated by DNA

IV-135

binding proteins to facilitate histone deacetylation and transcriptional repression (Figure 4.2). Sin3 functions predominantly as a repressor, being recruited to the promoters of actively transcribed genes, but Sin3 knockouts in various species results in a combination of up-and down-regulated genes33–35. It could be argued that the upregulation of some genes may arise by indirect gene cascading events downstream of initial Sin3 mediated repression in a cellular setting with complex gene circuitry. However, examples of direct Sin3-mediated transcriptional activation exist, such as its association with Nanog, resulting in increased Nanog transcription and maintenance of pluripotency in ESCs36,37. Sin3, therefore, represents a complex with both repressive and gene activation characteristics involved in the intricate regulation of many cellular processes such as maintenance of pluripotency, various developmental processes, and cell cycle regulation38,39. Sin3 may also be recruited to methylated DNA, dependent on MECP2 localisation at mCG sites, resulting in transcriptional repression40,41. MECP2 interacts with many repressive complexes (Figure 4.2) that include the SWI/SNF (Switch/Sucrose Non- Fermentable) complex, CoREST (Corepressor of RE1-silencing transcription factor) complex, and SMRT/NCoR (Silencing Mediator for Retinoid or Thyroid receptor/Nuclear receptor CoRepressor) complex42. The nucleosome remodelling SWI/SNF complex is recruited to repressed, hyper-methylated sites bound by MECP2, providing a link between nucleosome remodelling and mCG43. CoREST consists of HDAC1/2, scaffolding protein CoREST, lysine- specific demethylase (LSD1), but not RBBP4/744,45. In addition to histone deacetylation, LSD1 may catalytically mediate H3K9me1/2 and H3K4me1/246. CoREST-MECP2 localisation at mCG sites is required for neuronal gene silencing in ESCs by Sin3A-mediated deacetylation and SUV39H1-mediated histone methylation47. The co-repressor SMRT and its homologue NCOR contain a highly conserved repression domain and recruit HDAC3, GPAS2, and TBL1 to DNA bound by various proteins like BCL6, ETO and MEF2C48–50. Both SMRT/NCor and CoREST utilise HDACs to achieve transcriptional repression and have important early embryonic developmental roles51. For example, NCOR knockout mice have defects in neural cell differentiation52. In addition to MECP2, Kaiso specifically associates with NCOR and not SMRT, despite sharing high homology. In vitro binding analysis reveals a sequence-specific, methylation-dependent binding affinity that is distinct from MECP2. In vivo binding analysis identified N-COR-Kaiso enrichment at the methylated MTA2 promoter resulted in HDAC- dependent MTA2 repression53.

In summary, mCG readers are vital to the readout of mCG and constitute critical components of the epigenome. The binding of mCG readers has been well characterised and results in co- recruitment of other proteins that influence chromatin architecture resulting in transcriptional changes. The importance of mCG regulation is highlighted by gene knockout experiments and by the implication of these various factors within disease. However, loci without DNA

IV-136

methylation may also undergo transcriptional repression. This is evident by studying the modulation of transcriptional processes in yeast, some of which are conserved in humans. Non-methylated CGIs recruit a distinct set of proteins and constitute specific binding modalities that, through protein complex recruitment and subsequent changes in chromatin structure, result in transcriptional activation or repression.

IV-137

Figure 4.2: MECP2 binding to methylated CGIs and subsequent repressive complex recruitment. Sin3, CoREST, and NCoR/SMRT complex recruitment results in transcriptional repression by histone deacetylation. Recruitment of the CoREST complex results in H3K27me3/H3K9me3 deposition and H3K4me3 removal. The MECP2-SWI/SNF complex induces chromatin compaction resulting in gene repression. KAISO specifically associates with NCoR and not SMRT, despite sharing high homology. Adapted from42,43,53,54.

IV-138

IV.1.2 Epigenetic regulation at unmethylated CGIs

Cellular processes that are driven by methylation-independent DNA binding events constitute an important process and are responsible for transcriptional regulation of non-methylated genomic elements. TET mediated DNA demethylation at CGIs provides important reshaping of the chromatin environment. For example, demethylation by TET2 plays a role in de novo establishment of H3K4me3/H3K27me3 bivalent domains55. Deletion of Tet1 and Tet2 in mESCs results in loss of H3K27me3 at bivalent domains55, whilst studies in human ESCs have shown that TETs are required for the protection of bivalent promoters from DNA methylation, ensuring lineage specific transcription at these genes upon differentiation56. A few proteins that regulate CGIs pertaining to this thesis are described below and include the Polycomb Group (PcG) complex, Set1C/COMPASS (Complex Proteins Associated with Set1), and MLL (Mixed-lineage leukaemia) complexes. Epigenetic silencing via mCG recognition is complemented by the PcG protein complexes that utilise epigenetic processes to modulate cell lineage pathways in a methylation-independent manner57–59.

The PcG repressive protein complex is well conserved and is present within several unicellular organisms, plants, and animals60–62. PcG orthologues have expanded from 18 up to 37 members in mammals63. The number of PcG protein interactors remains unknown, with members still being discovered, contributing to the complexity of the PcG and resulting in sub- classification of complexes composed of distinct Polycomb repressive complex (PRC) members64. PRC1 and PRC2 constitute two major complexes that have been studied extensively and within themselves contain subcomplexes that participate in different regulatory pathways involving distinct proteins57. For example,Tet1 has been shown to bind to CGIs and recruit PRC2, resulting in the silencing of PRC2 specific developmental regulators required for stem cell maintenance in mESCs65. Core components in the PRC1 include E3 ubiquitin ligase Ring1B and one polycomb group of ring finger (PcgF). PRC2 contains the core components ZF domain-containing Suz12 (suppressors of zeste 12), WD40 repeat domain- containing Eed (Embryonic ectoderm development), and Ezh1/2 (enhancer of zeste 1 or 2) containing methyltransferase activity58. PRC1 and PRC2 target different histone modifications, namely: ubiquitination of H2AK119ub1 and the di-/trimethylation of H3K27me respectively64. Whilst the roles played by each complex are largely mutually exclusive, molecular and functional characterisation of each complex has revealed some inter-complex dependency, for example, PRC1 subtype recruitment may require PRC2-catalysed H3K27me3 at certain genomic positions66,67. PRC1 may be simplified into cPRC1 and ncPRC1 based on chromobox

IV-139

(CBX) proteins that drive the recruitment of cPRC1 to chromatin68,69. However, ncPRC1 is of particular relevance to this project (Figure 4.3), recruited to unmethylated CGIs by the ZF- CXXC motif within KDM2B70,71. This was demonstrated by tamoxifen-mediated deletion of the ZF-CXXC domain using a Cre-recombinase mouse ESC cell line. The deletion resulted in ablation of PCGF1/PRC1 complex targeting in vivo, and underlined the CXXC DNA binding domain as the primary localisation determinant for ncPRC172. Deletion analysis revealed that the minimal ncPRC assembly required a PCGF1 and BCOR1 (BCL corepressor like 1) heterodimer, whilst RING1B was dispensable for assembly73. Other protein interactors that comprise ncPRC1 include YY1(Yin and Yang 1), SKP1 (S-phase kinase Associated Protein 1) and USP7 (Ubiquitin Specific Protease)58,68. The crucial role of KDM2B has been demonstrated through knockouts, whereby a 40% global reduction in H2AK119ub1 was observed, whilst its deletion in ESCs results in premature differentiation71,73. The roles of the PcG complex in mammals are diverse and are essential repressive complexes that are in balance with local active chromatin states. The interplay between the two processes underpins many crucial biological processes74,75.

Another example of epigenetic control at the chromatin level, independent of DNA methylation, is methylation of histone H3 at lysine K4 that may exist in mono- (H3K4me1), di- (H3K4me2), or tri- (H3K4me3) methylated states, and is a hallmark of euchromatin76,77. The SET domain, named from Drosophila proteins Su(var)3-9, Enhancer of zeste [E(z)], and trithorax are well- studied chromatin-modifying enzymes that epigenetically regulate transcription (Figure 4.3) via chromatin control78,79. Histone modifying enzymes that constitute the Set1C/COMPASS and MLL/COMPASS complexes have been studied primarily within Saccharomyces cerevisiae and are responsible for methylation of H3K480,81. Human orthologues SET1A/B and their homologues MLL1-4, perform similar biological roles in complex with members of COMPASS reflected in their catalytic abilities and global H3K4me deposition profiles. Global analysis of H3K4me distribution in mammals revealed patterns that are largely similar to yeast but differ in their H3K4me2 and H3K4me3 levels, which peak at the 5’ ends of genes, whilst enhancer elements contain H3K4me and are void of H3Kme382,83. The essential core of COMPASS consists of ASH2 (Absent, SMall, or Homeotic-like), DPY30 (Dumpy-30), WDR5 (WD repeat-containing 5), and RBBP5 (Retinoblastoma-binding protein 5)84–86. COMPASS may associate with other histone methyl writers MLL1-4 and form the COMPASS like or MLL/COMPASS complex. Recruitment of MLL and subsequent H3K4 methylation results in gene activation, but only at distinct regions at which H3K4me deposition is mediated by MLL/COMPASS87. In vitro and in vivo analysis suggests that the SET1/COMPASS complex plays more widespread roles in H3K4me3 in mammalian cells than MLL/COMPASS88. Global ChIP-seq and gene expression analysis of Mll1 knockouts demonstrate selective H3K4me

IV-140

differences that are coupled to gene expression changes, affecting only 5% of genes that include the Hox gene family89. This observation was unsurprising, given MLL was already known to regulate Hox gene expression90. However, this also demonstrates that transcriptional activation of MLL target loci is unsurprisingly multifaceted, given the complexity of the protein that partakes in chromatin, DNA, protein, and RNA interactions through multiple domains present within the MLL proteins.

A highly conserved SET domain is responsible for H3K4me deposition when complexed with COMPASS as mentioned above. However, a CXXC and AT-hook domain that binds DNA may also drive recruitment. Cysteine-rich PHD domains permit histone interactions and DNA MTase-like regions responsible for complex-dependent histone acetylation. In addition, MLL proteins may participate in numerous protein interactions such as with Menin, required for DNA associated biological complexes and with other complexes like the SWI/SNF family91,92. The presence of H3 and H4 acetylation as a marker for transcriptional activation is well studied, and its removal is often linked with DNA methylation and transcriptional repression40,93. The relationship between histone acetylation and DNA methylation was also explored within an MLL1 binding site in the Hoxc8 promoter adding another dimension of transcriptional modulation to MLL1 capabilities. Mll knockout cells contained hypoacetylated H3 and H4 levels within the 5’ enhancer of Hoxc8 compared to WT cells that also coincided with a hypermethylated Hoxc8 promoter and enhancer. Importantly, the acetylation states of H3 and H4 in Mll1l knockout cells were restored to WT levels upon exogenous expression of Mll1 but the hypermethylation state remained and suggest a protective, rather than an erasable role in mCG maintenance for MLL194. Numerous studies have since demonstrated that the conserved CXXC domain present in every MLL member is involved in CG binding and is critical in protecting these sites from methylation95,96. The interplay of these domains in a biological setting create a very complex transcriptional situation in which promoter activation may be reliant on one of many, or a cumulative mix of scenarios that may play out at the DNA, RNA, protein, and/or chromatin level. How these structures influence the temporal and spatial transcriptional outputs or drive the MLL complex specificity remains unknown

IV.1.3 Methods adopted for the assessment of transcription factor binding

Previously, the assessment of a proteins potential to specifically bind a DNA sequence was assessed on a per protein basis. Assays such as EMSA whereby a protein-probe interaction resulted in a shift on a native polyacrylamide gel were used. The shift is visualised by incorporation of a radioactive or fluorescent label onto the protein or probe. The addition of

IV-141

external competitors, such as methylated or unmethylated probes were used to assess protein specificity. Alternatively, other binding assays like fluorescence anisotropy were employed and yielded a quantitative measure of binding affinity. These approaches were time-consuming, low-throughput, and reliant upon purification of each protein that was assessed. These caveats greatly limited the abilities to investigate the binding potential of proteins. The development of affinity proteomics coupled to mass spectrometry has enabled high throughput DNA binding screens to be employed for various DNA probe contexts and modifications in many tissues and cell-culture based systems97. SELEX has also been recently employed on most transcription factors within the human proteome using methylated and unmethylated DNA libraries to ascertain the sensitivity of transcription factors within the human proteome98. These approaches have yielded comprehensive repositories containing binding information about DNA binding proteins that are repelled by or attracted to mCG. Each approach, however, has its advantages and disadvantages. The SELEX approach provides high-affinity motif discrimination for each transcription factor but is reliant on the purification of every protein assessed and is subject to the use of sophisticated bioinformatics pipelines to obtain motif information. DNA pull-downs are subject to certain limitations such as the choice of DNA probe used and is equally limited to the choice of control probe, but unlike SELEX, offers information about DNA binders and their cellular interactions. Furthermore, DNA pull-downs, unlike EMSA or SELEX approaches, offer a binding environment more representative of the cellular setting because DNA binders within the protein lysate used within the DNA pull-down are in competition with each other for binding of the probe. Pull-downs also offer an additional advantage in that they allow for the capture of DNA binding proteins and identification of their native protein complexes, which themselves may alter or enhance the binding of the principle DNA binder. These make DNA pull-downs attractive, as they offer a more complex, ‘native’ setting. Therefore, the focus of this PhD was to employ DNA pull-downs to characterise mCG readers within the human and mouse brain based on their robust protocols, and because they are suitable for assessing the binding of proteins within a cell type or tissue of choice. As described in Chapter 3, the DNA pull-down was coupled to mass spectrometry and ProteoMM was employed for its analysis. Detailed in this chapter is the analysis of the mCG/CG dataset, focusing on the identification of novel mCG and CG DBD-containing proteins and their cellular interactors identified within the human and mouse brain.

IV-142

Figure 4.3: Transcriptional repression by protein complexes recruited to CGIs upon transcription factor binding to unmethylated CGIs. PCG complex subcomplex PRC1 is recruited by KDM2B binding to unmethylated DNA and may also be dependent on PRC2 recruitment in certain genomic contexts. SET1or MLL subunits in complex with the COMPASS complex bind unmethylated CGIs resulting in H3K4me3. 4.2 Results

IV.2.1 Global assessment of mCG/CG datasets

Prior to the identification of mCG readers, the quality of the DNA pull-down dataset generated using mCG and CG probes in human and mouse brain (mCG/CG dataset) was assessed by inspection of missing data (Figure 4.4A). Another method used to assess the quality and distribution of data was by plotting the normalised mean protein intensity (averaged value of

IV-143

all peptides belonging to a protein) against the log2 transformed raw protein intensity mCG values divided by the raw log2 transformed protein intensity CG values for each protein (Figure 4.4B). Human and mouse mCG/CG datasets have ~15,000 missing peptide observations distributed uniformly across replicates. Replicate 3 in the human experiment contained slightly more missing observations, potentially indicative of sample handling, run order bias, or an unidentified source of variation. An MA plot (Figure 4.4B) was generated subsequent to Eigen MS normalisation to ensure this effect was corrected for. Both human and mouse datasets centre around zero, indicative of background binding, and have similar profiles. Inspection of replicate data was also assessed through generation of a heatmap (Figure 4.6). Proteins displayed were filtered based on a broad enrichment thresholds (log2FC ≥ 1 and p-value ≤ 0.05 for DE and log2FC ≥ 0.5 and p-value ≤ 0.05 for P/A) for a global inspection of the mCG/CG dataset. Replicates within the heatmap cluster together, and intensities are largely in agreement with each other.

Figure 4.4: Human (left) and mouse (right) mCG/CG datasets. A) Bar plot of missing values within each dataset. B) MA plot of normalised data. Each dot represents a protein intensity post normalisation, plotted against its corresponding fold-change (log2FC) value .

IV-144

IV.2.2 Identification of novel mCG and CG readers in human and mouse

Differential expression (DE) analysis of proteins common to human and mouse mCG/CG datasets represent high confidence mCG or CG readers. Some have been previously characterised, and some are novel. These proteins are displayed in scatterplots generated for the combined DE dataset (Figure 4.5), and volcano plots for human-limited DE and P/A (Figure 4.7), and mouse-limited DE and P/A (Figure 4.8). There are tables in the supplementary information section pertaining to each dataset containing references associated with the studies that characterised the binding of proteins described as ‘already characterised’ within this thesis. Within the human and mouse combined dataset, 10 novel mCG DBD-containing proteins were observed. Of the 10 identified, 2 were detected as significant mCG readers within both mouse and human, and 8 displayed significance in human despite having peptides in mouse (Figure 3.10). For easier visualisation of enrichment in both species, the log2FC values for each protein in human and mouse were plotted on a scatterplot (Figure 4.5).

Proteins with a positive log2FC value represent proteins that bound the mCG probe with higher affinity whilst those repelled by mCG have a negative log2FC value. For illustrative purposes, scatterplots are split into DNA binding domain-containing proteins (top) and protein interactors (bottom). Labelled proteins within each scatter plot display significantly enriched proteins in combined mCG and CG (top right or bottom left quadrants respectively) or by species (middle vertical or horizontal quadrants). Significance thresholds for the combined analysis were set at log2 FC ≥ 1.2 and p-value ≤ 0.05. For example, the upper right and lower left quadrants contain significantly called ‘high confidence’ human and mouse proteins attracted to mCG or repelled by CG respectively. Within the DNA binding domain-containing plots, proteins depicted in bold represent potential novel DBD-containing proteins that have not been characterised as having affinity to mCG prior to this experiment. In addition, the supplementary section of this chapter contains all previously identified DBD-containing proteins within the scatterplot and a reference associated with the studies that previously characterised their binding. It is important to note that the DNA binding-domain containing proteins represent proteins with DNA binding or nucleic acid binding capabilities as determined by GO ID matching (see section 2.2.4). These proteins may not necessarily function primarily as transcription factors, despite containing DNA-binding domains. Some already well- characterised DBD-containing proteins with demonstrated affinity for mCG and enriched for the mCG probe in human and mouse include MBD2, MECP2, ZBTB4, and KLF13 that are indeed transcription factors, however MTA2 and MTA3, whilst containing nucleic acid binding modules, have primary functions in binding histone tails as part of the NuRD complex99. The

IV-145

pull-down also identified novel mCG and CG readers with DNA binding-capabilities. Novel mCG binders FOXO1 and EGR3 are high confidence candidates due to their enrichment for mCG that was consistently observed in human and mouse pull-downs. Conversely, proteins with CG binding capabilities in the bottom left quadrants that are repelled by CG methylation or have affinity for the CG probes used within the DNA pull-down and include KMT2A, KDM2B, and the interactors RBBP5 and RNF2, belonging to the MLL complex, which as described in section 4.1.2 is a well characterised CG binding complex.

The scatter plot also enables visualisation of proteins that were significant in one species, despite being analysed in both human and mouse datasets (Figure 4.5). For example, previously characterised proteins PBX1 and PBX3 (among others) were significant in human mCG but did not meet significance in mouse. Combined-human mCG contained the greatest number of novel mCG binding candidates. Some interesting candidates include ZFP62, because it remains largely uncharacterized, and CSTF2, which will be discussed further in section 4.3.3. Numerous proteins with mRNA processing, nuclear import, and histone- modifying capabilities were among the many interactors enriched in human mCG, such as the EXOSC family, KPNA proteins, and others required for neuronal maintenance. The terms “CG readers” or “CG binders” will be used to describe proteins that were enriched for the CG probe control condition (unmethylated CG probe). These proteins likely harbor affinity for the specific CG probe sequence used in the DNA pull-down and/or have a strong sensitivity to DNA methylation and are excluded from the mCG probe. Within the combined list enriched solely for human CG, two uncharacterised CG binding candidates were identified among many already characterised CG binders, THAP4 and SND1. Interactors DPY30, WDR5, and MLL1 enriched in combined-human CG are additional MLL or Set1C/COMPASS complex members. All significantly enriched mCG binders in combined-mouse have been previously identified. Of note, members of the ncPRC1-PcG complex were enriched for combined-mouse CG.

Very few proteins were detected within P/A datasets as these proteins represent the remainder of proteins that were unable to be analysed by the DE analysis (Figure 4.7 and Figure 4.8). Further, the combined analysis represents only a subset of all proteins detected in human and mouse. Consequently, the P/A combined analyses contained very few proteins, and only one protein, ZBTB44, which was significantly enriched in combined mCG P/A. To simplify the results, the P/A combined analysis was merged with the species-limited P/A to create one human P/A and one mouse P/A set. Therefore, the human or mouse P/A are the product of the merged ‘combined’ and species ‘limited’ P/A datasets, matched by GeneID to create P/A plots by species. To generate each plot, the percent of peptides observed over total possible peptide observations is plotted with corresponding p-values from the combined and species-

IV-146

limited dataset merge. Results for the P/A will, therefore, be discussed below, within the species-limited results section. For a broader, less stringent, assessment of proteins observed within the DNA pull-down, an extended list of mCG and CG readers subset at a DE cut-off of log2 FC ≥ 1 and filtered for p-value ≤ 0.05 was generated (Figure 4.6). The figure displays hierarchical clustering of proteins within human and mouse CG DE and P/A datasets falling into 4 major subsets. Subset 1 (S1) constitutes a conserved class of readers broadly repelled by CG methylation. S2 contains proteins that display enrichment for mCG in mouse. S3, the largest subset, contains a conserved list of mCG readers, whilst S4 contains readers that display an affinity for mCG in human only.

IV-147

Figure 4.5: Common mCG and CG readers in human and mouse for DBD-containing proteins (top) and protein interactors (bottom). Proteins displayed meet significance threshold log2 (mCG/CG) ≥ 1.2 and p-value ≤ 0.05. Proteins in bold represent potential novel DBD-containing proteins with an affinity for mCG.

IV-148

Figure 4.6: Hierarchical clustering of normalised and imputed proteins within the combined CG analysis DE and P/A datasets. Protein intensities from DE and P/A datasets were merged. Missing values were assigned zero. Each replicate was normalised by subtracting of row means divided by the standard deviation. Proteins portrayed were filtered for log2FC ≥ 1 and p-value ≤ 0.05 for DE and log2FC ≥ 0.5 and p-value ≤ 0.05 for P/A. Clustering identified 4 main subsets pertaining to proteins repelled by mCG in human and mouse (S1), proteins with affinity for mCG in mouse (S2), proteins with conserved mCG affinity (S3), and proteins with mCG affinity in human DE and P/A. Numbers within S1 and S2 correspond to sections of the heatmap belonging to each protein for easier visualisation of the name and location of each protein on the heatmap.

IV-149

IV.2.3 Identification of mCG readers in human or mouse-limited datasets

The species-limited datasets contain the proteins observed in either human or mouse mCG/CG datasets. For these proteins, DE and P/A analyses were conducted independent of the combined analysis to produce “human-limited” and “mouse-limited” datasets. The human- limited DE and human-limited P/A analyses identified 4 and 5 novel potential mCG readers respectively, whilst the mouse-limited DE dataset identified 8. The DE (top) and P/A (bottom) for human-limited and mouse-limited are presented in Figures 4.7 and 4.8, respectively. Again, as with the combined scatterplots, DNA binding-domain containing proteins (left) and protein interactors (right) were split for illustrative purposes. Proteins that were called as significant are labelled. A significance threshold of log2FC ≥ 1.2 and p-value ≤ 0.05 for the DE analysis was used whilst a significance threshold of percent observed ≥ 50% and p-value ≤ 0.1 was used for P/A. In addition, the supplementary section of this chapter contains all identified DBD- containing proteins within the volcano plots and a reference associated with the studies that previously characterised their binding.

Whilst the majority of DNA-binding domain-containing proteins in the species-limited DE datasets have already been implicated in mCG recognition (for example ZBTB14, ZBTB44 KLF3, and KLF16), ZNF575, SOX13, and SCAF4 within the human-limited set are novel. Interactors identified are involved in a range of cellular activities such as gene repression by HDAC2, cell cycle by CDKN1C, and cell migration by CEMIP, and were among the enriched mCG interactors. Human mCG P/A was enriched for already characterised mCG binders KLF12, KLF16, ZBTB4, and POU2F2, and identified 5 novel mCG binders. Some notable novel human mCG-binding candidates include ZNF445, ZNF683, FOXO3, and KLF9. Mouse- limited mCG identified 8 novel mCG readers, more than any other dataset (Figure 3.10 and Figure 4.8). Some examples of note include Foxo3, Foxp1, and Sox1. As with human-limited mCG, the mouse-limited mCG interactors identified provide potential links between mCG and many cellular mechanisms including gene repression (Gatad1), cell motility (Coro1a/b), and hormone-induced cellular responses governed by protein interactors like GABRG2 and SLC410. Mouse P/A analysis identified an already characterised mCG binder (Zbtb44) and one novel DNA binding domain-containing protein, Sp9. Known DNA-binders Tcf12, Mtf1, and Cxxc4 were among those enriched for mouse-limited CG, whilst BCOR complex members Bcorl, Bcor, and PRC complex members Yaf, Pcgf1 were among the enriched interactors in Mouse-limited P/A.

IV-150

Figure 4.7: Human-limited (top) and human P/A (bottom) mCG and CG readers for DBD-containing proteins (left) and protein interactors (right). Proteins displayed meet significance threshold log2(mCG/CG) ≥ 1.2 and p-value ≤ 0.05. For human P/A, proteins displayed pass the significance threshold of percent observed ≥ 50% and p-value ≤ 0.1. Proteins in bold represent potential novel DBD-containing proteins with an affinity for mCG.

IV-151

Figure 4.8: Mouse-limited (top) and Mouse P/A (bottom) mCG and CG readers for DBD-containing proteins (left) and protein interactors (right). For mouse-limited, proteins displayed meet significance threshold log2(mCG/CG) ≥ 1.2 and p-value ≤ 0.05. For mouse P/A, proteins displayed pass the significance threshold of percent observed ≥ 50% and p-value ≤ 0.1. Proteins in bold represent potential novel DBD-containing proteins with an affinity for mCG.

IV.2.4 Gene ontology analyses of mCG and CG readers in human and mouse brain

Functional annotations for enriched proteins in human and mouse mCG/CG DNA pull-downs were assigned using DAVID, a GO analysis database100. The REVIGO tool was utilised for visual output of DAVID, condensing redundant GO terms and representing relationships of existing terms in two dimensional ‘semantic space’, in which alike terms are grouped closer together. Term significance is based on p-value generation from DAVID. GO terms linked to

IV-152

proteins predicted to have a higher affinity for CG methylation that were similar in human and mouse can be summarised as follows. DNA, nucleic acid, and transcription factor binding were among the enriched biological GO terms, driven primarily by the KLF, EGR, FOX, SOX, RFX, and BTB/POZ families enriched in human and mouse. DNA methylation terms were driven by the BTB/POZ and MBD families, whilst ATP-dependent chromatin remodelling by NuRD complex family members was driven by GATAD2B, RBBP4/7, MBD2, and MTA2. It is unsurprising that some tissue-specific terms like regulation of neuron differentiation in mouse mCG, or differentiation within human mCG are observed, given the DNA pull- downs were performed with brain tissue. These terms were driven by an assortment of proteins like Bag1 or Nkx-2 in mouse and FOX/EGR members common to human and mouse mCG/CG DNA pull-downs and are highly expressed proteins within the brain101–103. Of interest, within the mCG cellular component is the enrichment of some complexes like the ESC/E(Z) complex in human, or NuRD, common to both species. GO molecular function terms include methyl-CpG-binding, partially driven by occurrence of the MBDs enriched within the pull-down, and Zbtb4, as well as an enrichment for chromatin binding, driven by MTA2/3, FOXO1, MBD2, MECP2. Of all the GO terms within the human and mouse CG GO REVIGO scatterplots, very few were common to both species. Cellular component terms MLL and PcG were enriched for human and mouse alongside histone H3-K4 methylation or mono-ubiquitination, which is expected given these modifications arise from each complex. MLL was driven primarily by core complex members ASH2L and RBBP5, while PcG complex members PCGF1, RNF2, KDM2B, and BCOR were enriched in human and mouse pull-downs.

IV-153

Figure 4.9: GO analysis of proteins enriched for mCG in human (left) and mouse (right). The Visualisation of Biological process, cellular component and molecular function GO terms are plotted by REVIGO, placing terms within an arbitrary space termed semantic space based on their similarity.

IV-154

Figure 4.10: GO analysis of proteins enriched for CG in human (left) and mouse (right). The Visualisation of Biological process, cellular component and molecular function GO terms are plotted by REVIGO, placing terms within an arbitrary space termed semantic space based on their similarity.

IV-155

Discussion

IV.3.1 Validation of the affinity pull-down results through the identification of known mCG and CG readers

Extensive efforts have been made to identify and characterise mammalian mCG readers given that the deposition of CG methylation and its readout are fundamental cellular processes crucial to development and disease. Assessment of the mCG readers detected in experiments within this study, across human and mouse mCG/CG datasets reveal many previously experimentally-verified mCG readers, including some proteins within classical mCG reader families. Since the identification of the MBD family, numerous other proteins and protein families have been established as mCG binders. The identification of 42 readers whose affinity for mCG has already been established were identified in my CG context DNA pull-down experiments from human and mouse brain. Figure 3.9,3.10 and Figures 4.5, 4.7, and 4.8 provide strong reassurance that these experiments were highly effective at detecting proteins with affinity for methylated DNA. References associated with the studies that characterised the binding of these proteins are contained within tables in the supplementary section of this thesis. Canonical mCG readers identified include 2 MBD members (MECP2104, MBD2105), 1 SRA member (UHRF197), and 4 BTB/POZ members (ZBTB4/14/33 and 4497,106). Of the other MBDs, MBD1, a well-characterised mCG binder105, was observed within the human dataset only, and surprisingly was only moderately enriched for mCG, but did not meet significance at the thresholds set. The lower fold change enrichment for methylated probes associated with MBD1 is most likely a reflection of some affinity to the CG probe, mediated by a CXXC domain107. MBD3 has no affinity for mCG based on past studies108 and here was observed only in the mouse-limited dataset with a low background enrichment for the CG probes. MBD4, despite having major roles in DNA repair, has been shown to bind mCG in biochemical assays109, DNA pull-downs97 and in cell culture110. The enrichment of human MBD4 to bind mCG probes in the experiments conducted here is therefore unsurprising. Less well characterized and more recently identified readers with an affinity for mCG were also identified within this pull-down, namely, the transcription factors KLF3/12/13/16, POU2F2, FOXK1/2, FOXJ3, EGR1 and zinc finger protein ZNF17497,98,111,112. Regulatory factor X (RFX) transcription factor proteins RFX1/3 and RFX5, as well as their interactors RFXANK and RFXAP, were also enriched for binding mCG probes within the brain pull-downs here. Each RFX member identified, displays broad expression profiles throughout many tissues, but very high expression in the brain113. Initially named Methylated DNA binding protein (MDBP), RFX1 was purified from human placenta and demonstrated methylation-dependent, sequence-

IV-156

specific affinity for DNA114–116. Subsequent characterisation of RFX1 led to the classification of the RFX family, with each member displaying similar DNA binding properties117,118. Therefore, it is unsurprising that each member, like RFX1, were observed as mCG binders in the pulldown experiments presented here and have been previously identified in mCG reader screens97,106. Recent characterisation of RFX5 revealed that mCG recognition is conferred by a winged-helix (WH) domain97. The WH domain is present in many DNA binding proteins, some of which also display mCG affinity. Some members identified in this screen and in external experiments include some members of the FOX family of transcription factors, which will be discussed in more detail in the novel mCG reader section of this chapter below. The success of these pull-downs is further substantiated by the presence of protein interactors involved in chromatin modifications linking mCG binding to gene repression. In particular, members of the NuRD repressive complex GATAD2A/B, RBBP4/7, HDAC1/2, which co- localised with MBD2 at mCG sites in vivo29. Each interactor was enriched for mCG in human and mouse, with the exception of GATAD2A, which only displayed affinity in mouse mCG within the combined analysis. Therefore, overall, the detection in these experiments of many proteins previously characterized as having a higher binding affinity for mCG strongly supports the efficacy of this approach and the value of the results for identification of new candidate methylated DNA binding proteins.

IV.3.2 Readers exhibiting unexpected binding behaviour

Of all significant proteins observed, very few contradictions with previously published data were observed and, in general, the analysis classifies proteins with high consistency with the current knowledge in the field. Specifically, 63 proteins are in agreement with previously published data and 5 display unexpected DNA binding behaviour (Figure 3.10). The 5 proteins that were enriched for a binding context opposite to what has been previously described are detailed below per dataset. An external SILAC based MS affinity pull-down analysis observed a moderate exclusion of ZBTB14 for their methyl-CG probe106, in opposition with findings within my pull-down experiments, in which ZBTB14 was enriched for the mCG probe in the human-limited dataset. Whilst an affinity for mCG would make sense, given that ZBTB14 is a member of the BTB/POZ family that has reported methyl-binding capabilities, not all BTB/POZ members recognise mC119. ZBTB14 may elicit dynamic binding that differs by experiment due to differences in tissue type, protein modifications, different isoforms, or assay approach technicalities. Furthermore, competition for DNA binding with other DNA binders or the presence of protein interactors may alter the binding characteristics of ZBTB14. The SILAC based approach that observed ZBTB14 as a CG binder was prepared from protein extracts of

IV-157

tissue culture origin (HeLaS3 cells), which is significantly different from the human frontal cortex tissue used within my pull-down experiments, which might be due to inherent protein constituent differences that could alter its binding. Furthermore, the probes used within each pull-down were different, which could result in enrichment of different proteins. This concept is highlighted by BTB/POZ member ZBTB33, the first member of the BTB/POZ family to be characterised for its mCG binding capabilities97,98,106,120, and enriched within my mouse-limited mCG binder set. ZBTB33 possesses affinity for both mCG and CG and, as one study has reported, binds preferentially to unmethylated DNA in vivo within the two cell culture-based systems used121. Future biochemical validation experiments utilising different mCG probe contexts will be required to discern the affinity ZBTB14 has for mCG in parallel with investigations of the genome-wide binding signature of ZBTB14.

The basic (bZIP) MAFF and basic helix-loop-helix transcription factor BHLHE40 were enriched for mCG in the human-limited set. Contrary to my pull-down results, a methyl- sensitive SELEX-based experiment classified their affinities as ‘methyl-minus’, meaning they were repelled by methylation and enriched for binding to unmethylated DNA. The SELEX experiment, which comprehensively characterised the affinity of 542 transcription factors, concluded that most bZIP and BHLH transcription factors are repelled by mCG98. A SELEX based strategy is an excellent method of ascertaining the binding motifs of transcription factors, but unlike a pull-down, cannot capture DNA binding when in complex with other proteins, or cell/tissue-specific isoform specific binding patterns. These complex binding activities, where specific protein isoforms or dimerisation or interaction events directly modulate mCG recognition may therefore be lost using SELEX approaches, unless each isoform was expressed for investigation. The bZIP family bind DNA through a basic domain and also homo/heterodimerise with other proteins through leucine zipper domains, forming complex interactomes with multifaceted transcriptional activation and repression outputs122,123. The BHLH domain within BHLHE40 mediates DNA binding and, in conjunction with its distinctive Orange domain, forms multimeric DNA binding complexes that negatively regulate a variety of cellular processes124. Therefore, MAFF and BHLHE40 may bind unmethylated DNA when in isolation (such as within the SELEX experiment), however, this may change when in complex with other DNA binding proteins. In this scenario MAFF or BHLHE40 may not have directly bound to mCG within the pull-down, but may have been part of larger protein complexes that did. Whilst the result of the pull-down is at odds with the SELEX experiment, further investigation is required. A TAP-MS experiment utilising methylated DNA incubated with protein lysate containing endogenously tagged MAFF or BHLHE40 would be helpful in addressing whether known MAFF or BHLHE40 interactors are responsible for their enrichment to mCG within the pull-down. Results of the TAP-MS experiment could determine whether

IV-158

MAFF or BHLHE40 only bind mCG when dimerised with another protein. In this case, it could be possible that either protein is pulled down as part of an interaction complex and is not the principal binder of mCG.

ARNT2, enriched in the experiments conducted here for binding to human and mouse CG (i.e. unmethylated CG probe), is another example of a BHLH DNA binding protein that undergoes homo/heterodimerisation and presents binding activities here that conflict with previous findings. Whilst the previously described methyl sensitive SELEX study98 classified ARNT2 as ‘methyl-minus’, and these conclusions are consistent with the pull-down conducted here, it was also reported to bind mCG in a protein microarray-based approach, which was then validated by EMSA125. ARNT2, like MAFF and BHLHE40, may undergo dimerisation or heterodimerisation events that alter their DNA-binding affinities, or exhibit other cell-type specific isoforms or modifications. It is known that ARNT2 heterodimerises with several proteins in response to developmental and environmental stimuli126. For example, an NCOR2- mediated ARNT2 homodimerization complex docked at enhancers functions to mediate neuronal activity-dependent gene expression. Upon reaching a threshold of neuronal stimulation, the ARNT2 homodimer is converted to ARNT2-NPAS4 heterodimers that trigger transcription of NPAS4-dependent neuronal activity-regulated genes126,127. It is interesting that direct binding analysis, within the published SELEX98 and by EMSA studies125, produced conflicting results, suggesting that ARNT2 exhibits nucleic acid sequence specificity and methyl binding selectivity for methylated and unmethylated DNA. This DNA-binding selectivity and its complex interactome are probably reflective of a complex transcriptional network reliant on the dimerisation state of ARNT2.

The final protein that is in opposition with previous findings is DNA damage-binding protein 2 (Ddb2), which binds the products of DNA damaged by UV radiation but also possesses transcriptional regulatory activity128. An independent study that performed a pull-down in mouse observed Ddb2 binding to mCG97, which is in opposition to results obtained within my pull-down experiments. Here, Ddb2 was detected as a CG binder in mouse but not within human, where it displayed slight, but not significant, affinity for mCG. It's likely that the observed reciprocal affinity is not a species-specific difference but is because Ddb2 binds CG and is not strongly repelled by mCG. In other words, Ddb2 may bind mCG but this affinity is not specific, and that it is also capable of binding to CG dinucleotides. Alternatively, the presence of co-interactors in each DNA pull-down may influence its affinity for mCG, but this requires confirmation.

IV-159

Most discrepancies within my pull-down experiments involve proteins with intermediate or non- specific binding affinity for mCG, or are constrained by experimental factors that differ between studies. My pull-down approach, while informative, is highly reliant on the DNA probe used as bait, and limited by the control probe used in differential enrichment quantitation. While SELEX based experiments do not suffer from this drawback, they are not as reflective of the native biological setting as my DNA pull-down experiments, and do not identify interacting proteins. For example, it was thought that ZBTB4 bound a single methylated CG, unlike ZBTB33 which requires two methylated CGs129,130, but based on results from my pull-downs, it seems that the occurrence of repetitive mCG sequences on the mCG probe does not hinder either protein’s affinity for mCG. Binding conditions or protein mixtures derived from a certain cell type, or a defined population of cell types, may also influence the outcome of the pull-down. It was somewhat surprising that UHRF2 was among the high confidence mCG readers in both mouse and human brain pull-downs here. Previous studies have shown that the SRA member UHRF1 is involved in mCG maintenance, but subsequent characterisation has also linked UHRF1 with hmC recognition with secondary mCG reader capabilities97. This is in line with other studies that observed UHRF2 as having no sequence specificity toward methylated or unmethylated DNA, but with a high affinity for 5hmC131,132. Buffer conditions like salt and any additives greatly alter DNA binding kinetics and protein stability, and may also influence the number and type of proteins enriched in all in vitro affinity based approaches. While the current consensus on UHRF2 indicates that its primary role is in 5hmC binding, different experimental conditions like probe design, buffer conditions, and tissue type could contribute to inconsistencies within the literature. Thus, UHRF2 may function as an mCG reader, at least in certain cases or cell types, perhaps through interactions with other protein complexes133.

In summary, the results of my pull-down experiments were assessed by comparisons to previously published reports and current knowledge in the field, and to a large extent are in agreement with previously published data. However, the DNA pull-down also identified a diverse set of proteins for which there has been no previous characterization of mCG binding capacity. These DBD-containing proteins and interactors constitute potentially novel brain- specific mCG readers. Detailed below, is a description of what is known about these proteins and a discussion into the likelihood of each in binding to mCG.

IV.3.3 Identification of novel mCG readers

EGR3 was one of two EGR family proteins enriched for mCG in the human and mouse datasets (Figure 4.5). The EGR family are DNA binding proteins and transcriptional regulators

IV-160

that, within the brain, modulate changes in gene expression underlying neuronal plasticity134,135. EGR1, also enriched in binding to mCG in human and mouse, has previously been characterised as an mCG binder by multiple in vitro and in vivo experiments111,136. EGR1 and EGR3 contain a conserved array of three ZF motifs that facilitate sequence-specific DNA- binding, which suggests specific cis-regulatory elements may be bound by both proteins. Based on these, it is plausible that EGR3, like EGR1, may bind to mCG with high affinity. Functional studies have proven that each protein plays distinct roles in memory. For example, mice with disruptions to the Egr1 gene result in a specific loss in the maintenance of late Long- term potentiation (LTP) and are unable to form long term memories, whilst short term memories are unaffected137. Egr3 knockout mice have motor function abnormalities and dysfunction in early LTP, resulting in deficits in short term memory formation138. Apart from some behavioural mouse experiments, the identification of EGR3 as an mCG binder raises interesting questions to explore regarding its expression, cellular localisation, and propensity for binding mCG in vitro and in vivo, as well as the downstream biological consequences of this in memory formation.

FOXO1 was the second novel mCG candidate identified in the human and mouse mCG combined DE analysis. The Forkhead domain is a type of winged-helix (WH) DNA binding domain present in the Forkhead Box (FOX) family, consisting of over 40 proteins that have been identified in mammals139. Enriched FOX members identified in the pull-down experiments presented here whose affinity for mCG has already been verified include FOXJ3, FOXK1, and FOXK297,98,106. FOXJ3 and FOXK1 (Figure 4.5) were enriched for mCG in human and mouse, whilst Foxk2, despite having peptides in human, only displayed significant enrichment in the mouse dataset (Figure 4.5). The mouse-limited analysis identified 3 novel mCG binding Fox protein candidates, Foxp1, Foxn3, and Foxo3 (Figure 4.8). Interestingly FOXO3 was also enriched in human P/A mCG (Figure 4.8). The likelihood of FOXO3 binding mCG with high affinity is strongly supported by the enrichment of FOXO3 for mCG within both species, especially within human P/A. Members of the FOX family have evolved general biological roles and specialised tissue-specific functions. For example, the FOXA subfamily are pioneer factors that play important roles in early development, organogenesis and in metabolism and homeostasis in adults, whilst FOXO and FOXP are involved in cellular survival/proliferation and immune-related processes, respectively139–141. Given the diverse biological roles of the FOX family and its subfamilies, it would be reasonable to speculate that some members have evolved to bind mCG, like the BTB/POZ family, which also comprises many members that bind DNA in a methylation-dependent and independent manner119,142. There remains a need for validation of each identified FOX mCG reader, and characterisation of its DNA binding affinities. The Forkhead DNA binding domain encodes three alpha-helices, three beta-sheets,

IV-161

and two large loops (‘winged’ regions) that flank the third beta-sheet143,144. Further structural characterization of the domain or key amino acids within conserved domains of these proteins need to be investigated within the context of mCG binding. For example, many members within the RFX family are implicated in mCG recognition, like RFX1/3, RFXAP, and RFXANK, were also enriched for mCG in this pull-down (Figure 4.5 and Figure 4.7)97,106. It is known, at least for RFX5, that the WH domain in RFX5 is responsible for mCG recognition97,145. By association, it follows that the WH domain in other RFX members may confer mCG binding affinity, given a high within conserved domains of every family member. The functional consequences of mCG readout by FOX proteins is lacking entirely from the literature, both for members already identified by external studies and novel members identified in this pull-down. The large size of the FOX family is paralleled by their many biological functions and has complicated in vivo characterisation of this family. It is known that the WH-DNA binding interface is influenced by posttranslational modifications like phosphorylation, acetylation, and ubiquitination146,147 that may abolish or alter DNA binding in certain cell types, as demonstrated for FOXO1 in muscle cells148. The role of such modifications may enhance or eradicate the affinity for mCG in certain developmental, or cell type settings. Future experiments that aim to assess the genomic localisation of each identified mCG binder and the transcriptional consequences associated with this localisation are needed. These studies may provide crucial insights into the complexity of mCG binding, highlighting potential cell-type or developmentally restricted binding events modulated by each FOX member. Investigation of the -biochemical and in vivo- dynamics of FOX proteins in mCG readout may also facilitate our understanding of how protein modifications exert their effect on mCG recognition, whilst more comprehensively characterising the temporal dynamics of FOX proteins and CG methylation.

The novel mCG binders identified within this pull-down can be linked to a variety of cellular functions including differentiation, cell cycle maintenance, RNA regulation, chromatin binding, genome maintenance, and oncogenesis. Some interesting mCG binding candidates are detailed below. The pull-down experiments identified 6 KLF members with an already verified affinity for mCG (Figures 4.5, 4.7 and Figure 4.8) and one novel KLF member (KLF9, enriched in human P/A, in Figure 4.7). ChIP-seq coupled with RNA-Seq in human glioblastoma stem- like cells revealed a genome-wide binding preference for promoter regions that showed gene repression by RNA-seq149, indicating that KLF9 binds and negatively regulates GC rich promoters that may be methylated. Of the mCG readers enriched for RNA regulation, numerous proteins had roles within mRNA splicing and transcript regulation. These include, CSTF2 in combined-human (Figure 4.5), SCAF4 and LARP4B in human-limited and human P/A respectively (Figure 4.7), whilst SETX and CSTF2t (tau variation of CSTF2) were identified

IV-162

in mouse-limited (Figure 4.8). The implication of mCG readers in the regulation of splicing is unsurprising. It is now established that transcription and splicing may occur simultaneously, and intragenic DNA methylation patterns show an association with spliced transcripts150. The correlation between the two is suggestive of an interplay that may be mediated by mCG readers with splicing potential. One such example stems from MECP2’s elucidation as a splicing modulator151. However, by and large, the involvement of mCG readers within splicing regulation remains unknown. It is likely that these candidates represent mCG readers with splicing potential but this, along with their roles in general development or neuronal development, remain to be investigated. The human and mouse pull-downs also identified proteins involved in neuronal development. For example, SRY-related HMG-BOX (SOX) protein 1 (Figure 4.8) that is required for neural fate determination, and SOX13 (Figure 4.7), which is thought to participate in neuronal differentiation and specification152,153. PAX7, which was enriched in binding to mCG in the mouse-limited set (Figure 4.8), is required for neural crest development154. Many observed mCG readers are also responsible for various cancers. For example, the translocation of the PAX7 gene is present in an aggressive lung cancer subtype155. Within the context of disease, the translocation of TLX3, whose protein product was identified in mouse-limited mCG (Figure 4.8) is associated with T-cell acute lymphoblastic leukaemia156. Lastly, many DBD-containing proteins with unknown roles were also identified. Some examples include an uncharacterised HMG-BOX containing DBD-containing protein, called BBX, identified in the combined human mCG reader set (Figure 4.5), and 6 novel Zinc Finger proteins were enriched for mCG, for which their functions and DNA binding characteristics remain unknown (ZFP62 in Figure 4.5, ZNF174, ZNF575, ZNF445, ZNF683 in Figure 4.7, and ZNF464 in Figure 4.8).

IV.3.4 Transcriptional effector complexes and gene expression

In addition to the identification of DBD-containing proteins that likely have a direct affinity for binding methylated or unmethylated DNA, an advantage of the DNA pull-down is its ability to simultaneously identify protein interactors or complexes bound to the enriched DNA binding proteins. Proteins that do not contain any nucleic acid or DNA binding motifs were classified broadly as ‘interactors’. These interactions are present in the scatterplot for human and mouse (Figure 4.5, Figure 4.6) and in the human and mouse-limited DE and P/A analysis (Figures 4.6 and 4.7 respectively). For example, the enrichment of canonical NuRD complex members MBD2, RBBP4/7, HDAC1/2, CHD3/4, and GATAD2A/B validates the results of pull-down and demonstrates its ability to capture cellular complexes that are biologically relevant.

IV-163

The brain pull-downs were enriched for many complexes and GO processes related to transcription, mRNA, and non-coding RNA (ncRNA) processes (Figure 4.9 and Figure 4.10). Recent improvements in the mapping of the non-coding transcriptome has unravelled multiple important roles of ncRNAs within neuronal development. The fastest evolving regions of the primate genome are non-coding sequences that produce ncRNAs primarily involved in regulating neural development genes157. Their precise spatial and temporal expression correlate with factors involved in brain organisation and maturation158, whilst other ncRNAs are vital to the regulation and maintenance of retrotransposons159. For example a bidirectional interaction between miRNAs and DBD-binder, CREB is responsible for LTP and memory160. The nuclear exosome complex is an RNA surveillance complex that functions co- transcriptionally and orchestrates processes vital to cellular function and gene expression by orchestrating events related to mRNA quality control, turnover, and within splicing161–163. Members of the nuclear exosome complex were among those enriched for mCG binding, and may represent a novel cellular mechanism linking mCG to RNA regulation and turnover. Regulation of RNA maturation and degradation are important processing steps underlying correct gene expression and are fundamental to a variety of biological processes such as genome integrity maintenance, DNA damage, cellular differentiation, RNA export, and splicing164. Nuclear exosome complex members EXCOSC2/3/5/9/10 were enriched for mCG binding in the combined human and mouse mCG/CG dataset (Figure 4.5), and represent surprising novel mCG reader candidates that may link mCG to RNA surveillance and possible splicing mechanisms. The roles of RNA surveillance within the cell are far more complicated than once thought, and technologies like RNA-seq are beginning to gradually unravel its involvement in a multitude of biological processes. To date, there are no links between any of the nuclear exosome complex members with mCG or proteins that bind mCG. The identification of enrichment of these RNA processing factors in these mCG pulldowns may hint at a functional role between RNA monitoring and mCG recognition within the brain. This is especially relevant given the importance of ncRNAs within the brain, or within splicing regulation, and the intimate link between transcription and splicing. It is known that intragenic methylation may function in preventing spurious transcription165. One hypothetical process may involve the exosome complex binding or being recruited by an mCG reader in competition with RNA polymerase at hypermethylated intragenic loci where cryptic transcripts are produced and mediating their decay. Another possibility is that the nuclear exosome complex associates with other enriched splicing factors enriched for mCG within the human and mouse pull-down. Known members of the splicing factor complex that were enriched within the mCG interactors combined datasets are CSTF1/2/3 and PRPF3/4. CSTF2 was highly enriched (Figure 4.5) whilst CSTF1/3 and PRPF3/4 were enriched and clustered within a complex with high affinity for mCG in the heatmap (Figure 4.6). Whilst involved in RNA recognition and

IV-164

splicing, the CSTF2 retains the ability to bind to DNA. Whether this factor constitutes the primary binder that then recruits other splicing or RNA surveillance machinery remains unknown. Another possibility is the entire complex has been co-recruited by the binding of another protein within the pull-down, which is likely for PRPF3. Multifunctioning protein MECP2 associates with PRPF3 and has been directly linked with splicing151, but no studies have established a link with CSTF proteins. These may represent a unique splicing complex that is recruited to mCG sites by MECP2 or by a mutually exclusive complex containing a novel DNA binder.

As with enrichment of NuRD for the mCG probes, the validation of the pull-downs in successfully enriching for proteins with affinity for CG stems partly from the identification of PcG, Set1C/COMPASS, and MLL1 complex members (Figure 4.6 and Figure 4.10). Each complex has been linked to transcriptional control by methylation independent mechanisms. The PcG complex contains two subcomplexes, PRC1 and PRC2, that each contain sub- components that associate with DNA and/or chromatin57,58. Of relevance to the pull-down is ncPRC1, a subcomponent of PRC1 distinguished by its direct recruitment to unmethylated CGIs, unlike other PRC1/ PRC2 subcomponents that associate with chromatin (Figure 4.3). KDM2B, which was identified as a DBD-containing protein in human and mouse combined CG, is responsible for binding to CGIs in vivo and recruiting ncPRC1 constituents that repress transcription through chromatin modifications70,71. Other established protein interactors constituting the ncPRC1 complex58,68 that were specifically enriched for CG in human and mouse datasets and clustered together (Figure 4.6) include RNF2, YY1, WDR5, and RING1 (Figure 4.5) and PCGF1, BCOR and BCORL1 (Figure 4.7 and Figure 4.8, human P/A and mouse-limited respectively. This result demonstrates the stringency and specificity of the pull- down that specifically enriched for members of the PcG complex that are involved in direct DNA binding and not other PcG components associated with chromatin binding and subsequent complex recruitment. Also enriched within the combined mCG/CG DNA pull-down (Figure 4.5 and 4.6) were essential core members of COMPASS complex ASH2, DPY30, WDR5, and RBBP584–86, and transcriptional activator MLL1 (also known as KMT2A)87. In opposition to the PcG recruitment, which is repressive, the MLL/COMPASS recruitment deposits H3K4me3 and results in gene activation80,81. Together, the two complexes represent the dynamic potential of unmethylated CGIs within mammals and their ability to positively or negatively regulate transcription. This observation may explain why GO terms for the CG probe encompassed more generalized cellular and biological functions, when compared to the mCG enriched GO terms. In other words, the biological consequences of mCG recognition are coupled to more constrained or defined molecular processes, like chromatin remodelling, histone deacetylation, and transcriptional repression, whereas proteins binding to CG

IV-165

dinucleotide are not as constrained in their associated processes, and responsible for a plethora of cellular and developmental mechanisms, resultant of both gene activation and repression events.

IV.3.5 Limitations of the pull-down and in mCG reader characterisation

Biochemical validation and DNA affinity pull-down assays are limited by the selection of defined DNA sequences used within the assay. The design of probes cannot fully capture the underlying sequence complexity bound by all proteins within the cell and inherently enriches for protein families or subtypes that have a higher affinity for the probes selected. Experimental factors such as washing and incubation times or buffer types may also influence the outcome of the experiment. Increased salt or detergent concentrations for example, have been used to isolate proteins embedded within chromatin. A buffer that resembled native cellular conditions was chosen with the aim of identifying proteins binding to DNA in their native conditions, and because higher salt concentrations reduce the affinity of proteins for DNA. The selection of alternative mCG methylated probes harbouring different sequence contexts or by altering the experimental pull-down conditions may enrich for a different subset or family of proteins. The systematic evaluation of transcription factors by SELEX experiments have enabled more robust and unbiased characterisation of transcription factor binding and reveal many transcription factors that are sensitive to CG methylation state than was previously thought98. However, components known to influence the DNA binding landscape within the nucleus such as nucleosomes and histone modifications are not present. As such, each assay does not recapitulate an in vivo setting. As discussed in sections 1.5.4 and 4.3.2, many proteins, like ZBTB33121, display a dynamic DNA binding profile that may not be comprehensively determined by artificial in vitro assays. A combinatorial approach utilising screening and confirmation assays coupled with in vivo binding characterisation is currently the best way to determine the binding behaviour of DBD-containing proteins whilst minimizing all limitations discussed above.

Differences in the types of tissue used in human and mouse pull-down is another limitation to the study and may affect the study of mCG reader conservation in human and mouse brain. This may, for example, offer another explanation for the enrichment of more generalised GO terms in CG when compared to mCG. Whole-brain tissue used in the mouse DNA pull-down had a greater cellular heterogeneity than the frontal cortex used in the human pull-down. This may explain the greater variety and number of proteins observed in the mouse dataset.

IV-166

However, by and large, comparisons between human and mouse proved feasible, as a large overlap of proteins binding to the mCG dinucleotide were observed.

Both a strength and limitation to DNA pull-downs is that they enrich for direct binders of the DNA probes as well as any attached proteins. Whilst this is useful for identifying protein complexes, a major assumption is made in plots containing DNA binders, classifying these proteins based on GO classification for nucleic acid or DNA binding IDs. It cannot be ruled out that some of these DNA binders, despite having DNA binding capabilities, are tethered to a larger protein complex, in which another protein directly bound the DNA probes. Subsequent TAP-MS and DNA binding experiments are required to confirm direct interactions for the DNA probe. Lastly are the limitations to the characterisation of protein complexes and interactions that are based on, and limited by, current knowledge. The aforementioned interactions were generated by STRING or GO analysis, databases that curate protein interactions based on externally validated experiments. While this is useful in ascertaining the pull-downs success, it is limited to known interactions that occur through previously validated experimental work. There are many other protein interactors identified in the pull-down that were not assigned any relevant biological or interactive network. These proteins could represent novel brain-specific and generalised interaction networks that link the binding of identified mCG DNA binders within the human and mouse pull-down study to various cellular outcomes.

IV.3.6 Building upon the mCG reader repertoire in the mammalian brain

The numerous proteins presented within this chapter present novel mCG candidates linked to a variety of cellular mechanisms, expanding upon the growing list of mCG readers. Future studies are needed to biochemically validate candidates identified in this screen and couple these results to in vivo binding experiments. The pull-down also contains a repository of protein interactors with the capacity to influence gene expression in a multitude of biological contexts. Follow up experiments utilising methods like TAP-MS may decouple protein binding from protein interactions and identify protein complexes involved in the regulation of or output of mCG. Together, these studies may deepen our understanding of mCG readout in the brain and help identify critical epigenetic processes underlying healthy brain development and functioning. An additional pull-down with CA and mCA probes was also conducted in parallel to the mCG/CG DNA pull-down with the aim of identifying novel mCA readers involved in the regulation of neurodevelopment. Discussed within the next chapter, this experiment represents the first screen identifying mCA readers in human and mouse brain.

IV-167

References

1. Hendrich, B. & Tweedie, S. The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet. 19, 269–277 (2003). 2. Li, E., Bestor, T. H. & Jaenisch, R. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69, 915–926 (1992). 3. Okano, M., Bell, D. W., Haber, D. A. & Li, E. DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257 (1999). 4. Fatemi, M. & Wade, P. A. MBD family proteins: reading the epigenetic code. J. Cell Sci. 119, 3033–3037 (2006). 5. Roussel-Gervais, A., Naciri, I., Kirsh, O. & Kasprzyk, L. Loss of the methyl-CpG–binding protein ZBTB4 alters mitotic checkpoint, increases aneuploidy, and promotes tumorigenesis. Cancer Res. (2017). 6. Zemach, A. & Grafi, G. Characterization of Arabidopsis thaliana methyl‐CpG‐binding domain (MBD) proteins. Plant J. (2003). 7. Wade, P. A. et al. Mi-2 complex couples DNA methylation to chromatin remodelling and histone deacetylation. Nat. Genet. 23, 62–66 (1999). 8. Eden, S. & Cedar, H. Role of DNA methylation in the regulation of transcription. Curr. Opin. Genet. Dev. 4, 255–259 (1994). 9. Ng, H. H. & Bird, A. DNA methylation and chromatin modification. Curr. Opin. Genet. Dev. 9, 158–163 (1999). 10. Maurano, M. T. et al. Role of DNA Methylation in Modulating Transcription Factor Occupancy. Cell Rep. 12, 1184–1195 (2015). 11. Ng, H. H. et al. MBD2 is a transcriptional repressor belonging to the MeCP1 histone deacetylase complex. Nat. Genet. 23, 58–61 (1999). 12. Le Guezennec, X. et al. MBD2/NuRD and MBD3/NuRD, two distinct complexes with different biochemical and functional properties. Mol. Cell. Biol. 26, 843–851 (2006). 13. Du, Q., Luu, P.-L., Stirzaker, C. & Clark, S. J. Methyl-CpG-binding domain proteins: readers of the epigenome. Epigenomics 7, 1051–1073 (2015). 14. Watanabe, S., Ichimura, T., Tsuruzoe, S. & Shinkai, Y. Methyl-CpG binding domain 1 (MBD1) interacts with the Suv39h1-HP1 heterochromatic complex for DNA methylation- based transcriptional repression. of Biological Chemistry (2003). 15. Villa, R. et al. The methyl-CpG binding protein MBD1 is required for PML-RARα function. Proc. Natl. Acad. Sci. U. S. A. 103, 1400–1405 (2006). 16. Ichimura, T., Ohkuma, Y., Chiba, T. & Saya, H. MCAF mediates MBD1-dependent

IV-168

transcriptional repression. and cellular biology (2003). 17. Ichimura, T. et al. Transcriptional repression and heterochromatin formation by MBD1 and MCAF/AM family proteins. J. Biol. Chem. 280, 13928–13935 (2005). 18. Wang, H. et al. mAM facilitates conversion by ESET of dimethyl to trimethyl lysine 9 of histone H3 to cause transcriptional repression. Mol. Cell 12, 475–487 (2003). 19. Xue, Y. et al. NURD, a novel complex with both ATP-dependent chromatin-remodeling and histone deacetylase activities. Mol. Cell 2, 851–861 (1998). 20. Millard, C. J. et al. The structure of the core NuRD repression complex provides insights into its interaction with chromatin. Elife 5, e13941 (2016). 21. Smeenk, G., Wiegant, W. W. & Vrolijk, H. The NuRD chromatin–remodeling complex regulates signaling and repair of DNA damage. J. Cell Biol. (2010). 22. Allen, H. F., Wade, P. A. & Kutateladze, T. G. The NuRD architecture. Cell. Mol. Life Sci. 70, 3513–3524 (2013). 23. Millard, C. J., Fairall, L. & Schwabe, J. W. R. Towards an understanding of the structure and function of MTA1. Cancer Metastasis Rev. 33, 857–867 (2014). 24. Schmitges, F. W. et al. Histone methylation by PRC2 is inhibited by active chromatin marks. Mol. Cell 42, 330–341 (2011). 25. Berger, S. L. The complex language of chromatin regulation during transcription. Nature 447, 407–412 (2007). 26. Kehle, J. et al. dMi-2, a hunchback-interacting protein that functions in polycomb repression. Science 282, 1897–1900 (1998). 27. Shi, Y. & Mello, C. A CBP/p300 homolog specifies multiple differentiation pathways inCaenorhabditis elegans. Genes Dev. (1998). 28. Feng, Q. & Zhang, Y. The MeCP1 complex represses transcription through preferential binding, remodeling, and deacetylating methylated nucleosomes. Genes Dev. 15, 827– 832 (2001). 29. Günther, K. et al. Differential roles for MBD2 and MBD3 at methylated CpG islands, active promoters and binding to exon sequences. Nucleic Acids Res. 41, 3010–3021 (2013). 30. Hendrich, B., Guy, J., Ramsahoye, B., Wilson, V. A. & Bird, A. Closely related proteins MBD2 and MBD3 play distinctive but interacting roles in mouse development. Genes Dev. 15, 710–723 (2001). 31. Fujita, N. et al. Mechanism of transcriptional regulation by methyl-CpG binding protein MBD1. Mol. Cell. Biol. 20, 5107–5118 (2000). 32. Gong, F., Clouaire, T., Aguirrebengoa, M., Legube, G. & Miller, K. M. Histone demethylase KDM5A regulates the ZMYND8–NuRD chromatin remodeler to promote DNA repair. J. Cell Biol. jcb.201611135 (2017).

IV-169

33. van der Torre, J., Wong, W. H. & DePinho, R. A. mSin3A corepressor regulates diverse transcriptional networks governing normal and neoplastic growth and survival. Genes (2005). 34. Pile, L. A., Spellman, P. T., Katzenberger, R. J. & Wassarman, D. A. The SIN3 deacetylase complex represses genes encoding mitochondrial proteins: implications for the regulation of energy metabolism. J. Biol. Chem. 278, 37840–37848 (2003). 35. Bernstein, B. E., Tong, J. K. & Schreiber, S. L. Genomewide studies of histone deacetylase function in yeast. Proc. Natl. Acad. Sci. U. S. A. 97, 13708–13713 (2000). 36. Baltus, G. A., Kowalski, M. P., Tutter, A. V. & Kadam, S. A positive regulatory role for the mSin3A-HDAC complex in pluripotency through Nanog and . J. Biol. Chem. 284, 6998–7006 (2009). 37. Saunders, A. et al. The SIN3A/HDAC Corepressor Complex Functionally Cooperates with NANOG to Promote Pluripotency. Cell Rep. 18, 1713–1726 (2017). 38. Laherty, C. D., Lawrence, Q. A. & Armstrong, A. P. Mad proteins contain a dominant transcription repression domain. and cellular biology (1996). 39. Heideman, M. R. et al. Sin3a-associated Hdac1 and Hdac2 are essential for hematopoietic stem cell homeostasis and contribute differentially to hematopoiesis. Haematologica 99, 1292–1303 (2014). 40. Jones, P. L. et al. Methylated DNA and MeCP2 recruit histone deacetylase to repress transcription. Nat. Genet. 19, 187–191 (1998). 41. Nan, X. et al. Transcriptional repression by the methyl-CpG-binding protein MeCP2 involves a histone deacetylase complex. Nature 393, 386–389 (1998). 42. Della Ragione, F., Vacca, M., Fioriniello, S., Pepe, G. & D’Esposito, M. MECP2, a multi- talented modulator of chromatin architecture. Brief. Funct. Genomics 15, 420–431 (2016). 43. Harikrishnan, K. N. et al. Brahma links the SWI/SNF chromatin-remodeling complex with MeCP2-dependent transcriptional silencing. Nat. Genet. 37, 254–264 (2005). 44. You, A., Tong, J. K., Grozinger, C. M. & Schreiber, S. L. CoREST is an integral component of the CoREST- human histone deacetylase complex. Proc. Natl. Acad. Sci. U. S. A. 98, 1454–1458 (2001). 45. Shi, Y. et al. Histone demethylation mediated by the nuclear amine oxidase homolog LSD1. Cell 119, 941–953 (2004). 46. Metzger, E. et al. LSD1 demethylates repressive histone marks to promote androgen- receptor-dependent transcription. Nature 437, 436–439 (2005). 47. Ballas, N., Grunseich, C., Lu, D. D., Speh, J. C. & Mandel, G. REST and its corepressors mediate plasticity of neuronal gene chromatin throughout neurogenesis. Cell 121, 645–657 (2005).

IV-170

48. Urnov, F. D., Wolffe, A. P. & Guschin, D. Molecular mechanisms of corepressor function. Curr. Top. Microbiol. Immunol. 254, 1–33 (2001). 49. Glass, C. K. & Rosenfeld, M. G. The coregulator exchange in transcriptional functions of nuclear receptors. Genes Dev. 14, 121–141 (2000). 50. Oberoi, J. et al. Structural basis for the assembly of the SMRT/NCoR core transcriptional repression machinery. Nat. Struct. Mol. Biol. 18, 177–184 (2011). 51. Itoh, T. et al. Structural and functional characterization of a cell cycle associated HDAC1/2 complex reveals the structural basis for complex assembly and nucleosome targeting. Nucleic Acids Res. 43, 2033–2044 (2015). 52. Hermanson, O., Jepsen, K. & Rosenfeld, M. G. N-CoR controls differentiation of neural stem cells into astrocytes. Nature 419, 934–939 (2002). 53. Yoon, H.-G., Chan, D. W., Reynolds, A. B., Qin, J. & Wong, J. N-CoR Mediates DNA Methylation-Dependent Repression through a Methyl CpG Binding Protein Kaiso. Mol. Cell 12, 723–734 (2003/9). 54. Hayakawa, T. & Nakayama, J.-I. Physiological roles of class I HDAC complex and histone demethylase. J. Biomed. Biotechnol. 2011, 129383 (2011). 55. Kong, L. et al. A primary role of TET proteins in establishment and maintenance of De Novo bivalency at CpG islands. Nucleic Acids Res. 44, 8682–8692 (2016). 56. Verma, N. et al. TET proteins safeguard bivalent promoters from de novo methylation in human embryonic stem cells. Nat. Genet. 50, 83–95 (2018). 57. Simon, J. A. & Kingston, R. E. Occupying chromatin: Polycomb mechanisms for getting to genomic targets, stopping transcriptional traffic, and staying put. Mol. Cell 49, 808– 824 (2013). 58. Aranda, S., Mas, G. & Di Croce, L. Regulation of gene transcription by Polycomb proteins. Sci Adv 1, e1500737 (2015). 59. Geisler, S. J. & Paro, R. Trithorax and Polycomb group-dependent regulation: a tale of opposing activities. Development 142, 2876–2887 (2015). 60. Goodrich, J. et al. A Polycomb-group gene regulates homeotic gene expression in Arabidopsis. Nature 386, 44–51 (1997). 61. Shaver, S., Casas-Mollano, J. A., Cerny, R. L. & Cerutti, H. Origin of the polycomb repressive complex 2 and gene silencing by an E(z) homolog in the unicellular alga Chlamydomonas. Epigenetics 5, 301–312 (2010). 62. Jamieson, K., Rountree, M. R., Lewis, Z. A., Stajich, J. E. & Selker, E. U. Regional control of histone H3 lysine 27 methylation in Neurospora. Proc. Natl. Acad. Sci. U. S. A. 110, 6027–6032 (2013). 63. Ringrose, L. & Paro, R. Epigenetic regulation of cellular memory by the Polycomb and Trithorax group proteins. Annu. Rev. Genet. 38, 413–443 (2004).

IV-171

64. Di Croce, L. & Helin, K. Transcriptional regulation by Polycomb group proteins. Nat. Struct. Mol. Biol. 20, 1147–1155 (2013). 65. Wu, H. et al. Dual functions of Tet1 in transcriptional regulation in mouse embryonic stem cells. Nature 473, 389–393 (2011). 66. Cao, R. et al. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298, 1039–1043 (2002). 67. Wang, L. et al. Hierarchical recruitment of polycomb group silencing complexes. Mol. Cell 14, 637–646 (2004). 68. Gao, Z. et al. PCGF homologs, CBX proteins, and RYBP define functionally distinct PRC1 family complexes. Mol. Cell 45, 344–356 (2012). 69. Tavares, L. et al. RYBP-PRC1 complexes mediate H2A ubiquitylation at polycomb target sites independently of PRC2 and H3K27me3. Cell 148, 664–678 (2012). 70. Gearhart, M. D., Corcoran, C. M., Wamstad, J. A. & Bardwell, V. J. Polycomb group and SCF ubiquitin ligases are found in a novel BCOR complex that is recruited to BCL6 targets. Mol. Cell. Biol. 26, 6880–6889 (2006). 71. He, J. et al. Kdm2b maintains murine embryonic stem cell status by recruiting PRC1 complex to CpG islands of developmental genes. Nat. Cell Biol. 15, 373–384 (2013). 72. Blackledge, N. P. et al. Variant PRC1 complex-dependent H2A ubiquitylation drives PRC2 recruitment and polycomb domain formation. Cell 157, 1445–1459 (2014). 73. Wong, S. J. et al. KDM2B Recruitment of the Polycomb Group Complex, PRC1.1, Requires Cooperation between PCGF1 and BCORL1. Structure 24, 1795–1801 (2016). 74. Mendenhall, E. M. et al. GC-rich sequence elements recruit PRC2 in mammalian ES cells. PLoS Genet. 6, e1001244 (2010). 75. Lynch, M. D. et al. An interspecies analysis reveals a key role for unmethylated CpG dinucleotides in vertebrate Polycomb complex recruitment. EMBO J. 31, 317–329 (2012). 76. Litt, M. D., Simpson, M., Gaszner, M., Allis, C. D. & Felsenfeld, G. Correlation between histone lysine methylation and developmental changes at the chicken beta-globin locus. Science 293, 2453–2455 (2001). 77. Noma K, Allis, C. D. & Grewal, S. I. Transitions in distinct histone H3 methylation patterns at the heterochromatin domain boundaries. Science 293, 1150–1155 (2001). 78. Stassen, M. J., Bailey, D., Nelson, S., Chinwalla, V. & Harte, P. J. The Drosophila trithorax proteins contain a novel variant of the nuclear receptor type DNA binding domain and an ancient conserved motif found in other chromosomal proteins. Mech. Dev. 52, 209–223 (1995). 79. Jenuwein, T., Laible, G., Dorn, R. & Reuter, G. SET domain proteins modulate chromatin domains in eu- and heterochromatin. Cell. Mol. Life Sci. 54, 80–93 (1998).

IV-172

80. Miller, T. et al. COMPASS: a complex of proteins associated with a trithorax-related SET domain protein. Proc. Natl. Acad. Sci. U. S. A. 98, 12902–12907 (2001). 81. Mikheyeva, I. V., Grady, P. J. R., Tamburini, F. B., Lorenz, D. R. & Cam, H. P. Multifaceted genome control by Set1 Dependent and Independent of H3K4 methylation and the Set1C/COMPASS complex. PLoS Genet. 10, e1004740 (2014). 82. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007). 83. Heintzman, N. D. et al. Histone modifications at human enhancers reflect global cell- type-specific gene expression. Nature 459, 108–112 (2009). 84. Wysocka, J. et al. WDR5 associates with histone H3 methylated at K4 and is essential for H3 K4 methylation and vertebrate development. Cell 121, 859–872 (2005). 85. Steward, M. M. et al. Molecular regulation of H3K4 trimethylation by ASH2L, a shared subunit of MLL complexes. Nat. Struct. Mol. Biol. 13, 852–854 (2006). 86. Dou, Y. et al. Regulation of MLL1 H3K4 methyltransferase activity by its core components. Nat. Struct. Mol. Biol. 13, 713–719 (2006). 87. Tyagi, S., Chabes, A. L., Wysocka, J. & Herr, W. E2F activation of S phase promoters via association with HCF-1 and the MLL family of histone H3K4 methyltransferases. Mol. Cell 27, 107–119 (2007). 88. Wu, M. et al. Molecular regulation of H3K4 trimethylation by Wdr82, a component of human Set1/COMPASS. Mol. Cell. Biol. 28, 7337–7344 (2008). 89. Wang, P., Lin, C., Smith, E. R. & Guo, H. Global analysis of H3K4 methylation defines MLL family member targets and points to a role for MLL1-mediated H3K4 methylation in the regulation of transcriptional …. and cellular biology (2009). 90. Milne, T. et al. The mixed lineage leukemia protein (MLL) targets SET domain methyltransferase activity to Hox gene promoters. in Blood vol. 100 137A–137A (AMER SOC HEMATOLOGY 1900 M STREET. NW SUITE 200, WASHINGTON, DC 20036 USA, 2002). 91. Debernardi, S. et al. The MLL fusion partner AF10 binds GAS41, a protein that interacts with the human SWI/SNF complex. Blood 99, 275–281 (2002). 92. Hughes, C. M. et al. Menin associates with a trithorax family histone methyltransferase complex and with the locus. Mol. Cell 13, 587–597 (2004). 93. Verdone, L., Agricola, E., Caserta, M. & Di Mauro, E. Histone acetylation in gene regulation. Brief. Funct. Genomic. Proteomic. 5, 209–221 (2006). 94. Milne, T. A. et al. MLL targets SET domain methyltransferase activity to Hox gene promoters. Mol. Cell 10, 1107–1117 (2002). 95. Erfurth, F. E. et al. MLL protects CpG clusters from methylation within the Hoxa9 gene, maintaining transcript expression. Proc. Natl. Acad. Sci. U. S. A. 105, 7517–7522

IV-173

(2008). 96. Cierpicki, T. et al. Structure of the MLL CXXC domain–DNA complex and its functional role in MLL-AF9 leukemia. Nat. Struct. Mol. Biol. 17, 62 (2009). 97. Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013). 98. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, (2017). 99. Wu, M. et al. The MTA family proteins as novel histone H3 binding proteins. Cell Biosci. 3, 1 (2013). 100. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). 101. Pabst, O., Herbrand, H., Takuma, N. & Arnold, H. H. NKX2 gene expression in neuroectoderm but not in mesendodermally derived structures depends on sonic hedgehog in mouse embryos. Dev. Genes Evol. 210, 47–50 (2000). 102. Qi, Y. et al. Control of oligodendrocyte differentiation by the Nkx2.2 homeodomain transcription factor. Development 128, 2723–2733 (2001). 103. Götz, R. et al. Bag1 is essential for differentiation and survival of hematopoietic and neuronal cells. Nat. Neurosci. 8, 1169–1178 (2005). 104. Meehan, R. R., Lewis, J. D., McKay, S., Kleiner, E. L. & Bird, A. P. Identification of a mammalian protein that binds specifically to DNA containing methylated CpGs. Cell 58, 499–507 (1989). 105. Hendrich, B. et al. Genomic structure and chromosomal mapping of the murine and human Mbd1, Mbd2, Mbd3, and Mbd4 genes. Mamm. Genome 10, 906–912 (1999). 106. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470–484 (2010). 107. Jørgensen, H. F., Ben-Porath, I. & Bird, A. P. Mbd1 is recruited to both methylated and nonmethylated CpGs via distinct DNA binding domains. Mol. Cell. Biol. 24, 3387–3395 (2004). 108. Kaji, K. et al. The NuRD component Mbd3 is required for pluripotency of embryonic stem cells. Nat. Cell Biol. 8, 285–292 (2006). 109. Hendrich, B., Hardeland, U., Ng, H. H., Jiricny, J. & Bird, A. The thymine glycosylase MBD4 can bind to the product of deamination at methylated CpG sites. Nature 401, 301–304 (1999). 110. Baubec, T., Ivánek, R., Lienert, F. & Schübeler, D. Methylation-dependent and - independent genomic targeting principles of the MBD protein family. Cell 153, 480–492 (2013). 111. Hashimoto, H. et al. Wilms tumor protein recognizes 5-carboxylcytosine within a specific

IV-174

DNA sequence. Genes Dev. 28, 2304–2313 (2014). 112. Zandarashvili, L., White, M. A., Esadze, A. & Iwahara, J. Structural impact of complete CpG methylation within target DNA on specific complex formation of the inducible transcription factor Egr-1. FEBS Lett. 589, 1748–1753 (2015). 113. Sugiaman-Trapman, D. et al. Characterization of the human RFX transcription factor family by regulatory and target gene analysis. BMC Genomics 19, 181 (2018). 114. Huang, L. H., Wang, R., Gama-Sosa, M. A., Shenoy, S. & Ehrlich, M. A protein from human placental nuclei binds preferentially to 5-methylcytosine-rich DNA. Nature 308, 293–295 (1984). 115. Wang, R. Y., Zhang, X. Y. & Ehrlich, M. A human DNA-binding protein is methylation- specific and sequence-specific. Nucleic Acids Res. 14, 1599–1614 (1986). 116. Zhang, X. Y. et al. The major histocompatibility complex class II promoter-binding protein RFX (NF-X) is a methylated DNA-binding protein. Mol. Cell. Biol. 13, 6810–6818 (1993). 117. Reith, W. et al. RFX1, a transactivator of hepatitis B virus enhancer I, belongs to a novel family of homodimeric and heterodimeric DNA-binding proteins. Mol. Cell. Biol. 14, 1230–1244 (1994). 118. Emery, P., Durand, B., Mach, B. & Reith, W. RFX proteins, a novel family of DNA binding proteins conserved in the eukaryotic kingdom. Nucleic Acids Res. 24, 803–807 (1996). 119. de Dieuleveult, M. & Miotto, B. DNA Methylation and Chromatin: Role(s) of Methyl-CpG- Binding Protein ZBTB38. Epigenetics insights vol. 11 2516865718811117 (2018). 120. Bartels, S. J. J. et al. A SILAC-based screen for Methyl-CpG binding proteins identifies RBP-J as a DNA methylation and sequence-specific binding protein. PLoS One 6, e25884 (2011). 121. Blattler, A. et al. ZBTB33 binds unmethylated regions of the genome associated with actively expressed genes. Epigenetics Chromatin 6, 13 (2013). 122. Kannan, M. B., Solovieva, V. & Blank, V. The small MAF transcription factors MAFF, MAFG and MAFK: current knowledge and perspectives. Biochim. Biophys. Acta 1823, 1841–1846 (2012). 123. Katsuoka, F. & Yamamoto, M. Small Maf proteins (MafF, MafG, MafK): History, structure and function. Gene 586, 197–205 (2016). 124. Sun, H., Ghaffari, S. & Taneja, R. bHLH-Orange Transcription Factors in Development and Cancer. Transl. Oncogenomics 2, 107–120 (2007). 125. Hu, S. et al. DNA methylation presents distinct binding sites for human transcription factors. Elife 2, e00726 (2013). 126. Okur, Z. & Scheiffele, P. The Yin and Yang of Arnt2 in Activity-Dependent Transcription.

IV-175

Neuron 102, 270–272 (2019). 127. Bersten, D. C., Bruning, J. B., Peet, D. J. & Whitelaw, M. L. Human variants in the neuronal basic helix-loop-helix/Per-Arnt-Sim (bHLH/PAS) transcription factor complex NPAS4/ARNT2 disrupt function. PLoS One 9, e85768 (2014). 128. Huang, S. et al. DDB2 Is a Novel Regulator of Wnt Signaling in Colon Cancer. Cancer Res. 77, 6562–6575 (2017). 129. Filion, G. J. P. et al. A family of human zinc finger proteins that bind methylated DNA and repress transcription. Mol. Cell. Biol. 26, 169–181 (2006). 130. Sasai, N., Nakao, M. & Defossez, P.-A. Sequence-specific recognition of methylated DNA by human zinc-finger proteins. Nucleic Acids Res. 38, 5015–5022 (2010). 131. Zhou, T. et al. Structural basis for hydroxymethylcytosine recognition by the SRA domain of UHRF2. Mol. Cell 54, 879–886 (2014). 132. Vaughan, R. M. et al. Comparative biochemical analysis of UHRF proteins reveals molecular mechanisms that uncouple UHRF2 from DNA methylation maintenance. Nucleic Acids Res. 46, 4405–4416 (2018). 133. Liu, Y. et al. UHRF2 regulates local 5-methylcytosine and suppresses spontaneous seizures. Epigenetics 12, 551–560 (2017). 134. Cheval, H. et al. Distinctive features of Egr transcription factor regulation and DNA binding activity in CA1 of the hippocampus in synaptic plasticity and consolidation and reconsolidation of fear memory. Hippocampus 22, 631–642 (2012). 135. Poirier, R. et al. Distinct functions of egr gene family members in cognitive processes. Front. Neurosci. 2, 47–55 (2008). 136. Koldamova, R. et al. Genome-wide approaches reveal EGR1-controlled regulatory networks associated with neurodegeneration. Neurobiol. Dis. 63, 107–114 (2014). 137. Jones, M. W. et al. A requirement for the immediate early gene Zif268 in the expression of late LTP and long-term memories. Nat. Neurosci. 4, 289–296 (2001). 138. Gallitano-Mendel, A. et al. The immediate early gene early growth response gene 3 mediates adaptation to stress and novelty. Neuroscience 148, 633–643 (2007). 139. Jackson, B. C., Carpenter, C., Nebert, D. W. & Vasiliou, V. Update of human and mouse forkhead box (FOX) gene families. Hum. Genomics 4, 345–352 (2010). 140. Friedman, J. R. & Kaestner, K. H. The Foxa family of transcription factors in development and metabolism. Cell. Mol. Life Sci. 63, 2317–2328 (2006). 141. Golson, M. L. & Kaestner, K. H. Fox transcription factors: from development to disease. Development 143, 4558–4570 (2016). 142. Prokhortchouk, A. V., Aitkhozhina, D. S., Sablina, A. A., Ruzov, A. S. & Prokhortchouk, E. B. Kaiso, a New Protein of the BTB/POZ Family, Specifically Binds to Methylated DNA Sequences. Russ. J. Genet. 37, 603–609 (2001).

IV-176

143. Weigel, D. & Jäckle, H. The fork head domain: a novel DNA binding motif of eukaryotic transcription factors? Cell 63, 455–456 (1990). 144. Hannenhalli, S. & Kaestner, K. H. The evolution of Fox genes and their role in development and disease. Nat. Rev. Genet. 10, 233–240 (2009). 145. Zhu, H., Wang, G. & Qian, J. Transcription factors as readers and effectors of DNA methylation. Nat. Rev. Genet. 17, 551–565 (2016). 146. Vogt, P. K., Jiang, H. & Aoki, M. Triple layer control: phosphorylation, acetylation and ubiquitination of FOXO proteins. Cell Cycle 4, 908–913 (2005). 147. Obsil, T. & Obsilova, V. Structure/function relationships underlying regulation of FOXO transcription factors. Oncogene 27, 2263–2275 (2008). 148. Bois, P. R. J., Brochard, V. F., Salin-Cantegrel, A. V. A., Cleveland, J. L. & Grosveld, G. C. FoxO1a-cyclic GMP-dependent kinase I interactions orchestrate myoblast fusion. Mol. Cell. Biol. 25, 7645–7656 (2005). 149. Ying, M. et al. Kruppel-like factor-9 (KLF9) inhibits glioblastoma stemness through global transcription repression and integrin α6 inhibition. J. Biol. Chem. 289, 32742– 32756 (2014). 150. Gelfman, S., Cohen, N., Yearim, A. & Ast, G. DNA-methylation effect on cotranscriptional splicing is dependent on GC architecture of the exon–intron structure. Genome Res. (2013). 151. Long, S. W., Ooi, J. Y. Y., Yau, P. M. & Jones, P. L. A brain-derived MeCP2 complex supports a role for MeCP2 in RNA processing. Biosci. Rep. 31, 333–343 (2011). 152. Pevny, L. H., Sockanathan, S., Placzek, M. & Lovell-Badge, R. A role for SOX1 in neural determination. Development 125, 1967–1978 (1998). 153. Wang, Y., Bagheri-Fam, S. & Harley, V. R. SOX13 is up-regulated in the developing mouse neuroepithelium and identifies a sub-population of differentiating neurons. Brain Res. Dev. Brain Res. 157, 201–208 (2005). 154. Murdoch, B., DelConte, C. & García-Castro, M. I. Pax7 lineage contributions to the mammalian neural crest. PLoS One 7, e41089 (2012). 155. Sorensen, P. H. B. et al. PAX3-FKHR and PAX7-FKHR gene fusions are prognostic indicators in alveolar rhabdomyosarcoma: a report from the children’s oncology group. J. Clin. Oncol. 20, 2672–2679 (2002). 156. Su, X. Y. et al. Various types of rearrangements target TLX3 locus in T-cell acute lymphoblastic leukemia. Genes Chromosomes Cancer 41, 243–249 (2004). 157. Pollard, K. S. et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443, 167–172 (2006). 158. Fiore, R., Khudayberdiev, S., Saba, R. & Schratt, G. MicroRNA function in the nervous system. Prog. Mol. Biol. Transl. Sci. 102, 47–100 (2011).

IV-177

159. Coufal, N. G. et al. L1 retrotransposition in human neural progenitor cells. Nature 460, 1127–1131 (2009). 160. Wu, J. & Xie, X. Comparative sequence analysis reveals an intricate network among REST, CREB and miRNA in mediating neuronal gene expression. Genome Biol. 7, R85 (2006). 161. Kim, M. et al. Distinct pathways for snoRNA and mRNA termination. Mol. Cell 24, 723– 734 (2006). 162. Andersson, R. et al. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nat. Commun. 5, 5336 (2014). 163. Pefanis, E. et al. Noncoding RNA transcription targets AID to divergently transcribed loci in B cells. Nature 514, 389–393 (2014). 164. Ogami, K., Chen, Y. & Manley, J. L. RNA surveillance by the nuclear RNA exosome: mechanisms and significance. Noncoding RNA 4, (2018). 165. Neri, F. et al. Intragenic DNA methylation prevents spurious transcription initiation. Nature 543, 72–77 (2017). 166. Iurlaro, M. et al. A screen for hydroxymethylcytosine and formylcytosine binding proteins suggests functions in transcription and chromatin regulation. Genome Biol. 14, R119 (2013). 167. Hsu, P. J. et al. Ythdc2 is an N6-methyladenosine binding protein that regulates mammalian spermatogenesis. Cell Res. 27, 1115 (2017).

IV-178

Supplementary information

Table S4.1: Combined dataset corresponds to proteins observed in human and mouse mCG/CG datasets. Combined-human represents a significantly enriched protein in human but not mouse and combined-mouse represents a significantly enriched protein in mouse but not human. Proteins marked by a * denotation are observed as statistically enriched within the mCA/CA DNA pull-downs (see chapter 5 supplementary tables). Proteins in red were enriched for a DNA binding context opposite to that observed within the DNA pull-down (i.e. enriched for mCG within the mCG/CG DNA pull-down but published to bind CG or vice versa).

Species GeneID Reference Species GeneID Reference

Combined mCG EGR1 111,112 Combined CG APEX1* -- Combined mCG EGR3 -- Combined CG ARNT2 98,125 Combined mCG FOXJ3 98 Combined CG ASH2L 97 Combined mCG FOXK1 97 Combined CG ATXN1 -- Combined mCG FOXO1* -- Combined CG CGGBP1 106 Combined mCG KLF13 98 Combined CG CIC -- Combined mCG MBD2* 105,106 Combined CG CXXC5 106 Combined mCG MECP2 97 Combined CG DEAF1 106 Combined mCG MTA1 97,106 Combined CG KAT6B* -- Combined mCG MTA2* 97,106 Combined CG KDM2B 166 Combined mCG MTA3* 97,106 Combined CG KMT2A -- Combined mCG RFX1* 97 Combined CG MYPOP* -- Combined mCG RFX5 97,106 Combined CG RBM45 97 Combined mCG UHRF2 97 Combined CG TFE3* 98 Combined mCG ZBTB4 97,106 Combined CG TFEB* 98,106 Combined-human BBX -- Combined CG ZMYND11 97 mCG Combined-human CHD4 106 Combined-human SND1 -- mCG CG Combined-human CHD5 97,106 Combined-human THAP4 -- mCG CG Combined-human CSTF2 -- Combined-human USF1* 98,106 mCG CG Combined-human EHD2 -- Combined-human USF2* 98,106 mCG CG Combined-human NTHL1 -- Combined-human YY1 98 mCG CG Combined-human PBX1 97,98,106 Combined-mouse Ddb2 97 mCG CG

IV-179

Combined-human PBX3 97,106 Combined-mouse Fosl2 97 mCG CG Combined-human RCOR3 -- Combined-mouse Top3A 106 mCG CG Combined-human RECQL5* -- mCG Combined-human RFXAP 97,106 mCG Combined-human SSB -- mCG Combined-human WIZ 97 mCG Combined-human YTHDC2* 167 mCG Combined-human ZFP62 -- mCG Combined-mouse Foxk2 97 mCG Combined-mouse Nfat5 98 mCG Combined-mouse Nkx2 97 mCG

Table S4.2: Human-limited mCG/CG dataset corresponds to proteins observed in human only. Proteins marked by a * denotation are observed as statistically enriched within the mCA/CA DNA pull- downs (see chapter 5 supplementary tables). Proteins in red were enriched for a DNA binding context opposite to that observed within the DNA pull-down (i.e. enriched for mCG within the mCG/CG DNA pull-down but published to bind CG or vice versa).

Species GeneID Reference Species GeneID Reference

Human-limited mCG BHLHE40* 98 Human-limited CG KDM2A 97 Human-limited mCG CHD3 106 Human-limited CG MAX* 97,98 Human-limited mCG KLF3 97,98 Human-limited CG MXI1* -- Human-limited mCG MAFF 98 Human-limited CG NPAS3 -- Human-limited mCG MBD4 97,106,166 Human-limited CG OLIG2* -- Human-limited mCG SCAF4 -- Human-limited CG PRRX1 -- Human-limited mCG SOX13* -- Human-limited CG RNASEH1 -- Human-limited mCG VEZF1* -- Human-limited CG SREBF1* 98 Human-limited mCG ZBTB14 106 Human-limited CG SYNJ1 -- Human-limited mCG ZNF174 98 Human-limited CG TBPL1 --

IV-180

Human-limited mCG ZNF575 -- Human-limited CG TFAP4* -- Human-limited CG ZNF134 --

Table S4.3: Mouse-limited mCG/CG dataset corresponds to proteins observed in mouse only. Proteins marked by a * denotation are observed as statistically enriched within the mCA/CA DNA pull- downs (see chapter 5 supplementary tables). Proteins in red were enriched for a DNA binding context opposite to that observed within the DNA pull-down (i.e. enriched for mCG within the mCG/CG DNA pull-down but published to bind CG or vice versa).

Species GeneID Referenc Species GeneID Referenc e e Mouse-limited mCG Cstf2t -- Mouse-limited CG Hivep2* -- Mouse-limited mCG En2 98 Mouse-limited CG Smad2 -- Mouse-limited mCG Foxn3 -- Mouse-limited CG Crebzf -- Mouse-limited mCG Foxo3* -- Mouse-limited CG Zic1 -- Mouse-limited mCG Foxp1 -- Mouse-limited CG Smad3 -- Mouse-limited mCG Gcfc2 -- Mouse-limited CG Flywch1 -- Mouse-limited mCG Klf12* 97 Mouse-limited CG Mtf1* 98 Mouse-limited mCG Klf16* 98 Mouse-limited CG Gtf2e2 -- Mouse-limited mCG Meis3* 98 Mouse-limited CG Nme2* -- Mouse-limited mCG Pax7 -- Mouse-limited CG -- Mouse-limited mCG Rax 98 Mouse-limited CG Fbxl19 -- Mouse-limited mCG Znf464 -- Mouse-limited CG Ovol2 -- Mouse-limited mCG Rfx3 97 Mouse-limited CG Tcf12 -- Mouse-limited mCG Rfxank 97,106 Mouse-limited CG Smad4 -- Mouse-limited mCG Sap30l -- Mouse-limited CG Npas1 -- Mouse-limited mCG Setx -- Mouse-limited CG Cxxc4 -- Mouse-limited mCG Sox1 -- Mouse-limited CG St18 -- Mouse-limited mCG Sp1 98 Mouse-limited mCG Tfcp2 97,98 Mouse-limited mCG Thra -- Mouse-limited mCG Tlx3 -- Mouse-limited mCG Ubp1 97,98 Mouse-limited mCG Wrn* -- Mouse-limited mCG Zbtb33 97,98,106,120 Mouse-limited mCG Zhx2 106

IV-181

Table S4.4: Human P/A mCG/CG corresponds to entire-human P/A created by merging combined- human P/A and human-limited P/A. Proteins marked by a * denotation are observed as statistically enriched within the mCA/CA DNA pull-downs (see chapter 5 supplementary tables).

Species GeneID Reference Species GeneID Reference

Human P/A mCG FOXO3* -- Human P/A CG ENO1 -- Human P/A mCG KLF12* 98 Human P/A CG HIVEP2* -- Human P/A mCG KLF16* 98 Human P/A CG LRPPRC* -- Human P/A mCG KLF9 -- Human P/A CG OLIG3* -- Human P/A mCG LARP4B -- Human P/A CG TCF7L2 98 Human P/A mCG POU2F2 98 Human P/A CG ZIC2* -- Human P/A mCG ZBTB44* 97 Human P/A mCG ZNF445 -- Human P/A mCG ZNF683* --

Table S4.5: Mouse P/A mCG/CG corresponds to entire-mouse P/A created by merging combined- mouse P/A and mouse-limited P/A. Proteins marked by a * denotation are observed as statistically enriched within the mCA/CA DNA pull-downs (see chapter 5 supplementary tables).

Species GeneID Reference Species GeneID Reference

Mouse P/A mCG Sp9 98 Mouse P/A CG Bnc2 97 Mouse P/A mCG Zbtb44* 97 Mouse P/A CG Esrra --

Mouse P/A CG Mitf* -- Mouse P/A CG Mtf2 --

IV-182

Identification and characterisation of mCA readers within human and mouse brain

Summary

The most prevalent DNA modification in eukaryotic genomes is cytosine methylation, which drives cell type-specific changes via synchronisation with other epigenetic processes like histone modifications. In most somatic cells, DNA methylation overwhelmingly occurs in the CG dinucleotide context, however non-CG methylation (mCH, where H = A, C, or T) contributes to a substantial portion of the overall DNA methylation within pluripotent stem cells and neurons in the brain. The most frequent form of mCH is in the CA dinucleotide context (mCA). Negligible at birth, CH methylation increases throughout mammalian brain development, coinciding with synaptogenesis and neuronal maturation, becoming the most abundant form of DNA methylation in the adult brain1. Furthermore, mCH deposition in the brain is highly conserved and inversely correlated with gene expression, showing cell type- specific gene expression patterns1. However, the mechanistic links between the deposition of mCH and its transcriptional readout are poorly understood. So far, only one mC reader, MECP2, has been demonstrated to bind to mCH sites in vitro and in vivo2. The loss of MECP2 results in Rett Syndrome (RTT), in which the symptoms begin to manifest at the same time when mCH starts to rapidly accumulate in the maturing neurons, while overexpression of MECP2 causes another disorder called MECP2 duplication syndrome3,4. To understand the regulatory repercussions of mCH within the brain, here I conducted a DNA pull-down coupled to mass spectrometry utilising human frontal cortex and mouse whole-brain incubated with methylated and unmethylated probes in the CA context. This affinity proteomics screen was successfully implemented and it identified numerous candidate mCA readers, including MBD2, providing a foundation for further characterisation of the roles of mCA and its readers within the brain. The selective affinity for the top mCA reader candidate, MBD2, was confirmed by gel shift analysis using a recombinantly expressed and purified MBD domain of MBD2. Thus, this study provides a foundational set of candidate mCH binding proteins in the mammalian brain, and promising evidence implicating MBD2 in mCA regulation, underpinning future efforts to elucidate its role in mCA readout and function in the mammalian brain.

Introduction

V-183

Methyl-readers couple DNA methylation to transcriptional output, and constitute fundamental epigenetic processes within the cell. The characterisation of mCG readers in particular has established defined molecular regulatory networks that affect transcription directly or indirectly by influencing local chromatin structure. The recent implication of mCH (and particularly its most abundant form, mCA) in brain development has raised similar questions into how this distinct covalent modification of DNA transmits information at the molecular level to influence neuronal function and brain development. However several characteristics such as its cellular specificity and deposition signature have made mCA a challenging epigenetic mark to study. As such, only MECP2 has been implicated in the recognition and biological readout of mCA thus far.

V.1.1 Non-CG Methylation in mammalian cells and tissues

The detection of CH methylation within mammalian genomes has been challenging for a number reasons. Firstly, mCH is not present at all or is present at very low levels in most somatic cells5, and when present is co-localized with extensive genome-wide mCG. Secondly, only a fraction of non-CG sites are methylated within a cell population. As a result, the detection of mCH had been dismissed as an artefact of incomplete conversion of unmethylated cytosines during bisulfite treatment of DNA in the past. Developments in WGBS have resulted in more efficient, cost-effective technologies, enabling robust mCH detection in many tissues6. These studies have revealed non-random mCH patterning in various cell types, at especially high levels in ESCs and neurons. For example, mCH has been reproducibly detected in human primary myocytes, oocytes, skeletal muscle, and adipose tissue, adrenal glands, aorta, heart, psoas and gastric systems by high coverage methylome analysis5,7. Whilst the distribution of CH methylation in these systems correlates with tissue-specific functions, further research is required to understand the biological implications of mCH in these tissues. In comparison, neurons1 and ESCs8 contain much higher levels of mCH. Pluripotent cells, for example, harbour an abundance of non-CG methylation and can be detected in ESCs8,9, somatic nuclear transfer stem cells10, iPSCs11, and germline cells and tissue12,13. For example, approximately a quarter of the identified methylcytosines in human H1 ESCs are methylated in the non-CG context, constituting a significant fraction of methylcytosines in the human genome. Differentiation of these lines results in a loss of mCH that is re-established in iPSCs produced from differentiated cells8,14. High levels of mCH are also observed in female germ cells, oocytes, polar bodies12, and pre-mitotic prospermatogonia13,15. Sperm contain no observable levels of mCH owing to passive removal of mCH through cell division within pre-mitotic prospermatogonia. After fertilisation, the zygote,

V-184

therefore, contains non-CG methylation at maternal alleles only. This too is lost by passive mechanisms, and its role in the germline and newly formed zygote remain unclear16,17.

V.1.2 Writers of mCH

DNMT3a, DNMT3b, and DNMT3L are involved in the deposition of CH methylation, however the presence of all three is not necessary for mCH deposition and maintenance in every cell type or tissue18,19. The implication of DNMT3a in the deposition and maintenance of CH methylation stems from a variety of experiments. Exogenous expression of DNMT3a in Drosophila introduces mCG and mCA, demonstrating the ability of the enzyme to methylate CA sites20, albeit with tenfold lower frequency compared to CG sites21. Depletion of mCH following knockdown of DNMT3a in multiple mCH containing cell types including ESCs22,23, oocytes24, and neurons25,26 supports its role in both deposition and maintenance of mCH in these tissues. Assessment of DNA binding sites in human and mouse brain reveals enrichment of DNMT3a at sites marked by mCH, indicating that mCH is reliant on DNMT3a in vivo1,25. The accumulation of mCH is accompanied by an increase in abundance of Dnmt3a in the post-natal developing brains of mice1. Conditional knockouts of Dnmt3a in mice brain inhibit the accumulation of mCA and a decrease in mCT and mCC26. DNMT3b has also been implicated in methylation of mCH, primarily at CT and CA sites19. Its expression in ESCs is critical for the deposition and maintenance of CH methylation. Knockdown studies of DNMT3a in ESCs do not abolish CH methylation completely, and DNMT3b is able to partially compensate for DNMT3a, revealing a level of interdependency or redundancy for this function that exists within ESCs. This is supported by the knockdown of DNMT3b and subsequent deletion of DNMT3a which led to further reductions in mCH27. DNMT3b knockouts dramatically reduce mCH in ESCs when compared to DNMT3a knockouts, implying that DNMT3b is likely the dominant enzyme responsible for mCH in human ESCs22. Unlike DNMT3a, low expression of DNMT3b in the brain indicates that it is unlikely to play major roles in mCH deposition1. Despite being catalytically inactive, DNMT3L is essential for mCH in ESCs, where its deletion results in almost complete disappearance of mCH13,23. This observation likely arises from a disruption to the DNMT3a/b-DNMT3L complexes, but this remains to be confirmed. In mouse prospermatogonia, however, the ATRX-Dnmt3, Dnmt3L (ADD) domain, when disrupted, results in global mCH reduction15. Therefore it is likely that DNMT3L may rely on associations with DNMT3a/b to coordinate cell type-specific mCH patterning throughout development.

The mechanisms by which DNMTs are recruited to and methylate CH sites within the genome also remain unknown. It has been proposed that histone modifications may attract or exclude

V-185

DNMTs from genomic loci, as illustrated in the aforementioned paragraph of mouse prospermatogonia involving the ADD domain and histone H315. Another proposition implicates nucleosome positioning in mCH patterning by limiting the accessibility of DNA to DNMTs. This may explain the ~180 periodicity of mCH observed in tissues containing CH methylation25. The differences in CH methylation patterns and bias for different mCH motifs in the brain compared to ESCs have also been investigated. DNMT3B is highly expressed within ESCs and binds selectively to gene bodies, through a PWWP domain that recognises H3K36me3, and is able to methylate CAG nucleotides in these actively transcribed genes28. The absence of mCH in expressed genes within neurons is due to the low abundance of DNMT3B in the adult brain, whilst a bias for mCAC is due to DNMT3A rather than DNTM3B29.

V.1.3 Erasure of mCH

Currently, there is a lack of evidence implicating active DNA demethylation processes in the removal of CH methylation. The current consensus, therefore, establishes passive DNA demethylation processes as the only mechanism by which mCH is removed. For example, there is no evidence implicating TET or DNA repair enzymes in the active removal of CH methylation30. Tet-assisted bisulfite sequencing (TAB-seq) detected negligible 5hmC at non- CG sites in ESCs and brain cells31–33. Two explanations have been proposed in response to this observation. The first proposes that TET activity at non-CG sites undergoes rapid turnover because its activity is much more efficient than at CG sites31. This conclusion has been supported by findings that TET1 has higher activity at methylated non-CG sites, but this finding has not been directly demonstrated in vivo31. The second proposes that CH removal is instead attributed to passive demethylation events resulting from successive cell divisions that are not maintained by DNMT3 enzymes. In the latter case, the modulation of TET activity may be reliant on as of yet unidentified co-interacting proteins that abolish the catalytic activity, or sequestration of TET proteins preventing their activity at non-CG sites. While both proposals require experimental validation, the consensus for passive DNA demethylation is strengthened by two observations. Firstly, a maintenance enzyme, like DNMT1 for CG methylation, has not been discovered for mCH34, implying there is no replication-dependent mechanism for mCH maintenance. Consequently, there is a need for continual de novo methylation by DNMT3s to maintain the CH landscape in dividing cells, and conversely, an inherent mechanism by which mCH may be removed. Secondly, a lack of active CH demethylation processes in post-mitotic cells like neurons means that they do not passively lose CH methylation through cell division, and therefore accumulate an abundance of mCH1.

V-186

V.1.4 Distribution of mCH and evidence for its roles in coordinating biological processes

Although mCH is a prevalent mark within certain cell types and tissues, evidence of its potential functions are only just beginning to emerge. Genome-wide inspection of mCH content and patterning reveal that mCH is non-uniformly enriched within the genome. For example, 79% of the mCH content in ESCs is in the 5’-CAG-3’ sequence context, referred to as mCAG9, whereas in neurons an enrichment in 5’-CAC-3’, termed mCAC, is observed1. These sites are conserved in human and mouse brain7, show hypomethylation at protein-DNA interaction sites35, and are associated with gene repression1. Within human brain, the binding of MECP2 to mCA sites in long genes is required for gene repression26,36. Several genomic features including gene bodies12,37, repeat elements27,38, and inactive enhancers39 are typically enriched for mCH. Active enhancers, transcription factor binding sites8, promoters9,12, and regions of the genome termed mCH deserts1 that span megabase regions within the brain, are usually void of mCH. CH methylation is also thought to complement the non-uniform spatial nature of CG methylation, as tissues with CH methylation display enrichment in low CG density dinucleotide regions that encompass the bulk of mammalian genomes40. To date, the analysis of CH methylation at defined loci have implicated mCH in a range of biological processes. Outlined below, these include transcriptional and splicing regulation, protein binding, and modulation of TEs.

Analysis of mCH patterning in numerous tissues reveals a higher abundance of mCH in gene bodies than within intergenic regions1,8,12. ESCs contain higher levels of intragenic CH methylation than in intergenic regions owing to a PWWP domain within Dnmt3b that recognises H3K36me3 in expressed genes28. In addition, CH methylation is abundant at splice sites, and higher in exons compared to introns, indicating CH methylation might play roles in splicing via mechanisms that are yet to be elucidated9,41. However, functional evidence, for example identification of splicing factors participating in this process, is required to confirm this observation is not merely an artifact of sequence composition bias at these sites. The hypermethylation of TEs occurs in both CG and CH contexts. SINE repeats contain the highest levels of mCH in human ESCs, but mCH is also found at other TEs such as LINES38. It is now understood that DNMT3a, DNMT3b, and DNMT3L are responsible for constitutive de novo CH methylation at these sequences23.

Tissues with non-CG methylation are generally mCH deficient in their promoter sequences. Some studies have speculated that this observation is a byproduct of the necessary exclusion

V-187

of DNMTs by various cellular factors at unmethylated CGIs, which may simultaneously inhibit mCH deposition at these genomic elements9. However, it seems likely that CH promoter hypomethylation is a controlled cellular process, given CH hypermethylation observed at promoters correlates negatively with transcription42,43. In addition, mCH may occlude the binding of DNA-binding domain (DBD)-containing proteins in similar mechanistic ways to mCG44, or permit the binding of specialised mCH binders to coordinate transcriptional output45. Similarly, enhancer mediated gene expression processes may be controlled by active deposition of mCH in certain cell types. Studies in ESCs reveal inactive enhancers are enriched for mCH whilst active enhancers, including enhancers required for the binding of embryonic transcription factors are hypomethylated in the CH context8. Recently, MECP2 has been observed to bind to and repress enhancers within topologically associated domains enriched in mCA and mCG46. Thus far MECP2 is the only protein demonstrated to bind mCA in vivo, leaving open the possibility that some mCA readers may function similarly, or in stimulating demethylation of these regions required for cellular reprogramming, as has been observed for mCG47,48. These observations make it likely that CH methylation plays generalised and cell type-specific gene regulatory roles in tissues where it is observed, and that its deposition is unlikely just a byproduct of de novo DNMT-driven deposition of mCG.

V.1.5 Non-CG methylation in the mammalian brain (Expansion from point 1.5)

Several compelling features of CH methylation within mammalian brains have attracted interest in recent years. Firstly, the mammalian brain undergoes unique epigenomic reconfiguration during neural differentiation. Pluripotent cells lose mCH as they differentiate into neural progenitor cells8. As neural progenitor cells differentiate further into neurons and glia, these cells, unlike most other somatic tissues, regain mCH. The onset of CH methylation coincides with synaptogenesis and synaptic pruning, indicative of a link between the two processes1. Further, CH methylation rises sharply as the brain matures but plateaus at maturity, again hinting at an undefined role in neural development6. The abundance of mCH within neurons surpasses mCG levels and achieves the highest recorded level of mCH within human and mouse tissues. Initially observed within the frontal cortex1,25, this pattern has also been observed in other brain regions like the mouse dentate gyrus1,25. While intriguing, a causal link between CH methylation and synaptogenesis has not been established. Secondly, analysis of genome-wide localisation patterns of mCH and mCG within neurons and glia reveal patterns that are coupled to defined biological outcomes. The neuronal genome is hypermethylated in the CH context, except for regions of active gene expression. Glia display

V-188

low levels of CH methylation, but are hypermethylated at some genes involved in neuronal development. Clustering mCG and mCH levels within the brain identified a subset of genes with intragenic mCH and mCG hypermethylation in neurons at genes associated with glial development and functioning. Conversely, these same genes display gene body mCG and mCH hypomethylation within glia. The same study also identified genes with higher genic mCH levels in genes required for synaptogenesis and neural development within glial populations1. The identification of these genes and the differential motif preference for mCH in ESCs and neurons implies that the epigenome undergoes coordinated changes that are restricted to specific non-CG containing loci.

The mammalian brain is an extremely complex organ consisting of many neuronal and supporting cell types that are critical for neuronal maturation processes and cognitive functions. As such, a multifaceted approach that can identify and characterise mCH patterning in brain cell type sub-populations, together with an understanding of the transcriptional consequences of mCH deposition, are needed to gain better insights into the role of the epigenome in genome regulation and neurodevelopment. Methylome analysis of excitatory and inhibitory neuronal subtypes revealed an abundance of CH methylation in both populations, but distinct patterns, with widespread differences in their genomic distribution 49. These DNA methylation patterns were coupled to ATAC-seq (Assay for Transposable- Accessible Chromatin using sequencing) and used to identify developmental expression of transcription factors underlying each neuronal subtype using footprinting and motif analysis, providing potential links between DNA methylation and transcriptional changes responsible for neuronal complexity49. A recently developed single-cell methylome assay has been used to identify and characterise neuronal subtypes based on their DNA methylation signature, providing an unbiased method of characterising methylomes within distinct cell populations that make up the brain50. Other efforts have also focused extensively on providing mechanistic links between mCH deposition and transcriptional output. Guo et al. showed that in vitro introduction of a non-CG methylated plasmid resulted in transcriptional repression at levels comparable to mCG, whilst reductions in intragenic mCH through Dnmt3a knockdown in the adult mouse dentate gyrus increased transcription of mCH-enriched genes but not in mCH depleted genes25. This study provided evidence for mCH as a repressive modification and identified MECP2 as an mCH reader that may modulate transcriptional repression at mCH sites.

V.1.6 MECP2, an mCH reader critical for neural development

V-189

Characteristics of MECP2 such as its expression dynamics, correlations with mCH deposition, and its neurodevelopmental pathology suggest its ability to bind mCH may constitute critical epigenetic processes required for normal brain development and cell function. Initially discovered as a protein with mCG binding capabilities51, MECP2 has attracted interest because of its ubiquitous expression, role in chromatin regulation, hmC recognition, abilities to control alternative splicing (although recent analysis of transcriptomic data disputes this52), and because its disruption results in Rett syndrome, a genetic brain disorder2,53,54. Among its multifaceted functions, MECP2 has also been demonstrated to bind to and regulate mCH within the human and mouse brain26,36. Initially proposed by ChIP-seq analysis, Mecp2 peaks were observed in hypermethylated mCG and mCH contexts within mature mouse brain55. Biochemical analysis demonstrated a direct MECP2-mCA interaction, confirmed by two separate studies using recombinantly expressed MECP2 and mCA DNA probes25,26. In particular, Gabel et al. concluded MECP2 selectively bound mCA with similar - albeit slightly lower - affinity when compared to mCG and that this affinity decreased for mCT and mCC, a finding that inferred MECP2 may selectively modulate mCA in vivo. Secondly, MECP2 expression increases substantially during neurodevelopment, mirroring the rise in mCH, reaching an abundance level similar to that of histones45,56. Third, several mutations within the X-linked MECP2 gene result in Rett syndrome3,57. Notably, the observations that the appearance of mCH in the brain is consistent with a rise in MECP2 expression and the delayed onset of Rett syndrome, led to the hypothesis that aberrant readout of mCH by MECP2 may contribute to Rett syndrome pathology.

To test this hypothesis, Chen et al. examined the binding profile of Mecp2 in adult mouse hypothalamus and confirmed Mecp2 preferentially binds genes with high mCH. The disruption of physiological Mecp2 levels led to misregulation of these mCH hypermethylated genes, whilst the transcription of genes with low mCH remained unchanged36. In both subsets of genes, mCG levels were relatively consistent, indicating that the primary determinant for Mecp2 regulation was mCH. Importantly, this study established that genic mCH readout by Mecp2 is a critical developmental process that occurs in the mature brain that may explain the delayed onset of Rett syndrome36. Another study observed that long genes (>100kb) rather than short genes were prone to misregulation in the cortices of patients with Rett syndrome26. The investigators also found that Mecp2 bound these long genes, which are CA hypermethylated, and that deletion of Dnmt3a or disruption of Mecp2 led to the misregulation of these genes. This study, therefore, established that genes bound by Mecp2 were longer than the genome wide average and contained high levels of mCH. This apparent bias for long genes has since been challenged and attributed to microarray and RNA-seq biases arising from technical variations such as PCR amplification and intra-sample variations in large

V-190

datasets58. Within these datasets, smaller fold-changes in transcription after polymerase chain reaction amplification leads to an overestimation of long gene expression levels due to baseline variability between datasets. When estimating baseline variability from a set of randomised controls, Mecp2 misregulation of long genes in RTT mouse model datasets disappeared58. In addition, the deletion of Dnmt3a within the Gabel et al study resulted in severe brain malformations and postnatal death and affected mCG and mCH levels, preventing an attempt to uncouple mCH from mCG in order to better characterise its role 26. The challenge in future studies will be in perturbing Dnmt3a levels at defined time points in ways that are not fatal or that confound the experiment by affecting both mCG and mCH levels. These experiments will be vital not only to understanding the role of MECP2 in mCH readout and epigenetic mechanisms underlying Rett syndrome, but in identifying and characterising the molecular dynamics of future mCH binding candidates and their roles in neurodevelopment.

V.1.7 Towards the need for novel mCH reader identification and characterisation

The identification of readers with mCG binding characteristics has been invaluable to understanding and exploring the functional output of mCG, however, this is largely lacking for mCH. Some studies have investigated the affinity of some members of the MBD family to mCH. For example, one study has observed that MECP2, but not MBD2, selectively recognised mCA with high affinity using isothermal titration calorimetry-based binding assays. However the MBD domain for MBD2 that they analysed was isolated from chicken, rather than human or mouse, and the probe used was based on a known MECP2 binding site59. Another more recent analysis employing isothermal titration calorimetry-based binding assays found that human MECP2, MBD2, and MBD4 preferred mCAC over mCAH, but concluded that the presence of a complementary TG nucleotide was the primary determinant of MECP2 and MBD2 binding, despite observing a decreased affinity for the CA probe than for the mCA probe60. Apart from MECP2, there is no other protein implicated in the specific recognition of mCH. The accumulation of mCA through neurodevelopment, concomitant with MECP2 expression, sparked interest in its role as an mCH binder. The observation that MECP2 did bind mCH, coupled with follow up in vivo characterisation, has improved our understanding of the potential roles of mCH in genome regulation and some of the potential molecular mechanisms underpinning Rett syndrome. However, a systematic screening approach to identify novel mCH candidates is currently lacking. The identification and characterisation of mCH readers may help decipher the molecular dynamics underlying the mammalian neuro-

V-191

epigenome and facilitate future understanding of brain development and neural function. Therefore, in my thesis research a DNA pull-down approach coupled to mass spectrometry was employed in human and mouse brains to identify potential mCA binders. In addition, the top mCH candidate, MBD2, was selected for recombinant expression and purification followed by in vitro binding validation assays to confirm selective affinity for mCA. The significance of this study, along with pitfalls and limitations in the identification and characterisation of mCH readers, is discussed, in addition to perspective and possibilities for future avenues of research.

Results

V.2.1 Global assessment of mCA/CA datasets

As described in the outline of this thesis in Chapter 1 (Section 1.6), a DNA pull-down utilising biotinylated mCA and CA DNA probes was performed to identify readers of mCA in human and mouse brain (Figure 3.1). The quality of the mCA/CA datasets was assessed by inspection of missing data and by visualisation of normalised mean protein intensity against log2 mCA/CA (Figure 5.1). Human and mouse mCA/CA datasets had ~6,000 missing peptide observations distributed uniformly across replicates, indicating no sample bias and minimal errors in sample handling. Replicate quality was also assessed by generation of a heatmap (Figure 5.5) displaying protein intensity for mCA and CA replicates (pre-filtered proteins log2FC ≥ 0.5 and p-value ≤ 0.05 for DE and significance threshold of percent observed ≥ 50% and p-value ≤ 0.1 for P/A). An MA plot (Figure 5.1B) was performed to check that each dataset is centred around zero, reflecting background interactions that would be expected for most of the proteins. Human mCA/CA datasets were distributed below zero whilst mouse were distributed above zero, indicating this dataset was slightly skewed towards the mCA replicates. The bar plot does not indicate that this is due to technical errors given the uniform distribution of missing values. Therefore observations in the MA plot are most likely attributable to biological rather than technical factors.

V-192

Figure 5.1: Human (left) and mouse (right) mCA/CA datasets. A) Bar plot of missing values within each dataset. B) MA plot of normalised data. Each dot represents a protein intensity post normalisation, plotted against its corresponding fold-change (log2FC) value .

V.2.2 Identification of novel mCA and CA readers in human and mouse Differential expression (DE) analysis of proteins within the “combined” human and mouse mCA/CA datasets identified a list of mCA binding candidates. These candidates represent high confidence readers given enrichment in both datasets. For easier visualisation, the log2FC values for each protein in human and mouse were plotted on a scatterplot (Figure 5.2). Proteins with a positive log2FC value represent proteins with a higher affinity for the mCA methylated probe, whilst those with affinity for CA (unmethylated CA) have a negative log2FC value. Proteins meeting significance thresholds have log2 FC ≥ 1.2 and p-value ≤ 0.05. For illustrative purposes, scatterplots are split into DBD-containing proteins (top) and protein interactors (bottom). Labelled proteins within each scatterplot display significant enrichment in combined mCA and CA (Top right or bottom left quadrants respectively) or by species (middle vertical or horizontal quadrants). For example, the upper right and lower left quadrants contain

V-193

high confidence proteins (observed in both human and mouse) enriched for binding mCA or CA respectively. Within the plots that contain DBD-containing proteins, proteins depicted in bold represent proteins that have previously published affinity for mCG or repulsion from methylated CG binding, respectively. In addition, the supplementary section of this chapter contains a Table of all identified DBD-containing proteins within the scatterplots and a reference associated with the studies that previously characterised their binding to methylated DNA. As mentioned in Chapter 4, it is important to note that the DNA binding-domain proteins represent proteins with DNA binding or nucleic acid binding capabilities, but are not necessarily DNA binding proteins.

Two transcription factor candidates were enriched for binding to mCA in human and mouse combined. MBD2, a well-studied mCG binder with repressive functions, and POU3F2, more commonly known as BRN2, involved in neuronal differentiation. The proteins detected in both species and analysed by DE analysis but which exhibit significantly enriched binding in one species, in this case, either human or mouse are referred to as “combined-human” and “combined-mouse”, respectively. These mCA candidates were the NuRD complex members MTA2 and MTA3 in combined-human (meaning they were detected in human and mouse DE datasets, but exceeded significance thresholds in human only) whilst Kat6b met significance in combined-mouse (Figure 5.2). These proteins were enriched to varying degrees in the corresponding species but not highly enough to meet the cut off log2(mCA/CA) ≥ 1.2 in both. The pull-down also identified a list of candidate mCA readers that have not been implicated in mCG recognition and may constitute a class of DBD-containing proteins with specialised mCA binding capabilities. Within the combined analysis (Figure 5.2) these include PH5FA and ZFP128 for combined-human, and Ybx3, Srek1, Ppie, and Dazap in combined-mouse.

Candidate reader proteins identified as having a strong affinity for CA are also displayed in Figure 5.2. The terms “CA readers” or “CA binders” will be used to describe proteins that were enriched for the CA probe condition. These proteins likely harbor affinity for the specific CA probe sequence and/or have a strong exclusion to DNA methylation. CA readers that have been documented to bind CG in external experiments are indicated in bold in the scatterplot. The DE analysis identified 3 DBD-containing proteins that have been documented as CG binders previously and display affinity for CA probes within human and mouse. These proteins, displayed in Figure 5.2, are USF1, USF2, and MNT. Among the many enriched CA binders within the human-limited dataset, of note are those that display repulsion to mCG in previously published experiments61, including ATF2, TFEB, and CLOCK, among others. Also within this list are proteins that were repelled by CG in the mCG/CG DE and P/A datasets (Chapter 4) but not identified in previous studies. These include APEX1 and MYPOP (combined CG,

V-194

Figure 4.5), HIVEP2 (mouse-limited CG and human P/A, Figures 4.7 and 4.8 respectively) , LRPPRC (human P/A CG, Figure 4.7) and MXI1. Interestingly, FOXO1 and FOXO3 were statistically enriched for CA but exhibited a strong preference for mCG in human and mouse pulldowns (FOXO1 in combined mCG, Figure 4.5 and FOXO3 in mouse-limited and human P/A, Figures 4.6 and 4.7) within the human mCG/CG DNA pull-down. Lastly, PBX3, RECQL5, and YTHDC2 exhibited an affinity for CA and were statistically enriched for mCG (combined- human mCG, Figure 4.5).

V-195

Figure 5.2: Common mCA and CA readers in human and mouse for DBD-containing proteins (top) and protein interactors (bottom). Proteins displayed meet the significance threshold log2(mCA/CA) ≥ 1.2 and p-value ≤ 0.05. Proteins in bold represent DNA binding domain-containing proteins with experimentally validated affinity for mCG that were enriched for mCA or those enriched for CA with an experimentally validated affinity for CG.

V-196

V.2.3 Identification of mCA readers in human or mouse-limited datasets

The human-limited DE and P/A analyses identified 5 and 2 potential mCA readers (Figure 5.3) respectively, whilst the mouse-limited DE and P/A analysis identified 7 novel mCA readers each (Figure 5.4). Again, as with the combined scatterplots, DBD-containing proteins (left) and protein interactors (right) were split for illustrative purposes. Labelled proteins represent those meeting significance criteria in each dataset. For the DE analyses in human-limited and mouse-limited, a significance threshold of log2 FC ≥ 1.2 and p-value ≤ 0.05 was adopted. For the P/A analyses in human and mouse, a significance threshold of percent observed ≥ 50% and p-value ≤ 0.1 was set. Percent observed corresponds to the observed proportion of observations (peptide intensity belonging to a protein of interest) within a particular context divided by all potential observations in that context. An observation refers to the detection and recorded peptide intensity value for a replicate. Each peptide has the potential to be observed 3 times, given 3 replicates were performed. For example, if “protein A” has 4 peptides detected, and the mass spectrometer records 6 peptide intensity values for in all 3 mCA replicates, Protein A has 12 potential observations (3 mCA replicates multiplied by 4 peptides) but a percent observed value of 33% (4 observations over 12 potential observations). Within the DBD-containing protein plots, proteins depicted in bold represent DBD-containing proteins with previously published mCG or CG binding affinity, for mCA and CA enriched proteins respectively. In addition, the supplementary section of this chapter contains all identified DBD- containing proteins within the volcano plots and a reference associated with the studies that previously characterised their binding. The majority of proteins identified in the species-limited datasets have not been implicated in mCG binding, except for Zhx3 and Meis3, identified in mouse mCA DE and P/A analyses, respectively. There are however some promising candidates enriched for mCA including Pou3f4, identified in human-limited mCA. Other promising candidates whose role in mCA regulation will be described in the discussion include MYRF, identified in human-limited and mouse P/A, and Klf7, enriched in mouse P/A mCA.

Numerous CA binders were also observed within the human and mouse-limited DE and P/A analyses. A few of these enriched CA probe binding DBD-containing proteins have been implicated in the binding of the CG probe, including MAX, BHLHE40, MLX, SREBF1, and ZNF385D (Figure 5.3). In addition to already verified DBD-containing proteins repelled by mC, OLIG3 and TFAP4 were identified as CG probe binders within the mCG/CG DNA pull-down, and were also repelled in the mCA context, constituting novel DBD-containing proteins repelled by mC in the CG and CA contexts. BHLHE4040 was the only DBD-containing protein enriched for the CA probes with published affinity for the mCG probes (Figure 4.7). Possible

V-197

reasons for this are discussed in Chapter 4 (Section 4.3.2). Within mouse, Mtf1 and Mitf were identified within the mCG/CG DNA pull-down experiments as binders of CG (Figure 4.8) and may constitute proteins that are be repelled by mCG. Conversely, Rfx1 (a published mCG binder62) and Wrn exhibited an affinity for CA (Figure 4.8) and bound to the mCG probe within the mCG/CG DNA pull-down (Chapter 4, Figure 4.5 for Rfx1 and Figure 4.8 for Wrn).

Figure 5.3: Human-limited (top) and human P/A (bottom) mCA and CA readers for DBD-containing proteins (left) and protein interactors (right). Proteins displayed meet significance threshold log2(mCA/CA) ≥ 1.2 and p-value ≤ 0.05. For human P/A, proteins displayed meet the significance threshold of percent observed ≥ 50% and p-value ≤ 0.1. Proteins in bold represent DNA binding domain-containing proteins with experimentally validated affinity for mCG that were enriched for mCA or those enriched for CA with an experimentally validated affinity for CG.

V-198

Figure 5.4: Mouse-limited (top) and Mouse P/A (bottom) mCA and CA readers for DBD-containing proteins (left) and protein interactors (right). For mouse-limited, proteins displayed meet significance threshold log2(mCA/CA) ≥ 1.2 and p-value ≤ 0.05. For mouse P/A, proteins displayed meet the significance threshold of percent observed ≥ 50% and p-value ≤ 0.1. Proteins in bold represent DNA binding domain-containing proteins with experimentally validated affinity for mCG that were enriched for mCA or those enriched for CA with an experimentally validated affinity for CG.

V.2.4 Coupling protein interactors to identified DBD-containing proteins in human and mouse brain

V-199

To assess potential conservation of mCA readers in human and mouse brain, a heatmap was generated using normalised and imputed protein intensity values observed in CA and mCA pulldown replicates from human and mouse DNA pull-downs (Figure 5.5). The mCA and CA binding contexts cluster together when hierarchical clustering using a spearman correlation is implemented. This indicates that the characteristics of proteins with affinity for methylation, overcomes interspecies variation between human and mouse. Proteins were then uploaded to STRING, a protein-protein interaction network database, to identify potential functional protein interaction networks present for mCA and CA. Surprisingly, NuRD complex members (proteins within Fig. S2 and in red, Figure 5.5) show enrichment for mCA, representing a potentially novel regulatory mechanism underlying mCA readout in human and mouse brain. These proteins are MBD2, GATAD2B, RBBP4, MTA1, MTA2, and MTA3. It is well established that NuRD associates with MBD263–65 and that this complex is responsible for the repression of loci harbouring CG methylation. Given this observation, MBD2 was therefore chosen for biochemical validation experiments to validate its affinity for mCA.

V-200

Figure 5.5: Hierarchical clustering of normalised and imputed proteins within the combined CA analysis DE and P/A datasets. Protein intensities from DE and P/A datasets were merged. Missing values were assigned zero. Each replicate was normalised by subtracting of row means divided by the standard deviation. Proteins displayed were filtered for log2FC ≥ 0.5 and p-value ≤ 0.05 for DE and significance threshold of percent observed ≥ 50% and p-value ≤ 0.1 for P/A. Clustering identified 2 main subsets pertaining to proteins repelled by mCA in human and mouse (S1) and proteins with affinity for mCA in human and mouse (S2). Numbers within S1 and S2 correspond to sections of the heatmap belonging to each protein for easier visualisation of the name and location of each protein on the heatmap.

V-201

V.2.5 Recombinant expression, purification and validation of MBD2 as an mCA reader

While MECP2 has been confirmed to bind mCA, MBD2 remains uncertain. As mentioned in Section 5.1.7, using isothermal titration calorimetry, the MBD domain from chicken was unable to bind mCA, whilst a subsequent study using human MBDs reported that MBD2 preferred mCAC over other forms of mCH59,60. To investigate whether MBD2 binds mCA, an electrophoretic mobility shift assay (EMSA) was performed on recombinantly expressed and purified protein. MECP2 was also expressed and purified, and used as a positive control, given that it has already been established to selectively bind mCA in vitro2,26,36. To maximise the chances of successful expression and purification in E. coli, the methyl binding domains (MBDs) of MBD2 and MECP2 were selected, codon optimised, and successfully cloned into a pETM11 vector backbone containing a 5’ 6X HIS (histidine) tag commonly used in bacterial expression and purification. The sizes of each are as follows: 157 amino acids for MBD- MECP2 (Figure 5.6A) that migrates at 17.5 kilodaltons (kDa), and 14kDa with the 6XHIS tag cleaved; MBD-MBD2 (Figure 5.6B) was 192 amino acids (with 6XHIS) that migrates at 20kDa. The HIS tag was unable to be cleaved for this protein product. Both proteins were successfully expressed and purified using a mixture of bead based-Nickel and HPLC column purifications. The MBD-MBD2 band (lane 6) was then subject to gel excision and mass spectrometry analysis to ensure it was the correct protein, and not a contaminant.

A) Expression and purification of MBD-MeCP2 (14kDa) B) Expression and purification of MBD-MBD2.6XHIS (20kDa) Expression and nickel purification HPLC Cleavage of 6XHIS Expression and nickel purification HPLC * * 1 2 3 4 MW 5 MW 6 7 MW 1 2 3 MW MW 4 5 6

25kDa 25kDa 25kDa 20kDa 25kDa 20kDa 20kDa 15kDa 20kDa 15kDa 15kDa

1) Uninduced fraction 5) HPLC fraction tube 32 MBD-MeCP2.6X HIS 1) Insoluble fraction 4) Heparin fraction 28 2) Insoluble fraction 6) TEV digestion waste 2) Soluble fraction: MBD-MBD2.6XHIS 5)Heparin fraction 36 3) Soluble fraction MBD-MeCP2.6XHIS 7) Reverse-IMAC TEV digested fraction 3) Purified fraction: MBD-MBD2.6XHIS 6) Fr 42 MBD-MBD2.6XHIS ~20kDa 4) Nickel purification MBD-MeCP2.6XHIS MBD-MeCP2 ~14 kDa

Heparin HPLC elution profile for MBD-MeCP2 Heparin HPLC elution profile for MBD-MBD2 Figure 5.6: ExpressionTube number and purification of (A) MBD-MECP2 and (B) MBDTube number-MBD2. Protein gel stained pictures (Top) depict various stages of protein purification pipeline. Arrows indicate bands I I n n t t e e correspondingn to each protein at expression and nickel purificationn stage, HPLC purification stage and s s i i t t y y reverseo IMAC stage. Note reverse IMAC (cleavage of HISo tag) was possible for MBD-MECP2 only. f f

a a b b s s

o o r r b b a a n n c c e e

( ( m m A A U U ) )

Column volume (ml) Column volume (ml) V-202 215nm (peptide absorbtion λ) peak at tube 32 215nm (peptide absorbtion λ) peak at tube 42 is the MBD-MECP2.6XHIS fraction is the MBD-MBD2.6XHIS fraction

To establish whether MBD2 selectively binds mCA and whether this binding is conferred via its MBD domain, an EMSA was performed. Briefly, this entailed incubation of the purified MBD domains with fluorescently labelled mCG and mCA DNA probes, followed by separation on a non-denaturing protein gel. To ascertain how specific the interaction was, an unlabeled (not fluorescent) competitor probe at 100X the concentration of the fluorescent probe was added. To quantitatively assess the results of the EMSA, ImageJ was used to generate band intensity measurements of the recorded shifts in the EMSA replicates (Figures 5.7 and 5.8). As expected, both MBD-MECP2 and MBD-MBD2 bound the mCG DNA probes. For MBD- MECP2, the presence of the 100X unlabelled CG competitor probe (unmethylated) reduced the shift for mCG slightly, but not significantly, unlike the unlabeled mCG competitor (methylated). The addition of unlabelled CG competitor probe (unmethylated) did not interrupt the MBD-MBD2-mCG fluorescent complex, but the addition of unlabelled mCG competitor probe (methylated) significantly reduced the fluorescent shift, indicating a specific affinity for mCG.

When determining the specificity for mCA, 3 different unlabeled competitor probes at 100X the concentration of fluorescently labelled mCA were used. These were probes containing mCA, CA, and TA repeats. The mCA probe was specifically modelled on the dominant form of mCH within the brain, mCAC1. Its complementary unmethylated counterpart was incorporated as a control probe. The decision to include a TA probe is based on a structural similarity of thymine to methyl-cytosine60. The inclusion of this probe therefore ensured that both MBD domains bound mCA selectively and not because mCA resembled a TA sequence. As expected, MECP2 bound mCA with high affinity, especially when competing with unlabeled TA. Whilst the unlabeled CA probe reduced the intensity of the shift, it was not as pronounced as the reduction observed for unlabeled mCA. In comparison, the shift for both unlabeled CA and TA probes did not reduce the MBD-MBD2-mCA band at 1X protein concentration but produced a decreased shift at 1.5X. Importantly, at both protein concentrations, the greatest reduction in band intensity was observed with the unlabeled mCA probe. These results independently validate the results of the DNA pull-down in human and mouse, and provide the first evidence implicating MBD2 in the specific recognition of mCA by biochemical analysis.

V-203

Figure 5.7: Binding characterisation of MECP2 for mCG and mCA. All probes used within the analysis are displayed at the top of the figure (A). Gel shift analysis (B) utilising methylated CG and CA fluorescent probes in the presence of purified MBD domains from MECP2 at 1X and 2X concentration (320nM and 640nM respectively). An unlabeled competitor at 100X the concentration of the fluorescent probe was used at these concentrations to demonstrate a specific affinity for each MBD to the methylated probes. Gel shift intensities were analysed using ImageJ. The intensities of each shift were normalised by dividing all shift intensity values by the highest shift intensity (in all 3 replicates) and expressing this value as a percentage (C). A two-sample, two-tail t-test was then implemented between each protein and probe only condition against protein, probe and unlabeled competitor for each protein concentration condition. P-values within statistical significance ranges are summarised with asterisks.

V-204

Figure 5.8:Binding characterisation MBD2 for mCG and mCA. All probes used within the analysis are displayed at the top of the figure (A). Gel shift analysis (B) utilising methylated CG and CA fluorescent probes in the presence of purified MBD domains from MBD2 at 1X and 1.5X concentration (1100nM and 1650nM respectively). An unlabeled competitor at 100X the concentration of the fluorescent probe was used at these concentrations to demonstrate a specific affinity for each MBD to the methylated probes. Gel shift intensities were analysed using ImageJ. The intensities of each shift were normalised by dividing all shift intensity values by the highest shift intensity (in all 3 replicates), and expressing this value as a percentage (C). A two-sample, two-tailed t-test was then implemented between each protein and probe only condition against protein, probe and unlabeled competitor for each protein concentration condition. P-values within statistical significance ranges are summarised with asterisks.

V-205

V.2.6 Gene ontology analyses of mCA and CA readers in human and mouse brain

Functional annotations for enriched proteins in the human and mouse mCA/CA DNA pull- downs were assigned using DAVID, a GO analysis database66. The REVIGO tool was utilised for visual output of DAVID, condensing redundant GO terms and representing relationships of existing terms in two dimensional ‘semantic space’, in which alike terms are grouped closer together. Term significance is based on p-value generation from DAVID. Summation of GO processes revealed many cellular processes that are enriched for the mCA readers detected in the human and mouse brain. These include regulation of transcription, chromatin remodelling, RNA binding, protein binding, protein transport, and cell motility, whilst NuRD was highly enriched for human mCA (Figure 5.9). Both the abundance and highly diverse enrichment of unrelated terms in CA (Figure 5.10) are driven by DBD-containing proteins and protein interactors with affinity for the CA probe. These processes pertain to aging, neuronal regulation, differentiation, histone activity, protein activity, and DNA binding.

V-206

Figure 5.9 GO analysis of proteins enriched for mCA in human (left) and mouse (right). The Visualisation of Biological process, cellular component and molecular function GO terms are plotted by REVIGO, placing terms within an arbitrary space termed semantic space based on their similarity.

V-207

Figure 5.10: GO analysis of proteins enriched for CA in human (left) and mouse (right). The Visualisation of Biological process, cellular component and molecular function GO terms are plotted by REVIGO, placing terms within an arbitrary space termed semantic space based on their similarity.

V-208

V.2.7 Summary of mC and C binders by classification into distinct protein regulatory complexes

The significantly enriched proteins from the mCG/CG and mCA/CA DNA pull-downs in human and mouse were further inspected by hierarchical clustering of proteins within each dataset using normalised protein intensities, displayed in Figure 5.11. Three main subsets of protein groups can be identified based on general patterns of binding that can be observed from the heatmap. Proteins from each subset were then analysed using STRING and filtered for proteins with known, experimentally validated interactions in order to identify regulatory protein networks enriched for each subset. Subset 1 (S1) contains proteins repelled by mC and is enriched for protein complexes responsible for the transcriptional regulation of unmethylated DNA. These include members of the PcG complex, COMPASS, and MLL1 complexes in addition to some histone methyltransferase complex members (ASH2L, WDR5, RNF2, RBBP5, DPY30, and the methyl transferase KMT2A). In contrast, S2, containing a list of proteins with affinity for mCG in human and mouse, was enriched for NuRD, SWI/SNF and Sin3, in line with previously identified protein regulatory networks involved in the suppression of genes that are methylated in the CG context. Surprisingly, as discussed in section 4.3.4, the clustering also revealed enrichment of nuclear exosome complex members, possibly implicating RNA regulatory processes in mCG binding. Within S2 is a smaller subset, S3, that exhibits mCG and mCA binding enrichments. Members within this subset belong to the NuRD complex (comprising CHD4, MTA1, MTA 2 and MTA 3) of which MBD2 is the principal DNA binder and recruiter. Results from the DNA pull-downs, therefore, reveal some level of conservation in the proteins that read and regulate mCA, mCG, and the unmethylated DNA substrates.

V-209

Figure 5.11: Heatmap generated by hierarchical clustering summarising significantly enriched proteins and their interaction networks for mCA, mCG, CA, and CG DE and P/A datasets in human and mouse. Protein intensities from DE and P/A datasets were merged. Missing values were assigned zero. Each replicate was normalised by subtracting of row means divided by the standard deviation. Proteins portrayed were filtered for log2FC ≥ 1.2 and p-value ≤ 0.05 for DE and significance threshold of percent observed ≥ 50%, p-value ≤ 0.1 for P/A. Hierarchical clustering generated 3 observable subsets pertaining to proteins largely repelled by mC in human and mouse (S1), proteins with an affinity for mCG in human and mouse (S2), and a subset of proteins within S2 (S3), constituting proteins with affinity for mC in human and mouse.

V-210

Discussion

These methyl-sensitive DNA pull-downs provide a comprehensive repository of potential mCA candidates through identification of DBD-containing proteins that potentially bind to mCA, in addition to some protein interactors that may form complexes responsible for the readout of mCA in the mammalian brain. Biochemical validation demonstrated that MBD2 has a high affinity for mCA, thereby verifying the results obtained from the pull-down, and establishes MBD2 as a prime candidate for future characterisation. However the technical limitations to this study and future challenges in characterising the binding of mCA readers identified in this study must be considered when interpreting results.

V.3.1 Identification of CA readers in human and mouse brain

The mCA/CA DNA pull-down constitutes the first methylation-sensitive DNA affinity screen for proteins from the mammalian brain that successfully identified a range of DBD-containing proteins and their interactors with affinity for mCA, or repelled by mCA. Some of the DBD- containing proteins that were statistically enriched for the CA probes are known to be repelled by CG methylation, in published DNA-binding experiments. Results from the mCA/CA DNA pull-down therefore establish a set of DBD-containing proteins that are largely repelled by mC regardless of methylation context within the human and mouse brain, and some proteins with affinity for CAC. Some of these factors were enriched for CA in both species, for example, USF1, USF2, and MNT within the combined CA analysis (Figure 5.2). USF1 and USF2 were documented as methyl-minus transcription factors (repelled by mCG) with a strong preference for CAC within their binding motifs in the methyl-sensitive SELEX experiment61. Other proteins enriched for CAC within the SELEX experiment include ARNTL, TFE3, TFEB (Figure 5.2), BHLHE40, MAX, MLX, and SREBF1 (Figure 5.3). These proteins illustrate that while a subset of proteins are enriched for CA primarily because of a strong exclusion to mCA, others represent bona fide transcription factors with affinity for CAC nucleotides. Comparisons with the mCG/CG DNA pull-downs yielded a subset of DBD-containing proteins that are repelled by mC that have not been characterised before. Some examples include APEX1, MXI1 (within combined CA, Figure 5.2), OLIG3 (much like its counterpart OLIG2), TFAP4 (within human- limited, Figure 5.3), and Mtf1 and Mitf (within mouse-limited CA, Figure 5.4). These DBD- containing proteins bound to CA and CG in respective DNA pull-downs, constituting novel CA and CG binders that are largely repelled by mC. Whilst the primary aim of this experiment was to identify mCA binders, the pull-down was also valuable in identifying potential sensitivity of

V-211

DBD-containing proteins. These DBD-containing proteins and their interactors may provide information for future investigations of transcriptional control processes in the brain, or general biological function of proteins that bind to unmethylated CA repeats and that are sensitive to methylation state. Lastly, some DBD-containing proteins with an affinity for the CA probes are known mCG binders or were enriched for mCG within the CG DNA pull-down. These included FOX members FOXO1 and FOXO3, PBX3, ZNF683, and Rfx1, among others. More work is required to establish if the in vivo binding dynamics of these DBD-containing proteins resemble results from the DNA pull-down, and the biological relevance of their affinity to CA or exclusion to mCA.

V.3.2 Identification of mCA readers in human and mouse brain

The DNA pull-down successfully identified candidate mCA readers in human and mouse brain, in line with the primary objective of this project. Discussed in detail below are the DBD- containing proteins observed in human and/or mouse that were enriched for mCA, and their likelihood of participating in mCA binding based on current literature. The high confidence DBD-containing proteins significantly enriched for human and mouse mCA were POU3F2 and MBD2 (Figure 5.2). POU3F2 belongs to the POU-III class of neural transcription factors and is involved in cortical neural migration67, layer protection68, and neurogenesis69. Deletion of Pou3f2 results in loss of specific neuronal lineages within the hypothalamus, and subsequent loss of the posterior pituitary gland70,71. POU3F2 is a high confidence mCA reader for a few reasons. First, POU3F2 was enriched in both human and mouse mCA/CA DNA pull-downs. Second, POU3F2 has previously been demonstrated to bind mCG61, and last, it has a critical role in neurogenesis. Forced expression of Pou3f2 in combination with Myt1-l converts fibroblasts into neurons, however, the mechanisms underlying the binding of these two transcription factors and their roles in promoting neurogenesis remain unknown72. POU3F2 has also been implicated in regulating gene expression with other SOX proteins. For example, Sox2 and Pou3f2 work synergistically to regulate Nestin expression73. Analysis of POU binding sites reveal that the domain binds to the 8 bp sequence 5’-ATGCAAAT-374,75, but has also been observed to bind to and regulate many other genes, binding to sequences outside of this sequence76,77. The SELEX experiment61 observed POU3F2 binding to many sequences. Some of these sequences contain mCG and other sequences contain no CG sequences, highlighting a broad binding profile. An important consideration of this analysis is that some of these sequences also contained TA dinucleotides, which bears a structural similarity to mCA and could potentially explain why POU3F2 was enriched for the mCA probe. This will be an important consideration if an affinity for mCA is to be confirmed by biochemical experiments.

V-212

Genome-wide analysis of POU3F2 binding in melanoma cells revealed many distinct POU3F2 binding sites that suggests its binding within the cell mirrors the versatility observed in vitro78. One aspect underlying the structural binding plasticity of POU proteins is their ability to homodimerize or heterodimerise with other proteins, such as MYTL1 and SOX79. Interestingly, MYT1L was also enriched for mCA in human and mouse (Figure 5.5) but did not meet significance thresholds set for the scatterplot (Figure 5.2). Sox21 was similarly enriched for mCA, and has been suggested to bind adjacent to Pou3f2 within the Hes5 promoter and repress transcription, thereby promoting hippocampal neurogenesis in adults80. Whether mCA recognition by POU3F2 is dependent upon MYTL1 or SOX21 heterodimerization, or whether each DBD-containing protein binds mCA by mutually exclusive processes, remains unknown. Individual SOX and POU factors bind near-identical DNA sequences, yet each protein regulates a unique subset of genes81. The selective partnership of POU and SOX proteins, termed the ‘partner code’, is reliant upon spatial and temporal expression of these proteins to cooperatively modulate genes in developmentally relevant settings81,82. Future biochemical validation and co-immunoprecipitation experiments are required to confirm the binding affinity of POU3F2 for mCA and to determine if this binding is dependent on complexing with proteins like MYTL1 or SOX21. In parallel, genome-wide binding dynamics of POU3F2 within the brain are required to understand the transcriptional targets of POU3F2 and ascertain whether mCA sites are occupied by this TF.

MBD2 was the second high confidence mCA binding candidate identified in human and mouse mCA. While its roles in neurogenesis are not as well defined as POU3F2, the role that MBD2 has in the readout of mCG and its downstream effects on transcription have been extensively studied, especially its potential for the recruitment of the NuRD, a nucleosome remodelling complex (Figure 4.2). Among the many constituents of NuRD are MTA proteins that may act as scaffolds for other proteins and exert their effects as part of larger transcriptional complexes or by modulating chromatin directly83. MBD2 was the only NuRD complex member enriched for mCA in the human and mouse sets that met the stringent cut-off thresholds set. MTA2 and MTA3 were highly enriched in human and mouse but were determined to be significant only in human-limited (Figure 5.2). MTA1 and RBBP4 and GATAD2B (also members of the NuRD complex) displayed affinity for mCA, but did not meet significance thresholds set. Hierarchical clustering of proteins in the human and mouse pull-down illustrates their enrichment (Figure 5.5, subset S2) and reveals that all of these NuRD complex members cluster together, indicating that they exhibit highly similar binding behaviour in these experiments, most likely driven by being pulled down within the same protein complex. MBD2 has been established as the principal DNA binder within NuRD, and together with the MTA proteins, recruits other NuRD complex constituents65. GATAD2A and GATAD2B form exclusive NuRD complexes,

V-213

the latter dominating within the brain due to elevated GATAD2B expression levels within brain tissue84.

Myelin regulatory factor (MYRF) is a transcriptional activator enriched in the human-limited mCA and mouse P/A mCA/CA datasets (Figure 5.3 and 5.4, respectively). Its enrichment in both datasets, especially within P/A, demonstrates that this protein has a strong affinity for mCA or repulsion by CA. Studies in mice have mostly implicated Myrf in myelination events and oligodendrocyte differentiation. For example, genetic studies reveal that constitutive Myrf- deficient mice exhibit severe myelination defects85, whilst Myrf deletion in mature affects oligodendroglial identity and results in failure of CNS integrity and maintenance of myelination86. Molecular characterisation of Myrf within mice reveals a genome-wide preference for regulatory regions in -specific genes87. ChIP-seq motif analysis reveals a preference for the binding of 5’-C[A/T]GGCA[C/G]-3’ sequences present in its target genes. Higher specificity DNA-binding is conferred by homo-trimerisation of Myrf88. Given this, the molecular readout of mCH in oligodendrocytes by Myrf is unlikely because of a number of observations based on our current understandings of mCH and Myrf DNA- binding. Firstly, glial cells contain much less mCH than neurons, but do harbour mCH at a select set of genes required for neuronal functions, presumably to silence these genes that are not required in glia. Genome wide binding analysis of Myrf has observed that this protein binds to glial and not neuronal-specific genes, making it unlikely to mediate mCH at these loci. Secondly, the mCH that accumulates at glial specific genes through development is inversely correlated with gene expression. This is at odds with the transcriptional activation observed for MYRF. Within C. elegans, myrf-1 is essential for synaptic rewiring and plasticity, regulating transcription in response to extracellular cues by translocating from the endoplasmic reticulum into the nucleus89. Myrf in mice is tethered to the endoplasmic reticulum, and undergoes cleavage and translocation to the nucleus where it functions as a DNA binding protein. However the roles of Myrf in regulating synaptic plasticity in mammals are unexplored.

Several DBD-containing proteins were significantly enriched within the combined mCA analysis (Figure 5.2). This includes a largely-uncharacterised Zinc finger protein, ZFP128, observed in combined-human mCA, and splicing factors PHF5A and Srek1 enriched within combined-human and combined-mouse respectively, mRNA binders Ppie and Dazap enriched in combined-mouse mCA, and Kat6b, a histone acetyltransferase with proposed roles in brain development. The last protein, Ybx3, was significantly enriched in combined- mouse mCA. Interestingly, its closely related family member YBX1, with which it shares 46% amino acid sequence similarity, was also enriched for mCA, albeit mildly (Figure 5.5). Both proteins belong to the Cold Shock Domain (CSD) family and possess RNA binding capabilities

V-214

that are utilised within splicing and translational processes in the cytoplasm. Additionally, CSD proteins may undergo post-translational modifications stimulating translocation to the nucleus, where each member modulates transcription through an additional DBD90. Much of what is known about YBX1/3 concerns their roles in immune response and stress. The biological functions of CSD proteins are diverse and range from transcriptional regulation, splicing, translation, and the orchestration of exosomal RNA content91. Ybx1 has been observed to play critical roles in neural stem cell processes92. In line with this, gene knockouts in mice are not viable, and exhibit severe brain defects92. Ybx3 knockouts, on the other hand, are viable and exhibit normal birthweight and organ development93. Analysis of the expression dynamics of each protein suggests both CSD proteins play important roles in brain development. Both show near ubiquitous expression in radial glia, neuroblasts, and neurons. Ybx3 is expressed in neuroglial cells and is believed to regulate the release of neuropeptides such as vasopressin and oxytocin within the hypothalamus94. In addition, YBX1 and YBX3 play numerous roles in brain disorders like brain tumours92, epilepsy95, major depression, and in Alzheimer's disease96. The characterisation of YBX1/3 may not be straightforward but will yield some interesting insights into the roles of each CSD protein within brain development. Some possible scenarios, based on current knowledge of CSD proteins are discussed below. The link between potential mCA-dependent transcriptional regulation may occur through direct binding of YBX1/3 through their DNA-binding domain97,98. Whilst likely, CSD proteins also interact with a plethora of proteins, in which case, YBX1/3 may indirectly bind to the mCA probe through interactions with another protein with affinity for mCA and modulate transcription within a larger protein complex. Another major role for CSD proteins is their regulation of RNA related processes90. Future protein interaction studies may be employed to identify complexes or protein regulatory networks within the brain that contain other DBD- containing proteins that may be the primary binders, or identify proteins specific to known CSD processes like RNA regulation, thereby establishing potential links between mCA readout and RNA processes like splicing. Protein interaction studies may therefore prove crucial in the characterisation of each CSD protein and help disentangle molecular mechanisms through identification of co-eluting proteins. For example, if protein interactors identified coincide with some already identified RNA regulatory factors enriched for mCA within these DNA pull-down experiments. STRING analysis revealed numerous previously identified co-eluting proteins that have been associated with either YBX1, YBX3 or both. These include nuclear exosome and spliceosomal complex members EXOSC2/9 and SKIV2L2 as well as RNA processing members like EBNA1BP2, PHF5A, and SNRPD3, among many others (Figure 5.2 and Figure 5.5). Alternatively, many studies have focused on the role CSD proteins play within kinase mediated cellular cascades99,100. Alongside the DBD-containing proteins enriched, the mCA/CA DNA pull-downs enriched for numerous proteins with affinity for mCA that are

V-215

involved in various intracellular signalling pathways, like MAP2K1 (Figure 5.5), CAMKK2, SRPK2 (Figure 5.3), and Jakmip1 (Figure 5.4). A cellular mechanism involving the translocation of YBX1/3 into the nucleus and modulating genes harbouring mCA downstream of an extracellular cascade would be interesting and unsurprising, given the crucial role intracellular signalling plays within brain plasticity101. However, independent experimental validation of the affinity of these DNA binding proteins for mCA would first need to be conducted.

Numerous other DBD-containing proteins were highly enriched for mCA in the human-limited and mouse-limited datasets. The human-limited DE analysis identified the hormone inducible repressor SPEN as a potential mCA binder. The SPEN protein family contains RNA recognition and protein interaction motifs and are implicated in diverse biological processes from embryogenesis to ageing, through the regulation of various signalling pathways102. Notably, the SHARP and SPOC domains within SPEN bind to corepressors SMRT, NCOR103, and CtBP104. Most interestingly, SPEN has been demonstrated to associate with histone deacetylases and co-immunoprecipitate with 5 NuRD complex members, including HDAC1/2, MTA2, RBBP4, and MBD3105. While the molecular roles of SPEN within the brain remain largely unknown, the potential regulatory roles of SPEN in mCA recognition are made more plausible by findings from a couple of other studies. Firstly, Spen is ubiquitously expressed in the postnatal brain but is concentrated in layers containing postmitotic neurons105, whilst conditional knockouts cause severe hypoplasia in the postnatal brain106. Secondly, despite its roles in RNA and protein recognition, Spen has been identified as an mCG binder in a SILAC based proteomics screen107. Furthermore, a direct and specific association for mCG was confirmed by EMSA within the same study107. These observations establish SPEN as a highly expressed protein within post-mitotic neurons, which contain high levels of mCA. Secondly, the co-immunoprecipitation of SPEN with other NURD complexes, most notably the members also identified in this screen for mCA (Figure 5.2 and Figure 5.5) may corroborate a mechanism by which Spen binds to mCA sites to recruit NuRD members. SPEN is therefore an interesting mCA candidate reader that requires further characterisation to understand its role within post-mitotic neurons, and whether this has any biological consequences for neuronal development through the readout of mCA. In the future, co-immunoprecipitation studies together with ChIP-seq analysis could be undertaken to unravel the potential role of SPEN in the readout of mCA within the mammalian brain. Such experiments may uncover whether NuRD is recruited by SPEN, or if SPEN is part of a larger protein complex in which MBD2 is the principal binder being recruited to mCA sites.

V-216

A few proteins of note that were significantly enriched for mCA within the mouse-limited dataset (Figure 5.4) include Klf7, which plays crucial roles in neuronal morphogenesis and survival108, Ssbp2 involved in genome maintenance109, and Kdm1b, a lysine demethylase critical to the establishment of maternal genomic imprinting110. In addition to POU3F2, discussed above, another protein belonging to the POU family of neural transcription factors also exhibited enrichment for mCA. Pou3f4, enriched within the mouse-limited mCA DE dataset (Figure 5.4), is located on the X chromosome and its disruption causes X-linked deafness111. Studies in rats revealed that Pou3f4 is expressed during embryonic development within the brain, neural tube, and otic capsule at E15.5 and E17.5 days112. Mutations in mice result in similar phenotypes to humans and exhibit loss of the otic mesenchyme and shortening of the cochlea. As with POU3F2, Pou3f4 was enriched for many motifs within the methyl- sensitive SELEX experiments61, including motifs that contained mCG, CA and TA sequences in the primary binding site. Biochemical validation will need to address binding to mCA and control for these SELEX sequences to ensure any mCA binding is specific. As discussed above, the selective DNA-binding of POU members may be reliant upon heterodimerization with other proteins, and DNA binding selectivity of POU factors may arise from cooperative DNA binding events, for example, cooperativity with the SOX family81,82. Co- immunoprecipitation experiments will, therefore, be crucial in deciphering the Pou3f4 protein interactions in addition to biochemical and in vivo binding characterisation experiments. Possible potential co-interacting candidates captured within this screen include Pou3f2 and Sox21, discussed above. In addition, Sox13 was significantly enriched in mouse P/A mCA set (Figure 5.4). Sox13 is a DBD-containing protein involved in the regulation of embryonic development and in the determination of cell fate113. It is expressed in the developing CNS, including the neural tube and developing brain114, and is upregulated in the cortical plate, where it likely participates in differentiation of a specific subset of neurons115. Apart from its temporal and spatial expression dynamics, its affinity for DNA remains uncharacterised, and its genome-wide binding dynamics within the brain are unknown. The identification of candidate SOX and POU within the human and mouse brain has provided an initial discovery that future studies can build upon, including whether an interaction between Pou3f4 and Sox21, Sox13 or Pou3f4, are required for the selective binding and readout of mCA.

In addition to the proteins already discussed, several zinc finger proteins were enriched in the combined mCA and the human or mouse-limited datasets (Figure 5.2). These include ZFP128, enriched in human and mouse but deemed significant in human only, ZNF384 and ZNF775 in human-limited, ZFR2 in human P/A (Figure 5.3), Jazf1 and Zhx3 in mouse-limited, and MYNN, a zinc finger and BTB/POZ protein (Figure 5.4). Mammalian two-hybrid and co-purification assays indicate that ZFP128 may act as a suppressor and modulate the BMP and TGF-beta

V-217

signal transduction with SMAD proteins116,117. Zhx3 is a known transcriptional repressor and has been identified as an mCG binder in a previous mCG reader screen118, but has no published association with mCA. All other zinc fingers remain largely uncharacterised. Further work is needed to identify the expression patterns of each zinc finger protein enriched within the brain and characterise their biological roles.

Given the novelty of mCA, to date only MECP2 has been verified as an mCA binder. Therefore, all other proteins detected within the mCA proteomics screen require substantial characterisation that includes, but is not limited to, biochemical validation, co- immunoprecipitation, and ChIP-seq experiments in order to characterise their roles in mCA readout. These experiments require time and extensive effort that was not within the scope of this project. Therefore, a decision was made to validate the affinity of one mCA reader detected within the study. MBD2 was chosen based on several observations. Firstly, MBD2 was one of only two DNA binders significantly enriched in both mouse and human mCA DNA pull-downs (Figure 5.2). Secondly, MBD2 belongs to a protein family known for their mCG binding capabilities in mammals and their mC binding activity in plants119, whilst the MBD family member MECP2 has recently been discovered to bind mCA in vitro and within the mammalian brain2,26. Lastly, the pull-down revealed enrichment of repressive co-interacting proteins belonging to the NuRD complex, offering a potentially interesting regulatory mechanism by which loci marked by mCA are bound by MBD2 resulting in transcriptional repression. This study therefore undertook the first steps in characterizing MBD2’s binding potential for mCA by confirming a direct, specific affinity of MBD2 for mCA.

V.3.3 Expression and purification of MBD proteins

In order to assess the binding affinity of MBD2 for mCA, MBD2 was cloned and expressed in a bacterial system. Bacterial expression of MBD2 was conducted in parallel with MECP2 and used as a positive control, given it has already been established to bind to mCA in vitro. Whilst not optimal for mammalian protein expression, recombinant expression in bacteria was chosen as an expression system because it is cost-effective and convenient. A caveat to bacterial expression is an increased probability of unsuccessful expression of a eukaryotic protein. This stems from an inability of bacteria to express or appropriately fold eukaryotic protein, often resulting in non-functional or insoluble protein products, especially with larger proteins. These problems arose, despite having pre-emptive considerations and choosing the Rosetta BL21(DE3) E. coli strain. This strain allows high-efficiency protein expression upon IPTG induction, under the control of a T7 promoter, and is designed to enhance eukaryotic

V-218

protein expression through incorporation of tRNAs commonly lacking in prokaryotes. Initial attempts at expression and purification of the full-length MBD proteins were unsuccessful due to protein stability and improper folding. Proteins were then N-terminally tagged with maltose- binding protein, used to improve expression and protein solubility. However this did not improve the expression or stability of each MBD protein. To overcome these problems, only a select portion of MECP2 and MBD2 were cloned and expressed. The rationale behind the selection of a fragment for expression stemmed from the fact that the MBD was most likely involved in the recognition of mCA, as this domain is universally responsible for mCG recognition in all mammalian MBD proteins that bind mCG. Therefore, subsequent attempts were aimed at expressing and purifying a truncated portion of each MBD protein harbouring only the MBD domain. To further increase the probability of successful expression and purification, the cloned sequence of each MBD fragment was codon optimised for bacterial expression. Partial purification of each MBD protein was achieved by Nickel-based bead purification approaches utilising a cloned His-tag lying at the C-terminus of each protein. The second phase of purification ensured a more thorough purification and simultaneous elimination of any protein-bound nucleic acid by performing HPLC column-based heparin purification. The MBD domains of MECP2 and MBD2 were successfully expressed and purified using this approach (Figure 5.6).

V.3.4 Biochemical validation of MBD2 as an mCA reader

Subsequent to expression and purification of each protein, MBD-MECP2 and MBD-MBD2 were subject to biochemical validation experiments to determine if MBD2 was capable of recognising mCA with high specificity, comparable to results already published for MECP2. An EMSA was performed in triplicate with fluorescently labelled mCG and mCA probes to demonstrate an affinity of each protein for each methylation context. To demonstrate specificity for mCG and mCA, unlabeled competitor DNA at 100X the concentration of fluorescent probe was used in order to ascertain whether the addition of these probes outcompetes the interaction of each protein with mCA and mCG. Band intensity measurements were obtained to quantitatively assess the affinity of MBD-MECP2 and MBD- MBD2 for each methylated context (Figures 5.7 and 5.8).

Prior to the mCA validation experiments, mCG validation experiments were performed to ensure protein functionality, thereby excluding some factors that could potentially affect protein functionality, and consequently, alter or nullify each protein’s ability to bind to DNA. Firstly, the selection of just the MBD domains entailed the risk of affecting the secondary structure of each

V-219

protein leading to downstream DNA-binding problems. To minimise this risk, an MBD fragment of MECP2 that had already been purified and tested was used120. Unfortunately, to my knowledge, there was no similar sequence available for human MBD2, so a sequence that encoded the MBD and some flanking regions were cloned. Secondly, protein functionality may have also been affected by an inability of the bacteria to appropriately fold each protein, as was the case for full-length protein. However, both MBD-MECP2 and MBD-MBD2 successfully bound the mCG probes in a concentration-dependent manner, eliminating the possibility that their functionality was affected through construct manipulation or by bacterial expression. To demonstrate specificity for mCG, unlabeled competitor CG and mCG probes were added to separate reaction mixtures. The results demonstrate that both proteins bound mCG specifically as competitor DNA at 100X the concentration of fluorescent probe significantly reduced the shift only when the probe was methylated in comparison to when it was not methylated. Quantitative analysis of band intensity mirrors this observation. Interestingly, the addition of unlabeled CG competitor did not reduce the shift created by MBD-MBD2 as much as the reduction observed for MBD-MECP2, indicating MBD-MBD2 bound the mCG probes with higher affinity than MBD-MECP2.

Having established that each protein retained its functionality for binding to mCG, the MBDs were then subject to testing for mCA binding. MBD-MECP2 unsurprisingly bound to the mCA probe, and this shift increased with protein concentration. The addition of unlabeled competitor in the CA, TA ,and mCA contexts suggest MBD-MECP2 exhibits high affinity for the mCA probe, even in the presence of highly abundant unmethylated competitor DNA. Quantitative analysis of MBD-MECP2 binding indicated that the shift for the fluorescent mCA probe is not significantly affected for TA, even at 100X the concentration of fluorescent mCA, and that this shift is only minimally reduced for CA. On the other hand, and in line with the expectation from a high-affinity mCA interaction, the addition of unlabeled mCA competitor reduced the shift almost completely. Analysis of MBD-MBD2 binding confirms the results of the pull-down and demonstrates a specific, high-affinity interaction between the MBD domain of MBD2 with the mCA probe. Moreover, quantitative analysis indicates that unlabeled CA and TA competitor at 100X concentration of fluorescent probe does not dramatically reduce the shift imposed by the fluorescent mCA probe, whereas the shift is almost eradicated when unlabelled mCA is added to the reaction mixture.

Previous studies have shown that MECP2 binds to CA DNA sequences, and this was also observed through biochemical interrogation of the MBD domain within MECP260,121. One study investigating the MBD domain of MBD2 concluded that it had no specific affinity for mCA. This study however used DNA probes modelled from a known MECP2 binding site and a cloned

V-220

MBD fragment from chicken. These findings are not only in opposition to the findings of this study but also by those from Liu et al. who investigated the binding of MBDs for mCH, mCA, and CA60. Liu et al. observed a higher affinity for mCA over other forms of mCH for MECP2, MBD2, and MBD4. These probes contained only a singly methylated CA site, but its complementary TG sequence was judged to be the primary determinant for DNA binding rather than the mCA site. Binding was therefore observed in the mCA and CA probes, but the CA sequence, whether methylated or not, was deemed non-essential. Methylation of cytosine at position 5 on the pyrimidine occurs at the same position within thymine. Based on this structural similarity, it is unsurprising that the MBD domain of MECP2 and MBD2 are capable of binding to TG60. To investigate whether thymine within the TG probe used by Liu et al. was the primary determinant of binding, an additional TA probe was used within the EMSA. This probe addressed whether thymine is the essential base required for binding through assessment of its shift on the amount of mCA fluorescent probe bound by MBD2 within the EMSA. The CA probe, identical to the mCA probe, but without methylation, was utilised in the same way, to ensure that the methylation state of cytosine in the mCA context, was the primary determinant of specificity for these MBD proteins.

Results from the EMSA experiments demonstrate that both MBD-MECP2 and MBD-MBD2 prefer to bind to mCA despite the presence of highly concentrated CA and TA probes. Results of the EMSA experiments also indicate that binding to CA pertains more to MBD-MECP2 than it does to MBD-MBD2. This is reflected by the observation that MBD-MECP2 undergoes a larger shift reduction using competitor CA DNA than for MBD-MBD2. Despite this observation, both MBDs largely retain their affinity for mCA in the presence of both CA and TA probes. This is reflected by an inability for either unmethylated probe to outcompete mCA and interact with either MBD to an appreciable extent. The TA probe served an important control and together with the CA probe demonstrates that while each MBD may bind to TA or CA probes (perhaps through recognition of its complementary TG dinucleotides), this interaction may be limited to a lack of surrounding mCA. However the amount of and proximity of mCA repeats within a probe on MECP2 and MBD2 binding need to be further investigated. The contextual relevance of these situations in vivo also remains a question for future work. Furthermore, the results of the EMSA confirm MBD-MBD2 has a specific affinity for mCA that is not affected by the presence of CA or TA dinucleotides, even at highly abundant levels. The biological consequences of mCA readout by MECP2 are beginning to be unravelled2,46, however MBD2 remains largely uncharacterised. A recent study demonstrated that replacing the MBD domain within MECP2 with the MBD domain from MBD2 and generating a chimeric protein called MM2 disrupted proper neuronal functioning122. The authors were able to show that MM2 retained mCG binding (albeit reduced) but was unable to bind mCAC or mCAT in vivo. These findings,

V-221

performed using knock in gene approaches in mice, are in opposition to the observations from the EMSA findings in this thesis. It is important to note, however, that differences in the primary and tertiary sequences used in this thesis, and that of the study, may underlie these observations. The cloned MBD fragment from MBD2 used in this thesis contains an additional 81 amino acids upstream of the core MBD site used by the authors in the study. Secondly, MM2 differs from MECP2 in its tertiary structure influencing its physiological properties and may explain the reduced binding capacity of MM2 for mCG when compared to wildtype samples122. It could be argued that this physiological change resulting from changes in the tertiary structure may explain why MM2 was unable to bind to mCH. The MBD-MBD2 protein used in the EMSA in this study is similarly non-physiological, and in contrast to the findings of Tillotson et al. 2021. Intriguingly, however, native MBD2 was significanlty enriched for mCA in the human and mouse DNA pull-downs and prompt the need for further characterisation into the ability of MBD2 to specifically bind mCH sequences.

V.3.5 Capturing protein reader conservation in human and mouse brain

Integration of the mCG/CG and mCA/CA DNA pull-down in human and mouse brain reveal a conserved subset of DNA-binding proteins and protein interactors with shared biological characteristics (Figure 5.11). Some of these members were enriched for the CG and CA DNA probes, suggesting these protein complexes, namely PRC1 and MLL1, are potentially involved in the binding to unmethylated loci, similar to the probes used in this DNA pull-down. Briefly, the PRC1 complex affects gene regulation by changes to chromatin structure and, for example, is necessary for striking a balance between stem cell proliferation and differentiation through recognition of chromatin or DNA123. With regards to results obtained in the DNA pull- down, confirmation experiments that address the propensity for PRC1 to be recruited to CA repeats are required and more broadly, the transcriptional repercussions of PRC1 binding to CG and CA DNA elements in the brain. In situ hybridisation has revealed PRC1 expression in progenitor cells during neurogenesis and within mature neuronal structures. However, the PRC1 complex is composed of many members whose variations contribute to distinct biological roles123. This is mirrored by observations of PRC1 complex expression compositions that vary in spatial and temporal manners within the developing and mature mammalian brain124. Still unknown, is how the many combinations of core proteins and their isoforms affect gene regulation by changes to chromatin, for example the monoubiquitination of H2A or in modulating H3K27me3 target genes125. Other questions concerning what drives PRC1 recruitment also remain. Whilst some studies have demonstrated PRC1 is recruited to histones via CBX125,126, others have demonstrated the potential for recruitment to

V-222

unmethylated CG repeats through DNA binding proteins like KDM2B127(Figure 4.4). Indeed, KDM2B was enriched within the DNA pull-down for CA and CG, and given the lack of histones within the pull-down, is arguably most likely responsible for PRC1 enrichment on unmethylated DNA substrates. However, there still remains a need to confirm this interaction directly, especially for the CA probes, given KDM2B has not previously been associated to bind to this DNA context. Alternatively, one cannot rule out the possibility that another DBD-containing protein within the DNA pull-down is responsible for PRC1 recruitment. Thus, many unanswered questions concerning the cell type-specific roles of PRC1 in the brain remain. Disentangling cell type or locus-specific complex recruitment may prove necessary in ascertaining how combinations of PRC1 achieve distinct biologically relevant outcomes within the developing and mature brain.

The MLL complexes are part of the larger COMPASS-like complex and are best known for their involvement in leukaemia, but mutations in these genes also cause neurodevelopmental disorders like Kabuki syndrome (causes moderate intellectual disability), and Wiedemann- Steiner syndrome (developmental delays)128,129. Like the PcG complex, the MLL and larger COMPASS-like complexes also associate with a multitude of accessory proteins in spatiotemporal manners. The primary roles of these complex members is within global H3K4me3 deposition. However, more localised gene expression changes have been demonstrated through interactions with DNA binders like ZNF335, which modulates NPC proliferation through inhibition of genes required for neural differentiation130,131. As with the PcG complex observation, enrichment of COMPASS-like and MLL members within the mCG/CG and mCA/CA DNA pull-downs indicate the presence of an unidentified DNA binding protein that binds to unmethylated DNA and is responsible for the subsequent recruitment of accessory proteins within these complexes. The DNA pull-down experiment therefore provided an important initial identification platform for the potential characterisation of protein complexes described above that may be involved in the readout of DNA elements with CA and CG repeats in the human and mouse brain. Subsequent TAP-MS experiments are required to confirm this observation, and to decipher the constituents of each complex. In parallel, the biological roles of each complex such as the genome-wide binding of DBD- containing proteins that recruit these complexes, and their effects on transcription, may be investigated through ChIP-seq and RNA-seq approaches.

Clustered together within Figure 5.11 (S3) are a group of proteins with a shared affinity for mCG and mCA in human and mouse. Closer inspection of the proteins within the subset reveal proteins belonging to the NuRD protein complex, coupling histone deacetylase and chromatin remodelling with transcriptional repression132,133. Targeting of NuRD to mCG sites in vivo has

V-223

been attributed to MBD264. Given MBD2 is also within this subset, it is therefore highly likely that these proteins were tethered to MBD2 within the protein lysate used for the DNA pull- downs. Its localisation to sites marked by mCG, driven by MBD2, has been well studied64,134, and explains why these members were enriched for the mCG probe. Whether this interaction exists at mCA sites in vivo and exerts effects on genes required for neurodevelopment is an exciting possibility, but remains a question for future studies. The results of the EMSA have confirmed a direct and specific interaction between MBD2 and mCA (Figure 5.8). The EMSA, however, is limited to probe selection and does not offer protein interaction information.

Most studies have focused on the characterisation of NuRD complex members within ESCs and progenitor cells135–137. The molecular roles of NuRD within postmitotic tissue, including the brain, suggest that it plays crucial roles in neurodevelopmental processes138, but these roles are poorly understood. Individual members of NuRD have been studied within the context of neuronal differentiation and maturation139,140. For example, CHD4 promotes early differentiation of progenitors, CHD5 facilitates neuronal migration, and CHD3 is responsible for layer specification within the cortex141. The same study also demonstrated that distinct variations of the NuRD complex variants are recruited to different regulatory elements essential for brain development141. Another study demonstrated that repression of NuRD target genes is fundamental to synaptic connectivity within the mammalian brain by conducting RNAi of NuRD target genes142. These studies provide some evidence implicating NuRD in neurodevelopmental processes, but have failed to address important questions such as the methylation state of these loci, or the mechanisms underlying NuRD recruitment. Results from the pull-down indicate that NuRD recruitment may be facilitated by mCA sites within the mammalian brain and MBD2-based recognition of these sites. Future studies will therefore be required to ascertain whether MBD2 binds to mCA in vivo and whether various members of the NuRD complex co-localise at these sites. Identification of these sites coupled with genetic studies and RNA-seq approaches may reveal a subset of mCA-driven loci that are under the control of NuRD and fundamental to healthy brain development.

V.3.6 Limitations to the techniques employed within this study

As mentioned in section 4.3.5, DNA pull-downs provide an excellent method for initial explorative identification of DNA binding proteins and their interactors, but have a number of inherent limitations. Careful probe design is paramount to the identification of proteins that are contextually biologically relevant. Despite this, DNA pull-downs are reliant upon artificial DNA sequences that do not capture the underlying complexity of genomic DNA. A byproduct of this

V-224

assay therefore enriches for a subset of proteins limited to the underlying nucleotide design. The design of an adequate control DNA probe is an important consideration within the DNA pull-down because DE analysis is highly reliant upon differences in peptide abundances between both probe conditions. For example, MECP2 was not significantly enriched for mCA within DNA pull-down. However, as discussed earlier, in the absence of mCA, MECP2 will bind to CA sequences with an appreciable affinity60. Therefore the presence of MECP2 on the CA probe likely rendered the DE value non-significant. This concept highlights the need for biochemical verification and the utilisation of many control probes. The results of the EMSA confirm MECP2 binds to mCA with high affinity, in alignment with numerous previous studies2,59,60, by demonstrating that MECP2 prefers mCA when CA is present and binds it with a much higher affinity. The artificial nature of DNA pull-downs extends beyond probe selection and includes technical factors such as incubation and buffer considerations. In addition, the DNA pull-down cannot recapitulate the native nuclear environment, for example, due to the lack of histone modifications and nucleosomes. Results from the DNA pull-down are also influenced by the types and preservation states of brain tissue used. Material for the human experiment was subject to post-mortem tissue available from the brain bank that had been frozen for prolonged periods, which could potentially have affected protein lysate quality. Due to the amount of protein lysate required for the pull-downs, the only feasible option for mouse was to use whole brain tissue. As such, mouse and human DNA pull-downs were subject to differences in protein heterogeneity, reflective of their different sources. Therefore, care needs to be taken when interpreting the results of the DNA pull-down, bearing in mind that the artificial nature of the assay does not fully recapitulate the biological complexity within the cell.

The protein purification pipeline was a major limitation to the biochemical verification, and hence subsequent characterisation of proteins enriched for mCA in this study. The large scale expression capabilities offered by bacterial expression systems are relatively cost-effective and are attractive options for protein expression. However, this is not ideal for mammalian proteins, as demonstrated by the inability of the bacterial system to successfully express MECP2 and MBD2. In order to effectively confirm the binding behaviour of mCA readers identified within this screen, a high throughput protein expression system is required such as the baculovirus or mammalian cell culture systems. Cell-free protein expression systems were trialled within this study but were not suitable for large scale protein production and downstream protein purification. Mammalian protein expression systems offer many advantages over bacterial expression systems. They produce proteins with highly similar native structures, increasing the probability of protein stability and activity. This is achieved because protein expression is within an environment that is biologically and physiologically relevant and similar to the recombinant protein of interest. The result is a higher probability

V-225

that proteins, especially complex proteins, will undergo proper folding and post-translational modification processing. Stable, viable protein product, produced at high levels through the utilisation of this system, could enable rapid high throughput purification enabling the evaluation of many different mCA readers identified within the DNA pull-down simultaneously.

Lastly, the EMSA, much like the DNA pull-down, suffers from all the caveats associated with an artificial DNA-binding assay. Therefore, while the EMSA experiments confirm results of the DNA pull-down by demonstrating a direct interaction between MBD-MBD2 and the mCA probe, they are limited to the choice of probes used within the assay. In addition, the results from the EMSA only provide a slightly higher level of confidence that MBD2 may bind to and regulate mCA within the brain. More complex in vivo experiments are needed to fully characterise the role of MBD2 within mCA recognition in the brain and remain a question for future experiments. Attempts at in vivo characterisation were made within this project. A mouse embryonic stem cell line with endogenously tagged MBD family members was acquired from Prof. Dirk Schübeler with the aim of performing ChIP-seq on neurons differentiated from these cells using the endogenous biotinylated tag. However there was no published protocol capable of differentiating ESCs into neurons with mCH levels observed in mature mouse neurons. Significant time was spent in trying to optimise differentiation of these ESCs into neurons with high mCH levels, but results were inconsistent and produced a mixed population of cells with very few neurons. Downstream experiments were unfeasible due to low neuron numbers, even with fluorescence-activated nuclei sorting (FANS) sorting for neuronal nuclei. Attempts were also made to perform a methyl-cap experiment143, in which the genomic DNA of sorted NeuN nuclei from mouse brain was incubated and pulled down using the HIS tags of MBDs used in the EMSA. The products were sequenced and analysed but did not produce high enough coverage and binding resolution to discern binding of either protein for mCA. The unsuccessful nature of these experiments prevented further characterisation of these proteins and highlighted avenues of research needed to explore the binding dynamics of mCA readers in future experiments.

V.3.7 Towards in vivo mCA reader characterisation

One way to understand the functional implications of epigenetic features such as mCH is to study proteins that read the modifications and couple its deposition to transcriptional changes. Very few studies have aimed to identify novel mCH readers, given the relatively recent increase in its characterization. To date, MECP2 and MBD2 are the only identified and verified mCH binders. In order to identify potential mCH binders within the human and mouse brain in

V-226

a systematic way, a DNA pull-down coupled to mass spectrometry was employed. The pull- down identified a list of high confidence readers that require further characterisation, establishing this study as a platform upon which future studies can build upon. The selection of MBD2 as the top mCA binding candidate was based on its enrichment for mCA in human and mouse within the DNA pull-down (Figure 5.2), and because of its affinity for mCG, which is well characterised. Lastly, hierarchical clustering revealed an enrichment of proteins that are commonly associated with MBD2 when bound to mCG in vivo, and this novel observation was noticed within the mCA binding contexts (figure 5.5). The MBD domain of MBD2 was recombinantly expressed and purified, to confirm its affinity for mCA thereby opening a new avenue of research into the in vivo binding dynamics of this methyl binding protein. Future work is now needed to ascertain the fraction of genomic loci harbouring mCA that are bound by MBD2, and combined with RNA-seq and gene knockout experiments to elucidate the molecular readout of mCA within the mammalian brain. In parallel, experiments that focus on the protein regulatory networks involved in the readout of mCA are needed. For example, the utilisation of co-immunoprecipitation experiments may be employed to verify co-interacting proteins observed within this pull-down or identify novel proteins recruited to mCA via binding of DBD-containing proteins identified within the mCA reader screen. These experiments will be crucial in expanding upon the known roles of protein complexes like NuRD and investigating their involvement in the readout of mCA. Such experiments will be crucial in determining whether MBD2 binds to mCA and whether this results in transcriptional repression that is conferred by NuRD recruitment. These same approaches may be utilised for the identification and characterisation of other protein interactors enriched for mCA within the mCA/CA DNA pull-down, providing new regulatory protein networks crucial to the epigenomic readout of cellular processes within the brain.

The success of these studies will be reliant on overcoming many challenges associated with studying mCA in human and mouse brain, many of which prevented further characterisation of MBD2 binding characterisation as well as its association with NuRD. One such challenge is the complexity of the brain, composed of many cell types, including neuronal cell types. This problem has already been partially resolved through FANS, that reduce brain tissue into its neuronal and non-neuronal constituents allowing for better discrimination of mCH within neuronal cells that possess mCH in abundance. Another major problem lies in the deposition of mCH. Besides being present in only a subset of cell types, separating its effects from mCG is difficult given both DNA methylation contexts generally co-occur in the same genomic regions. This is especially important for identified proteins like MBD2 that bind to mCG genome-wide. To circumvent this problem for MECP2, deep sequencing of MECP2 ChIP-seq libraries has been adopted by multiple studies to capture the resolution necessary for mCA

V-227

detection. This approach is not ideal due to increased costs, and it is reliant on more material (which may not be available), but future advances within the field may provide easier means by which to study protein localisation at mCA sites. For example, mice lacking mCH and not mCG may be useful in separating mCA-driven MBD2 occupied loci, or for any other protein identified within this screen. However, this would require the precise knockdown of DNMT3a at developmentally critical timepoints to affect mCH without perturbing mCG levels to a significant extent. Obtaining the required amount of brain tissue for experiments is another challenge, especially in cases where a significant amount of tissue is required. This limitation affected the experimental approach of the pull-down and is a major caveat in the subsequent characterisation of identified mCA readers. For example, a substantial amount of brain tissue was required for DNA pull-downs that prevented pre-sorting of brain tissue allowing for the identification of neuron-specific mCA readers. Currently, cell culture-based systems that overcome this limitation in other tissues are inadequate because the extremely low levels of mCA found within neurons developed in cell culture is not reflective of mCA in the mature brain in both abundance and localisation. Ideally, if there were a cell-culture based system available in which to study mCA that closely resembled the distribution of the modification observed in neurons from the brain, a SILAC approach could be adopted on sorted cell-culture derived nuclei. Access to adequate tissue or cell culture derived neurons was a major limitation preventing subsequent characterisation experiments and remains a challenge going forward in the elucidation of mCA reader binding dynamics. Studying the readout of mCA and how this mark contributes to neurodevelopment has therefore been challenging, but together with the co-advancement of these methodologies, will provide a comprehensive toolkit for the identification and downstream characterisation of mCA readers.

V-228

References

1. Lister, R. et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905 (2013). 2. Kinde, B., Gabel, H. W., Gilbert, C. S., Griffith, E. C. & Greenberg, M. E. Reading the unique DNA methylation landscape of the brain: Non-CpG methylation, hydroxymethylation, and MeCP2. Proc. Natl. Acad. Sci. U. S. A. 112, 6800–6806 (2015). 3. Amir, R. E. et al. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat. Genet. 23, 185–188 (1999). 4. Ramocki, M. B., Tavyev, Y. J. & Peters, S. U. The MECP2 duplication syndrome. Am. J. Med. Genet. A 152A, 1079–1088 (2010). 5. Schultz, M. D. et al. Corrigendum: Human body epigenome maps reveal noncanonical DNA methylation variation. Nature 530, 242 (2016). 6. He, Y. & Ecker, J. R. Non-CG Methylation in the Human Genome. Annu. Rev. Genomics Hum. Genet. 16, 55–77 (2015). 7. Varley, K. E. et al. Dynamic DNA methylation across diverse human cell lines and tissues. Genome Res. 23, 555–567 (2013). 8. Lister, R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009). 9. Laurent, L. et al. Dynamic changes in the human methylome during differentiation. Genome Res. 20, 320–331 (2010). 10. Ma, H. et al. Abnormalities in human pluripotent cells due to reprogramming mechanisms. Nature 511, 177–183 (2014). 11. Lister, R. et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471, 68–73 (2011). 12. Guo, H. et al. The DNA methylation landscape of human early embryos. Nature 511, 606– 610 (2014). 13. Ichiyanagi, T., Ichiyanagi, K., Miyake, M. & Sasaki, H. Accumulation and loss of asymmetric non-CpG methylation during male germ-cell development. Nucleic Acids Res. 41, 738–745 (2013). 14. Xie, W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148 (2013). 15. Vlachogiannis, G. et al. The Dnmt3L ADD Domain Controls Cytosine Methylation Establishment during Spermatogenesis. Cell Rep. 11, 990 (2015). 16. Wang, L. et al. Programming and inheritance of parental DNA methylomes in mammals. Cell 157, 979–991 (2014).

V-229

17. Gahurova, L. et al. Transcription and chromatin determinants of de novo DNA methylation timing in oocytes. Epigenetics Chromatin 10, 1–19 (2017). 18. Gowher, H. & Jeltsch, A. Enzymatic properties of recombinant Dnmt3a DNA methyltransferase from mouse: the enzyme modifies DNA in a non-processive manner and also methylates non-CpA sites1 1Edited by J. Karn. J. Mol. Biol. 309, 1201–1208 (2001). 19. Suetake, I., Miyazaki, J., Murakami, C., Takeshima, H. & Tajima, S. Distinct enzymatic properties of recombinant mouse DNA methyltransferases Dnmt3a and Dnmt3b. J. Biochem. 133, 737–744 (2003). 20. Ramsahoye, B. H. et al. Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc. Natl. Acad. Sci. U. S. A. 97, 5237–5242 (2000). 21. Gowher, H. & Jeltsch, A. Molecular enzymology of the catalytic domains of the Dnmt3a and Dnmt3b DNA methyltransferases. J. Biol. Chem. 277, 20409–20414 (2002). 22. Liao, J. et al. Targeted disruption of DNMT1, DNMT3A and DNMT3B in human embryonic stem cells. Nat. Genet. 47, 469–478 (2015). 23. Arand, J. et al. In vivo control of CpG and non-CpG DNA methylation by DNA methyltransferases. PLoS Genet. 8, e1002750 (2012). 24. Shirane, K. et al. Mouse oocyte methylomes at base resolution reveal genome-wide accumulation of non-CpG methylation and role of DNA methyltransferases. PLoS Genet. 9, e1003439 (2013). 25. Guo, J. U. et al. Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat. Neurosci. 17, 215–222 (2014). 26. Gabel, H. W. et al. Disruption of DNA-methylation-dependent long gene repression in Rett syndrome. Nature 522, 89–93 (2015). 27. Ziller, M. J. et al. Genomic distribution and inter-sample variation of non-CpG methylation across human cell types. PLoS Genet. 7, e1002389 (2011). 28. Baubec, T. et al. Genomic profiling of DNA methyltransferases reveals a role for DNMT3B in genic methylation. Nature 520, 243–247 (2015). 29. Lee, J.-H., Park, S.-J. & Nakai, K. Differential landscape of non-CpG methylation in embryonic stem cells and neurons caused by DNMT3s. Sci. Rep. 7, 11295 (2017). 30. Wu, H. & Zhang, Y. Reversing DNA methylation: mechanisms, genomics, and biological functions. Cell 156, 45–68 (2014). 31. Guo, J. U., Su, Y., Zhong, C., Ming, G.-L. & Song, H. Hydroxylation of 5-methylcytosine by TET1 promotes active DNA demethylation in the adult brain. Cell 145, 423–434 (2011). 32. Yu, M. et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell 149, 1368–1380 (2012).

V-230

33. Hu, L. et al. Crystal structure of TET2-DNA complex: insight into TET-mediated 5mC oxidation. Cell 155, 1545–1555 (2013). 34. Bostick, M. et al. UHRF1 plays a role in maintaining DNA methylation in mammalian cells. Science 317, 1760–1764 (2007). 35. Kim, T.-K. et al. Widespread transcription at neuronal activity-regulated enhancers. Nature 465, 182–187 (2010). 36. Chen, L. et al. MeCP2 binds to non-CG methylated DNA as neurons mature, influencing transcription and the timing of onset for Rett syndrome. Proc. Natl. Acad. Sci. U. S. A. 112, 5509–5514 (2015). 37. Xie, W. et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 148, 816–831 (2012). 38. Guo, W., Chung, W.-Y., Qian, M., Pellegrini, M. & Zhang, M. Q. Characterizing the strand- specific distribution of non-CpG methylation in human pluripotent cells. Nucleic Acids Res. 42, 3009–3016 (2014). 39. Lomvardas, S. et al. Interchromosomal interactions and olfactory receptor choice. Cell 126, 403–413 (2006). 40. Jones, P. A. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat. Rev. Genet. 13, 484–492 (2012). 41. Chen, P.-Y., Feng, S., Joo, J. W. J., Jacobsen, S. E. & Pellegrini, M. A comparative analysis of DNA methylation across human embryonic stem cell lines. Genome Biol. 12, R62 (2011). 42. Barres, R. et al. Weight loss after gastric bypass surgery in human obesity remodels promoter methylation. Cell Rep. 3, 1020–1027 (2013). 43. Barrès, R. et al. Non-CpG Methylation of the PGC-1α Promoter through DNMT3B Controls Mitochondrial Density. Cell Metab. 10, 189–198 (2009). 44. Inoue, S. & Oishi, M. Effects of methylation of non-CpG sequence in the promoter region on the expression of human synaptotagmin XI (syt11). Gene 348, 123–134 (2005). 45. Skene, P. J., Illingworth, R. S., Webb, S. & Kerr, A. R. W. Neuronal MeCP2 is expressed at near histone-octamer levels and globally alters the chromatin state. Mol. Cell (2010). 46. Clemens, A. W. et al. MeCP2 Represses Enhancers through Chromosome Topology- Associated DNA Methylation. Mol. Cell 77, 279–293.e8 (2020). 47. Stadler, M. B., Murr, R., Burger, L., Ivanek, R. & Lienert, F. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature (2011). 48. Feldmann, A. et al. Transcription factor occupancy can mediate active turnover of DNA methylation at regulatory regions. PLoS Genet. 9, e1003994 (2013). 49. Mo, A. et al. Epigenomic Signatures of Neuronal Diversity in the Mammalian Brain. Neuron 86, 1369–1384 (2015).

V-231

50. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017). 51. Meehan, R. R., Lewis, J. D., McKay, S., Kleiner, E. L. & Bird, A. P. Identification of a mammalian protein that binds specifically to DNA containing methylated CpGs. Cell 58, 499– 507 (1989). 52. Chhatbar, K., Cholewa-Waclaw, J., Shah, R., Bird, A. & Sanguinetti, G. Quantitative analysis questions the role of MeCP2 in alternative splicing. Cold Spring Harbor Laboratory 2020.05.25.115154 (2020) doi:10.1101/2020.05.25.115154. 53. Mellén, M., Ayata, P., Dewell, S., Kriaucionis, S. & Heintz, N. MeCP2 Binds to 5hmC Enriched within Active Genes and Accessible Chromatin in the Nervous System. Cell 151, 1417–1430 (2012). 54. Du, Q., Luu, P.-L., Stirzaker, C. & Clark, S. J. Methyl-CpG-binding domain proteins: readers of the epigenome. Epigenomics 7, 1051–1073 (2015). 55. Skene, P. J. et al. Neuronal MeCP2 is expressed at near histone-octamer levels and globally alters the chromatin state. Mol. Cell 37, 457–468 (2010). 56. Shahbazian, M. D., Antalffy, B., Armstrong, D. L. & Zoghbi, H. Y. Insight into Rett syndrome: MeCP2 levels display tissue- and cell-specific differences and correlate with neuronal maturation. Hum. Mol. Genet. 11, 115–124 (2002). 57. Guy, J., Cheval, H., Selfridge, J. & Bird, A. The role of MeCP2 in the brain. Annu. Rev. Cell Dev. Biol. 27, 631–652 (2011). 58. Raman, A. T. et al. Apparent bias toward long gene misregulation in MeCP2 syndromes disappears after controlling for baseline variations. Nat. Commun. 9, 3225 (2018). 59. Sperlazza, M. J., Bilinovich, S. M., Sinanan, L. M., Javier, F. R. & Williams, D. C., Jr. Structural Basis of MeCP2 Distribution on Non-CpG Methylated and Hydroxymethylated DNA. J. Mol. Biol. 429, 1581–1594 (2017). 60. Liu, K. et al. Structural basis for the ability of MBD domains to bind methyl-CG and TG sites in DNA. J. Biol. Chem. 293, 7344–7354 (2018). 61. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, (2017). 62. Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013). 63. Le Guezennec, X. et al. MBD2/NuRD and MBD3/NuRD, two distinct complexes with different biochemical and functional properties. Mol. Cell. Biol. 26, 843–851 (2006). 64. Baubec, T., Ivánek, R., Lienert, F. & Schübeler, D. Methylation-dependent and -independent genomic targeting principles of the MBD protein family. Cell 153, 480–492 (2013). 65. Torchy, M. P., Hamiche, A. & Klaholz, B. P. Structure and function insights into the NuRD chromatin remodeling complex. Cell. Mol. Life Sci. 72, 2491–2507 (2015).

V-232

66. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). 67. McEvilly, R. J., de Diaz, M. O., Schonemann, M. D., Hooshmand, F. & Rosenfeld, M. G. Transcriptional regulation of cortical neuron migration by POU domain factors. Science 295, 1528–1532 (2002). 68. Sugitani, Y. et al. Brn-1 and Brn-2 share crucial roles in the production and positioning of mouse neocortical neurons. Genes Dev. 16, 1760–1765 (2002). 69. Castro, D. S. et al. Proneural bHLH and Brn proteins coregulate a neurogenic program through cooperative binding to a conserved DNA motif. Dev. Cell 11, 831–844 (2006). 70. Schonemann, M. D. et al. Development and survival of the endocrine hypothalamus and posterior pituitary gland requires the neuronal POU domain factor Brn-2. Genes Dev. 9, 3122–3135 (1995). 71. Nakai, S. et al. The POU domain transcription factor Brn-2 is required for the determination of specific neuronal lineages in the hypothalamus of the mouse. Genes Dev. 9, 3109–3121 (1995). 72. Pfisterer, U. et al. Direct conversion of human fibroblasts to dopaminergic neurons. Proc. Natl. Acad. Sci. U. S. A. 108, 10343–10348 (2011). 73. Tanaka, S. et al. Interplay of SOX and POU factors in regulation of the Nestin gene in neural primordial cells. Mol. Cell. Biol. 24, 8834–8846 (2004). 74. LaBella, F., Sive, H. L., Roeder, R. G. & Heintz, N. Cell-cycle regulation of a human histone H2b gene is mediated by the H2b subtype-specific consensus element. Genes Dev. 2, 32– 39 (1988). 75. Klemm, J. D., Rould, M. A., Aurora, R., Herr, W. & Pabo, C. O. Crystal structure of the Oct-1 POU domain bound to an octamer site: DNA recognition with tethered DNA-binding modules. Cell 77, 21–32 (1994). 76. Goodall, J. et al. The Brn-2 transcription factor links activated BRAF to melanoma proliferation. Mol. Cell. Biol. 24, 2923–2931 (2004). 77. Alazard, R. et al. Identification of the ‘NORE’(N-Oct-3 responsive element), a novel structural motif and composite element. Nucleic Acids Res. 33, 1513–1523 (2005). 78. Kobi, D. et al. Genome-wide analysis of POU3F2/BRN2 promoter occupancy in human melanoma cells reveals Kitl as a novel regulated target gene. Pigment Cell Melanoma Res. 23, 404–418 (2010). 79. Reményi, A. et al. Differential dimer activities of the transcription factor Oct-1 by DNA- induced interface swapping. Mol. Cell 8, 569–580 (2001). 80. Matsuda, S. et al. Sox21 promotes hippocampal adult neurogenesis via the transcriptional repression of the Hes5 gene. J. Neurosci. 32, 12543–12557 (2012).

V-233

81. Wilson, M. & Koopman, P. Matching SOX: partner proteins and co-factors of the SOX family of transcriptional regulators. Curr. Opin. Genet. Dev. 12, 441–446 (2002). 82. Kamachi, Y., Uchikawa, M. & Kondoh, H. Pairing SOX off: with partners in the regulation of embryonic development. Trends Genet. 16, 182–187 (2000). 83. Günther, K. et al. Differential roles for MBD2 and MBD3 at methylated CpG islands, active promoters and binding to exon sequences. Nucleic Acids Res. 41, 3010–3021 (2013). 84. Pierson, T. M. et al. The NuRD complex and macrocephaly associated neurodevelopmental disorders. Am. J. Med. Genet. C Semin. Med. Genet. 181, 548–556 (2019). 85. Koenning, M. et al. Myelin gene regulatory factor is required for maintenance of myelin and mature oligodendrocyte identity in the adult CNS. J. Neurosci. 32, 12528–12542 (2012). 86. Emery, B. et al. Myelin gene regulatory factor is a critical transcriptional regulator required for CNS myelination. Cell 138, 172–185 (2009). 87. Bujalka, H. et al. MYRF is a membrane-associated transcription factor that autoproteolytically cleaves to directly activate myelin genes. PLoS Biol. 11, e1001625 (2013). 88. Kim, D. et al. Homo-trimerization is essential for the transcription factor function of Myrf for oligodendrocyte differentiation. Nucleic Acids Res. 45, 5112–5125 (2017). 89. Meng, J. et al. Myrf ER-Bound Transcription Factors Drive C. elegans Synaptic Plasticity via Cleavage-Dependent Nuclear Translocation. Dev. Cell 41, 180–194.e7 (2017). 90. van Roeyen, C. R. C. et al. Y-box protein 1 mediates PDGF-B effects in mesangioproliferative glomerular disease. J. Am. Soc. Nephrol. 16, 2985–2996 (2005). 91. Lindquist, J. A., Brandt, S., Bernhardt, A., Zhu, C. & Mertens, P. R. The role of cold shock domain proteins in inflammatory diseases. J. Mol. Med. 92, 207–216 (2014). 92. Fotovati, A., Abu-Ali, S., Wang, P. S., Deleyrolle, L. P. & Lee, C. YB-1 bridges neural stem cells and brain tumor–initiating cells via its roles in differentiation and cell growth. Cancer Res. (2011). 93. Lu, Z. H., Books, J. T. & Ley, T. J. YB-1 is important for late-stage embryonic development, optimal cellular stress responses, and the prevention of premature senescence. Mol. Cell. Biol. 25, 4625–4637 (2005). 94. Hindmarch, C., Yao, S., Beighton, G., Paton, J. & Murphy, D. A comprehensive description of the transcriptome of the hypothalamoneurohypophyseal system in euhydrated and dehydrated rats. Proc. Natl. Acad. Sci. U. S. A. 103, 1609–1614 (2006). 95. Unkrüer, B. et al. Cellular localization of Y-box binding protein 1 in brain tissue of rats, macaques, and humans. BMC Neurosci. 10, 28 (2009). 96. Viggars, A. P. et al. Alterations in the blood brain barrier in ageing cerebral cortex in relationship to Alzheimer-type pathology: a study in the MRC-CFAS population neuropathology cohort. Neurosci. Lett. 505, 25–30 (2011).

V-234

97. Bhullar, J. & Sollars, V. E. YBX1 expression and function in early hematopoiesis and leukemic cells. Immunogenetics 63, 337–350 (2011). 98. Horn, G., Hofweber, R., Kremer, W. & Kalbitzer, H. R. Structure and function of bacterial cold shock proteins. Cell. Mol. Life Sci. 64, 1457–1470 (2007). 99. Brandt, S. et al. Cold shock Y-box protein-1 participates in signaling circuits with auto- regulatory activities. Eur. J. Cell Biol. 91, 464–471 (2012). 100. Lindquist, J. A. & Mertens, P. R. Cold shock proteins: from cellular mechanisms to pathophysiology and disease. Cell Commun. Signal. 16, 63 (2018). 101. Davis, S. & Laroche, S. Mitogen-activated protein kinase/extracellular regulated kinase signalling and memory stabilization: a review. Genes Brain Behav. 5, 61–72 (2006). 102. Ariyoshi, M. & Schwabe, J. W. R. A conserved structural motif reveals the essential transcriptional repression function of Spen proteins and their role in developmental signaling. Genes Dev. 17, 1909–1920 (2003). 103. Oswald, F. et al. SHARP is a novel component of the Notch/RBP-Jκ signalling pathway. EMBO J. 21, 5417–5426 (2002). 104. Oswald, F., Winkler, M. & Cao, Y. RBP-Jκ/SHARP recruits CtIP/CtBP corepressors to silence Notch target genes. and cellular biology (2005). 105. Shi, Y. et al. Sharp, an inducible cofactor that integrates nuclear receptor repression and activation. Genes Dev. 15, 1140–1151 (2001). 106. Yabe, D. et al. Generation of a conditional knockout allele for mammalian Spen protein Mint/SHARP. Genesis 45, 300–306 (2007). 107. Bartels, S. J. J., Spruijt, C. G., Brinkman, A. B. & Jansen, P. A SILAC-based screen for Methyl-CpG binding proteins identifies RBP-J as a DNA methylation and sequence- specific binding protein. PLoS One (2011). 108. Laub, F. et al. Transcription factor KLF7 is important for neuronal morphogenesis in selected regions of the nervous system. Mol. Cell. Biol. 25, 5699–5711 (2005). 109. Li, J. et al. Requirement for ssbp2 in hematopoietic stem cell maintenance and stress response. J. Immunol. 193, 4654–4662 (2014). 110. Ciccone, D. N. et al. KDM1B is a histone H3K4 demethylase required to establish maternal genomic imprints. Nature 461, 415–418 (2009). 111. de Kok, Y. J. et al. Association between X-linked mixed deafness and mutations in the POU domain gene POU3F4. Science 267, 685–688 (1995). 112. Le Moine, C. & Young, W. S., 3rd. RHS2, a POU domain-containing gene, and its expression in developing and adult rat. Proc. Natl. Acad. Sci. U. S. A. 89, 3285–3289 (1992). 113. Lefebvre, V. The SoxD transcription factors--Sox5, Sox6, and Sox13--are key cell fate modulators. Int. J. Biochem. Cell Biol. 42, 429–432 (2010).

V-235

114. Wang, Y., Ristevski, S. & Harley, V. R. SOX13 exhibits a distinct spatial and temporal expression pattern during chondrogenesis, neurogenesis, and limb development. J. Histochem. Cytochem. 54, 1327–1333 (2006). 115. Wang, Y., Bagheri-Fam, S. & Harley, V. R. SOX13 is up-regulated in the developing mouse neuroepithelium and identifies a sub-population of differentiating neurons. Brain Res. Dev. Brain Res. 157, 201–208 (2005). 116. Jiao, K., Zhou, Y. & Hogan, B. L. M. Identification of mZnf8, a mouse Krüppel-like transcriptional repressor, as a novel nuclear interaction partner of Smad1. Mol. Cell. Biol. 22, 7633–7644 (2002). 117. Zwijsen, A., Verschueren, K. & Huylebroeck, D. New intracellular components of bone morphogenetic protein/Smad signaling cascades. FEBS Lett. 546, 133–139 (2003). 118. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470–484 (2010). 119. Zemach, A. & Grafi, G. Characterization of Arabidopsis thaliana methyl‐CpG‐binding domain (MBD) proteins. Plant J. (2003). 120. Hashimoto, H. et al. Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic Acids Res. 40, 4841–4849 (2012). 121. Klose, R. J. et al. DNA binding selectivity of MeCP2 due to a requirement for A/T sequences adjacent to methyl-CpG. Mol. Cell 19, 667–678 (2005). 122. Tillotson, R. et al. Neuronal non-CG methylation is an essential target for MeCP2 function. Mol. Cell 81, 1260–1275.e12 (2021). 123. Kuehner, J. N. & Yao, B. The Dynamic Partnership of Polycomb and Trithorax in Brain Development and Diseases. Epigenomes 3, 17 (2019). 124. Vogel, T., Stoykova, A. & Gruss, P. Differential expression of polycomb repression complex 1 (PRC1) members in the developing mouse brain reveals multiple complexes. Dev. Dyn. 235, 2574–2585 (2006). 125. Tavares, L. et al. RYBP-PRC1 complexes mediate H2A ubiquitylation at polycomb target sites independently of PRC2 and H3K27me3. Cell 148, 664–678 (2012). 126. Gao, Z. et al. PCGF homologs, CBX proteins, and RYBP define functionally distinct PRC1 family complexes. Mol. Cell 45, 344–356 (2012). 127. Gearhart, M. D., Corcoran, C. M., Wamstad, J. A. & Bardwell, V. J. Polycomb group and SCF ubiquitin ligases are found in a novel BCOR complex that is recruited to BCL6 targets. Mol. Cell. Biol. 26, 6880–6889 (2006). 128. Hannibal, M. C. et al. Spectrum of MLL2 (ALR) mutations in 110 cases of Kabuki syndrome. Am. J. Med. Genet. A 155A, 1511–1516 (2011). 129. Jones, W. D. et al. De novo mutations in MLL cause Wiedemann-Steiner syndrome. Am. J. Hum. Genet. 91, 358–364 (2012).

V-236

130. Sun, Y.-M. et al. Distinct profiles of REST interactions with its target genes at different stages of neuronal development. Mol. Biol. Cell 16, 5630–5638 (2005). 131. Yang, Y. J. et al. Microcephaly gene links trithorax and REST/NRSF to control neural stem cell proliferation and differentiation. Cell 151, 1097–1112 (2012). 132. Xue, Y. et al. NURD, a novel complex with both ATP-dependent chromatin-remodeling and histone deacetylase activities. Mol. Cell 2, 851–861 (1998). 133. Millard, C. J. et al. The structure of the core NuRD repression complex provides insights into its interaction with chromatin. Elife 5, e13941 (2016). 134. Feng, Q. & Zhang, Y. The MeCP1 complex represses transcription through preferential binding, remodeling, and deacetylating methylated nucleosomes. Genes Dev. 15, 827– 832 (2001). 135. Hong, W. et al. FOG-1 recruits the NuRD repressor complex to mediate transcriptional repression by GATA-1. EMBO J. 24, 2367–2378 (2005). 136. Kaji, K. et al. The NuRD component Mbd3 is required for pluripotency of embryonic stem cells. Nat. Cell Biol. 8, 285–292 (2006). 137. Zhang, J. et al. Harnessing of the nucleosome-remodeling-deacetylase complex controls lymphocyte development and prevents leukemogenesis. Nat. Immunol. 13, 86– 94 (2011). 138. Hoffmann, A. & Spengler, D. Chromatin Remodeling Complex NuRD in Neurodevelopment and Neurodevelopmental Disorders. Front. Genet. 10, 682 (2019). 139. Wu, L. M. N. et al. Zeb2 recruits HDAC-NuRD to inhibit Notch and controls Schwann cell differentiation and remyelination. Nat. Neurosci. 19, 1060–1072 (2016). 140. Hirota, A., Nakajima-Koyama, M., Ashida, Y. & Nishida, E. The nucleosome remodeling and deacetylase complex protein CHD4 regulates neural differentiation of mouse embryonic stem cells by down-regulating p53. J. Biol. Chem. 294, 195–209 (2019). 141. Nitarska, J. et al. A Functional Switch of NuRD Chromatin Remodeling Complex Subunits Regulates Mouse Cortical Development. Cell Rep. 17, 1683–1698 (2016). 142. Yamada, T. et al. Promoter decommissioning by the NuRD chromatin remodeling complex triggers synaptic connectivity in the mammalian brain. Neuron 83, 122–134 (2014). 143. Kangaspeska, S. et al. Transient cyclical methylation of promoter DNA. Nature 452, 112–115 (2008).

V-237

Supplementary information

Table S5.1:Combined dataset corresponds to proteins observed in human and mouse mCA/CA datasets. Combined-human represents a significantly enriched protein in human but not mouse and combined-mouse represents a significantly enriched protein in mouse but not human. Proteins marked by a * denotation are observed as statistically enriched within the mCG/CG DNA pull-downs (see chapter 4 supplementary tables).

Species GeneID Reference Species GeneID Reference

Combined mCA MBD2* 118,143 Human and Mouse MNT 118 CA Combined mCA POU3F2 61 Human and Mouse USF1* 61 CA Combined-human KAT6B* -- Human and Mouse USF2* 61 mCA CA Combined-human MTA2* 62 Combined-human APEX1* -- mCA CA Combined-human MTA3* 62 Combined-human ARNTL 61 mCA CA Combined-human PHF5A -- Combined-human ATF2 61 mCA CA Combined-human ZFP128 -- Combined-human CHD6 -- mCA CA Combined-mouse Dazap1 -- Combined-human CLOCK 61 mCA CA Combined-mouse Ppie -- Combined-human FOXO1* mCA CA Combined-mouse Srek1 -- Combined-human FOXO3* -- mCA CA Combined-mouse Ybx3 -- Combined-human HIVEP2* -- mCA CA Combined-human JDP2 -- CA Combined-human LRPPRC* -- CA Combined-human MXI1* -- CA Combined-human MYPOP* --

V-238

CA Combined-human NRF1 61 CA Combined-human PAX6 -- CA Combined-human PBX3* -- CA Combined-human POLB -- CA Combined-human RBM42 62 CA Combined-human RBM4B -- CA Combined-human RECQL5* -- CA Combined-human SCRT1 -- CA Combined-human SIN3A -- CA Combined-human SP3 61,118 CA Combined-human TAF9 62 CA Combined-human TCF4 -- CA Combined-human TFE3* 61 CA Combined-human TFEB* 61 CA Combined-human YTHDC2* -- CA

Table S5.2: Human-limited mCA/CA dataset corresponds to proteins observed in human only. Proteins marked by a * denotation are observed as statistically enriched within the mCG/CG DNA pull-downs (see chapter 4 supplementary tables).

Species GeneID Reference Species GeneID Reference

Human-limited mCA ELF1 Human-limited CA BHLHE40* 61 Human-limited mCA MYRF -- Human-limited CA HMGB1 --

V-239

Human-limited mCA SPEN 107 Human-limited CA HMGB3 -- Human-limited mCA ZNF384 Human-limited CA LBR -- Human-limited mCA ZNF775 -- Human-limited CA LONP1 -- Human-limited CA MAX* 61,62 Human-limited CA MAZ -- Human-limited CA MLX 61,118 Human-limited CA MLXIPL -- Human-limited CA OLIG2* -- Human-limited CA SREBF1* 61 Human-limited CA STAT1 -- Human-limited CA TFAP4* -- Human-limited CA TRIM33 -- Human-limited CA VEZF1* -- Human-limited CA ZIC2* -- Human-limited CA ZNF148 -- Human-limited CA ZNF385A -- Human-limited CA ZNF385D 61 Human-limited CA ZNF579 -- Human-limited CA ZNF629 -- Human-limited CA ZNF71 -- Human-limited CA ZSCAN31 --

Table S5.3: Mouse-limited mCA/CA dataset corresponds to proteins observed in mouse only. Proteins marked by a * denotation are observed as statistically enriched within the mCG/CG DNA pull-downs (see chapter 4 supplementary tables).

Species GeneID Reference Species GeneID Reference

Mouse- Jazf1 Mouse-limited CA Bhlhe23 -- limited mCA Mouse- Kdm1b Mouse-limited CA Mtf1* -- limited mCA Mouse- Klf7 -- Mouse-limited CA Neurod1 -- limited mCA Mouse- Naca -- Mouse-limited CA Nucb1 -- limited mCA Mouse- Pou3f4 -- Mouse-limited CA Pa2g4 -- limited mCA Mouse- Ssbp2 -- Mouse-limited CA Rfx1* limited mCA

V-240

Mouse- Zhx3 118 Mouse-limited CA Wrn* -- limited mCA

Table S5.4: Human P/A mCA/CA corresponds to entire-human P/A created by merging combined- human P/A and human-limited P/A. Proteins marked by a * denotation are observed as statistically enriched within the mCG/CG DNA pull-downs (see chapter 4 supplementary tables).

Species GeneID Reference Species GeneID Reference

Human P/A CPSF6 -- Human P/A CA OLIG3* -- mCA Human P/A ZFR2 -- Human P/A CA POLG2 -- mCA -- Human P/A CA ZNF354A -- Human P/A CA ZNF628 -- Human P/A CA ZNF683* --

Table S5.5: Mouse P/A mCA/CA corresponds to entire-mouse P/A created by merging combined- mouse P/A and mouse-limited P/A. Proteins marked by a * denotation are observed as statistically enriched within the mCG/CG DNA pull-downs (see chapter 4 supplementary tables).

Species GeneID Reference Species GeneID Reference

Mouse P/A DDX4 -- Mouse P/A CA BHLHE40* 61 mCA Mouse P/A EME2 -- Mouse P/A CA MITF* -- mCA Mouse P/A MEIS3* 61 Mouse P/A CA TCF3 -- mCA Mouse P/A MYNN -- mCA Mouse P/A MYRF* -- mCA Mouse P/A NME2* -- mCA Mouse P/A SOX13* -- mCA

V-241

General discussion

Summary

The discovery of mC readers has progressed considerably due to recent advances in MS- proteomics, however statistical tools that are reliable and applicable to multi-dataset analyses are lacking. Here, ProteoMM was developed to provide a robust, streamlined, and user- friendly approach to MS analyses by implementing enhanced normalisation and imputation methods that are currently unavailable in many commonly used statistical tools. The efficacy of ProteoMM was confirmed by benchmarking the tool against an existing, commonly utilised proteomics analysis tool, by comparing each performance output against previously published mCG binding data. The approach correctly called a greater number of already characterised DNA-binding domain (DBD)-containing proteins with an affinity for mCG. Using these approaches, ProteoMM was then used to identify novel DBD-containing proteins and interacting proteins with an affinity for mCG, mCA, CG, and CA DNA baits. Numerous novel candidate mCG and CG binding proteins were identified, providing a repository of proteins with affinity for each substrate within the human and mouse brain. In addition, the mCA/CA DNA pull-down provides the first exploratory mCA reader characterisation study, thereby expanding our current list of mCA binders within the mammalian brain, which thus far has been limited to MECP2. The list of protein interactors enriched for mCA suggests that this modification may recruit the repressive NuRD complex through recognition by MBD2, for which direct affinity to mCA was validated by biochemical approaches. The mCG/CG and mCA/CA DNA pull-downs, therefore, identified many novel mCG, mCA, CG, and CA readers that require extensive further characterisation in order to ascertain their interactions and effects on gene expression within the unique neuro-epigenetic landscape.

VI-242

Potential implications for identified mCG binders

CG methylation constitutes the bulk of DNA methylation found within mammals and is highly conserved. The majority of CG dinucleotides in the human genome are methylated except for within certain limited regions such as active gene regulatory elements. At promoters and TEs, CG methylation coincides with transcriptional repression, which is mediated by two main mechanisms1,2. First, CG methylation can impede the binding of transcription factors or DNA polymerase. Second, a subset of DNA binding proteins has evolved specialised DBDs that recognise, bind to, and modulate transcription at loci that are hypermethylated. Subsequent to binding, these DBD-containing proteins coordinate repression by interactions with proteins that alter the local chromatin state rendering the hypermethylated site inaccessible to other proteins3,4. To date, many DBD-containing proteins with mCG binding capabilities have been identified and characterised. These include members of classical mCG binding families like the MBD or Kaiso family5,6, whilst recent advances in DNA-binding screens have identified numerous novel mCG binders. Among these DNA-binding screens are methyl-sensitive SELEX screens that have comprehensively screened the human proteome7. DNA pull-downs coupled to MS have been invaluable in identifying not only DBD-containing proteins that bind to CG dinucleotides, but also in identifying biologically relevant protein interactions within many cell types8,9. The results from the mCG/CG DNA pull-down within this study were compared to the many DBD-containing proteins with known mCG binding capabilities identified by methyl-sensitive SELEX or other published DNA pull-down screens. A large overlap with external studies was observed, and suggests that these DBD-containing proteins and their interactors are expressed within the human and mouse adult brain, partaking in methylation-dependent gene regulation. For many proteins, no biochemical validation experiments have been conducted, and are required to confirm a direct interaction with mCG. Alongside this, a subset of proteins require further in vivo characterisation aiming to understand their genomic binding and protein regulatory networks. The DNA pull-down also identified many protein interactors, most likely acting via a DNA binding protein to elicit transcriptional repression. Experiments such as TAP-MS are required to disentangle the list of observed protein interactors and isolate defined regulatory complexes prior to the developmental characterisation of each complex. Together, the genome-wide binding dynamics and DBD-containing protein interaction experiments may reveal how each DBD- containing protein influences gene expression at target loci and to a larger extent, contributes to healthy brain functioning.

VI-243

In addition to proteins already identified in external DNA binding screens, many novel DBD- containing proteins were also identified, for example, members of the FOX family and EXOSC complex. The FOX family are a diverse protein family comprising over 40 members. The forkhead domain, conserved among all FOX members contains a winged-helix (WH) domain, that has been implicated in the recognition of mCG for methyl-reader RFX5. While FOXJ3, FOXK1, and FOXK2 have been verified7,9,10, FOXP1, FOXN3, and FOXO3 that were enriched within the DNA pull-down for mCG have not. Further, the FOX family members, already observed as having an affinity for mCG, have not been subjected to structural biochemical characterization experiments that would verify if and how the winged-helix domain participates in the recognition of mCG. These experiments would contribute to our current understandings of how various DNA binding domains interact with methylated DNA, but may also prove useful in predicting the likelihood of other unidentified proteins with similar domains. This is especially relevant in the case of the WH domain, which is not exclusive to the FOX family.

Numerous proteins implicated in RNA related processes were enriched for mCG within the DNA pull-down revealing potentially novel mechanisms related to splicing and RNA surveillance within the mammalian brain. Splicing factors CSTF1/2/3 and PRPF1/3, for example, were enriched for mCG. There is some evidence implicating CG methylation in the regulation of splicing, but much remains unknown. PRPF3 in particular has been established to associate with MECP2 which guides the splicing complex to mCG sites in vivo11. Whether MECP2 also associates with PRPF1, and whether this is a shared or unique splicing complex within the brain, remains to be determined. In addition to these proteins, other RNA-related proteins were also observed within the mCG/CG DNA pull-down, specifically, all members of the nuclear exosome complex. Thus far RNA turnover and quality control ,which are known functions mediated by the nuclear exosome, have not been mechanistically linked to mCG. The results of the DNA pull-down are suggestive of a link between RNA modulation and CG methylation. Further investigation may unravel how this complex is recruited, and verify a relationship between mCG and the nuclear exosome constituents, and investigate if this complex is unique to the brain, or functions more broadly in many cell types.

The first mCA reader screen in mammals

Whilst CG methylation predominates in most tissue, CH methylation is found in abundance in pluripotent stem cells and within the mammalian brain. Despite being low at birth, CH methylation accumulates alongside neurogenesis and is the more abundant form of DNA methylation within mature neurons12. This accumulation, coupled with cell type-specific effects

VI-244

on transcription, have prompted questions regarding the cellular mechanisms underlying mCH readout. MECP2 is the only mCH reader that has been comprehensively characterized, binding selectively to mCA in vitro and in vivo13–15. Mutations within the MECP2 gene give rise to an X-linked disorder known as Rett Syndrome, whilst overexpression of MECP2 results in MECP2 duplication syndrome16. These observations hint at an undiscovered regulatory role for mCA in neurodevelopment, that may be deciphered through an understanding of the proteins and their interactions that bind to this modification. For MECP2, this was confirmed by analysis of its in vivo binding dynamics, which verified a localisation at mCH sites, independent of mCG13,15. Further, misregulation of these loci occurred through disruption to MECP2, providing a molecular basis for the delayed onset of Rett Syndrome17. Whilst MECP2 has been investigated, there have been no attempts to identify and characterise other potential mCA binders or interactors within the mammalian brain that may regulate important neurodevelopmental processes that are coupled to mCA.

The mCA/CA DNA pull-down described here represents the first experiment to identify and characterise mCA binders and protein interactors in the human and mouse brain. Results from this DNA pull-down suggest that MECP2 is not the only DBD-containing protein that binds to mCA. Numerous DBD-containing proteins, mainly attributed with roles in neurodevelopment, for example POU3F2 and POU3F4, were significantly enriched within the DNA pull-down for the mCA modification. These results open a new avenue of research into how this novel modification is interpreted and provides new candidates that can now be further characterised to understand how mCA is read, and the molecular consequence of this mark on gene expression within neurodevelopment

mC reader conservation in human and mouse brain

The combined-human and mouse analysis produced by ProteoMM enabled comparisons of enriched proteins observed within each binding context. This analysis was successfully implemented and detected proteins in both species with an affinity for mCG, mCA, CG, and CA. Proteins enriched within the combined analysis represent a set of conserved proteins with similar biological functions in human and mouse brain. As such, these proteins most likely participate in important regulatory processes within the brain related to their affinities for the probes used within this DNA pull-down (Figure 4.6 and Figure 5.2). Further, two clusters repelled by mC and attracted to mC were identified when the mCA/CA and mCG/CG datasets were overlaid and clustered based on protein intensity (Figure 5.8). GO analysis of proteins within these clusters identified several important protein complexes. In line with what is

VI-245

currently known about these complexes, the PRC1 and MLL/COMPASS complex members displayed a repulsion to the mCG probe18,19, but were similarly repelled by mCA to a significant extent. The results suggest that these complexes likely coordinate gene expression at methylation deficient loci within the brain at both mCG and mCA sites. Biochemical validation and in vivo binding data is needed to ascertain whether these complexes are simply repelled by mCA or whether these complexes bind to CA with appreciable affinity, and localise at CA elements in the mammalian brain.

The DNA pull-downs identified a subset of proteins with affinity for mCG and mCA. Most notably, this was observed for the repressive NuRD complex, which couples histone acetylation and chromatin remodelling with transcriptional repression at loci to which it is guided20. Importantly, MBD2 was observed within this cluster and was highly enriched in human and mouse for mCG and mCA binding contexts. MBD2 is a well-characterised mCG binder, that is responsible for recruiting NuRD to mCG-enriched loci in vivo. While this result is unsurprising, it additionally indicates that NuRD is required within the regulation of brain development, a mechanism that has not been comprehensively explored. Excitingly, the enrichment of MBD2 and NuRD components for the mCA probe suggests NuRD likely participates in transcriptional regulation of hypermethylated mCA loci. A specific interaction for mCA was confirmed by EMSA (Figure 5.10), as an initial step towards the characterization of MBD2 as an mCA binder. To understand the repercussion of these results, further experimental work is required, however as detailed below, was not within the scope of this project and dependent on advances within the neuro-epigenomic field.

Experimental strategies for downstream characterization of mC readers

Thus far, proteomics has been essential in identifying proteins that bind to particular DNA modifications in many tissue types and for a variety of modifications8,9. Other approaches like SELEX also offer a means for investigation of transcription factor binding7. These approaches have circumvented prior pitfalls associated with previous experiments like TAP-MS or yeast two-hybrid systems, which lacked the ability to screen protein interactions in a high throughput manner. In addition, recent developments within quantitative MS now enable label-free protein screens that do not rely on isotope labelling or cell culture-based systems. Whilst some tools exist to analyse label-free data, no tool currently exists allowing for analysis of two or more datasets, limited to data that is compared post-analysis. ProteoMM was developed to overcome this limitation and provides more sophisticated imputation and normalisation

VI-246

procedures aimed at maximising the information present within MS datasets that is wrought with copious amounts of missing data. As such, ProteoMM is therefore not limited to just DNA pull-down MS data, but is a valuable tool that may be implemented in future and existing MS datasets requiring group comparisons.

The identification of numerous DBD-containing proteins and interactors within the mCG/CG and mCA/CA datasets suggests these proteins are expressed within mature human and mouse brain, participating in biological activities pertinent to their observed enrichment. Many of these represent novel DBD-containing proteins that have not been associated with binding to that context in any cell type previously. The proteins observed as enriched for each context constitute a repository of protein information that other studies may build upon. Protein interactions may be disentangled through TAP-MS experiments whilst DBD-containing protein binding and characterization experiments will rely on a mixture of biochemical and in vivo work. Some commonly applied biochemical analyses like EMSA or Microscale Thermophoresis (MST) may be used to verify the binding of observed DBD-containing proteins to a particular context but are artificial in nature, lacking the underlying complexity of genomic DNA or cellular factors like other proteins or the presence of nucleosomes. Nevertheless, these are the necessary initial steps required for the characterization of DBD-containing proteins identified in this experiment. A sound understanding of the genome-wide binding dynamics of the protein is required. ChIP-seq coupled to bisulfite treatment of the same genomic DNA is invaluable in ascertaining the binding sequences of proteins and their methylation states. Gene KO experiments allowing for a quantifiable effect on gene expression by the candidate protein may prove useful through cause and effect. Binding data may be overlaid with RNA-seq data from the same tissue or gene KO experiment, to comprehensively characterize the binding of each protein, their effects on gene expression upon binding, whether this is methylation- dependent and consistent with results from the DNA pull-down, and lastly, the repercussions of these events on neurodevelopment.

The potential experiments detailed above which would enable the comprehensive characterisation of mCA readers identified in this DNA pull-down are afflicted by inherent challenges associated with studying the mammalian brain. The brain is a complex organ and composed of non-neuronal and neuronal cell types and their subtypes. As such, FACS or FANS may be utilised to address protein behaviour in a subset of brain cell types. This is an essential requirement for studying mCA readers, whose mark is almost exclusively restricted to neurons. Current advances in the field such as cell culture and organoid differentiation will be required to overcome brain tissue limitations and in conducting experiments at numerous developmental timepoints, at which mCA levels differ. The intermingled nature of mCA and

VI-247

mCG within the brain is only partially resolved with sorting. Approaches to retain mCH levels whilst reducing mCG will also be useful in disentangling the binding of mCA readers, especially because many also bind to mCG. The future development of appropriate epigenome editing tools may also be of use, if they were able to achieve precise induction of mCA to allow for easier investigation of mCA binders and causal roles of the modification. Subsequent characterisation of mCA readers will be dependent on the progress and implementation of these strategies in the future.

Concluding remarks

The CG and mCA/CA DNA pull-downs were successful in identifying numerous proteins with affinity for mCA and mCG, in line with the overarching aim of this project, which was to evaluate mC reader conservation in human and mouse brain. Results from the mCG/CG DNA pull-down are largely consistent with existing information for readers implicated in mCG recognition. The enrichment of MBD2, MECP2, and various Kaiso members in human and mouse along with other existing and novel mCG binders suggests these proteins participate in similar epigenomic readout processes. How these proteins influence gene expression programmes in the brain to coordinate processes involved in neurodevelopment and neural function is largely unknown and remains to be explored. The mCA/CA DNA pull-down represents the first mCA reader screen in mammals and identified numerous potential DBD- containing proteins and interactors. Together, both DNA pull-downs identified a conserved list of proteins and potential regulatory networks that may function within brain development in methylation-dependent and independent contexts. Foremost within this list, is MBD2, whose specificity for mCA was verified by biochemical approaches. The enrichment of NuRD complex members for mCG and, surprisingly, mCA, suggests that this complex may coordinate transcriptional repression of loci within the brain harbouring mCA and represents an exciting avenue of potential research. Together with advances in the field of neuro-epigenetics, results of this assay may be used to characterise mCG and/or mCA binders in order to understand the role of these modifications within the mammalian brain.

VI-248

References

1. Pfeifer, G. P., Tang, M. & Denissenko, M. F. Mutation hotspots and DNA methylation. Curr. Top. Microbiol. Immunol. 249, 1–19 (2000). 2. Jurkowska, R. Z., Jurkowski, T. P. & Jeltsch, A. Structure and function of mammalian DNA methyltransferases. Chembiochem 12, 206–222 (2011). 3. Lock, L. F., Takagi, N. & Martin, G. R. Methylation of the Hprt gene on the inactive X occurs after chromosome inactivation. Cell 48, 39–46 (1987). 4. Ford, E. E., Grimmer, M. R., Stolzenburg, S. & Bogdanovic, O. Frequent lack of repressive capacity of promoter DNA methylation identified through genome-wide epigenomic manipulation. bioRxiv (2017). 5. Hendrich, B. & Tweedie, S. The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet. 19, 269–277 (2003). 6. Zhigalova, N. A., Zhenilo, S. V., Aithozhina, D. S. & Prokhortchouk, E. B. Bifunctional role of the zinc finger domains of the methyl-DNA-binding protein Kaiso. Mol. Biol. 44, 233–244 (2010). 7. Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, (2017). 8. Vermeulen, M. Identifying chromatin readers using a SILAC-based histone peptide pull- down approach. Methods Enzymol. 512, 137–160 (2012). 9. Spruijt, C. G. et al. Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized derivatives. Cell 152, 1146–1159 (2013). 10. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470–484 (2010). 11. Long, S. W., Ooi, J. Y. Y., Yau, P. M. & Jones, P. L. A brain-derived MeCP2 complex supports a role for MeCP2 in RNA processing. Biosci. Rep. 31, 333–343 (2011). 12. Lister, R. et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905 (2013). 13. Gabel, H. W. et al. Disruption of DNA-methylation-dependent long gene repression in Rett syndrome. Nature 522, 89–93 (2015). 14. Liu, K. et al. Structural basis for the ability of MBD domains to bind methyl-CG and TG sites in DNA. J. Biol. Chem. 293, 7344–7354 (2018). 15. Clemens, A. W. et al. MeCP2 Represses Enhancers through Chromosome Topology- Associated DNA Methylation. Mol. Cell 77, 279–293.e8 (2020).

VI-249

16. Kaufmann, W. E., Johnston, M. V. & Blue, M. E. MeCP2 expression and function during brain development: implications for Rett syndrome’s pathogenesis and clinical evolution. Brain Dev. 27 Suppl 1, S77–S87 (2005). 17. Chen, L. et al. MeCP2 binds to non-CG methylated DNA as neurons mature, influencing transcription and the timing of onset for Rett syndrome. Proc. Natl. Acad. Sci. U. S. A. 112, 5509–5514 (2015). 18. Zhang, P., Bergamin, E. & Couture, J.-F. The many facets of MLL1 regulation. Biopolymers 99, 136–145 (2013). 19. Wong, S. J. et al. KDM2B Recruitment of the Polycomb Group Complex, PRC1.1, Requires Cooperation between PCGF1 and BCORL1. Structure 24, 1795–1801 (2016). 20. Baubec, T., Ivánek, R., Lienert, F. & Schübeler, D. Methylation-dependent and - independent genomic targeting principles of the MBD protein family. Cell 153, 480–492 (2013).

VI-250