Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression

Sharmi Banerjee

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Electrical Engineering

Pratap Tokekar, Co-chair Xiaowei Wu, Co-chair William Baumann Inyoung Kim Anil Vullikanti

September 05, 2019 Blacksburg, Virginia

Keywords: Epigenetic factors, gene expression, transcription factors, histone marks, DNA methylation Copyright 2019, Sharmi Banerjee Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression

Sharmi Banerjee

(ABSTRACT)

This dissertation presents applications of machine learning and statistical approaches to infer protein-DNA bindings in the presence of epigenetic modifications. Epigenetic modifications are alterations to the DNA resulting in gene expression regulation where the structure of the DNA remains unaltered. It is a heritable and reversible modification and often involves addition or deletion of certain chemical compounds to the DNA. Histone modification is an epigenetic change that involves alteration of the histone proteins – thus changing the chromatin (DNA wound around histone proteins) structure – or addition of methyl-groups to the Cytosine base adjacent to a Guanine base. Epigenetic factors often interfere in gene expression regulation by promoting or inhibiting protein-DNA bindings. Such proteins are known as transcription factors. Transcription is the first step of gene expression where a particular segment of DNA is copied into the messenger-RNA (mRNA). Transcription factors orchestrate gene activity and are crucial for normal cell function in any organism. For example, deletion/mutation of certain transcription factors such as MEF2 have been associated with neurological disorders such as autism and schizophrenia. In this dissertation, different computational pipelines are described that use mathematical models to explain how the protein-DNA bindings are mediated by histone modifications and DNA-methylation affecting different regions of the brain at different stages of development. Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression

Sharmi Banerjee

(GENERAL AUDIENCE ABSTRACT)

A cell is the basic unit of any living organism. Cells contain nucleus that contains DNA, self- replicating material often called the blueprint of life. For sustenance of life, cells must respond to changes in our environment. Gene expression regulation, a process where specific regions of the DNA (genes) are copied into messenger RNA (mRNA) molecules and then translated into proteins, determines the fate of a cell. It is known that various environmental (such as diet, stress, social interaction) and biological factors often indirectly affect gene expression regulation. In this dissertation, we use machine learning approaches to predict how certain biological factors interfere indirectly with gene expression by changing specific properties of DNA. We expect our findings will help in understanding the interplay of these factors on gene expression. Acknowledgments

The dissertation owes so much to so many. Foremost among them, Dr.David Xie, for his constant guidance and encouragement, and giving me enough freedom to pursue my goals for personal development. It has been a privilege and an honor. I would like to thank Dr. Xiaowei Wu for being my mentor throughout my PhD and guiding me more than I could have expected. I would also like to thank the rest of my committee: Dr. Pratap Tokekar, Dr. William Baumann, Dr. Inyoung Kim and Dr. Anil Vullikanti for their valuable feedback, insightful discussions and hard questions. My sincere thanks also goes to Dr.Jason Xuan and Dr.Hamed Sari-Sarraf for getting me excited about research and helping me find my feet during the initial phase of the process. I would like to express my deepest gratitude to Dr. Michel Pleimling for providing me with assistantship in the last two years of my PhD. It would have be impossible for me to continue my doctoral work without his help. I could not have endured this journey without the support of my friends Sajal, Jiyoung, Meghna, Kwang, Jianlin and Xiaoran who have become my family in Blacksburg. I would like to thank my parents and sister for keeping my spirits up at times of failure. I thank my sister-in-law who shared with me her valuable academic experience and helped in critically reviewing this dissertation. Last but not the least I would like to thank my husband who bolstered me through the peaks and valleys of this journey. This dissertation belongs to him as much as it belongs to me.

iv Contents

List of Figures ix

List of Tables xvi

1 Introduction1

2 Identification of transcriptional regulatory modules in distinct chromatin states in mouse neural stem cells7

2.1 Overview...... 7

2.2 Key challenges...... 12

2.2.1 Integrating histone and transcription factors...... 12

2.2.2 Finding the optimum clustering algorithm...... 13

2.2.3 Resolution of the TRMs within chromatin states...... 13

2.3 Building blocks of the computational pipeline...... 17

2.3.1 Data-sets...... 17

2.3.2 Chromatin state identification through genome segmentation..... 17

2.3.3 TF binding pattern clustering via Dirichlet Process Mixture of Log Gaussian Cox Processes (DPM-LGCP)...... 18

2.4 Case studies...... 21

v 2.4.1 Simulated data...... 21

2.4.2 Real data...... 23

2.4.3 Functional enrichment of transcription factors within chromatin states 23

2.4.4 Gene expression and motif analysis...... 25

2.5 Results...... 25

2.5.1 Genome segmentation and chromatin state identification...... 25

2.5.2 Genome annotation enrichment...... 26

2.5.3 Relative enrichment around TSS...... 26

2.5.4 State sizes...... 30

2.5.5 Functional annotation of nucleosome and domain level states..... 30

2.5.6 Chromatin state preference of individual TF binding and gene expres- sion regulation...... 32

2.5.7 Chromatin state and preferential TF clustering...... 34

2.5.8 Estimated clusters in chromatin states...... 37

2.5.9 Protein-DNA binding motif preferences in chromatin states...... 41

2.6 Known protein-protein interactions predicted by the algorithm...... 43

2.7 Robustness of the proposed clustering technique...... 44

2.7.1 Effect of varying domain level state sizes...... 44

2.7.2 Effect of varying number of initial clusters...... 44

2.7.3 Effect of permutation of peaks...... 46

vi 2.8 Comparison of clustering results to other methods...... 46

2.9 Time complexity analysis...... 47

2.10 Discussion...... 47

3 Recursive motif analyses identify brain epigenetic regulatory modules 53

3.1 Overview...... 53

3.2 Key challenges...... 54

3.2.1 Motif enrichment prediction in DMS...... 54

3.3 Building blocks of the computational pipeline...... 56

3.3.1 Merging nearby DMS...... 56

3.3.2 Clustering DMRs for motif enrichment...... 57

3.3.3 Data analysis of Whole Genome Bisulphite Sequencing (WGBS-seq). 58

3.3.4 Recursive identification of TRMs from motifs...... 59

3.3.5 Preparing motif libraries...... 61

3.3.6 Analysis of single cell RNA-seq...... 62

3.4 Results...... 64

3.4.1 A Comprehensive motif database compiled for epigenetic regulatory module identification...... 64

3.4.2 Genome distribution of hypermethylated CpG sites identified in TET1KO and TET2KO frontal cortices...... 65

3.4.3 TRMs identified in differentially methylated clusters...... 69

vii 3.4.4 Brain ChIP-seq data support the TRMs predicted...... 74

3.4.5 Transcription factors under the influence of DNA methylation.... 76

3.5 Discussion...... 78

4 Discussion and future work 81

4.1 TRM analysis with brain single cell methylomes...... 83

4.2 Cell type specific PPI obtained through ETRM prediction...... 84

4.3 Co-expression of ETRMs in single cell and bulk RNA-seq...... 86

4.4 Summary...... 88

4.5 Contribution of the dissertation...... 90

Bibliography 92

Appendices 111

Appendix A First Appendix 112

A.1 Dirichlet Process mixture of log Gaussian Cox processes...... 112

A.1.1 INLA approximation of the LGCP model...... 112

A.1.2 Algorithm for posterior inference...... 115

viii List of Figures

2.1 A two-step process to identify chromatin-state-specific transcriptional regula- tory modules. diHMM is used to segment the genome into multiple chromatin states followed by application of the proposed clustering method to identify transcriptional regulatory modules in distinct states. Downstream analyses include gene expression comparison and de-novo motif comparison across dif- ferent chromatin states...... 11

2.2 TF peak enrichment in different chromatin states based on a threshold re- quirement of 10 peaks for each TF...... 15

2.3 TF peak enrichment in different chromatin states based on a threshold re- quirement of 5 peaks for each TF...... 16

2.4 Application of the clustering algorithm on simulated data. (A) True and Estimated binding intensities of twenty transcription factors in three clusters (B) Illustration of the clustering result obtained in (A)...... 22

2.5 Application of the clustering algorithm on real data. (A) Distribution of ChIP- seq peaks visualized in IGV window and (B) the corresponding estimated binding intensities of the clusters. The proteins in each cluster are shown with the same color...... 24

ix 2.6 (a) Nucleosome level emission matrix generated by diHMM. Functional anno- tations of the nucleosome level states are shown in the color bar on the left. Scale varies linearly between 0 and 1. (b) Fractional genome coverage for nu- cleosome and domain level states. Scale varies logarithmic-ally between 10−4 and 1. (c) Combined nucleosome-domain fold change obtained by diHMM. Functional annotation of the states are shown in the color bar on the left. Scale varies logarithmic-ally between 0.5 and 50...... 27

2.7 Nucleosome states genome annotation enrichment...... 28

2.8 Nucleosome and domain level state distribution around TSS...... 29

2.9 Quartile box plots of log10 of all nucleosome-level (a), or domain-level (b) state sizes. Box plot whiskers extend to 1.5 times the interquartile range...... 31

2.10 (a) Enrichment (in log scale) of TF peaks in different chromatin states showing binding preference of individual TFs. (b) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) in different domain level chromatin states. Genes were mapped to the nucleosome-level states for the corresponding domain-level states. (c) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) mapped to individual TFs in the Broad Promoter state and in (d) the Poised Enhancer state...... 33

x 2.11 (a), (b) Estimated cluster binding intensities along with the individual TF binding intensities in the Broad Promoter and Poised Enhancer states, re- spectively. In each figure, the estimated binding intensities of the individual TFs are shown in dotted lines and the estimated binding intensities of the clus- ters are shown in solid line. TFs in each cluster are shown in the same color as that of the cluster. The X axis represents the genomic locations mapped on the real line between 0 and 50. The Y axis represents the estimated binding intensities, both for the individual TFs and for the identified clusters. (c), (d) Pairwise protein co-binding probabilities corresponding to (a) and (b) respec- tively. (e), (f) Comparison of proximal gene expressions (TPM) regulated by the clusters in (a) and (b) respectively. Only those clusters having (1) multi- ple TFs and (2) proximal genes for at least two TFs are shown in the figure to explain the combinatorial regulation of gene expressions by multiple TFs. 36

2.12 Gene expression comparison for individual proteins among different chromatin states...... 38

2.13 Gene expression comparison for predicted clusters among different chromatin states...... 38

2.14 Estimated cluster intensities in different chromatin states...... 39

2.15 Estimated cluster intensities in different chromatin states...... 40

xi 2.16 Effect of chromatin states and co-binding partner on binding motifs. (A) De- novo motifs obtained using MEME for ASCL1 are similar to the consensus motif in both Broad Promoter and Polycomb Repressed states although the co-factors of ASCL1 are different in the two states. (B) De-novo motifs ob- tained using MEME for TCF3 show differences in motifs between the two states with different co-factors. The motifs in active state resemble the β- catenin/TCF/LEF motif whereas the motifs in repressed state resemble the E-Box consensus motif...... 42

2.17 Estimated intensity curves of the identified clusters in 10 sub-samples, each with 1,000 randomly selected windows in a Broad Promoter domain (D5)... 45

3.1 Recursive identification of Transcriptional Regulatory Modules (TRMs) in differentially methylated sites...... 55

3.2 Recursive “key” TF and ETRM prediction flowchart...... 60

3.3 Distribution of five motif categories, viz, HOMER, MethylPlus, MethylMinus, Canonical (from MeDReaders) and Methylated (from MeDReaders) is shown among the different DNA binding domains. The domains are arranged in clockwise decreasing order, with the Homeobox domain containing most of the motifs from the five categories whereas the SMAD domain containing least motifs...... 66

3.4 Venn diagram showing the number of shared motifs among the five categories. 67

3.5 Genome distribution of TET1KO and TET2KO differentially methylated sites. 68

xii 3.6 Methylation profiles of three DMR clusters identified (A-C) and the corre- sponding TRMs predicted (D-F) for TET1KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. 70

3.7 Methylation profiles of three DMR clusters identified (A-C) and the corre- sponding TRMs predicted (D-F) for TET2KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. 71

3.8 The enrichment of TF binding sites in TET1KO and TET2KO TRMs. Each color value in the figures represents the estimate of the odds ratio based on conditional Maximum Likelihood Estimate derived from Fisher’s exact test. For TET2 KO, the two NF1 TRMs identified from cluster 1 and cluster 3 were combined and labeled as one NF1-TRM in the bottom panel...... 75

xiii 3.9 (A:D) Four types of motifs shown. (A) DNA methylation has little Egr1 motif. (B) No CpG site was observed in methylated Mef2c motif. (C) Primary motif of Lhx1 contained CpG site. (D) Primary motif of Sox14 contained CpG site. (E:H) Gene expression level (Transcripts Per Million or TPM in log scale) shown for the genes predicted TRMs in Figure 3,4. (I:L) Average methylation level of the differentially methylated sites containing the motifs from the predicted TRMs in Figure 3,4. Drop in methylation level at 6week and 10week time points is due to the fact that they were wild type samples in DMS identification procedure...... 77

4.1 Neurological disorders hypothesized to be due to relative change in excitatory to inhibitory neuron activity. The two extremes show Rhett syndrome caused due to less excitation and more inhibition and Autism caused due to more excitation and less inhibition...... 82

4.2 Applying a recursive motif enrichment on 16 single cell methylome DMRs, 20 key TFs were predicted. (A) Fishers exact test showing the enrichment of the TFs in the 16 DMRs. (B):(G) Methylation profile across the 16 DMRs occupied by the key TFs show that: the DMRs of most excitatory neurons containing the Egr1 motif are hypo-methylated whereas DMRs containing the Mef2b motif are hypo-methylated in both excitatory and inhibitory neuron. Finally only specific DMRs containing Nur77, Tgif2 and Bach2 motifs are hypo-methylated. (H) The 20 TFs can be broadly classified as either enriched in excitatory, or enriched in both excitatory and inhibitory or enriched in inhibitory neurons only...... 85

xiv 4.3 Cell type specific protein interaction network generated from the predicted ETRMs for three key TFs: Egr1, Mef2b and Bach2. The three nodes represent the three TFs and the edges represent the co-enriched motifs in specific neuron sub type. Based on the co-enrichment, cell type specific edges and hence sub networks can be constructed...... 87

4.4 Expression level of general and neuron sub type specific TFs and some of the co-enriched motifs from the corresponding ETRMs during development in fore brain. (A) Expression goes up during development, specifically at the post natal day, (B) Expression reaches peak during development after post natal day, (C)Contrasting patterns in expression observed for the four TFs. Interestingly, although both Tgif2 and Nur77 motit were highly significant in the mDL-3 excitatory neuron, the former shows a decrease in expression during development whereas the latter shows increase in expression during development...... 88

4.5 Co-expression of specific TFs within single cell RNA-seq data. TFs such as Egr1 and Klf6, Mef2d and Nf1 are shown to have similar expression levels and are expressed in a similar number of cells. Klf motif was significantly co-enriched in the Egr1 and Nf1 motif was also significantly enriched in the Mef2 ETRM...... 89

xv List of Tables

S1 Known interactions predicted by the algorithm...... 51

S2 Error rates of clustering on a Broad Promoter domain with different number of initial clusters...... 52

S3 Comparison of clustering results with K-means...... 52

xvi Chapter 1

Introduction

A point of view in 1859 proposed that all life on earth has common ancestry. After over 100 years the DNA was discovered which established the harmony of all life. DNA, short for deoxyribonucleic acid, is the self-replicating genetic code that infuses life. It carries the instructions for the development, growth, reproduction, and functioning of all life. A stretch of DNA that carries code for a specific protein or for RNA chains with a specific function in the organism is called a Gene and is the basic unit of heredity. Genetic differences in the tissue-specific expression of proteins [81] are known to result in the varied observable differences in species. Ants look different from dogs and dogs look different from humans, but all three species originated from a shared predecessor that lived hundreds of millions of years ago. Such differences are often due to genes being turned on and off at varied times and in varied tissues in a species. The set of instructions dictating the time and place for gene to be expressed are written in sequences of DNA bases contained within the regulatory region of the gene. These instruc- tions are referred to as the “gene regulatory code”. This code is deciphered by a certain class of proteins called transcription factors (TFs) that bind to specific sequences of DNA (or “DNA words”) and change the speed of transcription of genetic information from DNA to messenger RNA or mRNA by turning the genes on or off respectively. This process is known as transcription and is followed by another process known as translation. The RNA molecules are converted into amino acids and finally into proteins as end products. This

1 2 Chapter 1. Introduction phenomenon, often called gene expression, is thus the process by which DNA sequences are expressed into proteins. Differences in gene expression among species could therefore be due to differences in the instructions present within the regulatory regions of specific genes. Research in the past few decades has shown the relationship of genetic mutations to various diseases including cancer [80]. In addition, it has been observed that gene expression can be regulated without directly altering the DNA structure [2] but rather changing certain chemical properties of the DNA. Such mutations, known as epigenetic modifications, are caused by epigenetic factors, are reversible and heritable. Several environmental such as diet, social interaction, toxic chemicals present in the air are known to be associated with epigenetic modifications and thus alter the genome in either constructive or destructive way [1]. Histone modification and DNA methylation are two well-studied epigenetic factors. Hi- stones are proteins found in eukaryotic cell nuclei and are the chief protein components of chromatin1, acting as spools around which DNA winds. Histone modification involves mod- ifications to histone proteins by multiple processes such as methylation, phosphorylation, and acetylation and affects gene expression. Such modifications either render the chromatin ‘open’ (Euchromatin) thus allowing transcription factors to bind to the DNA or ‘close’ (Hete- rochromatin) resulting in decrease in transcription factor binding. DNA-methylation involves either addition/deletion of a methyl group to/from the DNA molecule and is known to affect transcriptional binding thereby promoting or repressing gene expression regulation [123]. In this dissertation, computational pipelines are described that apply machine learning based models on ChIP-seq, RNA-seq and DNA-methylation data to predict the interplay of hi- stone modification and DNA-methylation on transcription factor binding. To validate the hypothesis, these models were tested against data collected from brain tissues at different time points of brain development.

1Nuclear DNA does not appear in free strands but is tightly condensed and wrapped around nuclear proteins to fit it inside the nucleus, forming a structure called chromatin 3

The first chapter of this dissertation discusses how clusters of transcription factors bind to the DNA under different chromatin states defined by unique combinations of histone modi- fications [10]. A multi-layer Markov based approach is first used to segment the genome into distinct chromatin states and then a novel non-parametric Bayesian clustering technique is applied to predict specific groups of TFs that may bind together in an active DNA state versus in a repressed DNA state. This helps in understanding why certain genes are ex- pressed in different states. It is established that transcription factor co-binding is related to chromatin state of the genome. Data from published literature was used to model the distribution of histone marks and transcription factors and study the transcriptional regula- tion of gene expression across the chromatin states. Differences in gene expression levels are identified when regulated by a single TF compared to genes being jointly regulated by the predicted transcriptional regulatory modules. Such combinatorial gene expression regulation by transcription factors is important to identify target genes of specific transcription factors involved in normal cell functions or in diseased conditions. It is also observed that the bind- ing motifs (unique sequences of nucleotide bases that represent a transcription factor) for certain TFs show change between two chromatin states. The dependence of transcription factor binding on the chromatin state of the genome is thus firmly established. The key con- tribution in this chapter is proposition of the idea of predicting transcriptional regulatory modules in distinct chromatin states and assembling a computational pipeline using different mathematical algorithms to implement the idea. The work from this chapter has culminated into one publication [10] The second chapter focuses on the exploration of the effect of DNA methylation on transcrip- tion factor binding during different stages of brain development and within different brain cell types [11, 39]. DNA methylation is known to interfere directly or indirectly with transcription 4 Chapter 1. Introduction factor-DNA interactions often occupying the promoters 2 and generally causing hindrance to TF binding. Although intensive research has been done to predict specific transcription factors (either experimentally or computationally) that are enriched in specific brain regions at certain developmental stages, a systematic exploration to dive into the transcription factor co-enrichment in the presence of DNA-methylation has not been fully studied. The objec- tive here is to identify the specific transcriptional regulatory modules that are enriched in genomic loci with distinct methylation profiles. Such genomic loci are often referred to as differentially methylated sites (DMS) or differentially methylated regions (DMRs), meaning genomic loci with different DNA methylation status across different biological samples or time points. The proposed idea can summarized in two basic steps: (1) cluster Differentially Methylated Regions (DMRs) sharing similar methylation levels among different biological samples, and (2) recursively identify distinct transcriptional regulatory modules (TRMs) within each DMR cluster using TF motifs (a set of nucleotide sequences that represent a TF) and a novel motif search algorithm. In the absence of ChIP-seq data, such a systematic identification of TRMs helps in understanding the effect of DNA-methylation on transcrip- tion factor binding through enrichment of TF motifs and validation via RNA-seq data. The key contribution in this chapter is designing and implementing a recursive motif enrichment and transcriptional regulatory module identification algorithm. A second contribution is the proposition of the idea of partitioning differentially methylated regions based on similarity of methylation profile before identifying significant motifs enriched within these DMRs. The work from this chapter has culminated into two publications [11, 39] The dissertation concludes with explorations on classifying neuron types into excitatory and inhibitory on the basis of the transcriptional regulatory modules enriched within each type. It is hypothesized that memories are represented by massively connected neural circuits in

2Promoters are located near the transcription start sites of genes, upstream on the DNA and lead to initiation of transcription of a particular gene 5 the brain. The ability of synapses 3 to grow strong or weak over time in response to high or low activity is defined as synaptic plasticity and is a crucial neuro-chemical foundation for learning and memory [70]. Specifically, experience-governed synaptic plasticity triggers multiple aspects of memory and is vital specially early in the life cycle of an organism during important periods of development of cortical circuits when sensory experiences are needed [103]. Upon arrival of an action potential a neuron is known to release small molecules known as neuro-transmitters. This results in either a positive or negative change in action potential depending upon the neuro-transmitters. Excitatory neurons are those that cause positive change and inhibitory neurons cause negative action potential. Npas4, an early- response transcription factor is induced in excitatory neurons upon calcium influx and has been known to control excitatory-inhibitory balance within neural circuits [14]. Significant research is being done to study the ratio of excitatory to inhibitory neuron activity in brain and its impact on neurological disorders such as autism [33, 79, 85]. As transcription factors work together in groups or pathways it is important to study the network of transcription factors that are involved in excitatory and inhibitory neurons. A recently published dataset consisting of differentially methylated regions from single-cell methylomes for sixteen differ- ent neuron sub-types in mammalian cortex [65] has been integrated with a recursive key TF and transcriptional regulatory module prediction approach to classify the neuron sub-types based on transcriptional regulatory networks. Several key transcription factors have been identified for each neuron sub-type that are analogous to hub genes in a protein interaction network connected with multiple binding partners. Results show that some of the tran- scription factors can be further classified as specifically enriched within an excitatory (or inhibitory) sub-type or generally enriched in all excitatory (or inhibitory) neurons. Multi- omics data sets such as single-cell methylation and bulk RNA-seq data have been leveraged

3a synapse is a structure that allows a neuron to transmit an electrical or chemical signal to another neuron or to a target cell 6 Chapter 1. Introduction to predict cell-type specific protein interaction sub-networks. Such a systematic prediction of cell-type specific protein interaction network within the context of epigenetic regulation will greatly help gain new insight into transcriptional regulation driven by epigenetic factors. Chapter 2

Identification of transcriptional regulatory modules in distinct chromatin states in mouse neural stem cells

2.1 Overview

Transcription factors (TFs) are proteins that bind to specific sequences on the DNA and help in the transcription of DNA to messenger RNA by targeting the genes located nearby. Tran- scription factor binding locations on the DNA can be inferred through a technique known as Chromatin Immunoprecipitaion (ChIP-seq). Precise detection of regulatory regions located near these protein coding genes is crucial for accurate gene regulation in response to devel- opmental and environmental stimuli. These specific DNA sequences are typically recognized by TFs that recruit the transcriptional machinery. Although this task is not difficult to ac- complish, many TFs detect similar consensus DNA-binding sites and a genome may contain thousands of consensus or similar sequences, both functional and nonfunctional. In a ChIP experiment, formaldehyde cross-links proteins to their bound DNA. Cells are homogenized and the chromatin is sheared and immunoprecipitated with antibody-bound magnetic beads

7 Chapter 2. Identification of transcriptional regulatory modules in 8 distinct chromatin states in mouse neural stem cells

(the antibodies are specific to the transcription factors). The immunoprecipitated DNA is then used as the input for a next-generation sequencing library prep protocol where it is sequenced. Sequencing produces millions of short reads1 covering the transcription factor binding sites (TFBS) which are aligned to a reference genome. After a few pre-processing steps, a peak-calling tool is used to identify the ChIP-seq ’peaks’ that are the potential TF- DNA binding regions. As compelling as it might be to the research community, ChIP-seq comes with a few challenges such as (1) the quality of the antibody used – a sensitive and specific antibody will give a high level of enrichment, (2) and the use of control in the ex- periment – a ChIP-seq peak should be compared with the same region in a matched control. Typical controls used are input DNA, DNA obtained from IP without antibody, or using an antibody against a protein that is not known to be involved in DNA binding, (3) depth of sequencing – more prominent peaks are identified with fewer reads, whereas weaker peaks require greater depth. A number of computational tools have been developed to identify TFBS among which PeakSeq [91] and MACS2 [32] are popular for calling peaks for a single TF ChIP-seq profile. Both of these tools use local Poisson or Binomial statistics to identify peaks. With more and more single TF binding information being accrued, recent efforts have been devoted to integrating ChIP-seq data from multiple TFs to uncover TF-TF interactions. Signal Spider [118] uses a Gaussian mixture model to approximate read intensity of ChIP- seq profiles to identify genomic regions co-regulated by multiple TFs. Sharmin et. al. [97] adopted K-mer reverse component frequencies and motif binding scores to identify general and cell-type specific TF binding rules. Cha and Zhou developed a method based on inho- mogeneous Poisson processes and Ripley’s K-function to detect TF clustering and ordering patterns [18]. Growing genome-wide ChIP studies have also revealed new insights into the interplay between TF binding and histone modifications. In mammalian genomes, TFs could

1a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment 2.1. Overview 9 have hundreds of thousands of potential binding sites but only bind to a small subset in a given cell type. In addition, not all binding events would have significant impact on gene expression [67]. Most TFs prefer to bind open chromatin regions which are highly accessible and nucleosome-depleted. Such chromatin regions are often with specific histone modifica- tions enriched in promoters and enhancers, such as H3K4me1 and H3K27ac marks. It was found that histone-modification-dependent TF binding could be protein family specific (Xin et al, Genome Research 2018). Thus, chromatin features identified from DNase-I hypersen- sitivity and histone modifications could be used to predict the binding of TF members from some families (Liu et al., Nucleic Acids Research 2016). On the other hand, a small number of TFs can act as pioneers, which have the ability to reach inaccessible chromatin regions and shape the chromatin landscape to facilitate the binding of other TFs. These pioneer TFs carry DNA binding domains recognizing partial or degenerate motifs on the nucleosome surface [102]. In addition to DNA accessibility, the co-occupancy of additional co-factors would restrict the bindings of some TFs to highly selected loci within accessible chromatin states. Kouzarides found that the affinity of transcription factors to their binding sites may be altered by epigenetic changes in chromatin structure [50]. Liu et al. used histone marks and DNA methylation data to establish the correlation between epigenetic changes and TF binding preferences [62], and in a later work found that co-occupancy of TFs may be pre- dicted using distinct chromatin features identified from DNase-I hypersensitivity, histone modifications, and GC content [63]. Recently, semi-automated genome annotation (SAGA) tools have been developed to seg- ment and label the genome into different chromatin states. The underlying hypothesis is that differences in the spatial distribution of histone marks lead to distinct characteristics of the genome. For example, active enhancers are often enriched in H3K27ac and H3K4me1 whereas active promoters are often enriched with H3K27ac and H3K4me3 [99]. Some of the popular SAGA tools are Segway [41], ChromHMM [28], and diHMM [69]. ChromHMM [28] Chapter 2. Identification of transcriptional regulatory modules in 10 distinct chromatin states in mouse neural stem cells and diHMM [69] use hidden Markov models whereas Segway [41] applies a dynamic Bayesian network to segment the genome and identify distinct chromatin states. Different from Seg- way and ChromHMM which perform genome segmentation and classification at single scale, diHMM can be used to segment the genome at multiple scales, such as nucleosome (narrow) and domain (broad) levels. Despite these achievements in genome segmentation, little effort has been put in to explore the transcription factor binding patterns across distinct chromatin states. To that end a two-step process (Figure 2.1) is discussed to investigate how chromatin con- figuration may affect the binding affinity of TFs. In the first step, the diHMM software, a chromatin state identification tool, is applied on the aligned BAM files of the histone marks to segment the genome into different states. The subsequent step involves using a newly de- veloped non-parametric Bayesian clustering method to group TFs that have similar binding patterns on each identified chromatin state into transcriptional regulatory modules. Based on the identified transcriptional regulatory modules, downstream analyses are performed to compare (a) the expression level of proximal genes regulated by individual TFs and pre- dicted TF clusters and (b) the binding sequences of a TF among different chromatin states. This analytical procedure is then applied on data sets generated from mouse neural stem cells (NSC). The results not only confirm several known transcriptional co-binding rules in different chromatin states such as JMJD3-SMAD3 in Promoter state [29], ASCL1-FOXO3 in Enhancer state [116], and CTCF-SMC1 in Boundary state [88], but also show interactions, for instance, MIZ1-TCF4 in Promoter state. It is also shown that the regulatory effects of the predicted transcriptional modules on proximal genes vary across chromatin states. Finally the de-novo binding sequences compiled from TF peaks were shown to be dependent on the chromatin state of the genome. Such systematic analyses of transcriptional regula- tion in different chromatin states provide valuable insights into the role of histone marks in regulating TF-TF interactions. 2.1. Overview 11

Figure 2.1: A two-step process to identify chromatin-state-specific transcriptional regulatory modules. diHMM is used to segment the genome into multiple chromatin states followed by application of the proposed clustering method to identify transcriptional regulatory modules in distinct states. Downstream analyses include gene expression comparison and de-novo motif comparison across different chromatin states. Chapter 2. Identification of transcriptional regulatory modules in 12 distinct chromatin states in mouse neural stem cells

2.2 Key challenges

2.2.1 Integrating histone and transcription factors

Sophisticated computational approaches have been developed that have used histone mod- ification data in segmenting the genome into distinct chromatin states. Histone proteins help wrap and condense the DNA into chromatin and transcription factors tend to bind to open-chromatin DNA. ChIP-seq is used to infer the binding sites of both histone proteins and transcription factors. Thus one approach to jointly study histones and transcription fac- tors is to combine them in a hierarchical model such as ChromHMM [28]. While technically it possible to jointly analyze histones and transcription factors in one single mathematical model, the idea presents some challenges. 1) Chromatin states are primarily formed by histone modifications. Thus, a computational model alone cannot validate the predicted states from a model that includes both histones modifications and transcription factors. 2) Transcription factors primarily recognize sequences on DNA near gene promoters and bind to those sequences thereby turning the genes on or off. Histone modifications are found both near gene promoters or hundreds of kilo-base pairs away from gene promoters (known as enhancers) that sometimes results in DNA-looping. Such looping sometimes opens up the DNA (open chromatin) and enables binding of transcription factors which otherwise might not bind. Looping brings the ChIP-seq peaks of enhancer histones and transcription factors and without considering the 3-D structure or topologically associating domain (TAD) it is difficult to infer direct or indirect effect of histones on transcription factor binding. Keep- ing in mind the challenges involved in joint analysis of histones and transcription factors, a computational pipeline was proposed that would be a two step process to first segment the genome into distinct chromatin states using the histones and then predict transcriptional regulatory modules within the chromatin states. 2.2. Key challenges 13

2.2.2 Finding the optimum clustering algorithm

The optimal clustering algorithm is expected to group together transcription factors having similar distributions of binding sites. Approaches such as K-means and Gaussian mixture models (GMMs) [118] can predict genomic regions co-regulated by multiple transcription factors. There are several indices that can be used to determine the optimum number of clusters in a data-set [19] and Akaike information criterion (AIC), Bayesian information criterion (BIC) in GMMs [17]. However, K-means works well only when the clusters have spherical shape and are of similar size. Gaussian mixture model (GMM) can be thought of as a generalization of k-means which has the advantage of probabilistic cluster assign- ment. With respect to predicting TRMs, both the intensity and the spatial distribution of the transcription factor binding sites is of interest. This is because two TFs might have approximately the same number of peaks or binding sites but the distribution of the sites could be very different. On the other hand one TF might have a lot of more binding sites as compared to the other TF but the overall distribution might be similar for both of them. Thus a clustering algorithm should separate the two TFs in the first case but group them in the same cluster in the latter case in order to capture the combinatorial regulation of gene expression by TFs. These conditions make the adoption of an inhomogeneous point process a natural choice [106]. The clustering approach discussed in this dissertation is inspired by previous models [18] with specific improvements as discussed later in detail.

2.2.3 Resolution of the TRMs within chromatin states

The final challenge pertains to fixing the resolution of the genomic regions during clustering. This is important because a smaller clustering resolution would result in fewer transcription factor binding sites and different predicted TRMs compared to larger clustering resolution. Chapter 2. Identification of transcriptional regulatory modules in 14 distinct chromatin states in mouse neural stem cells

One approach that was adopted at the beginning consisted of applying the clustering within individual chromatin state windows generated by a diHMM [69]. This approach had the limitation that for most of the genome only few TFs had binding sites greater than or equal to 10 in an individual window. Having too few reads made the estimation of intensity functions similar for all TFs resulting only in a single big cluster. Initial clustering results performed on the individual windows of the different chromatin states are shown in a Broad Promoter windows for some TFs in Figure 2.2. The minimum number of required TF binding sites was set to 10. This result shows that only two TFs JMJD3 and SMAD3 are in one cluster in Broad Promoter and Boundary chromatin states whereas a number of clusters are predicted in the Upstream Enhancer state. This is not surprising because Broad Promoter on an average had small size windows and Upstream Enhancer had larger size windows. Consequentially Broad Promoter had few TFs whereas Upstream Enhancer had many TFs. Reducing the minimum number of binding sites for clustering by 5 increased the number of clusters in Broad Promoter as shown in 2.3. However, this was the limitation of the clustering approach that a certain number of peaks or binding sites for a TF were needed to detect clusters with different binding intensity patterns.To overcome this problem, windows for each domain were merged and shifted on an imaginary real line to form contiguous regions of binding sites. This approach enabled detection of unique binding patterns of different TFs within same and among different clusters while retaining enough resolution to compare the clustering results among different domain windows of the same chromatin state. 2.2. Key challenges 15

Figure 2.2: TF peak enrichment in different chromatin states based on a threshold require- ment of 10 peaks for each TF Chapter 2. Identification of transcriptional regulatory modules in 16 distinct chromatin states in mouse neural stem cells

Figure 2.3: TF peak enrichment in different chromatin states based on a threshold require- ment of 5 peaks for each TF 2.3. Building blocks of the computational pipeline 17

2.3 Building blocks of the computational pipeline

2.3.1 Data-sets

ChIP-seq data used in this study were obtained from the ChIP-Atlas [83] databases for mouse neural stem cells. 21 proteins were included in this study. Among these proteins, P300 is a co-activator, RAD21 is a sub-unit of cohesin component, SMCHD1 is a non- canonical member of the SMC protein family, NUP153 is one of the building blocks of the nuclear pore complex, JMJD3 is a lysine-specific demethylase that demethylates H3K27me2 or H3K27me3, KDM1A is also a histone demethylase that demethylates both H3K4me and H3K9me, RNF2 which a E3 ubiquitin-protein ligase that mediates monoubiquitination of H2AK119Ub, BMI1 is a ring finger protein and a major component of the polycomb group complex 1 (PRC1), and SMAD3 and SMAD4 are signal transducers and transcriptional modulators mediating multiple signaling pathways and several other transcription factors such as OLIG2, FOXO3, ASCL1, MAX, NFIC, TCF3, SOX2, SOX21, SOX9, POU5F1, many of which are known to play key role in neural stem cells [53, 72]. Three major pre- processing steps were conducted, including alignment to reference genome using Bowtie2 [55] with default mode, binarization from SAM to BAM format using SAMtools [58] version 0.1.19, and peak calling using MACS2 [32] version 2.1.0 to obtain binding regions of the proteins. Gene expression data were downloaded from GSE70872 (untreated samples).

2.3.2 Chromatin state identification through genome segmenta-

tion diHMM [69] is a Hidden Markov Model based tool that can model the presence or absence of a histone mark to a high degree of accuracy. It can segment and annotate the genome Chapter 2. Identification of transcriptional regulatory modules in 18 distinct chromatin states in mouse neural stem cells into different chromatin states at multiple scales by modeling genome wide distribution of histone marks. By default, diHMM has two scales of classification: (a) nucleosome level, which has finer resolution chromatin state windows of around 200 base-pair (bp) length and (b) domain level, which is formed by stitching together similar nucleosome level windows thus having broader chromatin state windows extending over 100kbp-long regions [69]. As explained by the diHMM authors [69], the domain-level states identified by diHMM are able to recapitulate known patterns in the chromatin literature and capture functional differ- ences among diverse regulatory elements [69]. The first step to identify chromatin states is to binarize uniquely aligned BAM files, which is implemented in ChromHMM [28], a pre- decessor of diHMM. The diHMM software provides several nucleosome and domain level statistics including nucleosome-level emissions, combined nucleosome-level fold enrichment for each domain, fractional genome coverage of each nucleosome and domain level state, and nucleosome and domain state lengths. These statistics, together with the relative distance in- formation of nucleosome and domain level states from transcription start site (TSS) and the enrichment of nucleosome level states in genomic regions, were jointly analyzed to annotate each nucleosome and domain level state to a biologically relevant functional category.

2.3.3 TF binding pattern clustering via Dirichlet Process Mixture

of Log Gaussian Cox Processes (DPM-LGCP)

The input to the proposed clustering algorithm are the binding regions generated by a peak calling tool (MACS2 in this case). Using these regions as input, the center of each TF peak was treated as a binary binding event and these binding events were modeled along the genome by an inhomogeneous Poisson process (IP) for the following reasons: (i) the event of each binding site falling into a minuscule interval is a rare event (n → ∞, p → 0, np → λ), and (ii) the non-uniform distribution of TF peaks at different locations can be 2.3. Building blocks of the computational pipeline 19

well characterized by the intensity function of the inhomogeneous Poisson process. For a TF with n binding site locations, these locations were mapped to points in a closed interval D

on the real line, denoted by S = {s1, . . . , sn}. Following the inhomogeneous Poisson process model setting, the likelihood of observing S can be written as [100]

n  Z  Y f(S|λ(s)) = exp |D| − λ(s)ds λ(sj), (2.1) D j=1

where λ(s), s ∈ D is the intensity function. The Poisson process likelihood (1) provides the basis for non-parametric clustering of TFs based on their binding patterns, resulting in identification of modules of co-binding TFs sharing similar regulatory functions. For a given ChIP-seq data-set of N TFs coming from K clusters (with K unknown), it is assumed that TFs in the same cluster share a common intensity function, distinct from those in other clusters. This constitutes a point process clustering problem with unknown number of clusters.A Dirichlet process mixture of log Gaussian Cox process (DPM-LGCP) model is proposed, which employs a Dirichlet process (DP) prior to the latent log intensity functions

to facilitate clustering of the intensity functions. Let Si denote the binding site locations of the ith TF, the DPM-LGCP model can be described as follows:

Si|λi(s) ∼ IP (λi(s)), s ∈ D, i = 1, ..., N,

log(λi(s)) = zi(s), zi(s) ∼ G, (2.2)

G ∼ DP (m, G0),G0 = GP (0,Cθ),

where G is a random distribution with a DP prior. The DP prior is characterized by two

parameters m and G0, where m is the precision parameter, and G0 is the base measure.

The base measure G0 is assumed to be a Gaussian process with mean 0 and covariance

kernel Cθ(, ), and θ contains parameters that control the shape of the covariance kernel. Chapter 2. Identification of transcriptional regulatory modules in 20 distinct chromatin states in mouse neural stem cells

The introduction of this DP prior to the latent log intensity functions naturally facilitates clustering of the N point processes based on their intensity functions. With this model, neither the number of clusters nor ad-hoc distance measure between two point processes needs to be specified.

To overcome the difficulty of calculating the marginal likelihood of the point process Si, an approximate and efficient posterior inference using the Integrated Nested Laplace Ap- proximations (INLA) package [93, 100] has been employed. INLA approximation of LGCP links the latent Gaussian process zi(s) to a discrete spatial Gaussian Markov random field z = (z1, ..., zP )[60, 100, 109]. By doing so, the continuous covariance kernel of zi(s) can be transformed into a discrete precision matrix of the B-spline basis coefficients on a regular grid, which enables very fast covariance computation [92]. Finally, posterior inference on the assignment of TFs into clusters can be performed through a Markov chain Monte Carlo (MCMC) algorithm using Neal’s Gibbs sampler [77].

An equivalent representation of the above DPM-LGCP model is based on the stick-breaking representation of the DPM model. Let ci ∈ {1, 2, ...} be a latent variable indicating the

cluster assignment of the ith point process, i.e. ci = k means that process i comes from the kth cluster. It is assumed that point processes from the same cluster share a common

∗ intensity function. Let {λk} denote the unique intensity functions. The stick-breaking representation of the DPM-LGCP model is written as

S | c , λ∗ (s) ∼ inhomo-Poisson(λ∗ (s)), i = 1,...,N; i i ci ci

P r{ci = k} = πk, k = 1, 2,... ;

k−1 Y πk = Vk (1 − Vl),Vk ∼ Beta(1, m); l=1 2.4. Case studies 21

∗ ∗ ∗ λk(s) = exp{zk(s)}, zk(s) ∼ GP (0,Cθ).

With the model described above, approximate Bayesian inference can be performed using Integrated Nested Laplace Approximations (INLA) package [93]. In particular, the INLA package provides the approximated marginal likelihood which can be combined with the DPM sampling. Details of the approximation and the computational algorithm are provided in Appendix.

2.4 Case studies

2.4.1 Simulated data

Three 1-D inhomogeneous Poisson processes with Log-Gaussian intensity were generated. Next, binding site locations of 20 transcription factors were simulated by drawing indepen- dent samples from these inhomogeneous Poisson processes. In Figure 2.4, the solid curves denote the true intensities of the three processes.

Using the simulated binding site locations the TFs were assigned according to the cluster set- ting. The estimated clustering results after ten MCMC iterations, is shown in the summary below

1. initial error rate: 0.3947

2. estimation error rate: 0

3. overall true marginal likelihood: 76.7315

4. overall estimated marginal likelihood: 76.73146

5. true marginal likelihood per cluster: 27.2765, 14.9562, 34.4986 Chapter 2. Identification of transcriptional regulatory modules in 22 distinct chromatin states in mouse neural stem cells

Figure 2.4: Application of the clustering algorithm on simulated data. (A) True and Esti- mated binding intensities of twenty transcription factors in three clusters (B) Illustration of the clustering result obtained in (A). 2.4. Case studies 23

6. estimated marginal likelihood per cluster: 27.2765, 14.9562, 34.4986

2.4.2 Real data

An example of clustering on real data is shown in Figure 2.5A,B. In the left panels, ChIP-seq peak distribution of proteins obtained from ChIP-Atlas is shown, in a specific genomic region using IGV software. The proposed algorithm uses these genome locations as input, re-scales them to an imaginary real line (from 0 to 50) and estimates the binding intensity of each TF using an inhomogeneous Poisson point process. Finally, proteins sharing similar intensity patterns are clustered together. In the right panels, the estimated cluster intensities are shown. The values on the Y-axis represent the magnitude of the intensity function, and the X-axis represents the bins of mapped genome locations on the real line. Proteins belonging to different clusters are marked in the same color as that of the cluster.

2.4.3 Functional enrichment of transcription factors within chro-

matin states

Enrichment of each TF within each chromatin state was estimated to analyze positional preference of TF bindings. Enrichment of a particular TF in a particular chromatin state is calculated as (m/n)/(M/N) where m is the number of peaks of the TF in that state, n is the total number of peaks of the TF in all states, M is the size of the state and N is the total size of all states. Functional enrichment analysis of TF peaks allows us to identify specific chromatin states that are highly enriched in TF bindings [4] and also specific states that are depleted in TF bindings and thus provides an insight into genome wide binding preferences of TFs governed by different histone marks. Chapter 2. Identification of transcriptional regulatory modules in 24 distinct chromatin states in mouse neural stem cells

Figure 2.5: Application of the clustering algorithm on real data. (A) Distribution of ChIP- seq peaks visualized in IGV window and (B) the corresponding estimated binding intensities of the clusters. The proteins in each cluster are shown with the same color. 2.5. Results 25

2.4.4 Gene expression and motif analysis

In this study, the overall expression of genes among different chromatin states were com- pared in addition to the regulatory effect of individual TF and predicted TF modules on genes within each chromatin state. RSEM [57] was used to estimate gene expression (Tran- scripts Per Kilobase Million or TPM). Each gene had three expression values obtained from triplicates. For comparison of gene expression among chromatin states, genes were selected by proximity (+/- 2kb from TSS) to nucleosome-level state windows for the corresponding domain-level state windows. For comparison of gene expression regulated by different TFs within a specific chromatin state, genes were selected by proximity (+/- 2kb from TSS) to TF peaks. Expression levels of each gene over replicates were averaged and log transformed. Effect of chromatin states on TF motifs [7] was also explored since nucleotide sequence of transcription factor binding sites is a critical factor for recruiting both the primary TF and its co-factors. MEME [5] software was used to discover de-novo motif sequences of TFs and those sequences were to the consensus TF motifs from the HOMER [40] motif library.

2.5 Results

2.5.1 Genome segmentation and chromatin state identification

As described in the methods section, diHMM segments a genome into distinct chromatin states and outputs the states as regions within two bed files labeled by nucleosome and domain indexes (e.g. N1, N2... and D1, D2... respectively). For the nucleosome level states, annotation of the chromatin states to functionally relevant categories was performed by using information from the emission probabilities of the nucleosome states (Figure 2.6), fractional genome coverage (Figure 2.6), relative enrichment in different genomic regions (Figure 2.7), Chapter 2. Identification of transcriptional regulatory modules in 26 distinct chromatin states in mouse neural stem cells and distribution of nucleosome states around TSS (Figure 2.8A). Similarly, by comparing the nucleosome-level fold enrichments in each domain level state and the distribution of the domain level states around TSS (Figure 2.8B), the domain-level states were further grouped into different broader functional categories as shown in Figure 2.6.

2.5.2 Genome annotation enrichment

CpG island sites, Refseq transcripts and Repeats for mm10 genome from UCSC database were downloaded to calculate the fraction of genomic annotation for each nucleosome-level state (Figure 2.7). Active Promoter states were enriched in the 5pUTR regions, whereas the Transcriptional Elongation states were enriched in the Intron regions. In addition, the Repetitive state was highly enriched in the Satellite regions.

2.5.3 Relative enrichment around TSS

Relative enrichment of the nucleosome level states with a maximum of +/- 4kb and domain level states with a maximum of +/- 80kb around transcription start sites (TSS) is calculated. From the spatial distribution of the nucleosome level and domain level states (Figure 2.8), the Active Promoters, Bivalent Promoters at nucleosome level and Broad Promoters at domain level are enriched around the TSS. On the other hand, Transcriptional Elongation at nucleosome level and Upstream Enhancer at domain level are enriched away from the TSS. Repetitive and Low Signal states are enriched sporadically spanning on either side of the TSS. 2.5. Results 27

Figure 2.6: (a) Nucleosome level emission matrix generated by diHMM. Functional anno- tations of the nucleosome level states are shown in the color bar on the left. Scale varies linearly between 0 and 1. (b) Fractional genome coverage for nucleosome and domain level states. Scale varies logarithmic-ally between 10−4 and 1. (c) Combined nucleosome-domain fold change obtained by diHMM. Functional annotation of the states are shown in the color bar on the left. Scale varies logarithmic-ally between 0.5 and 50. Chapter 2. Identification of transcriptional regulatory modules in 28 distinct chromatin states in mouse neural stem cells

Figure 2.7: Nucleosome states genome annotation enrichment. 2.5. Results 29

Figure 2.8: Nucleosome and domain level state distribution around TSS. Chapter 2. Identification of transcriptional regulatory modules in 30 distinct chromatin states in mouse neural stem cells

2.5.4 State sizes

From the nucleosome and domain state sizes (Figure 2.9), the Upstream Enhancer and Low Signal states at domain level have wider state sizes compared to the Broad Promoter and Super Enhancer states. The Low Signal states not only have larger fractional genome coverage at both nucleosome and domain levels (Figure 2.6B) but also have in general larger state sizes. The Low Coverage states, on the other hand, have very small state sizes. Some of the Broad Promoter states also have smaller state sizes.

2.5.5 Functional annotation of nucleosome and domain level states

The nucleosome and domain level states identified by diHMM were annotated into functional states based on various statistics. Among the nucleosome level states there were: Active Pro- moters (N1-N5: enriched in H3K27ac, H3K4me3, H3K4me2, and enriched around TSS), Pro- moter Flanking states (N6 to N8: enriched in H3K27ac, H3K4me3, and H3K4me2 flanking TSS), Bivalent Promoter (N9 and N10: enriched in H3K27me3 and H3K4me2 or H3K4me3), Poised Enhancer (N11 and N12: strong enrichment in H3K27me3, and weak enrichment in H3K4me1), Strong Enhancer (N13: enriched in H3K27ac and H3K4me1), Weak Enhancer (N14: enriched in H3K4me1), Transcribed Enhancer (N15-N17: enriched in H3K36me3 and sometimes in H3K4me3), Transcriptional Elongation (N18 and N19: enriched in H3K36me3), CTCF promoter (N20 and N21: enriched in CTCF and enriched around TSS), CTCF (N22: enriched in CTCF), H4K20me1 (N23: enriched only in H4K20me1), Polycomb Repressed (N24: enriched only in H3K27me3), Heterochrom/Low Signal (N25-N27: low enrichment in almost all marks and spanned over a large proportion of the genome, main text, Figure 2.6), Repetitive (N28: enriched in multiple marks and in Satellite regions, Figure 2.7), H3K9ac (N29: enriched only in H3K9ac), and Low Coverage (N30: low fractional genome coverage, 2.5. Results 31

Figure 2.9: Quartile box plots of log10 of all nucleosome-level (a), or domain-level (b) state sizes. Box plot whiskers extend to 1.5 times the interquartile range.

main text, Figure 2.6B). Among the domain level states there were : Broad Promoter (D1 to D7: enriched in active promoter and promoter flanking states), Bivalent Promoter (D8-D12: enriched in bivalent Chapter 2. Identification of transcriptional regulatory modules in 32 distinct chromatin states in mouse neural stem cells promoter, poised enhancer, and Polycomb Repressed states), Poised Enhancer (D13: en- riched in poised enhancer), Super Enhancer (D14: enriched in active enhancer), Upstream Enhancer (D15-D17: enriched in weak enhancer and upstream from TSS), Transcribed (D18- D20: enriched in the transcribed elongation state and distributed over a broad region from TSS), Boundary (D21: enriched in CTCF domain), Polycomb Repressed (D22-D24: enriched in Poised Enhancer and Polycomb Repressed), Low Signal (D25-D26: not enriched in any state), and Low Coverage (D27-D30: low fractional genome coverage). In addition, at both nucleosome and domain level the Low Signal state had the largest state sizes whereas the Low Coverage state had the smallest state sizes (Figure 2.9).

2.5.6 Chromatin state preference of individual TF binding and

gene expression regulation

To analyze the distribution of protein-DNA binding sites in each chromatin state, ChIP-seq data were integrated with the chromatin state map of mouse neural stem cells (NSCs) (Figure 2.10). For most proteins, the binding events occur in open chromatin regions, although some pioneer transcription factors have the ability to bind directly to condensed chromatin and recruit co-factors [24, 102, 125]. In both active and repressed states, enrichment of pioneer TFs as well as other proteins (that might have been recruited by the former) was observed. BMI1, which is known to bind to regions marked by both H3K27me3 and H3K4me3 [12], was found to be highly enriched in the Bivalent Promoter and Poised Enhancer states (Figure 2.10). In addition, most TFs were found to be enriched in the Super Enhancer states except for RAD21, BMI1, SMCHD1 and NUP153. A similar observation was made by the authors in [72] where they showed that OLIG2, NFI family, SOX2, SOX9, TCF3, FOXO3, ASCL1, SOX21, and MAX were associated with active enhancer regions. Next, to study the regulatory effect of histone marks on proximal genes, expression levels of 2.5. Results 33

Figure 2.10: (a) Enrichment (in log scale) of TF peaks in different chromatin states showing binding preference of individual TFs. (b) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) in different domain level chromatin states. Genes were mapped to the nucleosome-level states for the corresponding domain-level states. (c) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) mapped to individual TFs in the Broad Promoter state and in (d) the Poised Enhancer state. Chapter 2. Identification of transcriptional regulatory modules in 34 distinct chromatin states in mouse neural stem cells genes with promoters were compared among different chromatin states. The proximal genes in the Broad Promoter state had a higher median expression than those in the Polycomb Repressed or Low Coverage states (Figure 2.10B). To understand the influence of chromatin states on transcriptional regulation, genes were grouped in each state based on the presence of different TF binding sites surrounding their TSS. As expected, the median expression of the genes in the states rich in active regulators was higher and more consistent compared to that in the repressed states, where differential gene expression pattern was observed (Figure 2.10C,D, and Figure 2.12).

2.5.7 Chromatin state and preferential TF clustering

The distributions of ChIP-seq peaks across distinct chromatin states indicate that function- ally relevant proteins may have similar binding patterns. Co-occupancy of proteins was determined in a specific chromatin state through a non-parametric Bayesian clustering ap- proach that identifies the combinatorial binding patterns of proteins. Each state at the domain level had multiple windows over different chromosomes across the genome. As dis- cussed in the challenges section, most windows are with very few peaks although the average domain-level window length ranged from 3.8 kb to over 450 kb. This prevented prediction of modules within a single domain window. To ensure that the unique properties of the domain-level states were preserved during clustering, all windows of a single domain-level state were merged (e.g. D1) across the entire genome and mapped the genome positions to a common interval [0, 50] on an imaginary real line. Adopting this approach for all domain level states eliminated the problem that different domains may have different sizes. Next, for each domain level state, the proposed algorithm used these mapped binding locations, computed individual binding intensity of each protein and clustered proteins having similar intensity patterns together to construct transcriptional regulatory modules. This process 2.5. Results 35 was repeated for each domain level state. The predicted regulatory modules in different chromatin states, are visualized by showing the estimated binding intensities of the proteins and the corresponding clusters in Figure 2.11 and in Figures 2.14 and 2.15. Clustering results in two contrasting states —Broad Promoter (Figure 2.11(a)) and Poised Enhancer (Figure 2.11(b)), were found to have noticeable differ- ences in the binding intensity shape of both individual proteins and the predicted clusters between the two states. In addition, the set of co-factors for different proteins varied be- tween the two states. BMI1 is known to bind to repressed and poised states [12] and was predicted as a single-protein cluster in the Poised Enhancer (Figure 2.11(b)) and Bivalent Promoter states (Figure 2.14). In other states such as Broad Promoter, Super Enhancer, and Upstream Enhancer, BMI1 was predicted with RNF2, RAD21, or SMCHD1 (Figures 2.14 and 2.15). It is worth noting that both BMI1 and RNF2 are components of the Polycomb group multi-protein, whereas SMCHD1, a non-canonical member of the SMC super-family, is also known to be associated with transcriptional repression [20] and polycomb recruitment mechanisms [34]. The proposed approach was able to cluster several other functionally rel- evant proteins that shared similar binding patterns, for example, JMJD3-SMAD3 (Figure 2.11) in most chromatin states (in 29, the authors found that JMJD3 is recruited to gene promoters by SMAD3 in neural stem cells and is essential to activate TGF-β -responsive genes), FOXO3-NFIC-SOX-TCF3 (Figures 2.14) in Upstream Enhancer states (in 72), the authors showed interactions among NFI family, TCF3, SOX2, SOX9, and FOXO3. Addi- tional predicted protein-protein interactions are shown in in Table S1. To assess the strength of association between two co-binding proteins, a pairwise protein co-binding probability matrix was calculated from the posterior samples of the MCMC pro- cedure (Figure 2.11). Each value in Figure 2.11 indicates the frequency of observing the corresponding two proteins in the same cluster out of the total 200 MCMC iterations. A high protein co-binding probability (indicated by darker color) provides stronger evidence Chapter 2. Identification of transcriptional regulatory modules in 36 distinct chromatin states in mouse neural stem cells

Figure 2.11: (a), (b) Estimated cluster binding intensities along with the individual TF binding intensities in the Broad Promoter and Poised Enhancer states, respectively. In each figure, the estimated binding intensities of the individual TFs are shown in dotted lines and the estimated binding intensities of the clusters are shown in solid line. TFs in each cluster are shown in the same color as that of the cluster. The X axis represents the genomic locations mapped on the real line between 0 and 50. The Y axis represents the estimated binding intensities, both for the individual TFs and for the identified clusters. (c), (d) Pairwise protein co-binding probabilities corresponding to (a) and (b) respectively. (e), (f) Comparison of proximal gene expressions (TPM) regulated by the clusters in (a) and (b) respectively. Only those clusters having (1) multiple TFs and (2) proximal genes for at least two TFs are shown in the figure to explain the combinatorial regulation of gene expressions by multiple TFs. 2.5. Results 37 of the existence of the protein pair in a cluster. Further a three-fold assessment on the robustness of the clustering algorithm was preformed. Next, the expression levels of proximal genes (Transcripts Per Kilobase Million) TPM reg- ulated by the predicted clusters in each state was examined to understand transcriptional regulation by combinatorial binding of proteins in different chromatin states. The results showed that the median expression level of the genes regulated by distinct clusters are close to each other in the Broad Promoter state (Figure 2.11). On the contrary, the median expression level of the proximal genes combinatorially regulated by the FOXO3-RAD21-SMAD4 cluster in Poised Enhancer was higher than that of the genes combinatorially regulated by the other cluster (Figure 2.11) (Similar behavior was observed in Bivalent Promoter, Upstream En- hancer and Boundary states shown in Figure 2.13). These results show that gene expression could change due to combinatorial binding of proteins in different chromatin states. Overall in the 30 domain states, it was observed that, in every state there were some proteins with no binding locations near proximal gene promoters (Figure 2.12). Additionally, the Broad Promoter, Upstream Enhancer and Transcribed states had more binding sites near proximal genes than the repressed states such as Polycomb Repressed or Poised Enhancer.

2.5.8 Estimated clusters in chromatin states

Figures 2.14 and 2.15 show the estimated binding intensities in the 28 chromatin states. In each figure the dotted lines represent the binding intensities of the individual proteins and the solid lines represent the estimated intensities of the identified clusters. The corresponding proteins in each cluster are shown in the same color as that of the cluster. Chapter 2. Identification of transcriptional regulatory modules in 38 distinct chromatin states in mouse neural stem cells

Figure 2.12: Gene expression comparison for individual proteins among different chromatin states.

Figure 2.13: Gene expression comparison for predicted clusters among different chromatin states. 2.5. Results 39

Figure 2.14: Estimated cluster intensities in different chromatin states. Chapter 2. Identification of transcriptional regulatory modules in 40 distinct chromatin states in mouse neural stem cells

Figure 2.15: Estimated cluster intensities in different chromatin states. 2.5. Results 41

2.5.9 Protein-DNA binding motif preferences in chromatin states

It is known that local epigenetic states affect bindings of proteins to targets and protein- DNA binding may prevent or facilitate epigenetic changes on their binding sites [13, 120]. A protein is known to bind to the DNA with different motifs depending on the presence of its co-binding partners [7]. To examine the influence of chromatin states and co-binding partners on the binding sequences of a protein, ChIP-seq peaks for each protein overlapped with each chromatin state were grouped and the binding motifs of the protein in an active (Broad Promoter/Super Enhancer) and a repressed state (Poised Enhancer/Polycomb Re- pressed) (Figure 2.16) were analyzed. MEME suite [6] was used to identify de-novo motif sequences and from the results the motif that matched with the candidate protein’s con- sensus motif or was known as a secondary motif was selected. In both the HOMER [40] or JASPAR [73] databases, no reference motif is documented for BMI1, KDM1A, JMJD3, NPAS3, NUP153, RNF2, RAD21, P300, and SMCHD1. For the remaining proteins with known motifs, genomic sequences from two different subsets of peaks overlapped with two contrasting chromatin states were extracted as mentioned before and de-novo motifs were predicted. Based on the MEME results, a protein’s binding preferences may be broadly categorized into one of the three types: (1) De-novo sequences that closely matched the protein’s con- sensus motif such as ASCL1 (Figure 2.16(a)), MAX, NFIC, FOXO3, and TFs from the SOX family. (2) De-novo sequences that either did not match with the consensus/secondary motifs or matched the consensus motif but were weakly enriched. It has been observed that the ATF/CREB motifs (‘TGAYRTCA’) are often enriched in genes targeted by β- catenin/TCF/LEF [59, 108]. For TCF3, highly enriched de-novo sequences resembling its consensus motif was predicted in the repressed state (Figure 2.16). However, in the active state it was observed that the ‘TGACGTCA’ pattern was highly enriched. This could im- Chapter 2. Identification of transcriptional regulatory modules in 42 distinct chromatin states in mouse neural stem cells

Figure 2.16: Effect of chromatin states and co-binding partner on binding motifs. (A) De-novo motifs obtained using MEME for ASCL1 are similar to the consensus motif in both Broad Promoter and Polycomb Repressed states although the co-factors of ASCL1 are different in the two states. (B) De-novo motifs obtained using MEME for TCF3 show differences in motifs between the two states with different co-factors. The motifs in active state resemble the β-catenin/TCF/LEF motif whereas the motifs in repressed state resemble the E-Box consensus motif. 2.6. Known protein-protein interactions predicted by the algorithm 43 ply that TCF3 might have been recruited by other co-factors resulting in indirect binding in that particular state. For OLIG2, both active and repressed chromatin states contained de-novo sequences resembling its consensus motif. However, these sequences were highly enriched in the repressed state and weakly enriched in the active state. The fact that the E-value 2 of the de-novo sequences of OLIG2 was not significant in the active state might suggest indirect binding in the state, probably being governed by other factors. (3) De-novo sequences resembling the secondary motifs such the SMAD family. For SMAD4, sequences with ‘GCCGC’ pattern were highly enriched in both active and repressed chromatin states, as reported previously in [43] where the authors found that SMAD4 can bind to both methy- lated and un-methylated motifs of distinct sequences. Similarly, for SMAD3, highly enriched sequences rich in ‘GC’ content in both chromatin states, which have been reported as sec- ondary SMAD3 motifs, often associated with known SMAD binding partners in TGF-β responses [113]. Interestingly, for POU5F1 the E-Box element ‘CANNTG’ was significantly enriched in both active and repressed chromatin states. In [123], the authors had also ob- served that the E-Box motif was significantly enriched with a p-value of 10−6 in a POU5F1 ChIP-seq experiment of ES cell with Dnmt1, Dnmt3A and Dnmt3B triple knockout, whereas the consensus POU5F1 motif was weakly enriched with a p-value of 0.1.

2.6 Known protein-protein interactions predicted by

the algorithm

We have shown some of the known interactions predicted by the algorithm in Table S1. The first column shows co-occupancy of the proteins predicted by the algorithm together with the states where the co-occupancy was observed. The second column provides references

2E-value:an estimate of the statistical significance of each motif it finds (the ”motif E-value”) Chapter 2. Identification of transcriptional regulatory modules in 44 distinct chromatin states in mouse neural stem cells

that also had reported the predicted co-occupancies.

2.7 Robustness of the proposed clustering technique

2.7.1 Effect of varying domain level state sizes

On a diHMM domain, 1000 windows were repeatedly sampled and clustering was performed on these windows. To assess the consistency of the results, error rates were computed from the difference of the estimated clusters on the entire domain and the estimated clusters on every 1000 windows. To compute the error rate, a protein-protein pair binary matrix is defined with size N by N where N is the number of proteins. A 1 in the (i, j) th entry of the matrix indicates that protein i and protein j are in the same cluster and 0 otherwise. Next, the upper triangular half of this pair matrix is used (it is symmetric) and converted it to a

N vector of length 2 . Such a vector is obtained for the clustering result of each sub-sample and for the ground truth, and the error rate is reported by comparing the vectors obtained from sub-samples to that from the ground truth. The average error rate of misclassifying an edge was 0.1, which suggests that the clustering results are pretty robust to varying domain level state sizes. Figure 2.17 shows the estimated intensity curves of the identified clusters in 10 such sub-samples, each with 1,000 randomly selected windows in a Broad Promoter domain (D5).

2.7.2 Effect of varying number of initial clusters

Like any unsupervised clustering algorithm, the proposed clustering process starts with a random initial cluster setting. In order to test whether the initial number of clusters affects the final results on a domain, the initial number of clusters was varied provided to the 2.7. Robustness of the proposed clustering technique 45

Figure 2.17: Estimated intensity curves of the identified clusters in 10 sub-samples, each with 1,000 randomly selected windows in a Broad Promoter domain (D5). Chapter 2. Identification of transcriptional regulatory modules in 46 distinct chromatin states in mouse neural stem cells algorithm and the clustering results were checked under these different settings. Since the total number of proteins used in the study is 21, the initial number of clusters was set to 2 and then gradually increased it to 21 (i.e., each protein forms a different cluster). The clustering results were invariant to the initial number of clusters provided as prior to the non-parametric Bayesian clustering algorithm. The error rates are shown in Table S2.

2.7.3 Effect of permutation of peaks

A specific region on Chromosome X was selected spanning from 109933423 to 143803727 (Figure 2.5C) as an example and the result is shown in Figure 2.5D, where two distinct binding patterns for the cluster consisting of JMJD3, SMAD3 and the cluster consisting of OLIG2 was observed. Next, all the binding locations within this region were shuffled, a set of peaks was randomly assigned to each protein and clustering on this new randomized set of locations was performed. Repeating this procedure 1000 times, the p-value of observing two distinct clusters with JMJD3-SMAD3 in one cluster and OLIG2 in another cluster under the null hypothesis of all proteins sharing the same binding intensity was calculated. The null distribution of the clustering pattern from all 1000 shuffled data was examined and compared with the clustering result generated from the observed locations. It was found that the clusters obtained with the real data are not resulted by chance, but reflect the true structure underlying the protein-DNA bindings.

2.8 Comparison of clustering results to other methods

Clustering results of the proposed algorithm with K-means was compared. Instead of ap- plying these two clustering methods directly on the binding locations of the proteins, first, 2.9. Time complexity analysis 47

individual protein binding intensity was estimated and used as inputs for clustering (as- suming each protein was in its own cluster). The optimal number of clusters for K-means was obtained using the NBclust package [19]. From the results in Table S3, for K-means approach, the number of optimal clusters was found to be 2 for most of the chromatin states. However, the cluster compositions that contain the regulatory TF modules are very similar to that of the proposed approach.

2.9 Time complexity analysis

1. Chromatin state identification performance: diHMM was run to identify chro- matin states with default parameter setting of 30 nucleosome and domain levels, using neural stem cells data with eight histone marks and CTCF binarized BAM files. This analysis was conducted on a 24-core Haswell-EP E5-2680 v3 workstation with 2.50GHz dual processor node and 128 GB RAM. It took about ten days to complete the analysis.

2. Clustering performance: DPM-LGCP was applied to detect the clustering pattern of proteins binding on different chromatin states identified by diHMM. For a total of 21 proteins and 30 chromatin states, it took about two and a half hours to complete the analysis.

2.10 Discussion

Development of semi-automated genome annotation tools has enabled genome segmentation and identification of distinct chromatin states at finer resolutions. In this study, a two-step process was designed to identify transcriptional regulatory modules within distinct chro- matin states. After segmenting the genome using diHMM software, a novel nonparametric Chapter 2. Identification of transcriptional regulatory modules in 48 distinct chromatin states in mouse neural stem cells

Bayesian clustering algorithm was applied to identify clusters of co-binding TFs on the seg- mented genome. In existing methods, distance thresholds and empirical tests were adopted to define pairwise co-bound regions and context-dependent co-regulators [21, 47, 56, 84]. The statistically principled approach used in this study models TF binding site locations by inhomogeneous Poisson processes, and employs a Dirichlet process prior to the random dis- tribution of the latent log-intensity functions to facilitate clustering of TF binding patterns. Such a nonparametric Bayesian clustering procedure is based on joint likelihood rather than pairwise TF relationship, and therefore is more flexible in capturing the intricate TF-DNA binding patterns in ChIP-seq data. More specifically, this approach does not require pre- specified parameters such as window size, distance threshold, and number of clusters, and hence achieves a greater degree of robustness.

The approach was applied on ChIP-seq data for neural stem cells obtained from ChIP-Atlas [83], which is rapidly gaining importance in cell replacement therapy. ChIP-seq can produce millions of short reads that may result in varying strengths of signal intensities along the genome. One limitation to the approach is the variations that result from technical biases in library construction, polymerase chain reaction (PCR) and sequence mapping. This may result in generation of nonfunctional binding or indirect binding leading to weak peak signals. Another possible limitation of our approach lies in handling the three dimensional structural information of the histone marks. This restricts down stream gene expression analyses within the gene promoters present in the Enhancer states. While not in scope of the current dissertation, including such information may improve the accuracy of the model and enable the prediction of long distance Enhancer activity.

Despite the limitations, several interesting findings were observed using the TF clustering approach implemented in this study. It has been known that TF binding sites are not ran- domly distributed but rather clustered together at enhancer or promoter regions. Hence, 2.10. Discussion 49

some specific TFs may team up to have a significant epigenetic impact on gene expression. For instance, Cusanovich et al. [25] showed that single TF knockdown only results in ex- pression changes of a small set of genes and functional TF binding frequently occurs at active enhancers harboring a large number of TF binding sites. In this study, transcrip- tional regulatory modules identified in different chromatin states revealed several known TF interactions in neural stem cells, for example, JMJD3-SMAD3 in the Broad Promoter and Bivalent Promoter states [29], SOX family and NF1 in the Enhancer states [116], and MAX- FOXO3-OLIG2 in Upstream Enhancer [72]. These results suggest chromatin state specific TF-TF co-occupancy. In addition, diverse gene expression levels were observed through com- binatorial regulation by the predicted transcriptional regulatory modules in different states. The uncovered links between gene expression and TF binding patterns on a genome-wide scale enhances our understanding on how chromatin-state-specific TF regulatory network is assembled to coordinate tissue differentiation and cell specification.

An important objective in transcription regulation is to understand the binding specificity and affinity of a transcription factor. A transcription factor may have several thousands of DNA binding sites along the genome, which collectively can be represented as a motif, a consensus sequence demonstrating the nucleotide preferences at each position of the binding site. The recognition site for a TF ranges from 8−21 bases [73]. According to the ENCODE project, strong binding motifs can be identified in approximately 86% DNA loci occupied by sequence-specific transcription factors [22]. However, the factors that decide how a TF selects the binding sites from all possible choices across the genome are largely unexplored. It has been gradually recognized that chromatin states and the co-binding partners of a TF could have influence on the TF’s binding preference [48]. In this study, it was observed that de-novo motifs obtained from different subsets of TF peaks overlapped with different chromatin states. TF binding sequences could be determined by TF class and chromatin Chapter 2. Identification of transcriptional regulatory modules in 50 distinct chromatin states in mouse neural stem cells state together. For example, the de-novo sequences predicted for the pioneer TFs resembled the consensus PWM across distinct chromatin states whereas for some TFs (such as SMAD family) the sequences resembled secondary motifs in specific chromatin states. Further, the prediction of TF binding preferences might help the identification of indirect TF bindings when the de-novo sequences predicted do not match the consensus or the secondary PWM but belong to a different TF class [108, 120]. In conclusion, this work will help the research community understand the causality of histone mark composition and combinatorial TF binding together regulating gene expression in neural stem cells. 2.10. Discussion 51

Predicted interactions in chromatin states Interactions mentioned in other works BMI1 and RNF2 in Broad Promoter, Up- Monoubiquitination of H2AX mediated by stream Enhancer, Transcribed, Boundary the RNF2-BMI1 complex is critical for the ef- states ficient formation of y-2AX and functions as a proximal regulator in DNA damage response [86] JMJD3, SMAD3 in Broad Promoter, Biva- JMJD3 and SMAD3 co-localize at the TSS lent Promoter, Super Enhancer, Upstream of TGF-β responsive genes in NSCs [29] Enhancer, Transcribed states OLIG2, KDM1A, TCF3 in Broad Promoter, KDM1A is involved in de-methylation of Super Enhancer, Upstream Enhancer, Tran- p300 HAT which activates OLIG2 through scribed, Polycomb Repressed states TCF3 [111] NPAS3, SOX2 in Upstream Enhancer, Interactions among important TFs in NSC, Poised Enhancer states among which are SOX2 and NPAS3 [31] NUP153, SMAD4 observed in Upstream En- The C-terminal MH2 domain of SMAD3 had hancer state the primary nuclear import activity allow- ing direct contact with FG-repeat containing CAN/Nup214 and NUP153 [122] POU5F1, SOX21 in Broad Promoter, Super SOX21 expression might be regulated by het- Enhancer, Upstream Enhancer,Transcribed, erogeneous POU5F1 and/or SOX2 activity Polycomb Repressed states in the embryo [35] P300, SMAD family in Bivalent Promoter, During astrocyte fate determination Acti- Super Enhancer states vated SMAD family works with STAT3, which is then mediated by P300/CBP result- ing in the induction of astrocytic gene expres- sion [44] NFIC, SOX, FOX and Basic helix-loop-helix These TFs as crucial members of cis- (bHLH) family in Broad Promoter, Bivalent regulatory network [72] Promoter, Super Enhancer states POU5F1, SOX2 observed in Bivalent Pro- OCT4 and SOX2 co-regulate multiple genes moter, Transcribed states [107, 124] RAD21, POU5F1, SOX2 in Broad Promoter, RAD21 has specific cohesin binding pat- Super Enhancer, Upstream Enhancer, Tran- tern that is characterized by CTCF indepen- scribed states dent co-localization of cohesin with pluripo- tency related transcription factors POU5f1 and SOX2 among others [82] TCF3, P300 observed in Super Enhancer, E2A co-eluted with the P300, CBP, and Upstream Enhancer states PCAF [16] POU5F1, RNF2 in Broad Promoter, Super POU5F1 and RNF2 in embryonic stem cells Enhancer, Upstream Enhancer states [105]

Table S1: Known interactions predicted by the algorithm Chapter 2. Identification of transcriptional regulatory modules in 52 distinct chromatin states in mouse neural stem cells

Initial number of clusters Estimated error rate 2 0.2 3-21 0.0

Table S2: Error rates of clustering on a Broad Promoter domain with different number of initial clusters.

Chromatin state DPM-LGCP K-means Super Enhancer (D14) (1) ASCL1, JMJD3, KDM1A, (1) ASCL1, JMDJ3, KDM1A, NFIC, NPAS3, OLIG2, P300, NFIC, NPAS3, OLIG2, P300, SMAD3, SOX2, SOX9, TCF3; SMAD3, SOX2, SOX9, TCF3; (20 BMI1, MAX, NUP153, SM- (2) BMI1, FOXO3, MAX, CHD1; (3) FOXO3, POU5F1, POU5F1, RAD21, RNF2, RAD21, RNF2, SMAD4, SOX21 SMAD4, SOX21, NUP153, SMCHD1 Upstream Enhancer (1) ASCL1, JMJD3, KDM1A, ASCL1, FOXO3, JMDJ3, (D15) NFIC, OLIG2, SMAD3, SOX2, KDM1A, NFIC, NPAS3, TCF3; (2) BMI1, NUP153, SM- OLIG2, P300, POU5F1, SMAD3, CHD1; (3) FOXO3, NPAS3, SOX2, SOX9, TCF3; (2) BMI1, P300, POU5F1, RNF2, SOX9; MAX, NUP153, RAD21, RNF2, (4) MAX, RAD21, SMAD4, SMAD4, SMCHD1, SOX21 SOX21 Polycomb Repressed (1) ASCL1, JMJD3, KDM1A, ASCL1, JMDJ3, KDM1A, NFIC, (D22) NFIC, NPAS3, OLIG2, P300, OLIG2, P300, SMAD3, SOX2, SMAD3, SOX2, SOX9, TCF3; SOX9, TCF3; (2) BMI1, FOXO3, (2) BMI1, MAX, RAD21, SM- MAX, NPAS3, POU5F1, RAD21, CHD1, NUP153; (3) FOXO3, RNF2, SMAD4, SOX21, SM- POU5F1, RNF2, SMAD4, CHD1, NUP153 SOX21; Boundary (D21) (1) ASCL1, OLIG2; (2) FOXO3, (1) ASCL1, FOXO3, JMDJ3, NPAS3, P300, POU5F1, SOX9; KDM1A, NFIC, NPAS3, OLIG2, (3) BMI1, RNF2, SMAD4, P300, SMAD3, SOX2, TCF3; (2) SOX21; (4) JMJD3, KDM1A, BMI1, MAX, POU5F1, RAD21, NFIC, RAD21, SMAD3, SOX2, RNF2, SMAD4, SOX21, SM- TCF3; (5) MAX, NUP153, CHD1, NUP153, SOX9 SMCHD1

Table S3: Comparison of clustering results with K-means Chapter 3

Recursive motif analyses identify brain epigenetic regulatory modules

3.1 Overview

Cytosine bases of eukaryotic DNA, are transformed to 5-methylcytosine by DNA methyl- transferase (DNMT) enzymes resulting in the process called DNA-methylation [87]. The modified cytosine nucleotides are typically adjacent to a guanine nucleotide, resulting in two methylated cytosine residues placed diagonally to each other on opposite DNA strands. DNA methylation is known to cause gene silencing, such as the silencing of tumor suppressor genes in cancer cells. DNA methylation contributes to the dynamics of chromatin conformation, and thus interferes directly or indirectly with the interactions between DNA molecule and its binding proteins. For instance, the methylation on gene promoters may prevent the bindings of transcription activators but facilitate the recruitment of transcription suppressors [15]. In order to study the interplay of DNA-methylation and transcription factor binding, a compu- tational pipeline is discussed to expedite the identification of epigenetic regulatory modules from dynamically methylated genomic loci Figure 3.1. This pipeline comprised following steps: 1) according to correlation in methylation change, the genomic loci of interest will be grouped using WGCNA [54] clustering algorithm; 2) Enrichment of motifs for transcription factors will be determined using a recursive motif finding algorithm. To demonstrate the

53 Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 54 modules power of such an analytical procedure, methylome data sets generated for the frontal cor- tices from Tet1KO and Tet2KO mice were used. Such a systematic identification of TRMs will help throw light on the interplay of DNA-methylation and transcription factors by sys- tematically identifying TRMs in cell-type specific DNA-methylation data. To support the predictions made from the steps above ChIP-seq and bulk RNA-seq data were used. The former was used to show enrichment of TFs using actual peaks and do a comparison of the results with the regulatory modules predicted using motif enrichment tools. The latter was used to explore the co-expression of genes in cell type specific conditions and do a compar- ison of the results to the recursive module identification step. The main contributions of the paper are in the new recursive transcriptional regulatory module identification using TF sequences, clustering differentially methylated sites to identify enriched transcription factor motifs, integration of methylated motifs with existing public motifs specifically for searching TF motifs nearby differentially methylated sites and integration of expression data to further explain the predicted TRMs in the differentially methylated sites.

3.2 Key challenges

3.2.1 Motif enrichment prediction in DMS

Two important questions in prediction of motifs or transcriptional regulatory modules (TRMs) within differentially methylated regions was whether to partition the regions first into cor- related methylation profiles and to determine the optimum clustering approach from the many available clustering choices such as K-means, Gaussian mixture modeling or Weighted Gene Correlation Network Analysis (WGCNA). In addition, although a motif enrichment tool such as HOMER [40] can identify motifs enriched within a set of genomic loci, tran- 3.2. Key challenges 55

Figure 3.1: Recursive identification of Transcriptional Regulatory Modules (TRMs) in dif- ferentially methylated sites. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 56 modules scriptional regulatory module prediction required motifs co-enriched nearby motif of interest. To that end a recursive motif enrichment tool was developed. However, whether recursive motif enrichment tool would be enough to detect the TRMs in specific subsets of a large data matrix or a clustering of DMS was needed first was not clear. Multiple approaches were tried such as clustering the DMS using K-means and detecting the number of clusters through silhouette index. Ground truth was chosen as the TET1KO DMS where experimental val- idation of some important key TFs such as Egr1 was already established. The K-means approach typically generated three clusters partitioning the DMS into hyper-methylation, hypo-methylation and mixed methylation levels in different samples. However, subsequent motif identification showed enrichment of very few key TFs with p-values slightly lower than 10−10, which is the threshold set by HOMER for known TF motif enrichment. Recursive motif identification without clustering identified few key TFs with lower p-values than the previous approach. WGCNA combined with recursive motif enrichment prediction not only identified multiple key TFs in each cluster but also had many of the key TFs from the ex- perimentally validated lists. A possible reason could be that the clustering emphasized on grouping regions with higher correlations. Further, it is known that DMS with similar cor- relation profiles are associated with similar TRMs. Hence, the final computational pipeline included WGCNA as the clustering approach on DMS.

3.3 Building blocks of the computational pipeline

3.3.1 Merging nearby DMS

For differentially methylated regions (DMRs), the methylated regions are already defined, similar to the TF peaks. Such regions can readily be used as input to TRMs. However, 3.3. Building blocks of the computational pipeline 57 in the absence of DMRs, or if the number of DMRs is very few, recursive identification of TRMs is not feasible. In such cases differentially methylated sites (DMS) can be used if the number of DMS is higher. DMS is one location on the genome it is necessary to extend it to get a region on the genome in order to search for potential enriched motifs. For each DMS, any neighboring DMS located within 200bp was merged with the current DMS. Thus, the new region formed had a starting location as the original DMS and ending location as the nearby DMS. For multiple neighboring DMS within 200bp, the starting location was the same (current DMS) whereas the ending location was the farthest DMS located within 200bp. Finally, all regions or single DMS (in absence of neighboring DMS) were extended by +/-100bp and the resulting regions were used as input for recursive TRM identification.

3.3.2 Clustering DMRs for motif enrichment

DNA methylome typically generates thousands of potential differentially methylated sites. The methylation data included in this study were collected from different brain cell types. Since past research has shown the TF binding is often determined by DNA methylation [123], the objective was to identify cell type specific transcriptional regulatory modules in differentially methylated sites. WGCNA [54] was used to group each DMR set into different clusters. When applied on very large matrix that demands very high computing power, WGCNA automatically divides the large matrix into smaller blocks and performs a two-level clustering. In the first step, DMRs are pre-clustered into different blocks using projective k-means that are weakly correlated. Next, for each block, it performs a network analysis by identifying clusters of highly correlated DMRs and estimating cluster DMRs. Finally, clusters whose DMRs are highly correlated are merged. Thus, block wise network analysis significantly reduces the memory footprint and the computational complexity. Size of a block depends on the hardware of the system; the user needs to specify the largest number Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 58 modules of DMRs that can fit in a block. According to WGCNA clustering, regions that are not part of any co-expressed module indicating a poor quality DMS are assigned a label 0 and were not considered in our analysis.

3.3.3 Data analysis of Whole Genome Bisulphite Sequencing (WGBS-

seq)

The raw FASTQ data sets were trimmed off bases with low sequencing quality using Trim- Galore with the following parameters, TrimGalore q 28, illumina length 30. After trimming, sequencing data were aligned to mm10 mouse genome using Bismark and Bowtie2. Using the module in Bismark, PCR duplicated read pairs were removed, followed by extracting methylation information at both CpG and CH (H as A, C, or T). Fisher Exact test was used to evaluate the significance of differential methylation on CpG. Briefly, a contingency table was constructed where the rows indicated two conditions and the columns indicated the number of methylated cytosines and unmethylated cytosines. In the test, CpG sites were required to have at least 10X reads covered. In order to control FDR, a sequential permutation method is employed [9]. In each of permutations, a new contingency table was reconstructed by randomly assigning reads to cells with the same methylation probability for each sample. A total of 1000 permutations were performed for each of CpG sites. The number of true null hypotheses (m0) was estimated by a histogram method (Ryan Lister et al, Science, 2013). Based on the estimated m0, the adjusted p-value for each CpG site was calculated. Thus the differentially methylated sites (DMSs) were identified. To determine differentially methylated regions (DMRs), a two-step approach was developed. First, any two adjacent DMSs with at most 500bp distance were merged into a cluster. In each of clusters which include at least 5 CpG sites, at least 80% of DMSs are prone to be methylated or unmethylated in one of conditions. All clusters filled requirements above will 3.3. Building blocks of the computational pipeline 59 be considered as DMR candidates. Second, at least 80% of CpG sites in candidate DMRs are prone to be methylated or unmethylated in one of conditions and each CpG site was required to have at least 0.1 methylation difference.

3.3.4 Recursive identification of TRMs from motifs

ChIPseq data analysis was done using MACS2 [32] with default parameters and q-value. Known motif enrichment analysis was performed using the script findMotifs.pl in HOMER with parameter “mset vertebrates. The concept of recursive motif identification is not new. In a previous paper [64], the authors leveraged the ENCODE human ChIP-seq database comprising of 765 peak data sets and proposed a motif discovery pipeline based on recursive entropy minimization. Based on the hypothesis that a TF can bind to the DNA either directly with a specific sequence or indirectly as co-factors of other sequence specific TFs, the algorithm recursively masked the sequences of TF motifs detected at the previous iteration and selected the motif that had the lowest entropy in the current iteration. While the approach is compelling and immensely helpful in the prediction of consequences of sequence variations in known binding locations, it relies on the availability of ChIP-seq data. Since both DNA-methylation and ChIP-seq experiments generate genome wide CpG methylated regions and protein-DNA binding regions respectively, and given the availability of numerous known and de-novo motif enrichment computational tools developed in the past, recursive identification of co-factors of candidate TFs is proposed from differentially methylated sites alone without using ChIP-seq data. In the proposed approach Figure3, first HOMER is used to identify known motifs in the entire data set (the input data set can be either a DMS cluster or the entire DMS). Next, based on a selection criteria (the motif with the most significant p-value and occupying maximum sites in the input data set) a candidate TF is selected, identify the regions in the data containing its motifs and shift the center of each Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 60 modules

Figure 3.2: Recursive “key” TF and ETRM prediction flowchart.

region based on the predicted offset by HOMER. This step results in identification of known motifs in this new shifted DMS. Finally, the regions from the input data set containing candidate TFs motifs are discarded and the next significant candidate motif is identified. This process is repeated until no significant TF is left. 3.3. Building blocks of the computational pipeline 61

3.3.5 Preparing motif libraries

HOMER (Hypergeometric Optimization of Motif EnRichment) is a well-known motif discov- ery tool used for known and de novo motif discovery from sequencing data such as ChIP-seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C data [40]. It contains several motif databases (1) ‘known’ motif database is mostly based on the analysis from high quality public ChIP- seq experiments. These motifs contain the transcription factor name, the DNA binding domain along with the GEO accession number, (2) JASPAR motif database is primarily based on published binding site selection experiments (SELEX) in addition to data from in vitro microarray based factor affinity experiment, (3) Organism-centric motif databases for Drosophila [52] (fruit fly) motifs, Arabidopsis [104] and other plant motifs and Saccha- romyces cerevisiae motifs [38, 68]. In recent years, several studies have been conducted to explore the transcription factor binding preference on methylated DNA using methylation- sensitive SELEX approaches. HOMER known motif library is primarily based on the analysis of public ChIP-seq data sets. The data set is obtained from high-quality ChIP-seq experiments, where the most significant motif predicted by HOMER, resembled the consensus site for the TF with the given DNA binding. These motifs are annotated in the HOMER software as ‘known’ motifs. HOMER also contains motifs from other sources such as JASPAR or other public data bases which are not annotated as ‘known’ motifs. The difference in annotation is due to the fact that for the ‘known’ motifs, HOMER has optimized their detection thresholds. This threshold can automatically be calculated by HOMER for consensus sequences and determines if a given sequence in a data set will be recognized by the motif. A higher threshold indicates only perfect motifs will be recognized whereas a lower score indicates a motif with greater mis- matches will be recognized. As recommended by HOMER, a “guess and check” approach is required to obtain the appropriate detection threshold that recognizes the correct sequences. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 62 modules

However, in the absence of consensus sequences, the position frequency matrix (PFM) can be used directly and the detection threshold can be set to zero. For instance, in HOMER database, the detection threshold for the PFMs obtained from JASPAR has been set to zero. The motif libraries thus prepared had been added to HOMER database as individual motif files and concatenated to all.motifs for ‘vertebrates’. All methylated motifs have methylated appended to their names to easily recognize them. To check if a motif from one or more of the three libraries was enriched in our study, ‘findmotifs.pl’ command from HOMER was used.

3.3.6 Analysis of single cell RNA-seq

Sequencing data sets were downloaded from GEO database (GSE108761), where mice were euthanized on postnatal days 5, 10, 16, and 21. Each time point includes data from eight mice total processed as four independent samples of two mice each. Therefore, all cells from the same time points were pooled together. As the previous study , in pooled data, cells with fewer than 500 or more than 15,000 UMI counts were excluded. In addition, cells with abnormal high expression in mitochondria genes (> 10% of total expression) were discarded as well. Cells were then clustered using the Seurat R package [95]. According to the previous study (Brian T.K. et al, PNAS, 2017), due to the developmental regulation of neuronal mark- ers, a combination of Snap25, Slc17a6, and Stmn2 was used to identify excitatory neurons. Gad1, Olig1, Aqp4, Cldn5, Vtn, Cx3cr1, and Mrc1 were used to identify inhibitory neurons, oligodendrocytes, astrocytes, endothelial and smooth muscle cells, pericytes, microglia, and macrophages, respectively. Those marker genes were used to specify clusters into distinct cell types. Then top 10 markers for each cluster were selected. A total of 825 transcription factors (TFs) were downloaded from another study (Merja Heinniemi et al, Nat. Method, 2014). In order to determine the relationship of co-expression between any two TFs, a con- 3.3. Building blocks of the computational pipeline 63

ditional probability matrix was constructed. Before matrix constructing, TFs with less than

100 cells expressing were discarded. Let NA, NB and NAB denote the number of cells ex-

pressing TFA, the number of cells expressing TFB and the number of cells simultaneously

expressing TFA and TFB, respectively. Then the probability of TFB expressing under the

condition of TFA expressing:

P (TFA,TFB) NAB P (TFB|TFA) = = (3.1) P (TFA) NA

Similarly, the probability of TFA expressing under the condition of TFB expressing:

P (TFA,TFB) NAB P (TFA|TFB) = = (3.2) P (TFB) NB

In addition to the significance of the co-expression between any two TFs, a hypergeometric test was used, in which the null hypothesis is that the target TFs are independent each

other. In the test, the testing statistic is the number of cells simultaneously expressing TFA

and TFB. Let N denote the total number of cells in single cell RNA-seq data. Then, under the null hypothesis, the testing statistic follows hypergeometric distribution, that is

NB  N−NB  NAB NA−NAB P (NAB) = N  (3.3) NA

In the test, p-value was calculated by cumulative probability of NAB. If p-value is less than 0.05, the null hypothesis will be rejected and the co-expression or co-repulsion between the target TFs will be determined. In order to compare the probability of TFA and TFB simultaneously expressing with its expectation, a represented factor (RF) was defined as: Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 64 modules

P (TF ,TF ) N ∗ N RF = A B = AB (3.4) P (TFA) ∗ P (TFB) NA ∗ NB

If RF is greater than 1, the co-expression degree between the target TFs is greater than the expectation; vice versa. In particular, the target TFs are dependent each other when RF is close to 1.

3.4 Results

3.4.1 A Comprehensive motif database compiled for epigenetic

regulatory module identification

HOMERs known motif database (as described before) was augmented with two additional motif data sets (Figure 3.3). The first is a combined set of un-methylated (or canonical) and methylation-related motifs compiled in MeDReaders database according to published litera- ture [114]. This data set contains motifs for 731 transcription factors (601 for human and 130 for mouse) that may bind to methylated DNA. In addition, the MeDReaders database also provides methylation-related motifs predicted using in silico approaches for 292 transcription factors (287 for human and 5 for mouse). In addition to this motif library, a second dataset was prepared using the position frequency matrices (PFMs) for over 500 transcription fac- tors obtained with methylation-sensitive SELEX approach [123]. The motifs in this data set can be broadly classified as (a) MethylMinus: the consensus sequence obtained from the motif with one or more CGs, the methylation of which negatively affects TF binding; (b) MethylPlus: the consensus sequence obtained from the motif contains one or more CGs, the methylation of which enhances TF binding. Thus, the aforementioned two motif data sets 3.4. Results 65 were integrated with HOMER known motif database and the motifs were classified into five types (1) methylation-related motifs (2) un-methylated (canonical) motifs from MeDRead- ers; (3) MethylPlus (4) MethylMinus identified with methylation-sensitive SELEX approach; and (5) motifs from HOMER database.

PFMs from these data sets were compared to retain unique PFMs in addition to 364 motifs from HOMER database, 864 canonical motifs from MeDReaders, 22 non-canonical motifs from MeDReaders, 191 MethylPlus motifs and 143 MethylMinus motifs. The transcription factors associated with these motifs were summarized in Figure 2a. The TFs documented in our integrated motif database can be classified according to their DNA-binding domains and shown in Figure 2b using a Circos [51] plot. Due to the fact that no CG pattern was not present in either canonical or methylated motifs, transcription factors with MADS do- main was not present in MethylPlus or MethylMinus categories. Interestingly, transcription factors with the Homeodomain (HM) and C2H2 zinc fingers domain frequently recognize the motifs in the MethylPlus, Canonical and Methylated categories. On the other hand, transcription factors with HMG and ETS domain were present in all except MethylPlus category. Although the information on methylation preference is still incomplete for many TFs, current knowledge suggests that the differences in DNA binding domains could be a key factor controlling whether a TF would interact with their methylated binding sites.

3.4.2 Genome distribution of hypermethylated CpG sites identi-

fied in TET1KO and TET2KO frontal cortices

With this unique motif database, the analytical pipeline was applied (Figure 3.1) on two methylome data sets generated for the frontal cortices of TET1KO and TET2KO mice. The TET1KO methylome was generated with reduced representation bisulfite sequencing Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 66 modules

Figure 3.3: Distribution of five motif categories, viz, HOMER, MethylPlus, MethylMinus, Canonical (from MeDReaders) and Methylated (from MeDReaders) is shown among the different DNA binding domains. The domains are arranged in clockwise decreasing order, with the Homeobox domain containing most of the motifs from the five categories whereas the SMAD domain containing least motifs. 3.4. Results 67

Figure 3.4: Venn diagram showing the number of shared motifs among the five categories. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 68 modules

Figure 3.5: Genome distribution of TET1KO and TET2KO differentially methylated sites.

to enrich for genomic regions rich in CpG dinucleotides by digestion with MseI and MluCI enzymes with recognition sites for TTAA and AATT, respectively. The TET2KO methylome was generated with whole genome bisulfite sequencing, and thus with broader coverage in genome-wide. Interestingly, 42,558 DMW were identified for TET1KO but only 12,900 DMSs for TET2KO. In addition, only 643 common DMSs was identified between TET1KO and TET2KO tissues (Figure 3.5). This result suggests TET1 and TET2 enzymes are dispensable for distinct sets of CpG sites during brain development. Compared to the genome distribution of TET2KO DMSs, TET1KO DMSs tend to localize inside or adjacent to genes (Figure 3.5). For instance, promoters host 4.3% of TET1KO DMSs and 1.9% of TET2KO DMS respectively. Thus, promoter methylation seems more susceptible to the loss of TET1 than the loss of TET2. Since both TET1 and TET2 are DNA demethylation enzymes, the direct consequence of TET1 or TET2 loss is the methylation increase on their corresponding targets. On the other hand, the methylation losses observed in TET1KO or TET2KO mice are likely to be indirect consequences. It is not a surprise that the majority of DMSs identified (75.1% for TET1KO and 67.3% for TET2KO) are hypermethylated in TET1KO or TET2KO mice. In order to understand the gain of methylation observed in TET1KO and TET2KO mice, further analyses were performed on the hypermethylated regions surrounding DMSs identified in TET1KO and TET2KO mice. 3.4. Results 69

3.4.3 TRMs identified in differentially methylated clusters

The first step in the identification of epigenetic regulatory modules followed our previous procedure [66], which is based on the assumption that: genomic loci sharing similar methy- lation profiles during development or among cell types might be co-regulated by a common set of TFs. To perform co-methylation co-regulation analysis, 66 methylomes were collected for mouse fore brain, mid brain and hind brain regions during development and for sorted mouse brain cells including astrocytes, oligodendrocytes, excitatory neurons, PV neurons and VIP neurons. Next, Weighted Correlation Network Analysis (WGCNA) algorithm [54] was applied to cluster differentially methylation regions into groups, which show highly cor- related methylation profiles across the methylomes. For TET1KO or TET2KO, WGCNA algorithm identified three co-methylated clusters respectively with each cluster showing a distinct methylation profile (Figure 3.6,3.7). Despite TET1KO and TET2KO mice share a very small number of DMSs as noted earlier, similar methylation profiles for some clusters were observed in two kinds of mice of distinct genotypes. For instance, the first clusters in TET1KO and in TET2KO both show decreased methylation during brain development and are with the lowest methylation level in excitatory neurons among all brain cell types. For TET1KO, the second cluster shows low methylation during embryonic development phases and increased methylation in postnatal frontal cortex whereas the third cluster shows hypo- methylation in most samples except in TET1KO. For TET2KO, the second cluster shows increased methylation in various kinds of neurons vs glia cells whereas the third cluster shows hyper-methylation in astrocyte, E17.5 neuron, and oligodendrocyte.

For each cluster, HOMER software was applied to analyze sequence for each genomic locus to identify transcription factor motifs. As recommended by the HOMER software, motif with enrichment p-value equal to or less than ≤ 10−10 was considered to be a significant one. Top significant motifs were shown for each TRM (Figure 3.6, Figure 3.7). Despite the fact that Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 70 modules

Figure 3.6: Methylation profiles of three DMR clusters identified (A-C) and the correspond- ing TRMs predicted (D-F) for TET1KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. 3.4. Results 71

Figure 3.7: Methylation profiles of three DMR clusters identified (A-C) and the correspond- ing TRMs predicted (D-F) for TET2KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 72 modules the majority of DMSs are different in TET1KO and TET2KO mice, several transcription factors were identified to be associated with both TET1KO and TET2KO including MEF2 family with MADS domain (MCM1, Agamous, Deficiens, and Serum response factor), SOX family with HMG domain (high mobility group) and Olig2, Ascl1 from bHLH domain (basic helix-loop-helix). Interestingly, for TET1KO, transcription factors from Zinc Finger domain were enriched in all three clusters but cluster 1 showed motif enrichment for Egr1 whereas cluster 2 and 3 showed the motif enrichment for Zic3 and Zeb1, respectively. Egr1 is involved in the consolidation of new memories and critical for early postnatal brain development [27]. ZIC3 regulates the expansion of neuronal progenitors [45] while Zeb1 keeps neurons in an immature state by preventing neuronal polarization [101]. Next, the recursive motif identification approach was applied on each cluster to investigate motif distribution. During the recursive motif searching process, the exclusion of some ge- nomic regions in a cluster may lead to substantial changes in the enrichment p-values for TF motifs. For example, in cluster 1 of TET1KO, Egr2, Sp5, and Klf14 motifs are with enrichment p-values as 10−5, 10−3 and 10−6 respectively. However, all three motifs were significantly enriched in Egr1 TRM, a subset in the cluster1, which was obtained through a recursive motif identification approach. Thus, the recursive TRM identification approach predicted clusters of motifs enriched nearby a key TF, which otherwise go unnoticed if all input sequences are analyzed together. The recursive motif identification approach also led to several interesting observations. Some TFs with different DNA binding domains are likely to form an TRM together, such as NF1 (CTF domain), Tgfi2 (Homeobox domain) and Ascl1 (bHLH domain). On the other hand, some TRMs are composed of TFs with DNA binding domain of the same class, including TFs with the ZF domain (EGR family), MADS domain (MEF2) and HMG domain (SOX family). These results shed some light on how TFs may be organized into regulatory complex, for instance, heterodimers with TFs from the same family. Pioneer TFs such as Ascl1 [115] were observed to have motifs enriched in multiple 3.4. Results 73

TRMs. This may reflect the fact that pioneer TFs act at early stages and participate in epigenetic regulation of genomic regions in multiple sub-clusters. Interestingly, Sox3 and Sox10 were predicted to be key TFs in TET2KO clusters. The transient expression of Sox3 has been reported in neural progenitors but Sox10 is known as a critical regulator involved in multiple stages during neural crest development. The genomic regions related to Sox3 TRMs are hypo-methylated whereas those for Sox10 TRM are hyper-methylated during de- velopment. Transcription factors binding could be under the influence of DNA methylation, histone mod- ifications, and chromatin structures. For TRMs identified for both TET1KO and TET2KO brains, several methylation-related motifs were enriched in the differentially methylated re- gions. As classified in our motif database, motifs from following categories were identified for their corresponding TFs: (1) ‘MethylMinus motifs for Sox12, Atf3, Batf and Tcf12; (2) ‘MethylPlus motifs for Lhx1, Lhx4, Lhx6, Lhx9 and Nanog; (3) ‘Methylation-related motifs, which include ‘motifs with little effect by DNA methylation for Tlx2 and Myog and ‘mo- tifs with no CpG for Atoh1, MEF2 and SOX family members. Having ‘motifs with little effect by DNA methylation or ‘motifs with no CpG would enable a TF to neglect methyla- tion statuses of its binding sites. Such kind of TFs may work as forerunners together with TFs from ‘MethylPlus’ motifs. Logically, the enrichment of TFs with ‘MethylMinus’ motifs may depend on the interactions between forerunners and DNA demethylation enzymes to demethylate their binding sites. Thus, the identification of TRMs together with the infor- mation of methylation related motifs would supply information for how TFs may assemble in differentially methylated loci. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 74 modules

3.4.4 Brain ChIP-seq data support the TRMs predicted

Brain ChIP-seq data was used to determine the binding frequencies of transcription factors on the genomic sequences for each differentially methylated sub-cluster with TRM identified (Figure 3.8). Not surprisingly, most TRMs were enriched for the binding sites of corresponding transcrip- tion factors. For instance, the top TFs with binding sites enriched in AP-1-TRM are FOS and JUN proteins, the two sub-units of a typical AP-1 protein complex. Similar results were observed for Atoh-TRM, Egr1-TRM, Mef2-TRM, NF1-TRM, and Sox-TRM. Although ChIP-seq study for Zic3 in mouse brain was not available, genome-wide analysis of Zic3 bind- ing sites in zebra fish embryos revealed a distribution biased towards distal intergenic regions that may act as functional enhancers [117]. Zic3-TRM is enriched for the binding sites of chromatin domain boundary proteins CTCF and its partner Smc1 (Structural maintenance of chromosomes protein 1) which is a key component of cohesin ring complex [42]. Four transcription factors are with binding sites depleted from Zic3-TRM but highly enriched in Zeb1-TRM are Nup153, Foxa2, Bmi1, and Smchd1 (3.8A). Although the interactions among Zeb1 with these transcription factors are not well studied, Zeb1, Foxa2, Bmi1, and Smchd1 were reported to be critical for epithelial mesenchymal transition [23, 26, 98, 101], an early neural developmental process in which the neuroepithelium of the dorsal neural tube gives rise to neural crest cells. For both TET1KO and TET2KO methylomes, neuronal activity-related transcription fac- tors Npas4 and co-factor CBP are with binding sites clustered together in AP-1-TRM and Mef2-TRM (3.8A). On the other hand, brain-development-related transcription factors Pax6, Tcf4, and Brn1 are enriched in Lhx-TRM and Sox-TRM (3.8B). Mef2-TRM, NF1-TRM, and Sox-TRM were identified for both TET1KO and TET2KO DMRs. Despite the similarity in the enrichment of corresponding transcription factors, a number of TFs show distinct 3.4. Results 75

Figure 3.8: The enrichment of TF binding sites in TET1KO and TET2KO TRMs. Each color value in the figures represents the estimate of the odds ratio based on conditional Maximum Likelihood Estimate derived from Fisher’s exact test. For TET2 KO, the two NF1 TRMs identified from cluster 1 and cluster 3 were combined and labeled as one NF1-TRM in the bottom panel.

binding preferences for these three TRMs identified in the mouse brains lacking the two different TET enzymes. In particular, Nup153 and Smchd1 binding sites are highly enriched in regions regulated by Mef2-TRM in Tet2KO but depleted from those in the Tet1KO brain. Currently, little information is available regarding the interactions between TET1 and TET2 enzymes with transcription factors such as MEF2 family members, the result suggests TET1 and TET2 are likely involved in two independent epigenetic regulatory pathways, i.e. via different combinations of TFs with MEF2. Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 76 modules

3.4.5 Transcription factors under the influence of DNA methyla-

tion

As reported by past research, DNA methylation could affect TF binding in either positive, negative, both positive and negative or indifferently resulting in different classes of motifs where the consensus sequence may or may contain CGs. Bulk RNA-seq data was used to compare the change in expression level of the TFs predicted to be in the same TRM to that of the methylation level of the DMRs containing the TF motifs, for four types of motifs, (1) motifs with little effect from DNA methylation (Figure 3.9A), (2) methylated motifs with no CpG observed (Figure 3.9B), (3) MethylPlus motifs (Figure 3.9C) and (4) MethylMinus motifs (Figure 3.9D). In case (1), the expression level increased for Egr1, Egr2 and Klf9 whereas the methylation level decreased during development (Figure 3.9E, I). Expression level for Sp5 however, decreased during development. In case (2), expression level of Mef2c/a/d increased during development whereas Mef2b expression did not show such a trend (Figure 3.9F). Methylation level of the DMRs containing MEF2 family motifs decreased during development (Figure 3.9J). In case (3), the expression level of Lhx1, Lhx2 and Isl1 showed decreasing trend whereas Lhx3, Nkx6-1 and Nanog did not show such a trend (Figure 3.9G). Methylation level for the DMRs containing these motifs showed decrease and then increase (Figure 3.9K). Finally, in case (4) except for Sox15 and Sox10, the other TFs from SOX family showed decrease in expression level (Figure 3.9H). Methylation level for the DMRs containing SOX family motifs showed increase and then peaked (Figure 3.9L). 3.4. Results 77

Figure 3.9: (A:D) Four types of motifs shown. (A) DNA methylation has little Egr1 motif. (B) No CpG site was observed in methylated Mef2c motif. (C) Primary motif of Lhx1 contained CpG site. (D) Primary motif of Sox14 contained CpG site. (E:H) Gene expression level (Transcripts Per Million or TPM in log scale) shown for the genes predicted TRMs in Figure 3,4. (I:L) Average methylation level of the differentially methylated sites containing the motifs from the predicted TRMs in Figure 3,4. Drop in methylation level at 6week and 10week time points is due to the fact that they were wild type samples in DMS identification procedure Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 78 modules

3.5 Discussion

In this chapter, a computational pipeline was proposed to maximize the mechanistic under- standing of methylation changes and its applications with brain methylomes derived from TET1KO and TET2KO mice. To our knowledge, this is the first toolkit taking advantage of several lines of information embedded in existing data sets: 1) methylome data sets collected for DMR clustering; 2) motif database compiled for methylation associated TF searching; and 3) ChIP-seq data sets providing additional experimental evidence. From technical aspects, the approach consisted of two complementary steps: clustering based on methylation correlation and recursive motif identification based on sequence analysis. These two steps serve an important purpose to group genomic loci under the same epi- genetic regulation mechanism and thus improve the likelihood to identify TFs with motifs significantly enriched in each sub-group. Our pipeline accepts DMRs identified with different thresholds defined by various kinds of software designed for methylation data analysis. Wor- thy of mention, TRMs identified with our pipeline are not based on a single DMR but rather thousands of DMRs sharing a similar methylation profile. Therefore, it is not difficult to imagine that a slight change in the list of DMRs is unlikely to result in striking differences in TRMs predicted. More accurate clustering results in grouping DMRs sharing similar methy- lation patterns may be achieved with the increasing numbers of methylomes generated for diverse conditions, development stages and distinct cell types. The recursive TF motif identi- fication enables us to further explore the differences within clusters at the sequence level and to reveal diverse mechanisms driving the epigenetic dynamics. The combination of these two steps also helped to explore subtle differences in epigenome regulation within a specific cell type at a given developmental stage. Our pipeline adopted the WGCNA algorithm to group genomic regions with highly correlated methylation profiles. Weighted adjacency feature in WGCNA puts emphasis on regions with a high correlation at the expense of regions with low 3.5. Discussion 79 correlations where the weight can be selected using scale-free topology [54]. WGCNA also allows scalability by splitting large matrices into smaller blocks to fit within the available RAM resulting in faster computation. For differentially methylated regions, a window size comparable to the width of a typical ChIP-seq peak is recommended. In a recent study conducted on defining features for ChIP- seq peak calling algorithms, the authors suggested the ChIP-seq peak as a 200 bp window surrounding the peak center [110]. They observed that the performance of some peak calling tools dropped with a setting for shorter window width (75 bp per window). Finally, HOMER motif analysis is limited to a set of TFs with known motif position weight matrix (PWM). For instance, based on ChIP-seq data, Nup153, Bmi1, and Smchd1 were enriched in cluster 3 of TET1KO. However, in HOMER or JASPAR [73] databases, no motif is documented for these three TFs. In addition, some transcription factors prefer to recognize DNA structure (shape-based) instead of a stretch of DNA sequences, and thus not all TFs are with motifs documented in the motif databases. From biological aspects, a number of brain TRMs were identified that were composed of transcription factors from the same family. This result is consistent with known facts that some TFs, such as Mef2 [119] and AP-1 [89] (a heterodimer composed of Fos and Jun pro- teins), tend to form homodimers or heterodimers with members of the same family. Worthy of mentioning, TFs from the same family may not work on the same locus simultaneously but substitute each other in a sequential order during the developmental process. For instance, Sox2 and Sox3 keep neuronal differentiation genes silent in neural progenitor cells and will be replaced by Sox11 when terminal differentiation initiates [94]. Apparently, the lack of comprehensive data sets could dampen the power of the pipeline to provide an accurate prediction. For instance, brain ChIP-seq data sets used in this study were generated from diverse neural stem cell lines and brain tissues of different developmental stages. Currently, very limited co-binding information is available for these transcription factors due to the lack Chapter 3. Recursive motif analyses identify brain epigenetic regulatory 80 modules of ChIP- seq data sets generated on desired experimental conditions. Additionally, TRMs were identified from different sets of genomic loci but may interact with each other in 3D through chromatin folding. Future efforts are required to integrate chromatin configuration information into TRM modeling. Despite these limitations, the predicted TRMs could still provide great starting points for further dedicated research and the analytic procedure de- scribed in this study will assist in the ultimate interpretation of the causes and consequences of methylation alterations. Finally, the combination of HOMER with the reursive motif search algorithm for regulatory module identification may be applied on genomic regions of other interests beyond differentially methylated loci demonstrated in this study. Chapter 4

Discussion and future work

The importance of DNA methylation has been firmly established for neuronal differentiation [76], neural plasticity [8, 49, 61, 71, 96] and functions [74, 78, 90]. During brain devel- opment, de novo DNA methylation occurs at the promoters of germ line-specific genes to repress pluripotency in progenitor cells, while the methylation loss at the promoters acti- vates neuron-specific genes [76]. The disorders in epigenetic machinery have been linked to many neurological diseases [36, 46, 112]. For instance, mutations in DNA methyltransferases DNMT3B lead to defective brain development [37] and mutations in methyl-cytosine binding protein MECP2 have been linked to Rett symptoms [3]. In addition, aberrant DNA methy- lation may lead to the premature activation of neuronal progenitor cells, and potentially, the development of brain tumors [30]. One of the concepts used to explain the pathological changes in an autistic brain is an increase in the ratio of excitatory to inhibitory neuron ac- tivity. This proposition is based mainly on the symptoms associated with autism spectrum disorders (Figure 4.1). One experiment to test this hypothesis was done on two layers of the six layers in mouse cortex and showed that the increase in E/I ration was a necessary adaptive compensation [103]. It is worthwhile to discuss and explore how transcription fac- tor(s) capable of defining and classifying two neuron types (excitatory and inhibitory) can be computationally predicted and how these transcription factor bindings change during brain development.

81 82 Chapter 4. Discussion and future work

Figure 4.1: Neurological disorders hypothesized to be due to relative change in excitatory to inhibitory neuron activity. The two extremes show Rhett syndrome caused due to less excitation and more inhibition and Autism caused due to more excitation and less inhibition 4.1. TRM analysis with brain single cell methylomes 83

4.1 TRM analysis with brain single cell methylomes

Computational pipelines that can predict transcriptional regulatory modules within genomic regions with specific properties of the DNA can be used to classify brain cell types, specif- ically excitatory and inhibitory neurons. As described in the previous chapter, a recursive motif identification approach was proposed to infer epigenetic transcriptional regulatory modules (TRMs) associated with differentially methylated regions (DMRs) in TET1KO and TET2KO. This pipeline could be used in general to infer TRMs in any given set of regions of interest. In the recent past, research has been done on predicting excitatory and inhibitory cell types in mice brain through DNA-methylation. Using this pipeline, DMRs were first grouped into co-methylated clusters according to their methylation correlations across a large number of methylomes. Within each cluster, the DMRs sharing similar methylation profiles are likely to be co-regulated by a common set of transcription factors, which will be iden- tified with a recursive motif search algorithm. To predict TRMs associated with brain cell specification, the pipeline was applied on brain single cell methylomes generated for 3,377 neurons derived from mouse frontal cortex [65]. An average of 1.4 million reads covering 4.7% of the mouse genome on average were obtained for each single neuron. According to their methylation profiles, 3,377 neurons have been divided into 16 sub populations includ- ing 10 excitatory neuron sub types and 6 inhibitory neuron sub types. For a given neuronal sub type, genomic regions with methylation levels statistically significant lower than other sub types were defined as CG-DMRs. All together, a total number of 575,524 genomic loci covering 5.8% of the genome were determined as CG-DMRs for the 16-neuron sub types. 73.2% of these CG-DMRs were located more than 10kb from the transcription start site. This suggests distal enhancers may play a key role in brain cell specification. With the recursive motif identification approach, 20 unique TFs were identified in differ- ent clusters of the DMRs (Figure 4.2A,H): Egr1, Mef2b, Atoh1, Junb, Tbr1, Neurog2, Rfx6, 84 Chapter 4. Discussion and future work

Atf3, Nur77, Fra1, Tgif2, Nf1, Oct4, Rorgt, Lhx3, Sox3, Tcf21, Mafa, Ap4 and Nkx6.1. Some of the TFs showing higher enrichment in excitatory neurons were Egr1, Junb and Atoh1 TFs showing enrichment in both types of neurons were Mef2b, Nf1 and Oct4 (Pou5f1) whereas some of the TFs showing higher enrichment in inhibitory neuron type were Mafa, Lhx3 and Sox3. TF classification within the two-neuron types showed good agreement with respect to expression levels as well as the number of cells expressed for certain pairs TFs such as Egr1,Junb and Nf1, Mef2d that showed similar expression levels and were expressed in sim- ilar number of single cells. Some of the “key” TFs were important in the development stage such as Mef2c, which was among the top TFs starting from postnatal day 0, up to 22 months. On the other hand Sox11 was among the top TFs in embryonic days. In sorted cell type too, some of the “key” TFs were among the top enriched such as Sox17 in Endothelial, Nkx6.2 in mature Oligodendrocytes, and Lhx5 in neuron cell type.

4.2 Cell type specific PPI obtained through ETRM

prediction

The ETRMs for the same “key TF” showed differences in composition (different co-enriched motifs) across the neuron sub types. This meant the co-enriched motifs were neuron sub-type specific. In order to compare neuron sub type specific motif co-enrichment, a few TFs were selected: Egr1, Mef2b and Bach2. In this context, a co-enriched motif is defined as one that has enrichment p-value of at least or less than 10−10 nearby the “key TF”. Comparing the results (Figure 4.2B) it was found that for Egr1, most regions were hypo methylated in the excitatory neurons except for mL6-1 and mIn-1. Egr1 was predicted as a key TF in mL4, mL23, mL5-1 and mDL-2. It was significantly enriched in mDL-1 (10−43), mDl-3 (10−88), mIn-1(10−37), mL5-2 (10−44), mL6-2 (10−153). It was not enriched in the only excitatory 4.2. Cell type specific PPI obtained through ETRM prediction 85

Figure 4.2: Applying a recursive motif enrichment on 16 single cell methylome DMRs, 20 key TFs were predicted. (A) Fishers exact test showing the enrichment of the TFs in the 16 DMRs. (B):(G) Methylation profile across the 16 DMRs occupied by the key TFs show that: the DMRs of most excitatory neurons containing the Egr1 motif are hypo-methylated whereas DMRs containing the Mef2b motif are hypo-methylated in both excitatory and inhibitory neuron. Finally only specific DMRs containing Nur77, Tgif2 and Bach2 motifs are hypo-methylated. (H) The 20 TFs can be broadly classified as either enriched in excitatory, or enriched in both excitatory and inhibitory or enriched in inhibitory neurons only. 86 Chapter 4. Discussion and future work

neuron mL6-1 and in any inhibitory neuron. Mef2b was a key TF in mIn-1, mL23. mL4, mL5-1, mL5-2, mL6-1, mNdnf-1, mNdnf-2, mPv, mSst-2 and mVip (Figure 4.2C). It was enriched in mDL-1 (p-value:1e-60), mDL-2 (10−114), mDL-3 (10−97), mL6-2 (10−185), mSst-1 (10−60) Bach2 was a key TF in mPv (Figure 4.2G). It was enriched in mDL-1 (10−60), mDL-2 (10−204), mDL-3 (10−79), mL23 (10−407), mL4 (10−229), mL5-2 (10−64), mL6-2 (10−81),mSst-1 (10−64), mSst-2 (10−26) . It was not enriched in mIn-1, mL5-1, mL6-1, mNdnf-1, mNdnf- 2 and mVip. Combining the aforementioned results, a cell type specific protein-protein interactions (PPI) has been proposed for three key TFs: Egr1, Mef2b and Bach2 Figure 4.3. Such a PPI network will not only show how a TF selectively may recruit another binding partner in a specific neuron type but also will help in the research done to understand the impact of excitatory and inhibitory imbalance in certain neurological disorders [85].

4.3 Co-expression of ETRMs in single cell and bulk

RNA-seq

TF classification within the excitatory and inhibitory types showed good agreement with single cell RNA-seq and bulk RNA-seq data from development for specific groups TFs such as (1) Egr1,Junb, Klf6, (2) Nf1, Mef2d and (3) Bach2, Lhx2. These TFs showed similar expression levels and were expressed in similar number of single cells as inferred from single cell RNA-seq analysis. Interestingly, Mef2a and Klf9 showed similar behavior in single cell RNA-seq. Motif of Klf9 is very similar to the Egr1 motif and both Egr1 and Mef2b ETRMs were predicted in two excitatory neurons mL23 and mL4. In bulk RNA-seq data Egr1, Junb and Tbr1 which showed higher enrichment in excitatory neurons compared to inhibitory neurons showed increase in gene expression level during development. Mef2b and Nf1 which were enriched in both excitatory and inhibitory neurons did not share the same 4.3. Co-expression of ETRMs in single cell and bulk RNA-seq 87

Figure 4.3: Cell type specific protein interaction network generated from the predicted ETRMs for three key TFs: Egr1, Mef2b and Bach2. The three nodes represent the three TFs and the edges represent the co-enriched motifs in specific neuron sub type. Based on the co-enrichment, cell type specific edges and hence sub networks can be constructed 88 Chapter 4. Discussion and future work

Figure 4.4: Expression level of general and neuron sub type specific TFs and some of the co-enriched motifs from the corresponding ETRMs during development in fore brain. (A) Expression goes up during development, specifically at the post natal day, (B) Expression reaches peak during development after post natal day, (C)Contrasting patterns in expression observed for the four TFs. Interestingly, although both Tgif2 and Nur77 motit were highly significant in the mDL-3 excitatory neuron, the former shows a decrease in expression during development whereas the latter shows increase in expression during development. trend in expression level during development. However, Mef2d and Nf1 showed increase in expression during development. Contrasting patterns of gene expression during development were observed for Nur77 and Tgif2 although both motifs were significantly enriched in the mDL-3 excitatory neuron. Bach2 which was highly enriched in mPv neuron showed decrease in expression during development.

4.4 Summary

Research from the past few decades has reinstated the idea of mining the vast repositories of biological data using machine learning and statistical tools to generate insights from the data and take actions from the insights. Machine learning and statistical models have been used in the past decades to understand complex biological systems from studying diseases to developing precision medicines [75]. If data generation was bottle neck in the past, data analysis and inference has become the main concern now [121]. Further, most biological sys- 4.4. Summary 89

Figure 4.5: Co-expression of specific TFs within single cell RNA-seq data. TFs such as Egr1 and Klf6, Mef2d and Nf1 are shown to have similar expression levels and are expressed in a similar number of cells. Klf motif was significantly co-enriched in the Egr1 and Nf1 motif was also significantly enriched in the Mef2 ETRM. 90 Chapter 4. Discussion and future work tems cannot be fully understood without exploiting multi-omics data which poses a further problem of integrating multiple heterogeneous data with very high dimensions. This disser- tation has followed the path of integrating multi-omics data, applying mathematical models to describe the data, derives insights from the data and proposes next steps to validate the insights. Specifically in the first project, a non-parametric Bayesian clustering approach was applied on transcription factor binding sites within distinct chromatin states to predict genome wide protein-protein interactions across different genomic loci. In the second project a novel recursive motif identification algorithm was applied on transcription factor motifs to predict genome wide protein-protein interactions in co-methylated genomic loci. While there is definite scope of improvement in the work, it has led to design of preliminary biological experiments to confound the co-binding of some of the transcription factors of interest pre- dicted in brain development. Such experiments, if validated can be used to further develop data driven approaches to complement the actual experiments for brain development, drug discovery in precision medicine, yield improvement in agriculture.

4.5 Contribution of the dissertation

1. Histone marks affecting transcription factor binding: Predicting genome wide tran- scription factor co-binding using mathematical models has been done in the past. In this dissertation, a systematic study of transcriptional co-binding within distinct chro- matin states has been performed using a two step process. In the first stage a genome segmentation tool base don Markov model has been used and in the second step a novel Bayesian non-parametric clustering algorithm has been proposed to predict different clusters of transcription factor. The predicted clusters of transcription factors can be thought of as general representation of TF-TF co-binding within the chromatin states. 4.5. Contribution of the dissertation 91

From literature survey this is the first study to perform genome wide prediction of multiple transcription factor co-binding across chromatin states.

2. DNA-methylation affecting transcription factor binding: DNA-methylation is a stable epigenetic mark that is capable of classifying different cell types. In this dissertation transcriptional regulatory modules were predicted sharing co-methylated genomic loci. Recursive deletion of most significant transcription factor motifs is performed at each iteration and co-enriched motifs are predicted using a differential motif discovery al- gorithm. One of the key findings from this project has been the possible interaction between two transcription factors, Egr1 and Mef2 family which although have been individually studied in the past and are known to be crucial for excitatory and in- hibitory neurons, but have not been jointly studied together. An integrated motif database from published literature has helped gain understanding of different binding specificity of transcription factor motifs from the same DNA-binding domain.

3. The ongoing work in this dissertation builds on the recursive algorithm proposed in the second chapter to differentiate between two classes to neurons: excitatory and inhibitory neurons. There are two main hypothesis for this project: (1) Certain neuro- logical disorders often show imbalance of excitatory to inhibitory activity (2) Specific transcription factors are activated in excitatory and inhibitory neurons. An integra- tive analysis of muti-omics data analysing transcription factor binding motifs, bulk RNA-seq data and single cell RNA-seq data can help predict cell type specific pro- tein interaction networks. Such networks are not only useful differentiating brain cell types, but they can also be expanded to other cell types as well where the interest is in prediction of groups of transcription factors that have different preferences in different cells. Bibliography

[1] Qudeer Ahmed Abdul, Byung Pal Yu, Hae Young Chung, Hyun Ah Jung, and Jae Sue Choi. Epigenetic modifications of gene expression by lifestyle and environ- ment. Archives of pharmacal research, 40(11):1219–1237, 2017.

[2] C David Allis and Thomas Jenuwein. The molecular hallmarks of epigenetic control. Nature Reviews Genetics, 17(8):487, 2016.

[3] Ruthie E Amir, Ignatia B Van den Veyver, Mimi Wan, Charles Q Tran, Uta Francke, and Huda Y Zoghbi. Rett syndrome is caused by mutations in x-linked mecp2, encoding methyl-cpg-binding protein 2. Nature genetics, 23(2):185, 1999.

[4] Carlos L Araya, Trupti Kawli, Anshul Kundaje, Lixia Jiang, Beijing Wu, Dionne Vafeados, Robert Terrell, Peter Weissdepp, Louis Gevirtzman, Daniel Mace, et al. Regulatory analysis of the c. elegans genome with spatiotemporal resolution. Nature, 512(7515):400–405, 2014.

[5] Timothy L Bailey, Nadya Williams, Chris Misleh, and Wilfred W Li. Meme: dis- covering and analyzing dna and protein sequence motifs. Nucleic acids research, 34 (suppl 2):W369–W373, 2006.

[6] Timothy L Bailey, Mikael Boden, Fabian A Buske, Martin Frith, Charles E Grant, Luca Clementi, Jingyuan Ren, Wilfred W Li, and William S Noble. Meme suite: tools for motif discovery and searching. Nucleic acids research, 37(suppl 2):W202–W208, 2009.

[7] Abha Singh Bais, Naftali Kaminski, and Panayiotis V Benos. Finding subtypes of

92 BIBLIOGRAPHY 93

transcription factor motif pairs with distinct regulatory roles. Nucleic acids research, 39(11):e76–e76, 2011.

[8] Nurit Ballas, Christopher Grunseich, Diane D Lu, Joan C Speh, and Gail Mandel. Rest and its corepressors mediate plasticity of neuronal gene chromatin throughout neurogenesis. Cell, 121(4):645–657, 2005.

[9] Tim Bancroft, Chuanlong Du, and Dan Nettleton. Estimation of false discovery rate using sequential permutation p-values. Biometrics, 69(1):1–7, 2013.

[10] Sharmi Banerjee, Hongxiao Zhu, Man Tang, Xiaowei Wu, Wuchun Feng, and Hehuang Xie. Identifying transcriptional regulatory modules among different chromatin states in mouse neural stem cells. Frontiers in genetics, 9:731, 2018.

[11] Sharmi Banerjee, Xiaoran Wei, and Hehuang Xie. Recursive motif analyses iden- tify brain epigenetic transcription regulatory modules. Computational and structural biotechnology journal, 17:507–515, 2019.

[12] Resham Bhattacharya, Soumyajit Banerjee Mustafi, Mark Street, Anindya Dey, and Shailendra Kumar Dhar Dwivedi. Bmi-1: At the crossroads of physiological and patho- logical biology. Genes & diseases, 2(3):225–239, 2015.

[13] Adam Blattler and Peggy J Farnham. Cross-talk between site-specific transcription factors and dna methylation states. Journal of Biological Chemistry, 288(48):34287– 34294, 2013.

[14] Brenda L Bloodgood, Nikhil Sharma, Heidi Adlman Browne, Alissa Z Trepman, and Michael E Greenberg. The activity-dependent transcription factor npas4 regulates domain-specific inhibition. Nature, 503(7474):121, 2013. 94 BIBLIOGRAPHY

[15] Joan Boyes and Adrian Bird. Dna methylation inhibits transcription indirectly via a methyl-cpg binding protein. Cell, 64(6):1123–1134, 1991.

[16] Curtis Bradney, Mark Hjelmeland, Yasuhiko Komatsu, Minoru Yoshida, Tso-Pang Yao, and Yuan Zhuang. Regulation of e2a activities by histone acetyltransferases in b lymphocyte development. Journal of Biological Chemistry, 278(4):2370–2376, 2003.

[17] Kenneth P Burnham and David R Anderson. Multimodel inference: understanding aic and bic in model selection. Sociological methods & research, 33(2):261–304, 2004.

[18] Maria Cha and Qing Zhou. Detecting clustering and ordering binding patterns among transcription factors via point process models. , 30(16):2263–2271, 2014.

[19] Malika Charrad, Nadia Ghazzali, Veronique Boiteau, Azam Niknafs, and Main- tainer Malika Charrad. Package nbclust. Journal of Statistical Software, 61:1–36, 2014.

[20] Kelan Chen, Jiang Hu, Darcy L Moore, Ruijie Liu, Sarah A Kessans, Kelsey Breslin, Isabelle S Lucet, Andrew Keniry, Huei San Leong, Clare L Parish, et al. Genome-wide binding and mechanistic analyses of smchd1-mediated epigenetic regulation. Proceed- ings of the National Academy of Sciences, 112(27):E3535–E3544, 2015.

[21] Xi Chen, Han Xu, Ping Yuan, Fang Fang, Mikael Huss, Vinsensius B Vega, Eleanor Wong, Yuriy L Orlov, Weiwei Zhang, Jianming Jiang, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133(6):1106–1117, 2008.

[22] ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57–74, 2012. BIBLIOGRAPHY 95

[23] Ita Costello, Sonja Nowotschin, Xin Sun, Arne W Mould, Anna-Katerina Hadjanton- akis, Elizabeth K Bikoff, and Elizabeth J Robertson. Lhx1 functions together with otx2, foxa2, and ldb1 to govern anterior mesendoderm, node, and midline development. Genes & development, 29(20):2108–2122, 2015.

[24] Isabel Cuesta, Kenneth S Zaret, and Pilar Santisteban. The forkhead factor foxe1 binds to the thyroperoxidase promoter during thyroid cell differentiation and modi- fies compacted chromatin structure. Molecular and cellular biology, 27(20):7302–7314, 2007.

[25] Darren A Cusanovich, Bryan Pavlovic, Jonathan K Pritchard, and Yoav Gilad. The functional consequences of variation in transcription factor binding. PLoS genetics, 10 (3):e1004226, 2014.

[26] Peixin Dong, Masanori Kaneuchi, Hidemichi Watari, Junichi Hamada, Satoko Sudo, Jingfang Ju, and Noriaki Sakuragi. Microrna-194 inhibits epithelial to mesenchymal transition of endometrial cancer cells by targeting oncogene bmi-1. Molecular cancer, 10(1):99, 2011.

[27] Florian Duclot and Mohamed Kabbaj. The role of early growth response 1 (egr1) in brain plasticity and neuropsychiatric disorders. Frontiers in behavioral neuroscience, 11:35, 2017.

[28] Jason Ernst and Manolis Kellis. Chromhmm: automating chromatin-state discovery and characterization. Nature methods, 9(3):215–216, 2012.

[29] Conchi Estar´as,Naiara Akizu, Alejandra Garc´ıa, Sergi Beltr´an,Xavier de la Cruz, and Marian A Mart´ınez-Balb´as.Genome-wide analysis reveals that smad3 and jmjd3 hdm co-activate the neural developmental program. Development, 139(15):2681–2691, 2012. 96 BIBLIOGRAPHY

[30] Mirco Fanelli, S Caprodossi, L Ricci-Vitiani, A Porcellini, F Tomassoni-Ardori, Stefano Amatori, F Andreoni, Mauro Magnani, R De Maria, A Santoni, et al. Loss of peri- centromeric dna methylation pattern in human glioblastoma is associated with altered dna methyltransferases expression and involves the stem cell compartment. Oncogene, 27(3):358, 2008.

[31] Xuefeng Fang, Jae-Geun Yoon, Lisha Li, Wei Yu, Jiaofang Shao, Dasong Hua, Shu Zheng, Leroy Hood, David R Goodlett, Gregory Foltz, et al. The sox2 response pro- gram in glioblastoma multiforme: an integrated chip-seq, expression microarray, and microrna analysis. BMC genomics, 12(1):11, 2011.

[32] Jianxing Feng, Tao Liu, Bo Qin, Yong Zhang, and Xiaole Shirley Liu. Identifying chip-seq enrichment using macs. Nature protocols, 7(9):1728–1740, 2012.

[33] R Gao and Peter Penzes. Common mechanisms of excitatory and inhibitory imbalance in schizophrenia and autism spectrum disorders. Current molecular medicine, 15(2): 146–167, 2015.

[34] Anne-Valerie Gendrel, Anwyn Apedaile, Heather Coker, Ausma Termanis, Ilona Zvetkova, Jonathan Godwin, Y Amy Tang, Derek Huntley, Giovanni Montana, Steven Taylor, et al. Smchd1-dependent and-independent pathways determine developmental dynamics of cpg island methylation on the inactive x chromosome. Developmental cell, 23(2):265–279, 2012.

[35] Mubeen Goolam, Antonio Scialdone, Sarah JL Graham, Iain C Macaulay, Agnieszka Jedrusik, Anna Hupalowska, Thierry Voet, John C Marioni, and Magdalena Zernicka- Goetz. Heterogeneity in oct4 and sox2 targets biases cell fate in 4-cell mouse embryos. Cell, 165(1):61–74, 2016. BIBLIOGRAPHY 97

[36] Monika Gos. Epigenetic mechanisms of gene expression regulation in neurological diseases. Acta Neurobiol Exp, 73:19–37, 2013.

[37] R Scott Hansen, Cisca Wijmenga, Ping Luo, Ann M Stanek, Theresa K Canfield, Corry MR Weemaes, and Stanley M Gartler. The dnmt3b dna methyltransferase gene is mutated in the icf immunodeficiency syndrome. Proceedings of the National Academy of Sciences, 96(25):14412–14417, 1999.

[38] Christopher T Harbison, D Benjamin Gordon, Tong Ihn Lee, Nicola J Rinaldi, Ken- zie D Macisaac, Timothy W Danford, Nancy M Hannett, Jean-Bosco Tagne, David B Reynolds, Jane Yoo, et al. Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004):99, 2004.

[39] Jianlin He, Xiguang Xu, Aboozar Monavarfeshani, Sharmi Banerjee, Michael A Fox, and Hehuang Xie. Retinal-input-induced epigenetic dynamics in the developing mouse dorsal lateral geniculate nucleus. Epigenetics & chromatin, 12(1):13, 2019.

[40] Sven Heinz, Christopher Benner, Nathanael Spann, Eric Bertolino, Yin C Lin, Peter Laslo, Jason X Cheng, Cornelis Murre, Harinder Singh, and Christopher K Glass. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Molecular cell, 38(4):576–589, 2010.

[41] Michael M Hoffman, Orion J Buske, Jie Wang, , Jeff A Bilmes, and William Stafford Noble. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods, 9(5):473–476, 2012.

[42] Sjoerd Johannes Bastiaan Holwerda and Wouter de Laat. Ctcf: the protein, the binding partners, the binding sites and their chromatin loops. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620):20120369, 2013. 98 BIBLIOGRAPHY

[43] Shaohui Hu, Jun Wan, Yijing Su, Qifeng Song, Yaxue Zeng, Ha Nam Nguyen, Jaehoon Shin, Eric Cox, Hee Sool Rho, Crystal Woodard, et al. Dna methylation presents distinct binding sites for human transcription factors. Elife, 2, 2013.

[44] Itaru Imayoshi and Ryoichiro Kageyama. bhlh factors in self-renewal, multipotency, and fate choice of neural progenitor cells. Neuron, 82(1):9–23, 2014.

[45] Takashi Inoue, Maya Ota, Miyuki Ogawa, Katsuhiko Mikoshiba, and Jun Aruga. Zic1 and zic3 regulate medial forebrain development through expansion of neuronal progen- itors. Journal of Neuroscience, 27(20):5461–5473, 2007.

[46] Mira Jakovcevski and Schahram Akbarian. Epigenetic mechanisms in neurological disease. Nature medicine, 18(8):1194, 2012.

[47] Hongkai Ji, Steven A Vokes, and Wing H Wong. A comparative analysis of genome- wide chromatin immunoprecipitation data for mammalian transcription factors. Nu- cleic acids research, 34(21):e146–e146, 2006.

[48] Arttu Jolma, Yimeng Yin, Kazuhiro R Nitta, Kashyap Dave, Alexander Popov, Minna Taipale, Martin Enge, Teemu Kivioja, Ekaterina Morgunova, and Jussi Taipale. Dna- dependent formation of transcription factor pairs alters their binding specificity. Na- ture, 527(7578):384–388, 2015.

[49] Jun Kohyama, Takuro Kojima, Eriko Takatsuka, Toru Yamashita, Jun Namiki, Jenny Hsieh, Fred H Gage, Masakazu Namihira, Hideyuki Okano, Kazunobu Sawamoto, et al. Epigenetic regulation of neural cell differentiation plasticity in the adult mammalian brain. Proceedings of the National Academy of Sciences, 105(46):18012–18017, 2008.

[50] Tony Kouzarides. Chromatin modifications and their function. Cell, 128(4):693–705, 2007. BIBLIOGRAPHY 99

[51] Martin Krzywinski, Jacqueline Schein, Inanc Birol, Joseph Connors, Randy Gascoyne, Doug Horsman, Steven J Jones, and Marco A Marra. Circos: an information aesthetic for comparative genomics. Genome research, 19(9):1639–1645, 2009.

[52] Ivan V Kulakovskiy, Alexander V Favorov, and Vsevolod J Makeev. Motif discovery and motif finding from genome-mapped dnase footprint data. Bioinformatics, 25(18): 2318–2325, 2009.

[53] Samuel A Lambert, Arttu Jolma, Laura F Campitelli, Pratyush K Das, Yimeng Yin, Mihai Albu, Xiaoting Chen, Jussi Taipale, Timothy R Hughes, and Matthew T Weirauch. The human transcription factors. Cell, 172(4):650–665, 2018.

[54] Peter Langfelder and Steve Horvath. Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics, 9(1):559, 2008.

[55] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357–359, 2012.

[56] Yuju Lee and Qing Zhou. Co-regulation in embryonic stem cells via context-dependent binding of transcription factors. Bioinformatics, 29(17):2162–2168, 2013.

[57] Bo Li and Colin N Dewey. Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC bioinformatics, 12(1):323, 2011.

[58] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078–2079, 2009.

[59] Wen-Hui Lien, Lisa Polak, Mingyan Lin, Kenneth Lay, Deyou Zheng, and Elaine Fuchs. In vivo transcriptional governance of hair follicle stem cells by canonical wnt regulators. Nature cell biology, 16(2):179, 2014. 100 BIBLIOGRAPHY

[60] Finn Lindgren, H˚avard Rue, and Johan Lindstr¨om.An explicit link between gaussian fields and gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.

[61] Jia Liu and Patrizia Casaccia. Epigenetic regulation of oligodendrocyte identity. Trends in neurosciences, 33(4):193–201, 2010.

[62] Liang Liu, Guangxu Jin, and Xiaobo Zhou. Modeling the relationship of epigenetic modifications to transcription factor binding. Nucleic acids research, 43(8):3873–3885, 2015.

[63] Liang Liu, Weiling Zhao, and Xiaobo Zhou. Modeling co-occupancy of transcription factors using chromatin features. Nucleic acids research, 44(5):e49–e49, 2016.

[64] Ruipeng Lu, Eliseos J Mucaki, and Peter K Rogan. Discovery and validation of in- formation theory-based transcription factor and cofactor binding site motifs. Nucleic acids research, 45(5):e27–e27, 2016.

[65] Chongyuan Luo, Christopher L Keown, Laurie Kurihara, Jingtian Zhou, Yupeng He, Junhao Li, Rosa Castanon, Jacinta Lucero, Joseph R Nery, Justin P Sandoval, et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mam- malian cortex. Science, 357(6351):600–604, 2017.

[66] Yanting Luo, Jianlin He, Xiguang Xu, Ming-an Sun, Xiaowei Wu, Xuemei Lu, and Hehuang Xie. Integrative single-cell omics analyses reveal epigenetic heterogeneity in mouse embryonic stem cells. PLoS computational biology, 14(3):e1006034, 2018.

[67] Buyong Ma, Chung-Jung Tsai, Yongping Pan, and . Why does binding of proteins to dna or proteins to proteins not necessarily spell function?, 2010. BIBLIOGRAPHY 101

[68] Kenzie D MacIsaac, Ting Wang, D Benjamin Gordon, David K Gifford, Gary D Stormo, and Ernest Fraenkel. An improved map of conserved regulatory sites for saccharomyces cerevisiae. BMC bioinformatics, 7(1):113, 2006.

[69] Eugenio Marco, Wouter Meuleman, Jialiang Huang, Kimberly Glass, Luca Pinello, Jianrong Wang, Manolis Kellis, and Guo-Cheng Yuan. Multi-scale chromatin state annotation using a hierarchical hidden markov model. Nature Communications, 8, 2017.

[70] Stephen J Martin, Paul D Grimwood, and Richard GM Morris. Synaptic plasticity and memory: an evaluation of the hypothesis. Annual review of neuroscience, 23(1): 649–711, 2000.

[71] Keri Martinowich, Daisuke Hattori, Hao Wu, Shaun Fouse, Fei He, Yan Hu, Guop- ing Fan, and Yi E Sun. Dna methylation-related chromatin remodeling in activity- dependent bdnf gene regulation. Science, 302(5646):890–893, 2003.

[72] Juan L Mateo, Debbie LC van den Berg, Maximilian Haeussler, Daniela Drechsel, Zachary B Gaber, Diogo S Castro, Paul Robson, Q Richard Lu, Gregory E Crawford, Paul Flicek, et al. Characterization of the neural stem cell gene regulatory network identifies olig2 as a multifunctional regulator of self-renewal. Genome research, 25(1): 41–56, 2015.

[73] Anthony Mathelier, Oriol Fornes, David J Arenillas, Chih-yu Chen, Gr´egoireDenay, Jessica Lee, Wenqiang Shi, Casper Shyr, Ge Tan, Rebecca Worsley-Hunt, et al. Jaspar 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research, 44(D1):D110–D115, 2016.

[74] Michael J Meaney and Anne C Ferguson-Smith. Epigenetic regulation of the neural transcriptome: the meaning of the marks. Nature neuroscience, 13(11):1313, 2010. 102 BIBLIOGRAPHY

[75] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinfor- matics, 19(6):1236–1246, 2017.

[76] Fabio Mohn, Michael Weber, Michael Rebhan, Tim C Roloff, Jens Richter, Michael B Stadler, Miriam Bibel, and Dirk Sch¨ubeler. Lineage-specific polycomb targets and de novo dna methylation define restriction and potential of neuronal progenitors. Molec- ular cell, 30(6):755–766, 2008.

[77] Radford M Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9(2):249–265, 2000.

[78] Erika D Nelson, Ege T Kavalali, and Lisa M Monteggia. Activity-dependent sup- pression of miniature neurotransmission through the regulation of dna methylation. Journal of Neuroscience, 28(2):395–406, 2008.

[79] Sacha B Nelson and Vera Valakh. Excitatory/inhibitory balance and circuit homeosta- sis in autism spectrum disorders. Neuron, 87(4):684–698, 2015.

[80] Richard M Neve, Koei Chin, Jane Fridlyand, Jennifer Yeh, Frederick L Baehner, Tea Fevr, Laura Clark, Nora Bayani, Jean-Philippe Coppe, Frances Tong, et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer cell, 10(6):515–527, 2006.

[81] Kazuhiro R Nitta, Arttu Jolma, Yimeng Yin, Ekaterina Morgunova, Teemu Kivioja, Junaid Akhtar, Korneel Hens, Jarkko Toivonen, Bart Deplancke, Eileen EM Furlong, et al. Conservation of transcription factor binding specificities across 600 million years of bilateria evolution. Elife, 4:e04837, 2015. BIBLIOGRAPHY 103

[82] Anja Nitzsche, Maciej Paszkowski-Rogacz, Filomena Matarese, Eva M Janssen- Megens, Nina C Hubner, Herbert Schulz, Ingrid de Vries, Li Ding, Norbert Huebner, Matthias Mann, et al. Rad21 cooperates with pluripotency transcription factors in the maintenance of embryonic stem cell identity. PloS one, 6(5):e19470, 2011.

[83] Shinya Oki, Tazro Ohta, Go Shioi, Hideki Hatanaka, Osamu Ogasawara, Yoshihiro Okuda, Hideya Kawaji, Ryo Nakaki, Jun Sese, and Chikara Meno. Chip-atlas: a data- mining suite powered by full integration of public chip-seq data. EMBO reports, 19 (12), 2018.

[84] Yuriy L Orlov, Mikael E Huss, Roy Joseph, Han Xu, Vinsensius B Vega, Yew K Lee, Wee S Goh, Jane S Thomsen, Edwin C-W Cheung, Neil D Clarke, et al. Genome- wide statistical analysis of multiple transcription factor binding sites obtained by chip- seq technologies. In Proceedings of the 1st ACM workshop on Breaking frontiers of computational biology, pages 11–18. ACM, 2009.

[85] Emily K Osterweil. Upsetting the excitatory-inhibitory balance hypothesis of autism. Science Translational Medicine, 11(484):eaax2730, 2019.

[86] Mei-Ren Pan, Guang Peng, Wen-Chun Hung, and Shiaw-Yih Lin. Monoubiquitina- tion of h2ax protein regulates dna damage response signaling. Journal of Biological Chemistry, 286(32):28599–28607, 2011.

[87] Theresa Phillips. The role of methylation in gene expression. Nature Education, 1(1): 116, 2008.

[88] Adam R Prickett, Nikolaos Barkas, Ruth B McCole, Siobhan Hughes, Samuele M Amante, Reiner Schulz, and Rebecca J Oakey. Genome-wide and parental allele-specific analysis of ctcf and cohesin dna binding in mouse brain reveals a tissue-specific binding 104 BIBLIOGRAPHY

pattern and an association with imprinted differentially methylated regions. Genome research, 23(10):1624–1635, 2013.

[89] Juan R Riesgo-Escovar and Ernst Hafen. Common and distinct roles of dfos and djun during drosophila development. Science, 278(5338):669–672, 1997.

[90] Tania L Roth, Eric D Roth, and J David Sweatt. Epigenetic regulation of genes in learning and memory. Essays in biochemistry, 48:263–274, 2010.

[91] Joel Rozowsky, Ghia Euskirchen, Raymond K Auerbach, Zhengdong D Zhang, Theodore Gibson, Robert Bjornson, Nicholas Carriero, Michael Snyder, and Mark B Gerstein. Peakseq enables systematic scoring of chip-seq experiments relative to con- trols. Nature biotechnology, 27(1):66–75, 2009.

[92] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applica- tions. CRC press, 2005.

[93] H˚avard Rue, Sara Martino, and Nicolas Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the royal statistical society: Series b (statistical methodology), 71(2):319–392, 2009.

[94] Abby Sarkar and Konrad Hochedlinger. The sox family of transcription factors: ver- satile regulators of stem and progenitor cell fate. Cell stem cell, 12(1):15–30, 2013.

[95] Rahul Satija, Jeffrey A Farrell, David Gennert, Alexander F Schier, and . Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5): 495, 2015.

[96] Hiroki Setoguchi, Masakazu Namihira, Jun Kohyama, Hirotsugu Asano, Tsukasa Sanosaka, and Kinichi Nakashima. Methyl-cpg binding proteins are involved in re- BIBLIOGRAPHY 105

stricting differentiation plasticity in neurons. Journal of neuroscience research, 84(5): 969–979, 2006.

[97] Mahfuza Sharmin, H´ectorCorrada Bravo, and Sridhar Hannenhalli. Heterogeneity of transcription factor binding specificity models within and across cell lines. Genome Research, 26(8):1110–1123, 2016.

[98] Natalie D Shaw, Harrison Brand, Zachary A Kupchinsky, Hemant Bengani, Lacey Plummer, Takako I Jones, Serkan Erdin, Kathleen A Williamson, Joe Rainger, Alexei Stortchevoi, et al. Smchd1 mutations associated with a rare muscular dystrophy can also cause isolated arhinia and bosma arhinia microphthalmia syndrome. Nature ge- netics, 49(2):238, 2017.

[99] Daria Shlyueva, Gerald Stampfel, and Alexander Stark. Transcriptional enhancers: from properties to genome-wide predictions. Nature Reviews Genetics, 15(4):272–286, 2014.

[100] Daniel Simpson, Janine Baerbel Illian, Finn Lindgren, Sigrunn H Sørbye, and Havard Rue. Going off grid: Computationally efficient inference for log-gaussian cox processes. Biometrika, 103(1):49–70, 2016.

[101] Shalini Singh, Danielle Howell, Niraj Trivedi, Ketty Kessler, Taren Ong, Pedro Ros- maninho, Alexandre ASF Raposo, Giles Robinson, Martine F Roussel, Diogo S Castro, et al. Zeb1 controls neuron differentiation and germinal zone exit by a mesenchymal- epithelial-like transition. Elife, 5:e12717, 2016.

[102] Abdenour Soufi, Meilin Fernandez Garcia, Artur Jaroszewicz, Nebiyu Osman, Matteo Pellegrini, and Kenneth S Zaret. Pioneer transcription factors target partial dna motifs on nucleosomes to initiate reprogramming. Cell, 161(3):555–568, 2015. 106 BIBLIOGRAPHY

[103] Ivo Spiegel, Alan R Mardinly, Harrison W Gabel, Jeremy E Bazinet, Cameron H Couch, Christopher P Tzeng, David A Harmin, and Michael E Greenberg. Npas4 regulates excitatory-inhibitory balance within neural circuits through cell-type-specific gene programs. Cell, 157(5):1216–1229, 2014.

[104] Nils Ole Steffens, Claudia Galuschka, Martin Schindler, Lorenz BuElow,` and Reinhard Hehl. Athamap: an online resource for in silico transcription factor binding sites in the arabidopsis thaliana genome. Nucleic acids research, 32(suppl 1):D368–D372, 2004.

[105] Putty-Reddy Sudhir and Chung-Hsuan Chen. Proteomics-based analysis of protein complexes in pluripotent stem cells and cancer biology. International journal of molec- ular sciences, 17(3):432, 2016.

[106] Matthew A Taddy, Athanasios Kottas, et al. Mixture modeling for marked poisson processes. Bayesian Analysis, 7(2):335–362, 2012.

[107] Shinya Tanaka, Yusuke Kamachi, Aki Tanouchi, Hiroshi Hamada, Naihe Jing, and Hisato Kondoh. Interplay of sox and pou factors in regulation of the nestin gene in neural primordial cells. Molecular and cellular biology, 24(20):8834–8846, 2004.

[108] Kenzui Taniue, Akiko Kurimoto, Yasuko Takeda, Takeshi Nagashima, Mariko Okada- Hatakeyama, Yuki Katou, Katsuhiko Shirahige, and Tetsu Akiyama. Asbel–tcf3 com- plex is required for the tumorigenicity of colorectal cancer cells. Proceedings of the National Academy of Sciences, 113(45):12739–12744, 2016.

[109] Benjamin M Taylor and Peter J Diggle. Inla or mcmc? a tutorial and comparative evaluation for spatial prediction in log-gaussian cox processes. Journal of Statistical Computation and Simulation, 84(10):2266–2284, 2014.

[110] Reuben Thomas, Sean Thomas, Alisha K Holloway, and Katherine S Pollard. Features BIBLIOGRAPHY 107

that define the best chip-seq peak calling algorithms. Briefings in bioinformatics, 18 (3):441–450, 2016.

[111] Igor F Tsigelny, Valentina L Kouznetsova, Nathan Lian, and Santosh Kesari. Molecular mechanisms of olig2 transcription factor in brain cancer. Oncotarget, 7(33):53074, 2016.

[112] Rocio G Urdinguio, Jose V Sanchez-Mut, and Manel Esteller. Epigenetic mechanisms in neurological diseases: genes, syndromes, and therapies. The Lancet Neurology, 8 (11):1056–1072, 2009.

[113] Ana Tufegdzic Vidakovic, Oscar M Rueda, Stephin J Vervoort, Ankita Sati Batra, Mae Akilina Goldgraben, Santiago Uribe-Lewis, Wendy Greenwood, Paul J Coffer, Alejandra Bruna, and Carlos Caldas. Context-specific effects of tgf-β/smad3 in cancer are modulated by the epigenome. Cell reports, 13(11):2480–2490, 2015.

[114] Guohua Wang, Ximei Luo, Jianan Wang, Jun Wan, Shuli Xia, Heng Zhu, Jiang Qian, and Yadong Wang. Medreaders: a database for transcription factors that bind to methylated dna. Nucleic acids research, 46(D1):D146–D151, 2017.

[115] Orly L Wapinski, Thomas Vierbuchen, Kun Qu, Qian Yi Lee, Soham Chanda, Daniel R Fuentes, Paul G Giresi, Yi Han Ng, Samuele Marro, Norma F Neff, et al. Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons. Cell, 155(3):621–635, 2013.

[116] Ashley E Webb, Elizabeth A Pollina, Thomas Vierbuchen, Noelia Urb´an,Duygu Ucar, Dena S Leeman, Ben Martynoga, Madhavi Sewak, Thomas A Rando, Fran¸coisGuille- mot, et al. Foxo3 shares common targets with ascl1 genome-wide and inhibits ascl1- dependent neurogenesis. Cell reports, 4(3):477–491, 2013.

[117] Cecilia L Winata, Igor Kondrychyn, Vibhor Kumar, Kandhadayar G Srinivasan, Yuriy 108 BIBLIOGRAPHY

Orlov, Ashwini Ravishankar, Shyam Prabhakar, Lawrence W Stanton, Vladimir Korzh, and Sinnakaruppan Mathavan. Genome wide analysis reveals zic3 interaction with distal regulatory elements of stage specific developmental genes in zebrafish. PLoS genetics, 9(10):e1003852, 2013.

[118] Ka-Chun Wong, Yue Li, Chengbin Peng, and Zhaolei Zhang. Signalspider: probabilistic pattern discovery on multiple normalized chip-seq signal profiles. Bioinformatics, 31 (1):17–24, 2015.

[119] Yongqing Wu, Raja Dey, Aidong Han, Nimanthi Jayathilaka, Michael Philips, Jun Ye, and Lin Chen. Structure of the mads-box/mef2 domain of mef2a bound to dna and its implication for myocardin recruitment. Journal of molecular biology, 397(2):520–533, 2010.

[120] Beibei Xin and Remo Rohs. Relationship between histone modifications and tran- scription factor binding is protein family specific. Genome research, pages gr–220079, 2018.

[121] Chunming Xu and Scott A Jackson. Machine learning and complex biological data, 2019.

[122] Lan Xu, Claudio Alarc´on, Seda C¸¨ol,and Joan Massagu`e.Distinct domain utilization by smad3 and smad4 for nucleoporin interaction and nuclear import. Journal of Biological Chemistry, 278(43):42569–42577, 2003.

[123] Yimeng Yin, Ekaterina Morgunova, Arttu Jolma, Eevi Kaasinen, Biswajyoti Sahu, Syed Khund-Sayeed, Pratyush K Das, Teemu Kivioja, Kashyap Dave, Fan Zhong, et al. Impact of cytosine methylation on dna binding specificities of human transcription factors. Science, 356(6337):eaaj2239, 2017. BIBLIOGRAPHY 109

[124] Hong-Bing Yu, Rory Johnson, Galih Kunarso, and Lawrence W Stanton. Coassembly of rest and its cofactors at sites of gene repression in embryonic stem cells. Genome research, 21(8):1284–1293, 2011.

[125] Kenneth S Zaret and Jason S Carroll. Pioneer transcription factors: establishing competence for gene expression. Genes & development, 25(21):2227–2241, 2011. 110 BIBLIOGRAPHY Appendices

111 Appendix A

First Appendix

A.1 Dirichlet Process mixture of log Gaussian Cox pro-

cesses

A.1.1 INLA approximation of the LGCP model

To review the modeling of a spatial point process using INLA, let us consider the LGCP model of a single point process S. INLA adopts an explicit link between the latent Gaussian

process z(s) and a discrete spatial Gaussian Markov random field z = (z1, ..., zP )[60, 100, 109]. In particular, one assumes that the covariance kernel of z(s) follows a Mat´erncovariance structure with σ2 C (x, y) = (κ|x − y|)νK (κ|x − y|), θ Γ(ν)2ν−1 ν

where Kν(.) is the modified Bessel function of the second kind, ν > 0 is the smoothing parameter, κ > 0 is the range parameter, and σ2 is the marginal variance. With this assumption, one can show that z(s) is the solution of the stochastic partial differential equation (SPDE):

(κ2 − ∆)α/2z(s) = W (s), (A.1)

112 A.1. Dirichlet Process mixture of log Gaussian Cox processes 113 where α = ν − 1/2 is an integer (by default α = 2 in the INLA package), ∆ = ∂2/∂s2 is the Laplacian operator, and W (s) is the white noise. The solution z(s) of the SPDE can be approximated numerically through expanding z(s) on a B-spline basis φj:

P X z(s) ≈ φj(s)zj, (A.2) j=1

where φj is chosen to be the piecewise linear basis constructed using B-spline of order 1 on the interval D. This expansion will transform the continuous co-variance of z(s) to a discrete

T precision matrix Q of the B-spline basis coefficients z = (z1, ..., zP ) . Q will be approximated using the properties of Gaussian Markov Random Field (GMRF, note: GMRF is defined discretely), which takes advantages of a local correlation structure in Q to facilitate fast computation. Through B-spline approximation, an explicit link is constructed between a continuous Gaussian process z(s) and a discretely defined GMRF z.

The implementation in INLA contains several steps. First, a regular grid is constructed t = (t1, ..., tP ) on D and construct a 1-D B-spline of order 1 on the grid. Under this construction, φj(t) is a piecewise linear base function that takes value 1 at node tj and zero on the rest of the grid points. Accordingly,

P X λ(s) = exp {z(s)} ≈ exp { φj(s)zj}. j=1

With the basis approximation, the integral in Equation (1) can be further approximated by

P P P Z X X X ds λ(s) ≈ λ(tj)˜αj ≈ exp{ φk(tj)zk}α˜j. D j=1 j=1 j=1

Hereα ˜j is the weight for numerical integration which takes the value |Dj| where Dj is the 114 Appendix A. First Appendix

support of φj(.). The exponential part of Equation (1) can then be approximated by

Z P P X X T exp{|D| − dsλ(s)} ≈ exp{|D| − α˜j exp{ φk(tj)zk}} = exp{|D| − α˜ exp(A1z)}, D j=1 j=1

T whereα ˜ = (˜α1,..., α˜P ) and [A1]jk = φk(tj), j = 1, . . . , P, k = 1,...,P . The product part of Equation (1) can be approximated by

n n P n Y Y X Y λ(si) ≈ exp{ φj(si)zj} = exp{A2z}, i=1 i=1 j=1 i=1

where [A2]ij = φj(si), i = 1, . . . , n, j = 1,...,P. Therefore, the log likelihood based on Equation (1) is approximated by

T T log f(S | λ(s)) ≈ |D| − α˜ exp(A1z) + 1 A2z. (A.3)

T T T T T T T Rewriting η = exp(z A1 , z A2 ), α = (α˜ , 0n×1) and constructing pseudo-observation T T T y = (0p×1, 1n×1) , the LGCP likelihood in Equation (1) can be approximated using a Poisson density n+P Y yi f(y|z) ≈ C (ηi) exp{−αiηi}, i=1 which are the product of conditionally independent Poisson random variables with mean αiηi and observation yi. INLA approximates the posterior distribution of π(z|y) using numerical integration: Z π˜(zi|y) = π˜(zi|θ, y)˜π(θ|y)dθ, A.1. Dirichlet Process mixture of log Gaussian Cox processes 115

The marginal posteriorπ ˜(θ|y) is an approximation to π(θ|y) using

π(z, θ, y) π˜(θ|y) ∝ , π˜G(z|θ, y) z=z∗(θ)

∗ whereπ ˜G(z|θ, y) is the Gaussian approximation of the full conditional of z, and z (θ) is the mode of the full conditional for z at a given θ. The marginal like likelihood π(y) can be approximated by

Z π(θ, z, y) π˜(y) = dθ π˜G(z|θ, y) z=z∗(θ)

using numerical integration. The marginal likelihood of S is denoted as H(S) ≈ π˜(y) in the DPM algorithm below.

A.1.2 Algorithm for posterior inference

Using Gibbs sampler, posterior inference of DPM-LGCP can be performed by the following algorithm.

Algorithm:

Step 0 Initialize the cluster assignment c = (c1, . . . , cN ).

Step 1 For i = 1 : N, do the following: Exclude ci from c. Denote the resulting number of

− clusters as K, the cluster sizes as {nk }. Sample ci from the following distribution:

  − − k with prob. ∝ nk H(Si|{Sl, cl = k, l 6= i}), k = 1, ..., K (ci | c−i, {Si}) = ,  − K + 1 with prob. ∝ mH(Si) 116 Appendix A. First Appendix

end For.

Repeat Step 1 for a pre-specified number of iterations.

Here, H(Si|{Sl, cl = k, l 6= i}) = H(Si, {Sl, cl = k, l 6= i})/H(Sl, cl = k, l 6= i), and all components are calculated using INLA.