Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression
Sharmi Banerjee
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering
Pratap Tokekar, Co-chair Xiaowei Wu, Co-chair William Baumann Inyoung Kim Anil Vullikanti
September 05, 2019 Blacksburg, Virginia
Keywords: Epigenetic factors, gene expression, transcription factors, histone marks, DNA methylation Copyright 2019, Sharmi Banerjee Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression
Sharmi Banerjee
(ABSTRACT)
This dissertation presents applications of machine learning and statistical approaches to infer protein-DNA bindings in the presence of epigenetic modifications. Epigenetic modifications are alterations to the DNA resulting in gene expression regulation where the structure of the DNA remains unaltered. It is a heritable and reversible modification and often involves addition or deletion of certain chemical compounds to the DNA. Histone modification is an epigenetic change that involves alteration of the histone proteins – thus changing the chromatin (DNA wound around histone proteins) structure – or addition of methyl-groups to the Cytosine base adjacent to a Guanine base. Epigenetic factors often interfere in gene expression regulation by promoting or inhibiting protein-DNA bindings. Such proteins are known as transcription factors. Transcription is the first step of gene expression where a particular segment of DNA is copied into the messenger-RNA (mRNA). Transcription factors orchestrate gene activity and are crucial for normal cell function in any organism. For example, deletion/mutation of certain transcription factors such as MEF2 have been associated with neurological disorders such as autism and schizophrenia. In this dissertation, different computational pipelines are described that use mathematical models to explain how the protein-DNA bindings are mediated by histone modifications and DNA-methylation affecting different regions of the brain at different stages of development. Computational Approaches To Predict Effect Of Epigenetic Modifications On Transcriptional Regulation Of Gene Expression
Sharmi Banerjee
(GENERAL AUDIENCE ABSTRACT)
A cell is the basic unit of any living organism. Cells contain nucleus that contains DNA, self- replicating material often called the blueprint of life. For sustenance of life, cells must respond to changes in our environment. Gene expression regulation, a process where specific regions of the DNA (genes) are copied into messenger RNA (mRNA) molecules and then translated into proteins, determines the fate of a cell. It is known that various environmental (such as diet, stress, social interaction) and biological factors often indirectly affect gene expression regulation. In this dissertation, we use machine learning approaches to predict how certain biological factors interfere indirectly with gene expression by changing specific properties of DNA. We expect our findings will help in understanding the interplay of these factors on gene expression. Acknowledgments
The dissertation owes so much to so many. Foremost among them, Dr.David Xie, for his constant guidance and encouragement, and giving me enough freedom to pursue my goals for personal development. It has been a privilege and an honor. I would like to thank Dr. Xiaowei Wu for being my mentor throughout my PhD and guiding me more than I could have expected. I would also like to thank the rest of my committee: Dr. Pratap Tokekar, Dr. William Baumann, Dr. Inyoung Kim and Dr. Anil Vullikanti for their valuable feedback, insightful discussions and hard questions. My sincere thanks also goes to Dr.Jason Xuan and Dr.Hamed Sari-Sarraf for getting me excited about research and helping me find my feet during the initial phase of the process. I would like to express my deepest gratitude to Dr. Michel Pleimling for providing me with assistantship in the last two years of my PhD. It would have be impossible for me to continue my doctoral work without his help. I could not have endured this journey without the support of my friends Sajal, Jiyoung, Meghna, Kwang, Jianlin and Xiaoran who have become my family in Blacksburg. I would like to thank my parents and sister for keeping my spirits up at times of failure. I thank my sister-in-law who shared with me her valuable academic experience and helped in critically reviewing this dissertation. Last but not the least I would like to thank my husband who bolstered me through the peaks and valleys of this journey. This dissertation belongs to him as much as it belongs to me.
iv Contents
List of Figures ix
List of Tables xvi
1 Introduction1
2 Identification of transcriptional regulatory modules in distinct chromatin states in mouse neural stem cells7
2.1 Overview...... 7
2.2 Key challenges...... 12
2.2.1 Integrating histone and transcription factors...... 12
2.2.2 Finding the optimum clustering algorithm...... 13
2.2.3 Resolution of the TRMs within chromatin states...... 13
2.3 Building blocks of the computational pipeline...... 17
2.3.1 Data-sets...... 17
2.3.2 Chromatin state identification through genome segmentation..... 17
2.3.3 TF binding pattern clustering via Dirichlet Process Mixture of Log Gaussian Cox Processes (DPM-LGCP)...... 18
2.4 Case studies...... 21
v 2.4.1 Simulated data...... 21
2.4.2 Real data...... 23
2.4.3 Functional enrichment of transcription factors within chromatin states 23
2.4.4 Gene expression and motif analysis...... 25
2.5 Results...... 25
2.5.1 Genome segmentation and chromatin state identification...... 25
2.5.2 Genome annotation enrichment...... 26
2.5.3 Relative enrichment around TSS...... 26
2.5.4 State sizes...... 30
2.5.5 Functional annotation of nucleosome and domain level states..... 30
2.5.6 Chromatin state preference of individual TF binding and gene expres- sion regulation...... 32
2.5.7 Chromatin state and preferential TF clustering...... 34
2.5.8 Estimated clusters in chromatin states...... 37
2.5.9 Protein-DNA binding motif preferences in chromatin states...... 41
2.6 Known protein-protein interactions predicted by the algorithm...... 43
2.7 Robustness of the proposed clustering technique...... 44
2.7.1 Effect of varying domain level state sizes...... 44
2.7.2 Effect of varying number of initial clusters...... 44
2.7.3 Effect of permutation of peaks...... 46
vi 2.8 Comparison of clustering results to other methods...... 46
2.9 Time complexity analysis...... 47
2.10 Discussion...... 47
3 Recursive motif analyses identify brain epigenetic regulatory modules 53
3.1 Overview...... 53
3.2 Key challenges...... 54
3.2.1 Motif enrichment prediction in DMS...... 54
3.3 Building blocks of the computational pipeline...... 56
3.3.1 Merging nearby DMS...... 56
3.3.2 Clustering DMRs for motif enrichment...... 57
3.3.3 Data analysis of Whole Genome Bisulphite Sequencing (WGBS-seq). 58
3.3.4 Recursive identification of TRMs from motifs...... 59
3.3.5 Preparing motif libraries...... 61
3.3.6 Analysis of single cell RNA-seq...... 62
3.4 Results...... 64
3.4.1 A Comprehensive motif database compiled for epigenetic regulatory module identification...... 64
3.4.2 Genome distribution of hypermethylated CpG sites identified in TET1KO and TET2KO frontal cortices...... 65
3.4.3 TRMs identified in differentially methylated clusters...... 69
vii 3.4.4 Brain ChIP-seq data support the TRMs predicted...... 74
3.4.5 Transcription factors under the influence of DNA methylation.... 76
3.5 Discussion...... 78
4 Discussion and future work 81
4.1 TRM analysis with brain single cell methylomes...... 83
4.2 Cell type specific PPI obtained through ETRM prediction...... 84
4.3 Co-expression of ETRMs in single cell and bulk RNA-seq...... 86
4.4 Summary...... 88
4.5 Contribution of the dissertation...... 90
Bibliography 92
Appendices 111
Appendix A First Appendix 112
A.1 Dirichlet Process mixture of log Gaussian Cox processes...... 112
A.1.1 INLA approximation of the LGCP model...... 112
A.1.2 Algorithm for posterior inference...... 115
viii List of Figures
2.1 A two-step process to identify chromatin-state-specific transcriptional regula- tory modules. diHMM is used to segment the genome into multiple chromatin states followed by application of the proposed clustering method to identify transcriptional regulatory modules in distinct states. Downstream analyses include gene expression comparison and de-novo motif comparison across dif- ferent chromatin states...... 11
2.2 TF peak enrichment in different chromatin states based on a threshold re- quirement of 10 peaks for each TF...... 15
2.3 TF peak enrichment in different chromatin states based on a threshold re- quirement of 5 peaks for each TF...... 16
2.4 Application of the clustering algorithm on simulated data. (A) True and Estimated binding intensities of twenty transcription factors in three clusters (B) Illustration of the clustering result obtained in (A)...... 22
2.5 Application of the clustering algorithm on real data. (A) Distribution of ChIP- seq peaks visualized in IGV window and (B) the corresponding estimated binding intensities of the clusters. The proteins in each cluster are shown with the same color...... 24
ix 2.6 (a) Nucleosome level emission matrix generated by diHMM. Functional anno- tations of the nucleosome level states are shown in the color bar on the left. Scale varies linearly between 0 and 1. (b) Fractional genome coverage for nu- cleosome and domain level states. Scale varies logarithmic-ally between 10−4 and 1. (c) Combined nucleosome-domain fold change obtained by diHMM. Functional annotation of the states are shown in the color bar on the left. Scale varies logarithmic-ally between 0.5 and 50...... 27
2.7 Nucleosome states genome annotation enrichment...... 28
2.8 Nucleosome and domain level state distribution around TSS...... 29
2.9 Quartile box plots of log10 of all nucleosome-level (a), or domain-level (b) state sizes. Box plot whiskers extend to 1.5 times the interquartile range...... 31
2.10 (a) Enrichment (in log scale) of TF peaks in different chromatin states showing binding preference of individual TFs. (b) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) in different domain level chromatin states. Genes were mapped to the nucleosome-level states for the corresponding domain-level states. (c) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) mapped to individual TFs in the Broad Promoter state and in (d) the Poised Enhancer state...... 33
x 2.11 (a), (b) Estimated cluster binding intensities along with the individual TF binding intensities in the Broad Promoter and Poised Enhancer states, re- spectively. In each figure, the estimated binding intensities of the individual TFs are shown in dotted lines and the estimated binding intensities of the clus- ters are shown in solid line. TFs in each cluster are shown in the same color as that of the cluster. The X axis represents the genomic locations mapped on the real line between 0 and 50. The Y axis represents the estimated binding intensities, both for the individual TFs and for the identified clusters. (c), (d) Pairwise protein co-binding probabilities corresponding to (a) and (b) respec- tively. (e), (f) Comparison of proximal gene expressions (TPM) regulated by the clusters in (a) and (b) respectively. Only those clusters having (1) multi- ple TFs and (2) proximal genes for at least two TFs are shown in the figure to explain the combinatorial regulation of gene expressions by multiple TFs. 36
2.12 Gene expression comparison for individual proteins among different chromatin states...... 38
2.13 Gene expression comparison for predicted clusters among different chromatin states...... 38
2.14 Estimated cluster intensities in different chromatin states...... 39
2.15 Estimated cluster intensities in different chromatin states...... 40
xi 2.16 Effect of chromatin states and co-binding partner on binding motifs. (A) De- novo motifs obtained using MEME for ASCL1 are similar to the consensus motif in both Broad Promoter and Polycomb Repressed states although the co-factors of ASCL1 are different in the two states. (B) De-novo motifs ob- tained using MEME for TCF3 show differences in motifs between the two states with different co-factors. The motifs in active state resemble the β- catenin/TCF/LEF motif whereas the motifs in repressed state resemble the E-Box consensus motif...... 42
2.17 Estimated intensity curves of the identified clusters in 10 sub-samples, each with 1,000 randomly selected windows in a Broad Promoter domain (D5)... 45
3.1 Recursive identification of Transcriptional Regulatory Modules (TRMs) in differentially methylated sites...... 55
3.2 Recursive “key” TF and ETRM prediction flowchart...... 60
3.3 Distribution of five motif categories, viz, HOMER, MethylPlus, MethylMinus, Canonical (from MeDReaders) and Methylated (from MeDReaders) is shown among the different DNA binding domains. The domains are arranged in clockwise decreasing order, with the Homeobox domain containing most of the motifs from the five categories whereas the SMAD domain containing least motifs...... 66
3.4 Venn diagram showing the number of shared motifs among the five categories. 67
3.5 Genome distribution of TET1KO and TET2KO differentially methylated sites. 68
xii 3.6 Methylation profiles of three DMR clusters identified (A-C) and the corre- sponding TRMs predicted (D-F) for TET1KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. 70
3.7 Methylation profiles of three DMR clusters identified (A-C) and the corre- sponding TRMs predicted (D-F) for TET2KO brain methylome. The range of methylation levels was set as 0 (blue) to 1 (yellow). For each iteration cycle, the numbers of DMR included in the analysis were shown on right bar. The key TF identified in each iteration cycle was marked with double ring. The TFs were colored according to TF family annotated with DNA binding domain. 71
3.8 The enrichment of TF binding sites in TET1KO and TET2KO TRMs. Each color value in the figures represents the estimate of the odds ratio based on conditional Maximum Likelihood Estimate derived from Fisher’s exact test. For TET2 KO, the two NF1 TRMs identified from cluster 1 and cluster 3 were combined and labeled as one NF1-TRM in the bottom panel...... 75
xiii 3.9 (A:D) Four types of motifs shown. (A) DNA methylation has little Egr1 motif. (B) No CpG site was observed in methylated Mef2c motif. (C) Primary motif of Lhx1 contained CpG site. (D) Primary motif of Sox14 contained CpG site. (E:H) Gene expression level (Transcripts Per Million or TPM in log scale) shown for the genes predicted TRMs in Figure 3,4. (I:L) Average methylation level of the differentially methylated sites containing the motifs from the predicted TRMs in Figure 3,4. Drop in methylation level at 6week and 10week time points is due to the fact that they were wild type samples in DMS identification procedure...... 77
4.1 Neurological disorders hypothesized to be due to relative change in excitatory to inhibitory neuron activity. The two extremes show Rhett syndrome caused due to less excitation and more inhibition and Autism caused due to more excitation and less inhibition...... 82
4.2 Applying a recursive motif enrichment on 16 single cell methylome DMRs, 20 key TFs were predicted. (A) Fishers exact test showing the enrichment of the TFs in the 16 DMRs. (B):(G) Methylation profile across the 16 DMRs occupied by the key TFs show that: the DMRs of most excitatory neurons containing the Egr1 motif are hypo-methylated whereas DMRs containing the Mef2b motif are hypo-methylated in both excitatory and inhibitory neuron. Finally only specific DMRs containing Nur77, Tgif2 and Bach2 motifs are hypo-methylated. (H) The 20 TFs can be broadly classified as either enriched in excitatory, or enriched in both excitatory and inhibitory or enriched in inhibitory neurons only...... 85
xiv 4.3 Cell type specific protein interaction network generated from the predicted ETRMs for three key TFs: Egr1, Mef2b and Bach2. The three nodes represent the three TFs and the edges represent the co-enriched motifs in specific neuron sub type. Based on the co-enrichment, cell type specific edges and hence sub networks can be constructed...... 87
4.4 Expression level of general and neuron sub type specific TFs and some of the co-enriched motifs from the corresponding ETRMs during development in fore brain. (A) Expression goes up during development, specifically at the post natal day, (B) Expression reaches peak during development after post natal day, (C)Contrasting patterns in expression observed for the four TFs. Interestingly, although both Tgif2 and Nur77 motit were highly significant in the mDL-3 excitatory neuron, the former shows a decrease in expression during development whereas the latter shows increase in expression during development...... 88
4.5 Co-expression of specific TFs within single cell RNA-seq data. TFs such as Egr1 and Klf6, Mef2d and Nf1 are shown to have similar expression levels and are expressed in a similar number of cells. Klf motif was significantly co-enriched in the Egr1 and Nf1 motif was also significantly enriched in the Mef2 ETRM...... 89
xv List of Tables
S1 Known interactions predicted by the algorithm...... 51
S2 Error rates of clustering on a Broad Promoter domain with different number of initial clusters...... 52
S3 Comparison of clustering results with K-means...... 52
xvi Chapter 1
Introduction
A point of view in 1859 proposed that all life on earth has common ancestry. After over 100 years the DNA was discovered which established the harmony of all life. DNA, short for deoxyribonucleic acid, is the self-replicating genetic code that infuses life. It carries the instructions for the development, growth, reproduction, and functioning of all life. A stretch of DNA that carries code for a specific protein or for RNA chains with a specific function in the organism is called a Gene and is the basic unit of heredity. Genetic differences in the tissue-specific expression of proteins [81] are known to result in the varied observable differences in species. Ants look different from dogs and dogs look different from humans, but all three species originated from a shared predecessor that lived hundreds of millions of years ago. Such differences are often due to genes being turned on and off at varied times and in varied tissues in a species. The set of instructions dictating the time and place for gene to be expressed are written in sequences of DNA bases contained within the regulatory region of the gene. These instruc- tions are referred to as the “gene regulatory code”. This code is deciphered by a certain class of proteins called transcription factors (TFs) that bind to specific sequences of DNA (or “DNA words”) and change the speed of transcription of genetic information from DNA to messenger RNA or mRNA by turning the genes on or off respectively. This process is known as transcription and is followed by another process known as translation. The RNA molecules are converted into amino acids and finally into proteins as end products. This
1 2 Chapter 1. Introduction phenomenon, often called gene expression, is thus the process by which DNA sequences are expressed into proteins. Differences in gene expression among species could therefore be due to differences in the instructions present within the regulatory regions of specific genes. Research in the past few decades has shown the relationship of genetic mutations to various diseases including cancer [80]. In addition, it has been observed that gene expression can be regulated without directly altering the DNA structure [2] but rather changing certain chemical properties of the DNA. Such mutations, known as epigenetic modifications, are caused by epigenetic factors, are reversible and heritable. Several environmental such as diet, social interaction, toxic chemicals present in the air are known to be associated with epigenetic modifications and thus alter the genome in either constructive or destructive way [1]. Histone modification and DNA methylation are two well-studied epigenetic factors. Hi- stones are proteins found in eukaryotic cell nuclei and are the chief protein components of chromatin1, acting as spools around which DNA winds. Histone modification involves mod- ifications to histone proteins by multiple processes such as methylation, phosphorylation, and acetylation and affects gene expression. Such modifications either render the chromatin ‘open’ (Euchromatin) thus allowing transcription factors to bind to the DNA or ‘close’ (Hete- rochromatin) resulting in decrease in transcription factor binding. DNA-methylation involves either addition/deletion of a methyl group to/from the DNA molecule and is known to affect transcriptional binding thereby promoting or repressing gene expression regulation [123]. In this dissertation, computational pipelines are described that apply machine learning based models on ChIP-seq, RNA-seq and DNA-methylation data to predict the interplay of hi- stone modification and DNA-methylation on transcription factor binding. To validate the hypothesis, these models were tested against data collected from brain tissues at different time points of brain development.
1Nuclear DNA does not appear in free strands but is tightly condensed and wrapped around nuclear proteins to fit it inside the nucleus, forming a structure called chromatin 3
The first chapter of this dissertation discusses how clusters of transcription factors bind to the DNA under different chromatin states defined by unique combinations of histone modi- fications [10]. A multi-layer Markov based approach is first used to segment the genome into distinct chromatin states and then a novel non-parametric Bayesian clustering technique is applied to predict specific groups of TFs that may bind together in an active DNA state versus in a repressed DNA state. This helps in understanding why certain genes are ex- pressed in different states. It is established that transcription factor co-binding is related to chromatin state of the genome. Data from published literature was used to model the distribution of histone marks and transcription factors and study the transcriptional regula- tion of gene expression across the chromatin states. Differences in gene expression levels are identified when regulated by a single TF compared to genes being jointly regulated by the predicted transcriptional regulatory modules. Such combinatorial gene expression regulation by transcription factors is important to identify target genes of specific transcription factors involved in normal cell functions or in diseased conditions. It is also observed that the bind- ing motifs (unique sequences of nucleotide bases that represent a transcription factor) for certain TFs show change between two chromatin states. The dependence of transcription factor binding on the chromatin state of the genome is thus firmly established. The key con- tribution in this chapter is proposition of the idea of predicting transcriptional regulatory modules in distinct chromatin states and assembling a computational pipeline using different mathematical algorithms to implement the idea. The work from this chapter has culminated into one publication [10] The second chapter focuses on the exploration of the effect of DNA methylation on transcrip- tion factor binding during different stages of brain development and within different brain cell types [11, 39]. DNA methylation is known to interfere directly or indirectly with transcription 4 Chapter 1. Introduction factor-DNA interactions often occupying the promoters 2 and generally causing hindrance to TF binding. Although intensive research has been done to predict specific transcription factors (either experimentally or computationally) that are enriched in specific brain regions at certain developmental stages, a systematic exploration to dive into the transcription factor co-enrichment in the presence of DNA-methylation has not been fully studied. The objec- tive here is to identify the specific transcriptional regulatory modules that are enriched in genomic loci with distinct methylation profiles. Such genomic loci are often referred to as differentially methylated sites (DMS) or differentially methylated regions (DMRs), meaning genomic loci with different DNA methylation status across different biological samples or time points. The proposed idea can summarized in two basic steps: (1) cluster Differentially Methylated Regions (DMRs) sharing similar methylation levels among different biological samples, and (2) recursively identify distinct transcriptional regulatory modules (TRMs) within each DMR cluster using TF motifs (a set of nucleotide sequences that represent a TF) and a novel motif search algorithm. In the absence of ChIP-seq data, such a systematic identification of TRMs helps in understanding the effect of DNA-methylation on transcrip- tion factor binding through enrichment of TF motifs and validation via RNA-seq data. The key contribution in this chapter is designing and implementing a recursive motif enrichment and transcriptional regulatory module identification algorithm. A second contribution is the proposition of the idea of partitioning differentially methylated regions based on similarity of methylation profile before identifying significant motifs enriched within these DMRs. The work from this chapter has culminated into two publications [11, 39] The dissertation concludes with explorations on classifying neuron types into excitatory and inhibitory on the basis of the transcriptional regulatory modules enriched within each type. It is hypothesized that memories are represented by massively connected neural circuits in
2Promoters are located near the transcription start sites of genes, upstream on the DNA and lead to initiation of transcription of a particular gene 5 the brain. The ability of synapses 3 to grow strong or weak over time in response to high or low activity is defined as synaptic plasticity and is a crucial neuro-chemical foundation for learning and memory [70]. Specifically, experience-governed synaptic plasticity triggers multiple aspects of memory and is vital specially early in the life cycle of an organism during important periods of development of cortical circuits when sensory experiences are needed [103]. Upon arrival of an action potential a neuron is known to release small molecules known as neuro-transmitters. This results in either a positive or negative change in action potential depending upon the neuro-transmitters. Excitatory neurons are those that cause positive change and inhibitory neurons cause negative action potential. Npas4, an early- response transcription factor is induced in excitatory neurons upon calcium influx and has been known to control excitatory-inhibitory balance within neural circuits [14]. Significant research is being done to study the ratio of excitatory to inhibitory neuron activity in brain and its impact on neurological disorders such as autism [33, 79, 85]. As transcription factors work together in groups or pathways it is important to study the network of transcription factors that are involved in excitatory and inhibitory neurons. A recently published dataset consisting of differentially methylated regions from single-cell methylomes for sixteen differ- ent neuron sub-types in mammalian cortex [65] has been integrated with a recursive key TF and transcriptional regulatory module prediction approach to classify the neuron sub-types based on transcriptional regulatory networks. Several key transcription factors have been identified for each neuron sub-type that are analogous to hub genes in a protein interaction network connected with multiple binding partners. Results show that some of the tran- scription factors can be further classified as specifically enriched within an excitatory (or inhibitory) sub-type or generally enriched in all excitatory (or inhibitory) neurons. Multi- omics data sets such as single-cell methylation and bulk RNA-seq data have been leveraged
3a synapse is a structure that allows a neuron to transmit an electrical or chemical signal to another neuron or to a target cell 6 Chapter 1. Introduction to predict cell-type specific protein interaction sub-networks. Such a systematic prediction of cell-type specific protein interaction network within the context of epigenetic regulation will greatly help gain new insight into transcriptional regulation driven by epigenetic factors. Chapter 2
Identification of transcriptional regulatory modules in distinct chromatin states in mouse neural stem cells
2.1 Overview
Transcription factors (TFs) are proteins that bind to specific sequences on the DNA and help in the transcription of DNA to messenger RNA by targeting the genes located nearby. Tran- scription factor binding locations on the DNA can be inferred through a technique known as Chromatin Immunoprecipitaion (ChIP-seq). Precise detection of regulatory regions located near these protein coding genes is crucial for accurate gene regulation in response to devel- opmental and environmental stimuli. These specific DNA sequences are typically recognized by TFs that recruit the transcriptional machinery. Although this task is not difficult to ac- complish, many TFs detect similar consensus DNA-binding sites and a genome may contain thousands of consensus or similar sequences, both functional and nonfunctional. In a ChIP experiment, formaldehyde cross-links proteins to their bound DNA. Cells are homogenized and the chromatin is sheared and immunoprecipitated with antibody-bound magnetic beads
7 Chapter 2. Identification of transcriptional regulatory modules in 8 distinct chromatin states in mouse neural stem cells
(the antibodies are specific to the transcription factors). The immunoprecipitated DNA is then used as the input for a next-generation sequencing library prep protocol where it is sequenced. Sequencing produces millions of short reads1 covering the transcription factor binding sites (TFBS) which are aligned to a reference genome. After a few pre-processing steps, a peak-calling tool is used to identify the ChIP-seq ’peaks’ that are the potential TF- DNA binding regions. As compelling as it might be to the research community, ChIP-seq comes with a few challenges such as (1) the quality of the antibody used – a sensitive and specific antibody will give a high level of enrichment, (2) and the use of control in the ex- periment – a ChIP-seq peak should be compared with the same region in a matched control. Typical controls used are input DNA, DNA obtained from IP without antibody, or using an antibody against a protein that is not known to be involved in DNA binding, (3) depth of sequencing – more prominent peaks are identified with fewer reads, whereas weaker peaks require greater depth. A number of computational tools have been developed to identify TFBS among which PeakSeq [91] and MACS2 [32] are popular for calling peaks for a single TF ChIP-seq profile. Both of these tools use local Poisson or Binomial statistics to identify peaks. With more and more single TF binding information being accrued, recent efforts have been devoted to integrating ChIP-seq data from multiple TFs to uncover TF-TF interactions. Signal Spider [118] uses a Gaussian mixture model to approximate read intensity of ChIP- seq profiles to identify genomic regions co-regulated by multiple TFs. Sharmin et. al. [97] adopted K-mer reverse component frequencies and motif binding scores to identify general and cell-type specific TF binding rules. Cha and Zhou developed a method based on inho- mogeneous Poisson processes and Ripley’s K-function to detect TF clustering and ordering patterns [18]. Growing genome-wide ChIP studies have also revealed new insights into the interplay between TF binding and histone modifications. In mammalian genomes, TFs could
1a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment 2.1. Overview 9 have hundreds of thousands of potential binding sites but only bind to a small subset in a given cell type. In addition, not all binding events would have significant impact on gene expression [67]. Most TFs prefer to bind open chromatin regions which are highly accessible and nucleosome-depleted. Such chromatin regions are often with specific histone modifica- tions enriched in promoters and enhancers, such as H3K4me1 and H3K27ac marks. It was found that histone-modification-dependent TF binding could be protein family specific (Xin et al, Genome Research 2018). Thus, chromatin features identified from DNase-I hypersen- sitivity and histone modifications could be used to predict the binding of TF members from some families (Liu et al., Nucleic Acids Research 2016). On the other hand, a small number of TFs can act as pioneers, which have the ability to reach inaccessible chromatin regions and shape the chromatin landscape to facilitate the binding of other TFs. These pioneer TFs carry DNA binding domains recognizing partial or degenerate motifs on the nucleosome surface [102]. In addition to DNA accessibility, the co-occupancy of additional co-factors would restrict the bindings of some TFs to highly selected loci within accessible chromatin states. Kouzarides found that the affinity of transcription factors to their binding sites may be altered by epigenetic changes in chromatin structure [50]. Liu et al. used histone marks and DNA methylation data to establish the correlation between epigenetic changes and TF binding preferences [62], and in a later work found that co-occupancy of TFs may be pre- dicted using distinct chromatin features identified from DNase-I hypersensitivity, histone modifications, and GC content [63]. Recently, semi-automated genome annotation (SAGA) tools have been developed to seg- ment and label the genome into different chromatin states. The underlying hypothesis is that differences in the spatial distribution of histone marks lead to distinct characteristics of the genome. For example, active enhancers are often enriched in H3K27ac and H3K4me1 whereas active promoters are often enriched with H3K27ac and H3K4me3 [99]. Some of the popular SAGA tools are Segway [41], ChromHMM [28], and diHMM [69]. ChromHMM [28] Chapter 2. Identification of transcriptional regulatory modules in 10 distinct chromatin states in mouse neural stem cells and diHMM [69] use hidden Markov models whereas Segway [41] applies a dynamic Bayesian network to segment the genome and identify distinct chromatin states. Different from Seg- way and ChromHMM which perform genome segmentation and classification at single scale, diHMM can be used to segment the genome at multiple scales, such as nucleosome (narrow) and domain (broad) levels. Despite these achievements in genome segmentation, little effort has been put in to explore the transcription factor binding patterns across distinct chromatin states. To that end a two-step process (Figure 2.1) is discussed to investigate how chromatin con- figuration may affect the binding affinity of TFs. In the first step, the diHMM software, a chromatin state identification tool, is applied on the aligned BAM files of the histone marks to segment the genome into different states. The subsequent step involves using a newly de- veloped non-parametric Bayesian clustering method to group TFs that have similar binding patterns on each identified chromatin state into transcriptional regulatory modules. Based on the identified transcriptional regulatory modules, downstream analyses are performed to compare (a) the expression level of proximal genes regulated by individual TFs and pre- dicted TF clusters and (b) the binding sequences of a TF among different chromatin states. This analytical procedure is then applied on data sets generated from mouse neural stem cells (NSC). The results not only confirm several known transcriptional co-binding rules in different chromatin states such as JMJD3-SMAD3 in Promoter state [29], ASCL1-FOXO3 in Enhancer state [116], and CTCF-SMC1 in Boundary state [88], but also show interactions, for instance, MIZ1-TCF4 in Promoter state. It is also shown that the regulatory effects of the predicted transcriptional modules on proximal genes vary across chromatin states. Finally the de-novo binding sequences compiled from TF peaks were shown to be dependent on the chromatin state of the genome. Such systematic analyses of transcriptional regula- tion in different chromatin states provide valuable insights into the role of histone marks in regulating TF-TF interactions. 2.1. Overview 11
Figure 2.1: A two-step process to identify chromatin-state-specific transcriptional regulatory modules. diHMM is used to segment the genome into multiple chromatin states followed by application of the proposed clustering method to identify transcriptional regulatory modules in distinct states. Downstream analyses include gene expression comparison and de-novo motif comparison across different chromatin states. Chapter 2. Identification of transcriptional regulatory modules in 12 distinct chromatin states in mouse neural stem cells
2.2 Key challenges
2.2.1 Integrating histone and transcription factors
Sophisticated computational approaches have been developed that have used histone mod- ification data in segmenting the genome into distinct chromatin states. Histone proteins help wrap and condense the DNA into chromatin and transcription factors tend to bind to open-chromatin DNA. ChIP-seq is used to infer the binding sites of both histone proteins and transcription factors. Thus one approach to jointly study histones and transcription fac- tors is to combine them in a hierarchical model such as ChromHMM [28]. While technically it possible to jointly analyze histones and transcription factors in one single mathematical model, the idea presents some challenges. 1) Chromatin states are primarily formed by histone modifications. Thus, a computational model alone cannot validate the predicted states from a model that includes both histones modifications and transcription factors. 2) Transcription factors primarily recognize sequences on DNA near gene promoters and bind to those sequences thereby turning the genes on or off. Histone modifications are found both near gene promoters or hundreds of kilo-base pairs away from gene promoters (known as enhancers) that sometimes results in DNA-looping. Such looping sometimes opens up the DNA (open chromatin) and enables binding of transcription factors which otherwise might not bind. Looping brings the ChIP-seq peaks of enhancer histones and transcription factors and without considering the 3-D structure or topologically associating domain (TAD) it is difficult to infer direct or indirect effect of histones on transcription factor binding. Keep- ing in mind the challenges involved in joint analysis of histones and transcription factors, a computational pipeline was proposed that would be a two step process to first segment the genome into distinct chromatin states using the histones and then predict transcriptional regulatory modules within the chromatin states. 2.2. Key challenges 13
2.2.2 Finding the optimum clustering algorithm
The optimal clustering algorithm is expected to group together transcription factors having similar distributions of binding sites. Approaches such as K-means and Gaussian mixture models (GMMs) [118] can predict genomic regions co-regulated by multiple transcription factors. There are several indices that can be used to determine the optimum number of clusters in a data-set [19] and Akaike information criterion (AIC), Bayesian information criterion (BIC) in GMMs [17]. However, K-means works well only when the clusters have spherical shape and are of similar size. Gaussian mixture model (GMM) can be thought of as a generalization of k-means which has the advantage of probabilistic cluster assign- ment. With respect to predicting TRMs, both the intensity and the spatial distribution of the transcription factor binding sites is of interest. This is because two TFs might have approximately the same number of peaks or binding sites but the distribution of the sites could be very different. On the other hand one TF might have a lot of more binding sites as compared to the other TF but the overall distribution might be similar for both of them. Thus a clustering algorithm should separate the two TFs in the first case but group them in the same cluster in the latter case in order to capture the combinatorial regulation of gene expression by TFs. These conditions make the adoption of an inhomogeneous point process a natural choice [106]. The clustering approach discussed in this dissertation is inspired by previous models [18] with specific improvements as discussed later in detail.
2.2.3 Resolution of the TRMs within chromatin states
The final challenge pertains to fixing the resolution of the genomic regions during clustering. This is important because a smaller clustering resolution would result in fewer transcription factor binding sites and different predicted TRMs compared to larger clustering resolution. Chapter 2. Identification of transcriptional regulatory modules in 14 distinct chromatin states in mouse neural stem cells
One approach that was adopted at the beginning consisted of applying the clustering within individual chromatin state windows generated by a diHMM [69]. This approach had the limitation that for most of the genome only few TFs had binding sites greater than or equal to 10 in an individual window. Having too few reads made the estimation of intensity functions similar for all TFs resulting only in a single big cluster. Initial clustering results performed on the individual windows of the different chromatin states are shown in a Broad Promoter windows for some TFs in Figure 2.2. The minimum number of required TF binding sites was set to 10. This result shows that only two TFs JMJD3 and SMAD3 are in one cluster in Broad Promoter and Boundary chromatin states whereas a number of clusters are predicted in the Upstream Enhancer state. This is not surprising because Broad Promoter on an average had small size windows and Upstream Enhancer had larger size windows. Consequentially Broad Promoter had few TFs whereas Upstream Enhancer had many TFs. Reducing the minimum number of binding sites for clustering by 5 increased the number of clusters in Broad Promoter as shown in 2.3. However, this was the limitation of the clustering approach that a certain number of peaks or binding sites for a TF were needed to detect clusters with different binding intensity patterns.To overcome this problem, windows for each domain were merged and shifted on an imaginary real line to form contiguous regions of binding sites. This approach enabled detection of unique binding patterns of different TFs within same and among different clusters while retaining enough resolution to compare the clustering results among different domain windows of the same chromatin state. 2.2. Key challenges 15
Figure 2.2: TF peak enrichment in different chromatin states based on a threshold require- ment of 10 peaks for each TF Chapter 2. Identification of transcriptional regulatory modules in 16 distinct chromatin states in mouse neural stem cells
Figure 2.3: TF peak enrichment in different chromatin states based on a threshold require- ment of 5 peaks for each TF 2.3. Building blocks of the computational pipeline 17
2.3 Building blocks of the computational pipeline
2.3.1 Data-sets
ChIP-seq data used in this study were obtained from the ChIP-Atlas [83] databases for mouse neural stem cells. 21 proteins were included in this study. Among these proteins, P300 is a co-activator, RAD21 is a sub-unit of cohesin component, SMCHD1 is a non- canonical member of the SMC protein family, NUP153 is one of the building blocks of the nuclear pore complex, JMJD3 is a lysine-specific demethylase that demethylates H3K27me2 or H3K27me3, KDM1A is also a histone demethylase that demethylates both H3K4me and H3K9me, RNF2 which a E3 ubiquitin-protein ligase that mediates monoubiquitination of H2AK119Ub, BMI1 is a ring finger protein and a major component of the polycomb group complex 1 (PRC1), and SMAD3 and SMAD4 are signal transducers and transcriptional modulators mediating multiple signaling pathways and several other transcription factors such as OLIG2, FOXO3, ASCL1, MAX, NFIC, TCF3, SOX2, SOX21, SOX9, POU5F1, many of which are known to play key role in neural stem cells [53, 72]. Three major pre- processing steps were conducted, including alignment to reference genome using Bowtie2 [55] with default mode, binarization from SAM to BAM format using SAMtools [58] version 0.1.19, and peak calling using MACS2 [32] version 2.1.0 to obtain binding regions of the proteins. Gene expression data were downloaded from GSE70872 (untreated samples).
2.3.2 Chromatin state identification through genome segmenta-
tion diHMM [69] is a Hidden Markov Model based tool that can model the presence or absence of a histone mark to a high degree of accuracy. It can segment and annotate the genome Chapter 2. Identification of transcriptional regulatory modules in 18 distinct chromatin states in mouse neural stem cells into different chromatin states at multiple scales by modeling genome wide distribution of histone marks. By default, diHMM has two scales of classification: (a) nucleosome level, which has finer resolution chromatin state windows of around 200 base-pair (bp) length and (b) domain level, which is formed by stitching together similar nucleosome level windows thus having broader chromatin state windows extending over 100kbp-long regions [69]. As explained by the diHMM authors [69], the domain-level states identified by diHMM are able to recapitulate known patterns in the chromatin literature and capture functional differ- ences among diverse regulatory elements [69]. The first step to identify chromatin states is to binarize uniquely aligned BAM files, which is implemented in ChromHMM [28], a pre- decessor of diHMM. The diHMM software provides several nucleosome and domain level statistics including nucleosome-level emissions, combined nucleosome-level fold enrichment for each domain, fractional genome coverage of each nucleosome and domain level state, and nucleosome and domain state lengths. These statistics, together with the relative distance in- formation of nucleosome and domain level states from transcription start site (TSS) and the enrichment of nucleosome level states in genomic regions, were jointly analyzed to annotate each nucleosome and domain level state to a biologically relevant functional category.
2.3.3 TF binding pattern clustering via Dirichlet Process Mixture
of Log Gaussian Cox Processes (DPM-LGCP)
The input to the proposed clustering algorithm are the binding regions generated by a peak calling tool (MACS2 in this case). Using these regions as input, the center of each TF peak was treated as a binary binding event and these binding events were modeled along the genome by an inhomogeneous Poisson process (IP) for the following reasons: (i) the event of each binding site falling into a minuscule interval is a rare event (n → ∞, p → 0, np → λ), and (ii) the non-uniform distribution of TF peaks at different locations can be 2.3. Building blocks of the computational pipeline 19
well characterized by the intensity function of the inhomogeneous Poisson process. For a TF with n binding site locations, these locations were mapped to points in a closed interval D
on the real line, denoted by S = {s1, . . . , sn}. Following the inhomogeneous Poisson process model setting, the likelihood of observing S can be written as [100]
n Z Y f(S|λ(s)) = exp |D| − λ(s)ds λ(sj), (2.1) D j=1
where λ(s), s ∈ D is the intensity function. The Poisson process likelihood (1) provides the basis for non-parametric clustering of TFs based on their binding patterns, resulting in identification of modules of co-binding TFs sharing similar regulatory functions. For a given ChIP-seq data-set of N TFs coming from K clusters (with K unknown), it is assumed that TFs in the same cluster share a common intensity function, distinct from those in other clusters. This constitutes a point process clustering problem with unknown number of clusters.A Dirichlet process mixture of log Gaussian Cox process (DPM-LGCP) model is proposed, which employs a Dirichlet process (DP) prior to the latent log intensity functions
to facilitate clustering of the intensity functions. Let Si denote the binding site locations of the ith TF, the DPM-LGCP model can be described as follows:
Si|λi(s) ∼ IP (λi(s)), s ∈ D, i = 1, ..., N,
log(λi(s)) = zi(s), zi(s) ∼ G, (2.2)
G ∼ DP (m, G0),G0 = GP (0,Cθ),
where G is a random distribution with a DP prior. The DP prior is characterized by two
parameters m and G0, where m is the precision parameter, and G0 is the base measure.
The base measure G0 is assumed to be a Gaussian process with mean 0 and covariance
kernel Cθ(, ), and θ contains parameters that control the shape of the covariance kernel. Chapter 2. Identification of transcriptional regulatory modules in 20 distinct chromatin states in mouse neural stem cells
The introduction of this DP prior to the latent log intensity functions naturally facilitates clustering of the N point processes based on their intensity functions. With this model, neither the number of clusters nor ad-hoc distance measure between two point processes needs to be specified.
To overcome the difficulty of calculating the marginal likelihood of the point process Si, an approximate and efficient posterior inference using the Integrated Nested Laplace Ap- proximations (INLA) package [93, 100] has been employed. INLA approximation of LGCP links the latent Gaussian process zi(s) to a discrete spatial Gaussian Markov random field z = (z1, ..., zP )[60, 100, 109]. By doing so, the continuous covariance kernel of zi(s) can be transformed into a discrete precision matrix of the B-spline basis coefficients on a regular grid, which enables very fast covariance computation [92]. Finally, posterior inference on the assignment of TFs into clusters can be performed through a Markov chain Monte Carlo (MCMC) algorithm using Neal’s Gibbs sampler [77].
An equivalent representation of the above DPM-LGCP model is based on the stick-breaking representation of the DPM model. Let ci ∈ {1, 2, ...} be a latent variable indicating the
cluster assignment of the ith point process, i.e. ci = k means that process i comes from the kth cluster. It is assumed that point processes from the same cluster share a common
∗ intensity function. Let {λk} denote the unique intensity functions. The stick-breaking representation of the DPM-LGCP model is written as
S | c , λ∗ (s) ∼ inhomo-Poisson(λ∗ (s)), i = 1,...,N; i i ci ci
P r{ci = k} = πk, k = 1, 2,... ;
k−1 Y πk = Vk (1 − Vl),Vk ∼ Beta(1, m); l=1 2.4. Case studies 21
∗ ∗ ∗ λk(s) = exp{zk(s)}, zk(s) ∼ GP (0,Cθ).
With the model described above, approximate Bayesian inference can be performed using Integrated Nested Laplace Approximations (INLA) package [93]. In particular, the INLA package provides the approximated marginal likelihood which can be combined with the DPM sampling. Details of the approximation and the computational algorithm are provided in Appendix.
2.4 Case studies
2.4.1 Simulated data
Three 1-D inhomogeneous Poisson processes with Log-Gaussian intensity were generated. Next, binding site locations of 20 transcription factors were simulated by drawing indepen- dent samples from these inhomogeneous Poisson processes. In Figure 2.4, the solid curves denote the true intensities of the three processes.
Using the simulated binding site locations the TFs were assigned according to the cluster set- ting. The estimated clustering results after ten MCMC iterations, is shown in the summary below
1. initial error rate: 0.3947
2. estimation error rate: 0
3. overall true marginal likelihood: 76.7315
4. overall estimated marginal likelihood: 76.73146
5. true marginal likelihood per cluster: 27.2765, 14.9562, 34.4986 Chapter 2. Identification of transcriptional regulatory modules in 22 distinct chromatin states in mouse neural stem cells
Figure 2.4: Application of the clustering algorithm on simulated data. (A) True and Esti- mated binding intensities of twenty transcription factors in three clusters (B) Illustration of the clustering result obtained in (A). 2.4. Case studies 23
6. estimated marginal likelihood per cluster: 27.2765, 14.9562, 34.4986
2.4.2 Real data
An example of clustering on real data is shown in Figure 2.5A,B. In the left panels, ChIP-seq peak distribution of proteins obtained from ChIP-Atlas is shown, in a specific genomic region using IGV software. The proposed algorithm uses these genome locations as input, re-scales them to an imaginary real line (from 0 to 50) and estimates the binding intensity of each TF using an inhomogeneous Poisson point process. Finally, proteins sharing similar intensity patterns are clustered together. In the right panels, the estimated cluster intensities are shown. The values on the Y-axis represent the magnitude of the intensity function, and the X-axis represents the bins of mapped genome locations on the real line. Proteins belonging to different clusters are marked in the same color as that of the cluster.
2.4.3 Functional enrichment of transcription factors within chro-
matin states
Enrichment of each TF within each chromatin state was estimated to analyze positional preference of TF bindings. Enrichment of a particular TF in a particular chromatin state is calculated as (m/n)/(M/N) where m is the number of peaks of the TF in that state, n is the total number of peaks of the TF in all states, M is the size of the state and N is the total size of all states. Functional enrichment analysis of TF peaks allows us to identify specific chromatin states that are highly enriched in TF bindings [4] and also specific states that are depleted in TF bindings and thus provides an insight into genome wide binding preferences of TFs governed by different histone marks. Chapter 2. Identification of transcriptional regulatory modules in 24 distinct chromatin states in mouse neural stem cells
Figure 2.5: Application of the clustering algorithm on real data. (A) Distribution of ChIP- seq peaks visualized in IGV window and (B) the corresponding estimated binding intensities of the clusters. The proteins in each cluster are shown with the same color. 2.5. Results 25
2.4.4 Gene expression and motif analysis
In this study, the overall expression of genes among different chromatin states were com- pared in addition to the regulatory effect of individual TF and predicted TF modules on genes within each chromatin state. RSEM [57] was used to estimate gene expression (Tran- scripts Per Kilobase Million or TPM). Each gene had three expression values obtained from triplicates. For comparison of gene expression among chromatin states, genes were selected by proximity (+/- 2kb from TSS) to nucleosome-level state windows for the corresponding domain-level state windows. For comparison of gene expression regulated by different TFs within a specific chromatin state, genes were selected by proximity (+/- 2kb from TSS) to TF peaks. Expression levels of each gene over replicates were averaged and log transformed. Effect of chromatin states on TF motifs [7] was also explored since nucleotide sequence of transcription factor binding sites is a critical factor for recruiting both the primary TF and its co-factors. MEME [5] software was used to discover de-novo motif sequences of TFs and those sequences were to the consensus TF motifs from the HOMER [40] motif library.
2.5 Results
2.5.1 Genome segmentation and chromatin state identification
As described in the methods section, diHMM segments a genome into distinct chromatin states and outputs the states as regions within two bed files labeled by nucleosome and domain indexes (e.g. N1, N2... and D1, D2... respectively). For the nucleosome level states, annotation of the chromatin states to functionally relevant categories was performed by using information from the emission probabilities of the nucleosome states (Figure 2.6), fractional genome coverage (Figure 2.6), relative enrichment in different genomic regions (Figure 2.7), Chapter 2. Identification of transcriptional regulatory modules in 26 distinct chromatin states in mouse neural stem cells and distribution of nucleosome states around TSS (Figure 2.8A). Similarly, by comparing the nucleosome-level fold enrichments in each domain level state and the distribution of the domain level states around TSS (Figure 2.8B), the domain-level states were further grouped into different broader functional categories as shown in Figure 2.6.
2.5.2 Genome annotation enrichment
CpG island sites, Refseq transcripts and Repeats for mm10 genome from UCSC database were downloaded to calculate the fraction of genomic annotation for each nucleosome-level state (Figure 2.7). Active Promoter states were enriched in the 5pUTR regions, whereas the Transcriptional Elongation states were enriched in the Intron regions. In addition, the Repetitive state was highly enriched in the Satellite regions.
2.5.3 Relative enrichment around TSS
Relative enrichment of the nucleosome level states with a maximum of +/- 4kb and domain level states with a maximum of +/- 80kb around transcription start sites (TSS) is calculated. From the spatial distribution of the nucleosome level and domain level states (Figure 2.8), the Active Promoters, Bivalent Promoters at nucleosome level and Broad Promoters at domain level are enriched around the TSS. On the other hand, Transcriptional Elongation at nucleosome level and Upstream Enhancer at domain level are enriched away from the TSS. Repetitive and Low Signal states are enriched sporadically spanning on either side of the TSS. 2.5. Results 27
Figure 2.6: (a) Nucleosome level emission matrix generated by diHMM. Functional anno- tations of the nucleosome level states are shown in the color bar on the left. Scale varies linearly between 0 and 1. (b) Fractional genome coverage for nucleosome and domain level states. Scale varies logarithmic-ally between 10−4 and 1. (c) Combined nucleosome-domain fold change obtained by diHMM. Functional annotation of the states are shown in the color bar on the left. Scale varies logarithmic-ally between 0.5 and 50. Chapter 2. Identification of transcriptional regulatory modules in 28 distinct chromatin states in mouse neural stem cells
Figure 2.7: Nucleosome states genome annotation enrichment. 2.5. Results 29
Figure 2.8: Nucleosome and domain level state distribution around TSS. Chapter 2. Identification of transcriptional regulatory modules in 30 distinct chromatin states in mouse neural stem cells
2.5.4 State sizes
From the nucleosome and domain state sizes (Figure 2.9), the Upstream Enhancer and Low Signal states at domain level have wider state sizes compared to the Broad Promoter and Super Enhancer states. The Low Signal states not only have larger fractional genome coverage at both nucleosome and domain levels (Figure 2.6B) but also have in general larger state sizes. The Low Coverage states, on the other hand, have very small state sizes. Some of the Broad Promoter states also have smaller state sizes.
2.5.5 Functional annotation of nucleosome and domain level states
The nucleosome and domain level states identified by diHMM were annotated into functional states based on various statistics. Among the nucleosome level states there were: Active Pro- moters (N1-N5: enriched in H3K27ac, H3K4me3, H3K4me2, and enriched around TSS), Pro- moter Flanking states (N6 to N8: enriched in H3K27ac, H3K4me3, and H3K4me2 flanking TSS), Bivalent Promoter (N9 and N10: enriched in H3K27me3 and H3K4me2 or H3K4me3), Poised Enhancer (N11 and N12: strong enrichment in H3K27me3, and weak enrichment in H3K4me1), Strong Enhancer (N13: enriched in H3K27ac and H3K4me1), Weak Enhancer (N14: enriched in H3K4me1), Transcribed Enhancer (N15-N17: enriched in H3K36me3 and sometimes in H3K4me3), Transcriptional Elongation (N18 and N19: enriched in H3K36me3), CTCF promoter (N20 and N21: enriched in CTCF and enriched around TSS), CTCF (N22: enriched in CTCF), H4K20me1 (N23: enriched only in H4K20me1), Polycomb Repressed (N24: enriched only in H3K27me3), Heterochrom/Low Signal (N25-N27: low enrichment in almost all marks and spanned over a large proportion of the genome, main text, Figure 2.6), Repetitive (N28: enriched in multiple marks and in Satellite regions, Figure 2.7), H3K9ac (N29: enriched only in H3K9ac), and Low Coverage (N30: low fractional genome coverage, 2.5. Results 31
Figure 2.9: Quartile box plots of log10 of all nucleosome-level (a), or domain-level (b) state sizes. Box plot whiskers extend to 1.5 times the interquartile range.
main text, Figure 2.6B). Among the domain level states there were : Broad Promoter (D1 to D7: enriched in active promoter and promoter flanking states), Bivalent Promoter (D8-D12: enriched in bivalent Chapter 2. Identification of transcriptional regulatory modules in 32 distinct chromatin states in mouse neural stem cells promoter, poised enhancer, and Polycomb Repressed states), Poised Enhancer (D13: en- riched in poised enhancer), Super Enhancer (D14: enriched in active enhancer), Upstream Enhancer (D15-D17: enriched in weak enhancer and upstream from TSS), Transcribed (D18- D20: enriched in the transcribed elongation state and distributed over a broad region from TSS), Boundary (D21: enriched in CTCF domain), Polycomb Repressed (D22-D24: enriched in Poised Enhancer and Polycomb Repressed), Low Signal (D25-D26: not enriched in any state), and Low Coverage (D27-D30: low fractional genome coverage). In addition, at both nucleosome and domain level the Low Signal state had the largest state sizes whereas the Low Coverage state had the smallest state sizes (Figure 2.9).
2.5.6 Chromatin state preference of individual TF binding and
gene expression regulation
To analyze the distribution of protein-DNA binding sites in each chromatin state, ChIP-seq data were integrated with the chromatin state map of mouse neural stem cells (NSCs) (Figure 2.10). For most proteins, the binding events occur in open chromatin regions, although some pioneer transcription factors have the ability to bind directly to condensed chromatin and recruit co-factors [24, 102, 125]. In both active and repressed states, enrichment of pioneer TFs as well as other proteins (that might have been recruited by the former) was observed. BMI1, which is known to bind to regions marked by both H3K27me3 and H3K4me3 [12], was found to be highly enriched in the Bivalent Promoter and Poised Enhancer states (Figure 2.10). In addition, most TFs were found to be enriched in the Super Enhancer states except for RAD21, BMI1, SMCHD1 and NUP153. A similar observation was made by the authors in [72] where they showed that OLIG2, NFI family, SOX2, SOX9, TCF3, FOXO3, ASCL1, SOX21, and MAX were associated with active enhancer regions. Next, to study the regulatory effect of histone marks on proximal genes, expression levels of 2.5. Results 33
Figure 2.10: (a) Enrichment (in log scale) of TF peaks in different chromatin states showing binding preference of individual TFs. (b) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) in different domain level chromatin states. Genes were mapped to the nucleosome-level states for the corresponding domain-level states. (c) Comparison of average expression (in log scale) of proximal genes (+/- 2kb from TSS) mapped to individual TFs in the Broad Promoter state and in (d) the Poised Enhancer state. Chapter 2. Identification of transcriptional regulatory modules in 34 distinct chromatin states in mouse neural stem cells genes with promoters were compared among different chromatin states. The proximal genes in the Broad Promoter state had a higher median expression than those in the Polycomb Repressed or Low Coverage states (Figure 2.10B). To understand the influence of chromatin states on transcriptional regulation, genes were grouped in each state based on the presence of different TF binding sites surrounding their TSS. As expected, the median expression of the genes in the states rich in active regulators was higher and more consistent compared to that in the repressed states, where differential gene expression pattern was observed (Figure 2.10C,D, and Figure 2.12).
2.5.7 Chromatin state and preferential TF clustering
The distributions of ChIP-seq peaks across distinct chromatin states indicate that function- ally relevant proteins may have similar binding patterns. Co-occupancy of proteins was determined in a specific chromatin state through a non-parametric Bayesian clustering ap- proach that identifies the combinatorial binding patterns of proteins. Each state at the domain level had multiple windows over different chromosomes across the genome. As dis- cussed in the challenges section, most windows are with very few peaks although the average domain-level window length ranged from 3.8 kb to over 450 kb. This prevented prediction of modules within a single domain window. To ensure that the unique properties of the domain-level states were preserved during clustering, all windows of a single domain-level state were merged (e.g. D1) across the entire genome and mapped the genome positions to a common interval [0, 50] on an imaginary real line. Adopting this approach for all domain level states eliminated the problem that different domains may have different sizes. Next, for each domain level state, the proposed algorithm used these mapped binding locations, computed individual binding intensity of each protein and clustered proteins having similar intensity patterns together to construct transcriptional regulatory modules. This process 2.5. Results 35 was repeated for each domain level state. The predicted regulatory modules in different chromatin states, are visualized by showing the estimated binding intensities of the proteins and the corresponding clusters in Figure 2.11 and in Figures 2.14 and 2.15. Clustering results in two contrasting states —Broad Promoter (Figure 2.11(a)) and Poised Enhancer (Figure 2.11(b)), were found to have noticeable differ- ences in the binding intensity shape of both individual proteins and the predicted clusters between the two states. In addition, the set of co-factors for different proteins varied be- tween the two states. BMI1 is known to bind to repressed and poised states [12] and was predicted as a single-protein cluster in the Poised Enhancer (Figure 2.11(b)) and Bivalent Promoter states (Figure 2.14). In other states such as Broad Promoter, Super Enhancer, and Upstream Enhancer, BMI1 was predicted with RNF2, RAD21, or SMCHD1 (Figures 2.14 and 2.15). It is worth noting that both BMI1 and RNF2 are components of the Polycomb group multi-protein, whereas SMCHD1, a non-canonical member of the SMC super-family, is also known to be associated with transcriptional repression [20] and polycomb recruitment mechanisms [34]. The proposed approach was able to cluster several other functionally rel- evant proteins that shared similar binding patterns, for example, JMJD3-SMAD3 (Figure 2.11) in most chromatin states (in 29, the authors found that JMJD3 is recruited to gene promoters by SMAD3 in neural stem cells and is essential to activate TGF-β -responsive genes), FOXO3-NFIC-SOX-TCF3 (Figures 2.14) in Upstream Enhancer states (in 72), the authors showed interactions among NFI family, TCF3, SOX2, SOX9, and FOXO3. Addi- tional predicted protein-protein interactions are shown in in Table S1. To assess the strength of association between two co-binding proteins, a pairwise protein co-binding probability matrix was calculated from the posterior samples of the MCMC pro- cedure (Figure 2.11). Each value in Figure 2.11 indicates the frequency of observing the corresponding two proteins in the same cluster out of the total 200 MCMC iterations. A high protein co-binding probability (indicated by darker color) provides stronger evidence Chapter 2. Identification of transcriptional regulatory modules in 36 distinct chromatin states in mouse neural stem cells
Figure 2.11: (a), (b) Estimated cluster binding intensities along with the individual TF binding intensities in the Broad Promoter and Poised Enhancer states, respectively. In each figure, the estimated binding intensities of the individual TFs are shown in dotted lines and the estimated binding intensities of the clusters are shown in solid line. TFs in each cluster are shown in the same color as that of the cluster. The X axis represents the genomic locations mapped on the real line between 0 and 50. The Y axis represents the estimated binding intensities, both for the individual TFs and for the identified clusters. (c), (d) Pairwise protein co-binding probabilities corresponding to (a) and (b) respectively. (e), (f) Comparison of proximal gene expressions (TPM) regulated by the clusters in (a) and (b) respectively. Only those clusters having (1) multiple TFs and (2) proximal genes for at least two TFs are shown in the figure to explain the combinatorial regulation of gene expressions by multiple TFs. 2.5. Results 37 of the existence of the protein pair in a cluster. Further a three-fold assessment on the robustness of the clustering algorithm was preformed. Next, the expression levels of proximal genes (Transcripts Per Kilobase Million) TPM reg- ulated by the predicted clusters in each state was examined to understand transcriptional regulation by combinatorial binding of proteins in different chromatin states. The results showed that the median expression level of the genes regulated by distinct clusters are close to each other in the Broad Promoter state (Figure 2.11). On the contrary, the median expression level of the proximal genes combinatorially regulated by the FOXO3-RAD21-SMAD4 cluster in Poised Enhancer was higher than that of the genes combinatorially regulated by the other cluster (Figure 2.11) (Similar behavior was observed in Bivalent Promoter, Upstream En- hancer and Boundary states shown in Figure 2.13). These results show that gene expression could change due to combinatorial binding of proteins in different chromatin states. Overall in the 30 domain states, it was observed that, in every state there were some proteins with no binding locations near proximal gene promoters (Figure 2.12). Additionally, the Broad Promoter, Upstream Enhancer and Transcribed states had more binding sites near proximal genes than the repressed states such as Polycomb Repressed or Poised Enhancer.
2.5.8 Estimated clusters in chromatin states
Figures 2.14 and 2.15 show the estimated binding intensities in the 28 chromatin states. In each figure the dotted lines represent the binding intensities of the individual proteins and the solid lines represent the estimated intensities of the identified clusters. The corresponding proteins in each cluster are shown in the same color as that of the cluster. Chapter 2. Identification of transcriptional regulatory modules in 38 distinct chromatin states in mouse neural stem cells
Figure 2.12: Gene expression comparison for individual proteins among different chromatin states.
Figure 2.13: Gene expression comparison for predicted clusters among different chromatin states. 2.5. Results 39
Figure 2.14: Estimated cluster intensities in different chromatin states. Chapter 2. Identification of transcriptional regulatory modules in 40 distinct chromatin states in mouse neural stem cells
Figure 2.15: Estimated cluster intensities in different chromatin states. 2.5. Results 41
2.5.9 Protein-DNA binding motif preferences in chromatin states
It is known that local epigenetic states affect bindings of proteins to targets and protein- DNA binding may prevent or facilitate epigenetic changes on their binding sites [13, 120]. A protein is known to bind to the DNA with different motifs depending on the presence of its co-binding partners [7]. To examine the influence of chromatin states and co-binding partners on the binding sequences of a protein, ChIP-seq peaks for each protein overlapped with each chromatin state were grouped and the binding motifs of the protein in an active (Broad Promoter/Super Enhancer) and a repressed state (Poised Enhancer/Polycomb Re- pressed) (Figure 2.16) were analyzed. MEME suite [6] was used to identify de-novo motif sequences and from the results the motif that matched with the candidate protein’s con- sensus motif or was known as a secondary motif was selected. In both the HOMER [40] or JASPAR [73] databases, no reference motif is documented for BMI1, KDM1A, JMJD3, NPAS3, NUP153, RNF2, RAD21, P300, and SMCHD1. For the remaining proteins with known motifs, genomic sequences from two different subsets of peaks overlapped with two contrasting chromatin states were extracted as mentioned before and de-novo motifs were predicted. Based on the MEME results, a protein’s binding preferences may be broadly categorized into one of the three types: (1) De-novo sequences that closely matched the protein’s con- sensus motif such as ASCL1 (Figure 2.16(a)), MAX, NFIC, FOXO3, and TFs from the SOX family. (2) De-novo sequences that either did not match with the consensus/secondary motifs or matched the consensus motif but were weakly enriched. It has been observed that the ATF/CREB motifs (‘TGAYRTCA’) are often enriched in genes targeted by β- catenin/TCF/LEF [59, 108]. For TCF3, highly enriched de-novo sequences resembling its consensus motif was predicted in the repressed state (Figure 2.16). However, in the active state it was observed that the ‘TGACGTCA’ pattern was highly enriched. This could im- Chapter 2. Identification of transcriptional regulatory modules in 42 distinct chromatin states in mouse neural stem cells
Figure 2.16: Effect of chromatin states and co-binding partner on binding motifs. (A) De-novo motifs obtained using MEME for ASCL1 are similar to the consensus motif in both Broad Promoter and Polycomb Repressed states although the co-factors of ASCL1 are different in the two states. (B) De-novo motifs obtained using MEME for TCF3 show differences in motifs between the two states with different co-factors. The motifs in active state resemble the β-catenin/TCF/LEF motif whereas the motifs in repressed state resemble the E-Box consensus motif. 2.6. Known protein-protein interactions predicted by the algorithm 43 ply that TCF3 might have been recruited by other co-factors resulting in indirect binding in that particular state. For OLIG2, both active and repressed chromatin states contained de-novo sequences resembling its consensus motif. However, these sequences were highly enriched in the repressed state and weakly enriched in the active state. The fact that the E-value 2 of the de-novo sequences of OLIG2 was not significant in the active state might suggest indirect binding in the state, probably being governed by other factors. (3) De-novo sequences resembling the secondary motifs such the SMAD family. For SMAD4, sequences with ‘GCCGC’ pattern were highly enriched in both active and repressed chromatin states, as reported previously in [43] where the authors found that SMAD4 can bind to both methy- lated and un-methylated motifs of distinct sequences. Similarly, for SMAD3, highly enriched sequences rich in ‘GC’ content in both chromatin states, which have been reported as sec- ondary SMAD3 motifs, often associated with known SMAD binding partners in TGF-β responses [113]. Interestingly, for POU5F1 the E-Box element ‘CANNTG’ was significantly enriched in both active and repressed chromatin states. In [123], the authors had also ob- served that the E-Box motif was significantly enriched with a p-value of 10−6 in a POU5F1 ChIP-seq experiment of ES cell with Dnmt1, Dnmt3A and Dnmt3B triple knockout, whereas the consensus POU5F1 motif was weakly enriched with a p-value of 0.1.
2.6 Known protein-protein interactions predicted by
the algorithm
We have shown some of the known interactions predicted by the algorithm in Table S1. The first column shows co-occupancy of the proteins predicted by the algorithm together with the states where the co-occupancy was observed. The second column provides references
2E-value:an estimate of the statistical significance of each motif it finds (the ”motif E-value”) Chapter 2. Identification of transcriptional regulatory modules in 44 distinct chromatin states in mouse neural stem cells
that also had reported the predicted co-occupancies.
2.7 Robustness of the proposed clustering technique
2.7.1 Effect of varying domain level state sizes
On a diHMM domain, 1000 windows were repeatedly sampled and clustering was performed on these windows. To assess the consistency of the results, error rates were computed from the difference of the estimated clusters on the entire domain and the estimated clusters on every 1000 windows. To compute the error rate, a protein-protein pair binary matrix is defined with size N by N where N is the number of proteins. A 1 in the (i, j) th entry of the matrix indicates that protein i and protein j are in the same cluster and 0 otherwise. Next, the upper triangular half of this pair matrix is used (it is symmetric) and converted it to a