Uncovering Networks by Integrating One Dimensional „Omics and Three Dimensional Chromatin Structure

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Xun Lan

Graduate Program in Integrated Biomedical Science Program

The Ohio State University

2012

Dissertation Committee:

Assistant Professor Jin Victor, Advisor

Associate Professor Kun Huang

Professor Jeffrey Parvin

Assistant Professor Qianben Wang

Copyright by

Xun Lan

2012

Abstract

Transcriptional regulation is a critical mediator of many normal cellular processes as well as disease progression. It involves a process by which different transcription factors bind to specific short DNA sequences termed cis-regulatory elements (CREs), such as promoters, enhancers, silencers and insulators, and thus control the transcription of different . Here we discuss both experimental and computational challenges in profiling Transcription Factor Binding Sites (TFBSs) using ChIP-based high throughput techniques. We then describe a method (w-ChIPeaks) for identifying TFBSs using such techniques.

The accessibility of CREs is often influenced by epigenetic modifications including

DNA methylation, histone acetylation and methylation, which can be associated with the activation or repression of genes. Methyl-CpG binding domain sequencing

(MBD-seq) is widely used to survey DNA methylation patterns. We generated high depth

MBD-seq data in MCF-7 cell and developed a bi-asymmetric-Laplace model (BALM) to estimate the methylation level of individual CpG dinucleotides. Clonal bisulfite sequencing results showed that the methylation status of tested regions was accurately detected with high resolution using the proposed model. Using the predicated methylation score, genome wide analysis showed medium negative correlation between DNA methylation and DNase hypersensitivity, which is an indication of nucleosome depletion.

ii

Transcription factors (TFs) often co-localize at CREs, form protein complexes, and collaboratively regulate expression. Machine learning and Bayesian approaches have been used to identify TF modules in a one-dimensional context. However, recent studies using high-throughput technologies have shown that TF interactions should also be considered in three-dimensional nuclear space. Taking K562 as a model cell line, we have analyzed publicly available Hi-C data, which enables genome-wide unbiased capturing of chromatin interactions, using a Mixture Poisson Regression Model and a power-law decay background to define a highly specific set of interacting genomic regions. We integrated multiple ENCODE Consortium resources with the Hi-C results, including DNase-seq data and ChIP-seq data for 45 transcription factors and 9 histone modifications. Different interacting loci display distinct epigenetic status and relationships with TFBSs. As expected, many of the transcription factors show binding patterns specific to interacting loci that encompass promoters or enhancers. This work indicates that protein-protein interactions may serve as a driving force of chromatin dynamic reorganization.

iii

Dedication

This document is dedicated to my family.

iv

Acknowledgments

I am very grateful to my advisor, Dr. Victor Jin, for his guidance and supervision of my

Ph.D study and research.

I would like to acknowledge my wonderful committee, Dr. Kun Huang, Dr. Jeffrey

Parvin and Dr. Qianben Wang for their help and inspiration with my project and

dissertation.

I would also like to acknowledge Dr. Joanna Groden for her support for my study and

career.

I would like to thank my peer colleague, William Hankey, for his help with my classes, more importantly, for proof reading virtually all my scientific writings. And now he has

to read another 200 pages.

Finally, and most importantly, I would like to thank my beautiful wife for her support and

encouragement. I would like to thank my twin baby girls for the great joy they bring to

me.

v

Vita

2005...... B.S. Biotechnology, Ocean University of

China

2007 - 2008 ...... Graduate Research Assistant, University of

North Dakota

2008 - 2009 ...... Graduate Teaching Assistant, University of

Memphis

2009 - Pres...... Graduate Research Associate, The Ohio

State University

Publications

Lan, X., Witt, H., Katsumura, K., Ye, Z., Wang, Q., Bresnick, E.H., Farnham, P.J. and

Jin, V.X. Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin hubs.

Nucleic Acids Res. Accepted.

Lan, X., Farnham, P.J. and Jin, V.X. Uncovering transcription factor modules using one- dimensional and three-dimensional analyses. J Biol Chem, Accepted.

Kenney, B.A., Lan, X., Huang, T.H.-M., Farnham, P.J. and Jin, V.X. (2012) Using

ChIPMotifs for De Novo Motif Discovery of OCT4 and ZNF263 Based on ChIP-Based

High-Throughput Experiments. Methods in Molecular Biology, 802, 323-34.

vi

Lan, X. and Jin, V.X. Experimental and computational challenges from array-based to sequence-based ChIP techniques. Current Bioinformatics, Review, Accepted.

Liu, R., Lei, J.X., Luo, C., Lan, X., Chi, L., Deng, P., Lei, S., Ghribi, O. and Liu, Q.Y.

(2012) Increased EID1 nuclear translocation impairs synaptic plasticity and memory function associated with pathogenesis of Alzheimer's disease. Neurobiol Dis, 45, 902-

912.

Lan, X., Bonneville, R., Apostolos, J., Wu, W. and Jin, V.X. (2011) W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data.

Bioinformatics, 27, 428-430.

Zuo, T., Liu, T.M., Lan, X., Weng, Y.I., Shen, R., Gu, F., Huang, Y.W., Liyanarachchi,

S., Deatherage, D.E., Hsu, P.Y. et al. (2011) Epigenetic silencing mediated through activated PI3K/AKT signaling in breast cancer. Cancer Res, 71, 1752-1762.

Lan, X., Adams, C., Landers, M., Dudas, M., Krissinger, D., Marnellos, G., Bonneville,

R., Xu, M., Wang, J., Huang, T.H.-M. et al. (2011) High resolution detection and analysis of CpG dinucleotides methylation using MBD-seq technology. PLoS One, 6, e22226.

Frietze, S., Lan, X., Jin, V.X. and Farnham, P.J. (2010) Genomic targets of the KRAB and SCAN domain-containing protein 263. J Biol Chem, 285, 1393-1403.

Bapat, S.A., Jin, V., Berry, N., Balch, C., Sharma, N., Kurrey, N., Zhang, S., Fang, F.,

Lan, X., Li, M. et al. (2010) Multivalent epigenetic marks confer microenvironment- responsive epigenetic plasticity to ovarian cancer cells. Epigenetics, 5, 716-729.

Gan, L., Qiao, S., Lan, X., Chi, L., Luo, C., Lien, L., Yan Liu, Q. and Liu, R. (2008)

Neurogenic responses to amyloid-beta plaques in the brain of Alzheimer's disease-like

vii transgenic (pPDGF-APPSw,Ind) mice. Neurobiol Dis, 29, 71-80.

Chen, X., Lan, X., Roche, I., Liu, R. and Geiger, J.D. (2008) Caffeine protects against

MPTP-induced blood-brain barrier dysfunction in mouse striatum. J Neurochem, 107,

1147-1157.

Zhong, Q., Zhang, Q., Wang, Z., Qi, J., Chen, Y., Li, S., Sun, Y., Li, C. and Lan, X.

(2008) Expression profiling and validation of potential reference genes during

Paralichthys olivaceus embryogenesis. Mar Biotechnol (NY), 10, 310-318.

Zhong, Q., Zhang, Q., Chen, Y., Sun, Y., Qi, J., Wang, Z., Li, S., Li, C. and Lan, X.

(2008) The isolation and characterization of myostatin gene in Japanese flounder

(Paralichthys olivaceus): Ubiquitous tissue expression and developmental specific regulation. Aquaculture, 280, 247-255.

Wang, J.#, Lan, X.#, Hsu, P.-Y., Huang, K., Parvin, J., Huang, T.H.-M.H. and Jin, V.X.

Genome-wide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2 mediated gene regulation.

Submitted.

#Co-1st authors

Fields of Study

Major Field: Integrated Biomedical Science Program

viii

Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments...... v

Vita ...... vi

List of Tables ...... xv

List of Figures ...... xvii

Chapter 1: Introduction ...... 1

Chapter 2: Experimental and computational challenges from array-based to sequence- based ChIP techniques ...... 8

Introduction ...... 9

From ChIP-chip to ChIP-seq ...... 11

Experimental Challenges of ChIP-seq ...... 13

Optimization of PCR cycle number during Whole Genome Amplification (WGA) 13

Specificity of the antibody ...... 15

Computational Challenges ...... 15

Utilization of multiple mapped tags ...... 16 ix

Peak detection ...... 17

Maximize the resolution ...... 18

Sequencing depth and coverage ...... 19

Data normalization ...... 20

Conclusions and Perspectives ...... 21

Chapter 3: A bin-based enrichment level threshold (BELT) algorithm to analyze ChIP- seq data...... 23

Introduction ...... 24

Results ...... 26

Evaluation of BELT by synthetic data ...... 26

Performance of BELT on real ChIP-seq data ...... 28

Validation by de novo motif discovery ...... 30

Comparison with other ChIP-seq programs ...... 31

Discussion ...... 33

Materials and Methods ...... 36

An overview of the BELT algorithm ...... 36

Datasets ...... 37

Calculation of an average fragment length for a ChIP sample ...... 37

Decoding the ChIP fragment position ...... 38

x

Determination of significant enrichment level thresholds ...... 38

Definition of a peak and localization of the target site ...... 39

Data normalization ...... 39

Calculation of the p-value ...... 39

Ranking the resultant peaks ...... 40

Estimation of False Discovery Rate (FDR) ...... 40

Discovery of motifs ...... 43

Program implementation ...... 43

Chapter 4: High resolution detection and analysis of CpG dinucleotides methylation using MBD-seq technology ...... 46

Introduction ...... 47

Results ...... 49

Overview of experimental design and analysis ...... 49

MBD-seq data analysis ...... 55

Optimal depth of MBD-seq ...... 58

Resolution and efficiency of BALM ...... 61

Clonal bisulfite sequencing validation ...... 65

Discussion ...... 68

Materials and Methods ...... 71

xi

Methylated DNA enrichment and high-throughput sequencing (MBD-seq) ...... 71

Clonal Sanger bisulfite sequencing ...... 72

Public datasets ...... 74

BALM algorithm ...... 74

Estimation of False Discovery Rate (FDR) ...... 78

Program implementation ...... 79

Chapter 5: Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin linkages ...... 81

Introduction ...... 82

Results ...... 84

Identifying interacting loci ...... 84

Clustering interacting loci ...... 91

Associating transcription factor binding sites with the sets of clustered interacting

chromatin loci ...... 99

Correlating with different sets of interacting loci ...... 105

Discussion ...... 114

Materials and Methods ...... 119

Overview of the integrated data analysis flow ...... 119

Modeling of Hi-C data to identify interacting loci ...... 120

xii

Self-ligation filtering threshold determination ...... 125

Apriori algorithm ...... 128

Epigenetic and transcription factor data analysis ...... 128

RNA-seq analyses...... 129

Chapter 6: Genome-wide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2-mediated gene regulation ...... 131

Introduction ...... 132

Results ...... 134

Identification of E2-mediated high frequency chromosomal interacting regions ... 134

Characterization of E2/ERα-regulated genes in both highest and lowest frequency

chromosomal interacting regions ...... 141

Correlation between E2-mediated chromosomal interaction frequency and epigenetic

modifications ...... 144

Quantification of E2-mediated differential chromosomal interactions ...... 148

Correlation between E2-mediated differential chromosomal interactions and

epigenetic modifications ...... 153

Characterization of E2/ERα-regulated genes in high frequent and strong differential

chromosomal interactions ...... 156

Discussion ...... 157 xiii

Materials and Methods ...... 160

Hi-C Method ...... 160

Chromosome interaction frequency...... 161

Processing ChIP-seq of histone modifications, ERα and Pol-II, MBD-seq and

FAIRE-seq data ...... 162

Validations of ERα regulated interaction by quantitative 3C-PCR ...... 162

Chapter 7: Summary of computational approaches and perspectives ...... 164

Computational methods...... 165

Detecting Transcription Factor Binding Sites (TFBSs) ...... 165

Identifying TF modules from one-dimensional „omics data ...... 166

Identifying chromatin interactions ...... 168

Identifying TF modules from three- dimensional „omics data ...... 170

A pathway for the identification of three-dimensional TF modules ...... 173

Appendix A: BELT manual ...... 176

Appendix B: BALM manual ...... 180

References ...... 190

xiv

List of Tables

Table 1 Summary of MDB-Seq tags from different elution...... 49

Table 2 Estimated BALM parameters...... 53

Table 3 PCR primers...... 73

Table 4 Number of loops in amplification regions...... 89

Table 5 Number of enriched regions of epigenetic marks...... 91

Table 6 Formation frequency of interaction between different clusters...... 95

Table 7 Number of binding sites of 45 factors...... 100

Table 8 Percentage of binding site associated with chromatin interaction...... 102

Table 9 Number of tags produced by RNA-seq...... 109

Table 10 Distribution of chromosomal interaction frequency in the . ... 135

Table 11 Chromosomal regions (1 Mb resolution) with highest interaction frequencies

(>=30% in control condition; T0)...... 136

Table 12 Functional annotation of genes located in the top 10 cold and the top 10 hot chromosomal interaction regions by using DAVID...... 142

Table 13 Number of strong chromosomal interaction changes between the E2-treated and control conditions...... 151

Table 14 Number of general chromosomal interaction changes between E2-treated and control conditions...... 151

xv

Table 15 Top 10 chromosomal regions (1 Mb resolution) with the most lost interactions and the most gained interactions, respectively...... 153

xvi

List of Figures

Figure 1 Experimental techniques to investigate transcription factor binding sites and chromatin interactions...... 5

Figure 2 Experimental and computational challenges in ChIP-seq...... 11

Figure 3 Example of experimental error...... 14

Figure 4 Representative regions of MBD-seq before and after utilizing multiple mapped tags...... 17

Figure 5 Summary of five steps in the BELT algorithm...... 26

Figure 6 Summary of detected peaks on four synthetic data...... 27

Figure 7 A summary of testing different combinations of bin size and threshold on four datasets...... 29

Figure 8 Accuracy of the BELT algorithm...... 31

Figure 9 Compare accuracy of the BELT algorithm to other public available software. . 32

Figure 10 A trend analysis elucidates the program‟s running time on each of the four datasets...... 33

Figure 11 Overview of experimental design and analysis...... 50

Figure 12 Bi-asymmetric-Laplace model...... 51

Figure 13 Tags distribution around transcription factor binding sites...... 52

Figure 14 Correlation of tags density of different salt concentrations...... 56

xvii

Figure 15 Correlation of CpG methylation score of different salt concentrations...... 57

Figure 16 Correlation between CpG island methylation score and DNase hyper sensitivity in MCF-7 cell line...... 58

Figure 17 Coverage and saturation of MBD-seq experiments...... 59

Figure 18 Coverage and saturation of MBD-seq experiments...... 60

Figure 19 Resolution of BALM...... 63

Figure 20 A comparison of BALM with MACS, QuEST on the result of MBD-seq data in

MCF-7 cell...... 64

Figure 21 Comparison of algorithm efficiency in terms of execution time...... 65

Figure 22 Clonal bisulfite sequencing validation...... 66

Figure 23 Validation of MBD-seq using bisulfite sequencing technique (ESPN,

PLEKHG5, HOXA11, PLAU)...... 67

Figure 24 Validation of MBD-seq using bisulfite sequencing technique (MC5R, PIK3C3,

KIAA0427, C18ORF24)...... 67

Figure 25 Validation of MBD-seq using bisulfite sequencing technique (PFKL, ARVCF,

SBF1)...... 68

Figure 26 Comparison between GRAI of MCF-7 cell line based on the MBD-Seq input data and the result of end-sequencing profiling technique developed by Volik et al...... 71

Figure 27 Strategy for analyzing Hi-C data...... 85

Figure 28 Hi-C analysis and genomic interactions in K562 cells...... 86

Figure 29 Distribution of hybrid fragments from the Hi-C data of K562 and GM06990 cells...... 87

xviii

Figure 30 Number of loops in the amplified regions of chr 22...... 90

Figure 31 Epigenetic modification distribution pattern...... 92

Figure 32 Clustering interacting loci...... 93

Figure 33 Epigenetic status of loci 2 of each major cluster...... 94

Figure 34 Relative distance of the interacting loci to a transcription start site...... 96

Figure 35 Epigenetic status of inter-chromosomal interacting loci...... 98

Figure 36 Correlation between CpG island methylation and DNase hyper sensitivity in

K562 cell line...... 100

Figure 37 Preferential binding pattern of transcription factors (TFs)...... 101

Figure 38 Interacting loci are bound by a network of transcription factors...... 103

Figure 39 Gene expression analyses reveal two types of chromatin linkages...... 106

Figure 40 Canonical pathways associated with Types II chromatin hubs...... 107

Figure 41 Relationship between different clusters and repeat types...... 108

Figure 42 Correlation of replicates for RNA-seq in K562 cell line...... 110

Figure 43 Gene expression changes induced by GATA1 and GATA2 knockdown...... 112

Figure 44 GO analysis of genes whose expression is altered by knockdown of GATA1 or

GATA2...... 113

Figure 45 Schematic model of GATA-regulated chromatin linkages...... 117

Figure 46 Flow chart of data processing...... 119

Figure 47 distribution of the digested DNA fragment lengths...... 126

Figure 48 Determination of self-ligation cutoff distance...... 127

Figure 49 Chromosomal interaction hotspots - chr3...... 138

xix

Figure 50 Chromosomal interaction hotspots - chr17...... 139

Figure 51 Chromosomal interaction hotspots – chr20...... 140

Figure 52 3C-PCR confirms that the de novo looping formations were observed in

THRAP1 and C16orf65 loci upon 1 hr E2 treatment...... 141

Figure 53 A heat map showing the expression levels for 69 genes in Top 10 hot regions after E2-treatment...... 143

Figure 54 Correlation between histone modification and chromosomal interaction frequency (1Mb resolution) for top 10 hot and cold regions...... 146

Figure 55 Correlation coefficient matrices for epigenetic markers and chromosomal interaction frequency...... 148

Figure 56 Distribution of relative ratios of interaction changes...... 150

Figure 57 Number of gain (positive value) and loss (negative value) interactions after E2 treatment was counted for every region (1Mb resolution) in the human genome based on the four types of the strongest chromosomal interactions...... 152

Figure 58 A heat map of change of histone modifications between E2-treated and control conditions for four types of chromosomal interactions...... 154

Figure 59 A heat map of dynamic change of histone modifications between control and

E2 treated experiments for four types of chromosomal interactions...... 155

Figure 60 Time-course gene expression profiles after E2 treatment for genes included in the top 10 most frequent interaction changes after E2 treatment...... 157

Figure 61 A pathway to identify one-dimensional and three-dimensional TF modules. 174

xx

Chapter 1: Introduction

Lan, X., Farnham, P.J. and Jin, V.X. Uncovering transcription factor modules using one- dimensional and three-dimensional analyses. J Biol Chem, Accepted.

1

Chapter 1: Introduction

Transcriptional regulation is a critical step in transmission of information from genotype (i.e. DNA) to phenotype (e.g. expression of and noncoding or micro

RNAs). A large number of proteins, namely transcription factors (TFs), play an essential role in gene regulatory networks by binding to short DNA sequences called cis-regulatory elements (CREs), which include promoters, enhancers, repressors and insulators (1). In many cases, multiple TFs function as a regulatory complex (hereafter referred to as a TF module) at a CRE. The concept of cooperative regulation by TFs bound near each other in the genome has been recognized for decades. For example, the enhancer located upstream of the interferon-β gene (IFNB1) has served as a wonderful example of cooperative regulation (2,3). Another example of TF modules is the regulatory region upstream of the Drosophila gene eve. Expression of the eve gene is modulated by different combinations of multiple TFs that bind to the upstream CRE (4). Although the identification of modules composed of TFs bound to adjacent genomic sites is increasing, due in part to ChIP-seq analysis of a large set of TFs by the ENCODE Consortium

(Integrative Analysis of the Human Genome, The ENCODE Consortium, submitted), it has recently become clear that cooperative regulation can be achieved by means other than the interaction of TFs bound next to each other on the genome. TFs can also be brought into close spatial proximity with other TFs bound to a regulatory element located

2 a great distance away via the three-dimensional conformation of a chromosome. In fact, three-dimensional genomic organization, which brings together two distant loci, has been shown to be involved in both gene regulation and nuclear compartmentalization (5-8).

The regulatory effects of a TF bound to a CRE can be either active or repressive, often switching from one to the other depending on other interacting factors. Therefore, a detailed understanding of the association of TFs with other TFs bound at adjacent or distal sites is required to comprehend the complex molecular mechanisms involved in transcriptional regulation of the genome. Also, elucidating cell type-specific TF modules may help to understand the mechanisms driving cell differentiation and disease progression. New experimental techniques facilitated by high throughput sequencing allow investigators to more globally address questions concerning the relationship between three-dimensional chromatin organization and transcription factor modules.

However, it is a remarkably complex task to extend the analysis of TF modules from a one-dimensional to a three-dimensional scale, requiring tremendous efforts from both experimental and computational biologists, as well as effective communication and collaboration among these specialists.

Although a variety of methods have been developed to investigate TF binding throughout the genome (9), the technique of Chromatin ImmunoPrecipitation (ChIP) is most common. This technique, which was developed during the 1980s and 1990s, has been modified extensively for the analysis of site-specific factors and histones (10-17).

The steps in a ChIP experiment include: 1) crosslinking TFs to the genome; 2) shearing

DNA (usually by sonication) to fragments ranging 100bp to 500bp in length; 3) enriching

3 for TF-DNA complexes using target TF-specific antibodies; 4) removing proteins by reversing the crosslinks, and 5) purifying the enriched DNA fragments for further analyses (Figure 1A). When ChIP was first developed, a Polymerase Chain Reaction

(PCR) assay would be performed to determine if the TF bound to a specific genomic position. Although this assay is still used to study single loci, the sequencing of the human genome (18-20) and the development of high throughput technologies (21) have enabled genome-wide profiling of TF binding sites. ChIP-chip, a high throughput technique that hybridizes the enriched DNA fragments to microarrays (22-25), was first used to survey TF binding sites genome-wide. ChIP-seq (26-31), a new technology that combines ChIP and massively parallel sequencing (on platforms such as the Illumina

Genome Analyzer and HiSeq machines, the ABI SOLiD, and the Roche 454) was first developed in 2007 and has rapidly proved to have several advantages, such as the complete coverage of the unique portions of the genome along with high resolution and sensitivity.

Although the aforementioned techniques provide a detailed map of TF binding sites, they are not sufficient to identify TF modules coordinated by higher order chromatin organization. In the past, co-localization methods such as Fluorescent In Situ

Hybridization (FISH) have been used to investigate three- dimensional chromatin structures (32). However, the higher resolution of the chromosome conformation capture

(3C) technique (33) has greatly improved our ability to examine the effects of chromatin conformation on transcriptional regulation. The 3C assay can detect pairs of genomic loci that are in close proximity in the three-dimensional space of the nucleus. In a 3C

4 experiment, formaldehyde is used to crosslink non-adjacent regions of chromatin that are spatially close. The DNA is then digested with a restriction enzyme and the fragments within the crosslinked complexes are joined by ligation. This is followed by crosslink reversal and PCR using primers specific for two different genomic regions. A high signal for the hybrid DNA sequence indicates a high ligation rate between the two genomic loci, which is likely produced by their close proximity and high interaction frequency (Figure

1B).

Figure 1 Experimental techniques to investigate transcription factor binding sites and chromatin interactions. Several high throughput variations on the 3C assay have been developed that allow a larger scale screening of chromatin interactions. For example, chromosome conformation capture-on-chip (4C) (34,35) detects many genomic regions interacting with one

5 particular locus using a microarray containing a set of specifically designed probes. After the crosslink reversal step of 3C, a second enzyme restriction digestion step is performed to shorten the hybrid fragments then the small hybrid fragments are circularized and subjected to inverse PCR. To identify the interacting regions for the locus of interest

(which is called the bait), specific primers within the bait region of the circularized hybrid fragment are designed such that they face the portion of the circularized fragment that is derived from the interacting region. After amplification, the products of inverse

PCR are hybridized to the custom microarray. The major obstacle for wide application of the 4C assay is that it can only detect regions interacting with one chosen genomic locus per experiment. Another 3C-based large scale DNA interaction profiling method is chromosome conformation capture carbon copy (5C) (36). Similar to 4C, 5C allows detection of many potential chromatin interactions but a multiplex ligation-mediated amplification (LMA) step distinguishes it from 4C. The universal primers of LMA are designed to fit near the restriction enzyme cutting sites and have a specific orientation so that the amplification products in the 5C library theoretically contain the ligation junctions of all the hybrid fragments in the 3C library. The 5C library is analyzed using microarrays or next generation sequencing. Both 4C and 5C involve multiplex primers or probe designs, which substantially increase the cost and decrease the applicability of the assay. Two newly developed next generation sequencing-based techniques, ChIA-PET

(37,38) and Hi-C (39), are more suitable for unbiased identification of chromatin interactions across the entire genome. ChIA-PET incorporates enrichment of crosslinked complexes that contain a target protein using antibody-based immunoprecipitation before

6 crosslink reversal; thus, it has mainly been used to interrogate interactomes of a specific

TF, such as the Estrogen α (ERα) (37) or CCCTC-binding factor (CTCF) (38).

Hi-C, on the other hand, uses biotin labeling of the ligation sites, which are then purified using avidin. In contrast to ChIA-PET, Hi-C can map the location of all possible interacting loci in the genome in an unbiased manner, but does not provide information as to which TFs are involved in formation of the different interactions. However, by combining other experimental and computational assays with Hi-C (as described in chapter 5), one can link TFs to sets of interacting loci, allowing the identification of three-dimensional TF modules.

7

Chapter 2: Experimental and computational challenges from array-based to sequence-

based ChIP techniques

Lan, X. and Jin, V.X. Experimental and computational challenges from array-based to sequence-based ChIP techniques. Current Bioinformatics, Review, Accepted.

8

Chapter 2: Experimental and computational challenges from array-based to sequence-

based ChIP techniques

INTRODUCTION

Transcription factors (TFs) constitute a large portion of the human genome encoded

~3,000 proteins. Of these, ~1,400 are considered to be sequence-specific DNA binding factors. These TFs play an essential role in gene regulatory network through binding to cis-regulatory elements (CREs) on the DNA, such as promoters, enhancers, silencers, and insulators. CREs contain short DNA sequences, namely motifs, which could be specifically recognized by certain TFs. There are growing evidence support that the accessibility of these CREs is determined by the status of the chromatin which is linked to DNA methylation and histone modifications (40-45). The expression of a set of downstream target genes is concerted by distinct binding pattern of TFs to CREs in different cell types or in same cell type under different conditions. Understand cell type specific chromatin state and TF binding may shed light on the understanding of the mechanism of cell differentiation in normal development and disease progression.

The most widely applied technique to survey in vivo TF binding site on the DNA is

Chromatin ImmunoPrecipitation (ChIP). The completion of the human genome sequencing project (18-20) and the advent of many high-throughput technologies (21) have allowed us to globally profile many transcription factors, DNA methylation and

9 histone modification and elucidate the transcriptional regulation mechanisms in living cells. ChIP-based technology, has evolved rapidly in recent years, from hybridization with spotted or tiling microarray (ChIP-chip) (22-25), to SAGE-like tags (ChIP-SAGE)

(42) or pair-end tag sequencing (ChIP-PET) (46), to current massively parallel sequencing (ChIP-seq) (26-31). The new ChIP-seq technology (Helicos HeliScope,

Illumina Genome Analyzer, ABI SOLiD, Roche 454, Illumina HiSseq), has clearly demonstrated several advantages over previous methods, such as covering whole- genome-wide scale, providing the highest resolution and accuracy of mapping, largely reducing cost per experiment for identifying a particular TF in a whole genome scale, and offering fast in parallel sequencing speed.

Despite the boosted growth in this field, scientists are still facing major challenges in experimental condition optimization and large data interpretation (Figure 2).

Experimental challenges include copy number variation in aberrant genomes, for example cancer cells, determination of PCR cycle number during Whole Genome

Amplification (WGA), PCR and sequencing efficiency difference caused by GC content variation, specificity of the antibody etc. Computational challenges, which mainly reside in ChIP-seq data analysis, consists harness multiple mapped tags derived from experiment, identify peaks with low False Discovery Rate (FDR), and estimation of the saturation sequencing depth for each factor. In this review, we will focus on issues emerged in high throughput data processing and discusses advantages and disadvantages of various strategies.

10

Figure 2 Experimental and computational challenges in ChIP-seq.

FROM CHIP-CHIP TO CHIP-SEQ

ChIP-chip was first published in 1999 and has dominated genome wide profiling of

Transcription Factor Binding Sites (TFBSs) during the beginning of last decade (24,47).

Instead of running numerous PCRs after ChIP experiment, in a ChIP-chip experiment,

DNA enriched by ChIP are hybridized to a microarray with thousands up to millions of probes. Genomic sites with hybridization signal enriched compared to a background are detected as TFBSs. Two major types of arrays are commercialized for TFBSs identification, including promoter arrays and whole genome arrays (48,49). The early arrays are mainly focused on gene promoter regions since it was believed that TFs

11 predominantly binds to promoter of a gene region to regulator its expression. Promoter array includes probes for promoters of all or part of the annotated genes in a certain genome, for example, the NimbleGen Human promoter arrays contains ~26,000 promoters ranging from 1 to 5 kb. As the reorganization of non-promoter binding transcription factors growth, promoter array can no longer fulfill the task that to detect whole genome TFBSs in an unbiased manner. Whole genome tiling arrays are designed to resolve this issue by sampling probes every 100bp throughout the human genome. As the microarray technique improves, the probe density increased from 380,000 per array to

2,100,000 per array; accordingly, the total number of slides required for complete genome coverage decreased from 38 to 10. Although the number of required slides largely decreased, not many labs can afford to run a complete genome coverage ChIP- chip experiment.

Combined with Next Generation Sequencing (NGS), ChIP-seq provides more promising genome wide profiling capability ever since it was first introduced in 2007

(27,50,51). The enriched DNA from ChIP experiment was size selected and a library was built via Whole Genome Amplification (WGA), which involves random priming PCR of up to 16 cycles. This procedure is designed to overcome the large DNA sample required by the sequencing step in a ChIP-seq experiment. In the past 4 years, the capacity of NGS machine has been increased from 10 million unique mapped reads per flow cell to over 1 billion.

ChIP-seq is superior to ChIP-chip in many different aspects (27,50,51). Firstly, it covers whole genome without bias when sequencing depth is high. The coverage of an

12 array depends on its type. It can only cover the regions that were originally designed.

Secondly, the resolution of ChIP-seq can reach 20bp combined with analysis tools

(BALM), while the resolution of ChIP-chip is from 30bp to 100bp depends on the array design. Another noteworthy feature of ChIP-seq is that the cost for a single experiment has been reduced significantly since it has been developed. With the Illumina new HiSeq machine, a single run (~$7000) is sufficient to profile genome wide binding pattern of

~50 transcription factors (50 * 20 million = 1 billion) if using bar coded tags. However, one set of 10 sides tiling arrays cost around $5000. Last but not least, the required amount of DNA for ChIP-seq is relatively low (~10-50 ng) especially for the new HiSeq machine compared to ChIP-chip (~2µg). Although ChIP-seq showed valuable features compared to ChIP-chip, however, ChIP-chip is still useful in that it can be designed to focus on only a few genes or interested regions thus reduce the cost when whole genome wide profiling is not necessary. In addition, the huge data produced by ChIP-seq experiments have high demand on data storage, transfer and analysis. In contrast, ChIP-chip data are much smaller in size and easier to handle.

EXPERIMENTAL CHALLENGES OF CHIP-SEQ

Optimization of PCR cycle number during Whole Genome Amplification (WGA)

Next Generation Sequencing (NGS) machines require a high amount of precipitated

DNA from ChIP experiment. To overcome the insufficiency in enriched DNA, WGA is unusually applied to artificially increase the DNA quantity. However, over amplification can raise bias in PCR products. Since random priming is not absolutely 100 percent,

13 some of the fragments are easier to be primed and thus amplified. The more cycles, the more bias in the PCR products accumulated. As a result, in the final sequence tags, many are of the exactly same sequences and are mapped to the same location on the reference genome (Figure 3A). Fortunately, many data analysis software tools (17,52,53) have considered this bias and redundant amplification product was removed for further analysis. However, this type of downstream filtering cannot increase the data quality rather is to lower the data background. It is essential to obtain an optimized cycle number to improve the efficiency of ChIP-seq experiments. One protocol developed in O‟Geen et al. (11) is to perform qPCR on the ChIP-seq library using both target and negative control primer pairs. An input library derived from the 10% input sample serves as a control to normalize the qPCR data to determine the relative enrichment of a given target.

Figure 3 Example of experimental error.

14

Specificity of the antibody

Antibody quality is critical to the success of a ChIP-seq experiment (Figure 3B).

There are a few approaches to validate the specificity of a particular antibody. First of all, a simple western blot with correct band size is an indication of a specific antibody.

Secondly, IP-western can determine if the antibody indeed immunoprecipitated the target protein. Thirdly, the antibody works if a similar pattern to public validated antibody is obtained. siRNA followed by ChIP experiment can serve as an indirect method to validate the antibody. Reduction in ChIP signals is expected for an antibody with high specificity.

COMPUTATIONAL CHALLENGES

Copy number variation in aberrant genomes

Extensive efforts have been made to elucidate the transcriptional deregulation and aberrant epigenetic event in cancer development, large amount of high-throughput data derived from cancer model cell lines are produced by many labs across the country. One of the major project producing ChIP-seq data is the ENCODE project focused on a few normal and cancer model cell lines, the majority of whose genomes are disrupted, for example, MCF-7, LNCaP, K562 etc. (54,55). Copy number variation in these genomes can produce false positive targets that tamper the accuracy of data analysis tools. In a recent study, we developed a software named BALM (chapter 4) (56), in which a genome region amplification index (GRAI) has been applied to eliminate the effect of copy number variation. It sets higher threshold for identifying enriched site in amplified

15 genomic regions to better control the False Discovery Rate (FDR). The effectiveness of the GRAI are tested on genomic regions on the q arm of chromosome 20 in MCF-7 cell line, which were reported to have a high amplification number. It was demonstrated that the GRAI precisely captured the copy number changes in disrupted genome.

Utilization of multiple mapped tags

Sequenced tags from ChIP-seq experiment are mapped to the reference genome using short tag alignment tools, such as BWA (57), Bowtie (58) and others (59,60). However, since the genome of high level organism is exceedingly repetitive, in a typical ChIP-seq experiment, a large portion of the tags (~30%) can be mapped to multiple locations on the reference genome. These tags are generally discarded by many analysis tools, including

MACS, QuEST, BELT and others (17,52,53). Effectively utilize these multiple mapped tags are very important for increasing the sequence depth and more accurate detection of target protein and DNA interactions, which in turn leads more precise and significant biological interpretation (Figure 4). A computational tool being developed in our laboratory (Wang et al, unpublished), LOcating Non-Unique matched Tags (LONUT), intends to improve the detection of the enriched regions for ChIP-seq and MBD-seq data.

Our LONUT algorithm used two parameters (the distance of a tag to a closest peak, and the score of that peak) to determine a mapped tag from many multiple matched tags. The recovered tags were validated by de novo motif discovery, CpG methylated scores, ChIP-

16 qPCR as well as Pyro-sequencing.

Figure 4 Representative regions of MBD-seq before and after utilizing multiple mapped tags.

Peak detection

Currently available computational tools for the ChIP-seq data are reviewed in Park

(2009) (50). A common approach for many tools is based on shifting sequence tags towards the binding site for a certain number of base pairs then locating the binding site by calculating the summit within a peak region by assuming a Poisson or binomial distribution for tags. This approach proves to be effective and is widely used in the ChIP-

17 seq user community. These sets of tools include MACS (52), SISSRs (61), QuEST (17).

A recent approach used a mixture model which showed advantages over several widely used programs in the ability to separate closely positioned peaks (62). Despite the accuracy and efficiency of these tools, few of them are available as easy-accessible online web tools for processing the ChIP-seq data. In our laboratory, we have recently developed a comprehensive web application tool for processing the ChIP-seq data (53), called W-ChIPeaks. Our web tool W-ChIPeaks employed a probe-based (or bin-based) enrichment threshold to define peaks and applied statistical methods to control false discovery rate for identified peaks. The web tool includes one web interface, wBELT

(bin-based enrichment threshold level) for ChIP-seq, where it was tested on previously published experimental data. The novel features of our web tool include a comprehensive output for identified peaks with different formats: GFF, BED and wig files, annotated genes to which these peaks are related, a graphical interpretation and visualization for the result, and a user-friendly web interface.

Maximize the resolution

The accuracy of an algorithm is crucial for further processing and interpreting of the predicted target sites. An accurate localization of the target site may provide a more precise prediction of its relationship to a point of interest such as the transcription start site of a gene, a CpG island or an exon-intron boundary. Secondly, by maximizing the log likelihood for the tags in a given peak region, the position of TFBSs could be estimated with high accuracy. Unlike TFBSs, epigenetic markers such as DNA methylation are

18 highly abundant in the genome and are densely distributed. Although one resolution of DNA methylation profiling can be achieved using MethylC-seq (63,64) or

BS-seq (65), the high cost, especially with multiple samples, obstruct the popularization these type of experiments (66). An economic and straightforward approach is Methyl-

CpG binding domain protein sequencing (MBD-seq, a variation of ChIP-seq to genome wide profile DNA methylation) (67), which utilize the high affinity of MBD protein to methylated DNA. Low resolution prediction of DNA methylation site from MBD-seq data because the lack of appropriate analysis tool may result in misinterpretations of the function of certain CpG island. A new program recently developed is capable of distinguishing two closely positioned CpG dinucleotides by applying EM algorithm to approximate a bi-asymmetric Laplace mixture model (BALM) (chapter 4). Hence, the program can provide a higher resolution of genome wide DNA methylation prediction with MBD-seq experiment (68).

Sequencing depth and coverage

How many tags are needed for an accurate profile of the target protein or modification? This question has been haunting the sequencing community since ChIP-seq has been developed. Sequencing depth directly affects the sensitivity of the ChIP-seq experiment. Under the same experimental condition, higher sensitivity can be reached by increasing the number of tags sequenced. It was estimated that 12 million unique mapped reads is sufficient for survey a transcription factor‟s binding site in human genome. This may not be generalized to any type of ChIP-seq data since the distribution of every

19 transcription factor and epigenetic modification is different, in addition, the property of the capture protein (usually an antibody) also varies greatly. High affinity antibody requires less sequenced tags compared to a low affinity antibody to identify target protein

– DNA interactions because the background noise tags is reduced due to the improved specificity. In our recent study (68), we showed that, Methyl-CpG binding domain protein sequencing (MBD-seq) requires ~100 million sequence tags to reach the saturation point of whole genome coverage in MCF7 breast cancer cell line. A typical transcription factor has genome-wide binding site range from ~1000 to ~20000, however, there are ~28,000,000 possible methylation sites in the human genome and it has been shown that majority (>90%) of these site are heavily methylated in differentiated cells

(64). Thus, much higher depth is required for MBD-seq than regular TF ChIP-seq to achieve a complete genome-wide detection. Different histone modifications also demonstrated distinct pattern across the genome (40-44). For example, similar to DNA methylation, repressive mark H3K27 tri-methylation showed broad peak in heterochromatin regions usually covers millions of bps with multiple gene embedded. In contrast, H3K4 methylation showed transcription factor like sharp peak bound cis- regulatory elements, usually promoters and enhancers (69). Careful design of tag numbers for a sequencing experiment should consider both the prevalent of target protein-DNA interaction and the sensitivity and specificity of the capture protein.

Data normalization

20

This issue resides in comparison of data across different cell types or cells under different environment. When a control data is available, or a comparison between multiple samples is being performed, a global normalization is needed since the sequencing depth of different samples hardly to be at the same level. How to effectively normalize data, thus enable unbiased comparison among sample? In a recent paper, we applied robust linear normalization to address this issue. In a typical transcription factor

ChIP-seq experiment, although peak regions possess a high enrichment level of tags, it only account for a small portion of the whole genome. Majority of the tags in a ChIP-seq dataset are nonspecific background noise and the distribution of the background noise of different sample across the whole genome usually correlated well with each other (r >=

0.8). Thus, we assume that despite the distinct pattern of tags enrichment at the binding region, the majority of genome should have similar tags enrichment. Robust linear regression is well fitted in this case, since the effect of the outliers (peaks at the binding site) is reduced and in contrast with the least square regression which is sensitive to outliers (70-72).

CONCLUSIONS AND PERSPECTIVES

It is now clear that ChIP-seq is becoming a dominant experimental technique to profile in vivo transcription factor binding sites and discovery of these sites is further for understanding the transcriptional network that regulates gene expression and functions.

ChIP-seq not only offers higher resolution but also cleaner data at a lower cost compared with the array-based ChIP-chip at a genome-wide scale. Despite many issues exists in the

21

ChIP-seq experimental protocol, many computational tools so far are able to manage to address many challenges such as data normalization, sequencing depth and multiple matched tags. Now, a more challenge question is how to integrate distinct types of data to gain deeper insight into regulatory mechanisms? Future studies in this field may focus on developing computational approaches for the integration with other high throughput data types („omics). 1) Integration of the RNA-seq data. For example, the integration of a given site-specific TF‟s ChIP-seq data with RNA-seq data in a given cell may result in the elucidation of the given TF instructed gene regulatory networks. 2) Integration of histone modifications (ChIP-seq), MBD-seq or BS-seq (two new technologies used to map the DNA methylome), and RNA-seq data, may reveal the interplay between the transcriptome and the epigenome. 3) Integration of the Hi-C data, a new technique for investigating the chromatin interactome. This may lead to decipher long-range inter- and intrachromosomal interactions and to understand how promoter-enhancer and enhancer- enhancer can contribute to gene regulation.

22

Chapter 3: A bin-based enrichment level threshold (BELT) algorithm to analyze ChIP-

seq data

Lan, X., Bonneville, R., Apostolos, J., Wu, W. and Jin, V.X. (2011) W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data.

Bioinformatics, 27, 428-430.

23

Chapter 3: A bin-based enrichment level threshold (BELT) algorithm to analyze ChIP-

seq data

INTRODUCTION

The completion of the human genome sequencing project (18-20) and the advent of many high-throughput technologies (21) have allowed us to globally profile thousands of transcription factors and elucidate the transcriptional regulation mechanisms in living cells. ChIP-based technology is a high throughput technology that has evolved rapidly in recent years, from hybridization with spotted or tiling microarray (ChIP-chip) (22-25), to

SAGE-like tags (ChIP-SAGE) (42) or pair-end tag sequencing (ChIP-PET) (46), to current massively parallel sequencing (ChIP-seq) (26-30). The new ChIP-seq technology

(Helicos HeliScope, Illumina Genome Analyzer, ABI SOLiD, Roche 454), has clearly demonstrated several advantages over previous methods, such as a whole-genome coverage, higher resolution and accuracy of mapping, largely reduced cost per run, and elevated parallel sequencing speed.

With the availability of large biological datasets derived from numerous ChIP-seq experiments (26-31,73), one major challenge for biologists using this technology is the high computational burden related to mapping raw reads, effectively visualizing data, detecting signal enriched regions, precisely locating target site. Several peak identification programs were designed to control a variety of system errors reside in DNA amplification, 24 sequencing, image processing and reads matching (17,52,61,74-79). However, in addition to these well addressed system biases, issues such as unequal sequencing depth among different samples, large region amplification in aberrant genomes etc, still concerns biologists. At the mean while, beside protein-DNA interaction, ChIP-seq technology has been applied to identify open chromatin, DNA methylation, histone modification etc.

While in later situations, a wide peak region is not sufficient to help biologists to interpret their data, there is an emerging demand to precisely locate the target site.

Here we present new software, a bin-based enrichment level threshold (BELT), for analyzing ChIP-seq data (Figure 5). The algorithm starts with defining a series of bin- sizes and counting the density of reads, then calculating an average fragment length for a

ChIP sample by considering the direction of reads, decoding the fragment position by shifting reads, and employing a percentile rank statistic method to determine significant enrichment level thresholds. During this process, BELT further defines enriched regions

(peaks) and locates target site by taking the average of the fragments‟ positions and finally employs monte-carlo simulation of ChIP data based on signal-noise-ratio levels to estimate false discovery rates of identified peaks. The performance of our algorithm was assessed by synthetic data and four well-characterized human transcription factors‟ ChIP- seq data: insulator protein CTCF (CCCTC-binding factor) (26), ERα (80), FOXA1 (52) and NRSF (neuron-restrictive silencer factor) (also known as REST, for repressor

25 element-1 silencing transcription factor) (17).

Figure 5 Summary of five steps in the BELT algorithm.

RESULTS

Evaluation of BELT by synthetic data

To evaluate our algorithm, we tested it on synthetic ChIP-seq data where various numbers of peaks were simulated in each dataset. Testing on synthetic data could provide us with an insight on our algorithm and allow us to obtain a set of optimized parameters, especially since we know the exact position of the simulated target site. Four synthetic data, each with a different level of peak numbers of 5,779, 11,402, 15,725 and 22,876, 26 were generated using integrated simulator. The procedure of the generation of synthetic data was based on signal-noise-ratio levels from the real data (see MATERIALS AND

METHODS). This procedure simulated the distribution of millions of reads similar to the ones observed in the distributions of the real data. We first wanted to test how the bin size would affect the peak detection; thus, various bin sizes from 25 to 150 bp were chosen for this purpose (Figure 6).

Figure 6 Summary of detected peaks on four synthetic data.

At the bin size of above 50 bp, 95% of peaks can be identified by BELT for all four datasets regardless of the number of peaks in the data. Fewest peaks were detected at a smaller bin size of 25 bp (Figure 6A). This occurrence might be due to the fact that this bin size is even shorter than the sequencing reads (usually from 27 to 75 bp from a typical Solexa/Illumina sequencing machine).

27

Next, we tested the accuracy for the detected peaks and checked if they are properly situated on the locations where they are being generated. The accuracy of an algorithm is crucial for further processing of datasets after peaks have been defined. A precise location of a target site such as binding motif or epigenetic modification may provide a better insight to the relationship between this target site and a point of interest like a gene.

We define accuracy as a measurement of the average distance between the recovered peaks and the designed target site. Therefore, we plotted the distance between each recovered peak vs. its corresponding designed target site (Figure 6B). We found that the majority of the recovered peaks are 30 bases or closer to the designed target site with the threshold of 0.999. BELT reaches the best performance when the bin size is 50 bp. Both smaller and larger bin sizes cause a decrease in accuracy. Meanwhile, a smaller bin size leads to less specificity and larger bin size leads to less sensitivity. A similar result was obtained when we tested the recovered peaks with the threshold of 0.9995.

Performance of BELT on real ChIP-seq data

After successfully assessing BELT on synthetic data, we evaluated the performance of the program on published ChIP-seq data. Four ChIP-seq datasets of well-characterized transcription factors, CTCF, ER, FOXA1 and NRSF were used in our study. We chose these data sets with a large variance of sequencing in depth (~2.94 million reads for

CTCF, ~3.91 million reads for FOXA1, ~8.31 million reads for ER and ~8.81 million reads for NRSF), since we wanted to examine how the depth of sequencing and noise levels will impact the performance of the program. The program first estimates the

28 average fragment length (L) for CTCF, FOXA1, ER, NRSF, to be 91 bp, 137 bp, 99 bp and 114 bp, respectively. All reads are then adjusted by shifting to the 3‟ ends with a half the L. Different combinations of two important parameters, threshold and bin size were tested (Figure 7).

Figure 7 A summary of testing different combinations of bin size and threshold on four datasets.

In all four datasets, all FDRs at various bin sizes (25 bp to 150 bp) and at thresholds greater than 0.999 are less than 1%. At the threshold of 0.995, the bin sizes greater than

100 bp have less than 10 % FDRs for all datasets except FOXA1. We also found that the

FDRs of ER and NRSF generally achieved much lower levels compared to the ones of 29

CTCF and FOXA1. These results strongly show that the depth of sequencing

(represented by the number of uniquely mapped reads) is a significant factor affecting peak detection. The results also suggested that possible optimal parameters are a combination of a threshold of 0.999 and a bin size larger than (or equal to) 50 bp for any given dataset.

Validation by de novo motif discovery

One way to validate the specificity of BELT is to perform the de novo motif discovery on identified peaks. This will examine whether the identified peaks are indeed true binding site for respective TFs. We applied the ChIPMotifs developed in our laboratory (81,82) to these four TFs. We tested them on a set of high stringent level of peaks for each TF, 1046 for CTCF, 1213 for FOXA1, 1962 for ER and 1870 for NRSF.

A length of 400 bp sequence centered on estimated binding motif of each peak was retrieved for using the motif discovery. Our motif analysis (Figure 8A) revealed that detected motifs are similar to the canonical motifs previously reported for each of these

TFs (27,83-85). We also found that the majority of motifs are indeed occurring close to the predicted position (Figure 8B). Overall, 1034 of 1046 (99%) peaks recovered at least a motif for CTCF, 1213 of 1213 (100%) peaks for FOXA1, 1962 of 1962 (100%) peaks for ER and 1496 of 1870 (80%) peaks for NRSF. These results illustrated the high specificity of this algorithm as well as the accuracy of the program‟s prediction of motif location. A detailed output from our ChIPMotifs for each TF is accessible from our website (http://motif.bmi.ohio-state.edu/BELT/download).

30

Figure 8 Accuracy of the BELT algorithm.

Comparison with other ChIP-seq programs

We compared BELT to three other public available Chip-seq peak-detection programs, MACS (52), QuEST (17), and SISSR (61). Because each program was implemented with different algorithms and defined different significant thresholds, in order to perform a meaningful comparison, we conducted the peak finding on previously described four datasets: CTCF, FOXA1, ER and NRSF with a variety of parameter. Due to the lack of a control data for CTCF, we were not able to perform the comparison between QuEST and BELT on this dataset. We compared the number of overlapping peaks identified between BELT and other programs and the relative distance from the 31 predicted binding site to the real motif. First, we identified a set of peaks by using an optimized parameter of bin size and threshold for each dataset from our program. We then identified the same level of numbers of peaks from other programs on each dataset by exhaustively testing different parameters. These parameters were selected to make sure of a similar significant level for each program. The results showed that all of the overlap rates are over 74%, the lowest case overlap rate was 74% between BELT and

QuEST on NRSF data. Importantly, our new approach showed higher accuracy in terms of target site localization than the other programs tested in this study (Figure 9).

Figure 9 Compare accuracy of the BELT algorithm to other public available software.

We also compared the execution time for all five programs. A trend analysis in elucidated the programs running time on each of the four datasets (Figure 10). In general, the execution time increased with the increase in the total number of reads. The faster

32 execution time of BELT was due to its C and C++ implementation as compared to

MACS (Python), SISSR (Perl), and QuEST (Perl).

1600 1400 1200 1000 800 600 (second) 400 200 0

Software Execution Time Execution Software CTCF FOXA1 ER NRSF BELT 25 55 75 160 MACS 100 150 400 500 QuEST 1200 1300 1400 1500 SISSR 600 750 850 950

Figure 10 A trend analysis elucidates the program‟s running time on each of the four datasets.

DISCUSSION

Several aspects contribute to the improvement of BELT to other programs:

Firstly, it mimics a ChIP process by reversing it. This process was achieved by statistically determining the average fragment length for each particular ChIP sample instead of taking a user estimated input (MACS). Accurate calculation of average fragment length and statistical estimation of position allows precise localization of target site to the interested gene or region, thus allows users to interpret their data in a more appropriate and effective way.

33

Secondly, it modulates the specificity and the sensitivity by adjusting thresholds and bin sizes. This feature highly increases the program‟s flexibility to deal with variety types of data and datasets with different quality and depth.

Compared to most FDRs defined by other programs, BELT took a different strategy.

Monte-carlo approach is applied to generate simulated datasets based on the characteristics of the data being processed, while majority of peak finding programs adopts a Poisson, binomial or negative binomial distribution in data simulation. Our fitting test showed that neither the input (control) nor experimental ChIP-seq data follows either of these distributions. Therefore, data simulation based on signal-noise-ratio level of real data can more effectively control the false discovery rate.

As many efforts are made to understand the abnormal transcriptional regulation and aberrant epigenetic event in cancer development, more and more ChIP-seq data derived from cancer model cell lines are available. A variety of these model cell lines‟ genomes are disrupted, for example, MCF-7, LNCaP, K562 etc. BELT increases background noise reads number simulated to better control the FDR in amplified genome regions.

Robust linear normalization enables effective comparison among samples. When a control data is available, or a comparison between different samples is being performed, a global normalization is needed since the sequencing depth of two different samples hardly to be at the same level. Although peak regions possess a high enrichment level of reads, it‟s only a very small portion compared to the whole genome. Majority of the reads in a ChIP-seq dataset are nonspecific background noise and the distribution of the background noise of different sample across the whole genome usually correlated well

34 with each other (r >= 0.8). Thus, we assume that the majority of genome region‟s enrichment patterns should have no difference. Robust linear regression is well fitted in this case, since the effect of the outliers is reduced and in contrast with the least square regression which is sensitive to outliers (52,77). For data that is not linearly correlated, we suggest using nonlinear normalization methods such as LOESS (85).

Another noteworthy feature of BELT is that it implements multiple statistical parameters to evaluate the significance of identified peaks. In addition to an FDR, which majorly evaluates the significance for a percentile level of peaks, we define an empirical formula 3) to calculate each individual peak‟s score. The advantage of having peak scores allows us to orderly rank the peaks. This is especially useful when there are two biological replicates available for comparison. A plot of the percentage overlap vs. the total of orderly ranked peaks on two biological samples of ZNF263 ChIP-seq data (86) shows a trend of decreasing on overlapping numbers along the increase of the number of peaks. This finding supports the notion that a peak score indeed can be a useful parameter for the measurement of the identified peaks. A third parameter of BELT is the p-value which is only defined when a comparison between multiple datasets is performed. It can measure the significance of enrichment difference between datasets at the same peak region. BELT outputs a set of passed peaks (with a p-value less than or equal to 0.05) as well as a set of failed peaks (with a p-value greater than 0.05).

Last but not least, it produces results within a relatively shorter time on given data when compared to other programs. This will turn out to be extremely helpful with the fact

35 that sequencing depth become much higher nowadays and the reads of one experiment may boost to hundreds of millions as the parallel sequencing capacity growing rapidly.

In summary, current peak finding program are very powerful and designed to handle a variety of system biases produced by the ChIP-seq procedure, however, with the rapid growth in capacity and widely application of the ChIP-seq technology, data analysis requires more complete bias control mechanism, more precise localization of target site and elevated processing speed. Our new program is an effort to address these emerging demands. Validation on four sets of published data as well as the comparison with other available programs demonstrated that BELT could provide another useful ChIP-seq analysis tool for sequencing user community.

MATERIALS AND METHODS

An overview of the BELT algorithm

In a ChIP-seq experiment, fragments of a genomic DNA of interest (interacting with a certain TF, exhibiting a certain modification etc) were precipitated by a specific antibody.

Single or paired end(s) of each of these fragments were sequenced simultaneously using high-throughput sequencing techniques. The precise location of each of the sequenced reads was recorded by mapping them to the genome. The strategy of our algorithm is as follows: while identifying the location of these reads, the BELT reverses the procedure of

ChIP-seq, traces back the location of the fragments that were precipitated and further identifies the binding regions (peaks).

36

BELT algorithm includes five steps: 1) Defining a series of bins by evenly dividing the genome varying from 50 bp to 150 bp, and counting the density of reads for each bin;

2) For single end sequencing, calculating an average fragment length for a ChIP sample by considering the direction of the reads, and decoding the fragment position by shifting reads; for paired ends sequencing, precise fragment position are known; 3) Determining significant enrichment threshold levels by a percentile rank statistic method; 4) Defining enriched regions (peaks) and locating the target site within identified peaks by taking the average of fragments position; and 5) Utilizing monte-carlo simulation for modeling background based on signal-noise-ratio of ChIP-seq data to estimate false discovery rates.

In case that multiple sample is compared, such as IgG, input data, control data, experimental data etc, a fisher exact test is applied to compute the p-value for identified peaks (Figure 5).

Datasets

ChIP-seq data for human transcription factors CTCF in CD4+ T cell, FOXA1 in

MCF-7 cell, ERα (shortened as ER) in MCF-7 cell, and NRSF in Jurkat lymphoblast cell were downloaded for evaluating the performance of the BELT program. All datasets are single end and available at http://motif.bmi.ohio-state.edu/BELT/download.

Calculation of an average fragment length for a ChIP sample

The read enriched area on both strands should be paired together at a target site and showing a bimodal pattern. In the first step, BELT sets a very high threshold (0.99995)

37 and find >=100 well-paired peaks, (if less than 100 pairs was found, the threshold will be reduced and the process will be repeated until more than 100 were found). Then, BELT calculates the average fragment length at each of these high significant target sites in

Formula 1),

P P l   r   f 1 1) nr n f

Where Pf and Pr as the read‟s position on forward and reverse strands respectively and nf, nr represent the number of reads on forward and reverse strands respectively.

To avoid excluding long fragments that have ends falling out of the enriched area, we extend the area by 200 bp on each side. The whole genome-wide average fragment length is computed L = l and used to further shifting reads, search for possible target site etc.

Decoding the ChIP fragment position

To estimate the fragment positions of the ChIP sample, we shift all reads towards the mid-point of the fragment by L/2 whereby each of these resulting points is considered as a representation of one fragment.

Determination of significant enrichment level thresholds

We define a series of bins, varying from 25 to 150bp, by evenly dividing the genome and counting the density of reads for each bin. Then, the user‟s input defines significant enrichment level thresholds based on a percentile ranking statistic method. After sorting the enrichment level for all bins, the threshold is taken at the percentile of the confidence level. By default, five thresholds are defined from 0.99 to 0.99999 percentile levels. 38

Definition of a peak and localization of the target site

A peak is defined as a set of continuous bins that have an average enrichment level higher than the threshold. Our algorithm only allows gaps with one bin in length, since larger gaps might indicate that there are two closely located peaks. After having defined a peak, we calculated the exact target site for that peak using Formula 2) with the assumption that the fragments are symmetrically distributed around a target site.

P  d  P  d  f   r  2) Pm  nt

Where Pm denotes the exact binding motif, modification site etc, nt is the number of reads in the region, Pf and Pr as the reads‟ position on the forward and reverse strand respectively, d as the distance shifted (equals L/2).

Data normalization

By default, robust linear regression is applied to normalize data with different sequencing depth since we assume that all experiments were performed under the same controllable conditions. First, enrichment levels of 10k bp bins are counted for each sample. Then a normalization factor is calculated by performing robust linear regression between two samples‟ bin enrichment levels. Two samples are adjusted to have comparable enrichment level using the normalization factor.

Calculation of the p-value

BELT performs fisher exact test and calculates a p-value for each peak if sample comparison is performed. Peaks with a p-value less than 0.05 are defined as passed peaks 39 otherwise are considered failed peaks. Both passed and rejected peaks are recorded as output.

Ranking the resultant peaks

Peaks are ranked by a score which measures their “quality/significance” and is empirically defined in Formula 3). This score is also used to rank the peaks in a particular percentile. We take several factors into account: the length of a peak, the average score of bins, and the shape of a peak.

m Sa 3) SpLog2   where Sp is the score of a peak; Sa is the average score of each bin in the peak; m denotes the number of bins in the peak defined as m = Lp / Lw, Lp: the length of the peak; and Lw: the width of a bin.

Importantly, by multiplying the average score and the number of bins in the peak, the score will increase the weight of a peak's shape and determine its order. Therefore, in general, if two peaks contain same amount of reads, the narrow one is favoured, in another case, higher enriched peak is favoured among peaks with same width.

Estimation of False Discovery Rate (FDR)

For a percentile rank r and a test statistic Zk, we want to test a null hypothesis,

Hk0: E(Pk) = 0 4)

40 where for peaks Pk, k=1,…, n, E is the expected value of the number of false positive peaks among all claimed true n peaks in that level r. In this case, we defined this E value as a false discovery rate (FDR).

FPr  5) FDRr   E  TPr  where FP(r) is the number of true false positive peaks with level r and TP(r) is the number of peaks claimed as true peaks with level r.

Practically BELT generates simulated datasets to compute FDR, where each dataset includes simulated peaks and background noise reads based on the real ChIP-seq data.

The procedure of data generation is described in the following section (Generation of synthetic, simulated background data).

Let Nb denote the number of peaks formed by simulated background data and Nt denote the total number of peaks detected from ChIP data. The FDR can be re-written as

N FDR  b 100% 6) N t

Generation of synthetic, simulated background data

We define a signal-noise-ratio (SNR) level as the following Formula 7), which is the basis for the monte-carlo simulation of data:

R SNR  s 7) R o

Where R ( R  Ro  Rs ) denotes the total number of reads, Rs denotes the number of reads that fall into peaks, and Ro denotes the number of reads that are not in any peak.

A). Simulated peaks (target sites) 41

(a) Randomly generate n target sites (e.g. 10 nt in length) on the genome; (b) For each site, generate a certain number of artificial fragments that mimic the ChIP sample; (c)

Randomly define each fragment‟s length, varying from 100-300 nt; (d) Randomize each fragment‟s position around the site, all fragments should cover the target site; (e) For single end sequencing record the coordinates of one end of each fragment, for paired end sequencing, both ends are recorded, whereby each end represents one read (the simulation is based on a real sample where the repeated regions have already been removed by the peak finding process; thus, we don‟t need to exclude the repeats regions in our simulation process).

B). Randomly generate background noise reads

(a) After obtaining the coordinates of the reads in A), we randomly generate coordinates for background fragments throughout the genome. The number of fragments is equal to R(1-SNR) for single end sequencing or R(1-SNR)/2 for paired end sequencing. (b) We record the coordinates of the one or two end(s) of the fragments as reads. If aberrant genome flag is on, an amplification factor will be calculated for amplified regions. An amplified region is defined as a large non-pericentromeric non- repetitive genomic region (>=10kb) that has a significant high input enrichment compare to majority of normal genomic region. The number of background noise reads generated in such region is multiplied by the amplification factor.

C). Synthetic data

42

A synthetic data is defined as a dataset of a pre-determined number of simulated peaks plus background noise reads; this dataset is a combination of the reads generated in both A) and B) and is used for our evaluation purposes.

D). Simulated data

A simulated data is defined as a dataset of a number of permutated reads within peaks in a corresponding ChIP data plus background noise reads. This dataset is used for the purpose of determining the FDR for each real data.

Discovery of motifs

Four data sets at high stringency levels (~1000 peaks each for CTCF, FOXA1, and

~2000 peaks for ER and NRSF) are obtained from the BELT program and used for detecting novel motifs by the program ChIPMotifs (http://motif.bmi.ohio- state.edu/ChIPMotifs), developed in our laboratory. Sequence logos were constructed using Weblogo (http://weblogo.berkeley.edu/).

Program implementation

BELT is implemented in C and C++. The source code is platform independent and was compiled and tested in Linux Fedora10, OS X with the gcc compiler and Windows

XP, Vista with Microsoft Visual C++ 9.0 compiler. It has a command line interface as well as a simple GUI (Buckle) to help users specify the parameters. The documentation, source code and compiled binaries for Linux, OS X, Windows XP and Vista can be downloaded on our website at http://motif.bmi.ohio-state.edu/BELT/download.

43

The program takes inputs of read information file in various formats, such as BED,

ELAND, EXTENDED ELAND, GFF and SAM/BAM. The default input file format is

ELAND; users can use –b option for BED format –S option for SAM format and –e for

EXTENDED ELAND format etc. It also provides an option (-n) for non standard file formats, which allows users to specify the column in that file containing information for the chromosome ID, reads‟ starting and ending coordinates, and strand of a read (e.g. –n

0 1 2 5). If the control data is provided, the -c option will specify the control file name

(e.g. -c XXX_IgG.txt). –c option could also be used to compare different samples.

Some of the extensively studied genomes‟ assembly information is embedded in

BELT, e.g. Human (hg18, hg19) and Mouse (mm8). Users may use the –g option to specify the genome of their data set, e.g. –g mm8, the default assembly is hg18. For the genomes that are not embedded, users need to provide the genome information, such as its size, chromosome number, and length of each chromosome. By default, peaks within pericentromeric and repetitive region will be excluded from the output using a species specific repetitive region table. However, users could explicitly turn this filter off by –R option. –A option is recommended if the object genome is disrupted.

BELT automatically calculates the average fragment length. By default, it will perform peak calling using five bin sizes from 25 to 150, and five confidence levels from

0.99 to 0.99999, and then generate a peak file for each combination of these sets of parameters. Users are able to decide which bin size and threshold to use after comparing the numbers of peaks found and the FDR. –w option is provided to specify bin size and –

44 p option is for confidence interval of the bin enrichment level, e.g., -w 50 100 150 –p

0.995 0.999.

In addition, option –o and -W allows the program to generate a bin fragment enrichment level file in variable step and fixed step WIG format respectively after the

ChIP fragment position has been decoded. These file can be easily visualized by any genome browser taking the WIG format file as input, such as the USCS genome browser

(87) and the Integrated Genome Browser (88). Tools that can convert files between different file formats are integrated in the program (Appendix A).

45

Chapter 4: High resolution detection and analysis of CpG dinucleotides methylation

using MBD-seq technology

Lan, X., Adams, C., Landers, M., Dudas, M., Krissinger, D., Marnellos, G., Bonneville,

R., Xu, M., Wang, J., Huang, T.H.-M. et al. (2011) High resolution detection and analysis of CpG dinucleotides methylation using MBD-seq technology. PLoS One, 6, e22226.

46

Chapter 4: High resolution detection and analysis of CpG dinucleotides methylation

using MBD-seq technology

INTRODUCTION

The advance of next generation sequencing technology has revolutionized the research field of transcriptional regulation and systems biology (21-23,25). ChIP- sequencing (ChIP-seq) has become a leading technology to interrogate in vivo protein-

DNA interactions (26,27,29). Recently, in addition to protein–DNA interactions, massively parallel sequencing has been used to identify open chromatin (89), histone modifications (26,28,90) and DNA methylation.

DNA methylation is one of the major epigenetic mechanisms that play an important role in a variety of cancers (91-93). Several high throughput profiling techniques

(MeDIP-seq (94-96), MIRA-seq (97), MBD-seq (67), MethylCap-seq (98), MethylC-seq

(63,64), BS-seq (65)) have been developed to study genome-wide methylation patterns.

Affinity-based enrichment of methylated DNA sequences with methyl-CpG binding domain proteins followed by next generation sequencing (MBD-seq) (67) utilizing the

MethylMiner™ Methylated DNA Enrichment kit has been shown to be a powerful alternative to MeDIP-seq and the whole methylome sequencing technology of BS-seq

(63,64,67).

47

In MBD-seq experiments, high coverage of methylated CpG dinucleotides can be achieved by increased sequencing depth; however, as the sequencing depth increases so does the cost and the computational resource requirement. No optimal sequencing depth has been given by previous studies.

MACS (52), QuEST (17), SISSRs (61), PICS (62), and many other peak identification programs (74,76-79,99) are developed for ChIP-seq data analysis; however, the majority of these programs were designed to locate transcription factor binding sites

(TFBSs) from ChIP-seq data. DNA methylation sites differ from TFBSs in that methylated CpG dinucleotides are highly abundant in most differentiated cells thus the signal peaks in MBD-seq data are densely distributed. The characteristic of this type of data raises the demand for a computational analysis program with higher resolution, since the aforementioned programs fail to finely detect methylation level of CpG dinucleotides.

Several recent studies applied methods based on tags density (67,100) or tags count normalized by the CpG density (66). Similar to many of the peak detection programs

(99), low resolution is the major disadvantage of using tags density based methods for

MBD-seq data analysis.

In this study, we performed high sequencing depth MBD-seq in the human breast- cancer MCF-7 cell line. The result shows that with ~100 million unique mapped tags

(approximately five lanes using a GAII sequencer) from 500mM and 1000mM elutes the coverage of the MBD-seq data become close to a saturation point. A bi-asymmetric-

Laplace model (BALM) was developed to analyze MBD-seq. We compared the resolution of BALM to that of several ChIP-seq analysis tools. The results demonstrate

48 the program‟s superior ability to distinguish methylation statuses of closely positioned

CpG sites.

This study demonstrates that MBD-seq combined with the new program is potentially a powerful tool to capture genome-wide DNA methylation profiles with high efficiency and resolution.

RESULTS

Overview of experimental design and analysis

In MBD-seq experiments, fragments of methylated genomic DNA are precipitated by a specific capture protein. Single end of each of these fragments are sequenced simultaneously using high-throughput sequencing techniques. The precise locations of sequenced tags are recorded by mapping them to the reference genome.

In this study, a total number of ~0.5 billion tags were generated from MBD-seq performed in MCF-7 cell line (Table 1).

Salt Concentration Total Mapped Tags Unique Mapped Tags

500mm 130,965,175 83,777,039 1000mm 154,623,791 84,000,909 2000mm 139,465,768 75,499,798 Input 72,970,099 50,840,361 Table 1 Summary of MDB-Seq tags from different elution.

A flow chart of major steps of the experiments is shown in Figure 11A. Briefly, a recombinant form of the human MBD2 protein was applied to precipitate methylated

DNA from genomic DNA. Three separate libraries was constructed under three different

49 elution salt concentrations (500mM, 1000mM, 2000mM) and were sequenced and subsequently aligned to Hg18 using the Bowtie mapping software (58) (MATERIALS

AND METHODS MBD-seq section).

Figure 11 Overview of experimental design and analysis.

50

To detect methylation level of each CpG dinucleotides, we present a statistic orientated algorithm, which is based on a bi-asymmetric-Laplace model. The model is aimed to precisely recapitulate the tags‟ bimodal distribution over target sites in a ChIP- seq experiment (52,101) (Figure 11, Figure 12A).

Figure 12 Bi-asymmetric-Laplace model.

51

Figure 13 Tags distribution around transcription factor binding sites.

This model is chosen based on the following facts. First, tags density decrease exponentially on both directions from the summit of each model. Second, an asymmetric exponential family distribution of the lengths of gel-electrophoresis-selected ChIP fragments is observed from paired-end sequencing data. More importantly, the proposed

52 model bears a low value of goodness of fit in both MBD-seq and TFBSs ChIP-seq data compared to previously described Gaussian (17) and t-distribution model (62) (Figure

12B, Figure 13). A detailed list of estimated BALM parameters for each dataset is provided in Table 2.

Parameters Data Strand θ σ κ forward -45 78.6999 0.854124 MBD in MCF-7 reverse 45 77.5207 1.12074

forward -77 83.9296 1.22654 MBD in H1 reverse 77 88.5777 0.779785

forward -53 73.0629 0.92573 MBD in T cell reverse 54 72.0422 1.0732

MBD in forward -47 70.1349 0.897699 HCT116 reverse 47 69.6271 1.0302

forward -27 40.2385 1.01702 CTCF reverse 28 40.3395 0.986449

forward -45 71.8979 0.975041 FOXA1 reverse 45 71.591 1.00957

forward -35 67.7709 0.917369 ER reverse 36 67.7936 1.09561

forward -36 55.1012 1.00033 NRSF reverse 36 53.6163 0.959186 Table 2 Estimated BALM parameters.

All four tested public transcription factor datasets have less than 10 million unique mapped tags, which demonstrates that obtaining an accurate model does not require

53 extremely high sequencing depth (Figure 13). An overview of the algorithm is described below (Figure 11B):

1. Initial scan for enriched regions using a tag shifting method as in BELT (53). If input data are available, a Fisher‟s exact test is performed to filter regions that are not significantly more enriched than the input and a genome region amplification index

(GRAI) is calculated by the background enrichment level using the input data. Set t>0, s=1 (t is the total number of iterations, s denotes the current iteration number).

2. Measure tag distribution over target sites.

3. Model the tag distribution over target sites as a bi-asymmetric-Laplace distribution and estimate the parameters using the maximum likelihood estimators (102-

104).

4. Calculate weighted enrichment for each nucleotide in the genome. Weighted enrichment is defined as the tag enrichment weighted by tag‟s relative position to the nucleotide using the BALM. Then scan the genome for regions that have weighted enrichment higher than a local threshold which is weighted by the GRAI. Perform

Fisher‟s exact test to filter regions that are not significantly enriched compared to input.

5. Within each enriched region, a BALM mixture is constructed. Center of each bi- asymmetric-Laplace distribution represent one target site. Unknown parameters are estimated using the EM algorithms (105,106). The number of components (each component can be interpreted as a target site) is determined by the Bayesian Information

Criterion (BIC) (107,108). Then update the enriched regions and target sites list. For

54

MBD-seq data, a methylation level is inferred from the mixture model for each CpG dinucleotides.

6. If s

7. Output a list of enriched regions as well as the precise location of the predicted target sites occurring in these regions. For MBD-seq data, a file contains the methylation level of each CpG dinucleotides in the genome is generated.

The major novelty of the algorithm is that it accurately estimates each CpG dinucleotides‟ methylation level with high resolution in tag-enriched regions by maximizing the likelihood of given tags via expectation maximization (EM) (109-111). A more extensive discussion of the algorithm is in the BALM algorithm section of

MATERIALS AND METHODS.

MBD-seq data analysis

The average tag density throughout known genes (RefSeq HG18) for all three experiments was plotted in Figure 14A. The results showed lower methylation levels occurs at both the transcription start sites (TSS) and the transcription termination sites

(TTS), which is consistent with the observation in previous studies using bisulfite padlock probes with microarray (112) as well as bisulfite sequencing (113). The tags distribution pattern of 1000mM and 2000mM elution are similar, however, 500mM is substantially different from the other two (Figure 14A). Log tag count correlation analysis found a higher correlation (r = 0.973, p = 0.000) for 1000 mM vs 2000mM than for 500mM vs 1000mM (r = 0.884, p = 0.000) and 500mM vs 2000mM (Figure

55

14B,C,D).

Figure 14 Correlation of tags density of different salt concentrations.

The BALM analysis results of three representative regions on chromosome 21 showed that 500mM elution captures distinct portion of methylated genomic region compared to 1000mM and 2000mM elution (Figure 15A). To quantitatively compare the detected methylated sites between three different salt concentration elutions, CpG sites with a methylation score in top 20% (5,632,772 out of 28,163,863 total CpG sites in human genome) of each salt concentration dataset were selected. A 78.4% overlap was observed between 1000mM and 2000mM elution, while the overlap between 500mM and

1000mM, 500mM and 2000mM are only 35.9%, 31.4% respectively (Figure 15B). The detected methylation score of the top 20% CpG sites in 1000mM and 2000mM were well correlated (r = 0.838, p = 0.000); however, there were no strong correlation between CpG

56 sites of 500mM and 1000mM elution (r = 0.286, p = 0.000) (Figure 15C, D).

Figure 15 Correlation of CpG methylation score of different salt concentrations.

CpG islands are often important cis-regulatory elements distributed in the genome. To test the effect of CpG island methylation on gene expression genome wide, we define a

CpG island methylation score as the average methylation score of all CpG dinucletides within that CpG island. We then calculated the correlation between promoter CpG island methylation score and the gene expression. Although no strong negative correlation (r = -

0.159, p = 0.000) is observed, the expression difference between promoter CpG island

57 hypermethylated group (score >0.8) of genes and hypomethylated group (score < 0.2) of genes is statistically significant (student‟s t-test, p =0.009). This result is consistent with current knowledge that gene expression is regulated at multiple levels and CpG island methylation might affect the accessibility of active or repressive transcription factors to the regulatory elements, however not directly promote or suppress gene expression (66).

Thus, we performed correlation analysis of CpG islands‟ DNA methylation score and

DNase hyper sensitivity, which measures the accessibility of the DNA (ENCODE consortium). Genome wide analysis showed medium negative correlation (r = -0.454, p =

0.000) between these two factors (Figure 16).

5

4.5 4 3.5 3 2.5 2 1.5 1

CpG island DNase island CpG hypersensitivity 0.5 0 0 0.2 0.4 0.6 0.8 1 CpG island methylation score

Figure 16 Correlation between CpG island methylation score and DNase hyper sensitivity in MCF-7 cell line.

Optimal depth of MBD-seq

58

We compared the coverage the MCF-7 MBD-seq data to three public available MBD-seq data. The result shows that increased sequencing depth provides higher tags CpG coverage (Figure 17, Figure 18A).

Figure 17 Coverage and saturation of MBD-seq experiments.

To maximize the efficiency of MBD-seq experiments, first we needed to determine an optimal combination of different salt concentration elution. Thus, we performed CpG coverage analysis (66) on five datasets, including three original datasets of MBD-seq under different salt concentration, a double concentration dataset that combined 500mM,

1000mM salt concentration datasets and a triple concentration dataset that combined three salt concentration datasets (Figure 18B). The triple concentration dataset showed increased depth because more tags were included; however, minimum improvement of 59 coverage was observed compared to the double concentration dataset. Meanwhile, the double concentration dataset showed significant increased coverage because of the complement of 500mM and 1000mM datasets. After we decided the 500mM and

1000mM combination is the optimal elution condition, saturation analysis was performed to optimize the efficiency of sequencing depth. Using increased fraction of random sampled data from the original 500mM, 1000mM combination dataset, tag coverage was calculated and plotted in Figure 18C.

Figure 18 Coverage and saturation of MBD-seq experiments.

Different levels of tag coverage tend to become saturated as sequenced tag number grows. After the point 60% (approximately 100 million tags), increase in sampled tag fraction does not cause significant increase in CpG coverage. MBD-seq experiments reached optimal efficiency when using a combination of 500mM and 1000mM salt 60 concentration elution and with ~100 million unique mapped tags sequenced in MCF-7 cells. Cancer cells in general contain less genome-wide DNA methylation than their normal counterparts. This also should be taken as one of the factors when estimating optimal sequencing depth for a certain experiment.

Resolution and efficiency of BALM

Because there are few widely accepted standalone programs available for MBD-seq data analysis, we compared the resolution of BALM with several popular peak-detection programs designed for analysis of transcription factor ChIP-seq data which is based on the same principle and shares similar procedures with MBD-seq. These programs apply similar or more advanced algorithms compared to the existing MBD-seq analysis methods most of which are based on tags density. MACS, QuEST, SISSRs and PICS were chosen because each of these programs is implemented with different algorithms and uses different statistical methods. For example, the MACS algorithm is based on shifting sequence tags towards the binding site for a certain number of base pairs then locating the binding site by calculating the summit within a peak region. QuEST identifies binding sites using a tag enrichment profile of a peak region with a Gaussian kernel. SISSRs screens binding sites in a certain window by a threshold of tags count on both forward and reverse strand calculated based on a Poisson distribution. Recently, mixture model showed advantages over several widely used programs in the ability to separate closely positioned peaks (62). For a comprehensive comparison, we included

PICS which applied a mixture t-distribution model to probabilistically infer binding sites.

61

Firstly, to test an algorithm‟s ability to separate closely positioned target sites, we generated spike in data using the human transcription factor ERα dataset in MCF-7 cell

(80). Briefly, three well defined peaks, representing low depth peak, medium depth peak and high depth peak respectively, which are detectable by all five programs on of the ERα dataset were inserted to random position of the genome. A second peak was then inserted at a close position to the first peak (100bp, 50bp, 25bp).

Five programs were then applied to detect the spike in peaks. The result shows that all of the programs identified the spike in peak region, however only BALM accurately located the two separated peaks within peak region at 50bp resolution for low depth peaks and

25bp resolution for medium and high depth peaks (Figure 19). This demonstrates the high resolution of BALM and indicates that the ability of the program to separate closely positioned peaks increases when increasing sequencing depth, which provide strong evidence supporting the high accuracy of the statistical model. The resolution limitation of BALM is at ~50bp for low depth regions and better resolution can be achieved at high depth regions.

Secondly, we compared the result of BALM to that of MACS and QuEST using the triple concentration MBD-seq dataset in the MCF-7 cell line. SISSRs and PICS were excluded in this comparison due to the tags number exceeding the capacity of the programs. All of the programs detected broad methylated genomic regions; however, in addition to methylated regions, BALM calculates the methylation score of each CpG dinucletides within each region. Three representative regions on Chromosome 7 show that the proposed algorithm provides users higher resolution detection and more

62 information of DNA methylation than programs originally designed for TFBSs detection

(Figure 20).

Figure 19 Resolution of BALM.

63

Figure 20 A comparison of BALM with MACS, QuEST on the result of MBD-seq data in MCF-7 cell.

Applying advanced algorithm raised the concern of the program‟s efficiency.

Furthermore, we compared the efficiency of these five algorithms by measuring computational resource consumption. BALM is relatively slow for data with small size since sophisticated statistical methods are applied; however, this algorithm is not tags number sensitive. A trend analysis demonstrates the programs‟ execution time on datasets with different tag numbers (Figure 21).

64

1200

1000

800

600

400

200 Execution time time Execution (mins) 0 8821805 154623791 425054734 MBD-seq ER MBD-seq 1000mm combined MACS 4 120 234 QuEST single thread 20 322 630 QuEST 6 threads 18 286 511 PICS single thread 34 576 1023 PICS 6 threads 28 482 934 BALM single thread 45 102 199 BALM 6 threads 14 34 65

Figure 21 Comparison of algorithm efficiency in terms of execution time.

In general, the execution time of all the programs tested increases with the increase in the total number of tags. The shorter execution time for high depth datasets might be due to its C and C++ implementation as compared to MACS (Python), QuEST (Perl),

SISSRs (Perl) and PICS (R).

Clonal bisulfite sequencing validation

To validate methylation sites identified from MBD-seq and assess the resolution and accuracy of BALM, we performed standard clonal bisulfite sequencing in randomly

65 selected regions in MCF-7 cells. 11 regions, including 2 unmethylated, 3 partially methylated and 6 fully methylated regions with both dense and spotty CpG sites.

Comparing a total of 178 CpG di-nucleotide loci‟s bisulfite sequencing results to the corresponding methylation score produced by BALM yielded a Pearson correlation coefficient r = 0.879 ( p = 0.000) The results demonstrated the prediction of DNA methylation by BALM is accurate and reliable not only in sparse but also in dense CpG regions (Figure 22, Figure 23, Figure 24, Figure 25).

Figure 22 Clonal bisulfite sequencing validation.

66

Figure 23 Validation of MBD-seq using bisulfite sequencing technique (ESPN, PLEKHG5, HOXA11, PLAU).

Figure 24 Validation of MBD-seq using bisulfite sequencing technique (MC5R, PIK3C3, KIAA0427, C18ORF24).

67

Figure 25 Validation of MBD-seq using bisulfite sequencing technique (PFKL, ARVCF, SBF1).

DISCUSSION

MBD-seq is widely used as a cost efficient method to investigate genome wide methylation pattern. In this study, we attempted to determine the optimal condition for the MBD-seq experiment. We performed high depth MBD-seq under three different elution salt concentrations (500mM, 1000mM, 2000mM) in MCF-7 cell line. The analysis indicates that different salt concentrations can be used in MBD-seq to yield distinctive populations of methylated DNA fragments. The result shows that with ~100 million unique mapped tags (approximately five lanes for a GAII sequencer) from

500mM and 1000mM elutes the coverage of the MBD-seq data become close to a saturation point. MBD protein‟s affinity to methylated DNA is enhanced as the density of

68 methylated CpG sites increases (66). This makes MBD-seq a highly effective method to measure the methylation status of CpG islands, within which CpG sites are densely distributed. As many of the studies are focused on CpG island methylation, accurate measurement can be achieved at a lower sequencing depth.

Interestingly, a medium negative correlation of CpG island methylation score and

DNase hyper sensitivity is observed. However, this result does not identify a causal relationship between DNA methylation and chromatin status. H3K4 methylation has been reported as an active histone modification mark on the chromatin affecting the accessibility of the cis-element. Relationship between H3K4 methylation and DNA methylation may be interesting for further study given the negative correlation between

DNA methylation and chromatin accessibility.

To finely determine the methylation level of each CpG dinucleotides in the genome, we developed a statistic model named BALM. There are several noteworthy features which increase the accuracy of the presented algorithm.

Firstly, unlike TFBSs, methylated CpG dinucleotides are highly abundant in the genome and densely distributed. The new program is capable of distinguishing two closely positioned target sites by applying EM algorithm to approximate a BALM mixture. Indeed, the proposed algorithm increases the resolution of MBD-seq from 150bp to 50bp and up to 25bp in signal highly enriched regions (66). A methylation score is calculated for each genomic CpG site based on the statistic model. This score is an effective indication of the probability of the position being methylated or not. Therefore, it allows users to interpret the data in a more appropriate and effective way.

69

Secondly, as many efforts are made to understand the abnormal transcriptional regulation and aberrant epigenetic event in cancer development, more high-throughput data derived from cancer model cell lines are available. The majority of these model cell lines‟ genomes are disrupted, for example, MCF-7, LNCaP, K562 etc. Another important feature of the program is by using a genome region amplification index (GRAI), it increases the threshold of the weighted enrichment as well as the number of background noise tags simulated to better control the FDR in amplified genomic regions. For example, several genomic regions on the q arm of chromosome 20 in MCF-7 cell line were reported to have a high amplification number (114). To access the bias control provided by the GRAI, the index of 20q generated based on the input data of MBD-seq was plotted with the result from the study of Volik S et al (114) (Error! Reference source ot found.). The comparison showed that the GRAI precisely reflected the copy number changes of different genomic regions.

In summary, we produced extra high sequencing depth MBD-seq data in MCF-7 cell line and determined the optimal parameters for MBD-seq experiment. The new algorithm was specifically designed to address the emerging demand of higher resolution detection of densely distributed CpG dinucleotides‟ methylation level. Through validation using clonal bisulfite sequencing, this study demonstrates that the combination of BALM and

MBD-seq could serve as a powerful method to capture genome wide DNA methylation profile with high efficiency, high resolution and low cost.

70

30 Volik S et al. GRAI 25

Amplific

20

ation Number ation 15

10

5

0 Chromosome 20q

Figure 26 Comparison between GRAI of MCF-7 cell line based on the MBD-Seq input data and the result of end-sequencing profiling technique developed by Volik et al.

MATERIALS AND METHODS

Methylated DNA enrichment and high-throughput sequencing (MBD-seq)

Purified genomic DNA from the breast cancer cell-line MCF-7 (BioChain, Hayward,

CA) was fragmented using an S2 non-contact Adaptive Focused Acoustics™ ultrasonicator (Covaris, Woburn, MA) as described in the SOLiD™ 3 fragment library protocol to generate randomly fragmented DNA of 50-350 bp in length. Fragmented

DNA was then subjected to MethylMiner™ methylated DNA kit enrichment (which uses a recombinant form of the human MBD2 protein) according to the manufacturer‟s protocol and two methylated fractions (500 mM and 1000 mM salt eluates) were isolated.

The recovered eluted mass of DNA was 6.9% (3.47 mg) of the total mass loaded (50 mg).

Subsequent elution at very high NaCl concentration (3.5 M) followed by digestion with

71 proteinase K shows that less than 10% of the captured DNA remains on the beads after elution with 1M NaCl. Separately, 25 μg of fragmented genomic DNA was enriched with a MethylMiner™ kit and eluted as a single fraction with buffer containing 2000 mM

NaCl. Unenriched genomic DNA fragments, 500 mM, 1000 mM, and 2000 mM DNA fractions were then used to construct standard fragment libraries using a combination of adaptor ligation and nick translation (SOLiD™ Fragment Library Construction Kit,

Invitrogen). Library DNA, was size-selected (inserts were ~100-200 bp) by gel- purification from 2% agarose E-Gel® EX gels prior to PCR amplification, attachment to beads, and emulsion PCR. Libraries were sequenced in 4-well deposition chambers on a

SOLiD™ 3 Analyzer and sequenced tags corresponding to 50 base lengths were obtained. The resulting tag sequence csfasta and quality files were aligned to the human genome (NCBI Build 36.1, UCSC Hg18) using the Bowtie mapping software (58).

Clonal Sanger bisulfite sequencing

2.5 µg of MCF-7 cell line genomic DNA (Biochain) was bisulfite converted with

MethylCode Bisulfite Conversion Kit (Invitrogen) in 5 reactions at 500 ng scale each.

PCR amplification of 50 ng equivalent of starting amount of converted DNA was performed with C to T conversion specific primers (avoiding CpG regions) with the following PCR mix: 2 units of AccuPrime Taq DNA Polymerase High Fidelity enzyme

(Invitrogen), 1X AccuPrime PCR Buffer II, 0.2 µM each primer final, in 100 µl final volume, with cycling conditions as follows: initial denaturation at 94°C 2min., 40 cycles

(denature at 94°C 15 sec., anneal at 53-62°C 30 sec., extend at 68°C 1 min.), final

72 extension at 68°C 5 min., 4°C hold. Detailed information about primers used for each amplicon is listed in Table 3.

BiSearch Primer Sequence 5'-3' PLAU-F419 GGGGTTTGAGGTAGTTTTAGGTAAGTT PLAU-S875 CTGCRAAAACAAATAAACCCTAACC

ARVCF-F49 GGGGGGATTGGAGTTATTTTTA ARVCF-R401 AATACCTAACATATATATCTAAACAACCCTCTC

SBF1-F12 GGGGGTGTATTTTGTATTTTGGT SBF1-R383 CTAACCATAACTTACCTAACCTCCTACTTAC

PFKL-F7 GTGTTATTTGGGAAATTTTAGGTAGAAT PFKL-R379 ACATCCTAAAAATAACACCACAAAAA

HOXA11-F4+2 GGTGTAATTTATGTTGGTTGGG HOXA11-R4+3 CTTCCCAAAACAAATCTATAAAAAAA

MC5R-F7+7 TGTAGTTTATTGGTTATTGTAGTGGATAG MC5R-R6-2+6 CCAAAAAAAACATATATATATACAAAAACAC

ESPN -898F TGGAAGGTAGGGTTTTTTGTAATTT ESPN-1180R CAAACAAACAAATTCATTCATCTACC

PLEKHG5 -1146F TGGTTTTTTTTTGTTAGGTAGAGAGG PLEKHG5-1420R ATATTCCCAAAACTTTACCAAAAAA

PIK3C3 -883F GATAGTTGAGAATAAGATGAGTATATGGTGT PIK3C3-1198R AACACTCCTACATAACCACCTTAAAAAA

KIAA0427 -846F AAAATTTTAGGAATTTAGGTTTTTAGTAGG KIAA0427-1241R CCTCACAAAACCCTCTTAATAAATAC

C18orf24 -1187F TTTATTTTTTTTTTTTTGGTTTGGG C18orf24-1443R AAACCATCCTTTAAACCTCTAAAAA Table 3 PCR primers.

73

PCR Products (1-2 µl) were cloned into pCR4-TOPO sequencing vector (Invitrogen), and transformed into TOP10 chemically competent E.Coli. (Invitrogen). Transformation was then plated on LB + 100 Ampicillin plates and incubated at 37°C overnight. Up to 20 individual colonies of each amplicon were grown in 1 ml BRM + 100 µg/ml cultures overnight in a 96 well culture block at 37°C at 300 RPM. Plasmid DNA was isolated using PureLink HQ 96 Plasmid Purification Kit (Invitrogen), and 1 µg of each clone was sequenced using M13 Reverse primer from TOPO kit (Invitrogen) by Sanger sequencing technology.

Public datasets

MethylC-seq data in H1 human embryonic stem cells (64), MBD-seq data in H1 human embryonic stem cells (66), human T cell(100) and HCT116 human colon cancer cells (67). ChIP-seq data for human transcription factors CTCF in CD4+ T cell (26),

FOXA1 in MCF-7 cell (52), ERα in MCF-7 cell (80), and NRSF in Jurkat lymphoblast cell (27) were downloaded for evaluating the performance of the program. All datasets are available at http://motif.bmi.ohio-state.edu/BALM/.

BALM algorithm

Initial scan for target sites

In the first step, an average fragment length is determined by taking the average distance between the mean position of the tags on the forward strand and reverse strand of top enriched bimodal pileup regions (53). Then, all tags are shifted towards the mid-

74 point by half of the average fragment length. A sliding window of 50bp is used to scan regions that have a higher fragments count than a predetermined threshold weighted by the genome region amplification index (GRAI) (see below for detailed descriptions).

Within each of these regions, a target site is calculated by taking the average of the positions of the fragments in this region. Shifted tag positions are used for the initial detection of target sites, while the original tag positions are used in constructing the model and subsequent analysis.

Parameter estimation of the BALM model

The density function for an asymmetric Laplace distribution (ALD) is,

    exp x   for x   1      f , , x    1)  1  2    exp x   for x      

Two steps are followed to obtain the maximum likelihood estimators of the ALD

(102,103),

1. Given sample size n, find xm, 1≤m≤n that minimize the function Hxm  ,

Hxm   2ln xm   xm  2)

1 n 1 n where x   x  x  , x   x  x  , m n  j1 j m m n  j1 j m

 x j  xm , x j  xm x j  xm    0, x j  xm

 xm  x j , x j  xm x j  xm    0, x  x  j m

75

In above equations, xj is the jth element of x and xm is the element that minimize function

H(xm).

2. set

ˆ   xm 3)

4 4 ˆ  2 xm  xm  xm   xm  4)

4 4 ˆ  xm  / xm  5)

Maximization of the likelihood of tags within enriched regions using the EM algorithm

Within a signal enriched region, multiple target sites might exist. These sites are estimated by maximizing the likelihood of the given tags using a BALM mixture,

n ˆ  arg ln f x | 6) mle  i1  i 

Where f is the probability density function of the BALM mixture and n is the number of tags.

Determination of the best mixture model

Bayesian Information Criterion (BIC) is used to determine the number of components

(target sites) within a signal enriched region.

BICM l   2ln LX,M l  d log N 7)

where M l is the model being tested, LX,Ml  is the log likelihood that the given sample is generated from model , is the number of free parameters in model and N sample size, in this case, tags number in a given enriched region.

Calculation of weighted enrichment level

76

The weight of a tag with respect to a target site is determined by the relative position of that tag to the target site. The weight is proportional to a probability P modeled by the

BALM described above. The weighted enrichment score of each position is calculated as follows,

n E  P x | ts  pos 8) pos i1  i  where ts represents target site, pos represents a genomic position and n is the total number of observations around that target site.

Generation of the genome region amplification index (GRAI)

The GRAI is an indicator of the copy number of a certain genomic region. When input data are available, a local background enrichment level can be calculated by counting the tags that mapped to that local region. The index is constructed by taking the ratio of the local background enrichment to the genome background enrichment.

ni Lg GRAI i  9) Li N where ni denotes the input tags number in the ith region, Lg denotes the genome size, Li represents the length of the ith region and N is the total number of input tags.

The EM algorithm

Briefly, each tag xi is treated as an independent observation, ci determines which component of the mixture is xi originated, k represent the kth component. Thus the components can be written as,

X i | Ci  k ~ BALM  k , k , k 

77

Let τk denotes the portion of the kth component, thus, in a mixture model with m components,

m   1 k1 k

σ and κ are treated as constants since these parameters are known and identical for all components. The unknown parameters to be estimated are,

  , 

The log likelihood function can be written as,

n m ln L ; x,c  ln  f X ; 10)   i1 k1 k  i k  where is the probability density function of the bi-asymmetric-Laplace distribution, is the number of observations.

Using Bayes‟ rule, the following formula could be derived to update the parameters for the next iteration (110),

1 n  s1  S s 11) k n i1 k,i

n S s x s1 i1 k,i i  k  n 12) S s i1 k,i

 s p x |  s where s denotes the iteration step, s k k  i k  represents the conditional Sk,i  m  s p x |  s  j1 j j  i j  probability that xi is from kth component at step s.

Estimation of False Discovery Rate (FDR)

78

Monte-Carlo simulation is performed to generate simulated data and compute FDR.

Each dataset includes simulated peaks and background noise tags based on the real

MBD-seq data (53).

Program implementation

BALM is implemented in C and C++. The source code is platform independent and was compiled and tested in Linux Fedora10, OS X with the gcc compiler and Windows

XP, Vista with Microsoft Visual C++ 9.0 compiler. The documentation, source code and compiled binaries for Linux, OS X, Windows XP and Vista can be downloaded at http://motif.bmi.ohio-state.edu/BALM/download.

The program takes inputs of various tag information file formats, such as BED,

ELAND, EXTENDED ELAND, GFF, SAM and Bowtie alignment file. The default input file format is ELAND. It also provides an option for non standard file formats, which allows users to specify the columns containing chromosome name, start, end, and strand of a tag (e.g. –n 0 1 2 5). If control data are available, the -c option specifies the control file name (e.g. -c XXX_IgG.txt). –c option can also be used to compare different samples.

Some of the most commonly used genome assemblies are included, e.g. Human

(hg18, hg19) and Mouse (mm8, mm9). The default assembly is hg18. For genomes that are not included, users need to provide genome information, such as size, chromosome number, and length of . By default, pericentromeric and repetitive region

79 are excluded from the output using a species specific repetitive region table; however, this filter off can be turned off if desired.

Similar to that of BELT, options –o and -W allow the program to generate files with fragment enrichment levels across the genome, in variable step and fixed step WIG format respectively after the tags have been shifted (Appendix B).

A user friendly GUI for BALM is provided to help biologist specifying parameter for running the program.

80

Chapter 5: Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin

linkages

Lan, X., Witt, H., Katsumura, K., Ye, Z., Wang, Q., Bresnick, E.H., Farnham, P.J. and

Jin, V.X. Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin hubs.

Nucleic Acids Res. Accepted.

81

Chapter 5: Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin

linkages

INTRODUCTION

Transcriptional regulation involves a process by which different transcription factors bind to specific short DNA sequences termed cis-regulatory elements (CREs), such as promoters, enhancers, silencers and insulators, and thus control the transcription of different genes. The accessibility of these CREs is often influenced by epigenetic modifications including histone acetylation and methylation, which can be associated with the activation or repression of genes. For example, H3K27ac is found at both active enhancers and promoters (43,45); H3K4 mono-, di- and tri-methylation is linked to gene activation (26,43,116), H3K27me3 is a mark of repressed regions (40-44), and

H3K36me3 identifies transcribed regions (26,117).

ChIP-seq and DNase-seq are high throughput experimental technologies that have been shown to be effective in defining a detailed map of transcription factor binding sites

(TFBSs), histone modifications, and open chromatin regions. Such techniques have been adopted by the ENCODE Consortium (http://encodeproject.org/ENCODE/) for the identification of many different TFBSs in various cell types, such as K562, GM12878,

HepG2, and HeLa (55); see http://encodeproject.org/ENCODE/cellTypes.html for a list of all ENCODE cell types. Many studies have shown that certain transcription factors, such as , the family, and YY1, usually bind to promoter regions whereas many 82 other factors, such as GATA1, TCF7L2 (also called TCF4) and α preferentially bind to distal regions that could be more than 20 kb away from a known transcriptional start site (TSS) (12,118-124). Although some distal binding sites may function as promoters for unannotated protein-coding and/or non-coding genes, it is clear that binding sites can control gene regulation via specific three-dimensional (3D) conformations of the chromatin that bring them into close spatial contact with distant promoters (5-8).

The development of the chromosome conformation capture technique (33) has greatly facilitated our understanding of the effects of chromatin conformation on transcriptional regulation due to greatly increased resolution over traditional co-localization techniques such as fluorescent in situ hybridization (32). Recently, by coupling with next generation sequencing technologies, Hi-C has, for the first time, enabled an unbiased genome-wide capturing of chromatin interactions (39). This study identified thousands of interacting loci in both K562 and GM06990 cells and identified nuclear substructures termed “fractal globules”. A recent review (32) has proposed that there might be four types of genomic interactions, including contacts associated with nuclear lamina, nuclear pores, and the as well as intra- and inter-chromosomal contacts. Although these recent studies provide great advances, there still remain many computational and biological challenges in organizing and deciphering Hi-C data. For example, the Hi-C data was initially modeled as a simple probability matrix and the identified interacting loci are thus at a 1

Mb scale. However, if the Hi-C data is modeled based on a statistical distribution of the real data, the interactions can not only be determined at finer scales, but can also be

83 differentiated into different types of interacting events (e.g. intra- vs. inter-chromosomal interactions and random vs. proximate ligation events). Also, the initial studies did not attempt to understand how epigenetic modifications correlate with the 3D chromatin interactions nor did they investigate how the binding of transcription factors might play a role in 3D genome organization. Although a recent study (125) correlated CTCF binding sites with Hi-C data to investigate genome-wide CTCF-mediated interactions, it was purely an in silico computational analysis and did not comprehensively utilize other publically available transcription factor binding data.

In our study, we have integrated the available K562 Hi-C data with multiple data sets from the ENCODE Consortium, including ChIP-seq data for 45 TFs and 9 histone modifications and DNase-seq data for open chromatin to dissect the underlying mechanisms of chromatin organization and its impact on genome regulation. We identified 12 distinct chromatin clusters that can be categorized into two different types.

Our integrated analysis suggests that transcription factors and chromatin modifiers assemble to form functional complexes that bring distant elements into close proximity.

To test this hypothesis, we used knockdown of transcription factors and RNA-seq analyses to provide genome-wide evidence that Hi-C data can identify sets of biologically relevant interacting loci.

RESULTS

Identifying interacting loci

84

Using the aforementioned Mixture Poisson Regression Model (MPRM) and the power-law decay background (Figure 27, Figure 28A), 96,137 interacting loci with a

FDR of 5.76% were selected from a total number of 23,337,840 hybrid fragments from

K562 cells; see the Hi-C data analysis section of MATERIALS AND METHODS for a more extensive description of the data modeling and analysis.

Figure 27 Strategy for analyzing Hi-C data.

85

Figure 28 Hi-C analysis and genomic interactions in K562 cells.

86

Consistent with the previous study (39), most of the 96,137 interactions are intra- chromosomal (95%) and within 1 million bp distance (75%) (Figure 29A).

Figure 29 Distribution of hybrid fragments from the Hi-C data of K562 and GM06990 cells.

Circos plots illustrating the genome-wide inter-chromosomal interactions and two examples of intra-chromosomal interactions for two individual chromosomes are shown in Figure 28B, C. To demonstrate the applicability of the Hi-C analysis procedure, we also re-analyzed the Hi-C data from GM06990 cells. We identified 83,785 chromatin interactions, among which 82,683 are intra-chromosomal and only 1,102 are inter-

87 chromosomal. Similar to the analysis of K562 cells, more than 86% of the interactions in

GM06990 cells are within 1 million base pair distance, which indicates that the predominance of regional interactions is not a unique property of K562 cells (Figure

29B).

K562 cells are derived from the bone marrow of a patient that had chronic myelogenous leukemia and are characterized by the presence of the Philadelphia

Chromosome, a specific chromosomal abnormality that is the result of a reciprocal translocation between and 22, creating a fusion gene in which the ABL1 gene on chromosome 9 (region q34) is juxtaposed to a part of the BCR ("breakpoint cluster region") gene on chromosome 22 (region q11). Because 9q34 and 22q11 are fused in K562 cells, interactions between these two regions would be labeled as inter- chromosomal although they are actually intra-chromosomal in the K562 genome. In fact,

780 of the 4806 inter-chromosomal interactions in K562 cells are between 9q34 and

22q11, suggesting that a portion of the inter-chromosomal interactions identified in cancer cells may be intra-chromosomal for that particular genome. In addition to the t(9;22)(q34;q11) translocation, a previous study (126) described at least four other chromosomal translocations in K562 cells, namely, der(10)t(3;10)(p21.3;q23), der(18)t(1;18)(p32;q21), der(21)t(1;21)(q23;p11), and der(12)t(12;21)(p12;q21).

Interestingly, there are only 2 interactions between 3p21.3 and 10q23 and no interactions were found between the other two pairs of chromosomal fusions. These analyses suggest that the 9;22 translocation may have different properties than the other translocations. In fact, the region corresponding to the 9;22 translocation is highly amplified. To investigate

88 a possible correlation between looping and amplification, we first identified all amplified regions in K562 cells using Sole-search, an integrated ChIP-seq peak-calling program that not only identifies binding sites but also performs an analysis of amplified and deleted regions of the input genome (127,128). We identified 102 amplified regions in

K562 cells, with 6,166 long range interactions found within the amplicons. Overlapping these regions with the set of interacting loci in K562 cells, we found that only 41 of the amplified regions contained mapped interacting loci (Table 4).

Fold Number Fold Number chr start end chr start end amplification of loops amplification of loops 1 16,762,800 16,767,399 6.45 1 11 51,424,400 51,438,199 4.7 2 1 16,797,800 16,843,599 4.25 10 13 79,996,000 80,367,199 3.92 240 1 141,825,000 141,828,199 4.54 2 13 89,237,200 91,268,799 3.79 1053 1 143,634,000 143,658,399 3.62 4 13 91,748,000 92,132,999 3.86 196 1 143,697,400 143,829,199 5.63 80 13 92,652,400 92,821,399 4.6 181 1 154,450,800 154,454,599 4.45 2 13 107,299,400 107,458,999 5.19 180 1 232,978,400 232,985,599 5.32 3 17 38,735,600 38,739,199 5.27 5 2 132,712,600 132,753,599 6.94 2 17 38,819,800 38,823,399 4.94 1 2 161,843,600 161,846,199 4.57 3 17 42,566,800 42,570,799 6.47 2 3 75,842,200 75,877,399 4.61 6 21 9,719,600 9,746,799 4.49 13 3 196,908,800 196,929,399 3.67 5 21 10,193,200 10,203,799 3.61 2 4 31,600 57,999 5.34 16 22 15,633,600 15,679,799 3.85 45 6 31,739,200 31,743,999 4.52 9 22 17,386,400 18,627,599 4.01 745 6 31,764,800 31,766,999 4.23 6 22 18,663,000 18,665,399 4.35 2 6 31,811,400 31,815,199 4.27 8 22 19,064,000 19,360,799 4.33 173 7 106,285,400 106,300,199 3.53 10 22 19,404,000 19,794,999 4.56 221 9 67,902,400 67,917,999 8.2 12 22 20,128,600 20,141,799 4.03 8 9 122,593,000 122,707,799 3.82 59 22 20,248,600 20,904,799 5.6 464 9 132,595,800 133,146,399 6.9 1060 22 21,020,200 21,294,199 5.39 287 10 46,477,200 46,479,799 4.46 2 22 21,326,200 21,963,199 6.61 1037 10 51,744,400 51,767,599 3.43 9

Table 4 Number of loops in amplification regions.

89

Thus, not all amplified regions are involved in long-range interactions. There was no relationship between the fold amplification and the number of loops. For example, the top highest amplified regions of K562 cells (more than 16.5 fold) had no long-range interactions whereas the 94th ranked amplicon (3.79 fold) had 1,053 long-range interactions. Interestingly, the amplified regions encompassing the BRC and ABL1 genes had 1,037 and 1,060 long range interactions, respectively, comprising more than 34% of all interactions associated with amplified genomic regions of K562 cells (see Figure 30 for an illustration of the number of loops in the amplified regions of chr 22).

Figure 30 Number of loops in the amplified regions of chr 22.

90

To further investigate potential issues in our analyses due to the existence of genomic rearrangement in K562 cells, we searched for interactions around other previously identified fusion sites in K562 cells. 25 fusion sites detected by FusionMap software were tested and none of these sites have interactions within a 20kb distance.

Clustering interacting loci

To determine the relationship between epigenomic modifications and the identified genomic interactions, we mapped 9 histone modification marks (H3K4me1/2/3,

H3K36me3, H4K20me1, H3K9ac, H3K27ac, H3K9me3 and H3K27me3) and regions identified as open chromatin using DNase hypersensitivity onto the set of identified interacting loci. In the initial clustering, we also included CTCF which is an insulator protein known to influence chromatin structure (129). Using our peak finding programs, wBELT and BALM (53,68), we first identified genomic regions that are enriched for each mark (Table 5).

Epigenetic Status Num. of Num. of Num. of Status Status Status sharp peaks sharp peaks broad regions Dnase 48196 H3K4me2 43779 H3K27me3 4354 H3K27ac 43766 H3K4me3 39333 H3K36me3 3586 H3K4me1 92812 H3K9ac 32118 H4K20me1 9404 H3K9me3 3182

Table 5 Number of enriched regions of epigenetic marks.

We observed that the enriched regions of H3K9me3, H3K27me3, H3K36me3, or

H4K20me1 showed broad peak patterns (>1000bp) in contrast to regions marked by

91

H3K4me1/2/3, H3K9ac or H3K27ac, which showed sharp peak patterns over a relatively small region (~200bp), which is in line with previous studies (26,43) (Figure 31).

Figure 31 Epigenetic modification distribution pattern. TSS, Transcription start site; GES, Gene End Site.

We defined an interacting locus as associated with a certain epigenetic mark if it was within a broad peak region of H3K9me3, H3K27me3, H3K36me3, H4K20me1 or if a sharp peak of open chromatin, CTCF, H3K4me1/2/3, H3K9ac or H3K27ac was in the interacting DNA segment. The peak score from the wBELT output was used to define the

92 intensity of the epigenetic status of that locus. Since many interacting loci may be associated with several peak scores from different marks, the intensities were standardized among different marks. We then performed hierarchical clustering (130) on the interacting loci using Cluster 3.0 software

(http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm, Stanford University,

1998-99) with Pearson correlation as the distance measurement. We used epigenetic information from only one end of the hybrid fragments to cluster the 96,137 interacting loci pairs into 12 groups (Figure 32A loci 1).

Figure 32 Clustering interacting loci. 93

Interestingly, the second loci often showed a very similar pattern of epigenetic status as the first loci (Figure 32A loci 2). All clusters can potentially interact with all the other clusters of chromatin, however, an interaction between two loci of the same cluster has a much higher formation rate (Chi-square test for every cluster, p value < 0.00001) (Figure

33, Table 6).

Figure 33 Epigenetic status of loci 2 of each major cluster.

For example, in cluster 6 (identified as having only H3K9me3 at loci 1), which constitute 7.3% of the total interactions, 3307 out 6980 (47.4%) of the loci 2 have a similar epigenetic status (H3K9me3 only) as loci 1. This formation rate of an interaction 94 between two regions with H3K9me3 is much higher than the random formation rate of

7.3% (p < 1E-20, Chi-square test). Conversely, in cluster 6, only 163 out of 6980 (2.3%) of the loci 2 have an epigenetic status that is similar to that of cluster 9 (H3K4me1 and partially DNase). This formation rate of an interaction between cluster 6 and 9 is much lower than the random rate of 10.2% (p < 1E-20, Chi-square test).

Loci1 clusters Loci2 clusters 1 2 3 4 5 6 7 8 9 10 11 12

1 157 31 3 7 61 34 31 52 148 74 131 656 2 - 449 7 6 135 38 54 106 247 151 415 646 3 - - 15 1 7 6 6 4 19 10 32 105 4 - - - 36 17 10 10 25 43 28 30 166 5 - - - - 954 152 157 285 574 472 922 1081 6 - - - - - 3307 62 52 163 210 492 2376 7 ------373 162 389 252 186 548 8 ------671 561 496 459 708 9 ------2799 991 1167 2398 10 ------2254 1350 1524 11 ------5611 3220 12 ------28066 Table 6 Formation frequency of interaction between different clusters.

To compare the density of the individual epigenetic marks between clusters, we plotted the distribution of the average intensities for each mark in each cluster (Figure

32B). Each of the clusters indeed has distinct distribution of epigenetic marks. Cluster 7 and 8 appears similar in the clustergram (Figure 32A); however, cluster 7 has much higher level of H3K9ac and H3K27ac compared to cluster 8.

95

Figure 34 Relative distance of the interacting loci to a transcription start site.

96

To examine how these different clusters of chromatin interactions may correlate with gene structure and transcriptional regulation, we determined the relative distance between transcription start site (TSS) and interacting loci of each cluster (Figure 34). For example:

1) we found a higher promoter presence in the interacting loci of clusters 7 and 8, which is consistent with the abundance of H3K9ac and H3K27ac in these clusters. 2) Cluster 10 loci are not at promoters and loci 1 and loci 2 in this cluster are both enriched with the gene body marks H4K20me1 and H3K36me3, suggesting that cluster 10 may represent interactions between two regions that are both actively being transcribed. 3) Cluster 9 had several interesting properties. First, loci 1 in this cluster are not directly at promoters (as determined by distance from a TSS and the lack of H3K9ac and H3K27ac marks) nor are they at active enhancers (they lack H3K27ac). However, these loci lack repressive marks but are marked by open chromatin and H3K4 monomethylation, suggesting that these regions are available for transcription factor binding. Further analyses of the TFs that bind to loci 9 are provided below. 4) Other clusters that were not enriched in promoter regions correspond to CTCF insulator binding sites (cluster 5), had marks of epigenetic silencing (cluster 6, 11), or had none of the epigenetic modifications that were analyzed by the ENCODE Consortium (cluster 12).

The analysis presented above is representative of intra-chromosomal interactions since the majority of the interactions in K562 cells are in this category. To determine if the epigenetic patterns of the inter-chromosomal interactions are different from those of the intra-chromosomal interactions, we performed the clustering analysis using only the inter-chromosomal interactions. A large portion of these interactions are formed by

97 regions that lack any of the epigenetic marks analyzed by the ENCODE Consortium, which is similar to cluster 12 in the combined analysis of intra- and inter-chromosomal interactions. Strikingly, the inter-chromosomal interactions consist of a significant portion of regions that are marked by H4K20me1 (active gene body regions), H3K36me3

(active gene body regions) and H3K27me3 (epigenetic silencing) (Figure 35).

Figure 35 Epigenetic status of inter-chromosomal interacting loci.

98

Associating transcription factor binding sites with the sets of clustered interacting chromatin loci

It is possible that the identified paired loci are brought together by interactions between a transcription factor bound to the loci 1 with another transcription factor bound to loci 2 in each paired set. CTCF has previously been shown to be highly correlated with looping (125). However, as shown in Figure 32, we found that a large portion of the chromatin interactions are not associated with a CTCF binding site. In addition, CTCF can potentially interact with other types of chromatin (Figure 33, cluster 5), suggesting that CTCF bound to loci 1 may interact with another TF at loci 2 to create a loop.

Because certain transcription factor-mediated chromatin interactions have been associated with different types of gene regulation, it was possible that the different clusters may be regulated by distinct sets of transcription factors. The availability of a large set of ChIP-seq data from the ENCODE Consortium provided the opportunity to test this hypothesis.

We used the BALM program (68) to identify sets of genome-wide binding sites for

45 transcription factors using publicly available ChIP-seq data (Table 7). We found that a majority of the binding sites for the 45 factors are associated with DNase hypersensitive regions, which is an indication of nucleosome depletion (131) (Figure 32A). Furthermore, an inverse correlation between DNA methylation and DNase hyper-sensitivity was observed in these open chromatin regions (Figure 36, MATERIALS AND METHODS).

99

Factor binding sites Num. of Num. of Num. of Factors Factors Factors binding sites binding sites binding sites ATF3 9342 HEY1 12862 SETDB1 7933 BDP1 4308 INI1 16285 SIN3AK-20 9164 BRF1 12820 JUND 11857 SIRT6 15673 BRF2 7290 MAX 11215 SIX5 7727 BRG1 9005 NELFE 5804 SRF 14837 CFOS 8417 NFATC1 9261 TAF1 7222 CJUN 11031 NFE2 15918 TFIIIC 11777 CMYC 8714 NFYA 10496 TR4 2289 CTCF 8847 NFYB 12539 USF1 10786 7450 NRSF 7522 XRCC4 12610 E2F6 12680 POL2 9991 YY1 7153 EGR1 10219 POL3 8423 ZNF263 5450 GABP 11833 PU1 12176 ZNF274 15377 GATA1 10828 RAD21 11962

GATA2 8284 RFX3 11473

GTF2B 11045 RPC155 10707

Table 7 Number of binding sites of 45 factors.

Figure 36 Correlation between CpG island methylation and DNase hyper sensitivity in K562 cell line.

100

This result is consistent with our previous study (68) and many other studies which have demonstrated that a majority of TF-DNA interactions require the DNA fragment to be in an “open” status (132), which is associated with nucleosome depletion (5), DNA hypo-methylation (133), and specific side chain modifications of histones (134). Open chromatin can reflect both promoter regions and other distal regulatory regions such as enhancer and repressor elements. To determine the preferential binding behavior of factors to promoters vs. distal regulatory elements, we defined a promoter-distal ratio as the occurrence rate of a factor binding in a promoter region divided by the occurrence rate of that same factor binding in a non-promoter region. This analysis defined two distinct groups of transcription factors. For example, 14% of the non-promoter open chromatin regions contain GATA1 binding sites, while only 5% of the promoter open chromatin regions have GATA1 binding sites. Thus, the promoter-distal ratio is 0.36, which indicates that GATA1 preferably binds to non-promoter, open chromatin regions.

Factors such as MYC and E2F4 are highly enriched in promoter, open chromatin regions whereas factors such GATA1 and GATA2, are over presented in non-promoter/ open chromatin regions (Figure 37).

Figure 37 Preferential binding pattern of transcription factors (TFs).

101

We next correlated the identified TFBSs with the 12 sets of interacting loci to determine which sets of TFs are preferentially associated with each type of loci. We first defined a specific TF-associated interacting locus if a binding site was found in the interacting DNA segment. We then ranked the TFs according to the percentage of the binding sites that is associated with interacting loci (Table 8).

TF Percentage TF Percentage CTCF 53.10% NFYB 40.18% USF1 50.19% SIN3AK-20 39.31% RAD21 49.85% E2F6 39.24% SIX5 48.26% POL2 39.04% PU1 47.72% NFYA 38.99% CFOS 46.82% NFE2 38.84% NRSF 46.56% INI1 38.62% BRG1 45.80% XRCC4 38.60% ZNF263 45.56% RPC155 38.41% CMYC 45.47% TFIIIC 37.50% CJUN 44.66% GTF2B 37.10% GATA2 44.44% NELFE 36.85% MAX 44.11% SIRT6 36.44% GATA1 44.02% HEY1 34.98% E2F4 43.99% RFX3 34.78% EGR1 43.75% SETDB1 33.97% GABP 43.71% BDP1 32.99% TAF1 43.48% ZNF274 32.51% JUND 41.98% POL3 32.08% SRF 41.84% NFATC1 32.04% YY1 41.70% BRF2 31.67% ATF3 41.17% BRF1 30.77% TR4 40.45%

Table 8 Percentage of binding site associated with chromatin interaction.

Not surprisingly, CTCF, which is thought to be a major determinant of looping, ranked number 1 in this list (53% of CTCF sites were associated with interacting chromatin loci). However, many of the TFs showed a similar high percentage of binding 102 sites associated with the identified interacting loci as did CTCF, suggesting that most TFs may be involved in looping. A hierarchical clustering (130) was performed to classify the factors (Figure 38A).

Figure 38 Interacting loci are bound by a network of transcription factors.

103

Similar to the analysis of the epigenetic marks, the peak score from the BALM output was used to represent the binding affinity of the TF at that locus and the scores were standardized among different TFs. The clustering result showed several major groups of transcription factors with distinct binding preferences for loci with a distinct epigenetic status. For example, one group of factors includes CTCF and RAD21, which can bind to insulators that are open (i.e. marked by DNAse hypersensitivity) (cluster 5). The majority of the transcription factors bind specifically to loci marked by H3K9ac and H3K27ac and which are likely to be active promoters (cluster 7 and 8) (26,43). However, 3 site-specific transcription factors (c-Jun, GATA1, and GATA2) and 3 chromatin regulators (BRG1,

INI1, SIRT6) bind specifically to loci in cluster 9 that are open chromatin marked by

H3K4 mono-methylation but not by promoter (H3K9Ac) or active enhancer (H3K27Ac) modifications.

To find the concurrence of these proteins at the two ends of the interacting loci, we applied the Apriori algorithm (135), which is a widely used data mining method for searching for association rules in large datasets of transactions (MATERIALS AND

METHODS). The Apriori algorithm revealed a protein interaction network through DNA looping (Figure 38B). First, our analysis showed a high concurrence of CTCF and

RAD21 in the ends of interacting loci, which is consistent with their similar binding preference and with previous reports (136). Second, proteins in the Polymerase III machinery, including POL3, TFIIIC, BRF1 and BDP1, also linked together through long distance DNA looping. Third, E2F4 and RNA polymerase were highly linked, consistent with previous studies (119). Finally, we note that c-Jun, GATA1, GATA2, INI1, and

104

BRG1 are closely linked. Interestingly, nearly all of the factor nodes have loops connected to themselves, which might due to the cross linking of the distant DNA elements to the protein during the ChIP-seq procedure. This may also explain the situation that some highly enriched peaks detected from a TF ChIP-seq experiments lack a consensus motif for that TF (12).

Correlating gene expression with different sets of interacting loci

To examine the influence of looping on gene expression, we assigned each interacting locus to a known gene if the locus is located within the gene body or is less than 10 kb from the transcription start site of the gene. The expression levels of the genes in loci 1 and paired loci 2 for each cluster versus a whole genome gene expression profile were then plotted (Figure 39). Interestingly, the interacting loci do not only possess a similar epigenetic status, the expression of the associated genes is also co-regulated. We defined

4 co-active clusters (24.8% of total interactions) as Type I linkages, 8 co-repressive clusters (75.2% of total interactions) as Type II linkages. Specifically, clusters 7, 8, 9 and

10 were composed of paired loci in which the nearest genes to both loci were more active than a set of randomly selected genes