<<

The Pennsylvania State University

The Graduate School

Eberly College of Science

ELUCIDATING BIOLOGICAL FUNCTION OF GENOMIC DNA WITH ROBUST

SIGNALS OF BIOCHEMICAL ACTIVITY:

INTEGRATIVE GENOME-WIDE STUDIES OF ENHANCERS

A Dissertation in

Biochemistry, Microbiology and Molecular Biology

by

Nergiz Dogan

© 2014 Nergiz Dogan

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

August 2014 ii

The dissertation of Nergiz Dogan was reviewed and approved* by the following:

Ross C. Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology Dissertation Advisor Chair of Committee

David S. Gilmour Professor of Molecular and Cell Biology

Anton Nekrutenko Professor of Biochemistry and Molecular Biology

Robert F. Paulson Professor of Veterinary and Biomedical Sciences

Philip Reno Assistant Professor of Antropology

Scott B. Selleck Professor and Head of the Department of Biochemistry and Molecular Biology

*Signatures are on file in the Graduate School iii

ABSTRACT

Genome-wide measurements of epigenetic features such as histone modifications, occupancy by transcription factors and coactivators provide the opportunity to understand more globally how are regulated. While much effort is being put into integrating the marks from various combinations of features, the contribution of each feature to accuracy of enhancer prediction is not known. We began with predictions of 4,915 candidate erythroid enhancers based on genomic occupancy by TAL1, a key hematopoietic that is strongly associated with induction in erythroid cells. Seventy of these DNA segments occupied by TAL1 (TAL1

OSs) were tested by transient transfections of cultured hematopoietic cells, and 56% of these were active as enhancers. Sixty-six TAL1 OSs were evaluated in transgenic mouse embryos, and 65% of these were active enhancers in various tissues. Inclusion of additional epigenetic features improved the prediction accuracy, with combinations of

TAL1, GATA1, EP300, H3K4me1, and H3K27ac giving high accuracy of enhancer prediction (70%-75% success depending on method of clustering) while having strong discriminatory power maintaining good sensitivity (Sn, up to 84%) and specificity (Sp, up to 80%). Importantly, it was shown that activating histone marks in the absence of key transcription factors or open profile is a weak predictor of enhancer activity, and had no discriminatory power to predict enhancers. Motifs that distinguish active from inactive TAL1 OSs implicate IRFs, STATs, and FOX families as candidate positive co-factors with TAL1, while REST (NRSF) and HOX family are implicated in inactivity. While signals for evolutionary constraint were weak over the entire TAL1-bound DNA segments regardless of activity in either assay, phylogenetic preservation of a TF-binding site motif was associated with enhancer activity.

Additionally, we reported that the conservation of GATA1 occupancy is linked to iv pleiotropic functions, meaning they are enhancers in multiple tissues, including non- hematopoietic tissues. Furthermore, the TAL1-bound enhancers validated in enhancer assays were assigned to their target genes and they include not only erythroid but also non-erythroid genes.

v

TABLE OF CONTENTS

LIST OF FIGURES ...... vii LIST OF TABLES ...... ix ACKNOWLEDGEMENTS...... x Chapter-1 Introduction to Regulation of Transcription by Cis-Regulatory Modules (CRMs)...... 1 1.1 Regulation of transcription ...... 1 1.2 Cis-regulatory modules (CRMs) ...... 2 1.2.1 Enhancers ...... 4 1.2.2 Identify enhancers ...... 5 1.2.3 How do enhancers influence disease? ...... 7 1.2.4 How important are enhancers for evolution ...... 10 1.3 Regulation of erythropoiesis via transcriptional enhancers and histone modifications ...... 12 1.3.1 Transcription factors ...... 13 1.3.2 Histone modifications ...... 14 1.4 Statement of Thesis ...... 17 Chapter-2 Epigenetic and Genetic Features that Lead To Discovery of Enhancer Function ...... 19 2.1 Introduction...... 20 2.2 Results ...... 22 2.2.1 Genome-wide prediction of regulated erythroid CRMs by TAL1 occupancy .. 22 2.2.2 Occupancy by TAL1 is a strong predictor of enhancer activity ...... 26 2.2.3 Impact of additional epigenetic features in predicting enhancer activity in the presence of TAL1 binding ...... 33 2.2.4 Confirmation of predictive power of EP300 occupancy, H3K4monomethylation and H3K27 acetylation in enhancer prediction ...... 41 2.2.5 Effective combinations of epigenetic features for prediction of enhancers .... 42 2.2.6 Motifs that distinguish TAL1-bound enhancers from inactive TAL1 OSs ...... 49 2.2.7 Conservation as an illuminator, not a predictor ...... 52 2.3 Discussion ...... 56 2.4 Methods ...... 59 2.4.1 ChIP-seq data for epigenetic features ...... 59 2.4.2 Enhancer assays by transient transfection for K562 cells ...... 59 vi

2.4.3 Transgenic mouse assays (VISTA Enhancer Browser) ...... 60 2.4.4 Clustering algorithms ...... 61 2.4.5 Measuring discriminatory power of transcription factors and histone modifications to identify enhancers ...... 62 2.4.6 Identification of significantly enriched motifs by employing the computer program, Discriminating Matrix Enumerator ...... 62 2.4.7 Analyses of sequence conservation and motif preservation ...... 63 2.5 Data access ...... 64 Chapter-3 Developing transfection methods for cell lines that allow interrogation of different aspects of erythroid differentiation ...... 65 3.1 Transient transfections to test CRMs ...... 66 3.2 Transcription factors present in different cell line models of erythroid cells ...... 68 3.3 Erythroid enhancement with HBG1 in K562 cells ...... 69 3.4 GATA1 responsive expression from HBG1 promoter in G1E-ER4 system ...... 71 3.5 GATA1-dependent enhancement with Vav2 promoter in G1E-ER4 system ...... 73 3.6 Potential repressor effect of GATA1-ER on expression in G1E-ER4 system ...... 74 3.7 Discussion ...... 74 3.8 Methods ...... 74 3.8.1 Enhancer assays by transient transfection for G1E-ER4 cell system ...... 75 Chapter-4 Conserved Transcription Factor Occupancy and Enhancer Usage in Different Cell Systems and Multiple Tissues ...... 77 4.1 Introduction...... 78 4.2 Contribution of conserved GATA1 occupancy to enhancer activity in different cell systems ...... 80 4.3 Conserved occupancy is associated with enhancer activity in multiple tissues .... 84 4.4 Discussion ...... 88 4.5 Methods ...... 91 4.5.1 Functional assays in K562, G1E, and G1E-ER4 cell transfection systems .... 91 4.5.2 Enhancer assays by transient transfection for MEL cells ...... 92 4.5.3 Transgenic mouse reporter assays ...... 92 Chapter-5 Summary and Future Directions ...... 93 References ...... 96 Appendix-A Supplementary Figures and Tables ...... 110 Appendix-B Principles Of Regulatory Information Conservation Between Mouse And Human ...... 175

vii

LIST OF FIGURES

Figure 1.1 Diagram of a classic gene regulatory region ...... 3 Figure 1.2 DNA looping model ...... 5 Figure 1.3 Features of CRMs ...... 7 Figure 1.4 Consequences of deletion and of the limb enhancer of SHH...... 9 Figure 1.5 Several inflammatory disease associated GWAS SNPs in a genomic regions...... 10 Figure 1.6 Stages of erythropoiesis ...... 12 Figure 1.7 GATA1, TAL1 and their partners at erythroid promoter region ...... 14 Figure 1.8 Genomic methods for predicting enhancers...... 17 Figure 2.1 Genome-wide prediction of TAL1 OSs as erythroid enhancers...... 24 Figure 2.2 Erythroid enhancer activity of TAL1 OSs in a transient transfection assay. ... 28 Figure 2.3 Tissue-specific enhancer activities of TAL1 OSs in transgenic mouse assays...... 32 Figure 2.4 Classification of TAL1 OSs based on epigenetic features...... 37 Figure 2.5 Enhancer activities of TAL1 OSs partitioned by clusters...... 40 Figure 2.6 Meta-analysis of contributions of epigenetic features to enhancer activity in transient transfection assays...... 46 Figure 2.7 Identification of motifs enriched in the TAL1-bound enhancers and inactive segments...... 52 Figure 2.8 Contributions of WGATAR motif preservation to enhancer activity of TAL1 OSs...... 55 Figure 3.1 Summary of some differences in the panel of transcription factors in G1E and G1E-ER4 cells...... 69 Figure 3.2 Illustration of constructs being tested ...... 70 Figure 3.3. Relative expression levels driven by HBG1 promoter and by GATA1-bound enhancers (GHP88 and GHP181), with HBG1 promoter in K562 cells...... 71 Figure 3.4 Relative expression levels driven by HBG1 promoter without an insert and with GATA1-bound inserts GHP88 and GHP181, and Vav2 promoter without an insert in the G1E and G1E-ER4 cells, following a time course after activation of GATA1-ER...... 72 Figure 3.5 GATA1 dependent enhancement with Vav2 promoter after transfection into G1E and G1E-ER4 cells...... 73 viii

Figure 4.1 Prediction of enhancers based on conserved GATA1 occupancy between mouse and human...... 81 Figure 4.2 Transient transfection results of 5 conserved GATA1 occupied segments. ... 83 Figure 4.3 Conserved GATA1-occupancy is linked to enhancement in multiple tissues and chromatin accessibility ...... 86 Figure 4.4 Distribution of numbers of co-bound TFs and conserved occupancy ...... 88 Figure 4.5 Enhancers discovered by GATA1 binding are active in tissues with other GATA factors...... 90 Supplemental Figure 2.1 TAL1-bound enhancers ...... 111 Supplemental Figure 2.2 The meta-data, 273 mouse DNA segments ...... 113 Supplemental Figure 2.3. Motif analyses ...... 115 Supplemental Figure 2.4. Conservation scores ...... 117 Supplemental Figure 2.5 Percentage of TAL1 OSs within enhancer-promoter units in each cluster...... 117 Supplemental Figure 2.6 UCSC genome browser views of two regions including two validated TAL1-bound enhancers ...... 119

ix

LIST OF TABLES

Table 2.1. ChIP-seq datasets ...... 64 Table 4.1 Transient transfection results of 5 GATA1 occupied segments (from ChIP-seq in 24 hour-induced G1E-ER4 cell line) conserved between mouse and human and tested for different cells...... 84 Table 4.2 Result of in-vivo enhancer assay of 10 GATA1 occupied segments that are conserved between mouse and human...... 87 Supplemental Table 2.1 Primer information on 70 TAL1 OSs transiently transfected into K562 cells...... 120 Supplemental Table 2.2 Individual transient transfection results on 70 TAL1-bound regions ...... 122 Supplemental Table 2.3 Results of 66 TAL1-bound segments in transgenic mouse reporter assay at E11.5...... 125 Supplemental Table 2.4 Genomic coordinates of meta-data (273 DNA segments) and their response to transient transfection assay...... 129 Supplemental Table 2.5 p-values of t-test on each epigenetic features of meta-data (273 DNA segments) ...... 142 Supplemental Table 2.6 Feature combination assessment (TAL1, GATA, EP300 and different combinations of them with H3K4me1, H3K4me3 and H3K27ac) using sensitivity, specificity and precision values ...... 143 Supplemental Table 2.7 The top 10 overrepresented motifs (identified by DME) in TAL1- bound active enhancers over inactive segments in transient transfection assay...... 145 Supplemental Table 2.8 The top 10 overrepresented motifs (identified by DME) in TAL1- bound inactive segments over active enhancers in transient transfection assay...... 148 Supplemental Table 2.9 The top 10 overrepresented motifs (identified by DME) in TAL1- bound positives over negatives in transgenic mice assay...... 150 Supplemental Table 2.10 The top 10 overrepresented motifs (identified by DME) in TAL1-bound negatives over positives in transgenic mice assay...... 153 Supplemental Table 2.11 Matched motifs (found by TOMTOM) in only active enhancers and in only inactives...... 156 Supplemental Table 2.12 Conservation scores of meta-data (273 DNA segments) tested in transfection assay and 66 TAL1 OSs tested in transgenic mice...... 157 x

Supplemental Table 2.13 Preserved WGATAR motif (1 for preserved motif; 0 for not preserved motif; -1 for no motif) at 151 GATA1-bound and 151 GATA1-TAL1 co-bound segments (1: occupancy and 0: no occupancy)...... 166 Supplemental Table 2.14 Closest genes to TAL1 bound in vivo enhancers and the expression levels of some genes observed in G1E, uninduced-, 24 hour-induced- and 30 hour-induced- G1E-ER4 cells...... 170 Supplemental Table 2.15 Linking TAL1-bound enhnacers to target genes covered by Enhancer-Promoters Units ...... 172

xi

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my advisor, Dr. Ross

Hardison, for providing me with this research opportunity, for his patience, enthusiasm, motivation, and enormous knowledge.

I greatfully thank to my committee members for their valuable comments and advice in the process of my studies.

I am thankful to the great, current and previous, Hardison Laboratory

Members for their practical and great help in my thesis.

Finally, and most importantly, I would like to thank my grandfather Musa

Palit, grandmother Suna Palit, my mother Sebahat Dogan, my father Yaser

Dogan, my brother Yunus Dogan, my fiancé Kivanc Artun, and other members of my family. I could not work and live happily without their love, support, encouragement, sensitivity, and deep understanding.

Chapter 1

Introduction to Regulation of Transcription by Cis-Regulatory Modules

(CRMs)

1.1 Regulation of transcription

Multicellular organisms develop from a single-celled zygote into more specialized cell types by the process of cell differentiation. Different cell types have the same genome in common. If different cells have the same genetic blueprint, how do they become so different? What drives the differentiation process? How is this complex process orchestrated? The answer lies in differential .

Gene expression is a process by which the DNA sequence of a gene is used to produce protein synthesis and cell structures. A particular cell type requires expressing specific genes and regulates the expression of multiple genes. Gene expression in eukaryotes specifically occurs at two main stages: (1) transcription; a process in which messenger RNA (mRNA) is produced by RNA polymerase to transcribe genetic information from DNA sequence, and (2) translation; a post-transcriptional process in which mRNA is used to produce a polypeptide with specific sequence. RNA polymerase II (Pol II) is localized onto particular sites on genome where transcription occurs. The initiation of transcription needs recruitment of transcription machinery. The components of the general transcription machinery are Pol II and several general transcription factors (GTFs: TFIIA, TFIIB, TFIID, TFIIE, TFIIF and TFIIH). Pol II and

GTFs form pre-initiation complex (PIC) that assembles at promoters to initiates transcription. The formation of PIC commonly starts with binding of TFIID to the TATA box, the initiator (Inr), the downstream promoter element (DPE), followed by involving of 2 other GTFs and Pol II (Thomas and Chiang 2006). This promoter-bound complex is sufficient to conduct a basic level of transcription; nevertheless, it is not enough for activator-dependent transcription. General cofactors are needed to convey regulatory signals between general transcription machinery and gene specific activators. In addition, numerous sequence-specific DNA binding transcription factors (TFs) are involved in regulation of transcription through regulatory elements called cis-regulatory modules

(CRMs), for instance transcriptional enhancers, promoters, insulators, silencers and control regions. These elements function through the recruitment of transcription cofactors. Some transcription cofactors, such as TATA binding protein (TBP)-associated factors (TAFs), mediators, EP300/CBP, are directly linked to the general transcription machinery. However, others are indirectly associated with transcription by controlling chromatin states through histone modifications, chromatin remodeling or DNA methylation (Kadonaga 2004; Thomas and Chiang 2006). Epigenetic features refer to biochemical events associated with expression and gene regulation occurs during development and differentiation (but without any changes in DNA sequence). Gene activity and comprehensive prediction of CRMs can be monitored by epigenetic features, such as histone modifications, transcripts, occupancy by TFs, and DNase-hypersensitive sites (DHSs) (Cosma 2002; Jaenisch and Bird 2003; Goldberg et al. 2007; Kouzarides

2007; Hardison 2012)

1.2 Cis-regulatory modules (CRMs)

The process of gene expression plays a crucial role in lineage commitment and cellular differentiation leading to animal development. Gene expression patterns and binding patterns of transcription factors are different in various cell types even if the DNA sequence of these cells is the same. Critical players during gene expression include cis- regulatory modules which are the DNA sequences regulating the level, timing and 3 tissue-specificity of expression of target genes located in cis, meaning on the same . The modules are clusters of regulatory elements, which are binding sites for transcription factors. The two types of CRM families in are promoters and distal regulatory elements (Fig. 1.1). A promoter, the first distinct family of CRMs, consists of a core promoter and nearby promoter elements. Promoters are required for transcriptional initiation at the correct position. The second main group of CRMs, distal regulatory elements, is composed of enhancers, silencers, insulators, and locus control regions (LCR). These cis-acting transcriptional regulatory DNA elements work either to enhance or repress transcription of genes. Enhancers cause increase in expression levels while silencers lead to decrease in expression, and insulators are boundary elements that block the interaction between promoters and enhancers (Maston et al.

2006; Wallace and Felsenfeld 2007; Jeziorska DM et al. 2009; Hardison and Taylor

2012).

Figure 1.1 Diagram of a classic gene regulatory region (Maston et al 2006).

4

Identification and characterization of a regulatory network is important to global understanding of gene regulation in developmental biology. Changes in gene regulatory modules can be major determinants of evolution, and genetic variants in regulatory modules can lead to aberrant regulation of genes and their products are linked to many complex disease processes, including cancers, heart diseases, diabetes and some autoimmune diseases (Furniss D et al. 2008; Zeggini et al. 2008; Visel et al. 2010;

Bonifer and Cockerill 2011; Attanasio et al. 2013).

1.2.1 Enhancers

Enhancers are one of the categories of cis-acting DNA sequences that up-regulate transcription of gene expression, and are crucial for developmental gene expression.

First identified enhancers were in the SV40 (Simian Virus 40) tumor virus genome that notably enhances transcription of a heterologous human gene including a promoter; and

(Banerji et al. 1981; Atchison 1988). These regulatory elements usually extend a few hundred base pairs (bp), and contain binding site sequences for transcription factors

(typically 6- to 20-bp motifs). These elements can function at different distances from their target promoter(s) and can be found upstream or downstream of genes, or within introns or (Lettice et al. 2003; Kleinjan and Heyningen 2005;). Thus, the question to be addressed is: “How do these long-distance transcriptional control elements work at large distances of several hundred kilobases or even megabases?” The answer lies in

DNA-looping model in which enhancer and core promoter are introduced into close proximity by looping, which is considered to be fored by Cohesin and mediator complexes, illustrated in Figure 1.2 (Vilar and Saiz 2005; Amano et al. 2009; Shlyueva et al. 2014). REVIEWS

a ChIP–seq for a TF b DNase-seq c ChIP–seq for chromatin marks

DNase I cleavage

TF TF TF TF

ChIP– DNase- ChIP– seq seq seq

5 d ChIP–seq for Mediator and cohesin e ChIA–PET and chromosome conformation capture methods

Target gene Pol II Pol II Non- Mediator target Cohesin genes Pol II Pol II Enhancer

Figure 1.2 DNA looping model (Shlyueva et al 2014). Fixation and shearing

Based on the classical model, general transcription factors and RNA polymerase II Pol II ChIA–PET Chromosome conformation capture are recruited to gene promoters by sequence-specific activators and co-regulators Mediator (Orphanides and Reinberg 2002; Thomas and Chiang 2006). Unusually, pre-initiation Pol II Pol II Cohesin complex is formed at not only core promoters but also enhancers so that the timing of or transcription activation can be controlled during development and differentiation

(Szutorisz et al 2005). Upon androgen induction, Pol II and histone acetyltransferases

(HATs) are recruited to the enhancer first. Moreover, Pol II binds at the promoter of androgen-responsive prostate-specific antigen gene, PSA, at later stage (ShangPol II et al.

2002; Louie et al. 2003). Additionally, Pol II and HAT CREB-binding proteinChIA –(CBP)PET bind at the enhancer element which regulates activation of the pD1 gene located in the T- cell (TCR)- locus. The binding to the enhancer controls transcriptional initiation Figure 3 | Genomic methods for predicting enhancers through the detection of transcNriaptutiroen R feavcietwors b| Ginednientgic,s in T cells (Spicuglia et al. 2002). Therefore, binding of Pol II to enhancers is suggesting ‘open’ chromatin, chromatin marks, or long-range contacts. The principles of the different methods (top panel binding by general transcription factors to theseof ea c transcriptionalh part) and t h regulatorye corresp o elementsnding d ata output (such as deep sequencing read density) that is used for regulatory element identification (bottom panel of each part) are shown. a | Chromatin immunoprecipitation followed by (Szutorisz et al 2005). deep sequencing (ChIP–seq) uses antibodies to determine the location of transcription factor (TF) binding sites genome wide. Although enhancers are bound by TFs, not all TF binding sites correspond to functional enhancers. 1.2.2 Identify enhancers b | Active enhancers and other regulatory elements are depleted of nucleosomes such that the DNA is accessible. Such regions can be detected by DNase I or micrococcal nuclease (MNase) digestion followed by deep sequencing It is a challenge to identifiy cis-regulatory(DN modulesase-seq (CRMs) or MNa withinse-se qlarge, res pcomplexectivel y). c | Nucleosomes that flank active enhancers bear characteristic histone genomes. Three major approaches, either singlymo ordi finic combinations,ations that c acann b bee d usedetec ttoe dfind by ChIP–seq using specific antibodies. d | Enhancers are brought into close proximity of their respective target promoters through the formation of chromatin loops, which are thought to be CRMs: (1) Evolutionary constraint in non-codinges regionstablish e(2)d bClustersy cohe sofin short and DNAMed imotifsator c omplexes. ChIP–seq can detect the contact points of cohesin and Mediator at promoters and enhancers, and has been used to predict enhancers. e | Chromatin interaction analysis with paired-end tag sequencing (ChIA–PET) and chromosome conformation capture (3C)-based methods preserve and detect spatial contacts by crosslinking, DNA fragmentation, DNA fragment ligation and deep sequencing. ChIA– PET includes a ChIP step to enrich for complexes that contain a specific protein, such as RNA polymerase II (Pol II). In contrast to ChIP–seq (part d), both ChIA–PET and 3C-based methods detect not only the contact points but also the pairwise connections between these points. The thin, solid lines indicate that pairwise connections between spatial contact points are captured in ChIA–PET and 3C-based methods. For 3C-based methods a schematic output of a chromosome conformation capture carbon copy (5C) or Hi-C experiment is shown; this method probes all interactions between defined genomic loci for their spatial proximity and physical contacts, which is similar to ChIA–PET in that it might (solid lines) or might not (dashed lines) correspond to regulatory interactions.

NATURE REVIEWS | GENETICS VOLUM E 15 | APRI L 2014 | 277

© 2014 Macmillan Publishers Limited. All rights reserved 6

(3) Epigenetic marks on chromatin such as transcription factors and histone modifications specific for different CRMs (Fig. 1.3) (Hardison and Taylor 2012). The first approach is to use to find subset of CRMs under strong purifying selection. Strong evolutionary constraint in non-coding DNA segments can be used to identify functional sequences in genomes. About half of the CRMs predicted based on strongly constrained noncoding sequences are active enhancers (Pennacchio et al.

2006). Moreover, conserved noncoding sequences from alignments and preservation of motif instance also identified active enhancers, with good accuracy. (Wang et al. 2006).

Secondly, CRMs can be predicted from motif instances. Regulatory elements, such as enhancers, are clusters of motifs for TF binding sites, and these clusters could be bound by TFs. However, every motif instances are not bound by TFs. For example, there are 8 million motif instances for binding GATA factors (WGATAR motif) in mouse genome and just 15,000 are bound in erythroid cells. It means about 1 in 500 motif instances are bound. Thus, we really need to know which instances actually bound by the transcription factors. The third approach is utilizing epigenetic signatures to identify CRMs. Specific epigenetic features are strongly associated with CRMs. Histone tails have modifications on them for each of the classes of regulatory elements. These epigenetic features can be mapped with high resolution and accuracy by using antibody against histone modifications or TFs. H3K4me1 and H3K27ac are the two powerful histone marks to find enhancers with good accuracy (Heintzman et al. 2007; Heintzman et al. 2009;

Creygthon et al. 2010; Zentner et al. 2011). The success rate of identifying enhancers by using transcription factor occupancy is encouraging, yet not perfect. Enhancers were predicted based on tissue-specific TFs, such as GATA1 in erythroid cells (Cheng et al.

2008) and MYOD in muscle cells (Cao et al. 2010), with a 40-50% success rate. Heart enhancers were identified using the presence of EP300 with high accuracy (75%, Blow 7 et al. 2010). EP300 plus H3K4me1 predicted enhancers in melanocytes with 70% to

86% (Gorkin et al. 2012).

Figure 1.3 Features of CRMs

1.2.3 How do enhancers influence human disease?

It is probable consequence that misregulation in regulatory elements affect phenotype because of their importance during development. Mutations in distal cis- regulatory elements, great majority of which are enhancers, control human diseases

(Hindorff et al. 2009), and variation in enhancers causes Mendelian disorders (Kelinjan and van Heyningen 2005; Noonan and McCallion 2010). A striking example causing phenotypic severities is dysregulation of expression of sonic hedgehog (SHH) and limb malformations. Point mutations in limb enhancer controlling the expression of SHH from approximately 1 megabase away was shown to bring about preaxial polydactyly in different families (Fig. 1.4A,C). Deletion of this enhancer causes critically truncated limbs 8

(Fig1.4B) (Lettice et al. 2003; Sagai et al. 2005; Visel et al. 2009). Other example is that gene sclerostin is misregulated as a result of the deletion of long-range bone enhancer located ~35 kb downstream of this gene in Van Buchem disease (Loots et al. 2005).

Another clear illustration reveals that microdeletions in 900 kb upstream POU3F4 gene is linked to patients with X-linked deafness type 3 (de Kok et al. 1996). In addition, locus control region (LCR), the main upstream control region for the β-globin locus, includes hypersensitive sites (cis-regulatory sites) driving increased expression of the β-globin genes in erythroid cells (Grosveld et al. 1987; Tolhuis et al. 2002). Translocations in the

β-globin locus lead to thalassaemias. Moreover, thalassaemia emerge due to the problem in the association between the globin genes and their long range cis-regulatory elements (Kleinjan and Lettice 2008).

The National Research Institute (NHGRI) catalog of published

Genome-Wide Association Studies (GWAS) Catalog includes a publicly available collection of published GWAS assaying single nucleotide polymorphisms (SNPs) and

SNP-trait/disease associations (Hindorff LA et al. 2009; Welter et al. 2013)

(http://www.genome.gov/gwastudies). A majority of SNPs linked to traits and diseases occur within non-coding functional regulatory elements. 88% of trait-associated SNPs

(mapped by GWAS) are located either in intronic or intergenic regions (Hindorff LA et al.

2009). Recent study tested the hypothesis that SNPs linked to diseases tend to be locate in non-coding regulatory elements such as enhancers. It has been shown that non-coding SNPs associated with asthma, the model system used in the study, are enriched in regulatory regions marked by histone modifications such as H3K4me1 and

H3K27ac (Gerasimova et al. 2013).

9

A. B.

C.

Figure 1.4 Consequences of deletion and mutations of the limb enhancer of SHH.

A. The limb enhancer of SHH is found approximately 1 MB away from its target promoter. Red arrow indicates the enhancer targets expression of the gene to a posterior of the developing limb bud in transgenic mouse enhancer assay. B. Wild type mouse (left) and mouse with Truncated limbs caused by targeted deletion of the limb enhancer (right). C. Preaxial polydactyly due to point mutations in the orthologous sequence of this enhancer in human.

Complex traits and disease phenotypes can be associated with transcription factor (TF) occupancy. Some of these SNPs overlap with DNAse hyper-sensitive regions and TF binding sites, and they are especially enriched in the regions associated with enhancers and transcription start sites across various cell types (ENCODE Project

Consortium, 2012). For instance, figure 1.5 shows several GWAS SNPs associated with inflammatory diseases closely reside to GATA2 occupied segments. Specifically, 10

Chron’s disease associated SNP (rs1174250) overlaps with a GATA2 binding signal in

HUVECs (human umbilical vein endothelial cells) (Fig. 1.5) (ENCODE Project

Consortium, 2012).

Figure 1.5 Several inflammatory disease associated GWAS SNPs in a genomic

regions.

“rs1174250” is the Crohn’s disease-associated SNP with GATA2 occupancy and

DNaseI profile, indicating possible functionality.

1.2.4 How important are enhancers for evolution

Mutation and selection are agents of genome evolution. Genomic in an enhancer can alter functionality. Constrained enhancers are presumed to improve fitness; however, it is hardly likely that active enhancers are neutrally evolving DNA

(Pennacchio et al. 2013). Almost a fifth of non-coding human genomic regions overlap mobile elements, and they are under purifying selection (Lindblad-Toh et al. 2011). Most 11 of these regions are associated with gene regulatory elements and very probably enhancers (Pennacchio et al. 2006; Visel et al. 2008).

Various studies suggest that evolution of particular phenotypes, ranging from insects to , is possibly driven by mutations in enhancers. For instance, mutations in enhancers in insects are involved in Drosophila body and wing pigmentation and development of larval trichome. Human-specific mutations cause an altered expression domain of enhancers of critical developmental genes (Carroll 2008; Levine 2010;

Wittkopp and Kalay 2012). Various adaptations specific to human populations, such as lactase persistence, can be regulated by mutations in regulatory network (Fang et al.

2012). Furthermore, more than 80% of loci under adaptive evolution probably regulatory regions in the marine-to-freshwater adaptation of stickleback . Conserved non-coding intergenic regions located proximal to marine-freshwater divergent loci are associated with regulatory evolution (Jones et al. 2012).

FANTOM5 CAGE (Cap Analysis of Gene Expression) atlas defines 43,011 enhancer candidates detected by bidirectional capped transcriptions. While bidirectional CAGE tags strongly predicts cell-type specific enhancer activity, a subset of enhancers were ubiquitous enhancers that are twice as conserved. Ubiquitous enhancers are not only marked by typical chromatin enhancer marks but also have higher signal of H3K4me3

(Andersson et al. 2014). In addition, it has been shown that pleiotropic enhancer activity of a subset of enhancers, in which a DNA segment shows activity in multiple tissues, is commonly detected for TF-occupied segments in mouse whose orthologous in humans are also bound by the orthologous TF (Cheng et al. submitted).

12

1.3 Regulation of erythropoiesis via transcriptional enhancers and histone

modifications

Induction of lineage-specific genes and repression of the genes that are from out of the lineage bring about commitment and differentiation of multipotent cells.

Transcriptional regulation includes CRMs bound by transcription factors, function of coactivators and corepressors, and histone modifications (Tsiftsoglou et al. 2009; Lee and Young 2013).

Mouse erythropoiesis is a model system that was used to study mechanism of regulation by CRMs in this study. This process of differentiation is thought to occur hierarchically, following a series of cell fate decisions. The induction and repression of specific genes during hematopoiesis drive the commitment of hematopoietic stem cells

(HSC) and progenitor cells into specific lineages, and determine the differentiation and maturation of these lineages. The commitment of HSC into a common myeloid progenitor (CMP) results in either of two bi-potential cells: the colony-forming unit (CFU)- erythrocyte/megakaryocyte or the CFU-granulocyte-macrophage (Philipsen et al. 2009)

(Fig. 1.6).

Figure 1.6 Stages of erythropoiesis (Philipsen et al. 2009)

13

1.3.1 Transcription factors

Transcription factors (TFs) can be categorized into two classes based on their function in regulation: control of initiation and control of elongation (Fuda et al. 2009;

Rahl et al. 2010; Selth et al. 2010; Zhou et al. 2012). TFs commonly interact with cofactors that are protein complexes coregulate transcription by activating (coactivator) or repressing (corepressor). Most TFs are involved in transcription initiation by recruiting coactivators, such as EP300, general transcription factors and the mediator complex.

Some coactivators, such as CBP (CREB-binding protein) or EP300 (P300 or E1A binding protein P300), have histone acetyltransferase (HAT) activity (or some have the ability to recruit other proteins with HAT activity to promoters). These transcriptional co- activating proteins with HAT activity acetylate the amine group of histone lysine residues to neutralize the positive charges in histone proteins therefore the DNA unwinds from histones and thereby makes it accessible to binding by TFs. In contrast, corepressors can repress transcriptional initiation by binding to TFs and so can recruit histone deacetylases (HDAC) to hydrolyse acetylated lysine residues. This causes to increase the positive charge on histones and makes DNA less accessible for transcription

(Goodson et al. 2005). In addition, the mediator complex functions as transcriptional coactivator in eukaryotes. It is critical for RNA polymerase II-dependent transcription.

This multiprotein complex is involved in not only activator-dependent transcription but also activator-independent transcription. For example, it can be important for the control of the formation of initiation complex (Sikorski and Buratowski 2009; Taatjes 2010).

Recent studies have highlighted the importance of cofactors that are crucial to formation and maintenance of DNA looping between enhancers and core promoter elements during transcription initiation.

Transcription factor TAL1 (T-Cell Acute Lymphocytic Leukemia 1), also called

SCL, is from the basic Helix-Loop-Helix (bHLH) family. TAL1 binds to the E-box, which is 14

CANNTG consensus sequence, via formation of heterodimer with E2A, HEB or E2-2

(Cantor et al. 2002) (Fig. 1.7). GATA1 is a protein and required for survival and maturation of primitive and definitive erythroid precursors (Fujiwara et al. 1996).

GATA1 has two zinc fingers (C- and N- finger) and an N-terminal activation domain. The

C-finger is involved in the recognition of the consensus WGATAR sequence. The N- finger is responsible for increasing the stability between C-finger and the DNA sequence and is necessary for recruiting cofactors, such as FOG1 (Martin et al. 1990; Trainor et al.

2000) (Fig. 1.4). GATA1 and TAL1 together is strongly associated with induction of gene expression in erythroid cells (Wozniak et al. 2008; Tripic et al. 2009; Cheng et al. 2009;

Soler et al. 2010). In addition to the two master erythropoetic regulator proteins, co- activator protein EP300, which is a histone acetlytransferase, is critical for gene regulation and a powerful signature to find enhnacers genome-wide.

Figure 1.7 GATA1, TAL1 and their partners at erythroid promoter region (source:

Ross Hardison).

1.3.2 Histone modifications

The nucleosome, the core unit of chromatin, consists of an octamer structure of the four core histones: H2A, H2B, H3 and H4, and DNA wrapped around them (Grant 2001).

This structure controls the accessibility of transcription factors to their target cis- regulatory sequence, so it is important to regulation of gene expression. The N-terminal 15 of histone tails of each histone protein provides the sites for different post-transcriptional modifications, such as methylation, acetylation and phoshorylation (Grant 2001;

Jenuwein et al. 2001; Turner 2005). Acetylation mark (H3 and H4 acetylation) is linked to gene activation while methylation is associated with either activation (H3K4 methylation) or repression (H3K9 and H3K27 methylation) (Jenuwein et al. 2001). Histone acetylation signs accessible chromatin and activation of transcription (Grant 2001; Kouzarides 2007;

Robertson et al. 2008; Hon et al. 2009). Enrichment by H3K4me3 is associated with active promoters (Barski et al. 2007; Birney et al. 2007; Heintzman et al. 2007;

Kouzarides 2007). In the presence of H3K4me1 with little or no trimethylation of lysine 4 on H3 is a powerful tool to predict enhancers with good accuracy (Heintzman et al. 2007;

Heintzman et al. 2009). Furthermore, enrichment by H3K27ac distinguishes active enhancers from poised enhancers and inactive enhancers (Creyghton et al. 2010;

Zentner et al. 2011). These findings suggest that histone modifications play a critical role in transcriptional regulation.

Enhancers are DNA segments accessible to proteins; nevertheless, these cis- acting transcriptional elements are not constantly open, and can need to have stimuli to be accessible, for instance following erythrocyte differentiation and treatment (He et al. 2010; Hu et al. 2011; Morrisey CS 2013). These stimuli leads to dynamic nucleosome repositioning, including chromatin remodelers like BAF (BRG1-

Associated Factors) which are the human analogs of SWI/SNF (SWItch/Sucrose

NonFermentable) (Hargreaves and Crabtree 2011). Moreover, pioneer factors, such as

FOXA1, recruit chromatin remodelers to make chromatin open and so facilitate binding of TFs at certain enhancers (Kaestner 2010).

Acetylation of lysine residues is the most dynamic histone modification since this modification controls changes in chromatin structure and gene transcription (Neely and

Workman 2002; Sanchez and Zhou 2009). Chromatin remodeling and histone modifying 16 cause change in chromatin structure to initiate transcription.Histone acetyltransferases (HATs), chromatin modifying complexes such as GCN5, EP300/CBP, use Acetyl-CoA to acetylate lysines in the N-terminal tails of core histone of nucleosome.

Acetylation reduces the strength of interactios between nucleosomes making chromatin fiber decondense. Furthermore, acetylation functions as a signature recognized by bromodomains in Swi/Snf and TAF1. EP300 (adenoviral E1A-associated protein of

300kDa)/CBP (CREB-binding protein) is one main family of HATs. This family members have nearly 500 residues long-HAT domains as well as bromodomains and three cysteine-histidine rich domains that promotes protein interactions. Acetyl signature enables transcription factors to recognize and interact with the actylated histone tails via their bromodomains. The highly conserved bromodomains are crucial to binding of protein to acetylated lysine. (Sanchez and Zhou 2009; Filippakopoulos and Knapp 2014).

ChIP-seq (Chromatin immunoprecipitation followed by massively parallel DNA sequencing uses antibodies against protein of interest to identify binding sites of transcription factors genom-wide (Fig. 1.8A). DNA accessible regions are depleted nucleosomes where active regulatory elements can bind, and these regions can be identified by DNase I fllowed by deep sequencing (Fig. 1.8B). ChIP-seq can be used to detect nucleosomes that flank active enhancers with typical enhancer-associated histone marks (Fig. 1.8C) (Shlyueva et al. 2014).Data sets from DNase-seq for open chromatin and ChIP-seq for three transcription factors: TAL1, GATA1, EP300, and five histone modifications: H3K4me1, H3K4me3, H3K27ac, H3K9me3, H3K27me3 were used in this study.

17 REVIEWS A. B. C. a ChIP–seq for a TF b DNase-seq c ChIP–seq for chromatin marks

DNase I cleavage

TF TF TF TF

ChIP– DNase- ChIP– seq seq seq

d ChIP–seq for Mediator and cohesin e ChIA–PET and chromosome conformation capture methods

Target gene Pol II Figure 1.8 Genomic methods for predictingPol II enhancers. Non- Mediator target Cohesin genes A. ChIP-seq for a transcription factor. B. DNase-seq for open chromatinPol II . C. ChIP- Pol II Enhancer seq for chromatin marks (Shlyueva et al. 2014). Fixation and shearing

Pol II ChIA–PET Chromosome conformation capture Mediator Pol II Pol II Cohesin 1.4 Statement of Thesis or

In this thesis, I studied the association between biological function of genomic DNA segments and robust signals of biochemicalPol II activity. ChIA–PET Chapter 2 provides new information on functional characterization of biochemically active regulatory elements.Figure 3 | Geno mIi cshow method sthat for pr ethedicti nbiochemicalg enhancers throu gcharacteristich the detection of tr aofnscN rbeingiaptutiroen R feav cieoccupiedtwors b| Ginednientgic,s ‘open’ chromatin, chromatin marks, or long-range contacts. The principles of the different methods (top panel of each part) and the corresponding data output (such as deep sequencing read density) that is used for regulatory by TAL1 (determinedeleme nbyt id eChIPntificatio-nseq (bott oinm perythroblasts)anel of each part) are s hisow nsufficient. a | Chromatin inimm aun omajorityprecipitatio nof fol locaseswed by to deep sequencing (ChIP–seq) uses antibodies to determine the location of transcription factor (TF) binding sites genome wide. Although enhancers are bound by TFs, not all TF binding sites correspond to functional enhancers. observe enhancementb | Activ eability enhance rsof an dthe otherse reg uboundlatory elem DNAents are dsegments.epleted of nucle osInom eaddition,s such that th e ID NexploreA is accessi bthele. Such regions can be detected by DNase I or micrococcal nuclease (MNase) digestion followed by deep sequencing (DNase-seq or MNase-seq, respectively). c | Nucleosomes that flank active enhancers bear characteristic histone effectiveness combinationsmodifications th a oft c andifferent be detected b epigeneticy ChIP–seq using sfeaturespecific antib odfories. d prediction | Enhancers are b ofrou g enhancers.ht into close proximity of their respective target promoters through the formation of chromatin loops, which are thought to be established by cohesin and Mediator complexes. ChIP–seq can detect the contact points of cohesin and Mediator Furthermore, motifat pandromo tconservationers and enhancers, an danalyses has been used wereto predi ctalso enhan cdoneers. e | C honrom aenhancerstin interaction an atolys isfind with out paired-end tag sequencing (ChIA–PET) and chromosome conformation capture (3C)-based methods preserve and detect spatial contacts by crosslinking, DNA fragmentation, DNA fragment ligation and deep sequencing. ChIA– the contributions ofPE Tmotif include sprese a ChIP strvationep to enric hand for co mconservedplexes that con taTFin a soccupancypecific protein, su ctoh a senhancer RNA polymera sactivity.e II (Pol II). In contrast to ChIP–seq (part d), both ChIA–PET and 3C-based methods detect not only the contact points but also the pairwise connections between these points. The thin, solid lines indicate that pairwise connections between A manuscript describingspatial cont ac thet poi ntstudiess are captur ed in in C thishIA–P E chapterT and 3C-ba,se d entitled methods. F or “Epigenetic 3C-based method s anda sche m geneticatic output of a chromosome conformation capture carbon copy (5C) or Hi-C experiment is shown; this method probes all interactions between defined genomic loci for their spatial proximity and physical contacts, which is similar to features that lead CtohI Adiscovery–PET in that it m ofigh tenhancer (solid lines) or m function”ight not (dash ehasd line s)been corresp osubmitted.nd to regulatory interactions.

NATURE REChapterVIEWS | GENE TI 3CS describes work undertaken to improve transfection VO L protocolsUM E 15 | APRI L in2014 cell | 277 © 2014 Macmillan Publishers Limited. All rights reserved lines with GATA1 knock-out and rescue genotypes, thereby allowing studies of wild-type 18 and mutant cis-regulatory modules while controlling changes in the trans environment.

The studies in this chapter showed GATA1 dependent enhancement with two different promoters and repressor effect of GATA1-ER hybrid protein on expression.

Chapter 4 of this thesis reveals the association of conserved transcription factor occupancy with enhancer usage in multiple tissues. Conserved GATA1 occupancy between mouse and human revealed enhancer activity in multiple tissues. We have contributed this insight that conserved TF occupancy is associated with activity of ubiquitous enhancers, along with some of the key data in Chapter 4, to a paper that is a collaboration with Michael Snyder’s lab. This paper with Yong Cheng and Zhihai Ma a co-first authors is entitled “Principles of regulatory information conservation revealed by comparing mouse and human transcription factor binding profiles.” The paper has been reviewed by Nature, and the revised version of the manuscript was resubmitted to

Nature.

Chapter 5 presents general discussion and the future of this work.

19

Chapter 2

Epigenetic and Genetic Features that Lead To Discovery of Enhancer

Function

Statement of collaboration

Chapter 2 contains data used in a research article, “Epigenetic and genetic features that lead to discovery of enhancer function” (Dogan et al. submitted). Nergiz

Dogan, the author of this thesis, carried out most of the analyses in this chapter and performed the experiments on cloning, cell culture, transient transfections, and enhancer assays in cultured cells. Maria Long contributed to cell culture experiments and some transfections. Weisheng Wu contributed to analysis of some of the ChIP-seq data for

TAL1, GATA1, and H3K4me1/H3K4me3. Christapher S. Morrissey and Kuan-Bei provided data for conservation scores and conserved WGATAR motif, respectively.

Yong Cheng and Deepti Jain provided conserved occupancy data for GATA1. Previously published ChIP-seq data produced by Ross Hardison laboratory was used, and Cheryl A.

Keller contributed to conducting ChIP-seq experiments. Transgenic mouse reporter assays were conducted in Len A. Pennacchio’s Laboratory.

20

2.1 Introduction

Accurate identification of cis-regulatory modules (CRMs) is essential for understanding mechanisms of gene regulation, modeling regulation during differentiation, and interpreting the effects of genetic variants associated with complex traits.

Challenges to meeting this goal are formidable (Hardison and Taylor 2012). CRMs are most frequently in the vast majority of the genome that does not code for proteins, thus the search space is enormous. No clear grammar for interpreting the DNA sequences of

CRMs has been discovered, so examination of the primary DNA sequence currently has poor power for identifying CRMs (Janky and van Helden 2008). Evidence of strong evolutionary constraint (Hardison 2000; Aparicio et al. 2002; Pennachio et al. 2006) has consistently revealed CRMs, but these are enriched for certain classes of genes and tissues (Woolfe et al. 2005; Visel et al. 2009; Blow et al. 2010) and do not capture the full spectrum of regulatory regions (Schmidt et al. 2010; Mouse ENCODE Consortium

2014). Application of machine learning to find discriminatory patterns in alignments of known regulatory regions (Taylor et al. 2006; Wang et al. 2006; Göttgens et al. 2010) has also been successful (to an almost 50% success rate in favorable circumstances).

Other methods for predicting CRMs based on conservation of noncoding regions have been less successful (Attanasio et al. 2008), illustrating a need for improving CRM prediction by methods in addition to conservation or alignment-based approaches.

At a biochemical level, CRMs are clusters of transcription factor (TF) binding site motifs in the DNA, and they are active when the TFs are bound. The complex of bound proteins tends to interact with co-activators such as EP300 or CBP, and the TF-bound

CRM is flanked by nucleosomes with characteristic histone modifications, such as H3K4 monomethylation (H3K4me1) and H3K27 acetylation (H3K27ac) (He et al. 2010; Zentner et al. 2011). All these epigenetic features (proteins or modifications, including those in chromatin, that lie on top of the DNA sequence) have been used for predictions of CRMs. 21

For instance, the presence of H3K4me1 with little or no trimethylation (H3K4me3) predicted enhancers with good accuracy (Heintzman et al. 2007; Heintzman et al. 2009), enrichment by H3K27ac separated active enhancers from poised enhancers and inactive enhancers (Creyghton et al. 2010; Zentner et al. 2011), and the presence of

EP300 identified heart enhancers with high accuracy (75%, Blow et al 2010). The combination of H3K4me1 plus EP300 was highly accurate (70% to 86%) for identifying enhancers in melanocytes (Gorkin et al. 2012). Using tissue-specific TFs such as

GATA1 in erythroid cells (Cheng et al. 2008) and MYOD in muscle cells (Cao et al.

2010) to predict enhancers had a good success rate (40-50%), but lower than that for co-activators. Enhancers have also been predicted by integration of several epigenetic features using segmentation in a hidden Markov model framework (Ernst and Kellis

2012); these were shown to be active at a 25% to 41% rate in a high throughput assay

(Kheradpour et al. 2013).

Despite the clear power of epigenetic approaches for CRM prediction, the feature or features with the strongest contribution to predictive accuracy are not known. While it is reasonable to expect integration of diverse signals to improve accuracy, this needs to be tested more broadly. Different groups of features could work better for particular cell or tissue types. The type of assay done for CRMs could also affect the outcome.

To shed additional light on these issues, we used data from the mouse ENCODE project (Mouse ENCODE Consortium 2014) to examine the role of multiple epigenetic features in determining activity of enhancers (tested experimentally) in a tractable system of differentiation. Our initial focus was occupancy by TAL1, since this TF plays multiple key roles in hematopoiesis and is needed for differentiation of erythroid progenitor cells into maturing erythroblasts (Porcher et al. 1996). Experiments using conditional Tal1 knockout mutants and rescue show that TAL1 is required for both specification and differentiation of erythroid and megakaryocytic cells (Schlaeger et al. 22

2005). Furthermore, the co-binding of TAL1 with GATA1 is strongly associated with gene induction (Wozniak et al. 2008; Tripic et al. 2009; Cheng et al. 2009, Soler et al. 2010).

Thus we began with high quality datasets on TAL1 occupancy in a mouse cell line model for maturing erythroblasts (Wu et al. 2011; Wu et al. 2014) as a single-factor predictor of erythroid CRMs. Remarkably, a majority of the DNA segments predicted as CRMs by this one factor and tested by reporter gene assays in either transfected cells in culture or transgenic mouse embryos were active as enhancers. Integrative analysis of additional epigenetic features at 273 DNA segments showed that the combination of occupancy by

TAL1, GATA1, and EP300 and the histone modifications H3K4me1 and H3K27ac gave a high success rate (71%) while capturing a large fraction of enhancers.

2.2 Results

2.2.1 Genome-wide prediction of regulated erythroid CRMs by TAL1 occupancy

DNase I hypersensitive sites (DHSs) and/or histone modifications in chromatin can be used to predict a large number of candidate regulatory regions (including promoters, enhancers, and insulators) in specific cell types (ENCODE Project

Consortium 2012; Shen et al. 2012; Thurman et al. 2012; Mortazavi et al. 2013). A large survey using cap-analysis of gene expression revealed evidence for over 40,000 enhancers across multiple mammalian tissues (Andersson et al. 2014). We focused on a subset of predicted CRMs that are likely to be involved in regulation during erythroid differentiation, i.e. DNA segments occupied by TAL1 (Wozniak et al. 2008; Tripic et al.

2009; Wu et al. 2011; Huang and Brandt 2000; Elnitski et al. 2001; Wu et al. 2014).

We began with the set of 4915 DNA segments determined by ChIP-seq to be bound by TAL1 (TAL1 occupied DNA segments or TAL1 OSs) (Table 1) in G1E-ER4

23

A.

Mouse July 2007 (NCBI37/mm9) chr18:32,680,045-32,727,617 (47,573 bp) Window Position 32,725,000 32,720,000 32,715,000 32,710,000 32,705,000 32,700,000 32,695,000 32,690,000 32,685,000 :chr18 TTfxnTst G1E TAL1 G1E-ER TAL1 24hr Erythrobl TAL1 G1E-ER GATA1 24hr Erythrobl GATA1 Gypc G1E 1 G1E-ER4 E2 24 1 EPC ++- S 1 EPC -++ S 1 G1E G1E-ER4+E2 Mel Erythrobl Megakaryo CH12 Heart Liver Kidney Wbrain

B.

DHSs GATA1 EP300 H3K27ac (-) (-) (-) 1796 (-) 592 713 4323 545 (+) 4202 4370 (+) 3119 (+) (+) H3K4me1 H3K4me3 H3K27me3 H3K9me3 (-)

2085 4740 175 (-) 4580 (-) 4667 (+) (-) 335 248 2830 (+) (+)

(+)

C.

G1E ER4 Ery Megs

24

D.

s 3000

k 2593 a

e 2500

p

1 2000

L A

T 1500

f

o 922

r 1000

e 596

b 413 500 292 m 34 u 38 27

N 0

1

1

0

2 5

5

0

0

.

-

- -

5

0

0

0

0

1 -

.

1 2

-

.

5

0

5

0

-

5

0

2

-

0

-

0

0

.

5

0

0

0 5 Distance to the closest TSS (kb)

Figure 2.1 Genome-wide prediction of TAL1 OSs as erythroid enhancers.

A. Example of epigenetic marks overlapping with a TAL1 peak with the body of

Gypc. The tracks displayed on the UCSC Genome Browser (Kent et al. 2002) show, in descending order, the DNA segment tested for enhancer activity, occupancy by

TAL1 and GATA1 in mouse erythroid cells, the gene model for Gypc, DNase hypersensitive sites in G1E cells, G1E-ER4 cells treated with estradiol, and mouse primary fetal liver-derived early erythroid progenitor cells (EPC CD117+, CD71+,

TER119-) and differentiating erythroblasts (EPC CD117-, CD71+, TER119+) (Mouse

ENCODE Consortium 2012) followed by ChromHMM segmentations based on histone modifications (Mouse ENCODE Consortium 2012). The colors correspond to states enriched for combinations of histone modifications indicated on the last line. B. Overlap of TAL1 peaks with other epigenetic features (listed in Table 1). C.

The proportions of TAL1 OSs (determined in G1E-ER4 cells treated with estradiol) found in each of the nine ChromHMM states in each of four hematopoietic cell types. D. Distribution of positions of TAL1 OSs relative to the transcription start sites (TSSs) of RefSeq genes (Pruitt et al. 2013). 25 cells treated with beta-estradiol, which is a model for erythroblasts differentiating in response to restoration of GATA1 (Weiss et al. 1997; Gregory et al. 1999). The TAL1

OSs overlapped with other features suggestive of regulatory function, as illustrated by the example of a candidate CRM in an intron of the gene Gypc, which encodes the erythroid membrane protein glycophorin C (Fig. 2.1A). TAL1 was bound in regions of accessible chromatin; 88% were in DHSs in G1E-ER4 cells (Wu et al. 2011; Morrissey

CS 2013). Genome-wide, the peaks of TAL1 binding coincided with binding by GATA1

(86%) in the same cell system (Cheng et al. 2009) and the presence of the coactivator

EP300 (64%) in murine erythroleukemia (MEL) cells (Mouse ENCODE Consortium submitted; Cheng et al. submitted) (Fig. 2.1B). As expected (Zhang et al. 2009; Arvey et al. 2012; Kundaje et al. 2012), a majority of the TAL1 bound regions were acetylated at

H3K27 (89%) in MEL cells, and the bound regions were almost uniformly in (or flanked closely by) chromatin with the activating histone modifications H3K4me1 (96%) and

H3K4me3 (58%) in the G1E-ER4 cells. Only a small minority had the H3K27me3 (7%) or

H3K9me3 (5%) modifications associated with gene repression (Fig. 2.1B). The patterns of histone modifications were integrated into discrete chromatin states using the program chromHMM (Ernst and Kellis 2012; Mouse ENCODE Consortium submitted). This revealed chromatin states associated with enhancers (dominated by H3K4me1 alone or in combination with H3K4me3 or H3K36me3) around the TAL1 occupied segment in

Gypc (Fig. 2.1A) and a majority of TAL1 OSs genome-wide (Fig. 1C). Moreover, the chromatin states change to the nonpermissive or quiescent states in cell types in which

Gypc is not expressed (Fig. 2.1A).

The presence of the promoter-associated histone modification H3K4me3 (Birney et al. 2007; Heintzman et al. 2007) suggested that some of the TAL1 OSs were close to transcription start sites (TSS). Indeed, 478 TAL1 OSs were within 1 kb of an annotated

TSS and 65 TAL1 OSs were within 100 bp of a TSS (Fig. 2.1D). Almost all (94%) of the 26

478 TSS-proximal TAL1 OSs were in genomic segments enriched by the H3K4me3 mark. While these TSS-proximal TAL1 OSs may overlap a gene promoter, it is difficult to distinguish activity directly in a promoter versus in an enhancer that is located adjacent to the promoter. Hence all the TAL1 OSs were considered candidates for enhancers, even though some could also be in DNA segments active as promoters.

2.2.2 Occupancy by TAL1 is a strong predictor of enhancer activity

From the set of 4915 DNA segments bound by TAL1, 70 were chosen to be tested for enhancer activity after transient transfection in the human cell line K562, which has properties of erythroid and megakaryocytic cells (Supplemental Table 2.1; Benz Jr. et al. 1980). The 70 were chosen randomly from eight groups of TAL1 OSs characterized by additional epigenetic features, as described in the next sections. DNA fragments containing the TAL1 OSs, and ranging in length from 243 to 1365 bp (average

690 bp), were cloned into a plasmid with a reporter gene encoding firefly luciferase driven by the promoter for the human HBG1 gene, which is able to respond to enhancers bound by TAL1 (Fig. 2.2A; Elnitski et al. 2001; Wang et al. 2006). Each TAL1

OS was tested in multiple experiments with both biological and technical replicates and a co-transfection control (Fig. 2.2A, B; Methods). The transfection assay typically showed consistently high levels of luciferase activity for the test constructs that pass the threshold for enhancement (two fold increase compared to the parental vector), as illustrated for the TAL1 OS in the intron of Gypc (Fig. 2.2B). The results for all 70 TAL1

OSs were summarized as box-plots (Fig. 2.2C; values are in Supplemental Table 2.2).

Of the tested TAL1 OSs, 39 (56%) produced at least a two-fold increase. Many were strongly active, with 26 giving at least a three-fold increase, and the most active one 27 generating a median effect of 23-fold. Activity of another 7 (10%) TAL1 OSs fell in a

“threshold zone”, which was less than the two-fold needed to be declared an

A.

Predicted enhancer Compared to

TAL1 OS HBG1pr Firefly Luc HBG1pr Firefly Luc

Tkpr Renilla Luc Tkpr Renilla Luc

B.

y

t

8

i

v

i

t

c

6

a

n

i

e

4

g

n

a

2

h

c

d

l

0 o F Rep1 Rep2 Rep1 Rep2 Negative control TAL1-2105 (Gypc intron)

28

C.

5 2

Inactive (34%) ) Enhancer (56%)

%

0

y

0

t 1

i

2

(

v

i

d

t

l

c

o

a

h

5

s

n

1

i

e

r

e

h

g

T

n

0

a

1

h

c

d

l

o

5

F

0

1 0

7

Figure 2.2 Erythroid enhancer activity of TAL1 OSs in a transient transfection

assay.

A. In each expression vector, a TAL1 OS is inserted upstream of a firefly luciferase reporter gene expressed from the human A-globin gene promoter (HBG1pr). After transfection the expression level of the test construct is compared to that from the parental vector, in both cases normalized to the expression of a co-transfection control plasmid with the Renilla luciferase gene expressed from the promoter for a viral gene encoding thymidine kinase (TKpr). B. Results of the enhancer assays of a negative control vector and an expression vector containing TAL1 OS from the

Gypc intron. Eight technical replicates for each of two biological replicates

(different days of transfection) are shown. C. Activities in enhancer assays on 70

TAL1 OSs, ordered by enhancer activity. The distribution of results (biological and technical replicates) for each TAL1 OS is shown as a box plot, with the internal line indicating the median of at least two biological replicates. Boxes for TAL1 29

OSs inactive in this assay are shaded blue, those in the threshold zone are pink, and those with activity are shaded red. enhancer by this assay, but greater than 1.5-fold, which is over three standard deviations above the median of the negative controls. The remaining 24 TAL1 OSs were not active in this enhancer assay. While the transient transfection assay reveals the ability to increase expression from a plasmid that acquires some aspects of chromatin structure (Reeves et al. 1985), the DNA does not integrate into , and the activity is assayed in a single cell type.

We then turned to assays for tissue-specific enhancement in transgenic mouse embryos (Pennacchio et al. 2006). A large number of human and mouse candidate

CRMs, predicted by interspecies conservation of noncoding sequences (Pennacchio et al. 2006), EP300 occupancy (Visel et al. 2009), or other features (ENCODE Project

Consortium 2012) have been tested for the ability to increase expression of a beta- galactosidase reporter gene driven by a minimal promoter in a tissue-specific manner at embryonic day 11.5 of mice. Of the 4915 erythroid TAL1 OSs, 66 (mouse DNA fragments or their orthologs in humans) have been tested in the mouse transient transgenic assay, as recorded in the VISTA Enhancer Browser (Visel et al. 2007).

Remarkably 43 of these (65%) were reproducibly active in one or more tissues (Fig.

2.3A). These were active in a range of tissues, with the greatest number in heart and midbrain (Fig. 2.3B; Supplemental Table 2.3).

Nine TAL1 OSs were tested in both assays. All nine were active in distinctive patterns of mouse tissues, and seven of the nine were also active in the K562 transient transfection assay (Fig. 2.3C; Supplemental Fig. S2.1A, B). This suggests that these assays for enhancer activity are robust, with a large majority of the tested DNA fragments active in both assays. 30

Surprisingly, these 43 TAL1-bound enhancers were only rarely active in tissues in which TAL1 is typically thought to be playing a role, i.e. the hematopoietic fetal liver and bone marrow. However, in addition to its role in hematopoietic tissues, TAL1, co- operating with TAL2 and GATA2, is involved in GABAergic neurogenesis in the midbrain

(Achim et al. 2013). The enhancement in other tissues could reflect pleiotropic functions of a subset of enhancers, in which a DNA segment is active as an enhancer in multiple tissues, perhaps through the action of TFs paralogous to the TF whose occupancy was used in ascertaining the predicted CRM. Such pleiotropic enhancer activity was observed frequently for TF-occupied DNA segments in mouse whose orthologs in humans were also bound by the orthologous TF (Cheng et al. submitted).

Enhancer promoter units (EPUs) were used to assign TAL1-bound enhancers that were tested and validated in transient transfections and/or transgenic mouse reporter assays, to their target gene(s). EPUs were identified as a result of mapping the cis-regulatory sequences in a set of 19 tissues and cell types in the mouse genome.

8,792 EPUs were listed using Spearman correlation coefficients between H3K4me1 signals at enhancers and the Pol II occupancy at promoters. The number of enhancer and promoter in each EPU is ranging from 1 to 180 for enhanacers, and from 1 to 129 for promoters (Shen et al. 2012). 59 TAL1 occupied enhancer segments (30 tested in transgenic mice, 25 transiently transfected into K562 cells, and 4 tested in both enhancer assays) were found to be covered by EPUs. The assigned Ref-Seq gene(s) of each TAL1-bound enhancer were listed in Supplemental Table 2.15.

As two examples, UCSC genome browser views of ChIP-seq and RNA-seq data in induced-G1E-ER4cell illustrate 2 enhancer promoter units including TAL1-bound enhancers with the expression level of Cltc and Hhex (Supplemental Fig. S2.6A-B).

31

A.

Negative

23 (35%)

43 (65%)

Positive

B.

Heart Midbrain Blood vessels Forebrain Hindbrain Neural Tube Liver Limb Melanocytes Branchial arch Dorsal root ganglion Facial mesenchyme

Other

0 5

0 5 0 5 0

1 1 2 2 3

TAL1-bound enhancers active in different tissues

32

C.

Transgenic mouse Transient transfection (E11.5) (K562)

Figure 2.3 Tissue-specific enhancer activities of TAL1 OSs in transgenic mouse

assays.

A. Partitions of 64 TAL1 OSs by enhancer activity in transgenic mouse. TAL1 OSs were tested for the ability to increase expression of a beta-galactosidase reporter gene driven by Hsp68 promoter in a tissue-specific manner at E11.5 day of mice. B.

Distribution of tissues showing enhancement by the TAL1 OSs. Some TAL1 OSs were reproducibly active in multiple tissues, and each of those tissues was counted for the distribution. C. Comparison of the results of the two enhancer assays on nine TAL1 OSs. Stained mouse images are from the VISTA Enhancer

Browser. 33

The closest genes to TAL1-bound in vivo enhancers were listed and the expression levels of the genes expressed in G1E, uninduced-G1E-ER4, 24 hour- induced- and 30 hour-induced G1E-ER4 cell systems were analyzed (Supplemental

Table 2.14). Five responsive genes (at least 1.5 fold change in expression levels in 24 hour-induced or 30 hour-induced G1E-ER4 cells compared to G1E or uninduced G1E-

ER4 cells) were found: Dpf3, Agpat4, Ier5, Gypc, and Bcl2l1. Dpf3 (D4 Zinc And Double

PHD Fingers, Family 3) encodes a transcriptional regulator protein that belongs to neuron-specific chromatin remodeling complex (nBAF) and binds acetylated histones.

Agpat4 (1-Acyglycerol-3-Phosphate) encodes an integral membrane protein which is convert lysophosphatidic acid to phosphatidic acid and involved in phospholipid biosynthesis. The other gene is Ier5 (Immediate Early Response 5) plays a key role in mediating cellular response to mitogenic signals. Gypc (Glycophorin C) encodes integral membrane glycoprotein and involved in regulating the mechanical stability of red cells.

The last responsive gene identified in this study is Bcl2l1 (BCL2-Like 1) encodes BCL-2 protein acting as anti- or pro-apoptatic regulators that play roles in a various cell events, such as cell death (http://www.genecards.org/).

2.2.3 Impact of additional epigenetic features in predicting enhancer activity in the presence of TAL1 binding

We hypothesized that inclusion of additional features associated with enhancers would improve the accuracy of enhancer prediction. The ratio of monomethylation to trimethylation of H3K4 (expressed as the log base 2) was computed on each TAL1 OS as a potential method to separate putative enhancers, expected to have a high ratio, from putative promoters, expected to have a low ratio (Birney et al. 2007; Heintzman et 34 al. 2007). Given the strong association of gene induction with co-occupancy by GATA1 with TAL1 (Wozniak et al. 2008; Tripic et al. 2009; Cheng et al. 2009; Soler et al. 2010), we predicted that TAL1 OSs also bound by GATA1 would be active as enhancers more frequently than those without GATA1. Thus, we also computed the level of GATA1 occupancy for each TAL1 OS. These measurements were combined in a data matrix along with the signal strength for TAL1 occupancy. 4648 TAL1 OSs were organized into clusters consisting of fairly homogeneous combinations of the three features, using k- means clustering, k=8 (Fig. 2.4A). Clustering at higher values for k did not bring out additional distinctive groups.

Two of the clusters (1 and 2) had a higher level of H3K4 trimethylation than monomethylation, characteristic of regions proximal to promoters, and as expected, cluster 1 had a majority of TAL1 OSs within 1 kb of a TSS (Fig. 2.4B). Despite the high level of H3K4me3, only 28% of the TAL1 OSs in cluster 2 were within 1 kb of an annotated TSS. As expected from the higher levels of H3K4 monomethylation than trimethylation, a large majority of TAL1 OSs in clusters 3-8 were distal to the closest

TSS. Clusters 1, 4 and 5 were particularly high in GATA1 occupancy, while clusters 2, 3 and 8 were low, with many not overlapping GATA1 peak calls (Fig. 2.4A). The currently known erythroid CRMs (Wu et al. 2011) were distributed among the clusters, but few were in clusters 2 and 8 (Fig. 2.4C). Almost all TAL1 OSs in each cluster were in DHSs, except for cluster 2 (Fig. 2.4D). Over half the TAL1 OSs in each cluster were occupied by EP300, with the exception of cluster 2 (Fig. 2.4D). Frequency of occupancy by EP300 was particularly high in clusters 4, 5 and 6, with the fraction bound ranging from 80% to

90%.

35

A.

log2(K4m1/K4m3) TAL1 GATA1

36

B.

[0-0.1] kb

(0.1-1] (2-5] (50-500] )

f (1-2] (5-50] (500-2000]

%

0

o

(

0

1

r

n

o

e

i

t

t

s

0

u

8

u

l

b

i

c

r

t

0

h

s

i

6

c

d

a

e

e

0

d

i

n

4

i

w

-

s

e

S

0

2

m

O

o

n

1

e

L

0

A G

T 1 2 3 4 5 6 7 8 Cluster no

C.

16

s

5 1 M 13

R 12 12

C

f

0

e

1

R

f o

5 5

r

5 e

b 2

m 1

u

0 N 1 2 3 4 5 6 7 8 Cluster no

37

D.

0 DHSs 0

3 EP300

P 0

E 214 608

0 374

, 221 817 1 s 197

S 343 1347

324 505

H

0

8 D

613

y b

246 0

d 194

6 125 853

e

k

r

a

0

4

m

s

S 68

0

O

2

1

L

A

0

T

f 1 3 4 5 7 8

o 2 6

% Cluster no

Figure 2.4 Classification of TAL1 OSs based on epigenetic features.

A. TAL1 OSs clustered by the ChIP-seq signals of H3K4me1 and H3K4me3 (log2 of the ratio), TAL1, and GATA1 (k-means clusters, k=8). B. Distributions of TAL1 OSs positions in each cluster relative to the TSSs of genes. C. The numbers of known reference CRMs overlapping with TAL1 OSs in each cluster. D. Percentages of

TAL1 OSs marked by DNaseI hypersensitive sites and co-bound by EP300; numbers of sites are over each column.

Additionally, considerable amount of TAL1 OSs in each cluster, ranging from

64% to 83%, was covered by enhancer-pomoter units (Supplemental Fig. S2.5) (Shen et al. 2012).

38

Our hypothesis predicts that the rate of discovery of active enhancers should vary significantly among the clusters, but this was only true to a limited extent. The 70

TAL1 OSs tested for enhancer activity in transient transfections were randomly chosen from each of the eight clusters, with seven to eleven DNA segments from each

(Supplemental Table 2.2). Surprisingly, the frequency of positives for the tested TAL1

OSs was not dramatically different among most of the clusters (Fig. 2.5A); the proportion that gave a greater than two-fold increase in expression ranged from 50 to 75% for almost all the clusters. The only exception is cluster eight, in which only three of the ten tested fragments were active. This cluster is notable for a low level of GATA1 binding signal in the TAL1 OSs, and these results support an absence of GATA1 coupled with the presence of TAL1 as a feature of TAL1 OSs that are less likely to be enhancers.

We were surprised to find that particularly high levels of TAL1 and GATA1 signal did not correlate with higher frequency of enhancer activity. Clusters 4 and 5 had the highest levels of GATA1 (Fig. 2.4A), but while cluster 5 had one of the highest rates of positives (71%, Fig. 2.5A), cluster 4 had a lower rate (50%). DNA segments in Cluster 4 also had the highest level of TAL1. The cluster with the highest rate of enhancer activity, cluster 6 (75%, Fig. 2.5A) had moderate amounts of TAL1 and GATA1 (Fig. 2.4A). We emphasize that this result is conditional on the fact that all the tested DNA segments were occupied by TAL1.

We were also surprised to discover that TAL1 OSs with lower ratios of

H3K4me1/H3K4me3 (clusters 1 and 2) and proximity to a TSS (cluster 1) were active as enhancers at a relatively high rate in the transient transfection assay (Fig. 2.5A). While many TAL1 OSs in cluster 1 were expected to be promoters (63.2 % are within 1 kb of a

TSS), the results show that they also contain promoter-proximal enhancer activity.

The G+C content of the DNA segments was very similar among the activity categories (Fig. 2.5B), indicating that this feature has little impact, again conditional on 39 the fact that the DNA segment is bound by TAL1. The presence of EP300 at a TAL1 OS does contribute to discriminating TAL1 OSs that are active versus inactive in this assay, but not dramatically. The fraction of enhancer-active TAL1 OSs with EP300 (80%) was larger than the fraction of inactive TAL1 OSs with EP300 (61%) (Fig. 2.5B). Moreover, among the TAL1 OSs in cluster 8, none of the inactive regions were bound by EP300.

The limited impact of additional epigenetic features on success of predicting enhancers (but conditional on TAL1 occupancy) is also seen for the transient transgenic mouse assays. The TAL1 OSs determined to be active as enhancers in each of the two assays were distributed similarly among the eight clusters (Fig. 2.5C; Supplemental

Table 2.3). This includes the lower frequency of successful predictions when TAL1 is bound in the absence of GATA1 or in the presence of low level of GATA1 binding signal

(cluster 8). The major difference in the two assays is that no DNA segments from or orthologous to those in cluster 2 were included in the transgenic mice assays.

A.

5 5 5 5

y

2 2 2 2

t i

v C1 (63%) C2 (50%) C3 (63%) C4 (50%)

i

t

0 0 0 0

c

2 2 2 2

a

n

5 5 5 5

i

1 1 1 1

e

g

0 0 0 0

n

1 1 1 1

a

h

c

5 5 5 5

d

l

o

F

0 0 0 0

1

1

8 8 1

1

8

0

1

5 5 5 5

y

t

2 2 2 2 i

v C8 (30%)

i C5 (71%) C6 (75%) C7 (55%)

t

0 0 0

0

c

2 2 2 2

a

n

i

5 5 5 5

1 1 1 1

e

g

0 0 0 0

n

1 1 1 1

a

h

c

5 5 5 5

d

l

o

F

0 0 0 0

1 1

1 1

1

8

0

7 1 1 40

B.

)

0

%

)

0

(

1

%

s

0

(

5

S

s

0

O

S

8

1

0

O

L

4

1

A

0

L

T

6

0

A

d 3

T

e

i

f

0

p

o

4

0

u

t

2

c

n

c

e

t

0

o

0

-

n

2

1

o

o

c

c

-

0

C

0 0

0

G 3

Inactive Threshold Active P Inactive Threshold Active

E

C.

Enhancer Threshold Inactive Positive Negative

8

7

6

o

n

5

r

e

t

4

s

u

l

3

C

2 1

0 20 40 60 80 100 0 20 40 60 80 100 % of TAL1 OSs in each cluster Transient transfection into K562 Transgenic mice

Figure 2.5 Enhancer activities of TAL1 OSs partitioned by clusters.

A. Enhancer activities of TAL1 OSs tested by transient transfection assays in

K562 cells, grouped in clusters by epigenetic features. The names of individual

TAL1 OSs are given along the x-axis, and the percent active in each cluster is listed. The distinctive properties of each TAL1 OS cluster are summarized in the three colored bars, derived from Fig. 2.4A. B. The percentage of GC-content and

EP300 co-occupancy is shown for the tested TAL1 OSs whose activities fall into 41 each of three categories: inactive, threshold, and active enhancers. C. Activities of

TAL1 OSs grouped in clusters by epigenetic features, shown for both enhancer assays: transient transfection into K562 cells (left) and transgenic mice at E11.5

(right).

2.2.4 Confirmation of predictive power of EP300 occupancy,

H3K4monomethylation and H3K27 acetylation in enhancer prediction

Despite the limited impact of these epigenetic features (Fig. 2.4A, Fig. 2.5A, B) on accuracy of enhancer prediction, given the presence of bound TAL1, they still could contribute to effective enhancer prediction in a different context. We conducted a meta- analysis by expanding the set of DNA segments examined to all those tested in the

K562 transient transfection assay in our laboratory (Wang et al. 2006; Cheng et al. 2008; this study), including negative controls (Supplemental Table 2.4). We analyzed these

273 DNA segments for the presence or absence of eight epigenetic features: binding of three transcription factors (ChIP-seq for TAL1 and GATA1 in G1E-ER4 cells and EP300 in MEL cells) and enrichment for five histone modifications (ChIP-seq for H3K4me1,

H3K4me3, H3K27me3, H3K9me3 and H3K27ac in G1E-ER4 cells) (Supplemental Fig.

S2.2A; Supplemental Table 2.4). We confirmed that three commonly used epigenetic predictors of enhancers, H3K4me1, H3K27ac, and occupancy by the co-activator EP300, were strongly positively associated with enhancers (Fig. 2.6A). The range of activities is consistently and significantly higher for DNA segments with these features than those without them. Furthermore, the presence of either TAL1 or GATA1 was also predictive of enhancer activity (Fig. 2.6A; Supplemental Table 2.5). The presence of H3K4me3 on the chromatin also was associated with higher activity in this assay, perhaps reflecting promoter-proximal enhancers. In contrast, the repressive histone modifications were 42 associated with less activity, with a median fold change around one (no enhancement;

Fig. 2.6A; Supplemental Fig. S2.2C; Supplemental Table 2.4).

2.2.5 Effective combinations of epigenetic features for prediction of enhancers

The epigenetic features were analyzed individually in Fig. 2.6A, but they tend to occur together in various combinations. Therefore, we also searched for frequently occurring combinations of features and determined their contributions to enhancer activity (Fig. 2.6B). The 273 DNA segments were organized into clusters consisting of different combinations of the 8 features using two unsupervised clustering methods.

Clustering by k-means (k=14, centroid model) revealed a pattern of the feature combinations that distinguish active enhancers from inactive regions. We also used

Density Based Spatial Clustering of Applications with Noise, or DBSCAN (Ester et al.

1996), to identify more homogeneous clusters while placing the less informative, non- homogenous combinations into a separate category. DBSCAN produced 19 homogeneous clusters of feature combinations that contribute differentially to the response to enhancer assay (Fig. 2.6B; Supplemental Table 2.4).

The combination of TAL1, GATA1, EP300, H3K4me1, and H3K27ac was the most effective predictor of enhancement activity, capturing the largest number of enhancers (40) and at a high frequency (56% of the 71 tested in DBSCAN cluster 1) (Fig.

2.6B). This conclusion was strengthened by examining clusters with DNA segments missing one or more of these positive features. Members of cluster 6 had all the positive features except TAL1 occupancy, and the rate of activity declined to 33.3%. Similarly,

DNA segments in cluster 7 had the other positive features but lacked both EP300 and

TAL1 occupancy, and the success rate dropped to 18.2% (2 out of 11). Likewise, exclusion of H3K27ac in cluster 2 decreased the success rate to 0%. We do not include 43

H3K4me3 as a positive predictor of enhancers, even though the DNA segments in cluster 1 have this mark, because the absence of this modification from DNA segments in cluster 3 is associated with an even higher rate of validation (71%). While the small cluster 4 appears to show that DNA segments can be active as enhancers without

EP300, this lack of EP300 signal could result from the fact that this feature was measured in a highly related but not identical cell line (MEL cells, not 24 h-induced G1E-

ER4 cells). Thus it is possible that the active enhancers in cluster 4 may be bound by

EP300 specifically in G1E-ER4 cells.

A.

** ** ** ** * ** ** *

8

y

t

i

v

i

6

t

c

a

n

i

e

4

g

n

a

h

c

2

d

l

o

F 0

+ - + - + - + - + - + - + - + -

1

1

1

0

3

3 c 3

L

e

0

e

A

e e

a

3

A T

7

m

m

m m

T

A 2

P

4

7

4 9

K

E

G

2

K

K K

K

44

B.

d

l

s

N

3

o

e

n

A

e

3

3

1

h

1

v

0

a

c

i

e

e s

e C

e

t

A 0

e

a

m

1

v

e

i

c

S

T

3

7

7 m

m

m t

r

L

m

a

A

B

c

2 -

2 9

4

4

P

h

A

n

I

A

K K

K K T

K D

K

E

T G

1

2 3

4 5 6

7 8 9 10 11 12 13 14

15 16

17 18

19

s

u

o

n

e

s

g

n

o

r

e

m

t

t

o

a

h

-

p

n o n

45

C.

0 . 1 Feature combinations

G+27ac without 4m3 8

. P+27ac

0 T T+G+4m1+27ac T+27ac

T+G+P+ T+4m1+27ac y

6 4m1+27ac Feature

t

. i

0 combinations v

i with 4m3

t i

s T+G+P+ n

4 4m1+4m3+27ac

e

. 0

S T+G+4m3

T=TAL1 2

. G=GATA1 0 P=EP300 no TF+4m1+27ac 4m1=H3K4me1 Histone marks 4m3=H3K4me3

0 without TFs 27ac=H3K27ac

. 0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity

D.

Categories of No TF At least 1 TF All 3 TFs Histone Modifications Enh(%) Thr(%) Inac(%) Total(#) Enh(%) Thr(%) Inac(%) Total(#) Enh(%) Thr(%) Inac(%) Total(#) No histone mark 13.6 13.6 72.7 22 ------K4me1 only 0 20 80 5 50 0 50 4 100 0 0 1 K4me3 only 0 0 100 1 0 0 100 1 - - - - K4me1+K4me3 0 0 100 4 25 25 50 4 0 33.3 66.7 3 K4me1+K27ac 5.6 22.2 72.2 18 54.8 6.5 38.7 31 70.6 0 29.4 17 K4me3+K27ac 0 33.3 66.7 3 66.7 33.3 0 3 0 100 0 1 K4me1+K4me3+K27ac 16.7 33.3 50 6 50 19.2 30.8 104 56.3 19.7 23.9 71 K27me3 only 7.7 7.7 84.6 13 50 0 50 2 - - - - K9me3 only 0 28.6 71.4 14 ------K4me1+K27me3 0 33.3 66.7 9 0 50 50 2 0 100 0 1 K4me3+K27me3 20 0 80 5 ------K4me1+K4me3+K27me3 0 16.7 83.3 6 ------K4me1+K4me3+K9me3 0 0 100 1 ------K4me1+K27ac+K27me3 - - - - 100 0 0 1 100 0 0 1 K27ac+K27me3+K9me3 - - - - 100 0 0 1 - - - - K4me1+K4me3+K27ac+K27me3 0 0 100 2 66.7 0 33.3 6 100 0 0 2 K4me1+K4me3+K27ac+K9m3 - - - - 33.3 33.3 33.3 3 50 50 0 2 K27ac+K9me3 0 0 100 1 ------K27me3+K9me3 - - - - 100 0 0 1 - - - - 46

Figure 2.6 Meta-analysis of contributions of epigenetic features to enhancer

activity in transient transfection assays.

A. Distributions of enhancer activities of DNA segments marked by each feature individually. The asterisks indicate statistically significant difference in activity between the presence and absence of the features. B. Combinations of epigenetic features and their association with enhancer activity. Each row of the diagram pertains to one of the 273 tested DNA segments. The presence and absence of the eight epigenetic features are represented by black and light grey color, respectively. The tested segments were organized into clusters both by k-means

(k=12) and DBSCAN (Ester et al. 1996) clustering of the binary representation of the epigenetic features. Each tested DNA segment is also categorized by activity in the transient transfection assay. C. Evaluation of the discriminatory power of each TF and different combinations of TFs with histone marks by a receiver- operator characteristic (ROC) plot. The sensitivity and 1-specificity is plotted for the six features (three histone modifications: H3K4me1, H3K4me3, H3K27ac, and three TFs: TAL1, GATA1, EP300) or different combinations of these features. The data points give the Sn and 1-Sp of each TF or different combinations of TFs with histone modifications. The top discriminators with the best performance do not have H3K4me3 mark (dots in the upper left), and are separated from the discriminators with less performance in which the combinations have H3K4me3 in the ROC graph. The feature(s) with no discriminatory powered are shown by the dots to the right of or along the diagonal in the ROC graph. D. Contributions of histone modification signals versus TF binding to enhancer activity. The tested

DNA segments were grouped by the patterns of histone modification on each row, 47 and then subdivided by the presence of TFs in the groups of columns. The percentages of tested DNA fragments that fall into the indicated activity bins are given, along with total numbers of DNA segments in each histone modification category.

Absence of all eight epigenetic features (cluster 19) was associated with inactivity in the assay; only 3 of 22 tested DNA segments (13.6%) were active (Fig. 2.6B).

Likewise, the DNA segments carrying the repressive marks H3K27me3 (cluster 16) or

H3K9me3 (cluster 17) had only one (7.7%) or no active enhancers, respectively.

The “active” histone modifications alone, in the absence of key TF occupancy, were not strong predictors of enhancer activity. None of the five DNA segments in cluster 18 (marked by only H3K4me1), one of the 18 segments in cluster 14 (H3K4me1 and H3K27ac), and none of the four DNA segments in cluster 12 (H3K4me1 and

H3K4me3) were active as enhancers in this dataset. In addition, only 33.3% of the DNA segments marked by only H3K4me1 and H3K27ac (in the absence of the three TFs) are in regions of accessible chromatin in G1E-ER4 cells (Supplemental Table 2.4).

The analysis of DBSCAN clusters pointed out important trends, but some of the clusters were small and could capture only a small fraction of potential enhancers. In order to better measure sensitivity as well as specificity, we evaluated the discriminatory performance of each transcription factor (TAL1, GATA1, EP300) or various combinations of them with three histone modifications (H3K4me1, H3K4me3, H3K27ac). For 65 combinations of epigenetic features (including individual features, Supplemental Table

2.6), the group of tested DNA segments that had a given set of features was considered the positive predictions for enhancer activity, and the remaining tested DNA fragments were considered negatives. These predictions were then evaluated by the results of the 48 enhancer assay, thereby separating the positive predictions into true and false positives and the negative predictions into true and false negatives. Thus we could calculate the sensitivity (Sn) and specificity (Sp) for enhancer prediction by each set of features, utilizing the information from all 273 tested DNA fragments at each set of features. We displayed the results in a ROC (Receiver Operating Characteristic) graph (Fig. 2.6C;

Supplemental Table 2.6), so that the best discriminators generated points in the upper left of the graph, whereas feature(s) with low discriminatory power generated points along or to the right of the diagonal (Fig. 2.6C). The sets of features were described by the minimum requirement for inclusion, so that, e.g., DNA segments with both TAL1 and

H3K27ac were a subset of the group of DNA segments with TAL1.

Twenty-eight different combinations of TAL1, GATA1, EP300, H3K4me1 and

H3K27ac had strong performances, generating points in the upper left dotted ellipse (Fig.

2.6C). These feature combinations captured 62-84% of the true enhancers while rejecting 59-80% of the DNA segments without enhancer activity (in the transient transfection assay). Enrichment of H3K27ac in the presence of EP300 binding was the best discriminator in this dataset, with a Sn of 74% and a Sp of 72%. However, many other combinations gave essentially equivalent performances to H3K27ac plus EP300, shown by the cluster of points in the middle of the upper left ellipse. Notable features and combinations among them were TAL1 binding (Sn=74% and Sp=69%), TAL1 plus

H3K27ac (Sn=70% and Sp=73%), and GATA1 plus H3K27ac (Sn=82% and Sp=63%).

Enrichment by both H3K4me1 and H3K27ac in the presence of TAL1 binding or TAL1-

GATA1 co-binding were also excellent discriminators, with Sn of 69% for both and Sp of

74% or 75%, respectively. Furthermore, the combination of these two histone marks and all three TFs (TAL1, GATA1 and EP300) had excellent Sp (80%) with the highest precision (60%), along with good Sn (62%). Precision in identifying enhancers using the 49 five features ranged from 50% to 60% (Supplemental Fig. S2.2B; Supplemental Table

2.6).

Including H3K4me3 in the set of diagnostic features reduced discriminatory power (Fig. 2.6C). For example, adding H3K4me3 to the combination of five features with the highest precision (i.e. using H3K4me3 along with H3K4me1, H3K27ac, TAL1,

GATA1, and EP300) decreased the Sn from 62% to 48% and the precision from 60% to

57%. Likewise, TAL1-GATA1 co-binding in the DNA segments enriched for H3K4me3 had low discriminatory power.

DNA segments associated with only the three activating histone modifications in various combinations but in the absence of any TF had no discriminatory power (points in the lower dotted ellipse in Fig. 2.6C). DNA segments bound only by TAL1 but with no activating histone modifications also had no discriminatory power (the point on the lowest left of the diagonal in Fig. 2.6C).

These results show that many combinations of binding by TAL1, GATA1, and

EP300 in the presence of H3K4me1 and H3K27ac had strong power for predicting enhancer activity. This result was confirmed by partitioning the DNA segments in a supervised manner, based on the presence of the epigenetic features (Fig. 2.6D). Over

70% of the DNA segments with all 5 positive features were active as enhancers in the transfection assays. DNA segments lacking TF occupancy were rarely active as enhancers.

2.2.6 Motifs that distinguish TAL1-bound enhancers from inactive TAL1 OSs

We hypothesized that additional proteins binding to the TAL1 OSs may contribute to their activity in each enhancer assay. One prediction of this hypothesis is that motifs representing the binding sites for such proteins would be present at 50 significantly different frequencies in active versus inactive groups of TAL1 OSs. Thus, we used the Discriminating Matrix Enumerator program, DME2 (Smith et al. 2005), to identify such differentially enriched motifs, analyzing separately the sets of TAL1 OSs tested in each assay (Fig. 2.7A). This produced four sets of enriched motifs, with enrichment in active (positive) or inactive (negative) TAL1 OSs determined by each assay. From the 200 motifs in each set (listed top 10 in each set as scoring matrices in

Supplemental Tables 2.7-2.10), the top ten motifs identified by DME2 were analyzed further to identify candidate proteins or representatives of protein families that may bind to the discriminating motifs (Supplemental Fig. S2.3B). These 40 top DME2 motifs were searched against two databases of known motifs, JASPAR (Mathelier et al. 2013) and

UniPROBE (Newburger and Bulyk 2009), using the motif comparison tool TOMTOM

(Gupta et al. 2007). A total of 108 known motifs aligned by TOMTOM had an E-value of

<1; these motifs had good statistical support and were analyzed further (Supplemental

Table 2.11). 56 motifs enriched in both the active and inactive (positive and negative) sets were removed.

The DME1 motif highly enriched in the TAL1 OSs with enhancer activity had significant matches to the known factor binding sites for three families of proteins: IRFs,

STATs, and FOX proteins (Fig. 2.7B). The DME73 (Fig. 2.7B) and DME23

(Supplemental Table 2.11) discriminatory motifs contain a GATA binding site motif.

Given the prevalence of this motif in all TAL1 OSs, its enrichment in enhancers suggests that enhancer-active TAL1 OSs have multiple instances of the motif. This result is consistent with the positive role of GATA1 in determining enhancement by TAL1 OSs

(Fig. 2.5A). The enhancer-enriched DME25 discriminatory motif contains a match to the binding site for SMAD3, a member of the family of proteins mediating TGF-beta signaling and serving as activators of target genes. 51

The discriminatory motifs enriched in the TAL1 OSs that are not active as enhancers reveal two potential protein partners with negative effects (Fig. 2.7C). DME41 matches the binding site motif for HOXD10, and DME22 matches the binding site motif for RE1-silencing transcription factor (REST). The latter (also known as NRSF) is transcriptional repressor of neural genes in non-neural tissues.

A.

TAL1 OSs

Transient transfection Transgenic mice

39 Active 24 Inactive 43 Positive 23 Negative Discriminating Motif Enumerator (DME) 200 motifs enriched in actives 200 motifs enriched in positives

200 motifs enriched in inactives 200 motifs enriched in negatives

Select top 10 motifs in each set MEME-TOMTOM

108 motifs matched to 63 protein binding sites from databases (E-value<1)

56 motifs in both enhancers and non-enhancers

41 motifs enriched only in 11 motifs enriched only in non- enhancers enhancers

52

B. C.

Figure 2.7 Identification of motifs enriched in the TAL1-bound enhancers and

inactive segments.

A. The overall procedure for identifying TF binding site motifs that contributes to discrimination of TAL1 OSs by activity. B. Motifs that distinguish TAL1 OSs that are active enhancers from those that are inactive. The motif discovered by DME2 is given on the first line of each box, followed by the known TF binding site motifs discovered by TOMTOM, all shown as aligned logos. C. Motifs that distinguish

TAL1 OSs that are inactive enhancers from those that are active.

2.2.7 Conservation as an illuminator, not a predictor

The overall level of sequence conservation surrounding the TAL1 OSs, based on interspecies comparisons, was not strongly associated with activity in either enhancer assay. The large majority of both sets of TAL1 OSs, tested in transient transfections or in transgenic mice, had quite low levels of conservation aggregated across the bound 53 segments. The level of conservation, estimated by the phastCons score (Siepel et al.

2005) and phyloP (Pollard et al. 2010), had weakly positive associations with level of activity in the transient transfection assay (correlation coefficients from linear regression

R=+0.173 and R=+0.015, respectively) and with enhancer results in transgenic mice

(R=+0.273 and R=+0.216, respectively, from logistic regression) (Supplemental Table

2.12; Supplemental Fig. S2.4A, B).

We expected that evolutionary constraint will be most intense on the protein binding sites, and thus we focused on preservation of a TF binding site motif between mouse and human to monitor more localized constraint. The TF binding site motif most strongly associated with TAL1 occupancy is actually the GATA motif (WGATAR), reflecting a strong role of GATA factors in directing the binding of TAL1 (Kassouf et al.

2010; Wu et al. 2014). Individual instances of this motif in a bound site can be lineage specific or deeply preserved (the aligned sequences have a strong match to the binding site motif in each species over multiple clades; Fig. 2.8A). Using the program CladiMo

(King DC. 2009), each DNA segment investigated was categorized as having (a) a

GATA motif preserved between mouse and human, (b) a GATA motif in mouse but not human, or (c) no GATA motif in mouse. The categories were assigned hierarchically from a to c, so that a DNA segment with both preserved and lineage-specific motifs was placed in the preserved category (Supplemental Table 2.13).

This analysis reveals a significant association of enhancer activity with preservation of the GATA binding site motif. Within the meta-analysis set of 273 DNA segments tested for enhancer activity by transient transfection into K562 cells, 151 are bound in vivo by GATA1. Of these, 72 have WGATAR motifs preserved in human and mouse, 73 have motifs in mouse but not human (not preserved), and 6 have no GATA motif. The distribution of enhancer activities was higher for DNA segments with the preserved motif than for those in which the motif is not preserved (median 2.3-fold 54 increase compared with a 1.7-fold increase, respectively; Fig. 2.8B). Furthermore, the frequency of observing enhancer activity was higher for the group with the preserved

GATA motif (60% compared to 41% for the group with non-preserved motifs). A subset of 115 DNA segments were bound by both TAL1 and GATA1, and the co-bound segments with a preserved GATA motif show a higher enhancement activity (median

2.4-fold increase in activity versus a median of 1.7-fold increase; Fig. 2.8B) and higher frequency of enhancement (65% versus 44%) when compared with the co-bound segments with a motif that is not phylogenetically preserved. These results confirm and extend previous observations (Cheng et al. 2008).

A.

55

B.

5

5 2 2 151 GATA1-bound 115 GATA1-TAL1 segments co-bound segments

p-value=0.017 p-value=0.038

0

y

0

t

2

i

2

v

i

t

c

a

5

5

1

1

n

i

e

g

n

0

0

1

1

a

h

c

d

l

5

5

o

F

0 0

f

f

i d d d

i d

t

t

e e

e

e

) )

) )

o

) )

o

t

WGATAR t

v v v

v

2 4

7

3

r r

4 6 r

r

o

o

m

m

( (

7 5

5 7

e e e

e

n

n

( (

(

(

s s s

s

o

o

e e e

e

n

n

r r r

r

p p p p

Figure 2.8 Contributions of WGATAR motif preservation to enhancer activity of

TAL1 OSs.

A. Examples of occurrence of a deeply preserved and a lineage-specific WGATAR motif in a TAL1 peak, discovered by CladiMo. B. The distribution of enhancer activities for DNA segments is represented as dot plots and partitioned by preservation on the GATA1 binding site motif. The total numbers of occupied segments bound by GATA1 (left) or co-bound by GATA1 and TAL1 (right) in each category are given at the bottom of plots. The red internal line indicates the median of enhancer activity in each category (no motif, non preserved motif, or preserved motif between human and mouse).

56

2.3 Discussion

The results of this study provide several important insights into how multiple epigenetic features contribute to the identification of active enhancers. Predicting enhancers based on occupancy by a single critical TF, TAL1, gave a success rate of

56% in transient transfections of hematopoietic cells and 65% in transient transgenic mice. These rates are higher than those achieved previously based on GATA1 binding in erythroid cells (Cheng et al. 2008) or MYOD binding in muscle cells (Cao et al. 2010).

Obviously, TFs with active regulatory roles in the cell types of interest should be used for predicting enhancers, but some TFs have better predictive power than others. The binding pattern of TAL1 is highly dynamic across hematopoietic differentiation and during maturation within a lineage (Wu et al. 2014). Such dynamically active TFs may be more directly involved in regulation (Wang et al. 2014) and hence serve as better predictors of enhancers.

Within the set of DNA segments bound by TAL1, inclusion of additional positive epigenetic features can improve the accuracy of enhancer prediction. Inclusion of

GATA1 occupancy and modification at H3K4me1 increased the success rate to 75%

(transient transfection) to 83% (transgenic mice). Expanding the analysis to include more

DNA segments with a greater diversity of epigenetic features showed that various combinations of TAL1, GATA1, and EP300 occupancy in chromatin modified by

H3K4me1 and H3K27ac were the strongest discriminators for prediction of enhancers; many combinations of these features gave Sn and Sp of approximately 70%. H3K4me1 and H3K27ac modifications were shown previously to be characteristic of active enhancers (Creyghton et al. 2010; Zentner et al. 2011), and our results provide strong confirmation for that in hematopoietic cells. In contrast, we also show that these modifications in the absence of occupancy by key TFs were not associated with 57 enhancer activity. Thus additional features such as TFs should be included along with appropriate histone modifications in enhancer predictions. Binding by the co-activator

EP300, which is a histone acetyl transferase (HAT), was rarely found in the absence of

H3K27ac mark. While these rare EP300-bound DNA segments without H3K27ac could represent a technical limitation of the study (they were determined in similar but not identical cell types), it is possible that they represent regions where EP300/CBP occupancy does not produce histone acetylation (Holmqvist et al. 2012; Holmqvist and

Mannervik 2012). This potential block to the HAT activity was associated with lack of activity in the enhancer assay (2 out of 7 success rate).

The accuracy of enhancer prediction described in this work is high (up to 75%), similar to some of the most effective predictors used previously (Visel et al. 2009; Blow et al. 2010). While this progress is encouraging, the fact remains that about a quarter of all predictions did not show activity in the assays. These predicted CRMs (active or inactive) possess a striking array of features strongly associated with activity, such as TF occupancy, positive histone modifications, and DNase accessibility. The success rates for the CRM predictions are similar for both the transient transfection and the transgenic mouse assays. Thus it is unlikely that one assay is seriously under- or over-counting the biologically meaningful enhancers. Rather, we suggest that other functional roles could be played by the predicted CRMs that were inactive in the enhancer assays. These other roles could include enhancement in tissues or at developmental stages not examined here. Future work with high throughput assays (Melnikov et al. 2012; White et al. 2013; Smith et al. 2013) and activity-based selections (Arnold et al. 2013; Murtha et al. 2014; Dickel et al. 2014) may interrogate a broader set of lineages and conditions.

Also, the predicted CRMs could have roles in negative regulation (Huang and Brandt

2000). The motif enrichment supports this hypothesis. One of the motifs significantly enriched in the TAL1 OSs that are not active as enhancers matches the binding site 58 motif for REST (NRSF). This is a known repressor, and thus it is possible that co- occupancy of REST with TAL1 may lead to down-regulation of expression. The expression profile for NRSF in BioGPS (Wu et al. 2009) reveals strong expression in early erythroblasts, supporting a potential role for this repressor during erythropoiesis.

The motif enrichment implicated IRFs and their partners STAT2::STAT1 (or other members of these protein families) as candidate proteins that may be co-bound at TAL1

OSs that are active as enhancers. The family of interferon regulatory factors, or IRFs, has been suggested previously as positive partners of GATA1 in a similar motif enrichment analysis focused on GATA1 occupancy (Zhang et al. 2009). Also, IRF binding site motifs were enriched within a set of high confidence erythroid enhancers (Xu et al. 2012). Further support for a positive impact of IRF proteins in erythroid enhancement comes from an automated analysis of co-occupancy from human

ENCODE ChIP-seq data. The FactorBook (Wang et al. 2012) listing for IRF1 binding sites in human K562 cells shows substantial co-binding by TAL1, GATA2, and GATA1.

During interferon signaling, the STAT1::STAT2 heterodimer forms a complex with IRF9 in an active multiprotein complex. Thus this complex may play a positive role in enhancement at some TAL1 OSs. Motif enrichment analysis also implicated the FOX

(Forkhead box) family proteins in enhancement by TAL1 OSs. Members of the forkhead family of winged-helix TFs play various roles in development, metabolism, immunology and cancer (Lehmann et al. 2003; Tuteja and Kaestner 2007; Lam et al. 2013); for example, FOXJ2 and FOXO3 are involved in erythropoiesis (Marinkovic et al. 2007;

Yang et al. 2009). They have also been described as pioneer factors that mediate the specificity of chromatin remodeling complexes, such as BAF, at certain enhancers

(Kaestner 2010).

The TAL1 OSs examined in this study have not been strongly conserved since the divergence of rodents and , just like many other DNA segments associated 59 with regulation (Odom et al. 2007; ENCODE Project Consortium 2012; Denas et al. submitted). The DNA segments (100 bp or longer) centered on the peaks of occupancy tend to have low phastCons and phyloP scores. However, when we focus on the most informative transcription factor binding motifs within these occupied DNA segments, we find substantial preservation of the motif across mammalian evolution in the active enhancers. This lends further support to the conclusion that evolutionary constraint on enhancement activity leads to preservation of the binding site motif (Cheng et al. 2008).

2.4 Methods

2.4.1 ChIP-seq data for epigenetic features

ChIP-seq data files for multiple epigenetic features in G1E-ER4 cells induced by estradiol to mature for 24 hour (Hardison lab) and MEL cells (Snyder lab) were obtained from the mouse ENCODE Consortium (Mouse ENCODE Consortium 2014; Wu et al.

2011). Datasets, numbers of peaks and filenames for downloads are given in Table 1.

2.4.2 Enhancer assays by transient transfection for K562 cells

70 DNA segments occupied by TAL1 in G1E-ER4 cells treated for 24 hr with estradiol, with mean size of ~1 kb (Supplemental Table 2.2, 2.3), were amplified from mouse DNA and cloned into a plasmid vector with the firefly luciferase reporter gene driven by the HBG1 promoter (Wang et al. 2006). The test constructs were transiently transfected into K562 cells in a 96-well plate using 0.14 g of plasmid DNA containing firefly luciferase reporter and 0.00035 g of co-transfection control plasmid expressing

Renilla luciferase in OptiMEM medium, adding 0.14 l of PLUS Reagent and 0.21 l

Lipofectamine LTX per well. The cells were plated at 2.8x104 cells per well. Each 60 plasmid was transfected in quadruplicate wells for each experiment, and each was tested in at least two separate experiments.

Cell extracts were harvested 48 hours after the transfection, and firefly and

Renilla luciferase activities measured in Promega’s dual luciferase assay. For each of quadruplicate transfections, at least two measurements were made on the cell lysates for a total of eight measurements of both, firefly and Renilla,luciferases for each construct in each experiment. The ratio of firefly luciferase activity of the test DNA to the

Renilla luciferase activity of the co-transfection control was normalized by the ratio of firefly luciferase activity from the parental vector to the Renilla luciferase activity of the co-transfection control to get a fold change. The tested fragments that have at least a 2- fold increase in activity are considered as active enhancers.

2.4.3 Transgenic mouse assays (VISTA Enhancer Browser)

The set of TAL1 OSs were evaluated for enhancer activities in mouse transgenic assays by mining data in the VISTA Enhancer Browser (http://enhancer.lbl.gov/). In transgenic mouse assays, candidate DNA segments are cloned into an Hsp68-promoter-

LacZ reporter vector. The embryos are generated and evaluated for reproducible LacZ activity at embryonic stage E11.5. The predicted regions showing reproducible expression in the same tissue in at least three independent transgenic mouse embryos were defined as positive enhancers. DNA segments tested in transgenic mice

(examining both mouse DNA segments and the human orthologs of mouse TAL1 OSs) that overlapped with a TAL1 OS for at least 50% of the TAL1 OS were included in the study.

61

2.4.4 Clustering algorithms

The ChIP-seq signal strength of H3K4me1, H3K4me3, TAL1, and GATA1 occupancy was calculated on each TAL1 OS in G1E-ER4 cells treated with estradiol for

24 hr (Fig. 2.4A). The TAL1 OSs were clustered into eight categories by k-means clustering (center-based) (McQueen JB 1967) based on the log2 transformed ratio of

H3K4me1 to H3K4me3 levels, the TAL1 occupancy levels, and the GATA1 occupancy levels (Supplemental Table 2.4). Clustering was repeated 100 times and only those OSs that could be placed in the same clusters for at least 50 times were retained for the subsequent assays and display (4648 TAL1 OSs). In each iteration of clustering, the identity of a cluster was determined by the rank of its median of H3K4me1/H3K4me3 ratios among all clusters. The clustering was displayed by heatmaps. The coloring of the first column was set so that the intervals with zero value (meaning the

H3K4me1/H3K4me3 ratio equals to one) were in black. The coloring of the third column was set so that the intervals with value 0.11 were in pink. This value was determined so that most (over 90%) intervals with larger values were called as GATA1 OSs, and most

(over 90%) intervals with smaller values were not called as GATA1 OSs.

In addition, the meta-data (273 DNA segments) were grouped by similarity in patterns of the presence or the absence of 8 epigenetic features that contribute differentially to the response to enhancer activity using k-means clustering (k=14) and

DBSCAN (19 clusters) (Density-Based Spatial Clustering of Applications with Noise)

(Ester et al. 1996) (Supplemental Table 2.5). Clusters were displayed as a heat map (Fig.

2.6B). Presence of epigenetic features in the DNA segments is shown black in color, and absence of the features is represented by light grey color. Comparison of the clusters formed by k-means with the clusters of DBSCAN revealed that most of the clusters formed by k-means sub-classified to different clusters as well as different set of non-homogenous patterns by DBSCAN. We set the DBSCAN parameters: the size of 62 the -neighborhood of a DNA segment, Eps= 1 and, the minimum number of DNA segments showing homogenous pattern of epigenetic features required to form a cluster, minpts=3. Homogenous clusters formed by DBSCAN were numbered from 1 to 19 and colored by different colors. Non-homogenous patterns of DNA segments were put together in the lower part of the heat map.

2.4.5 Measuring discriminatory power of transcription factors and histone modifications to identify enhancers

The discriminatory power of each feature (and combination of features) was evaluated by sensitivity (recall), specificity and precision (positive predictive value)

(Supplemental Table 2.7). The sensitivity is defined as the fraction of the DNA segments with set of features that are active enhancers. The specificity determines the fraction of the DNA segments without set of features that are inactive. Precision describes the fraction of the active enhancers that are the DNA segments with the set of features.

2.4.6 Identification of significantly enriched motifs by employing the computer program, Discriminating Matrix Enumerator

A computational method called Discriminating Matrix Enumerator (DME2, beta version 2008_08_30) (Smith et al. 2005) was used to identify overrepresented motifs of size 10 (described by scoring matrices) in TAL1 bound enhancers and non-enhancers identified by two enhancer assays: (1) “Enhancers” and “Inactive” regions determined by enhancer assay in transiently transfected K562 cells; (2) “Positive” and “Negative” regions identified by enhancer assay in transgenic mice. Before DME was run, G+C contents of 4 data sets (Enhancer, Inactive, Positive, Negative) produced by each enhancer assay were checked. G+C contents (%) were similar between foreground and background sets so that DME can be run (Supplemental Fig. S2.3A). DME2 software discovers the relative enrichment of position weight matrices in the foreground set 63 checked against the background (Supplemental Tables 2.7-2.10). Discriminatory power of each enriched motif was evaluated by the relative enrichment score given by DME

(Smith et al. 2005). From the 200 motifs in each set, 10 highest scored enriched DME motifs were analyzed further for matches to known binding sites for mammalian transcription factors in the motif databases, JASPAR and UniPROBE Mouse

(Mathelier et al. 2014; Newburger and Bulyk 2009) via motif comparison tool, TOMTOM

(Gupta et al. 2007). 108 known motifs aligned by TOMTOM (E-value of <1) with good statistical were analyzed further. 56 motifs enriched in both the enhancer and non- enhancer sets were removed. After filtering the aligned motifs from the output of

TOMTOM, transcription factor binding sites for different families of proteins enriched in only one category remained (Supplemental Table 2.12).

2.4.7 Analyses of sequence conservation and motif preservation

PhastCons (Siepel et al. 2005) and PhyloP (Pollard et al. 2010) conservation score of each tested TAL1 peak were obtained from conservation tracks of UCSC genome database. PhastCons scores show the level of conservation of each nucleotide of a conserved element in a multispecies alignment (on a scale of 0, not conserved; to

1,fully conserved). PhyloP score is a measurement of the conservation or divergence of a specific alignment position (high positive values point out purifying selection while negative scores indicate acceleration). We calculated the average score for each tested

TAL1 peak that is centered on the middle of the called peak and extending 50 bp on each side (Supplemental Table 2.13).

The CladiMo software package (King et al. 2009) was used to find all WGATAR motif instances in alignments of multiple mammalian genome sequences. Each DNA segment was categorized as having (1) a GATA motif preserved between mouse and 64

human, (2) a GATA1 motif in mouse, not human, or (3) no GATA motif in mouse

(Supplemental Table 2.13).

2.5 Data access

ChIP-seq data are deposited in GEO as the Series GSE51338, and theyare

available from the Mouse ENCODE portal (http://mouse.encodedcc.org); for review, the

Username is “mouseencode” and the password is “reviewers”.

Table 2.1. ChIP-seq datasets

Feature Cell line Number Filename at UCSC Genome Browser of peaks TAL1 G1E-ER4+E2 4,915 wgEncodePsuTfbsG1eer4e2Tal1ME0S129InputPk.broadPeak.gz GATA1 G1E-ER4+E2 13,123 wgEncodePsuTfbsG1eer4e2Gata1aME0S129InputPk.broadPeak.gz EP300 MEL 31,342 wgEncodeSydhTfbsMelP300IggrabPkV2.narrowPeak.gz, wgEncodeSydhTfbsMelP300sc584IggrabPk.narrowPeak.gz H3K4me1 G1E-ER4+E2 105,231 wgEncodePsuHistoneG1eer4e2H3k04me1ME0S129InputPk.broadPeak.gz H3K4me3 G1E-ER4+E2 72,495 wgEncodePsuHistoneG1eer4e2H3k04me3ME0S129InputPk.broadPeak.gz H3K27me3 G1E-ER4+E2 53,587 wgEncodePsuHistoneG1eer4e2H3k27me3ME0S129InputPk.broadPeak.gz H3K9me3 G1E-ER4+E2 69,929 wgEncodePsuHistoneG1eer4e2H3k09me3ME0S129InputPk.broadPeak.gz H3K27ac G1E-ER4+E2 31,535 *submission in progress DNaseI G1E-ER4+E2 93,705 wgEncodePsuDnaseG1eer4S129ME0Diffd24hPkRep1.narro hypersensitivity wPeak.gz, wgEncodePsuDnaseG1eer4S129ME0Diffd24hPkRep2.narro wPeak.gz

Note: G1E-ER4+E2: cells treated with estradiol for 24 hr

65

Chapter 3

Developing transfection methods for cell lines that allow interrogation of

different aspects of erythroid differentiation

Statement of collaboration

Nergiz Dogan, the author of the thesis, performed most of the work. Christine

Dorman contributed to the some of the experiments, including transient transfections and enhancer assays, described in this chapter.

66

3.1 Transient transfections to test CRMs

Genome-wide identification of functional DNA sequences requires computational and experimental methods. In contrast to protein-coding genes, any systematic rules for genome-wide finding of CRMs are not yet elucidated even though different predictive approaches are being explored (Hardison and Taylor 2012). Functional characterization of candidate regulatory elements, in particular those that show evidence of being acted upon biochemically, is essential to understand what distinguishes functionally active from inactive DNA segments. Moreover, functional assays can improve the predictive power of computational tools and provide insights as to how to test predicted CRMs more efficiently in high throughput assays.

One approach to testing for enhancer function is a gain-of-function assay using reporter genes. For decades, transient transfections have been used to ascertain whether a given DNA segment is capable of boosting the level of expression of a reporter gene after introduction into a host cell. There are well-known limits to the information revealed by this assay. Any information on tissue specificity is an inference based on the tissue of origin of the cell line and knowledge of the TF profile in the cell line. The tested DNA construct is removed from its chromosomal location and is tested adjacent to a promoter driving the reporter gene. The reporter gene construct does not integrate into the host cell chromosomes, but it does acquire histones and a chromatin structure. While these issues need to be kept in mind when interpreting the results of the transient transfection assay, it is important to not underestimate its value. The assay is specific; most DNA segments are not active in the assay. DNA segments active in this assay are often shown to be active in vivo in other studies. Thus the assay does have value in revealing whether a tested DNA fragment has the ability to regulate the level of expression of a target gene. 67

The results of the transient transfection assay are dependent on the trans- environment of the cells. This can add to the power of the assay. For example, transfection into a panel of cell types that naturally or by engineering vary in the transcription factors (TFs) present can lead to insights into the TFs and co-activators that can act on the tested DNA segment to affect levels of gene expression.

In this chapter of my thesis, I describe work aimed at developing transfection protocols for the cell lines comprising a GATA1 knockout and rescue system, G1E and

G1E-ER4. These cells have a trans-environment closer to erythroid progenitors and erythroblasts than does the commonly used K562 cell line. This human cell line is derived from cancer cells of a patient with chronic myelogenous leukemia. While it shows some properties of megakaryocytic and erythroid progenitors, it does not differentiate strongly along either lineage. In contrast, the G1E cells have a transcriptional profile similar to erythroid progenitor cells (Pilon et al. 2011), and G1E-

ER4 cells can differentiate into cells similar to maturing erythroblasts after addition of estradiol (Weiss et al. 1997, Welch et al. 2004). Importantly, if reliable transfection assays could be developed for these cell lines, experiments could be designed to examine the affects of loss and gain of the key TF GATA1. Activity dependent on the presence of GATA1 could be assayed directly.

This chapter is included to provide a record of my efforts developing these transfection protocols. The transfections do work, and the results obtained to date are informative (thay are used in Chapter 4, for example). However, most of the intriguing results have not been investigated more thoroughly. The initial results could provide fuel for informative further research. 68

3.2 Transcription factors present in different cell line models of erythroid cells

Transcriptional regulation of gene expression during erythroid differentiation is critically dependent on the hematopoietic transcription factor GATA1. GATA2, another hematopoietic transcription factor, is crucial to the early stages of hematopoiesis and plays a significant role in the proliferation of the early precursors (Tsai et al. 1994).

GATA2 shares similar DNA binding domain and binding site preferences with GATA1

(Ko and Engel, 1993).

The complex of human β-like globin genes contains a cluster of developmentally regulated genes arranged in the order of 5’-ε-Gγ- Aγ-δ- β-3’, encoding HBE1 (the embryonic epsilon-globin), HBG2 and HBG1 (the fetal G-gamma- and A-gamma-globins), and HBD and HBB (adult delta- and beta-globin), respectively (Patrinos et al. 2004). In transient transfection of human K562 leukemia cells, the luciferase reporter gene is driven by the promoter of the HBG1 gene. This gene is expressed in K562 cells, which have erythroid features and are readily and efficiently transfectable (Benz et al. 1980).

These cells have also both GATA1 and GATA2.

G1E is a mouse erythroid cell line derived from Gata1 knockout embryonic stem cells. Cells from the subline, G1E-ER4, have estrogen-activatable form of GATA1

(GATA1-ER) (Welch et al. 2004) (Fig. 3.1). G1E and G1E-ER4 cells prior to treatment with estradiol have high levels of GATA2. Once GATA1-ER is activated by estradiol, the expression of the Gata2 gene is rapidly repressed, and the level of GATA2 plummets.

By performing experiments before and after restoration of GATA1 (See methods in section 3.8.1), information on specific effects of GATA1 and GATA2 can be gleaned. Is the expression response dependent on GATA1 or GATA2? Is it independent of GATA factors? Is it repressed by GATA2 or GATA1-ER? Or activated by GATA1-ER upon estradiol treatment (+E2) to activate GATA1-ER? These are some of the questions that can be addressed using the assays described in this chapter. 69

trans environment constit induced genetic mod n cis-regulatory regions

G2 LUC G1E cells LUC

G1E-ER4 cells G2 G1-ER uninduced LUC LUC

G1E-ER4 cells G1-ER+E2 LUC induced LUC with E2

G2 G1 G1-ER G1-ER+E2 E2

GATA2 GATA1 GATA1-ER TAL1-E47 NFE2 KLF1 FOG1 beta-estradiol WGATAA GATA1-ER+E2

Figure 3.1 Summary of some differences in the panel of transcription factors in

G1E and G1E-ER4 cells.

3.3 Erythroid enhancement with HBG1 promoter in K562 cells

In enhancer assays, predicted enhancers amplified from mouse DNA inserted upstream of the firefly luciferase reporter gene. In almost all our previous experiments

(Wang et al. 2006; Cheng et al. 2008) and in experiments from several other laboratories

(Bresnick et al. 2005; Bresnick et al. 2010), the promoter used to drive expression is the promoter for the HBG1. This promoter has a binding site for GATA1 (Fig. 3.2) and it is known to respond to erythroid enhancers.

0 0

1 70 )

0

8

U

L

R 0

HBG1 pr LUC (

6

N

E

0

R / Enhancers+HBG1 pr 4

LUC F

F

0 2

Vav2 pr LUC 0

Enhancers+Vav2 pr LUC HBG1 GHP88 GHP181

: WGATAA (GATA1 motif)

Figure 3.2 Illustration of constructs being tested

In addition to the firefly luciferase expression plasmid, a Renilla luciferase reporter vector was also used as a co-transfection control. Strong enhancers used as controls and for developing the assays in other cell types are GHP88 (mm9 assembly coordinates: chr7: 88863914-88864563) and GHP181 (mm9 assembly coordinates: chr7: 111009146-111010117). These are two GATA1-binding peaks previously identified from GATA1 ChIP-chip data, and shown to be active enhancers by transfection of K562 cells (Cheng et al. 2008). Expression of both firefly and renilla luciferases were measured in at least quadruplicate determinations for each plasmid in at least three independent transfections. The Welch two-sample t-test was performed for the set of expression measurements for each predicted enhancers, comparing to the set of values from parental vector. The activity from a predicted enhancer is considered significant if its p-value is smaller than 0.05.

71

0

0

1

)

0

8

U

L

R 0

HBG1 promoter LUC (

6

N

E

0

R / Enhancers+HBG1 pr 4

LUC F F

(GHP88, GHP181) 0 2

WGATAA 0 (GATA1 motif) HBG1 GHP88 GHP181

Figure 3.3. Relative expression levels driven by HBG1 promoter and by GATA1-

bound enhancers (GHP88 and GHP181), with HBG1 promoter in K562 cells.

As a positive control in tests of enhancer activity of predicted enhancers in K562

cells, reporter vectors including HBG1 promoter without any target insert or with target

inserts GATA1 hit peaks, GHP88 and GHP181 upstream of the promoter were

transiently transfected into K562 cells. The results confirm that GHP88 and GHP181

lead to a significant increase in the ratio of the expressions of firefly luciferase to renilla

luciferase compared to parental vector (only HBG1 promoter) (Fig. 3.3) . Since K562

cells have both key hematopoietic transcription factors GATA1 and GATA2, we cannot

examine specific effects of the GATA factors on expression in these cells.

3.4 GATA1 responsive expression from HBG1 promoter in G1E-ER4 system

The level of expression determined by the regulatory elements can be different

depending on the cell types in which they are assayed. Therefore, and also in order to

see specific effects of GATA1, strong enhancers previously identified in K562 cells,

GHP88 and GHP181, were tested for their ability to enhancer expression from the HBG1

promoter, by transfecion into G1E and G1E-ER4 cells, the latter as a time course after

activation of GATA1-ER. Expression from HBG1 promoter alone increases after 72 activating GATA1-ER, as expected for this GATA1-dependent promoter. Insertion of the

GATA1-bound enhancers caused an increase in expression in G1E, showing that

GHP88 and GHP181 can enhance expression even in the absence of GATA1 (Fig. 3.4).

GATA2 may play a role in increasing expression from HBG1 promoter in G1E cells, which have GATA2 but not GATA1 (Fig. 3.1). Furthermore, the expression levels driven by enhancers linked to HBG1 promoter increase dramatically after activating GATA1-ER, as expected for GATA1-dependent activity for both enhancer and the promoter (Fig. 3.4).

However, it is difficult to dissociate GATA1 effects on promoters versus enhancers using these constructs. While the expression from the enhancer-containing constructs increased dramatically after activation of GATA1, to a level far above that of the promoter alone, the ratio (fold enhancement) is not substantially increased.

HBG1 GHP88 w/HBG1 GHP181 w/HBG1 Vav2

0

0

0

0

4

4

4 4

)

0

U

0

0

0

3

3

L 3

3

R

(

0

N

0

0

0

2

2 2

2

E

R

/

F

0

0

0

1

0

F

1 1

1

0

0 0

0

h

h

h

E

h

h

h

h h h

E

h h h

0

E

8

4

1

E

4

8

0

0

4 8

0

4 8

1

_

1

4

2

1

2

4

_

_ 2 4

G

_

2 4

4

_

_

G G

G

_ 4 _ _

_

4

4

_ _

4

4

R

4 4

4

4

R

4 4

R

R

R

R

R R

E

E

R

R

E

R R

E

E E

E

E

E

E E E

Figure 3.4 Relative expression levels driven by HBG1 promoter without an insert

and with GATA1-bound inserts GHP88 and GHP181, and Vav2 promoter without an insert in the G1E and G1E-ER4 cells, following a time course after activation of

GATA1-ER.

Red color in the boxplots represents GHPs that drive statistically significant increased expression compared with the promoter without GATA1-bound insert. 73

3.5 GATA1-dependent enhancement with Vav2 promoter in G1E-ER4 system

Whereas the HBG1 promoter responds to GATA1 by itself, I discovered that the

Vav2 promoter, which lacks a GATA1 binding site motif, does not (Fig. 3.4 and Fig. 3.5).

Thus I investigated whether this promoter would be a more sensitive one for revealing

GATA1-dependent enhacement.

CRMs previously predicted and identified as enhancers in K562 cells (R3, R8,

R13, R10) (Wang et al. 2006) were cloned into luciferase expression constructs driven by the Vav2 promoter, and transfected into G1E and G1E-ER4 cells, before and after estradiol treatment. All four CRMs enhance expression from the Vav2 promoter even in the absence of GATA1 (in G1E cells), but the degree of enhancement increases dramatically when GATA1-ER is activated (Fig. 4.5).

Vav2 only R3,Vav2 R8,Vav2 Vav2 only R13,Vav2 Vav2 only R10,Vav2

4

3

)

U

L

R

(

2

N

E

R

/

F

F

1

0

h h h h h h h h h h h h h h h h h h h h h

E E E E E E E

0 4 8 0 4 8 0 4 8 0 4 8

0 4 8 0 4 8 0 4 8

1 1 1 1

1 1 1

_ 2 4 _ 2 4 _ 2 4 _ 2 4 _ 2 4 _ 2 4 _ 2 4

G G G G G G G

8 _ _ 2 _ _ 3 _ _ 2 _ _ 0 _ _

2 _ _ 3 _ _

_ _

_ _ _ _ _

v v v

2 2 3 3 8 8 2 2 1 3 3 2 2 1 0 0

R R

2 3 8 2 3 2 0

a v v a v v a v v

1 1 1 1

R R R R

R R

v v

v

1 1

a a

a a R R a a

V V V

R R R R

a a a

R R

V V V V V V

V V V

Figure 3.5 GATA1 dependent enhancement with Vav2 promoter after transfection

into G1E and G1E-ER4 cells. 74

Red color in the boxplots depicts enhancers on the Vav2 promoter. Mm9 assembly coordinates: (R3) chr2:27199488-27199723; (R8) chr2:27252065-

27252987; (R13) chr8:124837163-124837562; (R10) chr2:27184263-27184417).

3.6 Potential repressor effect of GATA1-ER on expression in G1E-ER4 system

Experiments conducted in the G1E system suggests that the GATA1-ER protein, prior to activation with the ligand, can repress expression. For several constructs with either the HBG1 promoter (Fig. 3.4) or the Vav2 promoter (Fig. 3.5), a lower level of relative expression (Firefly to Renilla Luciferase) was observed in un-induced G1E-ER4 cells (containing non-activated GATA1-ER) compared to G1E cells (containing no

GATA1). This result was not obtained for all constructs, and more experiments are warranted to explore it more thoroughly.

3.7 Discussion

In this chapter, I describe the protocol (see Methods) and initial results for transfecting reporter genes into G1E and G1E-ER4 cells. Transfection of a predicted regulatory module on HBG1 promoter constructs into K562 cells or G1E-ER4 cells can reveal enhancers. Transfection of the constructs on Vav2 promoter into G1E cells also shows enhacement, and this parental plasmid may be a more effective way to reveal

GATA1-dependent enhancement.

These results are presented here as a record of efforts to develop the methods and to provide a starting point for future development. The assays in their current form are informative, as illustrated by their use in the next chapter.

3.8 Methods

75

3.8.1 Enhancer assays by transient transfection for G1E-ER4 cell system

The goal is to transfect G1E and G1E-ER4 cells predicted enhancers fast, accurately, and efficiently. Candidate enhancers were amplified from mouse DNA and were inserted upstream of the firefly luciferase reporter gene driven by HBG1 promoter

(containing WGATAA motif) or Vav2 promoter (chr2: 27,282,306-27,283,216 (mm9)) (not containing WGATAA motif). In addition to the firefly luciferase expression plasmid, renilla luciferase reporter vector was also used as a co-transfection control. The test constructs were transiently transfected into G1E and G1E-ER4 cells in a 96-well plate using 0.14 µg of plasmid DNA containing firefly luciferae reporter and 0.0007 µg of co-transfection control plasmid expressing Renilla luciferase in OptiMEM medium, adding 0.29 µl of

DMRIE-C Reagent per well. The cells were plated at 4.6x104 cells per well. Each plasmid was transfected in quadruplicate wells for each experiment, and each was tested in at least two separate experiments.

Rapidly growing G1E and G1E-ER4 cells split into antibiotic free media 24 hours before transfection below passage number 20. Iscove’s Modified Dulbecco’s Medium

(Invitrogen #12440) supplemented with 15% fetal bovine serum, 3 ml Kit Ligand (KL) conditioned medium, 6.2 ml of Monothioglycerol (MTG), 100 ml Erythropoietin (EPO;

10,000 U/ml). Cell were plated in OptiMEM reduced serum medium (Invitrogen # 31985) without antibiotics supplemented with EPO, KL, and MTG per well.

As to harvesting and enhancer assay, 96-well plates were spinned down in the centrifuge with plate holders at 1,500 rpm for 10 minutes. The remaning media were vacuumed off of cell pellets in 50 µl of 1X passive lysis buffer per well. After they were vortexed with plate vortexer, they were incubated for at least 15 minutes at room temperatures. Luciferase Assay Reagent II (LARII) and Stop&Glo reagents (from

Promega’s Dual Luciferase Assay Kit) was freshly prepared. 10 µl of cell lysate were 76 transferred into a white 96 well. Firefly and Renilla were measured for each well using 50

µl of LARII per well, read for 6 seconds, then 50 µl of Stop&Glo per well for 6 seconds.

77

Chapter 4

Conserved Transcription Factor Occupancy and Enhancer Usage in

Different Cell Systems and Multiple Tissues

Statement of collaboration

Chapter 4 includes some key results used in a research article, “Principles of regulatory information conservation revealed by comparing mouse and human transcription factor binding profiles.” (Cheng et al. 2014). A revised version of the manuscript has been resubmitted to Nature; a copy is in the Appendix B. Nergiz Dogan, the author of the thesis, performed the transient transfection experiments on conserved

GATA1 OSs, including the cloning, transient transfections and enhancer assays into different cell transfection systems in this chapter. She also contributed to the data analysis and interpretation. Transgenic mouse reporter assays were conducted in Len A.

Pennacchio’s Laboratory. Weisheng Wu run ChromHMM analyses.

78

4.1 Introduction

Building a clearer understanding of the evolution of regulatory networks is essential for interpreting research on mouse model systems as they could apply to human biology and biomedical research. The broad landscape of regulatory networks is similar in mouse and human, but the details of regulatory processes at orthologous DNA segments can differ significantly (Yue et al. 2014). Transcription factor (TF) binding at some DNA segments is deeply preserved over evolution, e.g. across all eutherian mammals, while other binding sites are restricted to one or a few closely related species.

The functional interpretation of this phylogenetic diversity among TF occupied segments

(TF OSs) is not clear, and likely will be complex. One interpretation of the TF OSs with limited conservation is that they are under relaxed or very low levels of purifying selection, and thus they have limited or no function. However, they could also be lineage-specific regulators, and results from the mouse ENCODE consortium (Yue et al.

2014) support this possibility at many occupied sites. Even at the subset of TF OSs whose DNA sequences are conserved between mouse and human (conservation of

DNA sequence), binding by orthologous TFs is not conserved at most of these sites

(Schmidt et al. 2010; Cheng et al. submitted). Also, occupancy by transcription factors and coactivators are also powerful (albeit imperfect) predictors of enhancers in spite of still not being perfect (Cheng et al. 2009; Visel et al. 2009; Blow et al. 2010). The level of overall sequence conservation (across an entire CRM, not limited to TF binding site motifs) is only weakly associated with enhancer activity (Chapter 2).

The roles of TF OSs that are preserved among many lineages are also not fully defined. Deep conservation of noncoding genomic DNA segments –even without evidence of TF binding – can be used to identify a subset of regulatory elements, such as developmental enhancers (Visel et al. 2008). Moreover, evolutionary constraint on enhancement activity brings about preservation of the binding site motif across 79 mammals (Cheng et al. 2008; Dogan et al. submitted). Despite the abundance of TF

OSs at which binding site is not conserved between human and mouse, a substantial subset of TF OSs do show conservation of occupancy (Cheng et al submitted). This subset contains many of the classically studied enhancers. Does the greater level of evolutionary constraint at these TF OSs reflect purifying selection on critical functions

(reflecting a magnitude of functional contribution), or could it result from multiple layers of selection on pleiotropic functions (reflecting the number of functions)? Clearly, functional assays on candidate regulatory regions at which TF binding is conserved between human and mouse is crucial to resolve such questions.

In the common model for gene regulation, tissue-specific transcription factors activate target genes by binding to enhancers active in a specific tissue or in a small number of tissues. However, it is not known how frequently enhancers discovered in one tissue are also active in other tissues. This possibility is particularly intriguing for enhancers bound by TFs that are also members of a multi-protein family. For example, the mammalian genome encodes six GATA proteins (GATA1/2/3/4/5/6). These factors share a highly conserved DNA binding domain. The conserved region spans 109 aminoacid residues, 75% of which is identical across the six GATA factors (Zhou et al.

2012). The preferred DNA binding site is almost identical for each factor. If an enhancer containing such a binding site were also used as an enhancer in another tissue, with the binding site being recognized by a paralogous GATA factor, then an additional layer of selection would be applied to this enhancer, i.e. activity in two (or more) tissues rather than one. Thus, utilization of an enhancer in multiple tissues could increase the evolutionary constraint on the TF occupied segment to the extent that binding of the TF to orthologous regions in other mammals is preserved. 80

Based on the facts mentioned above, we hypothesized that conservation of occupancy by GATA1, master regulator during erythropoiesis, in mouse and human predicts enhancers involved in the regulation of multiple tissues.

4.2 Contribution of conserved GATA1 occupancy to enhancer activity in different cell systems

The regulation of gene expression during erythroid differentiation is crucially dependent on the transcription factor GATA1. The significant correlation between enhancer activity and the evolutionary constraint leads to preservation of GATA factor binding motif across mammals (Fig. 4.1A). The work previously done on this (Cheng et al. 2008; Dogan et al. submitted) provides an opportunity to relate preserved binding site motif to conserved occupancy. For this purpose, conservation of GATA1 occupancy in mouse and human was used to predict enhancers. Figure 4.1B shows an example of

DNA segment occupied by GATA1 in mouse cell lines: 24 hour-induced G1E-ER4 (an estradiol dependent GATA1 rescued subline of G1E, see 3.4), erythroblast and MEL

(murine erythroleukemia) cells. Figure 4.1C displays occupancy by GATA1 in human cell lines: K562 (human myelogenous leukemia line), and PBDE (peripheral blood-derived erythroblasts) cells. The illustrated region in figure 4.1 is one of the regions (hs1862) tested in enhancer assay (Table 4.1 and 4.2).

81

A.

B.

C.

Figure 4.1 Prediction of enhancers based on conserved GATA1 occupancy

between mouse and human.

A. GATA factor binding motif preserved across mammals. B. GATA1 occupied segments in mouse cell lines. 24 hour-induced G1E-ER4, erythroblasts and MEL cells. C. GATA1 occupied segments in human cell lines; K562 and PBDE cells. 82

We chose ten DNA segments bound by GATA1 in mouse and whose orthologs in humans are also bound by GATA1. Moreover, all ten conserved GATA1 OSs were in the regions with DHSs peaks. 5 of these DNA segments were tested in transfection assays into four hematopoietic cell types: K562, MEL, G1E, and G1E-ER4 cells (see Section

3.4). The results of transfections revealed that 3 out of 5 conserved GATA1 occupied segments were active enhancers at least in two cell types. The distributions of results for transfections assays are plotted in the graphs in Figure 4.2, and the results are summarized in Table 4.1. The DNA segment hs1859 was active as an enhancer in all cell types except K562 while hs1862 acted as an enhancer in only induced G1E-ER4 cells, as well as K562 and MEL cells. The segment hs1866 was also an active enhancer in late stage erythroblasts. Performing experiments before and after restoration of

GATA1 revealed the expression response for hs1859, hs1862 and hs1866 was dependent on GATA1. Expression constructs with these candidate enhancers showed an increased level of activity upon induction by estradiol treatment (+E2) to activate

GATA1-ER (Fig. 4.2A, B; Table 4.1).

83

A.

K562K562 Un-inducedUn-induced MEL MEL

y

y

t

t

6 i

6

i

6 6

v

v

i

i

t

t

c

c

a

a

4 4

n

4

4

n

i

i

e

e

g

g

n

n

2

a

2

2

a

2

h

h

c

c

d

d

l

l

o

0

o

0

0

0

F

F

6

8 9 2 4

4 8

9

6

2 6

8 9 2 4

4 8

9

2 6

6

5 5 6 5

5 5

5

6

6 6

5 5 6 5

5 5

5

6 6

8

8 8 8 8

8 8

8

8 8

8

8 8 8 8

8 8

8

8 8

1

1 1 1 1

1 1

1

1 1

1

1 1 1 1

1 1

1

1 1

s

s s s s

s s

s

s s

s

s s s s

s s

s

s s

h

h h h h

h h

h

h h

h

h h h h

h h

h h h

B.

5 5 5 5

5 hs1854 hs1858 hs1859 hs1862 hs1866

y

t

i

v

i

t

4 4 4 4 4

c

a

n

i

3 3 3 3 3

e

g

n

2 2 2 2 2

a

h

c

1 1 1 1 1

d

l

o

F

0 0 0 0 0

4 4 4 4 4 4 4 4 4 4

4 4 4 4 4

E E E E E

R R R R R R R R R R

R R R R R

1 1 1 1 1

E E E E E E E E E E

E E E E E

G G G G G

______

_ _ _ _ _

h h h h h h h h h h

h h h h h

4 8 4 8 4 8 4 8 4 8

0 0 0 0 0

4 2 4 2 4 2 4 2 4 2

Figure 4.2 Transient transfection results of 5 conserved GATA1 occupied

segments.

A. The results of 5 DNA segments that are bound by GATA1 in mouse and human transfected into K562 cells (left) and uninduced MEL cells (right). B. The enhancer activity response of 5 elements tested in G1E cells and in G1E-ER4 cells that were un-induced (0h) or induced for 24 hour or 48 hour.

84

Table 4.1 Transient transfection results of 5 GATA1 occupied segments (from

ChIP-seq in 24 hour-induced G1E-ER4 cell line) conserved between mouse and

human and tested for different cells.

Enhancer activity levels (median values) in red color are enhancers, which have at least 2-fold change in activity, in different cell lines. Pink colored regions are falling into threshold zone (1.5≥activity>2 fold change). Blue color means inactivity.

Conserved ENCODE ID EN64 EN68 EN69 EN72 EN76 GATA1 OSs VISTA ID hs1854 hs1858 hs1859 hs1862 hs1866

K562 0.75 0.48 1.49 2.06 0.68

Uninduced MEL 0.98 0.88 3.39 3.05 1.62

G1E 1.24 1.23 2.49 1.36 1.83

Uninduced-G1E-ER4 0.97 1.12 2.38 1.32 1.47 (Median)

24h-induced G1E-ER4 1.32 1.17 2.95 2.90 2.13 Fold Change in Activity Change Fold in 48h induced G1E-ER4 1.46 1.39 3.83 3.52 2.35

4.3 Conserved occupancy is associated with enhancer activity in multiple tissues

TF OSs enriched by enhancer-related chromatin marks reveal that functionality of enhancers in multiple tissues is linked to conserved occupancy (Shen et al. 2012;

Cheng et al. resubmitted).

Our hypothesis is that GATA1 occupied segments that are functional in multiple tissues are under increased selective pressure, and so it is more probable that GATA1

OSs are conserved in occupancy. To experimentally test if conserved GATA1 OSs are active in different tissues, the ten GATA1 OSs whose occupancy is conserved between 85 mouse and human (previous section) were tested for the ability to drive reporter gene expression in transgenic mouse reporter assays at embryonic day 11.5. in Len

Pennacchio’s Laboratory (http://enhancer.lbl.gov: Visel et al. 2007). 9 out of 10 DNA segments showed reproducible enhancer activity in various tissues, such as heart, brain, liver, blood vessels; however, they were hardly active in erythroid tissues such as fetal liver. Among ten conserved GATA1 OSs with in vivo enhancer activity, three were active in only midbrain and other three in only heart (Fig. 4.3A; Table 4.2).

The patterns of chromatin accessibility and histone modifications across mouse tissues provided additional strong evidence for enhancer activity in multiple tissues. The patterns of histone modifications were used to segment the mouse genome into chromatin states (using the program chromHMM, Ernst and Kellis 2012). Chromatin with histone modifications reflecting gene activation are depicted with warm colors (Mouse

ENCODE Consortium submitted; Fig. 4.3B). All of the DNA segments with TF occupancy conserved between mouse and human showed both chromatin states and DNase hypersensitivity indicative of gene regulatory regions active in multiple tissues, such as heart, liver, and brain in addition to erythroid cells (Cheng et al. submitted). The patterns for two enhancers, (hs1857 and hs1859) are illustrated in Figure 4.3B. For instance, hs1857 was in a region of active chromatin in MEL, heart, liver, brain, and hs1859 was in active chromatin in MEL and brain. These patterns of active chromatin match the patterns of reporter gene expression in the transient transgenic mouse assay. We conclude that nine of the ten tested DNA segments with conserved occupancy are pleiotropic in the sense that they are active in multiple tissues.

86

A. B.

Figure 4.3 Conserved GATA1-occupancy is linked to enhancement in multiple

tissues and chromatin accessibility (Cheng et al. submitted).

A. Stained embryos with activity in different tissues (see Table 4.2). B. This panel represents genes, enhancers predicted by histone modifications, chromatin states, factor occupancy, and DNase hypersensitivity signals across different tissues for regions including HS1857 and HS1859 which are two conserved GATA1 occupied segments between mouse and human.

87

Table 4.2 Result of in-vivo enhancer assay of 10 GATA1 occupied segments that

are conserved between mouse and human.

ENCODE VISTA Chr Start End Transgenic mice (E11.5) ID ID (mm9)

EN64 hs1854 chr8 125235741 125237725 heart[7/8]

EN65 hs1855 chr11 102224347 102226111 negative

EN66 hs 1856 chr11 105995398 105997153 heart [6/7]

EN67 hs1857 chr4 117501319 117504204 midbrain (mesencephalon)[5/5]

EN68 hs1858 chr2 103733167 103735821 heart[4/7] , liver[4/7]

neural tube[8/8], hindbrain (rhombencephalon)[5/8], midbrain EN69 hs1859 chr7 116256647 116260817 (mesencephalon)[8/8], forebrain[8/8], heart[7/8], blood vessels[5/8], liver[4/8]

EN70 hs1860 chr2 167874564 167876364 midbrain (mesencephalon)[5/5]

EN72 hs1862 chr1 156885742 156887787 heart[6/6]

EN76 hs1866 chr19 37566512 37571839 blood vessels[5/5]

EN77 hs1867 chr2 69836566 69837905 midbrain (mesencephalon)[6/8]

Another aspect of pleiotropy is the number of TFs that bind to a particular DNA segments. We found that conserved TF occupied segments tended to be bound by multiple TFs, to a significantly higher extent than the DNA segments without conserved occupancy (p-value<2.2e-16, two-tailed t-test; Fig. 4.4; Cheng et al. submitted). This implies that co-occupancy by multiple TFs is associated with higher level of purifying selection on the sequences occupied by TFs.

88

Figure 4.4 Distribution of numbers of co-bound TFs and conserved occupancy

(Cheng et al. submitted).

4.4 Discussion

The functional constraints that lead to conservation of TF occupancy are not understood, but identifying those constraints would lead to more meaningful interpretations of the diversity of conservation profiles observed for TF OSs. In this chapter, conserved GATA1 ccupancy between human and mouse was used to predict candidate enhancer sequences. Two gain-of-function assays, transient transfection into hematopoietic cell lines and transgenic mouse reporter assay at E11.5, were used to test for enhancer function. The results of enhancer assays revealed that conserved

GATA1 occupancy between mouse and human is associated with pleiotropic functionality, in particular, that they work in varipus tissues. DNA segments where

GATA1 binding is conserved at orthologous regions are involved in erythropoiesis and active in a range of tissues, with the greatest number in heart, midbrain, and vasculature.

In addition, Cheng et al. has showed that TF OSs whose occupancy is conserved between mouse and human are co-bound with multiple TFs. This can be viewed as an 89 additional form of pleiotropy. Remarkably, GATA1 occupancy with more common chromatin accessibility across various tissues display the most conserved occupancy between mouse and human (Cheng et al. submitted).

Enhancers discovered by their conservation of TF occupancy are active in multiple tissues, and these could be the tissues that express paralogs of the TF whose occupancy was initially studied. Specifically, enhancers discovered by conserved

GATA1 bidning are active in tissues that express other GATA factors (Fig. 4.5). These findings suggest that the same factor-dependent enhancer, bound by GATA1 in erythroid cells, not only can function to boost expression of target genes in those cells, but also it is used in brain (bound by GATA3) and heart (bound by GATA4) to enhancer expression of target genes. These target genes could be different from the ones targeted by the GATA1-bound enhancer in erythroid cells. The activity in various tissues does not need to be performed by the same GATA family member. Paralogous GATA proteins that bind the same DNA motif, such as GATA5 or GATA6, are possibly at the GATA1

OSs with conserved occupancy and pleiotropic functions.

90

Tissues GATA Erythroid, T- factor Heart Brain Vasculature Liver Pancreas Lung Intestine Ovary Testis Megakaryocyte lymphocyte GATA1 +

GATA2 + + +

GATA3 + + +

GATA4 + + + + +

GATA5 +

GATA6 + + + + + + +

FOG1 +

FOG2 + + + +

Figure 4.5 Enhancers discovered by GATA1 binding are active in tissues with

other GATA factors.

The enhancers predicted on the basis of conserved GATA1 occupancy were active in

transient transfections at a high rate, especially when the host cells were MEL, G1E, and

G1E-ER4 cells. These cell lines were developed as models for erythroid differentiation,

and provide a versatile battery of cell environments in which to investigate many

mechanisms of erythroid regulation. While the K562 cell line is readily transfectable and

can reveal enhancer activity, it does not appear to be an ideal system in which to study

this category of CRMs. While it is striking that the enhancers were not active in erythroid

tissues in the transient transgenic mouse embryos, this more likely reflects a limitation in 91 the assay rather than a failure of erythroid enhancement. Specifically, at the time of the assay (embryonic day 11.5), little hematopoietic activity is in the liver, and extraembryonic hematopoietic tissues such as the yolk sac are not examined in this assay. In future studies, it would be informative to investigate later stages of gestation, such as embryonic data 14.5 when the fetal liver is an active hematopoietic organ.

However, this would require a different dissection and staining protocol, since at this stage the skin will obscure internal organs. Other animals, such as fish, could be informative recipients of the transgenes.

Other recent studies support this general model of activity in multiple tissues contributing to stronger selective constraint, it has been shown that bidirectional

FANTOM5 CAGE (Cap Analysis of Gene Expression) tags strongly predict not only cell- type specific enhancers but also ubiquitous enhancers that are a subset of enhancers and twice as conserved than cell-specific enhancers (Andersson et al. 2014).

In summary, the multiple layers of constraint from being used in multiple tissues may lead to stronger evolutionary pressure exerted as more purifying selection that in turn preserves TF occupancy in orthologous regions. The activity in multiple tissues may preclude the turnover in motifs and binding that is seen commonly in other cis-regulatory modules.

4.5 Methods

4.5.1 Functional assays in K562, G1E, and G1E-ER4 cell transfection systems

Five conserved GATA1 occupied segments (Table 4.1) were tested for enhancer assays in different cell transfection systems: K562, G1E, G1E-ER4, and MEL cells (Fig.

4.2) (see section 2.4.2, 3.8.1, and 4.4.2 for the details about the transient transfection protocols for K562 , G1E and G1E-ER4, and MEL cells, respectively). 92

4.5.2 Enhancer assays by transient transfection for MEL cells

Five conserved GATA1 occupied segments were amplified from mouse DNA and cloned into a plasmid vector with the firefly luciferase reporter gene driven by the HBG1 promoter. The test constructs were transiently transfected into MEL, murine erythroleukemia, cells in a 96-well plate using 0.14 µg of plasmid DNA containing firefly luciferase reporter and 0.0018 µg of co-transfection control plasmid expressing Renilla luciferase in OptiMEM medium, adding 0.14 µl of PLUS Reagent and 0.21 µl of

Lipofectamine LTX per well. The cells were plated at 3.0x104 cells per well. Each plasmid was transfected in quadruplicate wells for each experiment, and each was tested in at least two separate experiments.

4.5.3 Transgenic mouse reporter assays

For transgenic mouse assays, ten conserved GATA1 OSs were amplified and cloned into a minimal promoter, Hsp68, LacZ reporter vector. This part of the study has been in collaboration with Len Pennaacchio at Lawrence Berkeley National Laboratory,

ENCODE. The embryos were generated and assessed for reproducible LacZ activity at stage E11.5 (Visel et al. 2007). The candidate regions presenting reproducible expression, meaning that the same pattern in at least three independent transgenic mouse embryos, were identified as positive.

93

Chapter 5

Summary and Future Directions

One of the challenges of the post genome era is the knowledge on the control mechanisms of gene expression. Regulation of transcription occurs at two levels connected to each other: transcription factors and transcriptional apparatus, and chromatin and its regulators. The focus of this thesis was on the principles of transcriptional control by enhancers.

Genome-wide finding of functional DNA sequences, such as enhancers, requires computational and experimental methods. As opposed to protein-coding genes, any systematic rules for genome-wide identification of CRMs are not yet elucidated although different predictive approaches are being explored. Reported success rates of different approaches to identify enhancers reveal that the most reliable method is using enhancer-associated biochemical marks such as histone modifications and occupancy by transcription factors. However, currently available approaches including epigenetic features are still not being perfect (Visel et al. 2009; Blow et al. 2010; Hardison and

Taylor 2012).

Functional characterization of candidate regulatory elements, in particular those that show evidence of being acted upon biochemically, is essential to understand what distinguishes functionally active from inactive DNA segments. For this purpose, in this thesis, we analyzed the contribution of an epigenetic feature, individually or in combination with others, to predict enhancers. We found that the best discriminator was the enrichment of H3K27ac in the presence of EP300 binding, with a sensitivity of 74% and a specificity of 72%. However, other combinations including H3K4me1, H3K4me3,

TAL1, GATA1 and/or EP300 also gave very similar performances to H3K27ac plus 94

EP300. These observations also revealed a striking finding that DNA segments enriched by the three activating histone marks (H3K4me1, H3K4me3, H3K27ac) in various combinations yet in the absence of any transcription factor showed no discriminatory power. Furthermore, many of these DNA segments are not in the regions of open chromatin, suggesting that it is less likely that these regions are bound by other

TFs not used in this study.

Enhancers play crucial role in regulating gene expression during mammalian development; nonetheless, finding the genes enhancers control is quite complicated because enhancers can act independently from their locations, and are commonly distal from the target genes. The most widely used approach is to assign enhancers to the closest transcription start sites. In current study, new method, using Enhancer-Promoter

Units, EPUs, was performed to link enhancers to their target genes. Results provided a list of target genes controlled by TAL1-bound enhancers validated in enhancer assays.

This list includes not only hematopoiesis/erythropoiesis related genes, such as Gata1,

Lmo2, Gypa, Gypc, Hhex, Hba-a1, Hba-a2, Hba-x, Hbq1,Hbb-b1, Hbb-b2, but also non- hematopoietic genes, such as genes linked to neural lineages: Neurod2, Nrxn3, Dpf3,

Lmo1, or mitochrondrial ribosomal protein encoding gene, Mrpl32

(http://www.genecards.org/). Also, miR144 and MiR 451 were detected. miR-144 functions with miR-451 and control the expression of various genes associated with erythropiesis. Moreover, transcription of the miR-144 microRNA precursor is thought to be activated by GATA4 which is involved in development of heart and liver, and miR144 can be a potential therapeutic tool for treatment of ischemic heart disease (Zhang et al.

2010).

In this study, one of the noticeable findings is that conserved GATA1 occupied segments between human and mouse showed reproducible enhancer activity in various tissues, such as heart, brain, liver, blood vessels in vivo, and also in erythroid related 95 cultured cells. Enhancers discovered by GATA1 binding are active in tissues with other

GATA factors. This is suggesting that the same GATA factor-dependent enhancer may be used in erythroid (GATA1) and heart (GATA4) and brain (GATA3) for different targets.

Functional assays can improve the predictive power of computational tools and provide insights as to how to test predicted CRMs more efficiently. Massively parallel reporter assays that has recently been developed enables in vivo functional testing of transcriptional regulatory elements at higher throughput . Thus, it will provide screening and dissecting the large variety of enhancers, as well as other regulatory elements, that have been identified by the ENCODE project. The National Human Genome Research

Institute (NHGRI) launched the ENCODE (Encyclopedia of DNA Elements) Project in

September 2003. The project aims to identify all functional elements in human genome.

All data from human and mouse generated by ENCODE is released into public databases (www.genome.gov/encode). Therefore, detectable biochemical activity annotates the majority of the human genome. The key question here is “Where genome-wide mapping studies go next? The next step is going beyond releasing the data, for example integrating the data from ENCODE project with various data sets, such as GWAS SNPs. Importantly, the map of the functional elements obviously provides a good source of annotation for analyzing the locations of GWAS single nucleotide variants that are considerably found in functional genome regions identified by ENCODE data.

Taken together, more experimental investigation will give deeper understanding on the association of biological function of regulatory elements with biochemical activity.

Thus, having the knowledge of regulatory sequences and their functional impact may provide novel approaches to discover potential therapies for human disease. 96

References

Achim K, Peltopuro P, Lahti L, Tsai H-H, Zachariah A, Åstrand M, Salminen M, Rowitch D, Partanen J. 2013. The role of Tal2 and Tal1 in the differentiation of midbrain GABAergic neuron precursors. Biology Open 2: 990-997.

Amano T, Sagai T, Tanabe H, Mizushina Y, Nakazawa H, Shiroishi T. 2009. Chromosomal dynamics at the Shh locus: limb bud-specific differential regulation of competence and active transcription. Dev Cell 16: 47-57.

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al. 2014. An atlas of active enhancers across human cell types and tissues. Nature 507: 455-461.

Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310.

Arnold CD, Gerlach D, Stelzer C, Boryń ŁM, Rath M, Stark A. 2013. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339: 1074-1077.

Arvey A, Agius P, Noble WS, Leslie C. 2012. Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome Res 22: 1723-1734.

Attanasio C, Nord AS, Zhu Y, Blow MJ, Li Z, Liberton DK, Morrison H, Plajzer-Frick I, Holt A, Hosseini R, et al. 2013. Fine tuning of craniofacial morphology by distant-acting enhancers. Science 342: 1241006.

Atchison ML. 1988. Enhancers: mechanisms of action and cell specificity. Annu Rev Cell Biol 4: 127-153.

Attanasio C, Reymond A, Humbert R, Lyle R, Kuehn MS, Neph S, Sabo PJ, Goldy J, Weaver M, Haydock A, et al. 2008. Assaying the regulatory potential of mammalian conserved non-coding sequences in human cells. Genome Biol 9: R168.

Banerji J, Rusconi S, Schaffner W. 1981. Expression of a -globin gene is enhanced by remote SV40 DNA sequences. Cell 27: 299-308.

Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129: 823-837. 97

Benz EJ Jr, Murnane MJ, Tonkonow BL, Berman BW, Mazur EM, Cavallesco C, Jenko T, Snyder EL, Forget BG, Hoffman R. 1980. Embryonic-fetal erythroid characteristics of a human leukemic cell line. Proc Natl Acad Sci 77: 3509-3513.

Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816.

Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. 2010. ChIP-seq identification of weakly conserved heart enhancers. Nat Genet 42: 806-810.

Bonifer C, Cockerill PN. 2011. Chromatin mechanisms regulating gene expression in health and disease. Adv Exp Med Biol 711: 12-25.

Bresnick EH, Martowicz ML, Pal S, Jonhson KD. 2005. Developmental control via GATA factor interplay at chromatin domains. J Cell Physiol 205: 1-9.

Bresnick EH, Lee HY, Fujiwara T, Johnson KD, Keles S. 2010. GATA swiches as developmental drivers. J Biol Chem 285: 31087-31093.

Canto AB, Katz SG, Orkin SH. 2002. Distinct domains of the GATA-1 cofactor FOG-1 differentially influence erythroid versus megakaryocytic maturation. Mol Cell Biol 22: 4268-4279.

Cao Y, Yao Z, Sarkar D, Lawrence M, Sanchez GJ, Parker MH, MacQuarrie KL, Davison J, Morgan MT, Ruzzo WL, et al. 2010. Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. Dev Cell 18: 662-674.

Carroll SB. 2008. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134: 25-36.

Cheng Y, King DC, Dore LC, Zhang X, Zhou Y, Zhang Y, Dorman C, Abebe D, Kumar SA, Chiaromonte F, et al. 2008. Transcriptional enhancement by GATA1-occupied DNA segments is strongly associated with evolutionary constraint on the binding site moif. Genome Res 18: 1896-1905.

Cheng Y, Ma Z, Kim B-H, Cayting P, Boyle AP, Wu W, Sundaram V, Xing X, Dogan N, Li J, et al. 2014. Principles of regulatory information conservation revealed by comparing mouse and human transcription factor binding profiles. Nature: re-submitted.

Cheng Y, Wu W, Kumar SA, Yu D, Deng W, Tripic T, King DC, Chen KB, Zhang Y, Drautz D, et al. 2009. Erythroid GATA1 function revealed by genome-wide analysis of 98 transcription factor occupancy, histone modifications, and mRNA expression. Genome Res 19: 2172-2184.

Cosma MP. 2002. Ordered recruitment: gene-specific mechanism of transcription activation. Mol Cell 10: 227-236.

Cotney J, Leng J, Oh S, DeMare LE, Reilly SK, Gerstein MB, Noonan JP. 2012. Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb. Genome Res 22: 1069-1080.

Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, et al. 2010. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci 107: 21931-21936. de Kok YJ, Vossenaar ER, Cremers CW, Dahl N, Laporte J, Hu LJ, Lacombe D, Fischel- Ghodsian N, Friedman RA, Parnes LS, et al. 1996. Identification of a hot spot for microdeletions in patients with X-linked deafness type 3 (DFN3) 900 kb proximal to the DFN3 gene POU3F4. Hum Mol Genet 5: 1229-1235.

Denas O, Sandstrom R, Cheng Y, Beal K, Herrero J, Hardison RC, Taylor J. Genome- wide comparative analysis reveals human-mouse regulatory landscape and evolution. Genome Biol: submitted.

Dickel DE, Zhu Y, Nord AS, Wylie JN, Akiyama JA, Afzal V, Plajzer-Frick I, Kirkpatrick A, Göttgens B, Bruneau BG, et al. 2014. Function-based identification of mammalian enhancers using site-specific integration. Nat Methods 11: 566-571.

Dogan N, Wu W, Morrissey CS, Chen KB, Stonestrom A, Long M, Keller CA, Cheng Y, Jain D, Visel A, et al. 2014. Epigenetic and genetic features that lead to discovery of enhancer function. In prep

Elnitski L, Li J, Noguchi CT, Miller W, Hardison R. 2001. A negative cis-element regulates the level of enhancement by hypersensitive site 2 of the beta-globin locus control region. J Biol Chem 276: 6289-6298.

ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57-74.

Ernst J, Kellis M. 2012. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9: 215-216.

Ester M, Kriegel H-P, Sander J, Xu X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on 99

Knowledge Discovery and Data Mining (KDD-96) (ed. E Simoudis, J Han, U Fayyad), pp. 226-231. AAAI Press:Palo Alto.

Fang L, Ahn J,K, Wodziak D, Sibley E. 2012. The human lactase persistence-associated SNP -13910*T enables in vivo functional persistence of lactase promoter-reporter transgene expression. Hum Genet 131: 1153-1159.

Filippakopoulos P, Knapp S. 2014. Targeting bromodomains: epigenetic readers of lysine acetylation. Nat Rev Drug Discov 13: 337-356.

Friedman JR, Kaestner KH. 2006. The FOXA family of transcription factors in development and metabolism. Cell Mol Life Sci. 63: 2317-2328.

Fuda NJ, Ardehali MB, Lis JT. 2009. Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature 461: 186-192.

Furniss D, Lettice LA, Taylor IB, Critchley PS, Giele H, Hill RE, Wilkie AOM. 2008. A variant in the sonic hedgehog regulatory sequence (ZRS) is associated with triphalangeal thumb and deregulates expression in the developing limb. Hum Mol Genet 17: 2417-2423.

Gerasimova A, Chavez L, Li B, Seumois G, Greenbum J, Rao A, Vijayanand P, Peters B. 2013. Predicting cell types and genetic variations contributing to disease by combining GWAS and epigenetic data. PLoS ONE 8: e54359.

Goldberg AD, Allis CD, Bernstein E. 2007. Epigenetics: a landscape takes shape. Cell 128: 635-638.

Goodson M, Jonas BA, Privalsky MA. 2005. Corepressors: custom tailoring and alterations while you wait. Nucl Recept Signal 3: 1-8.

Gorkin DU, Lee D, Reed X, Fletez-Brant C, Bessling SL, Loftus SK, Beer MA, Pavan WJ, McCallion AS. 2012. Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes. Genome Res. 22: 2290-2301.

Göttgens B, Ferreira R, Sanchez MJ, Ishibashi S, Li J, Spensberger D, Lefevre P, Ottersbach K, Chapman M, Kinston S, et al. 2010. cis-Regulatory remodeling of the SCL locus during evolution. Mol Cell Biol 30: 5741-5751.

Grant PA. 2001. A tale of histone modifications. Genome Biol 2: REVIEWS0003. 100

Gregory T, Yu C, Ma A, Orkin SH, Blobel GA, Weiss MJ. 1999. GATA-1 and erythropoietin cooperate to promote erythroid cell survival by regulating bcl-xL expression. Blood 94: 87-96.

Grosveld F, van Assendelft GB, Greaves DR, Kollias G. 1987. Position-independent, high-level expression of the human beta-globin gene in transgenic mice. Cell 51: 975- 985.

Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. 2007. Quantifying similarity between motifs. Genome Biol 8: R24.

Hardison RC. 2000. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet 16: 369-372.

Hardison RC. 2012. Genome-wide epigenetic data facilitate understanding of disease susceptibility association studies. J Biol Chem 287: 30932-30940.

Hardison RC, Taylor J. 2012. Genomic approaches towards finding cis-regulatory modules in animals. Nat Rev Genet 13: 469-483.

Hargreaves DC, Crabtree GR. 2011. ATP-dependent chromatin remodeling: genetics, genomics and mechanisms. Cell Res 21: 396-420.

He HH, Meyer CA, Shin H, Bailey ST, Wei G, Wang Q, Zhang Y, Xu K, Ni M, Lupien M, et al. 2010. Nucleosome dynamics define transcriptional enhancers. Nat Genet 42: 343- 347.

Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, et al. 2009. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459: 108–112.

Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, et al. 2007. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39: 311-318.

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106: 9362-9367.

Holmqvist PH, Mannervik M. 2012. Genomic occupancy of the transcriptional co- activators p300 and CBP. Transcription 4: 18-23.

101

Holmqvist PH, Boija A, Philip P, Crona F, Stenberg P, Mannervik M. 2012. Preferential genome targeting of the CBP co-activator by Rel and Smad proteins in early Drosophila melanogaster embryos. PLoS Genet 8: e1002769.

Huang S, Brandt SJ. 2000. mSin3A regulates murine erythroleukemia cell differentiation through association with the TAL1 (or SCL) transcription factor. Mol Cell Biol 20: 2248- 2259.

Hu G, Schones DE, Cui K, Ybarra R, Northrup D, Tang Q, Gattinoni L, Restifo NP, Huang S, Zhao K. 2011. Regulation of nucleosome landscape and transcription factor targeting at tissue-specific enhancers by BRG1. Genome Res 21: 1650-1658. Janky R, van Helden J. 2008. Evaluation of phylogenetic footprint discovery for predicting bacterial cis-regulatory elements and revealing their evolution. BMC 9: 37.

Jaenisch R, Bird A. 2003. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet 33: 245-254.

Jenuwein T, Allis CD. 2001. Translating the histone code. Science 293: 1074-1080.

Jeziorska DM, Jordan KW, Vance KW. 2009. A systems biology approach to understanding cis-regulatory module function. Semin Cell Dev Biol 20: 856-862.

Jones FC, Grabherr MG, Chan YF, Russell P, Mauceli E, Johnson J, Swofford R, Pirun M, Zody MC, White S, et al. 2012. The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484: 55-61.

Kadonaga JT. 2004. Regulation of RNA polymerase II transcription by sequence-specific DNA binding factors. Cell 116: 247-257.

Kaestner KH. 2010. The FoxA factors in organogenesis and differentiation. Curr Opin Genet Dev 20: 527-532.

Kassouf MT, Hughes JR, Taylor S, McGowan SJ, Soneji S, Green AL, Vyas P, Porcher C. 2010. Genome-wide identification of TAL1’s functional targets: insights into its mechanisms of action in primary erythroid cells. Genome Res 20: 1064-1083.

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Res 12: 996-1006.

Kheradpour P, Ernst J, Melnikov A, Rogov P, Wang L, Zhang X, Alston J, Mikkelsen TS, Kellis M. 2013. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res 23: 800-811.

102

King DC. 2009. Phylogenetic conservation of cis-regulatory regions using sequence alignability and cladistic motifs. In Integrative Biosciences, Dissertation, pp. 1-91. The Pennsylvania State University.

King DC, Taylor J, Zhang Y, Cheng Y, Lawson HA, Martin J, ENCODE groups for Transcriptional Regulation and Multispecies Sequence Analysis, Chiaromonte F, Miller W, Hardison RC. 2007. Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res 17: 775:786.

Kleinjan DA, Lettice LA. 2008. Long-range gene control and genetic disease. Adv Genet 61: 339-388.

Kleinjan DA, van Heyningen V. 2005. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet 76: 8-32.

Ko LJ, Engel JD. 1993. DNA-binding specifities of the GATA transcription factor family. Mol Cell Biol 13: 4011-4022.

Kouzarides T. 2007. Chromatin modifications and their function. Cell 128: 693-705.

Kundaje A, Kyriazopoulou-Panagiotopoulou S, Libbrecht M, Smith CL, Raha D, Winters EE, Johnson SM, Snyder M, Batzoglou S, Sidow A. 2012. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res 22: 1735-1747.

Lam EWF, Brosens JJ, Gomes AR, Koo C-Y. 2013. Forkhead box proteins: tuning forks for transcriptional harmony. Nat Rev Cancer 13: 482-495.

Lee TI, Young RA, 2013. Transcriptional regulation and its misregulation in disease. Cell 152: 1237-1251.

Lehmann OJ, Sowden JC, Carlsson P, Jordan T, Bhattacharya SS. 2003. Fox’s in development and disease. Trends Genet 19: 339-344.

Lettice LA, Heaney SJH, Purdie LA, Li L, de Beer P, Oostra BA, Goode D, Elgar G, Hill RE, de Graaff E. 2003. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet 12: 1725-1735.

Levine M. 2010. Transcriptional enhancers in animal development and evolution. Curr Biol 20: R754-R763.

Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G, Mauceli E. 2011. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478: 476-482. 103

Loots GG, Kneissel M, Keller H, Baptist M, Chang J, Collette NM, Ovcharenko D, Plajzer-Frick I, Rubin EM. 2005. Genomic deletion of a long-range bone-enhancer misregulates sclerostin in Van Buchem disease. Genome Res 15: 928-935.

Louie MC, Yang HQ, Ma A-H, Xu W, Zou JX, Kung H-J, Chen H-W. 2003. Androgen- induced recruitment of RNA polymerase II to a -p160 coactivator complex. Proc Natl Acad Sci U.S.A. 100: 2226-2230.

MacQueen JB. 1967. Some methods for classification and analysis of multivariate observations. In 5th Symposium on Mathematical Statistics and Probability (ed. LM Le Cam, J Neyman), pp. 281-297. Berkeley, CA: University of California Press,1.

Marinkovic D, Zhang X, Yalcin S, Luciano JP, Brugnara C, Huber T, Ghaffari S. 2007. Foxo3 is required for the regulation of oxidative stress in erythropoiesis. J Clin Invest 117: 2133-2144.

Martin D, Orkin S. 1990. Transcriptional activation and DNA binding by the erythroid factor GF-1/NF-E1/Eryf 1. Genes & Dev 4: 1886-1898.

Maston GA, Evans SK, Green MR. 2006. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7: 29-59.

Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A, Ienasescu H, et al. 2014. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 42: D142-D147.

Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. 2012. Systematic localization of common disease- associated variation in regulatory DNA. Science 337: 1190-1195.

Melnikov A, Murugan A, Zhang X, Tesileanu T, Wang L, Rogov P, Feizi S, Gnirke A, Callan CG Jr, Kinney JB, et al. 2012. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay. Nat Biotechnol 30: 271-277.

Morrissey CS. 2013. Understanding the epigenetics of erythroid differentiation through the power of deep sequencing. In Bioinformatic and Genomics, Dissertation, pp. 1-118. The Pennsylvania State University.

Mortazavi A, Pepke S, Jansen C, Marinov GK, Ernst J, Kelllis M, Hardison RC, Myers RM, Wold BJ. 2013. Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome Res 23: 2136-2148. 104

Mouse ENCODE Consortium. 2014. An integrated and comparative encyclopedia of DNA elements in the mouse genome. Nature: submitted.

Mouse ENCODE Consortium. 2012. An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol 13: 418.

Murtha M, Tokcaer-Keskin Z, Tang Z, Strino F, Chen X, Wang Y, Xi X, Basilico C, Brown S, Bonneau R, et al. 2014. FIREWACh: high-throughput functional detection of transcriptional regulatory modules in mammalian cells. Nat Methods, doi:10.1038/NMETH.2885.

Neely K, Workman J. 2002. Histone acetylation and chromatin remodeling. Which comes first? Mol Genet Metab 76: 1-5.

Newburger DE, Bulyk ML. 2009. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res 37: D77-82.

Noonan JP, McCallion AS. 2010. Genomics of long-range regulatory elements. Annu Rev Genomics Hum Genet 11: 1-23.

Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E. 2007. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat Genet 39: 730-732.

Orphanides G, Reinberg D. 2002. A unified theory of gene expression. Cell 108: 439- 451.

Patrinos GP, de Krom M, de Boer E, Langeveld A, Imam AM, Strouboulis J, de Laat W, Grosveld FG. 2004. Multiple interactions between regulatory regions are required to stabilize an active chromatin hub. Genes Dev 18: 1495-1509.

Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, et al. 2006. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444: 499-502.

Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. 2013. Enhancers: five essential questions. Nat Rev Genet 14: 288-295.

Pilon AM, Subramanian SA, Kumar SA, Steiner LA, Cherukuri PF, Wincovitch S, Anderson SM, Mullikin JC, Gallagher PG, Hardison RC, et al. 2011. Genome-wide ChIP- Seq reveals a dramatic shift in the binding of the transcription factor erythroid Kruppel- like factor (EKLF) during erythrocyte differentiation. Blood 118: e139-e148.

105

Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of non-neutral substitution rates on Mammalian phylogenies. Genome Res 20: 110-121.

Porcher C, Swat W, Rockwell K, Fujiwara Y, Alt FW, Orkin SH. 1996. The T cell leukemia oncoprotein SCL/tal-1 is essential for developmental of all hematopoietic lineages. Cell 86: 47-57.

Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, et al. 2014. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42: D756-763.

Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. 2010. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470: 279-283.

Rahl PB, Lin CY, Seila AC, Flynn RA, McCuine S, Burge CB, Sharp PA, Young RA. 2010. c- regulates transcriptional pause release. Cell 141: 432-445.

Reeves R, Gorman CM, Howard B. 1985. Minichromosome assembly of non-integrated plasmid DNA transfected into mammalian cells. Nucleic Acids Res 13: 3599-3615.

Sagai T, Hosoya M, Mizushina Y, Tamura M, Shiroishi T. 2005. Elimination of a long- range cis-regulatory module causes complete loss of limb-specific Shh expression and truncation of the mouse limb. Development 132: 797-803.

Sanchez R, Zhou M-M. 2009. The role of human bromodomains in chromatin biology and gene transcription.Curr Opin Drug Discov Devel 12: 659-665.

Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martines-Jimenez CP, Mackay S, et al. 2010. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328: 1036-1040.

Schlaeger TM, Mikkola HK, Gekas C, Helgadottir HB, Orkin SH. 2005. Tie2Cre- mediated gene ablation defines the stem-cell leukemia gene (SCL/)-dependent window during hematopoietic stem-cell development. Blood 105: 3871-3874.

Selth LA, Sigurdsson S, Svejstrup JQ. 2010. Transcript Elongation by RNA Polymerase II. Annu Rev Biochem 79: 271-293.

Shang Y, Myers M, Brown M. 2002. Formation of the androgen receptor transcription complex. Mol Cell 9: 601-610.

106

Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV, et al. 2012. A map of the cis-regulatory sequences in the mouse genome. Nature 488: 116-120.

Shlyueva D, Stampfel G, Stark A. 2014. Transcriptional enhancers: from properties to genome-wide predictions. Nat Rev Genet 15: 272-286.

Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034-1050.

Sikorski TW, Buratowski S. 2009. The basal initiation machinery: beyond the general transcription gfactors. Curr Opin Cell Biol 21: 344-351.

Smith AD, Sumazin P, Zhang MQ. 2005. Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci 102: 1560-1565.

Smith RP, Taher L, Patwardhan RP, Kim MJ, Inoue F, Shendure J, Ovcharenko I, Ahituv N. 2013. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat Genet 45: 1021-1028.

Soler E, Andrieu-Soler C, de Boer E, Bryne JC, Thongjuea S, Stadhouders R, Palstra RJ, Stevens M, Kockx C, van Ijcken W, et al. 2010. The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation. Genes Dev 24: 277-289.

Spicuglia S, Kumar S, Yeh J-H, Vachez E, Chasson L, Gorbatch S, Cautres J, Ferrier P. 2002. Promoter activation by enhancer-dependent and -independent loading of activator and coactivator complexes. Mol Cell 10: 1479-1487.

Szutorisz H, Dillon N, Tora L. 2005. The role of enhancers as centres for general transcription factor recruitment. Trends Biochem Sci 30: 593-599.

Taatjes DJ. 2010. The human mediator complex: a versatile, genome-wide regulator of transcription. Trends Biochem Sci 35: 315-322.

Taylor J, Tyekucheva S, King DC, Hardison RC, Miller W, Chiaromonte F. 2006. ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. Genome Res 16: 1596-1604.

Thomas MC, Chiang CM. 2006. The general transcription machinery and general cofactors. Crit Rev Biochem Mol Biol 41: 105-178.

107

Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al. 2012. The accessible chromatin landscape of the human genome. Nature 489: 75-82.

Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W. 2002. Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell 10: 1453-1465.

Trainor CD, Ghirlando R, Simpson MA. 2000. GATA zinc finger interactions modulate DNA binding and transactivation. J Biol Chem 275: 28157-28166.

Tripic T, Deng W, Cheng Y, Zhang Y, Vakoc CR, Gregory GD, Hardison RC, Blobel GA. 2009. SCL and associated proteins distinguish active from repressive GATA transcription factor complexes. Blood 113: 2191-2201.

Tsai FY, Keller G, Kuo FC, Weiss M, Chen J, Rosenblatt M, Alt FW, Orkin SH. 1994. An early haematopoietic defect in mice lacking the transcription factor GATA-2. Nature 371: 221-226.

Tsiftsoglou AS, Vizirianakis IS, Strouboulis J. 2009. Erythropoiesis: model systems, molecular regulators, and developmental programs. IUBMB Life 61: 800-830.

Turner BM. 2005. Reading signals on the nucleosome with a new nomenclature for modified histones. Nat Struct Mol Biol 12: 110-112.

Tuteja G, Kaestner KH. 2007. SnapShot: forkhead transcription factors I. Cell 130: 1160.

Tuteja G, Kaestner KH. 2007. Forkhead transcription factors II. Cell 131: 192.

Vilar JM, Saiz L. 2005. DNA looping in gene regulation: from the assembly of macromolecular complexes to the control of transcriptional noise. Curr Opin Genet Dev 15: 136-144.

Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. 2009. ChIP-Seq accurately predicts tissue-specific activity of enhancers. Nature 457: 854-858.

Visel A, Minovitsky S, Dubchak I, Pennacchio LA. 2007. VISTA Enhancer Browser-a database of tissue-specific human enhancers. Nucleic Acids Res 35: D88-92.

Visel A, Prabhakar S, Akiyama JA, Shoukry M, Lewis KD, Holt A, Plajzer-Frick I, Afzal V, Rubin EM, Pennacchio LA. 2008. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet 40: 158-160.

Visel A, Rubin EM, Pennacchio LA. 2009. Genomic views of distant-acting enhancers. Nature 461: 199-205. 108

Visel A, Zhu Y, May D, Afzal V, Gong E, Attanasio C, Blow MJ, Cohen JC, Rubin EM, Pennacchio LA. 2010. Targeted deletion of the 9p21 non-coding coronary artery disease risk interval in mice. Nature 464: 409-412.

Wallace JA, Felsenfeld G. 2009. We gather together: insulators and genome organization. Curr Opin Genet Dev 17: 400-407.

Wang H, Zhang Y, Cheng Y, Zhou Y, King DC, Taylor J, Chiaromonte F, Kasturi J, Petrykowska H, Gibb B, et al. 2006. Experimental validation of predicted mammalian erythroid cis-regulatory modules. Genome Res 16: 1480-1492.

Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, et al. 2012. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research 22: 1798-1812.

Wang H, Zang C, Taing L, Arnett KL, Wong YJ, Pear WS, Blacklow SC, Liu XS, Aster JC. 2014. NOTCH1-RBPJ complexes drive target gene expression through dynamic interactions with superenhancers. Proc Natl Acad Sci 111: 705-710.

Weiss MJ, Yu C, Orkin SH. 1997. Erythroid-cell-specific properties of transcription factor GATA-1 revealed by phenotypic rescue of a gene-targeted cell line. Mol Cell Biol 17: 1642-1651.

WelchJJ, Watts JA, Vakoc CR, Yao Y, Wang H, Hardison RC, Blobel GA, Chodosh LA, Weiss MJ. 2004 Global regulation of erythroid gene expression by transcription factor GATA1. Blood 104:3136-3147.

White MA, Myers CA, Corbo JC, Cohen BA. 2013. Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks. Proc Natl Acad Sci 110: 11952-11957.

Wittkopp PJ, Kalay G. 2012. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence . Nat Rev Genet 13: 59-69.

Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, et al. 2005. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol 3: e7.

Wozniak RJ, Keles S, Lugus JJ, Young KH, Boyer ME, Tran TM, Choi K, Bresnick EH. 2008. Molecular hallmarks of endogenous chromatin complexes containing master regulators of hematopoiesis. Mol Cell Biol 28: 6681-6694.

109

Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, et al. 2009. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 10: R130.

Wu W, Cheng Y, Keller CA, Ernst J, Kumar SA, Mishra T, Morrissey C, Dorman CM, Chen KB, Drautz D, et al. 2011. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res 21: 1659-1671.

Wu W, Morrissey CS, Keller CA, Mishra T, Pimkin M, Blobel GA, Weiss MJ, Hardison RC. 2014. Dynamic shifts in occupancy by TAL1 are guided by GATA factors and drive large-scale reprogramming of gene expression during hematopoiesis. Genome Res: in press.

Xu J, Shao Z, Glass K, Bauer DE, Pinello L, Van Handel B, Hou S, Stamatoyannopoulos JA, Mikkola HK, Yuan GC, et al. 2012. Combinatorial assembly of developmental stage- specific enhancers controls gene expression programs during human erythropoiesis. Dev Cell 23: 796-811.

Yang GH, Wang F, Yu J, Wang XS, Yuan JY, Zhang JW. 2009. are involved in erythroid differentiation control. J Cell Biochem 107: 548-556.

Zeggini E, Scott LJ, Saxena R., Voight BF, Marchini JL, Hu T, de Bakker PI, Abecasis GR, Almgren P, Andersen G, et al. 2008. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet 40: 638-645.

Zentner GE, Tesar PJ, Scacheri PC. 2011. Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res 21: 1273-1283.

Zhang X, Wang X, Zhu H, Zhu C, Wang Y, Pu WT, Jegga AG, Fan GC. 2010. Synergistic effects of the GATA-4-mediated miR-144/451 cluster in protection against simulated ischemia/reperfusion-induced cardiomyocyte death. Journal of Molecular and Cellular Cardiology 49: 841-850.

Zhang Y, Wu W, Cheng Y, King DC, Harris RS, Taylor J, Chiaromonte F, Hardison RC. 2009. Primary sequence and epigenetic determinants of in vivo occupancy of genomic DNA by GATA1. Nucleic Acids Res 37: 7024-7038.

Zhou P, He A, Pu WT. 2012. Regulation of GATA4 transcriptional activity in cardiovascular development and disease. Curr Top Dev Biol 100: 143-169.

Zhou Q, Li T, Price DH. 2012. RNA polymerase II elongation control. Annu Rev Biochem 81: 119-143.

110

Appendix A

Supplementary Figures and Tables

A.

Tissue expressed: Heart Midbrain Forebrain Limb

TAL1 ID (peak #): TAL1-2105 TAL1-3553 TAL1-3467 TAL1-3897, 3898

B.

VISTA Enhancer activity in K562 Tissues with enhancer activity in mouse embryos, E11.5 TAL1 peak ID ID (Fold change) [Reproducibility]

TAl1_1578 mm104 Enhancer (10.1±1.3) Heart[7/7] | melanocytes[5/7] | liver[3/7]

TAL1_1496 hs796 Enhancer (6.3 ±0.9) Forebrain [4/5]

TAL1_2105 mm291 Enhancer (6.3±1.3) Heart [6/7]

TAL1_2302 hs1866 Enhancer (5.8±1.5) Blood vessels [5/5]

Neural tube[8/8] | hindbrain (rhombencephalon)[8/8] | TAL1_1123 hs1466 Enhancer (4.5±0.6) midbrain (mesencephalon)[8/8] | dorsal root ganglion[7/8] | forebrain[6/8] | limb[8/8] | branchial arch[8/8] | heart[6/8]

TAL1_2750 hs1860 Enhancer (3.9±0.4) Midbrain (mesencephalon) [5/5]

TAL1_3467 hs840 Enhancer (3.2±0.4) Forebrain [10/10]

Hindbrain (rhombencephalon)[3/5] | midbrain TAL1_1020 hs1385 Inactive (1.4±0.4) (mesencephalon)[4/5]

TAL1_250 hs1862 Inactive (0.4±0.2) Heart [6/6]

111

Supplemental Figure 2.1 TAL1-bound enhancers

A. Four examples of side views of whole mouse embryos with in vivo enhancer activity at E11.5. Tested DNA segments that are bound by TAL1 are enhancers in different tissues. B. Comparison of the results of 9 DNA segments bound by TAL1 peaks tested in two enhancer assays (Supplemental Tables 2, 3).

112

A.

(+) (-) K9me3 K27me3 K27ac K4me3 K4me1 EP300 GATA1 TAL1

0 50 100 150 200 250

# of DNA segments transiently transfected into K562

B.

0

.

1

8 .

0 T+no histone mark T+G+P+4m1+27ac

T+27ac n

6 T+G+P+ P

.

o 0

i 4m1+4m3+27ac s i G

c T

e r

4 T+4m1+27ac

. P

0 T+G+4m3

T=TAL1 2

. G=GATA1 0 P=EP300 4m1=H3K4me1

no TF+4m1 4m3=H3K4me3 0

. 27ac=H3K27ac 0 0.0 0.2 0.4 0.6 0.8 1.0

Recall

113

C.

Enhancer (#) Threshold (#) Inactive (#) Total (#)

TAL1 (+) 67 19 37 123 273 TAL1 (-) 23 27 100 150

GATA1 (+) 76 26 49 151 273 GATA1 (-) 14 20 88 122

EP300 (+) 69 21 35 125 273 EP300 (-) 21 25 102 148

K4me1 (+) 80 36 90 206 273 K4me1 (-) 10 10 47 67

K4me3 (+) 62 27 60 149 273 K4me3 (-) 28 19 77 124

K27ac (+) 80 31 68 179 273 K27ac (-) 10 15 69 94

K27me3 (+) 10 6 32 48 273 K27me3 (-) 80 40 105 225

K9me3 (+) 3 5 13 21 273 K9me3(-) 87 41 124 252

Supplemental Figure 2.2 The meta-data, 273 mouse DNA segments

A. For a compiled set of 273 mouse DNA segments tested for enhancer activity in transiently transfected K562 cells, the number with a significant signal for each of eight epigenetic features is plotted. Many DNA segments carry more than one feature (Supplemental Table 4). B. Precision-recall plot for DNA segments with or without different set of feature(s) (Supplemental Table 6). C. Categorizing DNA segments (from meta-data; 273 in total) based on their response to enhancer 114 assay in the presence or absence of 8 epigenetic features individually

(Supplemental Table 4).

A.

)

0

8

%

(

t

0

6

n

e

t

0

n

4

o

c

0

2

C

+

0 G Enh Inac Pos Neg

115

B. Enriched motifs in Transient Transfection Transgenic mice

Enhancer Positive

0 0

4 4

0 0

3 3

0 0

2 2

0 0

1 1

0 0

y c

n 0 50 150 250 350 0 50 150 250 350

e

u

q e

r Inactive Negative

F

0 0

4 4

0 0

3 3

0 0

2 2

0 0

1 1

0 0

0 50 100 150 200 0 50 100 150 200 DME score

Supplemental Figure 2.3. Motif analyses

A. Distribution of G+C content (%) in four data sets: Enhancers and inactive regions identified in transient transfection assays, and positives and negatives from transgenic mice assays. B. Frequency of 200 motifs. Top 10 motifs (5% of

200 motifs) were used to run further analyses (Supplemental Tables 7-10).

116

A.

n

o

i

0 0

t

6 6

c

e

f

s

0 0

n

n

4 4

i

a

r

s

t

S

t

0 0

n

2 2

O

e

i

1

s

L

n

0 0

A

a

r

T

T f

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5

o

y

c

0 0

n

6 6

e

e

c

i

u

m

q

0 0

e

c

4 4

r

i

F

n

e

0 0

g

2 2

s

n

a

r

T

0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 PhastCons score PhyloP score

B.

0

.

1

5

.

8

1

.

e

r

0

e

o

r

c

0

o

s

6

.

.

c

1

s

s

0

n

P

o

o

4

5

l

C

.

.

t

y

0

0

s

h

a

P

h

2

.

P

0

0

.

0

0

. 0 -2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 Fold change in activity (log2)

117

Supplemental Figure 2.4. Conservation scores

A. Frequency of TAL1 OSs tested in the two enhancer assays based on

PhastCons and PhyloP scores. B. Correlation of TAL1 OSs between the conservation scores, PhastCons (left) and PhyloP (right), and enhancer activity in transient transfection assay.

C8 191/234 C7 219/343 C6 303/400 C5 182/220 C4 321/386

Clusters C3 478/616 C2 684/882 C1 1127/1567

0 20 40 60 80 100 % of clustered TAL1OSs within EPUs

Supplemental Figure 2.5 Percentage of TAL1 OSs within enhancer-promoter units

in each cluster.

118

A.

119

B.

Supplemental Figure 2.6 UCSC genome browser views of two regions including

two validated TAL1-bound enhancers

A. TAL1 peak 793 and B. TAL1 peak 2302.

120

Supplemental Table 2.1 Primer information on 70 TAL1 OSs transiently

transfected into K562 cells.

The 70 TAL1 bound regions were tested for enhancer activity by dual luciferase reporter assay. Forward and reverse primers used to amplify TAL1-bound DNA segments were listed.

TAL1 peak ID Forward Reverse 43 TGTGGCAGAGTTAGTGTTTATGAGGAAC ACTCAAGGAAGCCTCCTAGTAATC 63 GGGTCCTTTACAGGTTCTCAAATTGT TTGCCACGTAACTACTGACACCGA 127 CAGCTCGATCTATCAATCCACAG TCCGATGGCTTGCTCCTGCTAT 184 TTCCTTGTTGGAAGCCACAACTGC CCCAAAGCTCATCAAGGAAGGTAG 196 AGAAGGTGTCCGCCTCTCTCTACA TCCGAGTCCTGTTCCACGTCTGATT 201 ACTGTGCTTGGAGAGTCCCAGAAA TTACAAACCCAGGTCTCACCCTCA 244 TTACATACGCACCACTGAGTGAGGAC TGTAGGCTCTGTTGCAACCTGT 250 CCCTGCCTTTCCTCTGTCATAACT AGGATGAACTCACCTGGCGTCTAT 379 ATACCAACCGTATTTCCCGTGC TACACATTCCAGCAAGGAGATGCC 541 TCCAACACCCTCTTCTGCTGTCAT AGACTGTACACTCACATAGGCACG 612 AAGTGAGCACTGACTACTGAAA CCTGGCTCCTTGACTCATATT 734 GCTTCTCCAACATTGCCCTGCAAA GCTGGTTTGTGGATTCTGCCCTTT 758 ATATCCTTGTGCTTCTGACCCTCC GTAAGATGGAGCTCTTGGTGTCCT 840 GCAGACAGTCATTAGCCTGAAA CTGCGCTCTGCATCTTTAGT 842 CTGGAAGGATTCACCCTGAC CGTCCACCATTGCTTCTCT 987 ATCAGTTAAGCCCGCCCTCTTGAT CCCAGCTGACTATAAAGGGACTAGGA 1020 CCTTGTGAACTTGGAAGAGTCGGT ATACCAGGGAGACAAGAAAGCTGC 1025 AGGTTAGTTCAGGTAGCCCTGTCA TCTCCTTTGAGAAACAGGCTGAGT 1050 ATTGAAGGCCACAGCAACTA CATTCTCAGCGCTAGGATTACA 1123 GCAGCGGCTGTCCTTTCTTCT TGGCCGGTTCCAGGAATGTGTTCT 1195 AACACAAGCCAAATGTGCTGGG GGTTGTTTCCATTTCCTGTCTAAGGC 1464 AATTTCTCACATTCCTCCTCCGGG TCCTACACTCTCACCCAATCCTGA 1496 ACAGACTTCGGAGATGCCTT GGCCCAAATAGATGAATACCGAGAC 1506 TGAACTTCCAGGCCATCTGCCTTT TTGATTGTGGAGTGCTGCATTCGG 1515 CTGGTCTGTAGGTGGTGAAAG CATTGAGCAAGCAAACTGGAG 1529 AAACACGTGACCTCCATGGTCTGT CCGCCTGTCTCCTCACTAACATAGA 1578 AACTTCTGCTGGAGCCACTTACCA ATACAGTGAGACATGGAGCACCCA 1742 ATGTGCCTGACTGGACTCACTCCT CAGCAAGGCTGTAATCTCTGTTCAGC 1774 TACCACAGAAAGAAGGCAGGAGCA TCTTCACTCCCACCAACAGCATGA 1796 GAGAGCGGTGAGGAAGACAGTAAA CAGGAGGAAATGTAGCTCTGTGGT 1799 AGTCAACCTTTAAGGGAGCAGACA TGTGCAATTTGAGGTGAACATGGG 1863 GCGCGCAGGAGAAGTTGCTATAAA TGGCTGCTGGGTAGAGATTGACTA 2105 ATGATGCTGTCAAGCAGGGTGTGT AAGCAGAGAGTAACTGTGACGGGT 2135 AGGACTCTCAGCTTTGTGTATGT TGAAAGCAGCGAGCAGATATGCAG 2190 GTCCAGGAATCCTAACAAAGGCAC GTTCAAGTTGAGGGTAGGAAATGGC 2302 GAGCAACTTTGCCTCAAGTTC CAGGCTCCACAACTCTTGAT 121

2311 TGGACTTCACGGCTCTAAATGCCT TGGCTCAGAGTGCTTGCTCTTCAT 2434 CTTGGGCACCCAGGACGAACTTT GTTGGTTAAGAGTGATCTGTACTGGCCC 2471 TTTCTCTTACTGGGCATCTCCTGACC TGTTCTCACAGCTTCTCCTCGGAT 2578 GAGACACTGGTTCATTCCTCAA GCTATGTCCCTTGTATTCCTAGTT 2747 TGTTAGTGCGAGCACCCAAACTCA AATCGTTGACTCTCCATGAGGCCACT 2750 AGGCTACACAGCAGTGTTCTCTTG CGCGCAGCGCTAAGTAAACAGAAA 2794 TCAGACTACAGTGAGGGACACTTG GGTCTTCAGACACATCAGAAGAGGGT 2950 ATGGAGGCACTCTGGTGGAACATT TCTAGCAGAGGTCTCAATGCTGGT 2980 CCCGGCTATAACTTTAACCTTGAG GGTTCAGTGAGGCACCCTATCTAA 3022 GCTTCAATTCAGTTTAGAGTCCAGGC AGGGAGCTGGGATAAAGACAGGA 3158 AGAGACTCTAGATCTGCTAGTCAGGG CCCTTCTTGTCACTCCAGGAAGTC 3278 TGTTGTCCTTGAGAGGGAGCAGTT TTTGGAGCTTCCACCGATTCTCCT 3365 CCACTGTAAGAGCAGTACATACTTGG CCCTAGTCAGCAGAAAGTCGTCTA 3461 AGAGGGCACTGGTAAGTCCCACA TAATTGCTTATTTGCCACGGGATATCTAC 3467 TTCTGGTGTAAACCCTTCCCTGGT CTCACACCCAAGCACACATGCAAT 3657 CATAACTGCCTTCTTCACCCAGCA AGAGTCTCAACTGGCCTTTGGC 3663 GCTGCTGTGTGAATGCTCTTTGTG TGGGTTAGGAGGAAGAAGACTGGA 3800 CTGCTGCTGCTGCTTCTTCTTCTT ACATTTGGGAAAGTGGTCCTGCTG 3879 TGCCAGTGAGACCCAGGGATATTT CCATGCCTGACTTTCAGAAGCACA 3920 GGGCATTGATGGAGTTAACAGA TGGCTGGCTCTCTTGAAATG 3953 GGTATCGTGGACCTTGCTCAAGAAGT TTTGAGGCAAGCCCAAATTCCTGC 3960 GGGACAAAGGCATGAACCACCTTT TGCATGCGTGTGTATGAATGTGCG 4157 CATTTACGGGACACCTCAAACACC TGGGACCCATGCCTCACATATACT 4199 AACAGGGTACCAGTGAATGTCTGC TGCATCTTGACTAGTTCCCACACC 4249 TAAGCACCCAGCCTGGCATT ACTAGAGGATGTGTCGGATGGGAT 4361 AATGCTAACCAGGCCTCAAACCTC GCAGAGATAGCCTGTGTTCTGTGT 4371 TGGCATGGTTAATTCCTTGGGCTG AGGTGTCCCAGTGAGTTTAGCCTT 4423 CCGCAGCTTAATAAGGGCTGATTTCC GATTCAGCCCAGCCATAGAGCAAT 4652 CCCAAGATGCCCAGAATCC CATCTCCTTTGGGCTGCTTA 4769 CCAGCAGATTGAAAGTTTAGTGCAGACG ATCCACACCCACCCTTGTACACTT 4783 AGTCACACGGGCAAGCAAATGAAG GCATGTGTACATGTGGTTATATATGTGCAG 4816 AAGGCTTGGGCTAGAGAGATGGTT AACACTCATGGAGGACAGAAGCAG 4851 TGGCTGGCAGGATTCCTTGGTTAT ACAAGGGATGAGTCAGCACTCAGA 4858 GAGAAGTAGAAATAGAGACTAGATTGGGTG AGCCTTCTTCCCAGCTTCTCCTTT

122

Supplemental Table 2.2 Individual transient transfection results on 70 TAL1-bound regions

Fold change in firefly luciferase compared to parental vector and normalized to Renilla luciferase co-transfection control were listed

TAL1 Chr Start End peak Individual transfection results on TAL1OSs (mm9) ID chr1 132196529 132197197 184 19.7 20.3 24.7 23.2 19.9 19.0 24.6 23.4 22.8 18.9 23.1 23.1 24.3 20.2 23.7 23.7 chr13 14634853 14635762 1195 12.5 15.4 16.5 11.5 12.6 12.4 16.0 12.7 12.9 13.1 13.1 17.4 11.6 13.1 13.5 18.2 chr12 32734927 32735442 1025 11.8 11.7 11.6 14.5 11.5 11.6 11.7 14.2 8.3 10.3 15.3 10.1 8.5 10.2 15.6 10.0 chrX 7418047 7418661 4858 13.0 12.4 10.0 12.4 13.0 12.6 10.2 12.6 9.7 8.6 8.9 11.4 9.5 8.7 8.9 12.1 chr15 66657520 66658263 1578 12.1 11.8 9.5 10.7 12.3 12.0 9.5 10.7 8.8 9.4 8.9 11.3 8.7 9.4 8.8 11.2 chr3 146068529 146069275 3022 10.7 11.6 10.6 8.7 10.7 11.7 10.6 9.1 8.5 9.4 11.1 9.2 8.3 9.2 10.8 9.0 chr7 111009280 111010007 4199 7.5 7.9 9.0 9.1 7.6 8.6 9.5 9.4 10.0 13.8 12.2 12.6 10.6 15.7 13.3 13.4 chr18 38654476 38654937 2135 5.1 4.8 4.6 7.1 7.1 5.5 4.5 4.3 5.6 11.6 8.1 6.5 7.8 6.3 6.6 11.5 8.0 7.4 8.8 7.6 8.2 7.9 9.2 7.9 chr9 123864720 123865709 4851 9.2 9.7 6.6 8.7 12.7 10.3 5.0 11.0 3.7 6.3 5.0 6.5 3.9 6.3 7.6 7.6 6.5 7.0 5.6 6.8 6.8 7.1 5.7 7.2 chr7 16880022 16880853 3960 3.5 11.0 3.6 3.1 6.0 6.9 7.9 5.1 5.1 3.4 7.1 5.9 6.2 8.6 7.0 3.0 7.9 6.3 8.3 7.1 8.3 6.6 8.7 6.7 chr14 118593772 118594280 1496 7.1 5.4 7.2 7.4 7.2 5.5 7.6 7.3 5.0 6.7 5.7 5.6 5.1 7.0 5.9 5.7 chr18 32701721 32702830 2105 4.7 5.1 3.8 6.0 5.0 5.0 3.9 6.1 6.5 6.5 7.2 7.7 6.9 6.9 7.5 8.2 chr19 37568375 37569012 2302 7.0 5.1 6.3 8.7 7.1 5.3 6.7 8.9 5.4 3.9 6.2 4.8 5.4 3.9 6.3 4.4 chr1 88375618 88376290 127 5.3 2.5 5.0 2.0 4.5 6.0 2.8 3.8 7.0 5.3 3.5 3.3 4.9 6.2 4.3 7.0 6.0 4.9 4.0 4.9 6.3 5.0 4.2 5.0 chr11 77882879 77883715 758 4.7 4.1 6.7 3.0 4.8 4.2 6.8 3.0 5.1 5.2 4.7 4.0 5.0 5.2 4.5 4.0 chr12 88141648 88142376 1123 4.4 5.1 4.4 5.9 4.5 5.1 4.4 5.9 4.5 4.3 5.5 4.4 4.7 4.4 5.8 4.6 chr1 153800663 153801297 244 4.6 6.1 4.9 5.2 4.7 6.2 5.1 5.3 4.2 3.1 4.4 3.1 4.2 3.2 4.5 3.2 chr7 133185585 133186156 4249 6.4 7.6 4.3 4.1 6.5 7.8 4.4 4.2 2.9 3.4 2.8 4.9 2.8 3.4 2.8 4.8 chr4 106680675 106681320 3158 4.1 3.7 3.6 5.3 4.2 3.7 3.5 5.2 5.0 5.0 2.9 4.0 5.1 4.9 3.0 3.9

123

chr2 167875757 167876211 2750 4.0 3.3 4.0 3.1 4.0 3.5 4.2 3.2 4.0 3.2 4.0 3.9 3.9 3.2 4.0 4.0 chr4 155162230 155162903 3365 3.0 4.4 4.1 3.8 3.1 4.4 4.2 3.8 5.7 3.6 1.9 4.0 2.0 5.6 3.6 4.0 chr8 37346436 37347275 4371 2.6 5.1 3.4 3.3 2.6 5.5 3.4 3.4 2.5 3.6 4.1 6.8 2.6 4.0 4.5 7.5 chr9 45611360 45612037 4652 4.4 3.3 4.0 3.7 4.4 3.4 4.2 3.7 2.7 3.1 2.4 3.0 2.6 3.1 2.4 3.0 chr10 116543101 116543855 541 3.7 2.6 3.9 2.4 3.8 2.6 4.1 2.4 3.7 2.6 3.6 2.6 4.0 2.8 3.9 2.8 chr5 85240440 85241048 3467 3.6 3.2 3.1 3.2 3.7 3.3 3.2 3.2 3.0 2.6 2.8 4.1 3.0 2.6 2.9 4.1 chr8 83017196 83018542 4423 2.5 4.1 2.4 4.0 2.6 4.2 2.4 4.0 3.0 3.0 2.5 3.1 3.3 3.2 2.7 3.4 chr6 72229401 72230032 3800 7.1 3.7 9.5 3.0 6.0 5.9 5.4 2.8 2.2 1.9 3.3 2.9 3.7 1.9 1.7 3.2 4.2 2.9 2.4 2.6 4.3 3.0 2.2 2.6 chr7 91871088 91871876 4157 3.0 2.8 4.6 2.5 3.1 2.8 4.5 2.6 2.9 2.1 3.0 2.2 3.2 2.2 3.3 2.3 chr16 57241944 57242302 1799 2.8 2.7 3.3 2.8 2.8 2.8 3.3 2.9 2.5 2.2 2.8 1.7 2.6 2.2 2.9 1.8 chr8 35160152 35161438 4361 2.5 2.6 2.3 1.9 2.6 2.7 2.4 1.9 2.5 3.3 3.2 2.6 2.8 3.5 3.5 2.8 chr6 120530027 120530545 3879 2.3 2.1 3.1 2.6 2.4 2.1 3.1 2.6 2.9 2.7 2.1 1.8 3.0 2.7 2.2 1.9 chr16 49839564 49840556 1796 3.7 2.7 4.4 2.9 3.8 2.7 4.6 2.9 1.8 1.8 1.9 2.0 1.7 1.8 1.8 1.9 chr2 167626792 167627467 2747 2.2 2.9 2.4 2.2 2.3 2.9 2.4 2.3 2.0 2.0 1.3 3.2 2.0 2.0 1.3 3.1 chr10 24354921 24355529 379 2.0 2.7 2.0 1.9 2.0 2.7 2.0 1.9 3.4 2.1 2.3 2.5 3.3 2.1 2.3 2.5 chr12 58298497 58299143 1050 2.1 2.6 2.0 4.2 2.2 2.7 2.0 4.4 1.9 2.2 2.4 1.9 1.8 2.2 2.5 1.9 chr4 134275289 134275961 3278 4.3 5.2 4.1 3.9 3.4 4.5 2.0 1.8 1.6 1.9 2.9 2.1 2.4 1.6 2.1 2.1 2.3 1.8 1.9 1.8 2.2 1.9 2.0 2.2 chr5 77087161 77087852 3461 1.7 2.1 2.0 1.8 1.7 2.2 2.1 1.9 2.7 2.1 2.0 2.2 2.8 2.4 2.2 2.4 chr14 123430438 123430680 1506 2.3 2.4 2.8 3.7 2.4 2.4 2.8 3.8 1.7 1.8 1.5 1.7 1.7 1.8 1.6 1.7 chr16 34044485 34045111 1774 2.1 2.9 1.9 2.6 2.1 2.8 1.9 2.5 1.7 2.1 1.8 2.0 1.8 2.1 1.9 2.0 chr15 8711603 8712036 1515 1.9 2.5 2.0 2.3 1.9 2.6 2.1 2.4 3.5 1.7 1.9 1.4 3.4 1.7 1.9 1.4 chr1 135029194 135029593 196 2.3 2.3 1.8 2.1 2.3 2.4 1.8 2.1 1.2 2.3 1.6 1.8 1.2 2.4 1.6 1.9 chr19 40887714 40888455 2311 1.5 1.4 2.1 1.2 1.6 1.4 2.1 1.2 1.8 1.6 2.1 2.2 1.9 1.8 2.3 2.4 chr2 28477394 28477920 2434 1.4 1.4 2.2 2.1 1.4 1.5 2.2 2.2 1.2 1.7 1.7 1.8 1.2 1.7 1.8 1.8 chr6 135146524 135147153 3920 1.6 1.6 2.6 1.7 1.6 1.7 2.7 1.7 2.5 1.0 1.9 1.3 2.5 1.1 2.0 1.3 chr7 6126508 6127068 3953 1.6 1.7 1.9 1.8 1.6 1.7 2.0 1.9 1.3 1.4 1.7 1.5 1.2 1.4 1.8 1.6 chr3 121950048 121950842 2980 1.6 1.9 1.5 1.7 1.7 1.9 1.5 1.8 1.0 1.5 1.3 0.9 1.0 1.5 1.3 1.0 chr3 103179283 103179942 2950 1.3 1.5 1.4 1.9 1.3 1.5 1.4 1.8 1.5 1.8 1.5 1.2 1.4 1.8 1.5 1.2

124

chr14 71023649 71024262 1464 0.7 0.7 0.6 1.3 0.7 0.7 0.6 1.2 1.6 1.9 1.8 1.6 1.6 1.8 1.8 1.6 chr12 29502832 29503567 1020 1.3 1.5 1.5 2.0 1.3 1.6 1.5 2.0 1.3 0.6 1.6 1.1 1.3 0.7 1.7 1.2 chr1 38056216 38056808 43 1.7 1.2 1.2 1.5 1.8 1.2 1.2 1.5 2.3 1.0 0.5 2.3 2.2 1.0 0.5 2.3 chr5 147127885 147128511 3657 1.4 1.6 1.2 1.4 3.2 1.8 1.3 1.2 0.7 1.2 1.5 0.7 0.7 1.6 1.2 0.6 1.8 1.9 1.0 2.0 1.9 1.9 1.0 2.0 chr1 52580874 52581510 63 1.8 1.3 1.5 1.4 1.8 1.3 1.4 1.3 1.3 1.3 1.0 1.0 1.4 1.3 1.0 1.1 chr11 72295276 72296391 734 1.1 1.7 1.3 1.3 1.2 1.7 1.3 1.3 2.3 1.2 1.5 1.1 2.5 1.2 1.7 1.2 chr2 109749209 109749667 2578 0.6 1.5 1.6 1.7 0.6 1.5 1.6 1.8 1.1 0.8 0.9 1.3 1.1 0.8 0.9 1.3 chr9 102674523 102674817 4783 1.6 2.3 1.6 0.8 0.7 1.1 0.8 0.9 0.8 1.8 1.0 1.2 1.8 1.1 1.4 1.4 1.0 1.1 1.5 1.5 1.0 1.2 1.6 1.6 chr11 23993994 23994533 612 1.2 1.1 1.1 1.0 1.3 1.1 1.1 1.0 1.2 0.9 1.4 1.2 1.2 0.9 1.4 1.2 chr15 12838058 12838708 1529 1.5 0.8 1.7 2.2 2.1 1.3 1.2 1.0 0.7 0.9 0.8 0.9 1.3 0.8 0.5 1.2 1.2 0.9 1.1 0.9 1.3 0.9 1.1 1.0 chr5 148253420 148253996 3663 0.7 1.1 1.2 1.3 0.7 1.1 1.2 1.3 1.3 0.9 0.6 0.7 1.3 0.9 0.6 0.7 chr9 96141676 96142381 4769 0.7 0.8 0.9 0.6 0.7 0.8 0.9 0.6 1.0 1.5 1.3 1.1 1.0 1.6 1.3 1.1 chr11 96802210 96802881 842 0.9 1.4 0.7 1.1 0.9 1.4 0.7 1.2 1.0 1.4 0.7 0.6 1.1 1.5 0.7 0.6 chr11 96779483 96779756 840 0.8 1.4 1.1 1.2 0.9 1.4 1.1 1.2 0.8 0.7 0.6 0.9 0.8 0.7 0.6 1.0 chr3 21894787 21895266 2794 1.6 0.6 0.6 0.7 0.5 0.4 0.5 0.7 0.6 0.4 0.5 1.7 0.5 0.8 1.0 1.2 0.7 1.3 0.9 0.5 0.7 1.4 1.0 0.5 chr17 4625931 4626888 1863 0.4 0.7 0.5 0.6 0.4 0.7 0.5 0.6 0.5 0.6 0.6 0.5 0.5 0.7 0.6 0.5 chr18 78319653 78320300 2190 0.7 0.6 0.4 0.4 0.7 0.6 0.4 0.4 0.5 0.5 0.5 0.3 0.5 0.5 0.5 0.3 chr11 121294437 121295272 987 0.5 0.5 0.7 0.5 0.5 0.5 0.7 0.5 0.3 0.4 0.5 0.5 0.3 0.4 0.5 0.5 chr16 17637963 17639064 1742 0.5 0.6 0.4 0.4 0.5 0.6 0.5 0.4 0.6 0.4 0.5 0.4 0.7 0.4 0.6 0.5 chr1 135721557 135722921 201 0.3 0.4 0.4 0.3 0.3 0.4 0.4 0.3 0.5 1.1 0.3 0.3 0.5 1.2 0.4 0.3 chr1 156886224 156886842 250 0.7 0.8 0.3 0.1 0.3 0.2 0.5 0.3 0.5 0.3 0.3 0.2 0.2 0.3 0.3 0.3 0.6 0.5 0.7 0.4 0.6 0.5 0.6 0.4 chr9 110570916 110571692 4816 0.3 0.5 0.3 0.5 0.3 0.5 0.3 0.5 0.2 0.2 0.2 0.1 0.2 0.2 0.2 0.1 chr2 35191661 35192274 2471 0.2 0.2 0.3 0.2 0.2 0.2 0.3 0.2 0.1 0.2 0.2 0.3 0.1 0.2 0.2 0.3

125

Supplemental Table 2.3

Results of 66 TAL1-bound segments in transgenic mouse reporter assay at E11.5.

Information on these TAL1 peaks on VISTA Enhancer Browser and k-means clustering

illustrated in Figure 4A.

Chr Chr VISTA Start TAL1 Start End Cluster (TAL1) End (TAL1) (VISTA) element (TAL1) peak ID (VISTA) (VISTA) no(X50percentstablekmeansid) (mm9) (mm9) ID chr12 29502832 29503567 1020 chr12 29501669 29503770 hs1385 8 chr12 84173911 84175203 1105 chr12 84173382 84175400 mm78 4 chr12 88141648 88142376 1123 chr12 88141522 88144552 hs1466 3 chr12 90222271 90223529 1130 chr12 90222342 90223370 mm296 4 chr13 34919805 34920515 1216 chr13 34919435 34920768 mm180 7 chr14 118593772 118594280 1496 chr14 118593058 118594652 hs796 3 chr15 66657520 66658263 1578 chr15 66655863 66658724 mm104 8 chr17 12397492 12398656 1882 chr17 12396817 12398613 mm156 4 chr1 156886224 156886842 250 chr1 156885742 156887787 hs1862 7 chr18 32701721 32702830 2105 chr18 32701598 32702895 mm291 4 chr18 75662321 75663806 2184 chr18 75661251 75665484 mm18 6 chr18 75662321 75663806 2184 chr18 75662017 75663321 mm257 chr19 37568375 37569012 2302 chr19 37566512 37571839 hs1866 5 chr2 31983174 31984266 2456 chr2 31983250 31984106 mm194 4 chr2 45027025 45027862 2495 chr2 45026663 45028475 hs1802 6 chr2 103733606 103734648 2564 chr2 103733167 103735821 hs1858 4 chr2 152614615 152616061 2677 chr2 152613921 152616703 hs2050 6 chr2 167875757 167876211 2750 chr2 167874564 167876364 hs1860 1 chr2 172860208 172861567 2756 chr2 172860299 172862341 mm92 6 chr3 145627288 145627941 3020 chr3 145627063 145627893 mm311 3 chr4 117502412 117503328 3188 chr4 117501319 117504204 hs1857 6 chr4 131632510 131633977 3253 chr4 131631431 131635142 mm80 6 chr4 133285425 133286226 3269 chr4 133284966 133286220 hs569 5 chr5 38900658 38901482 3422 chr5 38900507 38901138 mm253 6 chr5 85240440 85241048 3467 chr5 85240355 85241286 hs840 8 chr5 120586118 120586932 3553 chr5 120585787 120586943 mm144 6 chr5 120586118 120586932 3553 chr5 120585896 120586933 hs1673

chr6 50840651 50841395 3759 chr6 50839577 50842587 hs1677 6 chr6 122343482 122344193 3897 chr6 122342623 122346341 mm94 5 chr6 122344239 122344657 3898 chr6 122342623 122346341 mm94 5 chr7 116258121 116259041 4207 chr7 116256647 116260817 hs1859 4 chr8 87254405 87255295 4435 chr8 87252801 87256604 mm21 6 chr8 122817261 122818516 4518 chr8 122815912 122818693 mm196 4 chr8 122895394 122896558 4521 chr8 122894800 122896576 mm99 4 126

chr8 125235866 125236576 4555 chr8 125235741 125237725 hs1854 7 chr9 63892510 63893729 4694 chr9 63892591 63894756 mm69 4 chr9 70682019 70683270 4737 chr9 70682343 70683041 mm190 6 chr11 11736803 11737167 594 chr11 11736363 11738001 hs2059 5 chr11 21978925 21979264 604 chr11 21978419 21979455 hs690 7 chr11 32145196 32146357 626 chr11 32145270 32146411 mm101 4 chr11 65463268 65464062 712 chr11 65462408 65465145 mm85 chr11 77236866 77237466 752 chr11 77235971 77237943 mm146 chr11 77236866 77237466 752 chr11 77235985 77237958 hs1675 chr11 86400220 86400815 793 chr11 86399698 86401331 mm127 1 chr11 98253860 98254298 854 chr11 98253216 98256320 hs1769 7 chr11 98254393 98255110 855 chr11 98253216 98256320 hs1769 6 chr12 88156953 88157678 1124 chr12 88156495 88158532 hs1624 5 chr14 69922592 69923552 1454 chr14 69922497 69923744 mm181 6 chr14 69935941 69937195 1455 chr14 69935881 69937057 mm215 4 chr15 66973594 66974313 1581 chr15 66973595 66975198 mm105 8 chr15 83353446 83354080 1659 chr15 83353437 83354501 hs1731 8 chr1 155880862 155881608 248 chr1 155880986 155881987 mm153 4 chr19 9018417 9019166 2236 chr19 9018686 9020249 hs2056 3 chr19 24526931 24527416 2272 chr19 24526476 24528770 hs1395 8 chr2 38391368 38392292 2485 chr2 38390575 38392603 mm160 6 chr2 152642938 152644140 2680 chr2 152643108 152643933 mm161 6 chr3 60406761 60407630 2837 chr3 60406133 60407586 mm163 5 chr3 100242086 100243806 2931 chr3 100241120 100244013 mm97 6 chr1 183187098 183187939 316 chr1 183186068 183189794 mm24 4 chr5 23849251 23849605 3383 chr5 23847456 23851575 hs1908 8 chr5 115716914 115717769 3535 chr5 115716995 115717570 mm307 6 chr6 120124224 120125340 3876 chr6 120124252 120125226 mm168 8 chr11 32150587 32151471 627 chr11 32150452 32152344 mm95 5 chr11 45618653 45619607 636 chr11 45618819 45619547 mm224 4 chr11 57464978 57465599 675 chr11 57464180 57466246 mm133 7 chr11 57464978 57465599 675 chr11 57464215 57466190 hs1662

chr1 59010525 59012054 70 chr1 59009748 59013257 mm88 4 chr11 95200815 95202005 830 chr11 95200522 95202441 hs1666 1 chr11 95217466 95218800 831 chr11 95217123 95219510 mm142 4 chr11 95217466 95218800 831 chr11 95217139 95219484 hs1671 chr11 102224771 102225712 885 chr11 102224347 102226111 hs1855 4

127

VISTA element Tissues ID hs1385 hindbrain (rhombencephalon)[3/5] | midbrain (mesencephalon)[4/5] mm78 midbrain (mesencephalon)[6/8] | heart[6/8] | facial mesenchyme[7/8] neural tube[8/8] | hindbrain (rhombencephalon)[8/8] | midbrain (mesencephalon)[8/8] | hs1466 dorsal root ganglion[7/8] | forebrain[6/8] | limb[8/8] | branchial arch[8/8] | heart[6/8] mm296 midbrain (mesencephalon)[10/11] | heart[4/11] mm180 other[5/6] hs796 forebrain[4/5] mm104 heart[7/7] | melanocytes[5/7] | liver[3/7] mm156 other[3/4] hs1862 heart[6/6] mm291 heart[6/7] mm18 midbrain (mesencephalon)[3/4] | heart[4/4] mm257 midbrain (mesencephalon)[6/8] | heart[8/8] hs1866 blood vessels[5/5] mm194 midbrain (mesencephalon)[6/9] | heart[6/9] hs1802 midbrain (mesencephalon)[8/8] hs1858 heart[4/7] | liver[4/7] hs2050 blood vessels[5/5] hs1860 midbrain (mesencephalon)[5/5] mm92 heart[5/11] mm311 midbrain (mesencephalon)[4/7] hs1857 midbrain (mesencephalon)[5/5] mm80 blood vessels[10/10] hs569 heart[4/9] mm253 heart[4/4] hs840 forebrain[10/10] mm144 midbrain (mesencephalon)[6/8] hs1673 neural tube[6/9] | midbrain (mesencephalon)[7/9] hs1677 neural tube[6/6] | hindbrain (rhombencephalon)[6/6] | midbrain (mesencephalon)[3/6] mm94 limb[3/5] | other[3/5] mm94 limb[3/5] | other[3/5] neural tube[8/8] | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[8/8] | hs1859 forebrain[8/8] | heart[7/8] | blood vessels[5/8] | liver[4/8] mm21 midbrain (mesencephalon)[6/11] | heart[6/11] mm196 heart[4/6] mm99 heart[8/10] hs1854 heart[7/8] mm69 heart[17/17] mm190 heart[4/8] hs2059 heart[8/9] hs690 midbrain (mesencephalon)[7/11] 128

mm101 heart[9/12] | blood vessels[7/12] mm85 heart[7/7] mm146 heart[9/9] hs1675 heart[8/10] mm127 heart[3/6] hs1769 heart[6/7] hs1769 heart[6/7] hs1624 negative mm181 negative mm215 negative mm105 negative hs1731 negative mm153 negative hs2056 negative hs1395 negative mm160 negative mm161 negative mm163 negative mm97 negative mm24 negative hs1908 negative mm307 negative mm168 negative mm95 negative mm224 negative mm133 negative hs1662 negative mm88 negative hs1666 negative mm142 negative hs1671 negative hs1855 negative

129

Supplemental Table 2.4 Genomic coordinates of meta-data (273 DNA segments)

and their response to transient transfection assay.

The presence (1) or absence (0) of 8 epigenetic features at the 273 DNA segments, and k-means (14 clusters) and DBSCAN (19 homogenous clusters) clustering algorithms on meta-data were also listed.

Chr(mm9) Start End ID Activity chr1 132196529 132197197 TAL1__184 23.100 chr3 146068529 146069275 TAL1__3022 9.986 chr18 38654476 38654937 TAL1__2135 7.294 chr7 16880022 16880853 TAL1__3960 6.607 chr1 135989522 135989792 Btg2R9 6.363 chr14 118593772 118594280 TAL1__1496 6.341 chr18 32701721 32702830 TAL1__2105 6.270 chr8 124837163 124837562 Zfpm1R13 5.977 chr19 37568375 37569012 TAL1__2302 5.804 chr7 118147598 118148147 GHP221 5.517 chr7 82756750 82757349 GHP53 5.380 chr6 88066819 88067388 Gata2R1 4.907 chr1 88375618 88376290 TAL1__127 4.895 chr11 77882879 77883715 TAL1__758 4.702 chr12 88141648 88142376 TAL1__1123 4.542 chr7 132572012 132572611 GHP293 4.340 chr7 133185585 133186156 TAL1__4249 4.248 chr7 86503239 86503938 GHP68 4.169 chr4 106680675 106681320 TAL1__3158 4.031 chr7 73341400 73342099 GHP10 3.596 chr8 37346436 37347275 TAL1__4371 3.520 chr7 128123364 128124013 GHP264 3.304 chr7 130509502 130510051 GHP275 3.248 chr9 45611360 45612037 TAL1__4652 3.232 chr7 134913927 134914526 GHP309 3.157 chr8 83017196 83018542 TAL1__4423 3.053 chr6 72229401 72230032 TAL1__3800 2.980 chr7 91871088 91871876 TAL1__4157 2.863 chr7 118322848 118323397 GHP222 2.813 chr8 35160152 35161438 TAL1__4361 2.630 chr7 71194021 71194720 GHP4 2.605 chr7 132615562 132616211 GHP296 2.422 chr16 49839564 49840556 TAL1__1796 2.342 130

chr7 135557758 135558307 GHP314 2.314 chr7 116258367 116259066 GHP205 2.296 chr7 106623318 106623984 GHP156 2.236 chr12 58298497 58299143 TAL1__1050 2.195 chr7 109390843 109392019 GHP172 2.165 chr5 77087161 77087852 TAL1__3461 2.123 chr7 87353740 87354439 GHP74 2.025 chr7 107992742 107993391 GHP167 1.986 chr7 107615488 107616487 GHP163 1.975 chr7 106973326 106974333 GHP159 1.878 chr8 124843371 124843635 Zfpm1R10 1.862 chr19 40887714 40888455 TAL1__2311 1.752 chr7 117918205 117918754 GHP216 1.734 chr2 28477394 28477920 TAL1__2434 1.728 chr1 135969753 135969902 Btg2R3 1.681 chr6 135146524 135147153 TAL1__3920 1.679 chr7 134910576 134911275 GHP308 1.626 chr7 88202775 88203374 GHP82 1.580 chr7 91799621 91800720 GHP106 1.521 chr8 124853571 124853860 Zfpm1R12 1.519 chr3 121950048 121950842 TAL1__2980 1.501 chr14 71023649 71024262 TAL1__1464 1.414 chr1 38056216 38056808 TAL1__43 1.380 chr11 72295276 72296391 TAL1__734 1.303 chr6 135115673 135115946 Hebp1R2 1.223 chr8 124845689 124845892 Zfpm1R19 1.132 chr8 124810751 124811295 Zfpm1R3 1.093 chr7 107327713 107328262 GHP160 1.082 chr5 148253420 148253996 TAL1__3663 0.997 chr7 87774111 87774810 GHP78 0.970 chr11 96802210 96802881 TAL1__842 0.957 chr8 124833200 124833376 Zfpm1R8 0.753 chr16 17637963 17639064 TAL1__1742 0.476 chr1 135721557 135722921 TAL1__201 0.394 chr7 91046669 91047368 GHP101 0.315 chr7 135815708 135816257 GHP316 0.280 chr9 110570916 110571692 TAL1__4816 0.260 chr2 35191661 35192274 TAL1__2471 0.224 chr6 88152903 88153447 Gata2R5 15.634 chr8 124838838 124839093 Zfpm1R24 4.116 chr6 88139617 88140016 Gata2R8 3.780 chr8 124846623 124847102 Zfpm1R14 2.559 chr8 124820526 124820950 Zfpm1R21 2.000 131

chr7 109397425 109397874 GHP173 1.770 chr8 124814848 124815272 Zfpm1R6 1.583 chr7 97211041 97211927 GHP117 1.201 chr8 124850001 124850270 Zfpm1R11 1.166 chr8 124820151 124820506 Zfpm1R7 1.129 chr8 124807803 124808237 Zfpm1R5 1.115 chr6 88140542 88141081 Gata2R3 0.752 chr8 124830554 124831203 Zfpm1R4 0.613 chr7 97277676 97278225 GHP118 0.491 chr7 133269983 133270532 GHP297 0.418 chr13 14634853 14635762 TAL1__1195 13.063 chr12 32734927 32735442 TAL1__1025 11.597 chr7 111009280 111010007 TAL1__4199 9.735 chr7 111009146 111010117 GHP181 7.665 chr9 123864720 123865709 TAL1__4851 6.814 chrX 146984294 146984495 Alas2R1 4.877 chr4 155162230 155162903 TAL1__3365 3.900 chrX 146999208 146999549 Alas2R3 3.332 chr7 111014108 111014707 GHP182 3.228 chr10 116543101 116543855 TAL1__541 3.207 chr7 116254241 116254990 GHP204 2.437 chr16 34044485 34045111 TAL1__1774 2.013 chr5 147127885 147128511 TAL1__3657 1.367 chr7 111058491 111059040 GHP183 0.976 chr9 96141676 96142381 TAL1__4769 0.970 chr3 21894787 21895266 TAL1__2794 0.677 chr1 156886224 156886842 TAL1__250 0.378 chr7 104490017 104490566 GHP150 2.750 chr6 88141777 88142006 Gata2R7 1.873 chr7 71096727 71097876 GHP2 1.014 chr7 77238268 77238717 GHP25 1.805 chr7 77456358 77456907 GHP27 1.777 chr7 77558135 77558684 GHP28 1.768 chr7 78534090 78534639 GHP31 1.683 chr7 77659306 77659855 GHP29 1.244 chr7 78181804 78182353 GHP30 1.168 chr7 77289559 77290129 GHP26 1.124 chr7 77036182 77036881 GHP23 1.120 chr7 78647752 78648301 GHP32 0.981 chr7 77104468 77105517 GHP24 0.967 chr7 76974115 76974664 GHP22 0.730 chr7 76507384 76507933 GHP20 0.493 chr7 76852202 76852751 GHP21 0.349 132

chr7 80201188 80201787 GHP42 0.334 chr7 112696459 112697108 GHP193 1.934 chr7 123344417 123345116 GHP246 1.803 chr7 107336599 107337148 GHP161 1.535 chr6 88081846 88082276 Gata2R9 1.301 chr7 73613801 73614350 GHP12 1.295 chr7 90245535 90246084 GHP94 1.050 chr7 112561519 112562068 GHP191 1.026 chr6 38592265 38592444 Hipk2R33 0.680 chr7 99540280 99541080 GHN419 0.603 chr8 124808263 124808907 Zfpm1R2 1.785 chr11 121294437 121295272 TAL1__987 0.482 chr7 133491588 133492187 GHP300 0.345 chr6 38721281 38721450 Hipk2R16 2.182 chr6 38796863 38797087 Hipk2R27 1.938 chr7 73302100 73302699 GHP9 1.785 chr6 38711397 38711542 Hipk2R23 1.726 chr7 108867272 108867821 GHP170 1.715 chr6 135091565 135092125 Hebp1R3 1.332 chr7 135065445 135066044 GHP311 1.331 chr7 73342767 73343574 GHN37 1.321 chr7 107727090 107727639 GHP164 1.276 chr6 38704031 38704437 Hipk2R4 1.186 chr6 38707478 38707762 Hipk2NC4 1.066 chrX 147000391 147000733 Alas2NC1 1.020 chr6 38800035 38800364 Hipk2R28 0.958 chr6 38806152 38806301 Hipk2R30 0.953 chr7 88970980 88971579 GHP89 0.814 chr6 38782189 38782723 Hipk2NC1 0.795 chr6 38715235 38715384 Hipk2R25 0.658 chr6 88272307 88272589 Gata2NC2 0.640 chr8 124825010 124825309 Zfpm1R18 2.097 chr7 87438711 87439660 GHP75 1.695 chr8 124831546 124831700 Zfpm1R29 1.614 chr6 88153903 88154127 Gata2NC1 1.280 chr8 124823870 124824145 Zfpm1R28 1.148 chr8 124839961 124840270 Zfpm1R9 1.134 chr7 86887199 86887748 GHP72 3.080 chr7 112900289 112900988 GHP196 2.255 chr8 124854446 124854615 Zfpm1R27 1.946 chr8 124899439 124899655 Zfpm1NC1 1.666 chr7 87330823 87331372 GHP73 1.655 chr7 71146719 71147318 GHP3 1.557 133

chr7 130521146 130522045 GHP276 1.071 chr8 124852939 124853188 Zfpm1R15 0.989 chr7 73224076 73224875 GHP8 0.523 chr7 133540211 133540860 GHP301 0.443 chr7 88973238 88973787 GHP90 0.335 chr7 6126508 6127068 TAL1__3953 1.647 chr12 29502832 29503567 TAL1__1020 1.404 chr2 109749209 109749667 TAL1__2578 1.200 chr15 12838058 12838708 TAL1__1529 1.037 chr7 107833031 107833730 GHP165 0.880 chr7 70865082 70865631 GHP1 2.473 chr7 111267298 111267897 GHP186 2.388 chr7 72834698 72835647 GHP7 2.116 chr7 132234331 132235580 GHP291 1.610 chr7 104665107 104666061 GHN478 1.600 chr7 122241704 122242253 GHP241 1.573 chr7 114218509 114219108 GHP198 1.472 chr7 73894613 73895162 GHP14 1.456 chr7 114704028 114704577 GHP199 1.360 chr7 115415642 115416191 GHP201 1.268 chr7 99412978 99413527 GHP127 1.260 chr7 113438116 113438965 GHP197 0.994 chr7 111119394 111119943 GHP184 0.936 chrX 147281215 147281371 Alas2NC2 0.898 chr7 115940055 115941254 GHP203 0.848 chr7 121505468 121506067 GHP234 0.741 chr7 80593212 80593977 GHN159 0.652 chr7 90844040 90844989 GHP100 0.630 chr7 114839819 114840368 GHP200 0.613 chr7 115474285 115474884 GHP202 0.583 chr7 130857235 130857784 GHP279 0.579 chr7 131129748 131130297 GHP283 0.484 chr6 38824639 38824903 Hipk2R39 1.510 chr7 97306681 97307481 GHN391 1.245 chr6 38824490 38824648 Hipk2R40 0.480 chr6 38649010 38649300 Hipk2R19 1.493 chr7 72445253 72445802 GHP6 0.636 chr7 72445253 72445802 GHP6 0.511 chr7 84135423 84136249 GHN213 0.407 chrX 7418047 7418661 TAL1__4858 10.789 chr7 134234693 134235292 GHP304 6.176 chr6 120530027 120530545 TAL1__3879 2.487 chr10 24354921 24355529 TAL1__379 2.212 134

chr7 120228608 120229207 GHP228 2.170 chr3 103179283 103179942 TAL1__2950 1.471 chr7 110976236 110976785 GHP180 0.320 chr7 118604074 118604623 GHP223 2.030 chr7 99419545 99420144 GHP128 1.945 chr7 108648673 108649222 GHP169 1.467 chr7 122770207 122771206 GHP243 1.328 chr7 73908565 73909114 GHP15 1.141 chr2 27007744 27007934 Vav2NC2 1.100 chr2 27249343 27249572 Vav2R4 0.941 chr7 98623314 98623913 GHP122 0.756 chr7 116632200 116632749 GHP206 0.649 chr7 111232683 111233032 GHP185 0.539 chr7 70735684 70736133 GHP0 0.528 chr7 99547805 99548404 GHP130 0.436 chr7 120023689 120024338 GHP227 0.419 chr2 27184263 27184417 Vav2R10 2.679 chr8 124757292 124757442 Zfpm1NC2 1.325 chr7 73722956 73723555 GHP13 1.292 chr8 124632865 124633128 Zfpm1NC4 0.897 chr2 27220045 27220284 Vav2NC1 0.854 chr6 88168966 88169500 Gata2R6 1.886 chr8 124769944 124770211 Zfpm1NC3 1.257 chr7 73425051 73425600 GHP11 1.206 chr8 124779650 124780054 Zfpm1R16 1.144 chr7 88516140 88516989 GHP87 0.992 chr6 38616598 38616976 Hipk2NC3 0.403 chr7 118020599 118021198 GHP219 1.951 chr2 27293121 27293330 Vav2R6 1.214 chr7 90120140 90120689 GHP93 1.150 chr7 79090283 79091223 GHN133 0.891 chr7 90212473 90213357 GHN322 0.818 chr1 153800663 153801297 TAL1__244 4.519 chr7 130114466 130115369 GHP270 2.604 chr8 124806654 124806897 Zfpm1R1 2.465 chr1 135029194 135029593 TAL1__196 1.958 chr17 4625931 4626888 TAL1__1863 0.561 chr7 88863914 88864563 GHP88 3.347 chr2 167626792 167627467 TAL1__2747 2.251 chr7 91735511 91736060 GHP105 2.053 chr7 75118605 75119154 GHP18 1.522 chr18 78319653 78320300 TAL1__2190 0.491 chr7 70644202 70645021 GHN6 1.069 135

chr7 74441122 74441828 GHP17 0.730 chr2 27199488 27199723 Vav2R3 4.629 chr2 27244974 27245564 Vav2R7 2.881 chr7 134991063 134992012 GHP310 0.523 chr7 104358431 104359030 GHP147 5.817 chr7 135444428 135445076 GHP313 3.660 chr2 27267361 27267660 Vav2R5 2.982 chr7 104881929 104882528 GHP152 2.045 chr2 167875757 167876211 TAL1__2750 3.941 chr7 74355378 74355948 GHP16 1.364 chr1 52580874 52581510 TAL1__63 1.318 chr15 66657520 66658263 TAL1__1578 10.094 chr4 134275289 134275961 TAL1__3278 2.133 chr15 8711603 8712036 TAL1__1515 1.975 chr14 123430438 123430680 TAL1__1506 2.080 chr7 107830000 107830921 GHN534 1.269 chr7 75356326 75356875 GHP19 0.908 chr1 136002653 136002802 Btg2R8 1.431 chr7 86291470 86292288 GHN240 0.482 chr16 57241944 57242302 TAL1__1799 2.765 chr9 102674523 102674817 TAL1__4783 1.156 chr5 85240440 85241048 TAL1__3467 3.190 chr11 23993994 23994533 TAL1__612 1.156 chr11 96779483 96779756 TAL1__840 0.905 chr7 112857304 112857753 GHP194 0.723

k- ID TAL1 GATA1 EP300 H3K4me1 H3K4me3 H3K27ac H3K27me3 H3K9me3 DBSCAN DHS means TAL1__184 1 1 1 1 1 1 0 0 1 1 1 TAL1__3022 1 1 1 1 1 1 0 0 1 1 1 TAL1__2135 1 1 1 1 1 1 0 0 1 1 1 TAL1__3960 1 1 1 1 1 1 0 0 1 1 1 Btg2R9 1 1 1 1 1 1 0 0 1 1 1 TAL1__1496 1 1 1 1 1 1 0 0 1 1 1 136

TAL1__2105 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R13 1 1 1 1 1 1 0 0 1 1 1 TAL1__2302 1 1 1 1 1 1 0 0 1 1 1 GHP221 1 1 1 1 1 1 0 0 1 1 1 GHP53 1 1 1 1 1 1 0 0 1 1 1 Gata2R1 1 1 1 1 1 1 0 0 1 1 1 TAL1__127 1 1 1 1 1 1 0 0 1 1 1 TAL1__758 1 1 1 1 1 1 0 0 1 1 1 TAL1__1123 1 1 1 1 1 1 0 0 1 1 1 GHP293 1 1 1 1 1 1 0 0 1 1 1 TAL1__4249 1 1 1 1 1 1 0 0 1 1 1 GHP68 1 1 1 1 1 1 0 0 1 1 1 TAL1__3158 1 1 1 1 1 1 0 0 1 1 1 GHP10 1 1 1 1 1 1 0 0 1 1 1 TAL1__4371 1 1 1 1 1 1 0 0 1 1 1 GHP264 1 1 1 1 1 1 0 0 1 1 1 GHP275 1 1 1 1 1 1 0 0 1 1 1 TAL1__4652 1 1 1 1 1 1 0 0 1 1 1 GHP309 1 1 1 1 1 1 0 0 1 1 1 TAL1__4423 1 1 1 1 1 1 0 0 1 1 1 TAL1__3800 1 1 1 1 1 1 0 0 1 1 1 TAL1__4157 1 1 1 1 1 1 0 0 1 1 1 GHP222 1 1 1 1 1 1 0 0 1 1 1 TAL1__4361 1 1 1 1 1 1 0 0 1 1 1 GHP4 1 1 1 1 1 1 0 0 1 1 1 GHP296 1 1 1 1 1 1 0 0 1 1 1 TAL1__1796 1 1 1 1 1 1 0 0 1 1 1 GHP314 1 1 1 1 1 1 0 0 1 1 1 GHP205 1 1 1 1 1 1 0 0 1 1 1 GHP156 1 1 1 1 1 1 0 0 1 1 1 TAL1__1050 1 1 1 1 1 1 0 0 1 1 1 GHP172 1 1 1 1 1 1 0 0 1 1 1 TAL1__3461 1 1 1 1 1 1 0 0 1 1 1 GHP74 1 1 1 1 1 1 0 0 1 1 1 GHP167 1 1 1 1 1 1 0 0 1 1 1 GHP163 1 1 1 1 1 1 0 0 1 1 1 GHP159 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R10 1 1 1 1 1 1 0 0 1 1 1 TAL1__2311 1 1 1 1 1 1 0 0 1 1 1 GHP216 1 1 1 1 1 1 0 0 1 1 1 TAL1__2434 1 1 1 1 1 1 0 0 1 1 1 Btg2R3 1 1 1 1 1 1 0 0 1 1 1 TAL1__3920 1 1 1 1 1 1 0 0 1 1 1 137

GHP308 1 1 1 1 1 1 0 0 1 1 1 GHP82 1 1 1 1 1 1 0 0 1 1 1 GHP106 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R12 1 1 1 1 1 1 0 0 1 1 1 TAL1__2980 1 1 1 1 1 1 0 0 1 1 1 TAL1__1464 1 1 1 1 1 1 0 0 1 1 1 TAL1__43 1 1 1 1 1 1 0 0 1 1 1 TAL1__734 1 1 1 1 1 1 0 0 1 1 1 Hebp1R2 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R19 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R3 1 1 1 1 1 1 0 0 1 1 1 GHP160 1 1 1 1 1 1 0 0 1 1 1 TAL1__3663 1 1 1 1 1 1 0 0 1 1 1 GHP78 1 1 1 1 1 1 0 0 1 1 1 TAL1__842 1 1 1 1 1 1 0 0 1 1 1 Zfpm1R8 1 1 1 1 1 1 0 0 1 1 1 TAL1__1742 1 1 1 1 1 1 0 0 1 1 1 TAL1__201 1 1 1 1 1 1 0 0 1 1 1 GHP101 1 1 1 1 1 1 0 0 1 1 1 GHP316 1 1 1 1 1 1 0 0 1 1 1 TAL1__4816 1 1 1 1 1 1 0 0 1 1 1 TAL1__2471 1 1 1 1 1 1 0 0 1 1 1 Gata2R5 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R24 0 1 1 1 1 1 0 0 3 6 1 Gata2R8 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R14 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R21 0 1 1 1 1 1 0 0 3 6 1 GHP173 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R6 0 1 1 1 1 1 0 0 3 6 1 GHP117 0 1 1 1 1 1 0 0 3 6 0 Zfpm1R11 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R7 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R5 0 1 1 1 1 1 0 0 3 6 1 Gata2R3 0 1 1 1 1 1 0 0 3 6 1 Zfpm1R4 0 1 1 1 1 1 0 0 3 6 1 GHP118 0 1 1 1 1 1 0 0 3 6 1 GHP297 0 1 1 1 1 1 0 0 3 6 1 TAL1__1195 1 1 1 1 0 1 0 0 7 3 1 TAL1__1025 1 1 1 1 0 1 0 0 7 3 1 TAL1__4199 1 1 1 1 0 1 0 0 7 3 1 GHP181 1 1 1 1 0 1 0 0 7 3 1 TAL1__4851 1 1 1 1 0 1 0 0 7 3 1 Alas2R1 1 1 1 1 0 1 0 0 7 3 1 TAL1__3365 1 1 1 1 0 1 0 0 7 3 1 138

Alas2R3 1 1 1 1 0 1 0 0 7 3 1 GHP182 1 1 1 1 0 1 0 0 7 3 1 TAL1__541 1 1 1 1 0 1 0 0 7 3 1 GHP204 1 1 1 1 0 1 0 0 7 3 1 TAL1__1774 1 1 1 1 0 1 0 0 7 3 1 TAL1__3657 1 1 1 1 0 1 0 0 7 3 1 GHP183 1 1 1 1 0 1 0 0 7 3 1 TAL1__4769 1 1 1 1 0 1 0 0 7 3 1 TAL1__2794 1 1 1 1 0 1 0 0 7 3 1 TAL1__250 1 1 1 1 0 1 0 0 7 3 1 GHP150 0 1 1 1 0 1 0 0 4 8 1 Gata2R7 0 1 1 1 0 1 0 0 4 8 1 GHP2 0 1 1 1 0 1 0 0 4 8 1 GHP25 0 0 0 0 0 0 0 1 8 17 0 GHP27 0 0 0 0 0 0 0 1 8 17 0 GHP28 0 0 0 0 0 0 0 1 8 17 0 GHP31 0 0 0 0 0 0 0 1 8 17 0 GHP29 0 0 0 0 0 0 0 1 8 17 0 GHP30 0 0 0 0 0 0 0 1 8 17 0 GHP26 0 0 0 0 0 0 0 1 8 17 0 GHP23 0 0 0 0 0 0 0 1 8 17 0 GHP32 0 0 0 0 0 0 0 1 8 17 0 GHP24 0 0 0 0 0 0 0 1 8 17 0 GHP22 0 0 0 0 0 0 0 1 8 17 0 GHP20 0 0 0 0 0 0 0 1 8 17 0 GHP21 0 0 0 0 0 0 0 1 8 17 0 GHP42 0 0 0 0 0 0 0 1 8 17 0 GHP193 0 0 0 1 0 0 1 0 13 15 0 GHP246 0 0 0 1 0 0 1 0 13 15 1 GHP161 0 0 0 1 0 0 1 0 13 15 0 Gata2R9 0 0 0 1 0 0 1 0 13 15 1 GHP12 0 0 0 1 0 0 1 0 13 15 0 GHP94 0 0 0 1 0 0 1 0 13 15 0 GHP191 0 0 0 1 0 0 1 0 13 15 0 Hipk2R33 0 0 0 1 0 0 1 0 13 15 1 GHN419 0 0 0 1 0 0 1 0 13 15 0 Zfpm1R2 1 1 1 1 1 0 0 0 1 2 1 TAL1__987 1 1 1 1 1 0 0 0 1 2 1 GHP300 1 1 1 1 1 0 0 0 1 2 1 Hipk2R16 0 0 0 1 0 1 0 0 5 14 1 Hipk2R27 0 0 0 1 0 1 0 0 5 14 0 GHP9 0 0 0 1 0 1 0 0 5 14 1 Hipk2R23 0 0 0 1 0 1 0 0 5 14 0 139

GHP170 0 0 0 1 0 1 0 0 5 14 0 Hebp1R3 0 0 0 1 0 1 0 0 5 14 0 GHP311 0 0 0 1 0 1 0 0 5 14 1 GHN37 0 0 0 1 0 1 0 0 5 14 1 GHP164 0 0 0 1 0 1 0 0 5 14 0 Hipk2R4 0 0 0 1 0 1 0 0 5 14 0 Hipk2NC4 0 0 0 1 0 1 0 0 5 14 1 Alas2NC1 0 0 0 1 0 1 0 0 5 14 0 Hipk2R28 0 0 0 1 0 1 0 0 5 14 0 Hipk2R30 0 0 0 1 0 1 0 0 5 14 0 GHP89 0 0 0 1 0 1 0 0 5 14 1 Hipk2NC1 0 0 0 1 0 1 0 0 5 14 0 Hipk2R25 0 0 0 1 0 1 0 0 5 14 0 Gata2NC2 0 0 0 1 0 1 0 0 5 14 0 Zfpm1R18 0 0 0 1 1 1 0 0 11 9 1 GHP75 0 0 0 1 1 1 0 0 11 9 1 Zfpm1R29 0 0 0 1 1 1 0 0 11 9 1 Gata2NC1 0 0 0 1 1 1 0 0 11 9 1 Zfpm1R28 0 0 0 1 1 1 0 0 11 9 1 Zfpm1R9 0 0 0 1 1 1 0 0 11 9 1 GHP72 0 1 0 1 1 1 0 0 3 7 1 GHP196 0 1 0 1 1 1 0 0 3 7 1 Zfpm1R27 0 1 0 1 1 1 0 0 3 7 1 Zfpm1NC1 0 1 0 1 1 1 0 0 3 7 1 GHP73 0 1 0 1 1 1 0 0 3 7 1 GHP3 0 1 0 1 1 1 0 0 3 7 1 GHP276 0 1 0 1 1 1 0 0 3 7 1 Zfpm1R15 0 1 0 1 1 1 0 0 3 7 1 GHP8 0 1 0 1 1 1 0 0 3 7 1 GHP301 0 1 0 1 1 1 0 0 3 7 1 GHP90 0 1 0 1 1 1 0 0 3 7 1 TAL1__3953 1 1 0 1 0 1 0 0 6 5 1 TAL1__1020 1 1 0 1 0 1 0 0 6 5 1 TAL1__2578 1 1 0 1 0 1 0 0 6 5 1 TAL1__1529 1 1 0 1 0 1 0 0 6 5 1 GHP165 1 1 0 1 0 1 0 0 6 5 1 GHP1 0 0 0 0 0 0 0 0 2 19 0 GHP186 0 0 0 0 0 0 0 0 2 19 0 GHP7 0 0 0 0 0 0 0 0 2 19 0 GHP291 0 0 0 0 0 0 0 0 2 19 0 GHN478 0 0 0 0 0 0 0 0 2 19 0 GHP241 0 0 0 0 0 0 0 0 2 19 0 GHP198 0 0 0 0 0 0 0 0 2 19 0 140

GHP14 0 0 0 0 0 0 0 0 2 19 0 GHP199 0 0 0 0 0 0 0 0 2 19 0 GHP201 0 0 0 0 0 0 0 0 2 19 0 GHP127 0 0 0 0 0 0 0 0 2 19 1 GHP197 0 0 0 0 0 0 0 0 2 19 0 GHP184 0 0 0 0 0 0 0 0 2 19 0 Alas2NC2 0 0 0 0 0 0 0 0 2 19 0 GHP203 0 0 0 0 0 0 0 0 2 19 0 GHP234 0 0 0 0 0 0 0 0 2 19 0 GHN159 0 0 0 0 0 0 0 0 2 19 0 GHP100 0 0 0 0 0 0 0 0 2 19 0 GHP200 0 0 0 0 0 0 0 0 2 19 0 GHP202 0 0 0 0 0 0 0 0 2 19 0 GHP279 0 0 0 0 0 0 0 0 2 19 0 GHP283 0 0 0 0 0 0 0 0 2 19 0 Hipk2R39 0 0 0 0 1 1 0 0 11 10 1 GHN391 0 0 0 0 1 1 0 0 11 10 0 Hipk2R40 0 0 0 0 1 1 0 0 11 10 1 Hipk2R19 0 0 0 1 1 0 0 0 9 12 1 GHP6 0 0 0 1 1 0 0 0 9 12 0 GHP6 0 0 0 1 1 0 0 0 9 12 0 GHN213 0 0 0 1 1 0 0 0 9 12 0 TAL1__4858 1 1 0 1 1 1 0 0 6 4 1 GHP304 1 1 0 1 1 1 0 0 6 4 1 TAL1__3879 1 1 0 1 1 1 0 0 6 4 1 TAL1__379 1 1 0 1 1 1 0 0 6 4 1 GHP228 1 1 0 1 1 1 0 0 6 4 1 TAL1__2950 1 1 0 1 1 1 0 0 6 4 1 GHP180 1 1 0 1 1 1 0 0 6 4 1 GHP223 0 0 0 0 0 0 1 0 10 16 0 GHP128 0 0 0 0 0 0 1 0 10 16 0 GHP169 0 0 0 0 0 0 1 0 10 16 1 GHP243 0 0 0 0 0 0 1 0 10 16 0 GHP15 0 0 0 0 0 0 1 0 10 16 0 Vav2NC2 0 0 0 0 0 0 1 0 10 16 0 Vav2R4 0 0 0 0 0 0 1 0 10 16 1 GHP122 0 0 0 0 0 0 1 0 10 16 0 GHP206 0 0 0 0 0 0 1 0 10 16 0 GHP185 0 0 0 0 0 0 1 0 10 16 0 GHP0 0 0 0 0 0 0 1 0 10 16 0 GHP130 0 0 0 0 0 0 1 0 10 16 0 GHP227 0 0 0 0 0 0 1 0 10 16 0 Vav2R10 0 0 0 0 1 0 1 0 9 13 0 141

Zfpm1NC2 0 0 0 0 1 0 1 0 9 13 1 GHP13 0 0 0 0 1 0 1 0 9 13 0 Zfpm1NC4 0 0 0 0 1 0 1 0 9 13 1 Vav2NC1 0 0 0 0 1 0 1 0 9 13 0 Gata2R6 0 0 0 1 1 0 1 0 9 11 1 Zfpm1NC3 0 0 0 1 1 0 1 0 9 11 1 GHP11 0 0 0 1 1 0 1 0 9 11 1 Zfpm1R16 0 0 0 1 1 0 1 0 9 11 1 GHP87 0 0 0 1 1 0 1 0 9 11 1 Hipk2NC3 0 0 0 1 1 0 1 0 9 11 1 GHP219 0 0 0 1 0 0 0 0 2 18 1 Vav2R6 0 0 0 1 0 0 0 0 2 18 1 GHP93 0 0 0 1 0 0 0 0 2 18 0 GHN133 0 0 0 1 0 0 0 0 2 18 0 GHN322 0 0 0 1 0 0 0 0 2 18 0 TAL1__244 1 0 1 0 0 1 1 1 0 NOISE 1 GHP270 0 1 0 0 1 1 0 0 0 NOISE 1 Zfpm1R1 0 1 1 0 1 1 0 0 0 NOISE 1 TAL1__196 1 1 1 0 1 1 0 0 0 NOISE 1 TAL1__1863 1 1 0 0 1 0 0 0 0 NOISE 1 GHP88 1 1 1 1 1 1 0 1 1 NOISE 1 TAL1__2747 1 1 1 1 1 1 1 0 1 NOISE 1 GHP105 1 1 1 1 1 1 1 0 1 NOISE 1 GHP18 1 1 1 1 1 1 0 1 1 NOISE 1 TAL1__2190 1 0 1 1 1 1 1 0 1 NOISE 1 GHN6 0 0 0 0 1 0 0 0 2 NOISE 0 GHP17 0 1 0 1 0 0 0 0 2 NOISE 0 Vav2R3 0 1 1 1 1 0 0 0 3 NOISE 1 Vav2R7 0 0 1 1 1 1 1 0 3 NOISE 1 GHP310 0 1 0 1 1 1 0 1 3 NOISE 1 GHP147 0 0 1 1 0 1 0 0 4 NOISE 1 GHP313 0 1 0 1 0 1 0 0 4 NOISE 1 Vav2R5 0 0 1 1 0 1 0 0 4 NOISE 1 GHP152 0 1 0 1 0 1 0 0 4 NOISE 0 TAL1__2750 1 1 0 1 1 1 1 0 6 NOISE 1 GHP16 1 1 0 1 1 1 1 0 6 NOISE 1 TAL1__63 1 1 0 1 0 0 0 0 6 NOISE 1 TAL1__1578 1 1 1 1 0 0 0 0 7 NOISE 1 TAL1__3278 1 1 1 1 0 1 1 0 7 NOISE 1 TAL1__1515 1 1 1 1 0 0 1 0 7 NOISE 0 TAL1__1506 1 0 0 0 0 0 1 1 8 NOISE 0 GHN534 0 0 0 0 0 1 0 1 8 NOISE 0 GHP19 0 0 0 1 1 0 0 1 8 NOISE 0 142

Btg2R8 0 0 0 1 1 1 1 0 9 NOISE 1 GHN240 0 0 0 1 1 1 1 0 9 NOISE 1 TAL1__1799 1 0 0 0 0 0 1 0 10 NOISE 0 TAL1__4783 1 0 0 0 0 0 1 0 10 NOISE 1 TAL1__3467 1 0 0 1 0 0 0 0 12 NOISE 0 TAL1__612 1 0 0 1 0 1 0 0 12 NOISE 1 TAL1__840 1 0 0 1 0 1 0 0 12 NOISE 1 GHP194 0 0 1 1 0 0 1 0 13 NOISE 0

Supplemental Table 2.5 p-values of t-test on each epigenetic features of meta-

data (273 DNA segments)

Welch t-test, conf.level=0.95 p-value

TAL1 (+), TAL1 (-) 1.206E-07 Bound significantly greater GATA1 (+), GATA1 (-) 1.74E-09 Bound significantly greater p300 (+), p300 (-) 1.45E-08 Bound significantly greater H3K4me1 (+), H3K4me1 (-) 7.248E-09 Enriched significantly greater H3K4me3 (+), H3K4me3 (-) 0.0221 Enriched significantly greater H3K27ac (+), H3K27ac (-) 4.11E-08 Enriched significantly greater H3K27me3 (+), H3K27me3 (-) 1.72E-05 Enriched significantly less H3K9me3 (+), H3K9me (-) 0.001328 Enriched significantly less

143

Supplemental Table 2.6 Feature combination assessment (TAL1, GATA, EP300 and different combinations of them with H3K4me1, H3K4me3 and H3K27ac) using

sensitivity, specificity and precision values

Sensitivity Precison Epigenetic Features/Combinations Specificity (Recall) (PPV) TAL1 0.744 0.694 0.545 GATA1 0.844 0.590 0.503 EP300 0.767 0.694 0.552 K4me1+GATA1 0.822 0.601 0.503 TAL1+GATA1 0.700 0.716 0.548 GATA1+EP300 0.722 0.705 0.546 K27ac+TAL1 0.700 0.732 0.563 K4me1+K27ac+GATA1 0.800 0.634 0.518 K4me1+EP300 0.744 0.699 0.549 K4me1+K27ac+EP300 0.722 0.727 0.565 K4me1+TAL1 0.711 0.710 0.547 K4me3+GATA1 0.656 0.672 0.496 K4me3+EP300 0.567 0.743 0.520 K27ac+TAL1+GATA1 0.689 0.749 0.574 K27ac+GATA1 0.822 0.628 0.521 K27ac+EP300 0.744 0.721 0.568 K4me1+GATA1+EP300 0.711 0.710 0.547 TAL1+EP300 0.644 0.765 0.574 K4me1+TAL1+GATA1 0.700 0.727 0.558 K27ac+TAL1+EP300 0.633 0.787 0.594 K4me1+K4me3+GATA1 0.633 0.683 0.496 K4me1+K27ac+TAL1 0.689 0.738 0.564 K4me3+K27ac+GATA1 0.644 0.694 0.509 K4me1+K4me3+K27ac+GATA1 0.622 0.699 0.505 K4me1+K4me3+EP300 0.556 0.749 0.521 K4me1+K27ac+TAL1+GATA1 0.689 0.754 0.579 K4me3+K27ac+EP300 0.556 0.760 0.532 K4me1+K4me3+K27ac+EP300 0.544 0.765 0.533 K4me1+K27ac+GATA1+EP300 0.689 0.732 0.559 TAL1+GATA1+EP300 0.633 0.770 0.576 K4me1+TAL1+EP300 0.633 0.770 0.576 K27ac+TAL1+GATA1+EP300 0.622 0.792 0.596 K4me1+TAL1+GATA1+EP300 0.633 0.776 0.582 K4me3+TAL1 0.544 0.776 0.544 K4me1+K4me3+TAL1 0.544 0.787 0.557 K4me3+K27ac+TAL1 0.544 0.798 0.570 K4me1+K4me3+K27ac+TAL1 0.544 0.803 0.576 K4me1+K4me3+TAL1+GATA1 0.544 0.792 0.563 K4me3+K27ac+TAL1+GATA1 0.544 0.803 0.576 144

K4me1+K4me3+K27ac+TAL1+GATA1 0.544 0.809 0.583 K4me3+TAL1+EP300 0.478 0.798 0.538 K4me1+K4me3+TAL1+EP300 0.478 0.803 0.544 K4me1+K27ac+TAL1+EP300 0.622 0.792 0.596 K4me3+GATA1+EP300 0.556 0.749 0.521 K27ac+GATA1+EP300 0.700 0.727 0.558 K4me1+K4me3+GATA1+EP300 0.544 0.754 0.521 K4me1+K27ac+TAL1+GATA1+EP300 0.622 0.798 0.602 K4me3+K27ac+GATA1+EP300 0.544 0.765 0.533 K4me1+K4me3+K27ac+GATA1+EP300 0.533 0.770 0.533 K4me3+K27ac+TAL1+EP300 0.478 0.814 0.558 K4me1+K4me3+K27ac+TAL1+EP300 0.478 0.820 0.566 K4me3+TAL1+GATA1+EP300 0.478 0.803 0.544 K4me1+K4me3+TAL1+GATA1+EP300 0.478 0.809 0.551 K4me3+K27ac+TAL1+GATA1+EP300 0.478 0.820 0.566 K4me1+K4me3+K27ac+TAL1+GATA1+EP300 0.478 0.825 0.573 K4me3+TAL1+GATA1 0.444 0.732 0.449 no activating histone marks+TAL1 0.022 0.995 0.667 no activating histone marks-TF(s) 0.044 0.754 0.082 K4me1-TF(s) 0.022 0.732 0.039 K4me3-TF(s) 0.022 0.858 0.071 K27ac-TF(s) 0.022 0.847 0.067 K4me1+K4me3-TF(s) 0.011 0.902 0.053 K4me1+K27ac-TF(s) 0.022 0.869 0.077 K4me3+K27ac-TF(s) 0.011 0.945 0.091 K4me1+K4me3+K27ac-TF(s) 0.011 0.962 0.125

145

Supplemental Table 2.7 The top 10 overrepresented motifs (identified by DME) in

TAL1-bound active enhancers over inactive segments in transient transfection

assay.

DME172 DME14 ID SVMAGMGSAG ID AGGCARRSYY P0 A C G T P0 A C G T 01 0 4 14 0 01 15 0 0 0 02 6 7 5 0 02 0 0 15 0 03 8 10 0 0 03 0 0 15 0 04 18 0 0 0 04 0 15 0 0 05 0 0 18 0 05 15 0 0 0 06 13 5 0 0 06 4 0 11 0 07 0 0 18 0 07 9 0 6 0 08 0 8 10 0 08 0 6 9 0 09 18 0 0 0 09 0 6 0 9 10 0 0 18 0 10 0 10 0 5 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=18 AT FGCOUNT=15 AT INFO=1.60577 AT INFO=1.60936 AT SCORE=254.744 AT SCORE=228.629

DME174 DME3 ID KGKTKTGTWY ID GGRRSAGBAR P0 A C G T P0 A C G T 01 0 0 7 12 01 0 0 16 0 02 0 0 19 0 02 0 0 16 0 03 0 0 10 9 03 10 0 6 0 04 0 0 0 19 04 10 0 6 0 05 0 0 8 11 05 0 12 4 0 06 0 0 0 19 06 16 0 0 0 07 0 0 19 0 07 0 0 16 0 08 0 0 0 19 08 0 7 8 1 09 7 0 0 12 09 16 0 0 0 10 0 3 0 16 10 6 0 10 0 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=19 AT FGCOUNT=16 AT INFO=1.60443 AT INFO=1.60465 AT SCORE=253.4 AT SCORE=224.466

146

DME7 DME101 ID YWCTGYWKTT ID GGVDGGSARA P0 A C G T P0 A C G T 01 0 5 0 11 01 0 0 15 0 02 4 0 0 12 02 0 0 15 0 03 0 16 0 0 03 4 2 9 0 04 0 0 0 16 04 6 0 5 4 05 0 0 16 0 05 0 0 15 0 06 0 12 0 4 06 0 0 15 0 07 7 0 0 9 07 0 6 9 0 08 0 0 6 10 08 15 0 0 0 09 0 0 0 16 09 8 0 7 0 10 0 0 0 16 10 15 0 0 0 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=16 AT FGCOUNT=15 AT INFO=1.60096 AT INFO=1.60316 AT SCORE=241.646 AT SCORE=219.387

DME74 DME79 ID MABMTGTTKT ID WTWTKRKTTT P0 A C G T P0 A C G T 01 4 11 0 0 01 2 0 0 12 02 15 0 0 0 02 0 0 0 14 03 0 7 2 6 03 4 0 0 10 04 8 7 0 0 04 0 0 0 14 05 0 0 0 15 05 0 0 5 9 06 0 0 15 0 06 4 0 10 0 07 0 0 0 15 07 0 0 3 11 08 0 0 0 15 08 0 0 0 14 09 0 0 3 12 09 0 0 0 14 10 0 0 0 15 10 0 0 0 14 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=15 AT FGCOUNT=14 AT INFO=1.60311 AT INFO=1.61812 AT SCORE=236.218 AT SCORE=216.487

147

DME1 DME8 ID BTTKRKTTTS ID TTTGDWTKCW P0 A C G T P0 A C G T 01 0 2 3 13 01 0 0 0 14 02 0 0 0 18 02 0 0 0 14 03 0 0 0 18 03 0 0 0 14 04 0 0 1 17 04 0 0 14 0 05 4 0 14 0 05 5 0 4 5 06 0 0 4 14 06 8 0 0 6 07 0 0 0 18 07 0 0 0 14 08 0 0 0 18 08 0 0 4 10 09 0 0 0 18 09 0 14 0 0 10 0 5 13 0 10 3 0 0 11 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=18 AT FGCOUNT=14 AT INFO=1.6003 AT INFO=1.60462 AT SCORE=232.033 AT SCORE=214.234

148

Supplemental Table 2.8 The top 10 overrepresented motifs (identified by DME) in

TAL1-bound inactive segments over active enhancers in transient transfection

assay.

DME22 DME107 ID GSTGCWGHAG ID MAAGSSCKRG P0 A C G T P0 A C G T 01 0 1 12 0 01 5 4 0 0 02 0 12 1 0 02 9 0 0 0 03 0 0 0 13 03 9 0 0 0 04 0 0 13 0 04 0 0 9 0 05 0 12 1 0 05 0 1 8 0 06 7 0 0 6 06 0 3 6 0 07 0 0 13 0 07 0 9 0 0 08 8 4 0 1 08 0 0 4 5 09 10 0 0 3 09 1 0 8 0 10 0 0 13 0 10 0 0 9 0 AT BGCOUNT=7 AT BGCOUNT=1 AT CORRECTEDBGCOUNT=2 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=13 AT FGCOUNT=9 AT INFO=1.70914 AT INFO=1.61233 AT SCORE=156.445 AT SCORE=137.938 DME32 DME35 ID AAGRVCKGGV ID DGGVAGCCWK P0 A C G T P0 A C G T 01 10 0 0 0 01 5 0 1 3 02 10 0 0 0 02 0 0 9 0 03 0 0 10 0 03 0 0 9 0 04 4 0 6 0 04 3 5 1 0 05 1 2 7 0 05 9 0 0 0 06 0 10 0 0 06 0 0 9 0 07 0 0 2 8 07 0 9 0 0 08 0 0 10 0 08 0 9 0 0 09 0 0 10 0 09 7 0 0 2 10 4 5 1 0 10 0 0 8 1 AT BGCOUNT=1 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=10 AT FGCOUNT=9 AT INFO=1.61857 AT INFO=1.63188 AT SCORE=154.086 AT SCORE=137.605

149

DME46 DME29 ID CTSARVCHSG ID RGGGDGSRGS P0 A C G T P0 A C G T 01 0 11 0 0 01 1 0 8 0 02 0 0 0 11 02 0 0 9 0 03 0 3 8 0 03 0 0 9 0 04 11 0 0 0 04 0 0 9 0 05 2 0 9 0 05 3 0 0 6 06 1 7 3 0 06 0 0 9 0 07 0 11 0 0 07 0 1 8 0 08 3 1 0 7 08 4 0 5 0 09 0 2 9 0 09 0 0 9 0 10 0 0 11 0 10 0 3 6 0 AT BGCOUNT=5 AT BGCOUNT=3 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=11 AT FGCOUNT=9 AT INFO=1.62585 AT INFO=1.66681 AT SCORE=147.1 AT SCORE=135.239

DME97 DME60 ID ARDGHKAGGC ID GGCAGVWGRV P0 A C G T P0 A C G T 01 11 0 0 0 01 0 0 10 0 02 2 0 9 0 02 0 0 10 0 03 2 0 7 2 03 0 10 0 0 04 0 0 11 0 04 10 0 0 0 05 6 3 0 2 05 0 0 10 0 06 0 0 8 3 06 2 2 6 0 07 11 0 0 0 07 9 0 0 1 08 0 0 11 0 08 0 0 10 0 09 0 0 11 0 09 4 0 6 0 10 2 9 0 0 10 4 3 3 0 AT BGCOUNT=6 AT BGCOUNT=5 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=11 AT FGCOUNT=10 AT INFO=1.63203 AT INFO=1.60715 AT SCORE=145.82 AT SCORE=134.808

150

DME76 DME196 ID WDGVMCCTGG ID ARGCARGADM P0 A C G T P0 A C G T 01 5 0 0 4 01 9 0 0 0 02 1 0 7 1 02 4 0 5 0 03 0 0 9 0 03 0 0 9 0 04 2 4 3 0 04 0 9 0 0 05 2 7 0 0 05 9 0 0 0 06 0 9 0 0 06 2 0 7 0 07 0 9 0 0 07 0 0 9 0 08 0 0 0 9 08 9 0 0 0 09 0 0 9 0 09 3 0 5 1 10 0 0 9 0 10 6 3 0 0 AT BGCOUNT=1 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=9 AT FGCOUNT=9 AT INFO=1.6191 AT INFO=1.60162 AT SCORE=138.932 AT SCORE=133.406

Supplemental Table 2.9 The top 10 overrepresented motifs (identified by DME) in

TAL1-bound positives over negatives in transgenic mice assay.

DME38 DME73 ID KRKGGWKGGG ID SYTVTCTSTG P0 A C G T P0 A C G T 01 0 0 18 7 01 0 13 6 3 02 5 0 20 0 02 0 7 0 15 03 0 0 19 6 03 0 0 0 22 04 0 0 25 0 04 7 7 8 0 05 0 0 25 0 05 0 0 0 22 06 12 0 0 13 06 0 22 0 0 07 0 0 16 9 07 0 0 0 22 08 0 0 25 0 08 0 15 7 0 09 0 0 25 0 09 0 0 0 22 10 0 0 25 0 10 0 0 22 0 AT BGCOUNT=2 AT BGCOUNT=3 AT CORRECTEDBGCOUNT=3 AT CORRECTEDBGCOUNT=5 AT FGCOUNT=25 AT FGCOUNT=22 AT INFO=1.61841 AT INFO=1.61157 AT SCORE=312.739 AT SCORE=288.702

151

DME40 DME17 ID SCTKSCTSMC ID GKGRWGGGGB P0 A C G T P0 A C G T 01 0 10 11 0 01 0 0 20 0 02 0 21 0 0 02 0 0 18 2 03 0 0 0 21 03 0 0 20 0 04 0 0 17 4 04 4 0 16 0 05 0 14 7 0 05 13 0 0 7 06 0 21 0 0 06 0 0 20 0 07 0 0 0 21 07 0 0 20 0 08 0 6 15 0 08 0 0 20 0 09 4 17 0 0 09 0 0 20 0 10 0 21 0 0 10 0 6 8 6 AT BGCOUNT=1 AT BGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=21 AT FGCOUNT=20 AT INFO=1.60928 AT INFO=1.62827 AT SCORE=301.105 AT SCORE=279.68

DME23 DME36 ID GKBTTATCWS ID GGSWKKGGMC P0 A C G T P0 A C G T 01 0 0 19 0 01 0 0 18 0 02 0 0 13 6 02 0 0 18 0 03 0 11 4 4 03 0 7 11 0 04 0 0 0 19 04 5 0 0 13 05 0 0 0 19 05 0 0 14 4 06 19 0 0 0 06 0 0 14 4 07 0 0 0 19 07 0 0 18 0 08 0 19 0 0 08 0 0 18 0 09 9 0 0 10 09 11 7 0 0 10 0 10 9 0 10 0 18 0 0 AT BGCOUNT=0 AT BGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=0 AT FGCOUNT=19 AT FGCOUNT=18 AT INFO=1.60123 AT INFO=1.6104 AT SCORE=297.556 AT SCORE=278.805

152

DME15 DME25 ID CHGSCTGMCW ID CTSCTGKSKS P0 A C G T P0 A C G T 01 0 20 0 0 01 0 22 0 0 02 2 1 0 17 02 0 0 0 22 03 0 0 20 0 03 0 10 12 0 04 0 11 9 0 04 0 22 0 0 05 0 20 0 0 05 0 0 0 22 06 0 0 0 20 06 0 0 22 0 07 0 0 20 0 07 0 0 9 13 08 5 15 0 0 08 0 5 17 0 09 0 20 0 0 09 0 0 10 12 10 7 0 0 13 10 0 15 7 0 AT BGCOUNT=1 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=3 AT FGCOUNT=20 AT FGCOUNT=22 AT INFO=1.65517 AT INFO=1.6051 AT SCORE=297.124 AT SCORE=277.16 DME1 DME28 ID NCHGCCYCTG ID CTSKCTSMRG P0 A C G T P0 A C G T 01 4 11 1 38 01 0 19 0 0 02 0 54 0 0 02 0 0 0 19 03 10 2 0 42 03 0 5 14 0 04 0 0 54 0 04 0 0 7 12 05 0 54 0 0 05 0 19 0 0 06 0 54 0 0 06 0 0 0 19 07 0 6 0 48 07 0 7 12 0 08 0 54 0 0 08 6 13 0 0 09 0 0 0 54 09 16 0 3 0 10 0 0 54 0 10 0 0 19 0 AT BGCOUNT=1 AT BGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=54 AT FGCOUNT=19 AT INFO=1.6025 AT INFO=1.62109 AT SCORE=291.018 AT SCORE=270.293

153

Supplemental Table 2.10 The top 10 overrepresented motifs (identified by DME) in

TAL1-bound negatives over positives in transgenic mice assay.

DME13 DME23 ID TYWCHTCTSC ID VYTKGGGDTG P0 A C G T P0 A C G T 01 0 0 0 12 01 2 2 8 0 02 0 3 0 9 02 0 6 0 6 03 4 0 0 8 03 0 0 0 12 04 0 12 0 0 04 0 0 11 1 05 2 9 0 1 05 0 0 12 0 06 0 0 0 12 06 0 0 12 0 07 0 12 0 0 07 0 0 12 0 08 0 0 0 12 08 1 0 7 4 09 0 6 6 0 09 0 0 0 12 10 0 12 0 0 10 0 0 12 0 AT BGCOUNT=1 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=0 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=12 AT FGCOUNT=12 AT INFO=1.62791 AT INFO=1.62558 AT SCORE=186.17 AT SCORE=174.914 DME86 DME130 ID TKTKYCTCTS ID KGTYTBTKCT P0 A C G T P0 A C G T 01 0 0 0 14 01 0 0 4 9 02 0 0 6 8 02 0 0 13 0 03 0 0 0 14 03 0 0 0 13 04 0 0 3 11 04 0 4 0 9 05 0 7 0 7 05 0 0 0 13 06 0 14 0 0 06 0 6 4 3 07 0 0 0 14 07 0 0 0 13 08 0 14 0 0 08 0 0 8 5 09 0 2 0 12 09 0 13 0 0 10 0 6 8 0 10 0 0 0 13 AT BGCOUNT=5 AT BGCOUNT=4 AT CORRECTEDBGCOUNT=2 AT CORRECTEDBGCOUNT=2 AT FGCOUNT=14 AT FGCOUNT=13 AT INFO=1.63681 AT INFO=1.63966 AT SCORE=182.947 AT SCORE=172.434

154

DME60 DME162 ID CTDKBKTCTT ID BTTWCCTCYS P0 A C G T P0 A C G T 01 0 14 0 0 01 0 6 4 2 02 0 0 0 14 02 0 0 0 12 03 3 0 4 7 03 0 0 0 12 04 0 0 3 11 04 3 0 0 9 05 0 2 4 8 05 0 12 0 0 06 0 0 4 10 06 0 12 0 0 07 0 0 0 14 07 0 0 0 12 08 0 14 0 0 08 0 12 0 0 09 0 0 0 14 09 0 4 0 8 10 0 0 0 14 10 0 7 5 0 AT BGCOUNT=2 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=14 AT FGCOUNT=12 AT INFO=1.60548 AT INFO=1.60756 AT SCORE=182.631 AT SCORE=170.874

DME79 DME28 ID YTKYTYMTCT ID TKYYCTGMCY P0 A C G T P0 A C G T 01 0 7 0 8 01 0 0 0 12 02 0 0 0 15 02 0 0 7 5 03 0 0 5 10 03 0 4 0 8 04 0 10 0 5 04 0 6 0 6 05 0 0 0 15 05 0 12 0 0 06 0 7 0 8 06 0 0 0 12 07 6 9 0 0 07 0 0 12 0 08 0 0 0 15 08 1 11 0 0 09 0 15 0 0 09 0 12 0 0 10 0 0 0 15 10 0 7 0 5 AT BGCOUNT=5 AT BGCOUNT=2 AT CORRECTEDBGCOUNT=2 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=15 AT FGCOUNT=12 AT INFO=1.60327 AT INFO=1.6106 AT SCORE=182.535 AT SCORE=169.458

155

DME165 DME41 ID GCTTCTBWYY ID TTWRTTTWRT P0 A C G T P0 A C G T 01 0 0 13 0 01 0 1 0 11 02 0 13 0 0 02 0 0 0 12 03 0 0 0 13 03 4 0 0 8 04 0 0 0 13 04 8 0 4 0 05 0 13 0 0 05 0 0 0 12 06 0 0 0 13 06 0 0 0 12 07 0 5 3 5 07 0 0 0 12 08 7 0 0 6 08 4 0 0 8 09 0 4 0 9 09 6 0 6 0 10 0 6 0 7 10 0 0 0 12 AT BGCOUNT=3 AT BGCOUNT=3 AT CORRECTEDBGCOUNT=1 AT CORRECTEDBGCOUNT=1 AT FGCOUNT=13 AT FGCOUNT=12 AT INFO=1.61884 AT INFO=1.63094 AT SCORE=177.095 AT SCORE=167.671

156

Supplemental Table 2.11 Matched motifs (found by TOMTOM) in only active

enhancers and in only inactives.

Matched motifs in only active enhancers E- q- Category DME# Enriched Motif by DME Matched Motifs_TOMTOM p-value value value POSoverNEG DME36 GGSWKKGGMC Klf1 8.88E-06 0.005 0.010 POSoverNEG DME73 SYTVTCTSTG TAL1::GATA1 3.36E-05 0.020 0.040 POSoverNEG DME40 SCTKSCTSMC EHF 5.38204E-05 0.032 0.064 POSoverNEG DME38 KRKGGWKGGG Ascl2_secondary 0.000122395 0.072 0.039 POSoverNEG DME23 GKBTTATCWS GATA2 0.000162235 0.096 0.080 ENHoverINAC DME174 GGRRSAGBAR Ets1 0.000165971 0.098 0.088 POSoverNEG DME23 GKBTTATCWS Gata4 0.000204232 0.121 0.080 POSoverNEG DME25 CTSCTGKSKS Smad3_primary 0.000246497 0.146 0.291 POSoverNEG DME73 SYTVTCTSTG Gata4 0.000255729 0.151 0.101 ENHoverINAC DME3 KGKTKTGTWY Foxj3_primary 0.000294644 0.174 0.189 ENHoverINAC DME3 KGKTKTGTWY Foxl1_secondary 0.000319483 0.189 0.189 ENHoverINAC DME1 BTTKRKTTTS Irf4_primary 0.0004111 0.243 0.274 POSoverNEG DME40 SCTKSCTSMC EWSR1-FLI1 0.000449575 0.266 0.265 ENHoverINAC DME101 GGVDGGSARA MZF1_5-13 0.000455578 0.269 0.190 POSoverNEG DME17 GKGRWGGGGB Smad3_secondary 0.000496504 0.293 0.063 ENHoverINAC DME101 GGVDGGSARA PPARG::RXRA 0.000526972 0.311 0.190 POSoverNEG DME38 KRKGGWKGGG Esrra_secondary 0.000539849 0.319 0.090 POSoverNEG DME73 SYTVTCTSTG GATA2 0.000540437 0.319 0.160 POSoverNEG DME17 GKGRWGGGGB Zfp410_secondary 0.000544423 0.322 0.063 POSoverNEG DME17 GKGRWGGGGB Esrra_secondary 0.000596557 0.353 0.063 POSoverNEG DME38 KRKGGWKGGG RREB1 0.000800137 0.473 0.104 POSoverNEG DME25 CTSCTGKSKS Myf6_secondary 0.000819327 0.484 0.399 POSoverNEG DME1 NCHGCCYCTG Tcfap2a_secondary 0.000903988 0.534 0.302 ENHoverINAC DME1 BTTKRKTTTS Foxl1_secondary 0.000940153 0.556 0.274 ENHoverINAC DME1 BTTKRKTTTS Foxa2_primary 0.000960547 0.568 0.274 ENHoverINAC DME1 BTTKRKTTTS STAT2::STAT1 0.000995009 0.588 0.274 ENHoverINAC DME79 WTWTKRKTTT Hoxb13_3479.1 0.00110759 0.655 0.828 POSoverNEG DME15 CHGSCTGMCW Bhlhb2_secondary 0.00111977 0.662 0.541 POSoverNEG DME25 CTSCTGKSKS Bach1::Mafk 0.00125931 0.744 0.399 POSoverNEG DME1 NCHGCCYCTG Tcfap2c_secondary 0.00127602 0.754 0.302 POSoverNEG DME1 NCHGCCYCTG Myog 0.00129152 0.763 0.302 POSoverNEG DME15 CHGSCTGMCW INSM1 0.00129411 0.765 0.541 POSoverNEG DME40 SCTKSCTSMC PAX5 0.00133449 0.789 0.395 ENHoverINAC DME8 TTTGDWTKCW FOXP2 0.00138068 0.816 0.598 ENHoverINAC DME1 BTTKRKTTTS Foxk1_primary 0.00139959 0.827 0.274 ENHoverINAC DME1 BTTKRKTTTS Irf6_primary 0.00139959 0.827 0.274 POSoverNEG DME17 GKGRWGGGGB Pax4 0.00144363 0.853 0.129 POSoverNEG DME38 KRKGGWKGGG MZF1_5-13 0.00144414 0.853 0.169 ENHoverINAC DME8 TTTGDWTKCW Foxj1_primary 0.00165576 0.979 0.598 ENHoverINAC DME174 GGRRSAGBAR ELF1 0.00167203 0.988 0.265 ENHoverINAC DME174 GGRRSAGBAR Crx 0.00168971 0.999 0.265 157

Matched motifs in only inactives p- E- q- Category DME# Enriched Motif by DME Matched Motifs_TOMTOM value value value NEGoverPOS DME41 TTWRTTTWRT Hoxd10_2368.2 0.0002 0.120 0.133 NEGoverPOS DME41 TTWRTTTWRT Hoxa9_2622.2 0.0002 0.134 0.133 NEGoverPOS DME130 KGTYTBTKCT RUNX2 0.0002 0.139 0.277 NEGoverPOS DME60 CTDKBKTCTT Sox11_primary 0.0003 0.201 0.355 NEGoverPOS DME60 CTDKBKTCTT Sox4_primary 0.0006 0.355 0.355 INACoverENH DME76 WDGVMCCTGG MAX 0.0009 0.555 0.853 INACoverENH DME22 GSTGCWGHAG REST 0.0011 0.665 0.697 INACoverENH DME22 GSTGCWGHAG Tcf12 0.0012 0.697 0.697 INACoverENH DME60 GGCAGVWGRV Myb 0.0013 0.752 0.324 INACoverENH DME60 GGCAGVWGRV Tcf3 0.0014 0.811 0.324 NEGoverPOS DME41 TTWRTTTWRT Hoxa10_2318.1 0.0016 0.945 0.513

Supplemental Table 2.12 Conservation scores of meta-data (273 DNA segments)

tested in transfection assay and 66 TAL1 OSs tested in transgenic mice.

Phylop PhastCons Chr Enhancer Start End ID (100bp (100bp (mm9) assay centered) centered) chr1 132196529 132197197 TAL1_184 0.089 0 K562 chr13 14634853 14635762 TAL1_1195 0 0 K562 chr12 32734927 32735442 TAL1_1025 0.06846 0 K562 chrX 7418047 7418661 TAL1_4858 0 0.23883 K562 chr15 66657520 66658263 TAL1_1578 -0.07935 0 K562 chr3 146068529 146069275 TAL1_3022 0 0 K562 chr7 111009280 111010007 TAL1_4199 0 0.23256 K562 chr18 38654476 38654937 TAL1_2135 0 0.05046 K562 chr9 123864720 123865709 TAL1_4851 -0.00126 0 K562 chr7 16880022 16880853 TAL1_3960 0 0.74198 K562 chr14 118593772 118594280 TAL1_1496 0 0 K562 chr18 32701721 32702830 TAL1_2105 0 0.28848 K562 chr19 37568375 37569012 TAL1_2302 0 0 K562 chr1 88375618 88376290 TAL1_127 0.07654 0 K562 chr11 77882879 77883715 TAL1_758 0 0 K562 chr12 88141648 88142376 TAL1_1123 -0.1256 0 K562 chr1 153800663 153801297 TAL1_244 -0.12017 0 K562 158

chr7 133185585 133186156 TAL1_4249 0 0.06159 K562 chr4 106680675 106681320 TAL1_3158 -0.10136 0.00281 K562 chr2 167875757 167876211 TAL1_2750 0.37703 0.38919 K562 chr4 155162230 155162903 TAL1_3365 0.10211 0.1871 K562 chr8 37346436 37347275 TAL1_4371 -0.03665 0 K562 chr9 45611360 45612037 TAL1_4652 0.12467 0 K562 chr10 116543101 116543855 TAL1_541 0.8007 0.96279 K562 chr5 85240440 85241048 TAL1_3467 1.795 0.9986 K562 chr8 83017196 83018542 TAL1_4423 0.16889 0 K562 chr6 72229401 72230032 TAL1_3800 0 0 K562 chr7 91871088 91871876 TAL1_4157 0 0.01223 K562 chr16 57241944 57242302 TAL1_1799 -0.08054 0.04892 K562 chr8 35160152 35161438 TAL1_4361 0.07724 0 K562 chr6 120530027 120530545 TAL1_3879 0 0.95806 K562 chr16 49839564 49840556 TAL1_1796 0.27409 0.40064 K562 chr2 167626792 167627467 TAL1_2747 -0.01629 0.04672 K562 chr10 24354921 24355529 TAL1_379 -0.12322 0.00382 K562 chr12 58298497 58299143 TAL1_1050 0.21507 0 K562 chr4 134275289 134275961 TAL1_3278 -0.12253 0.00484 K562 chr5 77087161 77087852 TAL1_3461 0.34311 0.27508 K562 chr14 123430438 123430680 TAL1_1506 0 0 K562 chr16 34044485 34045111 TAL1_1774 0.09197 0.0207 K562 chr15 8711603 8712036 TAL1_1515 0.29823 0 K562 chr1 135029194 135029593 TAL1_196 0.05302 0 K562 chr19 40887714 40888455 TAL1_2311 0 0 K562 chr2 28477394 28477920 TAL1_2434 0.35253 0.24124 K562 chr6 135146524 135147153 TAL1_3920 0 0.08377 K562 chr7 6126508 6127068 TAL1_3953 0 0.03164 K562 chr3 121950048 121950842 TAL1_2980 0 0 K562 chr3 103179283 103179942 TAL1_2950 0 0 K562 chr14 71023649 71024262 TAL1_1464 0 0 K562 chr12 29502832 29503567 TAL1_1020 0.16225 0 K562 chr1 38056216 38056808 TAL1_43 0.06829 0 K562 chr5 147127885 147128511 TAL1_3657 0 0 K562 chr1 52580874 52581510 TAL1_63 0.03766 0 K562 chr11 72295276 72296391 TAL1_734 0 0 K562 chr2 109749209 109749667 TAL1_2578 0.04986 0.08731 K562 chr9 102674523 102674817 TAL1_4783 -0.08314 0 K562 chr11 23993994 23994533 TAL1_612 0 0 K562 chr15 12838058 12838708 TAL1_1529 0.32639 0 K562 chr5 148253420 148253996 TAL1_3663 0.07009 0.10254 K562 chr9 96141676 96142381 TAL1_4769 0.04844 0 K562 chr11 96802210 96802881 TAL1_842 0 0 K562 159

chr11 96779483 96779756 TAL1_840 0 0 K562 chr3 21894787 21895266 TAL1_2794 0 0 K562 chr17 4625931 4626888 TAL1_1863 0 0 K562 chr18 78319653 78320300 TAL1_2190 0 0.15407 K562 chr11 121294437 121295272 TAL1_987 0 0 K562 chr16 17637963 17639064 TAL1_1742 -0.31727 0.00265 K562 chr1 135721557 135722921 TAL1_201 0.02661 0 K562 chr1 156886224 156886842 TAL1_250 0.57015 0 K562 chr9 110570916 110571692 TAL1_4816 0.15057 0 K562 chr2 35191661 35192274 TAL1_2471 -0.1933 0.00273 K562 chr7 111009146 111010117 GHP181 0 0.34844 K562 chr7 134234693 134235292 GHP304 0 0.00952 K562 chr7 104358431 104359030 GHP147 0 0.33806 K562 chr7 118147598 118148147 GHP221 0 0.18185 K562 chr7 82756750 82757349 GHP53 0 0.55837 K562 chr7 132572012 132572611 GHP293 0 0.08687 K562 chr7 86503239 86503938 GHP68 0 0.00164 K562 chr7 135444428 135445076 GHP313 0 0.00257 K562 chr7 73341400 73342099 GHP10 0 0.00125 K562 chr7 88863914 88864563 GHP88 0 0.03901 K562 chr7 128123364 128124013 GHP264 0 0.13983 K562 chr7 130509502 130510051 GHP275 0 0.01071 K562 chr7 111014108 111014707 GHP182 0 0.13652 K562 chr7 134913927 134914526 GHP309 0 0.19597 K562 chr7 86887199 86887748 GHP72 0 0.0026 K562 chr7 118322848 118323397 GHP222 0 0.02296 K562 chr7 104490017 104490566 GHP150 0 0.02319 K562 chr7 71194021 71194720 GHP4 0 0.00681 K562 chr7 130114466 130115369 GHP270 0 0.00255 K562 chr7 116254241 116254990 GHP204 0 1 K562 chr7 132615562 132616211 GHP296 0 0.02623 K562 chr7 135557758 135558307 GHP314 0 0.00538 K562 chr7 116258367 116259066 GHP205 0 0.92525 K562 chr7 112900289 112900988 GHP196 0 0.16672 K562 chr7 106623318 106623984 GHP156 0 0.42567 K562 chr7 120228608 120229207 GHP228 0 0.02705 K562 chr7 109390843 109392019 GHP172 0 0.09025 K562 chr7 72834698 72835647 GHP7 0 0.00548 K562 chr7 91735511 91736060 GHP105 0 0.01386 K562 chr7 104881929 104882528 GHP152 0 0.00803 K562 chr7 87353740 87354439 GHP74 0 0.07598 K562 chr7 107992742 107993391 GHP167 0 0.02949 K562 chr7 107615488 107616487 GHP163 0 0.0269 K562 160

chr7 106973326 106974333 GHP159 0 0.54894 K562 chr7 123344417 123345116 GHP246 0 0.03399 K562 chr7 109397425 109397874 GHP173 0 0.00381 K562 chr7 117918205 117918754 GHP216 0 0.00909 K562 chr7 87438711 87439660 GHP75 0 0.00585 K562 chr7 87330823 87331372 GHP73 0 0.4304 K562 chr7 134910576 134911275 GHP308 0 0.23294 K562 chr7 88202775 88203374 GHP82 0 0.00379 K562 chr7 71146719 71147318 GHP3 0 0.03176 K562 chr7 91799621 91800720 GHP106 0 0.57083 K562 chr7 74355378 74355948 GHP16 0 0 K562 chr7 97211041 97211927 GHP117 0 0.07779 K562 chr7 107327713 107328262 GHP160 0 0.00649 K562 chr7 130521146 130522045 GHP276 0 0.11039 K562 chr7 71096727 71097876 GHP2 0 0.07897 K562 chr7 88516140 88516989 GHP87 0 0.62479 K562 chr7 111058491 111059040 GHP183 0 0.19998 K562 chr7 87774111 87774810 GHP78 0 0.32473 K562 chr7 107833031 107833730 GHP165 0 0.12745 K562 chr7 90844040 90844989 GHP100 0 0.12546 K562 chr7 73224076 73224875 GHP8 0 0.08011 K562 chr7 72445253 72445802 GHP6 0 0.00527 K562 chr7 97277676 97278225 GHP118 0 0.94009 K562 chr7 133540211 133540860 GHP301 0 0.18645 K562 chr7 133269983 133270532 GHP297 0 0.00217526 K562 chr7 133491588 133492187 GHP300 0 0.04165 K562 chr7 88973238 88973787 GHP90 0 0.00487 K562 chr7 110976236 110976785 GHP180 0 0.20678 K562 chr7 91046669 91047368 GHP101 0 0.04802 K562 chr7 135815708 135816257 GHP316 0 0.00682 K562 chr7 70735684 70736133 GHP0 0 0.005 K562 chr7 70865082 70865631 GHP1 0 0.0058 K562 chr7 72445253 72445802 GHP6 0 0.00527 K562 chr7 73302100 73302699 GHP9 0 0.00171 K562 chr7 73425051 73425600 GHP11 0 0.00729 K562 chr7 73613801 73614350 GHP12 0 0.03905 K562 chr7 73722956 73723555 GHP13 0 0.08906 K562 chr7 73894613 73895162 GHP14 0 0.76353 K562 chr7 73908565 73909114 GHP15 0 0.04825 K562 chr7 74441122 74441828 GHP17 0 0.159295 K562 chr7 75118605 75119154 GHP18 0 0.00821 K562 chr7 75356326 75356875 GHP19 0 0.01697 K562 chr7 76507384 76507933 GHP20 0 0.00582 K562 161

chr7 76852202 76852751 GHP21 0 0.0103137 K562 chr7 76974115 76974664 GHP22 0 0.01001 K562 chr7 77036182 77036881 GHP23 0 0.99959 K562 chr7 77104468 77105517 GHP24 0 0.11184 K562 chr7 77238268 77238717 GHP25 0 0.01035 K562 chr7 77289559 77290129 GHP26 0 0.00956 K562 chr7 77456358 77456907 GHP27 0 0.01875 K562 chr7 77558135 77558684 GHP28 0 0.02823 K562 chr7 77659306 77659855 GHP29 0 0.77 K562 chr7 78181804 78182353 GHP30 0 0.56071 K562 chr7 78534090 78534639 GHP31 0 0.0074 K562 chr7 78647752 78648301 GHP32 0 0.004 K562 chr7 80201188 80201787 GHP42 0 0 K562 chr7 88970980 88971579 GHP89 0 0.02651 K562 chr7 90120140 90120689 GHP93 0 0.00874 K562 chr7 90245535 90246084 GHP94 0 0.01105 K562 chr7 98623314 98623913 GHP122 0 0.03818 K562 chr7 99412978 99413527 GHP127 0 0.11392 K562 chr7 99419545 99420144 GHP128 0 0.01089 K562 chr7 99547805 99548404 GHP130 0 0.00631 K562 chr7 107336599 107337148 GHP161 0 0.00636 K562 chr7 107727090 107727639 GHP164 0 0.76639 K562 chr7 108648673 108649222 GHP169 0 0.7572 K562 chr7 108867272 108867821 GHP170 0 0.00841 K562 chr7 111119394 111119943 GHP184 0 0.00676 K562 chr7 111232683 111233032 GHP185 0 0.28823 K562 chr7 111267298 111267897 GHP186 0 0.00709 K562 chr7 112561519 112562068 GHP191 0 0.91711 K562 chr7 112696459 112697108 GHP193 0 0.02346 K562 chr7 112857304 112857753 GHP194 0 0.75328 K562 chr7 113438116 113438965 GHP197 0 0 K562 chr7 114218509 114219108 GHP198 0 0.44612 K562 chr7 114704028 114704577 GHP199 0 0.04216 K562 chr7 114839819 114840368 GHP200 0 0.01591 K562 chr7 115415642 115416191 GHP201 0 0.58072 K562 chr7 115474285 115474884 GHP202 0 0.04409 K562 chr7 115940055 115941254 GHP203 0 0.02904 K562 chr7 116632200 116632749 GHP206 0 0.02914 K562 chr7 118020599 118021198 GHP219 0 0.00119 K562 chr7 118604074 118604623 GHP223 0 0.38115 K562 chr7 120023689 120024338 GHP227 0 0.0647 K562 chr7 121505468 121506067 GHP234 0 0.87875 K562 chr7 122241704 122242253 GHP241 0 0.52402 K562 162

chr7 122770207 122771206 GHP243 0 0.99986 K562 chr7 130857235 130857784 GHP279 0 0.150982 K562 chr7 131129748 131130297 GHP283 0 0.12055 K562 chr7 132234331 132235580 GHP291 0 0 K562 chr7 134991063 134992012 GHP310 0 0.94809 K562 chr7 135065445 135066044 GHP311 0 0.75464 K562 chr6 88152903 88153447 Gata2R5 0 0.97179 K562 chr8 124837163 124837562 Zfpm1R13 0.48134 0 K562 chr6 88066819 88067388 Gata2R1 0 0.88411 K562 chrX 146984294 146984495 Alas2R1 0 0.41162 K562 chr2 27199488 27199723 Vav2R3 0.29682 0.36388 K562 chr8 124838838 124839093 Zfpm1R24 0.5536 0 K562 chr6 88139617 88140016 Gata2R8 0 0.98749 K562 chrX 146999208 146999549 Alas2R3 0 0.67181 K562 chr2 27267361 27267660 Vav2R5 0.47307 0.32046 K562 chr8 124846623 124847102 Zfpm1R14 1.19905 0 K562 chr8 124806654 124806897 Zfpm1R1 1.22337 0 K562 chr8 124825010 124825309 Zfpm1R18 2.82566 0 K562 chr8 124820526 124820950 Zfpm1R21 1.27927 0 K562 chr6 38796863 38797087 Hipk2R27 0 0.30117 K562 chr6 88168966 88169500 Gata2R6 0 0.94748 K562 chr6 88141777 88142006 Gata2R7 0 0.85356 K562 chr8 124843371 124843635 Zfpm1R10 0.40128 0 K562 chr8 124808263 124808907 Zfpm1R2 0.41717 0 K562 chr6 38711397 38711542 Hipk2R23 0 0.71971 K562 chr1 135969753 135969902 Btg2R3 0.61718 0 K562 chr8 124814848 124815272 Zfpm1R6 1.85273 0 K562 chr8 124853571 124853860 Zfpm1R12 1.61654 0 K562 chr6 38824639 38824903 Hipk2R39 0 0.51133 K562 chr1 136002653 136002802 Btg2R8 0.03109 0 K562 chr6 135091565 135092125 Hebp1R3 0 0.23521 K562 chr6 88081846 88082276 Gata2R9 0 0.63537 K562 chr6 135115673 135115946 Hebp1R2 0 0.04278 K562 chr2 27293121 27293330 Vav2R6 0.15325 0.1356 K562 chr6 38704031 38704437 Hipk2R4 0 0.97037 K562 chr8 124850001 124850270 Zfpm1R11 1.04784 0 K562 chr8 124823870 124824145 Zfpm1R28 1.18775 0 K562 chr8 124779650 124780054 Zfpm1R16 1.10001 0 K562 chr8 124839961 124840270 Zfpm1R9 0.68605 0 K562 chr8 124845689 124845892 Zfpm1R19 0.64507 0 K562 chr8 124820151 124820506 Zfpm1R7 0.07559 0 K562 chr8 124807803 124808237 Zfpm1R5 0.02925 0 K562 chr8 124810751 124811295 Zfpm1R3 0.90411 0 K562 163

chr8 124852939 124853188 Zfpm1R15 0.75985 0 K562 chr6 38800035 38800364 Hipk2R28 0 0.95868 K562 chr2 27249343 27249572 Vav2R4 0.18545 0.02629 K562 chr8 124833200 124833376 Zfpm1R8 0.3838 0 K562 chr6 88140542 88141081 Gata2R3 0 0.99998 K562 chr8 124830554 124831203 Zfpm1R4 1.0195 0 K562 chr6 38824490 38824648 Hipk2R40 0 0.97949 K562 chr1 135989522 135989792 Btg2R9 0.05381 0 K562 chr6 38721281 38721450 Hipk2R16 0 0.7437 K562 chr2 27184263 27184417 Vav2R10 0.16389 0.33665 K562 chr2 27244974 27245564 Vav2R7 -0.13542 0.00471 K562 chr8 124854446 124854615 Zfpm1R27 0.02033 0 K562 chr8 124831546 124831700 Zfpm1R29 -0.30482 0 K562 chr7 90212473 90213357 GHN322 0 0.00166667 K562 chr7 107830000 107830921 GHN534 0 0.06179 K562 chr7 79090283 79091223 GHN133 0 0.04312 K562 chr7 80593212 80593977 GHN159 0 0.00637 K562 chr7 84135423 84136249 GHN213 0 0.01004 K562 chr7 86291470 86292288 GHN240 0 0.00829 K562 chr7 73342767 73343574 GHN37 0 0.04838 K562 chr7 97306681 97307481 GHN391 0 0.00528 K562 chr7 99540280 99541080 GHN419 0 0.13763 K562 chr7 104665107 104666061 GHN478 0 0.00457 K562 chr7 70644202 70645021 GHN6 0 0.14691 K562 chrX 147000391 147000733 Alas2NC1 0 0.01774 K562 chrX 147281215 147281371 Alas2NC2 0 0.48481 K562 chr6 88153903 88154127 Gata2NC1 0 0.00209 K562 chr6 88272307 88272589 Gata2NC2 0 0.00409 K562 chr6 38782189 38782723 Hipk2NC1 0 0.04576 K562 chr6 38649010 38649300 Hipk2R19 0 0.00291 K562 chr6 38707478 38707762 Hipk2NC4 0 0.00601 K562 chr2 27220045 27220284 Vav2NC1 0.08946 0.01966 K562 chr2 27007744 27007934 Vav2NC2 0.15981 0.01535 K562 chr8 124899439 124899655 Zfpm1NC1 -0.05585 0 K562 chr8 124757292 124757442 Zfpm1NC2 0.15337 0 K562 chr8 124769944 124770211 Zfpm1NC3 0.06271 0 K562 chr8 124632865 124633128 Zfpm1NC4 0.00668001 0 K562 chr6 38715235 38715384 Hipk2R25 0 0.09778 K562 chr6 38806152 38806301 Hipk2R30 0 0.00947 K562 chr6 38592265 38592444 Hipk2R33 0 0.15163 K562 chr6 38616598 38616976 Hipk2NC3 0 0.00451 K562 chr11 86400220 86400815 TAL1_793 0 0 E11.5 chr2 167875757 167876211 TAL1_2750 0.37703 0.38919 E11.5 164

chr3 145627288 145627941 TAL1_3020 0 0 E11.5 chr12 88141648 88142376 TAL1_1123 -0.1256 0 E11.5 chr14 118593772 118594280 TAL1_1496 0 0 E11.5 chr11 32145196 32146357 TAL1_626 0 0 E11.5 chr12 84173911 84175203 TAL1_1105 0.17931 0 E11.5 chr12 90222271 90223529 TAL1_1130 1.09518 0 E11.5 chr17 12397492 12398656 TAL1_1882 0 0.06312 E11.5 chr18 32701721 32702830 TAL1_2105 0 0.28848 E11.5 chr2 31983174 31984266 TAL1_2456 -0.04058 0.01213 E11.5 chr8 122817261 122818516 TAL1_4518 0.07853 0 E11.5 chr8 122895394 122896558 TAL1_4521 -0.11089 0 E11.5 chr9 63892510 63893729 TAL1_4694 0.8283 0 E11.5 chr2 103733606 103734648 TAL1_2564 0.49672 0.4418 E11.5 chr7 116258121 116259041 TAL1_4207 0 0.9681 E11.5 chr6 122343482 122344193 TAL1_3897 0 0.0036 E11.5 chr6 122344239 122344657 TAL1_3898 0 0.01142 E11.5 chr11 11736803 11737167 TAL1_594 0 0 E11.5 chr19 37568375 37569012 TAL1_2302 0 0 E11.5 chr4 133285425 133286226 TAL1_3269 2.0333 0.99818 E11.5 chr18 75662321 75663806 TAL1_2184 0 0.11112 E11.5 chr2 172860208 172861567 TAL1_2756 0.16728 0.36031 E11.5 chr4 131632510 131633977 TAL1_3253 -0.09273 0.00323 E11.5 chr5 38900658 38901482 TAL1_3422 0 0 E11.5 chr5 120586118 120586932 TAL1_3553 0.58219 0.44953 E11.5 chr8 87254405 87255295 TAL1_4435 1.26089 0 E11.5 chr9 70682019 70683270 TAL1_4737 0.13076 0 E11.5 chr11 98254393 98255110 TAL1_855 0 0 E11.5 chr2 45027025 45027862 TAL1_2495 2.31783 0.99977 E11.5 chr2 152614615 152616061 TAL1_2677 0.49959 0.29835 E11.5 chr4 117502412 117503328 TAL1_3188 0.42997 0.48094 E11.5 chr6 50840651 50841395 TAL1_3759 0 0.92665 E11.5 chr13 34919805 34920515 TAL1_1216 0 0 E11.5 chr1 156886224 156886842 TAL1_250 0.57015 0 E11.5 chr11 21978925 21979264 TAL1_604 0 0 E11.5 chr11 98253860 98254298 TAL1_854 0 0 E11.5 chr8 125235866 125236576 TAL1_4555 -0.15333 0 E11.5 chr15 66657520 66658263 TAL1_1578 -0.07935 0 E11.5 chr12 29502832 29503567 TAL1_1020 0.16225 0 E11.5 chr5 85240440 85241048 TAL1_3467 1.795 0.9986 E11.5 chr11 95200815 95202005 TAL1_830 0 0 E11.5 chr1 59010525 59012054 TAL1_70 0.099 0 E11.5 chr1 183187098 183187939 TAL1_316 0.07465 0 E11.5 chr11 95217466 95218800 TAL1_831 0 0 E11.5 165

chr11 102224771 102225712 TAL1_885 0 0 E11.5 chr11 32150587 32151471 TAL1_627 0 0 E11.5 chr12 88156953 88157678 TAL1_1124 1.03967 0 E11.5 chr14 69922592 69923552 TAL1_1454 0 0 E11.5 chr2 38391368 38392292 TAL1_2485 -0.00439 0.02661 E11.5 chr3 100242086 100243806 TAL1_2931 0 0 E11.5 chr11 57464978 57465599 TAL1_675 0 0 E11.5 chr15 83353446 83354080 TAL1_1659 -0.06298 0 E11.5 chr19 24526931 24527416 TAL1_2272 0 0 E11.5 chr5 23849251 23849605 TAL1_3383 -0.21455 0.00689 E11.5 chr15 66973594 66974313 TAL1_1581 0.10503 0 E11.5 chr3 60406761 60407630 TAL1_2837 0 0 E11.5 chr14 69935941 69937195 TAL1_1455 0 0 E11.5 chr6 120124224 120125340 TAL1_3876 0 0.35196 E11.5 chr1 155880862 155881608 TAL1_248 0.22922 0 E11.5 chr11 45618653 45619607 TAL1_636 0 0 E11.5 chr2 152642938 152644140 TAL1_2680 0.09174 0.00957 E11.5 chr5 115716914 115717769 TAL1_3535 0.58507 0.4338 E11.5 chr19 9018417 9019166 TAL1_2236 0 0 E11.5 chr11 65463268 65464062 TAL1_712 0.26571 0.16543 E11.5 chr11 77236866 77237466 TAL1_752 0.71749 0.43424 E11.5

166

Supplemental Table 2.13 Preserved WGATAR motif (1 for preserved motif; 0 for

not preserved motif; -1 for no motif) at 151 GATA1-bound and 151 GATA1-TAL1

co-bound segments (1: occupancy and 0: no occupancy).

Chr GATA1 TAL1 Preserved Enhancer Start End ID (mm9) occupancy occupancy WGATARmotif activity chr8 124843371 124843635 TAL1_184 1 1 1 23.100 chr3 146068529 146069275 Gata2R5 1 0 1 15.634 chr7 87353740 87354439 TAL1_1195 1 1 -1 13.063 chr5 77087161 77087852 TAL1_1025 1 1 0 11.597 chr7 111058491 111059040 TAL1_4858 1 1 0 10.789 chr1 135029194 135029593 TAL1_1578 1 1 0 10.094 chr8 124853571 124853860 TAL1_3022 1 1 1 9.986 chr8 124845689 124845892 TAL1_4199 1 1 1 9.735 chr8 124838838 124839093 GHP181 1 1 1 7.665 chr2 28477394 28477920 TAL1_2135 1 1 1 7.294 chr8 124852939 124853188 TAL1_4851 1 1 0 6.814 chr2 109749209 109749667 TAL1_3960 1 1 -1 6.607 chr12 32734927 32735442 Btg2R9 1 1 0 6.363 chr7 107992742 107993391 TAL1_1496 1 1 1 6.341 chr7 117918205 117918754 TAL1_2105 1 1 1 6.270 chr7 104490017 104490566 GHP304 1 1 0 6.176 chr7 97277676 97278225 Zfpm1R13 1 1 1 5.977 chr1 135969753 135969902 TAL1_2302 1 1 1 5.804 chr7 88863914 88864563 GHP221 1 1 1 5.517 chr7 132615562 132616211 GHP53 1 1 1 5.380 chrX 7418047 7418661 Gata2R1 1 1 1 4.907 chr16 34044485 34045111 TAL1_127 1 1 0 4.895 chr1 132196529 132197197 Alas2R1 1 1 1 4.877 chr7 107833031 107833730 TAL1_758 1 1 1 4.702 chr7 74441122 74441828 Vav2R3 1 0 1 4.629 chr7 104881929 104882528 TAL1_1123 1 1 -1 4.542 chr7 134913927 134914526 GHP293 1 1 0 4.340 chr8 124820151 124820506 TAL1_4249 1 1 0 4.248 chr16 49839564 49840556 GHP68 1 1 1 4.169 chr1 156886224 156886842 Zfpm1R24 1 0 1 4.116 chr3 121950048 121950842 TAL1_3158 1 1 1 4.031 chr7 88202775 88203374 TAL1_2750 1 1 1 3.941 chr14 71023649 71024262 TAL1_3365 1 1 0 3.900 chr7 111009146 111010117 Gata2R8 1 0 1 3.780 chr8 124846623 124847102 GHP313 1 0 0 3.660 chr18 38654476 38654937 GHP10 1 1 1 3.596 chr7 107327713 107328262 TAL1_4371 1 1 1 3.520 chr12 58298497 58299143 GHP88 1 1 0 3.347 167

chr6 88152903 88153447 Alas2R3 1 1 1 3.332 chr7 130509502 130510051 GHP264 1 1 1 3.304 chr7 111014108 111014707 GHP275 1 1 0 3.248 chr15 12838058 12838708 TAL1_4652 1 1 0 3.232 chr4 106680675 106681320 GHP182 1 1 1 3.228 chr7 87774111 87774810 TAL1_541 1 1 0 3.207 chr7 71194021 71194720 GHP309 1 1 0 3.157 chr7 135557758 135558307 GHP72 1 0 0 3.080 chr7 130521146 130522045 TAL1_4423 1 1 0 3.053 chr1 52580874 52581510 TAL1_3800 1 1 0 2.980 chr8 124850001 124850270 TAL1_4157 1 1 1 2.863 chrX 146999208 146999549 GHP222 1 1 0 2.813 chr7 134234693 134235292 GHP150 1 0 0 2.750 chr8 124810751 124811295 TAL1_4361 1 1 1 2.630 chr7 116254241 116254990 GHP4 1 1 0 2.605 chr9 45611360 45612037 GHP270 1 0 1 2.604 chr11 121294437 121295272 Zfpm1R14 1 0 1 2.559 chr11 72295276 72296391 TAL1_3879 1 1 1 2.487 chr8 124830554 124831203 Zfpm1R1 1 0 1 2.465 chr7 135444428 135445076 GHP204 1 1 1 2.437 chr7 86887199 86887748 GHP296 1 1 1 2.422 chr6 88141777 88142006 TAL1_1796 1 1 1 2.342 chr6 120530027 120530545 GHP314 1 1 1 2.314 chr7 73341400 73342099 GHP205 1 1 1 2.296 chr4 155162230 155162903 GHP196 1 0 0 2.255 chr8 124814848 124815272 TAL1_2747 1 1 0 2.251 chr19 37568375 37569012 GHP156 1 1 1 2.236 chr7 74355378 74355948 TAL1_379 1 1 0 2.212 chr7 91735511 91736060 TAL1_1050 1 1 1 2.195 chr7 128123364 128124013 GHP228 1 1 0 2.170 chr12 88141648 88142376 GHP172 1 1 1 2.165 chr3 103179283 103179942 TAL1_3278 1 1 0 2.133 chr12 29502832 29503567 TAL1_3461 1 1 1 2.123 chr7 16880022 16880853 GHP105 1 1 0 2.053 chr8 124837163 124837562 GHP152 1 0 0 2.045 chr7 112900289 112900988 GHP74 1 1 1 2.025 chr7 106973326 106974333 TAL1_1774 1 1 0 2.013 chr1 135721557 135722921 Zfpm1R21 1 0 1 2.000 chr11 77882879 77883715 GHP167 1 1 0 1.986 chr15 8711603 8712036 TAL1_1515 1 1 1 1.975 chr1 88375618 88376290 GHP163 1 1 1 1.975 chr7 109397425 109397874 TAL1_196 1 1 1 1.958 chr7 133491588 133492187 Zfpm1R27 1 0 0 1.946 chr7 118147598 118148147 GHP159 1 1 0 1.878 chr7 111009280 111010007 Gata2R7 1 0 1 1.873 168

chr17 4625931 4626888 Zfpm1R10 1 1 1 1.862 chr7 133269983 133270532 Zfpm1R2 1 1 1 1.785 chr7 132572012 132572611 GHP173 1 0 0 1.770 chr6 135146524 135147153 TAL1_2311 1 1 0 1.752 chr8 37346436 37347275 GHP216 1 1 0 1.734 chr8 124899439 124899655 TAL1_2434 1 1 1 1.728 chr13 14634853 14635762 Btg2R3 1 1 1 1.681 chr6 135115673 135115946 TAL1_3920 1 1 0 1.679 chr3 21894787 21895266 Zfpm1NC1 1 0 -1 1.666 chr7 116258367 116259066 GHP73 1 0 1 1.655 chr7 97211041 97211927 TAL1_3953 1 1 0 1.647 chr8 35160152 35161438 GHP308 1 1 0 1.626 chr7 135815708 135816257 Zfpm1R6 1 0 1 1.583 chr10 24354921 24355529 GHP82 1 1 1 1.580 chr6 72229401 72230032 GHP3 1 0 0 1.557 chr7 133185585 133186156 GHP18 1 1 0 1.522 chr1 135989522 135989792 GHP106 1 1 0 1.521 chr7 134991063 134992012 Zfpm1R12 1 1 1 1.519 chr7 91799621 91800720 TAL1_2980 1 1 0 1.501 chr7 75118605 75119154 TAL1_2950 1 1 0 1.471 chr8 124820526 124820950 TAL1_1464 1 1 1 1.414 chr4 134275289 134275961 TAL1_1020 1 1 1 1.404 chr8 124807803 124808237 TAL1_43 1 1 0 1.380 chr1 38056216 38056808 TAL1_3657 1 1 0 1.367 chr7 82756750 82757349 GHP16 1 1 0 1.364 chr9 96141676 96142381 TAL1_63 1 1 0 1.318 chr11 96802210 96802881 TAL1_734 1 1 1 1.303 chr7 109390843 109392019 Hebp1R2 1 1 1 1.223 chr14 118593772 118594280 GHP117 1 0 0 1.201 chr7 134910576 134911275 TAL1_2578 1 1 0 1.200 chr7 73224076 73224875 Zfpm1R11 1 0 1 1.166 chr7 133540211 133540860 Zfpm1R19 1 1 1 1.132 chr9 110570916 110571692 Zfpm1R7 1 0 1 1.129 chr7 91046669 91047368 Zfpm1R5 1 0 1 1.115 chr7 88973238 88973787 Zfpm1R3 1 1 1 1.093 chr6 88066819 88067388 GHP160 1 1 0 1.082 chr10 116543101 116543855 GHP276 1 0 0 1.071 chr7 107615488 107616487 TAL1_1529 1 1 1 1.037 chr6 88139617 88140016 GHP2 1 0 0 1.014 chr5 147127885 147128511 TAL1_3663 1 1 0 0.997 chr16 17637963 17639064 Zfpm1R15 1 0 1 0.989 chr2 167875757 167876211 GHP183 1 1 0 0.976 chr2 167626792 167627467 GHP78 1 1 0 0.970 chr7 71096727 71097876 TAL1_4769 1 1 0 0.970 chr8 124833200 124833376 TAL1_842 1 1 0 0.957 169

chrX 146984294 146984495 GHP165 1 1 1 0.880 chr2 35191661 35192274 Zfpm1R8 1 1 0 0.753 chr15 66657520 66658263 Gata2R3 1 0 1 0.752 chr2 27199488 27199723 GHP17 1 0 0 0.730 chr7 71146719 71147318 TAL1_2794 1 1 0 0.677 chr7 110976236 110976785 Zfpm1R4 1 0 1 0.613 chr8 124808263 124808907 TAL1_1863 1 1 -1 0.561 chr7 106623318 106623984 GHP8 1 0 -1 0.523 chr7 130114466 130115369 GHP310 1 0 0 0.523 chr18 32701721 32702830 GHP118 1 0 1 0.491 chr6 88140542 88141081 TAL1_987 1 1 0 0.482 chr8 124854446 124854615 TAL1_1742 1 1 0 0.476 chr7 118322848 118323397 GHP301 1 0 0 0.443 chr8 83017196 83018542 GHP297 1 0 0 0.418 chr19 40887714 40888455 TAL1_201 1 1 0 0.394 chr7 6126508 6127068 TAL1_250 1 1 1 0.378 chr7 91871088 91871876 GHP300 1 1 0 0.345 chr7 120228608 120229207 GHP90 1 0 0 0.335 chr7 86503239 86503938 GHP180 1 1 0 0.320 chr9 123864720 123865709 GHP101 1 1 0 0.315 chr8 124806654 124806897 GHP316 1 1 0 0.280 chr5 148253420 148253996 TAL1_4816 1 1 0 0.260 chr7 87330823 87331372 TAL1_2471 1 1 1 0.224

170

Supplemental Table 2.14 Closest genes to TAL1 bound in vivo enhancers and the

expression levels of some genes observed in G1E, uninduced-, 24 hour-induced-

and 30 hour-induced- G1E-ER4 cells.

Closest gene names in bold indicate the expressed genes in the cell system. The pooled expression values of each expressed gene in each cell type are listed as log2FPKM

(Fragments per kilobase per million sequenced reads).

TAL1 peak VISTA G1E 0h_ER4 24h_ER4 30h_ER4 ID ID ClosestGene (mouse) 1020 hs1385 Tssc1(intragenic) 4.37 4.82 5.05 5.86 1105 mm78 1700085C21Rik-Dpf3 2.42 2.35 3.21 4.05 1123 hs1466 Gm6772-6430527G18Rik 3.37 3.75 3.35 3.08 1130 mm296 Nrxn3(intragenic) - - - - 1216 mm180 Fam50b-Prpf4b - - - - 1496 hs796 Gpr180-Sox21 2.69 3.11 2.84 1.91 1578 mm104 Sla-Wisp1 - - - - 1882 mm156 Agpat4(intragenic) 2.72 2.59 3.38 3.92 250 hs1862 Cacna1e-Ier5 2.83 3.43 5.24 4.19 2105 mm291 Gypc(intragenic) 5.20 4.76 8.53 8.06 2184 mm18 Ctif(intragenic) 2184 mm257 Ctif(intragenic) 2302 hs1866 Hhex-Exoc6 7.08 7.67 8.68 7.24 2456 mm194 Ppapdc3-Prrc2b - - - - 2495 hs1802 Gm13476-1700019E08Rik - - - - 2564 hs1858 4930547E08Rik-Lmo2 5.69 6.85 7.98 7.61 2677 hs2050 Bcl2l1(intragenic) 5.36 5.39 9.21 8.61 2750 hs1860 Fam65c-Pard6b - - - - 2756 mm92 Rbm38-Ctcfl - - - - 3020 mm311 2410004B18Rik-Syde2 - - - - 3188 hs1857 Klf17-Slc6a9 - - - - 3253 mm80 Epb4.1-Oprd1 - - - - 3269 hs569 Arid1a(intragenic) 6.15 5.99 7.68 6.96 3422 mm253 Slc2a9-Wdr1 - - - - 3467 hs840 Epha5-Cenpc1 - - - - 3553 mm144 Gm10390(intragenic) - - - - 3553 hs1673 Gm10390(intragenic) - - - - 3759 hs1677 C530044C16Rik-Mir148a - - - - 3897 mm94 Phc1-Rimklb - - - - 171

3898 mm94 Phc1-Rimklb - - - - 4207 hs1859 Ric3-Lmo1 - - - - 4435 mm21 Nfix(intragenic) - - - - 4518 mm196 Fam92b-Gse1 5.42 5.76 7.46 7.27 4521 mm99 Fam92b-Gse1 5.42 5.76 7.46 7.27 4555 hs1854 Cbfa2t3-Acsf3 6.63 6.48 7.32 7.00 1700055C04Rik- 4694 mm69 - - - - 1110036E04Rik 4737 mm190 Lipc(intragenic) - - - - 594 hs2059 Ddc(intragenic) - - - - 604 hs690 Ehbp1(intragenic) - - - - 626 mm101 Nprl3(intragenic) - - - - 712 mm85 Gm12295-Map2k4 - - - - 752 mm146 Ssh2(intragenic) - - - - 752 hs1675 Ssh2(intragenic) - - - - 793 mm127 Mir21-Ptrh2 - - - - 854 hs1769 Pgap3(intragenic) - - - - 855 hs1769 Pgap3(intragenic) - - - -

172

Supplemental Table 2.15 Linking TAL1-bound enhnacers to target genes covered by Enhancer-Promoters Units

Enh.T AL1 Assigned promoters (TSS) Genes ID 244 153937450 1700025G04Rik 250 156946766 Ier5 541 116500870, 116585000 Cct2, Frs2 594 11586215,11708965,11790404,11927974,11937423, Ikzf1,Fignl1,Ddc,Grb10,Grb10 Nprl3,Hba-x,Hba-a1,Hba-a2,F830116E18Rik,Hba-a1,Hba- 626 32167614,32176599,32183671,32183674,32186963,32196488,32196491,32200068, a2,Hbq1 752 77277414,77303180,77306913, Coro6,Ankrd13b,Git1 758 77886506,77886671,77893885,77908174, Mir144,Mir451,Eral1,BC017647 793 86397660,86497324,86497484,86497582,86570994, Mir21,Tmem49,Ptrh2,Ptrh2,Cltc 98190959,98210051,98219697,98245124,98247945,98261804,98273797,98300302,98308147,9 Neurod2,Ppp1r1b,Stard3,Tcap,Pnmt,Pgap3,Erbb2,1810046 854 8407345,98412410, J19Rik,Grb7,Ikzf3,Zpbp2 98190959,98210051,98219697,98245124,98247945,98261804,98273797,98300302,98308147,9 Neurod2,Ppp1r1b,Stard3,Tcap,Pnmt,Pgap3,Erbb2,1810046 855 8407345,98412410, J19Rik,Grb7,Ikzf3,Zpbp2 1025 32746144, Prkar2b 1050 58261630,58298459,58331398,58331411, Slc25a21,Slc25a21,Gm5081,Mipol1 1105 84828658, Dpf3 1130 90191833, Nrxn3 1195 14615493,14705304,14705508,14722511,14722524, Hecw1,Mrpl32,Psma2,AW209491 1216 34967362, Prpf4b 1496 118636252,118666379, Sox21,Gm9376 1496 118636252,118666379, Sox21,Gm9376 1578 66644155,66663391, Sla,Sla 1578 66644155,66663391, Sla,Sla 1796 49855766, Cd47 1882 12312149,12511526, Agpat4,Map3k4 173

2105 32719688, Gypc 2105 32719688, Gypc 2302 37509330,37624907, Hhex,Exoc6 2302 37509330,37624907, Hhex,Exoc6 2456 31951170, Ppapdc3 2564 103798151,103809713,103810443, Lmo2,Lmo2,Lmo2, Ube2v1,Tmem189,Cebpb,A530013C23Rik,Ptpn1,Fam65c,P 2747 167457505,167487044,167514414,167516707,167757826,167836093,167906503, ard6b Ube2v1,Tmem189,Cebpb,A530013C23Rik,Ptpn1,Fam65c,P 2750 167457505,167487044,167514414,167516707,167757826,167836093,167906503, ard6b Ube2v1,Tmem189,Cebpb,A530013C23Rik,Ptpn1,Fam65c,P 2750 167457505,167487044,167514414,167516707,167757826,167836093,167906503, ard6b 2756 172765794,172805342,172825637,172847402,172945026,172978573,173044423,173102034, Bmp7,Spo11,Rae1,Rbm38,Ctcfl,Pck1,Zbp1,Pmepa1 3020 145587341,145600995, Bcl10,2410004B18Rik 3022 146041908,146067650, Gm10636,Ssx2ip 3158 106584074, Ssbp3 3188 117507862, Slc6a9 3253 131604993,131628012,131631228, Epb4.1,Epb4.1,Epb4.1 3278 134108081,134260205, Sepn1,Man1c1 3422 38871545,38874624,38893391,38952834, Slc2a9,Slc2a9,Slc2a9,Wdr1 3461 77269623, C530008M17Rik 3467 83784220,84846407, Tecrl,Epha5 3467 83784220,84846407, Tecrl,Epha5 3553 120566521,120881894,120922772,120926555,120953632, Rbm19,Lhx5,Sdsl,Sds,Plbd2 Sftpb,Usp39,0610030E20Rik,Tmem150a,Rnf181,Vamp5,Va 3800 72253241,72295169,72297310,72305476,72312375,72330462,72340661,72364326, mp8,Ggcx 3897 122287033, Phc1 3898 122287033, Phc1 3960 16872144,16894931,16973134, Ccdc9,Bbc,Sae1 174

4157 91754452,91827861, Fah,Zfand6 Olfr560,Olfr78,Olfr561,Olfr564,Olfr566,Olfr568,Olfr569,Olf 109902441,109907985,109923039,109951993,110005785,110025635,110036665,110048882,1 r570,Olfr571,Olfr572,Olfr574,Olfr575,Olfr576,Olfr577,Olfr5 10058351,110076143,110096980,110104134,110113615,110122504,110133676,110189994,11 78,Olfr582,Olfr583,Olfr584,Olfr585,Olfr586,Olfr589,Olfr59 0199813,110234033,110246256,110271296,110304259,110322149,110335116,110360375,110 1,Olfr592,Olfr593,Olfr594,Dub2a,Usp- 368233,110401019,110403414,110468926,110477001,110486569,110495440,110507727,1105 ps,Olfr597,Olfr598,Olfr599,Olfr600,Olfr601,Olfr604,Usp17l 32514,110565209,110591664,110599852,110618554,110641390,110655458,110666896,11069 5,Olfr605,Olfr606,Olfr608,Olfr609,Olfr610,Olfr611,Olfr613, 4199 8881,110708992,110713791,110732537,110745831,110752102,110760865,110788652,110809 Olfr615,Olfr616,Olfr617,Olfr618,Olfr619,Olfr620,Olfr622,Ol 832,110819543,110865109,110880441,110889836,110905864,110920108,110926857,1109403 fr623,Olfr624,Olfr243,Olfr628,Olfr629,Olfr630,Olfr69,Olfr6 44,110962417,110962437,110989035,110991676,111001721,111030755,111042985,11105485 8,Olfr67,Hbb-b2,Hbb-b1,Gm5736,Hbb-bh1,Hbb- 5,111063575,111085895,111095081,111127689,111151772,111161214,111170830,111188311, y,Olfr66,Olfr64,Olfr65,Olfr631,Olfr632,Olfr633,Olfr635,Olfr 111198866,111208114,111217543,111233592,111254794, 638,Olfr639,Olfr640,Olfr641,Olfr642,Olfr643,Olfr644,Olfr6 45,Olfr646 4207 116313822, Lmo1 4249 133225925, Gsg1l 4423 83017943, Gypa 87298268,87324239,87346722,87356164,87356180,87364540,87370830,87380884,87396009,8 Nfix,Nfix,Dand5,Dand5,Gadd45gip1,Rad23a,Calr,Farsa,Syce 4435 7396156, 2,Syce2 124975222,124984592,124999964,125000042,125075229,125091914,125100807,125135241,1 9330133O14Rik,Snai3,Rnf166,Ctu2,Fam38a,Cdt1,Aprt,Galn 4555 25135525,125146635,125202075,125223009,125299404,125365477,125372273,125565897,12 s,Trappc2l,Pabpn1l,Cbfa2t3,Cbfa2t3,Acsf3,Gm16378,Cdh15 5589407, ,Ankrd11,Spg7 4652 45207791,45211220,45214159,45238375,45636721, Fxyd2,Fxyd2,Fxyd2,Dscaml1,Cep164 4694 63869866,63964953, Smad6,Lctl 4737 70782615, Lipc 123715594,123760782,123761020,123771246,123836859,123883525,123893241,123936702,1 4851 Cxcr6,Fyco1,Fyco1,Xcr1,Cdkn2d,Ccr1,Ccr1l1,Ccr3,Ccr2,Ccr5 24016921,124036281, 7367120,7400968,7418956,7455431,7461369,7476354,7476523,7496947,7505733,7525015,75 Gripap1,Kcnd1,Otud5,Pim2,Slc35a2,Pqbp1,Timm17b,Pcsk1 4858 44941,7595618, n,Eras,Hdac6,Gata1,Glod5 175

Appendix B

Principles Of Regulatory Information Conservation Between Mouse And Human

Yong Cheng*1, Zhihai Ma*1, Bong-Hyun Kim2, Weisheng Wu3, Philip Cayting1, Alan P. Boyle1, Vasavi Sundaram4, Xiaoyun Xing4, Nergiz Dogan3, Jingjing Li1, Ghia Euskirchen1, Shin Lin1, 5, Yiing Lin1, 6, Axel Visel5, Trupti Kawli1, Xinqiong Yang1, Dorrelyn Patacsil1, Cheryl A. Keller3, Belinda Giardine3, The mouse ENCODE Consortium, Anshul Kundaje1, Ting Wang4, Len A. Pennacchio7, Zhiping Weng2, Ross C. Hardison3#, Michael P. Snyder1#

* These authors contributed equally to the work

# To whom correspondence should be addressed:

Michael P. Snyder ([email protected]) Ross C. Hardison ([email protected])

Affiliations:

1. Department of Genetics, Stanford University, Stanford, California 94305, USA 2. Program in Bioinformatics and Integrative Biology, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605. 3. Center for Comparative Genomics and Bioinformatics, Huck Institutes of the Life Sciences, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802 4. Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA. 5. Division of Cardiovascular Medicine, Stanford University, Stanford, California 94304. 6. Department of Surgery, Washington University School of Medicine, St. Louis, Missouri 63110. 7. Lawrence Berkeley National Laboratory, Genomics Division, Berkeley, California, USA

176

Summary

To broaden our understanding of the evolution of gene regulation mechanisms in mammals, we generated occupancy profiles for 32 orthologous transcription factors (TFs) in human-mouse erythroid progenitor and lymphoblast cell lines. By combining the genome-wide TF occupancy repertoires, associated epigenetic signals, and TF co- association patterns, we deduced several evolutionary principles of gene regulatory features operating since the human and mouse lineages diverged ~75 million years ago. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF occupied sequences (TF OSs) in both species. However, the extent to which orthologous DNA segments are bound by orthologous TFs in human and mouse varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Divergence of TF occupancy is associated with changes of chromatin state and DNA methylation level. Importantly, TF OSs whose occupancy is conserved between human and mouse tend to be pleiotropic; they function in multiple tissues and also co-associate with multiple TFs. Single nucleotide variants (SNVs) at sites with high regulatory potential or associated with phenotypes are significantly enriched in occupancy conserved TF OSs. 177

Determining the similarities and differences between human and mouse regulatory networks will not only improve our understanding of the evolution of regulatory mechanisms, but also has implications for biomedical research performed on mouse models. Recent genome-wide binding studies of eight TFs in multiple species uncovered several regulatory networks that have been highly rewired since mouse and human divergence1-4. These results contrast sharply with other data showing that conservation of genomic DNA sequences can be a useful guide to discovery of regulatory regions5 and that the regulatory landscape can be highly conserved among more distant species6. Considering the large numbers of known TFs and their functional diversity, comprehensive studies on a broader range of TFs are needed to resolve these apparent discrepancies. Furthermore, our knowledge of the functional consequences of either divergence or conservation of TF occupancy remains limited. Therefore, we generated and analyzed a large dataset of genome-wide binding profiles for 32 TFs in mouse and human. We focused on TF occupancy in cell line models for erythroid progenitors (K562 and MEL) and lymphoblasts (GM12878 and CH12) in human and mouse (Extended Data Fig.1a), also showing that the results are similar to those obtained in embryonic stem cells (Supplementary Fig. 3). Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) assays were conducted according to ENCODE standards7 (see methods in Supplementary).

These genome-wide binding data for a large and diverse set of TFs revealed both conserved and non-conserved features of TF occupancy between mouse and human. First, the location of TF binding relative to gene features is conserved between orthologous TF OSs. Although most TFs can reside at both promoters and distal sites, each shows a pronounced preference (Fig. 1a and Extended Data Fig. 2a and 2b). The preference is strongly conserved between human and mouse (R=0.8; Extended Data Fig. 2c). The one exception is ETS1, which preferentially binds proximal to promoters in human but not mouse (Fig. 1a). ETS1 is responsible for the mouse-specific expression of T-cell thymus marker Thy-18, and we hypothesize that this dramatic difference in its binding location may contribute to immune system differences between human and mouse9. Second, while the primary motifs of most sequence-specific TFs are conserved between human and mouse, the secondary motifs (e.g. motifs of associated factors; see Supplement) tend to be lineage-specific (Fig. 1b, Extended Data Fig. 2d).

The preferred chromatin states, defined by histone modifications, for binding orthologous TFs are also conserved between mouse and human. Using data on five histone modifications, the mouse and human genomes were segmented into eight chromatin states (Fig. 1c, Extended Data Fig. 3b). Most TF OSs are located in states characteristic of promoters and enhancers (states 1-4). In contrast, approximately 50% of OSs for CTCF/cohesin complex (CTCF, RAD21 and SMC3)11,12 are located in state 5 and 8, which mark quiescent regions with very low signal for all the histone modifications. MAFK also show preference for quiescent regions. Interestingly, both the CTCF/cohesin complex and MAFK13 can mediate long-range interactions in the genome. The state preference is conserved between mouse and human (Fig. 1c; R=0.9), suggesting that 178 the function of the occupied segments are similar in the two species. Indeed, the proportion of predicted enhancers14 within TF OSs is also conserved (R=0.7) (Extended Data Fig.4).

We also examined DNA methylation profiles in TF OSs by using both methylated DNA immunoprecipitation (MeDIP) and DNA digestion with methyl-sensitive restriction enzymes (MRE) followed by sequencing15. The TF OSs are highly enriched for MRE-seq signals but depleted of MeDIP-seq signals, showing that TF OSs are generally hypomethylated in both species (Fig. 1d, Extended Data Fig. 3c).

The TF binding regions are enriched for conservation of DNA sequences, showing a strong signal for evolutionary constraint within +/- 50 bp of ChIP-seq peak summits (Fig. 2a). This result indicates that purifying selection has acted on DNA sequences in many of the TF OSs, but it does not mean that all TF OSs are uniformly under constraint. Indeed, many TF OSs do not align between mouse and human16 because either they are lineage-specific sequences such as transposable elements17 or they have diverged to an extent that they no longer align.

We then focused on the subset of TF OSs whose sequences aligned between mouse and human to determine whether orthologous DNA sequences are also occupied by orthologous TFs (details in Supplementary methods). Surprisingly, the fraction of TF OSs at which occupancy was conserved varied markedly both among TFs and with the genomic locations (Fig. 2b). Conservation of occupancy is consistently higher in the promoter regions and lower in distal regions for almost all TFs, suggesting that the promoters may be under stronger selection compared with distal enhancers. Conserved promoter occupancy is observed both for factors that bind near promoters (NRF1, MAZ) as well as factors with a minority of binding sites in promoter regions (e.g. MEF2A and TAL1). A significant exception is the CTCF/cohesin complex, which not only shows high levels of occupancy conservation as described previously18, but the conservation remains high at proximal, middle and distal regions relative to the TSS (Fig. 2b). These patterns of variation in conservation in occupancy are robust. One potential confounding factor is the tendency for promoter sequences to be more conserved than other regulatory regions, but adjusting the occupancy conservation by the sequence conservation difference revealed similar trends (Extended Data Fig. 5a). Likewise, removal of the few TFs for which markedly different numbers of peaks were called between mouse and human did not change the patterns of conservation of occupancy (Extended Data Fig. 5b, Supplementary Information).

Next we investigated how epigenetic factors influence TF binding at orthologous sites between mouse and human. As expected, the distribution of chromatin states is highly similar for occupancy-conserved TF OSs (Fig. 2c and Extended Data Fig. 6). For TF OSs that can be aligned between the two species but lack occupancy conservation, a smaller proportion of these orthologous sequences were in enhancer-associated states (state 3 and 4) and a larger proportion were in either repressed (state 7) or quiescent 179

(states 5 and 8) chromatin. Thus species-specific loss of TF occupancy at many sites is accompanied by a shift to repressive or quiescent chromatin. In contrast, the promoter states were maintained in the second species even with the loss of TF binding. This result suggests that other TFs may help maintain conservation of a promoter state in these regions. We also searched for changes in the level of DNA methylation between TF OSs and their orthologous sequences. DNA methylation levels remained low in both species for occupancy-conserved TF OSs (Fig 2.d and Extended Data Fig. 7), but the DNA methylation levels were significantly increased in the unbound, orthologous sequences. Thus, species-specific loss of TF occupancy is also associated with species-specific increases in DNA methylation.

We hypothesized that TF OSs with regulatory functions in multiple tissues would be under increased selective pressure, and thus more likely to be conserved in occupancy. In order to test this hypothesis, we first examined DNase I hypersensitive sites (DHSs) across 55 tissues and cell lines16 to measure the chromatin accessibility of each TF OS among different tissues. Since DHSs are a proxy for regulatory element activity19, TF OS regions accessible in multiple tissues are more likely to function in those tissues. Chromatin accessibility of TF OSs presents wide variation, ranging from tissue-specific to ubiquitous patterns (Fig. 3a). Strikingly, the TF OSs with more pervasive chromatin accessibility across different tissues show the highest extent of occupancy conservation between mouse and human. The association between tissue usage and occupancy conservation is general; it was observed for most of the TFs examined (Extended Data Fig. 8b and 8c). This association is also robust to several potential confounding factors. CTCF/cohesin complexes, which are abundant and conserved across different tissue types and species18,20, might be expected to bias the result, but instead we obtained comparable results after removing all the genomic regions occupied by CTCF, RAD21 or SMC3 (Extended Data Fig. 8a). The conservation of promoter regions among multiple tissues and species14 also might be expected to bias our analysis, but, after removal of occupancy-conserved TF OSs that lie within 2kb of TSSs, we still found that the association between tissue usage and TF occupancy conservation holds for distal TF OSs (Extended Data Fig. 8d and 8e). Furthermore, specifically examining distal TF OSs that overlapped with enhancers predicted by chromatin signals14 showed that broad tissue usage of presumptive enhancers tracks strongly with conservation of occupancy between mouse and human (Fig. 3b).

A prediction of our hypothesis is that occupancy-conserved TF OSs will be active in multiple tissues. To experimentally test this prediction, we chose ten GATA1 OSs whose occupancy is conserved between human and mouse. Even though OSs were chosen based on the occupancy profile of an erythroid specific regulatory factor, all ten conserved OSs overlapped with DHSs peaks and predicted enhancers in multiple tissues, such as liver and brain (Fig. 3c). When tested for in vivo enhancer activity in transgenic mouse reporter assays at embryonic day 11.5, nine of the ten showed strong, reproducible in vivo enhancer activity, but rarely were they active only in erythroid tissues such as fetal liver (Fig. 3c). We expanded our analysis to examine other mouse 180

GATA1 OSs that overlapped with previously tested enhancers deposited in the VISTA Enhancer Browser (http://enhancer.lbl.gov)21. Six GATA1 OSs that are specific to mouse generated positive enhancer assays; only one (16%) showed expression in tissues other than blood vessels and heart. In contrast, among 12 occupancy-conserved GATA1 OSs with in vivo enhancer activity, 6 (50%) were active in non-erythroid tissues such as midbrain (Supplementary Table S5).

Since precise gene regulation requires complex interactions among different TFs, we speculated that differences in conservation of TF occupancy may be related, at least in part, to different co-association partners. By calculating the occupancy signals for all the TFs in each TF OS, we found that, in general, occupancy-conserved TF OSs tend to be bound by more TFs compared to lineage-specific TF OSs (p-value<2.2e-16, two- tailed t-test Fig. 4a), suggesting that co-association with multiple TFs increases the level of purifying selection on the occupied sequences. Furthermore, by examining each co- associated TF pair (Fig. 4b), we determined whether the co-associations were more enriched in occupancy-conserved versus species-specific binding sites (Fig. 4c, Extended Data Fig.9). The relationships fell into three categories. In the first, co- association of TFs is not linked with occupancy conservation. For example, RAD21 is highly associated with CTCF; however, the co-association of RAD21 occupied sequences with CTCF occurs with equivalent frequency at occupancy-conserved binding sites and species-specific binding sites. In the second category, TF co-association is negatively correlated with occupancy conservation. For example, the co-association of MYC OSs with EP300, an enhancer-associated factor, is highly enriched in the mouse- specific binding sites. In the last category, TF co-association is positively correlated with occupancy conservation. For example, the co-association of MYC OSs, with the co- repressor SINA3A is highly enriched in occupancy-conserved sequences, indicating that MYC-associated repressors tend to be conserved between human and mouse.

In a previous study, we assigned putative regulatory potential to genome variation by combining high-throughput experimental data sets, computational predictions, and manual annotation22. Interestingly, even though conservation was not considered during the previous classifications, we found that SNVs with high regulatory potential were highly enriched in occupancy conserved TF OSs (p-value < 5.5e-07, Fisher’s exact test) (Extended Data Table 1a). Moreover, examination of the distribution of GWAS SNPs as a function of TF OS occupancy conservation revealed a significant enrichment of GWAS SNPs in occupancy conserved TF OSs (p-value < 2.2e-16 Fisher's Exact Test) compared with the background distribution of all dbSNPs. When examining individual phenotypes, we found that SNPs associated with several phenotypes such as type I diabetes are significantly enriched in occupancy conserved TF OSs (p- value=0.019 Fisher's exact test). However, SNPs associated with other phenotypes, such as pulmonary function, are highly human specific (p-value=0.027 Fisher's exact test; Extended Data Table 1.b). Thus, although GWAS SNPs are generally enriched in occupancy conserved TF OSs, this enrichment is phenotype-specific.

181

Here we report that the conservation of TF occupancy associates with pleiotropic functions. This observation was further validated by in vivo enhancer assays in transgenic mice. To our knowledge, this is the first systematic investigation and validation of the relationship between pleiotropic TF OSs and their occupancy conservation. The pleiotropic functions of a regulatory module subjects it to multiple constraints that preserve the underlying motifs and occupancy patterns. The roles in different tissues need not be carried out by the same TF. Paralogous proteins that bind to the same DNA motif (e.g. GATA5 or GATA6) could be the active proteins at the GATA1 OSs with conserved occupancy and pleiotropic functions. These and other principles of TF occupancy conservation can be elucidated in future studies.

References

1. Odom, D. T. et al. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nat. Genet. 39, 730–732 (2007). 2. Schmidt, D. et al. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Nature 328, 1040 (2010). 3. Stefflova, K. et al. Cooperativity and Rapid Evolution of Cobound Transcription Factors in Closely Related Mammals. Cell 154, 530–540 (2013). 4. Kunarso, G. et al. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat. Genet. 42, 631–634 (2010). 5. Pennacchio, L. A. & Rubin, E. M. Genomic strategies to identify mammalian regulatory sequences. Nature reviews. Genetics 2, 100 (2001). 6. He, Q. et al. High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species. Nat. Genet. 43, 414–420 (2011). 7. Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813 (2012). 8. Tokugawa, Y., Koyama, M. & Silver, J. A molecular basis for species differences in Thy-1 expression patterns. Mol. Immunol. 34, 1263 (1997). 9. Mestas, J. & Hughes, C. C. W. J. Immunol. 172, 2731 (2004). 10. Ravasi, T. et al. An Atlas of Combinatorial Transcriptional Regulation in Mouse and Man. Cell 140, 744–752 (2010). 11. Nitzsche, A. et al. RAD21 Cooperates with Pluripotency Transcription Factors in the Maintenance of Embryonic Stem Cell Identity. PLoS ONE 6, e19470 (2011). 12. Merkenschlager, M. & Odom, D. T. CTCF and Cohesin: Linking Gene Regulatory Elements with Their Targets. Cell 152, 1285–1297 (2013). 13. Sawado, T., Igarashi, K. & Groudine, M. Activation of beta-major globin gene transcription is associated with recruitment of NF-E2 to the beta-globin LCR and gene promoter. Proc. Natl. Acad. Sci. U.S.A. 98, 10226 (2001). 14. Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116–120 (2012). 15. Xie, M. et al. DNA hypomethylation within specific transposable element families associates with tissue-specific enhancer landscape. Nat. Genet. 45, 836–841 (2013). 16. MouseENCODE Consortium. An Integrated and Comparative Encyclopedia of DNA Elements in the Mouse Genome. (submitte) 17. Sundaram, V. & Cheng, Y. Widespread contribution of transposable elements to 182

the innovation of gene regulatory networks. Genome research (accepted) 18. Schmidt, D. et al. Waves of Retrotransposon Expansion Remodel Genome Organization and CTCF Binding in Multiple Mammalian Lineages. Cell 148, 335– 348 (2012). 19. Gross, D. S. & Garrard, W. T. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197 (1988). 20. Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 39, 311–318 (2007). 21. Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. Nucleic acids research (2007). doi:10.1093/nar/gkl822 22. Boyle, A. P. et al. Genome research (2012). 23. Hardouin, S. N. & Nagy, A. Mouse models for human disease. Clin. Genet. 57, 237 (2000). 24. Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome. Genome research 22, 1759 (2012). 25. Maurano, M. T. et al. Nature 337, (2012).

VITA Nergiz Dogan EDUCATION: Ph.D. (2014) Biochemistry, Microbiology and Molecular Biology The Pennsylvania State University, University Park, PA, USA M.S. (2008) Biotechnology Izmir Institute of Technology, Izmir, TURKEY B.S. (2003) Biology, Majored in Fundamental and Industrial Microbiology Ege University, Izmir, TURKEY

PUBLICATIONS:

1. Dogan N, Wu W, Morrissey CS, Chen K-B, Stonestrom A, Long M, Keller CA, Cheng Y, Jain D, Visel A, Pennacchio LA, Weiss MJ, Blobel GA, Hardison RC. Epigenetic and genetic features that lead to discovery of enhancer function (submitted).

2. Cheng Y, Ma Z, Kim B-H, Wu W, Cayting P, Boyle AP, Sundaram V, Xing X, Dogan N, Li J, Euskirchen G, Lin S, Lin Y, Visel A, Kawli T, Yang X, Patacsil D, Keller CA, Giardine B, The mouse ENCODE Consortium, Kundaje A, Wang T, Pannacchio LA, Weng Z, Hardison RC, Snyder MP. Principles of regulatory information conservation between mouse and human. Nature: revised and resubmitted.

3. The mouse ENCODE Consortium et al. An integrated and comparative encyclopedia of DNA elements in the mouse genome (submitted).

4. Dogan N, Tari C. 2008. Characterization of three-phase partitioned exo- polygalacturonase from Aspergillus sojae with unique properties. Biochemical Engineering Journal 39: 43-50.

5. Tari C, Dogan N, Gogus N. 2008. Biochemical and thermal characterization of crude exo-polygalacturonase produced by Aspergillus sojae. Food Chemistry 4: 824-829.