Identifying tissue specific distal regulatory sequences in the mouse genome

by

Julie Chih-yu Chen

A thesis submitted in conformity with the requirements for the degree of Master Cell and Systems Biology University of Toronto

© Copyright by Julie Chih-yu Chen 2011

Identifying tissue specific distal regulatory sequences in the mouse genome

Julie Chih-yu Chen

Master

Cell and Systems Biology University of Toronto

2011 Abstract

Epigenetic modifications, (TF) availability and chromatin conformation influence how a genome is interpreted by the transcriptional machinery responsible for expression. Enhancers buried in non-coding regions are associated with significant differences in histone marks between different cell types. In contrast, gene promoters show more uniform modifications across cell types. In this report, enhancer identification is first carried out using an enhancer associated feature in mouse erythroid cells. Taking advantage of public domain ChIP-Seq data sets in mouse embryonic stem cells, an integrative model is then used to assess features in enhancer prediction, and subsequently locate enhancers. Significant associations with multiple TF bound loci, higher expression in the closest , and active enhancer marks support functionality and tissue-specificity of these enhancers. Motif enrichment analysis further determines known and novel TFs regulating the target cell type.

Furthermore, the features identified can facilitate more accurate enhancer prediction in other cell types.

ii

“You cannot open a book without learning something.”

Confucius

iii

Acknowledgments

My time at U of T has been a life changing experience. As a result of the environment and my commitment during this time, my desire in pursuing research has never been more evident. Much more than I set out to achieve, I have expanded my horizon and developed diverse sets of skills both in research and in personal growth.

I would like to thank my supervisor, Dr. Jennifer Mitchell, for the introduction to the research, the opportunity to learn wet lab techniques, the experience, and the frequent discussion that shaped the biological framework of the study. I am particularly grateful of her guidance on biological interpretations and her help on editing my NSERC application. I would also like to thank Dr. Quaid Morris for computational guidance in the mouse embryonic stem cell project, and for the discussions that enriched my research experience.

I also thank both of my committee members, Dr. Sue Varmuza and Dr. Nicholas Provart, for their inputs on my study and all the help on my completion of the graduate study. I am especially thankful of Dr. Nicholas Provart for the thorough editing of my thesis.

I am very grateful of the government funding agencies, OGS and NSERC, for the awards to support my research, and the conference organizations, BioC2010 and CDB Symposium 2011, for the travel fellowships, which allowed me not only to present my posters, but also interact with other researchers and broaden my scope of knowledge.

I would like to thank my lab mates, Anandi Bhattacharya and Mike Schwartz, and Jessica Yang and Dr. Yunchen Gong from the Guttman lab for the discussions on biological and bioinformatics research, and for being there to brighten up the days. I also thank Dr. Paul Boutros for his final and very helpful inputs, and Dr. Ieuan Clay for the opportunity to participate in one of his projects. Furthermore, a great deal of my bioinformatics and statistical applications in the thesis were acquired from periods when I was supervised by Dr. Chao A. Hsiung, Dr. I- Shou Chang and Dr. Von-Wun Soo. I am deeply grateful of their influences, and the opportunities to learn and develop these skills.

Lastly, I would like to thank my parents and my best friend, Hui-yi, for the constant support and encouragement from the other end of the world throughout my graduate study. I also thank my

iv family, Andy, Alice, Cat and n, for their warm and loving support. I could not have reached this point without them.

Finally, on a non-scientific note, I am grateful of the five positive messages in the fortune cookies, which had strong relevance to the stages of my life at the moments I received them.

v

Table of Contents

Acknowledgments ...... iv

Table of Contents ...... vi

Declaration……………...... x

List of Abbreviations ...... xi

List of Tables…………………… ...... xii

List of Figures…………………… ...... xiii

List of Appendices ...... xiv

Chapter 1 Introduction ...... 1

1.1. One genome, multiple epigenomes and transcriptomes ...... 2

1.2. Distal regulatory elements: Enhancers ...... 2

1.2.1. Significance not to be overlooked ...... 2

1.2.2. Epigenetic states at enhancers in relation to tissue specificity ...... 5

1.2.3. Regulation of through chromatin looping ...... 5

1.3. Features predictive of enhancers ...... 5

1.3.1. Interaction of and enhancers ...... 6

1.3.2. Histone modification states at enhancers ...... 6

1.3.3. Active and poised enhancers in embryonic stem cells ...... 7

1.4. Computational approaches relevant to enhancer identification ...... 7

1.4.1. Position specific matrices and comparative genomics ...... 7

1.4.2. Integrative modeling of ChIP-Seq data ...... 8

1.5. Thesis overview ...... 9

Chapter 2 Methods ...... 10

2.1. Methods for enhancer identification in mouse erythroid cell ...... 11

2.1.1. Mapping of datasets to the mouse genome ...... 11

vi

2.1.2. Enrichment of nucRNA-Seq and ChIP-Seq datasets ...... 11

2.1.3. Conservation, motif identification, and function annotation analyses ...... 13

2.1.4. Native ChIP-qPCR of H3K4me1 ...... 13

2.2. Methods for enhancer identification in mouse ES cells ...... 14

2.2.1. Public datasets ……………………………...... 14

2.2.2. Data pre-processing ...... 14

2.2.3. Training data sets ...... 16

2.2.4. Feature combination assessment using Naive Bayes ...... 16

2.2.5. Feature extraction with lasso regularized multinomial logistic regression ...... 17

2.2.6. Absolute gene expression in mouse ES cell ...... 17

2.2.7. functional enrichment analysis ...... 17

2.2.8. Association with multiple transcription factor bound loci ...... 18

2.2.9. Supervised motif analysis ...... 18

2.2.10. Comparison to other high throughput sequencing datasets ...... 18

Chapter 3 Enhancer Identification in Erythroid Cells: A Biologically Directed Approach ...... 20

3.1. Introduction ...... 21

3.2. Results……………………… ...... 22

3.2.1. A closer look at the Hbb locus control region ...... 22

3.2.2. Identification of putative enhancers ...... 24

3.2.3. Overlap with transcription factors and conserved regions of the genome ...... 27

3.2.4. Multiple transcription factor peaks in proximity to putative enhancers ...... 29

3.2.5. H3K4me1 ChIP-qPCR results support putative enhancers ...... 32

3.2.6. Conservation analysis of RNAPII+/nucRNA- regions ...... 34

3.2.7. Supervised motif enrichment analysis ...... 36

3.2.8. Gene Ontology analysis of putative erythroid enhancers ...... 39

3.3. Discussion ...... 41 vii

Chapter 4 Modeling Approaches to Enhancer Identification in Mouse Embryonic Stem Cells ... 43

4.1. Introduction ...... 44

4.2. Results……………………… ...... 46

4.2.1. Feature extraction improves enhancer prediction ...... 46

4.2.2. Systematic ranking of enhancer signatures ...... 49

4.2.3. Classified enhancer and promoter-like candidates ...... 51

4.2.4. Enhancer and promoter-like candidates coordinately regulate gene expression .. 53

4.2.5. Top enhancer regions locate near genes encoding transcription factors ...... 55

4.2.6. Enhancer candidates are associated with multiple transcription factors ...... 58

4.2.7. Mouse ES cell enhancer candidates ...... 60

4.2.7.1. Previously validated enhancers ...... 60

4.2.7.2. Novel enhancer clusters overlapping multiple binding sites...... 62

4.2.7.3. Novel enhancers with fewer transcription factor binding peaks ...... 66

4.2.8. Transcription factor motif enrichment at ES cell enhancer candidates ...... 66

4.2.8.1. ES cell related transcription factors ...... 66

4.2.8.2. Differentiation and developmentally related transcription factors ...... 69

4.2.9. Identified enhancers are mainly active and cell type-specific ...... 70

4.2.9.1. Enhancer activities ...... 70

4.2.9.2. Tissue specificity of enhancers ...... 70

4.2.10. Comparison to other enhancer associated factors ...... 72

4.2.10.1. H3K27ac and CHD7 ...... 73

4.2.10.2. 5-hydroxymethylcytosine versus active enhancers ...... 73

4.3. Discussion ...... 74

4.3.1. Future work ...... 77

Chapter 5 General Discussion ...... 78

5.1. Discussion ...... 79 viii

5.2. Summary ...... 81

References…………………… ...... 82

Appendices…………………… ...... 91

ix

Declaration……………..

This dissertation presents solely the results of my own work under the guidance and support acknowledged in the next page. Furthermore, it is my every intention to provide further evidence for the Confucius quote.

Signed,

Julie Chih-yu Chen

x

List of Abbreviations

3C Conformation Capture 5hmC 5-Hydroxymethylcytosine 5mC 5-Methylcytosine AUC Area under curve ChIP-qPCR Chromatin immunoprecipitation quantitative polymerase chain reaction ChIP-Seq Chromatin immunoprecipitation coupled with high throughput sequencing Clover Cis-eLement OVER representation DAVID Database for Annotation, Visualization and Integrated Discovery Enh Putative enhancer region Enh&PrL Set of genes with at least one Enh and PrL candidates closest to them ES cell Embryonic stem cell GEO Gene Expression Omnibus GO Gene Ontology miRNA Micro RNA MTL Multiple transcription factor-bound loci N-Myc and c-Myc ChIP-Seq co-bound regions nucRNA Nuclear RNA OSN OCT4, and NANOG ChIP-Seq co-bound regions PhastCons PHylogenetic Analysis with Space/Time models Conservation PrL Putative promoter-like regions PSSM Position-Specific Scoring Matrix RAG Rabbit anti goat RNAPII RNA polymerase II RNAPII-ser5 Serine 5 phosphorylated RNA polymerase II RNA-Seq Ribonucleic acid identified through high throughput sequencing RPKM Number of Reads Per Kilobase of exon region per Million mapped read SISSRs Site Identification from Short Sequence Reads algorithm TF Transcription factor TSS Transcription start site

xi

List of Tables……………………

TABLE 1-1. POTENTIAL DISTAL REGULATORY FUNCTIONAL ELEMENTS...... 4

TABLE 2-1. A LIST OF HIGH THROUGHPUT SEQUENCING DATA SETS USED FOR ENHANCER IDENTIFICATION IN MOUSE ERYTHROID CELLS. ... 12

TABLE 2-2. DATA SETS AND GENOMIC FEATURES USED FOR ENHANCER IDENTIFICATION IN MOUSE ES CELLS...... 15

TABLE 3-1. LIST OF TOP 50 DISTAL ENHANCER CANDIDATES IN ERYTHROID CELLS IDENTIFIED FROM RNAPII+/NUCRNA-...... 26

TABLE 3-2. CHI-SQUARE ASSOCIATION OF ERYTHROID ENHANCERS WITH TFS AND ENHANCER CANDIDATES...... 28

TABLE 3-3. SIGNIFICANT ASSOCIATION WITH CONSERVATION...... 34

TABLE 3-4. MOTIFS SIGNIFICANTLY ENRICHED IN SUBSETS OF CONSERVED ERYTHROID CELL ENHANCERS...... 37

TABLE 3-5. GENE ONTOLOGY ENRICHMENT OF THE NEAREST TRANSCRIBED GENES OF PUTATIVE ENHANCERS IN ERYTHROID CELLS...... 40

TABLE 4-1. MOTIF ENRICHMENT IN MOUSE ES CELL PUTATIVE ENHANCERS...... 68

xii

List of Figures……………………

FIGURE 3-1. HBB LCR REGION OVERLAPPING MULTIPLE TRANSCRIPTION FACTORS...... 24

FIGURE 3-2. DISTRIBUTION OF RNAPII+/NUCRNA- THROUGHOUT THE GENOME...... 25

FIGURE 3-3. VENN DIAGRAM OF RNAPII-SER5, OVERLAPPED BINDING OF ALL TFS, AND NUCRNA PRESENCE...... 28

FIGURE 3-4. A VALIDATED ENHANCER REGION 75KB UPSTREAM OF LMO2, ONE OF THE KEY REGULATORS IN ERYTHROID CELLS...... 30

FIGURE 3-5. GRAPHS OF ENHANCER CANDIDATES IDENTIFIED AROUND PIM1 AND KLF13...... 32

FIGURE 3-6. CHIP-QPCR RESULTS OF H3K4ME1 SUPPORT ERYTHROID ENHANCERS IDENTIFIED...... 33

FIGURE 3-7. DISTRIBUTION OF PHASTCONS SCORES...... 35

FIGURE 3-8. ILLUSTRATION OF MOTIFS SIGNIFICANTLY ENRICHED IN SUBSETS OF CONSERVED ERYTHROID CELL ENHANCERS...... 38

FIGURE 4-1. ASSESSING FEATURE COMBINATIONS AS ENHANCER SIGNATURES USING NAIVE BAYES...... 48

FIGURE 4-2. FEATURE RANKING DETERMINED USING MULTINOMIAL LOGISTIC REGRESSION WITH LASSO REGULARIZATION...... 51

FIGURE 4-3. HEATMAP OF FEATURES USED IN LASSO REGRESSION FOR TOP 50 ENHANCER AND PROMOTER-LIKE CANDIDATES...... 52

FIGURE 4-4. GENOMIC DISTRIBUTION OF CANDIDATES AND THEIR ASSOCIATION WITH GENE EXPRESSION...... 55

FIGURE 4-5. GENE ONTOLOGY ANALYSIS OF THE ENHANCER AND PROMOTER-LIKE CANDIDATE SETS...... 58

FIGURE 4-6. BARPLOTS OF PERCENT OVERLAP WITH MULTIPLE TRANSCRIPTION FACTOR BOUND LOCI...... 59

FIGURE 4-7. FOUR KNOWN MOUSE ES CELLS ENHANCERS THAT INTERACT WITH NEARBY PROMOTERS THROUGH LOOPING MECHANISMS. 61

FIGURE 4-8. NOVEL PUTATIVE ENHANCER REGIONS AROUND SOX2 GENE...... 63

FIGURE 4-9. OTHER NOVEL PUTATIVE ENHANCER REGIONS IN MOUSE ES CELLS...... 65

FIGURE 4-10. ACTIVE ENHANCERS AND CELL SPECIFICITY OF TOP ENHANCERS (PROB≥0.8)...... 71

FIGURE 4-11. OVERLAPPING PERCENTAGES WITH 5-HYDROXYMETHYLCYTOSINE AND OTHER FEATURES...... 72

xiii

List of Appendices

APPENDIX 1. ENHANCER CANDIDATE LIST OF THE MOUSE ERYTHROID CELLS...... 91

APPENDIX 2. PRIMERS DESIGNED AND USED IN ERYTHROID STUDY...... 132

APPENDIX 3. WEIGHTS OF FEATURES USED IN MULTINOMIAL LOGISTIC REGRESSION FOR ES CELLS ...... 133

APPENDIX 4. TOP ENHANCER CANDIDATE LIST OF MOUSE EMBRYONIC STEM CELLS...... 134

APPENDIX 5. PREDICTION OF LUCIFERASE-VALIDATED ENHANCER POSITIVES AND NEGATIVES...... 167

APPENDIX 6. DETAILED PLOTS FOR NOVEL PUTATIVE ENHANCER REGIONS...... 171

APPENDIX 7. ACTIVE ENHANCERS AND CELL SPECIFICITY OF ALL ENHANCER CANDIDATES...... 172

APPENDIX 8. FEATURE COEFFICIENTS DETERMINED FROM LASSO REGULARIZATION (H3K27AC INCLUDED) ...... 173

xiv 1

Chapter 1

Introduction

2

1.1. One genome, multiple epigenomes and transcriptomes

Every cell in our bodies contains the same genetic material, but what differentiates a neuron from a muscle cell is the way in which that material is interpreted. Epigenetic differences, transcription factor (TF) availability and differences in chromatin accessibility and folding influence how the genome is interpreted by the transcriptional machinery responsible for gene expression. Originated from a single genome, differences in epigenomes and transcriptomes contribute to tissue-specific gene expression which in turn shapes unique functionality of each cell type.

Considering the development process, various progenitor and (terminally) differentiated cell types evolve gradually as differentiation takes place. Epigenetics studies investigating DNA methylation and histone modification have been associated with tissue specificity gene expression and embryonic development in mammals (Mikkelsen et al., 2007; Ong and Corces, 2011). Thus, investigating the tissue-specific interplay of TFs and histone modification at genomic elements will help decipher regulation of gene expression and elucidate cellular identity.

1.2. Distal regulatory elements: Enhancers

1.2.1. Significance not to be overlooked

Previously, interactions of TFs with proximal promoter regions have been the focus of attention on regulation of gene expression in part by the limited coverage of microarray technology (- 10kbps/+2.45kbps of transcription start sites). In addition to promoters, regulation of gene expression can also be controlled by transcription factor complex binding at enhancer and/or insulator regions (Heintzman et al., 2009; Ong and Corces, 2011). Unlike promoters, enhancers do not need to be close to, and can be located upstream or downstream of the genes they regulate. Furthermore, mutations in DNA sequences of distant-acting enhancers contribute to various diseases (Visel et al., 2009b). Notably, mice with targeted deletion of the sonic hedgehog (Shh) limb-enhancer 1Mb from Shh resulted in complete loss of Shh expression and severely truncated limbs (Sagai et al., 2005).

3

Enhancers can be located at great distances from genes regulated which make them difficult to identify. The advent of massively parallel sequencing technology coupled with chromatin immunoprecipitation (ChIP-Seq) (Kharchenko et al., 2008), has empowered genome-wide investigation of TF occupied sites and histone modifications in detailed resolution. Such technology has shed light on genome annotation and identification of distal regulatory elements. Using ChIP-seq data sets, high proportions TF binding peaks are located in intergenic regions >10kb from transcription start sites (TSSs) of annotated genes in both the mouse and human genomes implying a high proportion of gene regulation occurs via distally located regulatory elements (Table 1-1).

4

Table 1-1. Potential distal regulatory functional elements.

The percentage of reported TF peaks located more than 10kb away from the closest Ensembl transcript in mouse and human are listed. * denotes the value noted in the paper. The corresponding technique, cell type used and the citation are also listed. MCF7 is a human breast adenocarcinoma cell line. G1E-ER4 is a subline of Gata-1− erythroblast line which stably expresses an estrogen-activated form of GATA1 (cultured in beta-estradiol for 24 hours).

TF > 10 kb from a Technique Tissue/Cell type Species Publication gene (%)

NANOG 45 ChIP-Seq H1 ES cell Homo (Lister et al., SOX2 45 sapiens 2009) P300 48 22 OCT4 19 TAF1 9 Oestrogen 37 ChIA-PET MCF7 Homo (Fullwood et sapien s al., 2009)

c-Myc 6 ChIP-Seq E14 ES cells Mus (Chen et al., 15 musculus 2008) Esrrb 22 Nanog 36 Sox2 35 Oct4 26 N-Myc 8 P300 28 Smad1 34 Stat3 26 Suz12 6 GATA1 39* ChIP-Seq Induced Mus (Yu et al., erythroleukemia musculus 2009) (MEL) cells GATA1 14 ChIP-Seq G1E-ER4 Mus (Cheng et al., musculus 2009) KLF1 21 ChIP-Seq Primary Erythroid Mus (Tallack et al., Cell musculus 2010)

5

1.2.2. Epigenetic states at enhancers in relation to tissue specificity

Contrary to prior belief that the 99% of non-coding mammalian sequences is junk DNA, studies have found that enhancers buried in non-coding regions are associated with significant epigenetic differences between different cell types: by contrast gene promoters show more uniform modifications (Heintzman et al., 2009; Visel et al., 2009a). This finding indicates that enhancers, which can be located at great distances from the genes they regulate, play a larger role in regulating tissue-specific gene expression than do gene promoters.

1.2.3. Regulation of gene expression through chromatin looping

Enhancers have been shown to function through chromatin looping. Expression of the β-globin genes is regulated by the locus control region (LCR) 50 kb upstream of the Hbb-b1 gene. The distal regulatory elements of the LCR, were shown to be in physical proximity of the active β- globin promoters through looping (Carter et al., 2002; Tolhuis et al., 2002). This conformation is absent in brain (Tolhuis et al., 2002) and is different in erythroid progenitor cells where Hbb-b1 is not expressed, suggesting the association between the conformation and the active state (Palstra et al., 2003).

Physical genomic interactions can be detected using the chromosome conformation capture (3C) technique (Dekker et al., 2002). While 3C can be used to identify chromatin loops formed between distal regulatory elements and gene promoters in a small scale, 4C and Hi-C techniques can determine larger scale interactions and entire genome conformations (Lieberman-Aiden et al., 2009; Schoenfelder et al., 2010; Simonis et al., 2006; Zhao et al., 2006).

Since enhancers can come into close contact with promoters of their regulated genes over long distances through chromatin loops, enhancer identification narrows down potential functional looping candidates, and further 3C techniques will help in annotating specific regulatory elements and support their role in the regulation of target genes.

1.3. Features predictive of enhancers

With the availability of high throughput sequencing data, several features in addition to ultra- conserved and extremely conserved sites (Pennacchio et al., 2006; Visel et al., 2008) have been used to identify putative enhancers for a tissue type. These include: p300 cofactor binding sites

6

(Visel et al., 2009a), H3K4me1 enriched and H3K4me3 depleted sites (Heintzman et al., 2007), multiple transcription factor bound loci (Chen et al., 2008), extragenic RNAPII (De Santa et al., 2010) and a combination of several features (Kim et al., 2010).

1.3.1. Interaction of proteins and enhancers

These different approaches show variable success at predicting enhancers with in vivo activity. For example, 47% (246/528) of human genomic regions predicted by “ultra” and “extreme” sequence conservation were confirmed as enhancers in a transgenic mouse assay (Pennacchio et al., 2006; Visel et al., 2008). The prediction was significantly enhanced to 87% (75/86) when using only p300 high throughput chromatin immunoprecipitation sequencing (ChIP-Seq) binding sites from mouse forebrain, midbrain and limb cells (Visel et al., 2009a).

Taking the multiple transcription factor bound loci approach (MTL: three major pluripotency TFs, OCT4, SOX2, NANOG), Chen et al. 2008 generated enhancer candidates in mouse embryonic stem cells, and tested 25 of these regions for enhancer activity (Chen et al., 2008). All 25 regions displayed ES cell specific enhancer activity suggesting the MTL approach is the most predictive of functional enhancer regions. However, depending on the binding preferences of TFs to different DNA functional elements, such as promoters or enhancers, MTLs may not represent enhancers. For example, all 8 MYC and MYCN containing MTLs were found to have little or no enhancer activity (Chen et al., 2008).

1.3.2. Histone modification states at enhancers

A motif independent model for identifying and distinguishing promoters and enhancers was applied to the using histone modification profiles (Heintzman et al., 2007). Heintzman observed H3K4me1 enrichment and H3K4me3 depletion at p300 binding sites, and used this signature to identify putative enhancers in 5 human cell lines. 63.5% of their enhancer predictions were supported by DNase I hypersensitivity, binding of p300, or binding of the mediator Med1 (TRAP220). Kim et al. used a set of criteria, including CBP binding, H3K4me1 enrichment and H3K4me3 depletion, to identify putative enhancers (Kim et al., 2010).

7

1.3.3. Active and poised enhancers in embryonic stem cells

Studies have reported epigenetic pre-marking of developmental enhancers through histone and DNA methylation marks in ES cells (Cui et al., 2009; Xu et al., 2009), and these pre-marked enhancers, termed poised enhancers, are active at later developmental stages. Although p300 and H3K4me1 have been used to identify enhancers, recent studies found that the enhancers marked by these two features in ES cells included both active and poised enhancer classes (Creyghton et al., 2010; Rada-Iglesias et al., 2011). The active and poised enhancers can be distinguished by H3K27ac and H3K27me3 marks, respectively. A third enhancer class (intermediate), marked by H3K4me1+ and H3K27ac-, differs in the capability to assume the active and poised state upon differentiation (Zentner et al., 2011). 5-hydroxymethylcytosine (5hmC), which is suggested to be an intermediate stage to de-methylate 5-methylcytosine (5mC) (Ficz et al., 2011), has been reported to significantly associate with enhancer, promoter and exonic regions in human ES cells (Stroud et al., 2011). More specifically, 5hmC is significantly enriched in active enhancer marks, H3K4me1 and H3K27ac. On the contrary, 5hmC is also reported to be enriched at TSSs of poised genes with bivalent marks (H3K4me3 and H3K27me3) (Pastor et al., 2011). In summary, enhancers overlapping 5hmC regions are associated with the active enhancer class, but TSSs overlapping 5hmC regions are associated with poised genes.

1.4. Computational approaches relevant to enhancer identification

1.4.1. Position specific matrices and comparative genomics

Before the next generation sequencing era, researchers have previously predicted enhancers with various algorithms using position-specific scoring matrices (PSSMs) from TRANSFAC (Matys et al., 2003) and JASPAR (Portales-Casamar et al., 2010), TF binding site clustering and comparative genomics (Aerts et al., 2007; Berman et al., 2004; Hallikas et al., 2006; Palin et al., 2006). PSSMs are matrices representing TF binding motifs and contain scores of each nucleotide within a specific length. Issues with these approaches include high false positive rate, predictions limited to well studied TFs. In addition, various biological aspects are overlooked, such as TF accessibility, TF expression and how the epigenetic state at the functional element contributes to tissue-specificity. Since chromatin states vary greatly between tissues, and enhancer regions are

8 marked by tissue specific histone modifications, the link between cell-type and active enhancers is missing using sequence based approaches.

1.4.2. Integrative modeling of ChIP-Seq data

The advent in sequencing technology has enabled genome-wide identification of mRNA presence, TF binding, histone modification, and DNA methylation. Sequencing techniques are not limited to specific genic regions as is microarray technology (low resolution of entire genome or higher resolution of partial genome in tiling arrays). Recently, integrative models using multiple data sources, including histone modification ChIP-Seq data, have been shown to successfully improve cell type-specific transcription factor binding site prediction (Ernst et al., 2010; Won et al., 2010). These studies focus mainly on identifying cell type-specific functional elements, and improving prediction accuracy of tissue-specific transcription factor binding sites.

The Chromia method generates hidden Markov models (HMMs) for promoters and enhancers in comparison to that of background using binned signals of 8 histone marks and PSSM scores as features (Won et al., 2010). Ernst et al. demonstrated high predictability of true TF binding locations in the human genome using motif information and proposed general binding preference score, which is a combination of 29 binary and continuous features including different levels of conservation reported by PhastCons scores, CpG islands, G+C percentage, location relative to RefSeq genes, DNase I sensitivity, CTCF binding, high throughput sequence tags for a histone variant (H2A.Z), and RNA polymerase II, and sum of tags in 20 histone methylation modifications (Ernst et al., 2010).

However, different histone modifications can be associated with active, repressive or poised chromatin in ES cells, and such marks are distributed differently in the genome in relation to annotated genes. Therefore, the usage of sum of tags from all histone modification forms in Ernst et al. 2010 focuses more on TF binding prediction rather than functional annotation of the genomic regions. These two integrative methods demonstrated the significant improvement in TF binding prediction using sequence features, histone modification and other chromatin related features. Furthermore, these methods use all available ChIP-Seq data in either human or mouse ES cells for enhancer or TF binding site predictions, without assessing how predictive each feature is. As a result, the application of these approaches to other cell types will require generating all corresponding ChIP-Seq data in those cell types to reach the optimal performance.

9

1.5. Thesis overview

Massively parallel sequencing technology has enabled the investigation of genome-wide TF binding sites and histone modifications at high resolution. The thesis aims to identify enhancer regions through utilization of genomic features, and available RNA-Seq and ChIP-Seq data sets. The presented work identifies enhancers in mouse erythroid cells and mouse ES cells through a biologically directed approach and an unbiased computational modeling, respectively. Enhancer- associated TFs which play regulatory roles and contribute to tissue specificity in the cell types are subsequently identified through motif enrichment analysis. Chapter 2 details the methods used in the subsequent two chapters.

In Chapter 3, the biologically directed approach was first taken to identify enhancers in mouse erythroid cell using enrichment of RNAPII-ser5 and the absence of nucRNA. A series of down- stream analyses were followed to investigate the functional importance of the putative enhancers. Since multiple signatures have been associated with enhancers, a systematic assessment of these signatures is therefore necessary to accurately identify enhancers. In Chapter 4, given the wealth of data sets publicly available in mouse ES cells, chromatin related ChIP-Seq features and genomic features are assessed with computational modeling, and key signatures identified are used toward enhancer prediction. A detailed investigation of these putative enhancers further reveals the properties of these enhancers and their potential roles in transcriptional regulation in mouse ES cells. Finally, I discuss and summarize what has been learned from different approaches in Chapter 5.

The workflow in Chapter 3 was presented in a poster session at BioC2010, and the results were incorporated into Dr. Jennifer A. Mitchell and Dr. Ieuan Clay‟s work on transcriptome characterization in erythroid cells (under review). Part of the work in Chapter 4 was presented in poster sessions at “Epigenetic Landscape in Development and Disease“, “Epigenetics eh!” and “Toronto regulatory codes workshop”. The manuscript for Chapter 4 is in preparation.

10

Chapter 2

Methods

11

Bioinformatics and statistical analyses were done in R (http://cran.r-project.org/) and Bioconductor 2.12.0 (Gentleman et al., 2004) unless otherwise stated. Visualization of genomic regions was generated using UCSC genome browser (Fujita et al., 2010).

2.1. Methods for enhancer identification in mouse erythroid cell

This section details the methods used in Chapter 3.

2.1.1. Mapping of datasets to the mouse genome

RNAPII-ser5 ChIP-Seq and nucRNA-Seq data were obtained from a publication in submission (Mitchell et al.). ChIP-Seq raw data of Gata1, KLF1, Ldb1, Eto2, Tal1, Mtgr1 listed in Table 2-1 were downloaded from GEO (Barrett et al., 2010) and the European Sequence Read Archive (ENA). Raw data were aligned to NCBI m37 mouse assembly (mm9) using Bowtie alignment (Langmead et al., 2009) which provides ultrafast, memory-efficient short read alignment by utilizing a Burrows-Wheeler index. The parameters were set to suppress alignments to only 1 best reportable alignment with a maximum number of 2 mismatches within 28 nucleotides of seed length in the high quality end. The aligned tags are then further subjected to peak identification. Format conversion was done using the Vancouver Short Read Analysis Package (http://vancouvershortr.sourceforge.net/).

2.1.2. Enrichment of nucRNA-Seq and ChIP-Seq datasets

SeqMonk (bioinformatics.bbsrc.ac.uk/projects/seqmonk) was used to identify enrichment in nucRNA-Seq allowing maximum gap of 100bps, minimum size of 1000 bps and depth of 0. The SISSRs (Jothi et al., 2008) algorithm identifies TF peaks from ChIP-Seq data to be the transition point from aligned sense tags to anti-sense tags with more than a specified numbers of tags on both sides and within the window frames. Fold enrichment is computed by summing the number of tags in target TF over that of control. The SISSRs algorithm was used to identify peaks which were significant enriched in target data compared to the input data. Since SISSRs is designed for single end sequence data, the RNAPII paired end data were split to single end data. The average length and length of longest DNA fragment sequenced were set to 500 and 1000, respectively. The parameters for the corresponding TF data sets were set according to original publications

12 and peak detection was done in comparison to the corresponding input data. Significant peaks were defined to have probabilities smaller than 0.001.

Table 2-1. A List of high throughput sequencing data sets used for enhancer identification in mouse erythroid cells.

Gata1* denotes Gata1 ChIP-Seq data from Soler et al. MEL: mouse erythroleukaemic cells. G1E-ER4: a subline of Gata-1− erythroblast line which stably expresses an estrogen-activated form of GATA1 (cultured in beta-estradiol for 24 hours).

Seq data sets Cell Type Accession Reference nucRNA, RNAPII-ser5 Adult erythroid cells ERP000702 (ENA) (Mitchell et al.; under review)

GATA1 G1E-ER4 cells GSE18164 (GEO) (Cheng et al., 2009)

KLF1 E14.5 fetal livers GSE20478 (GEO) (Tallack et al., 2010)

LDB1, ETO2, TAL1, MEL cells ERA000161 (ENA) (Soler et al., 2010) MTGR1, GATA1*

13

2.1.3. Conservation, motif identification, and function annotation analyses

PhastCons conservation identifies evolutionarily conserved elements, and the score ranges from 0 to 1 representing the probability that each base is in a conserved element (Siepel et al., 2005). Conservation was coupled into the analysis as a filter to improve the accuracy of identifying regulatory elements. The Clover algorithm (Frith et al., 2004) screens a set of given DNA sequences against transcription factor PSSMs and assess whether any motifs are over- or under- represented in the given sequences by comparison to random sequences drawn from mouse chromosome 19 and 5kb upstream of TSSs. The PSSMs are obtained from the JASPAR mammalian database (Portales-Casamar et al., 2010) and a protein-binding microarray publication in mouse and human (Badis et al., 2009).

The DAVID web server (Huang da et al., 2009b) provides gene-annotation enrichment analysis and functional clustering for sets of genes. The functional annotations of the putative enhancers were identified from functional enrichment of their closest genes. Clusters containing GO Biological processes and/or GO molecular functions and any terms with Benjamini-corrected p- value < 0.05 were reported. The average RNAPII enrichment is first calculated from genes for each term within the cluster and the RNAPII enrichment is reported to be the maximum of average RNAPII enrichment within each cluster.

2.1.4. Native ChIP-qPCR of H3K4me1

In mouse erythroid cells, native ChIP was carried out using histone modification H3K4me1 and Rabbit anti goat antibodies (RAG) with micrococcal nuclease digestion. RAG was used to check for non-specific binding to the mouse genome. The digestion of DNA to mainly mononucleosomes was confirmed by gel electrophoresis and optimized with 6 minute incubation of 30U micrococcal nuclease. 100µl of digested sample, named input, was also purified for normalization purposes to account for digestion efficiency affected by open and close chromatin. Quantitative PCR (q-PCR) was performed on known enhancers of Lmo2 and Hbb-b1 to confirm the validity of the experiment. In addition, primers for 6 putative enhancer candidates (Pim1, Ep300, Klf13, Slc25a37, Wdr26 and Aplp2) named after the closest transcribed genes were designed using Primer3 (Rozen and Skaletsky, 2000) (Appendix 1), and the product sizes were confirmed using gel electrophoresis.

14

2.2. Methods for enhancer identification in mouse ES cells

This section details the methods used in Chapter 4.

2.2.1. Public datasets …………………………….

Thirty public domain ChIP-Seq raw data sets in mouse ES cells were obtained from GEO (Barrett et al., 2010): 12 TFs (Chen et al., 2008), 8 histone modifications (Mikkelsen et al., 2007), 3 polymerase occupancy (Rahl et al., 2010) and 7 chromatin associated proteins (Chen et al., 2008; Goren et al., 2010; Kagey et al., 2010) (Table 2-1). Although not listed in the table, five ChIP-Seq controls corresponding to the above features were also downloaded. RNA-Seq raw data and processed regions were obtained for expression analysis (Guttman et al., 2010). Other genomic features including PhastCons most conserved regions from 30 vertebrate genomes (Miller et al., 2007; Siepel et al., 2005), CpG islands (Gardiner-Garden and Frommer, 1987), G+C contents, SNP and repeat regions were downloaded from UCSC genome annotation database (mm9 build) (Fujita et al., 2010).

2.2.2. Data pre-processing

As was done in section 2.1.1, reads from ChIP-Seq data were aligned to NCBI m37 mouse assembly using Bowtie alignment (Langmead et al., 2009) by suppressing alignments to only 1 best reportable alignment with a maximum number of 2 mismatches within 28 nucleotides of seed length in high quality end. The genome was segmented into 1kb non-overlapping bins and target tag count in each bin was normalized by dividing control tag count plus a fudge factor of 3, which was used to reduce the effect of low input count in generating extreme ratios. The same procedure was done for all ChIP-Seq data sets to obtain a vector of values for each protein or TF. Multiple ChIP-Seq data files for each TF in the same experimental setting were combined, exact duplicate of reads were removed to avoid PCR amplification bias generated in sequencing library preparation. As the control ChIP-Seq data sets were not uniformly distributed throughout the genome, 180 bins with read counts in top 0.5 percentile for all 5 controls were first excluded from the analysis. The genomic features were subsequently quantified in each bin using counts without normalization, e.g. Number of CpG islands within each bin. In enhancer candidate plots, significant TF binding peaks (p<0.001) predicted with the SISSRs algorithm (Jothi et al., 2008) were labeled in order to show binding peaks defined from another source.

15

Table 2-2. Data sets and genomic features used for enhancer identification in mouse ES cells.

The RNA-Seq, ChIP-Seq and genomic feature data sets used in the study are listed and cited in the table. The usage of these data, the information on the mouse ES cells, and GEO accession numbers are reported in “Purpose”, “Cell line”, “Growth condition”, and “Accession” columns. MTL: multiple transcription factor bound loci. Cell Data Purpose Accession Reference line RNA Feature V6.5 GSE20851 (Guttman et al., 2010) Histone modifications Feature V6.5 GSE11172 (Mikkelsen et al., 2007) (H3, H3K4me1, GSE12241 H3K4me2, H3K4me3, H3K36me3, H4K20me3, H3K27me3, H3K9me3) and RNAPII RNAPII-ser2, RNAPII- Feature V6.5 GSE20530 (Rahl et al., 2010) ser5 SMC1A, SMC3, Feature V6.5 GSE22562 (Kagey et al., 2010) MED12, MED1, NIPBL p300 Feature E14 GSE11431 (Chen et al., 2008) CTCF Feature V6.5 GSE18699 (Goren et al., 2010) CpG islands, GC content, Feature -- -- (Fujita et al., 2010; SNP, repeat regions and Gardiner-Garden and PhastCons most Frommer, 1987; Miller conserved regions et al., 2007; Siepel et al., 2005) OCT4, SOX2, NANOG, Training sets E14 GSE11431 (Chen et al., 2008) c-Myc, N-Myc KLF4, STAT3, SMAD1, MTL E14 GSE11431 (Chen et al., 2008) E2F1, TCFCP2I1, ZFX, analysis ESRRB

16

2.2.3. Training data sets

The binding regions of each TF used as training data were defined to be regions with number of tags in the top 0.5 percentile. As random regions co-bound by the pluripotency transcription factors Oct4, Sox2 and Nanog (OSN) were confirmed to have enhancer activity in 25/25 cases and regions bound by c-Myc and N-Myc (MYC) were shown to have very weak or no ES-cell- specific enhancer activity in luciferase assays (Chen et al., 2008), co-bound regions are used as enhancer positive (Enh) and promoter-like (PrL) training sets to classify enhancers using TF- independent features. More specifically, 1291 co-bound regions of OSN without either c-Myc or N-Myc binding are taken as the Enh training set; whereas 4465 co-bound regions of c-Myc/N- Myc (MYC) without either of OSN are taken as the PrL training set. Due to the promoter-like nature of the MYC cluster (Figure 4-4a), 5000 random regions were drawn from non-OSN and non-MYC regions to be the third category called „unknown‟.

2.2.4. Feature combination assessment using Naive Bayes

Naive Bayes classifiers with various feature combinations were used to classify 1kb genome bins into the three categories. Due to limited amount of training data, 10-fold cross validation was performed with the training data set by randomly leaving 10% of the data as a validation set. Classifier assessment was carried out using the mean of deviance, area under curve (AUC), precision, modified precision, and recall values computed from the validation set. Deviance, given by minus 2 times the log likelihood ratio, quantifies the goodness-of-fit of the classifier. AUC is the area under the receiver operating characteristic (ROC) curve, which graphs true positive versus false positive rates as the thresholds are varied. Precision, normally given by true positive over sum of true positive and false positive, was modified here to report the percentage of OSN co-bound regions out of putative enhancers that were originally Enh or PrL. This was done to avoid penalizing potential true enhancers in the unknown validation set. Recall, given by true positive over sum of true positive and false negative, measures the percentage of all OSN co-bound regions predicted to be enhancers. These indices provide different aspects on model assessment. Ranking the classifiers with different feature combination was done by sorting according to the average ranking of all indices.

17

2.2.5. Feature extraction with lasso regularized multinomial logistic regression

All the features were first standardized by subtracting the mean and dividing by the standard deviation in order to prevent biased shrinkage of feature weights. Taking into account the dependencies between features, multinomial logistic regression was applied to model sequence features, chromatin features, and associated proteins to predict the genome-wide location of enhancers. Cross entropy was used as the error function for multi-group classification. To assess predictability of features, lasso regularization was used to introduce extra penalization with a power raised on the weight vector (Friedman et al., 2010; Tibshirani, 1996). Using lasso regularization, feature weights of less significance shrink to 0 as lambda increases. The logged lambda, -4, was obtained through comparison of average multinomial deviances from 10-fold cross validation.

2.2.6. Absolute gene expression in mouse ES cell

RPKM, the number of reads per kilobase of exon region per million mapped reads, derived from RNA-Seq data has been shown to be approximately proportional to the absolute abundance of mRNAs in cell (Mortazavi et al., 2008). I obtained RPKM values formerly computed by Ouyang et al. (Ouyang et al., 2009) from a mouse ES cell RNA-Seq data set (Cloonan et al., 2008). The Enh and PrL candidates were assigned to Ensembl genes of the closest TSSs without taking the degree of transcription into account. The distributions of absolute expression (RPKM) were shown in box plots, and one-sided Kolmogorov-Smirnov tests were performed to assess the differences in empirical cumulative distributions of gene sets.

The RPKM values of embryoid bodies also obtained from Ouyang et al. (data originated from Cloonan et al.) were used to identify differentially expressed genes. Log 2 ratio of RPKM in ES cells and in embryoid bodies was computed for each gene. A fudge factor of 0.5 was added before taking the ratio. Genes up-regulated more than 2 fold in either cell types were reported to be differentially expressed between ES cells and embryoid bodies.

2.2.7. Gene Ontology functional enrichment analysis

Candidates were first assigned to genes by identifying the closest TSSs. 469 Enh candidates (Enh prob≥0.9) and 2239 PrL candidates (PrL prob ≥0.99) were used to carry out the analysis. The

18 discrepancy in probability cut-offs between sets was due to the large number of high confidence PrL candidates. DAVID functional annotation website was used to assess functional enrichment on both Enh and PrL candidate sets (Huang da et al., 2009a, b). As genes can have more than one nearby candidate within each set, only the unique genes within each set were subjected to GO analysis. The total numbers of genes with associated GO annotations in Enh and PrL were 302 and 1913, respectively. Only the molecular functions and biological processes significantly enriched in each set were subsequently reported (FDR<0.1). For better visualization of the enrichment functions, the Enrichment map Cytoscape plug-in was used to cluster functions sharing the same genes (Merico et al., 2010; Smoot et al., 2011).

2.2.8. Association with multiple transcription factor bound loci

To determine if the putative enhancer candidates are enriched in MTL, I separate Enh, PrL and unknown candidates, and all genome bins into 3 groups: 0 TF peak enrichment, enrichment of 1 to 3 TFs, and enrichment of over 4 TFs (MTL) within the bin. The MTL enrichment of Enh is assessed using Chi-squared statistics of the counts of the last two bins in comparison to that of PrL, unknown and genomic bin sets.

2.2.9. Supervised motif analysis

The Clover algorithm (Frith et al., 2004) was used to screen a set of given DNA sequences against transcription factor weight matrices and assess whether any motifs are over- or under- represented in the given sequences by comparing to random sequences drawn from chr19 (42.7% C+G) and 5kb upstream regions (45.7% C+G). I obtained human and mouse PSSMs from the JASPAR mammalian database (Portales-Casamar et al., 2010) and protein-binding microarray data in mouse (Badis et al., 2009). 467 putative enhancer regions with enhancer probability greater than 0.9 (45% C+G) were used in the motif analysis to narrow down the search. For a more stringent motif screening, the significance of enrichment in Enh regions compared to 2222 PrL regions was also tested (PrL prob>0.99; 61% C+G).

2.2.10. Comparison to other high throughput sequencing datasets

Significantly enriched regions of H3K4me1 and distal H3K27ac in ES, adult liver, progenitor B and neural progenitor cells were obtained (Creyghton et al., 2010), and coordinates of these

19 regions, originally aligned to mm8 mouse built, were converted to mm9 using the UCSC genome browser liftOver tool (Fujita et al., 2010). 10479 and 2916 CHD7 binding peaks from medium and high thresholds were obtained (Schnetz et al., 2010), and coordinate conversion was performed to update the genome locations to mm9 built. The 5hmC and 5mC enriched regions in murine V6.5 ES cells were obtained directly for mm9 (Pastor et al., 2011). The reported Chi- square p-values and log odds ratios were computed using 5hmC data obtained using the GLIB (glucosylation, periodate oxidation, biotinylation) approach. Venn diagrams are drawn on the basis of putative enhancers allowing 500bp gaps between regions.

20

Chapter 3

Enhancer Identification in Erythroid Cells: A Biologically Directed Approach

21

3.1. Introduction

Erythroid cells are a well characterized model cell type for the study of gene regulation in which the highly expressed α and β globin genes have been well studied. The high levels of tissue- specific β globin gene expression are regulated by the LCR consisting of six DNase I hypersensitive sites (HS) (Forrester et al., 1990). Specifically, the second hypersensitive site, HS2, in LCR was confirmed to have classical enhancer activity (Talbot and Grosveld, 1991).

With limited data available in mouse erythroid cells at the time of study, a directed approach using observations at the enhancers in the β globin LCR was conducted as a marker to identify enhancers. The presence of RNA Polymerase II is generally thought to correlate with active gene transcripts; however, the presence of the initiating form of RNAPII phosphorylated at serine 5 of the C terminal domain (RNAPII-ser5) in the absence of associated transcription (nucRNA), was observed at the β globin LCR through comparison of RNAPII-ser5 ChIP-Seq and nucRNA-Seq data. The absence of nucRNA at the enhancers and previously validated looping conformation of these enhancers and β globin (Hbb) promoters imply that such RNAPII-ser5 enrichment may be contributed to by the RNAPII-ser5 occupying the Hbb promoter regions.

The hypothesis is that distal regulatory regions with RNAPII-ser5 occupancy and the absence of nucRNA are functional enhancers, which potentially interact with distant promoters through looping mechanisms. Regions with enrichment of RNAPII-ser5 and absence of nucRNA are referred to as RNAPII+/nucRNA-. Downstream bioinformatics analyses such as conservation and TF binding overlap are performed to assess the functionality of the putative enhancers. Identification of motif enrichments using these putative regions is also performed to confirm the motif enrichment of erythroid-related TFs, and discover novel Ts active in erythroid cells. This chapter presents a pilot directed approach to identify enhancers.

22

3.2. Results………………………

3.2.1. A closer look at the Hbb locus control region

The extensively studied Hbb LCR region in erythroid cells was used as a model to design the marker (Figure 3-1). The tag density and identified peak sites of RNAPII-ser5 aligned with 4 of the 6 previously known DNase I hypersensitive sites; however, nucRNA transcripts were not enriched at or around any of the HS. In addition, the RNAPII-ser5 peaks at HS1 to HS4 aligned with binding peaks of several erythroid-expressed TFs. Although peaks of TFs were not identified in HS5 and HS6 by SISSRs algorithm, there was relatively low level of GATA1 and other TF occupancy at these regions (Figure 3-1). The discrepancy is likely due to the stringency of cutoff parameter set in SISSRs. In general, multiple TF binding sites are found in the first four enhancers. Note that even though ChIP-Seq data of the TFs are not performed on the same erythroid cells as that of nucRNA-Seq and RNAPII-ser5 ChIP-Seq, significant overlapping peaks indicated feasibility of comparing the data sets.

23

24

Figure 3-1. Hbb LCR region overlapping multiple transcription factor binding sites.

The LCR region 50kb upstream of adult Hbb genes is shown on UCSC genome browser with separate tracks for six known DNase I hypersensitive sites, significantly enriched regions of nucRNA, RNAPII-ser5, and TFs identified using SeqMonk and SISSRs algorithm. Fold enrichments over the control (input ChIP-seq data) are reported beside the enriched regions. These tracks are followed by the coverage plots of RNAPII-ser5 and erythroid-expressed TFs as well as PhastCons mammalian conservation scores.

3.2.2. Identification of putative enhancers

1598 putative enhancer candidates were identified from the RNAPII+/nucRNA- filter. Two enhancers closest to transcribed Hbb-b2 were among the top 50 candidates that are >10kb away from the TSS of transcribed genes short listed in Table 3-1 (full list provided in Appendix 1). Note that there may or may not be other non-transcribed genes located closer to the candidates; however, in light of linking enhancer regions to gene expression, the nearest transcribed transcripts were recorded in the table. To specifically look for potential functional enhancers involved in long distance chromatin looping, 206 out of 1598 enhancer candidates are located in intergenic regions greater than 20kb away from the closest TSSs of transcribed genes. The distribution of enhancers found by the RNAPII+/nucRNA- marker in the genome is illustrated in Figure 3-2. For several TFs, over 30% of putative enhancers identified by the marker are located in genic and intergenic regions, which are defined to be >10kb from nearest TSS. Overall, these enhancers have a mean distance of 0.4 Mbps from the TSSs of the nearest transcripts.

25

Region Percentage Count downstream 11.51 184 genic 32.60 521 intergenic 32.54 520 overlapTSS 9.82 157 upstream 13.52 216

Figure 3-2. Distribution of RNAPII+/nucRNA- throughout the genome.

The pie chart illustrates the percentage of putative enhancers that overlap with any of the following five categories: genic, overlapTSS, upstream, intergenic, and downstream regions. Genic regions are defined by strict overlap to Ensembl transcripts; overlapTSS regions are regions that are overlap TSSs; upstream regions are within 10kbps of TSSs but not overlapping the TSSs window; intergenic regions are more than 10kbps away from the boundaries of any transcripts; downstream regions are within 10kbps of the 3‟ end of transcripts.

26

Table 3-1. List of top 50 distal enhancer candidates in erythroid cells identified from RNAPII+/nucRNA-.

The distal enhancers are defined as regions >10kbps away from transcribed genes. Columns represent the fold enrichment of RNAPII-ser5 over input (Fold), whether the region is conserved (conserved), shortest distance in number of base pairs from TSS of closest expressed Ensembl transcript (dist2ET), sum of RNA tags within the transcript (RNASum), and the average of tags per kb (AvgPKb). GATA1* denotes the GATA1 data from Soler et al 2010.

27

3.2.3. Overlap with transcription factors and conserved regions of the genome

Genomic regions, including intergenic regions, bound by multiple TFs are reported to have regulatory function (Table 1-1) (Chen et al., 2008). To investigate the regulatory potential of putative enhancers in erythroid cells, I investigated the overlap between RNAPII+/nucRNA- regions and the set of erythroid cell-expressed TFs for which ChIP-Seq data were generated. I tested the association between RNAPII+/nucRNA-, and TF peaks using Chi-square statistics. The analysis confirmed the strongest positive association between the marker and GATA1 peaks (p=1.24e-14; log-odds=2.92) (Table 3-2) in comparison to other TF binding sites. The significant positive log-odds value indicates that GATA1 and RNAPII+/nucRNA- co-occurred more frequently than with the remaining combinations of RNAPII-ser5 and nucRNA in support of the hypothesis. Overall, the RNAPII+/nucRNA- regions are significantly associated with all individual TFs except ETO2, and the most significant association is found with TF combined, which contains the binding regions for any TFs (p< 2.2e-16). The numbers of overlaps between RNAPII-ser5 and TF combined binding sites as well as nucRNA presence are displayed in the Venn diagram (Figure 3-3). As expected RNAPII-ser5 bound regions are found to overlap with transcribed regions (49%). Several TF bound regions also overlap with transcribed regions however, as RNAPII+/nucRNA- regions are being investigated as an enhancer marker I am not able to detect enhancers located within transcribed genes. A hundred and thirty eight of RNAPII- ser5 regions are in the absence of nuclear RNA, and contain at least one erythroid TF binding site. Tests of the association further confirmed RNAPII+/nucRNA- regions with or without overlapping GATA1 peaks to be both significantly and positively associated with intergenic regions (p=2e-16 and p=0.0064, respectively).

28

Table 3-2. Chi-square association of erythroid enhancers with TFs and enhancer candidates.

The table presents the numbers of overlaps between 1598 enhancer candidates and TF peaks as well as the p values and log odds values computed through Chi-square statistics in comparison to random fragments. TFcomb represents regions with any TFs peaks. In the case of KLF1 and ETO2 the log-odds could not be computed (--) because the random number of overlaps was 0.

TF Number of overlaps p-value log-odds GATA1 71 1.24E-014 2.919327 KLF1 14 0.0004976 -- LDB1 71 4.48E-014 2.695556 TAL1 40 3.25E-008 2.613738 GATA1* 68 2.12E-013 2.650421 ETO2 2 0.4794 -- MTGR1 37 4.30E-008 2.939945 TFcomb 138 < 2.2e-16 2.708708

Figure 3-3. Venn diagram of RNAPII-ser5, overlapped binding of all TFs, and nucRNA presence.

The diagram illustrates the number of RNAPII-ser5 peaks overlapping with TFcomb and/or nucRNA enriched regions allowing 500bps gaps in between. TFcomb represents regions with any TFs peaks.

29

3.2.4. Multiple transcription factor peaks in proximity to putative enhancers

As seen from the Hbb LCR example, multiple TF binding peaks coincide with individual hypersensitive sites. A validated enhancer region 75kb upstream of Lmo2, a highly expressed TF regulator in erythroid cells that regulates erythropoiesis (Brandt and Koury, 2009), was examined to reveal multiple TF bindings as well as RNAPII-ser5 binding with the absence of nucRNA at the region further supporting the hypothesis that this marker identifies regulatory regions (Figure 3-4). Putative enhancer candidates around 50kbp upstream of Pim1 and >80kbp upstream of Klf13 also contained multiple TF ChIP-Seq peaks (Figure 3-5a,b). Similar to the Hbb LCR and Lmo2 enhancer, these intergenic regions occupied by several TFs and associated with RNAPII- ser5 could be important regulatory regions for their nearby genes.

Interestingly, a large proportion of RNAPII+/nucRNA- putative enhancers >20kb away from annotated genes were associated with multiple GATA1 peaks between the RNAPII+/nucRNA- putative enhancer and the gene TSS. This trend agrees with the previous finding that GATA1 dimerization/self-association plays an important role in regulation of erythorid-cell development (Shimizu et al., 2007). The self-association of GATA1 may also be involved in establishing contact between promoters and enhancers. The trend was observed in the two RNAPII+/nucRNA- enhancer candidates that overlapped with GATA1 binding peaks from different data sources, and multiple other GATA1 peaks between enhancers and the Pim1 promoter (Figure 3-5a). Furthermore, the GATA1 peaks were almost equally spaced throughout the 110kb upstream of the Klf13 TSS (Figure 3-5b).

30

Figure 3-4. A validated enhancer region 75kb upstream of Lmo2, one of the key regulators in erythroid cells.

The figure depicts Lmo2 transcripts and the genomic region upstream of its TSS including the validated enhancer region. Enrichment of nucRNA identified using SeqMonk and RNAPII-ser5 or transcription factor peaks identified using SISSRs algorithms are represented by rectangles. The fold enrichments over the corresponding ChIP-seq input are labeled beside each peak. The previously validated enhancer overlaps the putative enhancer with RNAPII-ser5 enrichment of 27.18 fold and the absence of nuclear RNA. The primers used for the ChIP-qPCR experiment are drawn below the gene track.

31

(b) (a) (a)

32

Figure 3-5. Graphs of enhancer candidates identified around Pim1 and Klf13.

Tracks shown at (a) Pim1 and (b) Klf13 are designed primers, significant nucRNA enrichments identified from SeqMonk, and RNAPII or erythroid TF peaks identified from SISSRs. The fold enrichment values over the input are reported beside the peaks.

3.2.5. H3K4me1 ChIP-qPCR results support putative enhancers

H3K4me1 has been previously found to be enriched at enhancers (Heintzman et al., 2007). To gather more evidence for the putative enhancers, I performed qPCR from H3K4me1 ChIP fragments with 6 candidates named after the closest expressed genes (Pim1, p300, , slc25a37, wdr26 and aplp2), the primer sets are listed in Appendix 2, and the qPCR results are shown in Figure 3-6. I obtained a modest enrichment in H3K4me1 compared to the non-specific control antibody (RAG). The result of qPCR showed that regions closest to Pim1, Ep300, Klf13 and Wdr26 (p-values= 1.2*10-5, 5*10-4, 2*10-4, 2*10-4, respectively) showed similar degree of enrichment as known enhancers of Hbb-b1 and Lmo2 (p-values= 1*10-4, 1*10-3, respectively), whereas regions closest to Aplp2 and Slc25a37 had limited or no enrichment. Among the 6 enhancer candidates, the q-PCR result of H3K4me1 at Pim1 and Klf13 putative enhancer has the highest difference over RAG.

33

(a) 2.5 2 1.5 1 RAG 0.5 H3K4me1 0

(b)

1.8 1.6 1.4 1.2 1 0.8

Figure 3-6. ChIP-qPCR results of H3K4me1 support erythroid enhancers identified.

(a) The plot reports per unit DNA quantity of fold enrichment versus Vh16 for both H3K4me1 and RAG antibodies at the noted regions. „NG‟ represents non-genic regions, and „pr‟ represents promoter regions. (b) The plot reports fold enrichment between DNA immunoprecipitated by H3K4me1 and RAG antibodies.

34

3.2.6. Conservation analysis of RNAPII+/nucRNA- regions

In order to probe the functional significance of the putative enhancers, I investigated the conservation of the sequences around these candidates. The scores computed from multiple alignments of 30 vertebrate genomes to the mouse genome were retrieved from the UCSC genome browser (Fujita et al., 2010), and the proportion of maximum conservation scores of 1kb surrounding peak midpoints that are greater than 0.8 was computed for different combinations of sites (Figure 3-7). The null distribution of PhastCons conservation scores was generated from 1000 randomly drawn sequences from throughout the genome. The RNAPII+/nucRNA- enhancers were significantly more conserved than random sequences (p=0.007612); furthermore, the enhancer regions overlapping TF peaks had higher conservation compared to enhancer candidates only (Table 3-3, Figure 3-7). This provided further evidence that the RNAPII+/nucRNA- marker was able to identify regions with regulatory potential. GATA1 peaks overlapping with RNAPII-ser5 and nucRNA (nucRNA+/RNAPII+/Gata1+) were also more conserved compared to random, which is not surprising as these regions tend to overlap annotated genes that are highly conserved.

Table 3-3. Significant association with conservation.

The table presents the numbers of conserved regions in each group set as well as the p values computed through Chi-square statistics and log odds values compared to that of random fragments. TFcomb represents regions with any TFs peaks.

Conserved Not-conserved P-value log-odds

TFcomb 6194 3416 < 2.2e-16 0.9056

RNAPII+/nucRNA- 751 847 0.007612 0.1926

RNAPII+/nucRNA-/Tfcomb 105 33 1.793e-08 1.4790

35

Max Conservation of 1kb Around Peak Midpoint 30

Gata1+

25 nucRNA-/PolII+/Gata1+ nucRNA+/PolII+/Gata1+ nucRNA-/PolII+

Random

20

15

Density

10

5 0

0.0 0.2 0.4 0.6 0.8 1.0

PhastCons Scores

Figure 3-7. Distribution of PhastCons scores.

The density plots of maximum PhastCons scores 1kbps around peak midpoints in each category. The random density is computed from random genomic locations.

36

3.2.7. Supervised motif enrichment analysis

In order to identify whether known TFs are binding to the putative enhancer regions, a supervised motif enrichment analysis was conducted using the Clover algorithm (Frith et al., 2004). Putative enhancer regions were compared to random background to identify enriched JASPAR binding matrices (Portales-Casamar et al., 2010). A motif enrichment test was performed on two conserved enhancer subsets, both with and without overlapping identified TF peaks (Table 3-4). The purpose of separating out the enhancer subset without TF peaks was to evaluate the capability to identify additional TFs that have not been previously identified. The enrichment of TAL1/GATA1 motifs in conserved enhancers with TF peaks was consistent with ChIP-Seq data and was in agreement with their functional roles (score=31.9, p<0.001). In addition to known TFs with ChIP-Seq data, these regions were significantly enriched in binding matricies for TFs known to regulate erythroid cells including: Hba-regulating KLF4 (Marini et al., 2010) and GATA1 interacting NFYA (Huang et al., 2004).

Investigation of conserved RNAPII+/nucRNA- regions devoid of TF peaks identified several ETS domain TFs (Figure 3-8). These included SPI1 (score=133.0, p < 0.001) and ETS1, which is associated with erythroid differentiation (Marziali et al., 2002) (score=26.8, p < 0.001). Interestingly, the results also showed a slight enrichment of ZFX, which is expressed at low levels in erythroid cells and is known to have a role in self-renewal of embryonic and hematopoietic stem cells (Galan-Caridad et al., 2007). The TF motif enrichment through supervised motif analysis demonstrates the potential in identifying erythroid regulating TFs in addition to existing known TFs from ChIP-Seq analysis, which gives a more complete picture of combinatorial TF binding patterns in the putative enhancers. Furthermore these results indicate that a TF independent approach for identifying putative enhancer regions not only provides functional annotation of the genome, but also identifies associated TFs which may be centrally involved in regulating gene expression in the cell type investigated.

37

Table 3-4. Motifs significantly enriched in subsets of conserved erythroid cell enhancers.

Motif enrichment analysis on two conserved enhancers sets: one set that overlapping TF ChIP- Seq peaks and the other set without any TF peaks. The corresponding raw scores as well as p- values computed in comparison to random sequences drawn from chr19 and 5kbp upstream of TSSs are reported below.

Conserved Motif Raw score P-value (chr19) P-value (5kbp) Candidates KLF4 40.8 0 0

TAL1::GATA1 31.9 0 0 GATA1 27.1 0 0 TF peaks

Overlapping NFYA 7.7 0.001 0.001 SPI1 133.0 0 0 GABPA 115.0 0 0 NFYA 78.0 0 0 KLF4 73.2 0 0.002

FEV 62.9 0 0 SPIB 56.3 0 0 ELF5 47.8 0.001 0 ELK4 28.1 0 0 ithout TF peaks

W ETS1 26.8 0 0 ELK1 18.7 0 0 ZFX 10.1 0.003 0.003 SRY 8.73 0.001 0

38

Figure 3-8. Illustration of motifs significantly enriched in subsets of conserved erythroid cell enhancers.

The figure shows the motifs enriched in conserved candidates overlapping TF ChIP-Seq peaks on the left and motifs enriched in conserved candidates without TF peaks on the right. TFs in the blue box all contain a ETS domain and have similar motif pattern.

39

3.2.8. Gene Ontology analysis of putative erythroid enhancers

To probe the functionality of the enhancer candidates, Gene Ontology analysis of the nearest transcribed gene was performed on DAVID (Huang da et al., 2009b). The enriched functional clusters and the corresponding RNAPII-ser5 enrichment values are listed in Table 3-5. The results indicate that through regulating genes related to essential functions such as nucleotide, RNA, protein-DNA binding and protein catabolic process, the enhancer candidates play an important role in transcriptional and translational regulation. In addition, genes associated with chromatin organization and protein-DNA binding are close to strong RNAPII-ser5 enrichment regions. Genes associated with the enhancer candidates are also enriched for functions in cell cycle and stress response, and potentially involved with a mechanism of RNAPII promoter proximal pausing previously reported at cell cycle and stress response related genes to allow rapid temporal and spatial changes in gene expression (Muse et al., 2007). A a proportion of the RNAPII+/nucRNA- regions do overlap gene TSSs, this subset may represent paused polymerase at the TSS of poised genes accounting for the enrichment in stress response related genes. Tissue-expression enrichment analysis reveals that enhancer candidates are nearby genes up- regulated in thymus and bone marrow tissues, where erythroid cells are present.

40

Table 3-5. Gene Ontology enrichment of the nearest transcribed genes of putative enhancers in erythroid cells.

Clusters containing GO biological processes and/or GO molecular functions and any terms with Benjamini corrected p-value < 0.05 are reported. * The RNAPII-ser5 enrichment column reports the maximum of average RNAPII-ser5 enrichments for functions within each cluster.

Functional clusters Cluster RNAPII-ser5 Enrichment enrichment* Cell cycle 7 36.51 Protein catabolic process 5.71 24.39 RNA binding 5.03 38.86 Chromatin organization and protein-DNA binding 4.71 195.69 RNA splicing & processing 4.68 28.4 Translation 4.48 33.14 Chromosome segregation 3.55 68.58 Stress response 3.33 30.38 Nucleotide-binding 3.25 35.84 Metal ion binding 2.91 30.96

41

3.3. Discussion

In this chapter, I assessed the potential of a one feature marker, RNAPII-ser5 enrichment and the absence of nucRNA, to predict enhancers in mouse erythroid cells. This marker was investigated due to the presence of RNAPII-ser5 peaks at the Hbb LCR which was used as an example of a cluster of distal enhancers. Downstream analyses of putative enhancers revealed both significant association with conserved regions, and significant enrichment of erythroid cell-expressed TF binding sites. As further validation, 4 out of 6 candidates showed enrichment for H3K4me1 by ChIP-qPCR. These analyses support the regulatory potential of the identified putative enhancers.

Enhancer candidates significantly overlapped TF peaks obtained from 7 erythroid cell-expressed TFs for which ChIP-Seq data was available in mature erythroid cells. A large proportion of putative enhancers did not overlap TF binding sites from the available data. This finding suggests that the current data available in erythroid cells does not cover all regulatory TFs involved in gene regulation in erythroid cells. Through motif enrichment analysis, I identified well known and potential erythroid cell-regulating TFs that have not yet been investigated by ChIP-Seq.

The multiple TF peaks at and around enhancer candidates are in agreement with combinatorial binding of TFs and self-association of specific TFs (GATA1 for example). Notably, multiple spaced-out TF binding sites or clusters are found in between the distal enhancers of Lmo2, Pim1 and Klf13, and their respective gene promoters. This regular binding of TFs may reflect a chromatin looping conformation that brings enhancers and promoters in proximity. The enhancer upstream of Lmo2 has been confirmed to contact the Lmo2 promoter by 3C (Bhattacharya et al. manuscript in preparation) which warrants further investigation of Pim1 and Klf13 enhancers through 3C techniques.

Intergenic transcripts have been reported around enhancers, and at the human Hbb LCR (Ashe et al., 1997; Kim et al., 2010). In the nucRNA-Seq data used no transcription above background was detected at the Hbb LCR which informed the choice of RNAPII+/nucRNA- regions as an enhancer marker. The absence of nuclear RNA at enhancer candidates is perhaps due to the stability, size or abundance of enhancer RNAs which makes them more difficult to detect during high-throughput sequencing library preparation or at the current sequencing depth.

42

Gene Ontology analysis revealed functional enrichment of transcriptional and translational processes. Genes associated with chromatin organization and protein-DNA binding have the highest RNAPII-ser5 enrichment at enhancers indicating the potential role of enhancers on epigenetic and transcriptional regulation of gene expression. Furthermore, gene expression enrichment in erythroid-associated tissues demonstrates tissue-specific regulation of the enhancer candidates.

This study provided insight into the feasibility of comparing data sets from different origins (Table 2-1) and experiments in erythroid cells. Due to the limitation of data availability, enhancers were identified using one out of a variety of possible enhancer signatures. Such an approach can be biased when the feature is an incomplete or non-exclusive representation of enhancers. For example, the functional enrichment of cell cycle and stress response, previously reported to associate with transcriptional pausing (Muse et al., 2007), is likely a result of RNAPII pausing at proximal promoters rather than enhancer activity. More enhancer-associated data sets in erythroid cells can be incorporated as they become available to deal with the issue. To systematically improve prediction accuracy and recall, I subsequently take an unbiased modeling approach to enhancer prediction by using the abundant public ChIP-Seq data sets available for mouse ES cells in Chapter 4.

43

Chapter 4

Modeling Approaches to Enhancer Identification in Mouse Embryonic Stem Cells

44

4.1. Introduction

Stem cells are well known for their pluripotency, defined to be the potential for developing into all of the different cell types in the adult organism, and self-renewal in maintaining the undifferentiated state through repeated cell divisions. The transcriptional regulatory circuitry of mouse ES cells is well established involving several ES cell-expressed TFs including: OCT4, SOX2, NANOG, KLF4, c-MYC, ESRRB, CHD7 and ZIC3 (Chen et al., 2008; Lim et al., 2007; Rao and Orkin, 2006; Schnetz et al., 2010). In fact ectopic expression of four factors, Oct4/Sox2/Klf4/c-Myc, can induce differentiated cells to a state of pluripotency (Takahashi and Yamanaka, 2006). Epigenetic states can be reprogrammed through this process as pluripotency is established. At gene promoters, H3K4me3 is generally associated with transcribed genes while H3K27me3 is found at silent genes. In ES cells bivalent domains containing both H3K4me3 and H3K27me3 have been identified at the promoters of genes poised for expression during development (Bernstein et al., 2006). Recent reports have also identified poised marks at enhancers regions (H3K4me1 and H3K27me) and have proposed that these enhancers regulate gene expression during development rather than in ES cells (Creyghton et al., 2010; Rada- Iglesias et al., 2011). This chapter presents enhancer identification in mouse ES cells using the ChIP-seq data available for this intensively studied cell type.

Previous enhancer identification approaches showed promising performance (introduced under section 1.2.4) and I identified known and novel putative enhancers in erythroid cells using the RNAPII+/nucRNA- marker in the previous chapter. However, potential issues with non- integrated approaches are that each feature may be an incomplete or non-exclusive representation of enhancers. Using p300 binding sites, for example, limits the identification of enhancer regions to regions bound by p300-interacting TFs. Although sequence conservation is frequently used to identify regulatory elements, ultra-conservation has been reported to identify a small subset of developmental enhancers (Visel et al., 2008). In addition, a large population of validated heart enhancers are less deeply conserved in vertebrate evolution (Blow et al., 2010) providing evidence for variation in the degree of evolutionary conservation at enhancers. Furthermore, although the MTL approach using OSN ChIP-Seq co-bound regions obtained the highest precision (25/25 had enhancer activity), identifying enhancers is confined by the knowledge of relevant regulatory TFs in specific cells. A method that integrates and assesses all these features

45 is necessary to elucidate the importance of individual enhancer signatures and increase the accuracy of prediction.

In contrast to the biologically directed approach based on one marker in Chapter 2, I first demonstrate the importance of feature selection using Naive Bayes classifiers. The approach in this chapter identifies key signatures through integrative modeling, uses these signatures to predict functional enhancers, and subsequently identifies TFs with motif enrichment in these regions as potential active TFs in mouse embryonic stem cell. OSN co-bound regions throughout the genome are used as the enhancer training set. Through probabilistic modeling, predicted enhancer candidates are not limited to OSN co-bound regions, and are expanded to regions with similar signature states. A series of downstream analyses are followed to assess the enhancers identified, and additional comparisons to other relevant concepts are done to put the findings in biological context.

#This work was carried out under computational guidance from Dr. Quaid Morris in Department of Computer Science through collaboration.

46

4.2. Results………………………

In this study, I used RNA sequencing data and 30 public domain high throughput ChIP sequencing data sets in mouse ES cells including: 12 transcription factors (TFs) (Chen et al., 2008), 8 histone modifications (Mikkelsen et al., 2007), 3 polymerase occupancy (Rahl et al., 2010) and 7 chromatin associated proteins (Chen et al., 2008; Kagey et al., 2010) (Table 2-2). In addition to sequencing data sets, 5 genomic features - CpG islands, SNP, repeat regions, G+C content and PhastCons most conserved regions - are also incorporated in the analysis.

To prevent cell type specificity in enhancer identification, only non-TF data sets were evaluated as enhancer markers. The TF data sets were used for either training or validation purposes. As 25/25 regions co-bound by the pluripotency transcription factors OCT4, SOX2 and NANOG (OSN) were shown to have enhancer activity and 8 out of 8 c-MYC and N-Myc (MYC) co- bound regions had limited or no enhancer activity (Chen et al., 2008), OSN co-bound regions and MYC co-bound regions are used as the enhancer positive and negative training sets. Furthermore, MYC co-bound regions tend to be close to TSSs, so 1kb genomic bins are classified into 3 categories, enhancer (Enh), promoter-like (PrL) and unknown.

4.2.1. Feature extraction improves enhancer prediction

Multiple combinations of features have been used to predict enhancers; however, one by one filtering assumes all features are equally important, and can also introduce bias when features are highly correlated or redundant. To simulate filtering approaches with probability, Naive Bayes classifications, using various combinations of features, was performed. Using 10 fold cross validations on the training set data, assessment of the model was carried out with the following indices; deviance, area under ROC curve (AUC), precision, modified precision, and recall (detailed in Chapter 2) (Figure 4-1).

The results demonstrate the importance of feature extraction for enhancer prediction. Classification using all features (rank 19) does not warrant the best classifications, as it obtains the highest deviance, and lower precision and recall values compared to other combinations of features. Such classification bias can be caused by redundancy between features and introduction of noise by non-informative features.

47

Through adding an informative feature, for example, G+C percent, the AUC, precision and recall can be increased (rank 2 to 1). Although the indices do not always agree, deviance was also slightly improved when adding G+C content to the combination of p300, Med12, H3K4me1 and H3K4me3. Furthermore, the classifier ranked 7, which contains one additional feature, CpG island, has the highest recall but loses precision.

I further assessed the strength and weakness of previously used enhancer markers. Although p300 is extensively used to identify enhancers and has the highest modified precision in combination with CpG islands (rank 21), the recall is very low, which is likely caused by insufficient coverage of p300 ChIP-Seq data or incomplete representation of enhancers by p300 (He et al., 2011). The combination of mono- and tri-methylation of H3K4, previously used to distinguish enhancers from promoters, performed poorly with low precision and recall (rank 27). The poor performance is likely due to noise introduced by the broad pattern in enrichment of histone modifications or slight accumulation of H4K3me3 frequently observed in enhancer regions.

Although sequence conservation has been previously used to predict enhancers, adding most conserved regions into feature combinations can worsen the model prediction (rank 2 to 6; 13 to 20). An explanation for this observation is that some enhancer regions are highly conserved as are gene promoter regions and coding sequences which results in inaccurate categorization. Surprisingly, the classifier using most conserved region, RNAPII-ser5 and RNA ranked lowest in AUC, precision, and recall.

Taking the Naive Bayes classification approach, I have shown the importance of feature extraction in enhancer identification through assessing various marker combinations. Although this approach determined the most appropriate combination of features for optimal enhancer identification, each feature is still given equal weight in the model and assumed to be independent. As each feature may contribute to different degrees in the prediction, and some features are not independent of each other (e.g. histone modifications), I further used a lasso regularized multinomial logistic regression model to systematically obtain ranked contributions to enhancer classification.

48

Figure 4-1. Assessing feature combinations as enhancer signatures using Naive Bayes.

The first 11 columns depict the features used in each given row (Naive Bayes classifier) and the 12th (Others) column represents the rest of the features listed in Table 2-2. The capability of each classifier in categorizing OSN regions (Enh training set) from MYC regions (PrL training set) and unknown is assessed using 10-fold cross validation. The last five columns listing the average deviance (AvgDev), area under ROC curve (AvgAUC), precision, modified precision* and recall values are color-coded with red indicating good model prediction and blue indicating poor prediction. Naive Bayes classifiers with different feature combination are sorted by the average ranking of the 5 indices.

49

4.2.2. Systematic ranking of enhancer signatures

Having established the importance of feature selection, I next examined all features systematically. As some of the features used are dependent specifically histone modifications, I carried out lasso regularized multinomial logistic regression to assess the features and obtain their rank contribution to classify enhancers. Lasso regularization (Friedman et al., 2010; Tibshirani, 1996) introduces a lambda penalty factor to shrink feature weights, so relevant signatures are given appropriate weights. Feature weights corresponding to specific log lambda values for each category are shown in Figure 4-2a. By increasing the penalty parameter lambda, weights of less informative features for each category shrink to 0; whereas weights of informative features remain non-zero. The higher the absolute weight, the more influence the factor to the current model prediction, while positive and negative weights indicate enrichment and absence in the category, respectively.

The most positively predictive features for Enh regions are p300 and H3K4me1, which have been used in previous methods for enhancer prediction (Heintzman et al., 2007; Visel et al., 2009a; Won et al., 2010). The component of the mediator complex, Med12, is ranked third and has been shown to associate with enhancers involved in chromatin looping to promoter regions (Kagey et al., 2010). As the PrL training regions are co-bound by c-Myc and N-Myc, which are frequently located around TSS (Fig 4-2a), the strong associations with CpG islands, RNAPII- ser5, H3K4me3 and G+C percent are observed in agreement with promoter characteristics. The weights of p300 and CpG islands are much higher than other features indicating the significance of these features in the Enh and PrL sets. A negative weight of p300 in the unknown group indicates a lack of p300 enrichment in such group. Through cross validation evaluation, ten features shown in Figure 4-2b were taken as classification signatures for the rest of the downstream analyses (Appendix 3; log lambda = -4).

50 a)

b)

51

Figure 4-2. Feature ranking determined using multinomial logistic regression with lasso regularization.

(a) Feature weights in each class with respect to logged lambda, which is a penalty factor to shrink feature weights. Weights of features less discriminative of the three categories are shrinked to 0 as lambda is increased. Top ranking features are those with non-zero weights at high lambda values: p300, H3K4me1 and Med12 for Enh group; CpG islands, H3K4me3, G+C percent and RNAPII-ser5 for PrL groups. (b) Signatures used in the chosen model, log lambda = -4, are shown in each category with their degree of contribution.

4.2.3. Classified enhancer and promoter-like candidates

A total of 19200 1kb regions were predicted to be Enh, 67672 were predicted to be PrL, and 2567872 regions were predicted as unknown using the lasso regularized model (Appendix 4). As Enh candidates are ranked by probabilities, more stringent thresholds can be applied to gain the higher confidence enhancer candidates. All luciferase-validated positive regions from Chen et al. 2008 are predicted as Enh (for higher stringency: 24 out of 25 regions with p>0.8); whereas all luciferase-validated negative regions are predicted PrL in the model (7 out of 8 regions with PrL prob > 0.8) (Appendix 5).

In addition, the probabilities of putative enhancer candidates are significantly correlated to an independent luciferase assay data set (Schnetz et al., 2010). The assay on 67 regions was conducted to validate enhancer candidates bound by CHD7, a transcription factor that targets active gene enhancer elements to modulate ES cell-specific gene expression. The log probability of enhancers from the model is significantly correlated to the logged relative luciferase (ρ=0.44, p value=0.0002).

A heatmap of the 50 top Enh and PrL candidates demonstrates the distinction of signature features at predicted candidates (Figure 4-3). The Enh and PrL signature features are well separated with hierarchical clustering except for H3K4me2, the lowest ranked feature in the PrL category. Although previously suggested to be tightly associated with TSSs of genes as H3K4me3, a subset of H3K4me2 sites devoid of H3K4me3 was found to be near or in the gene bodies of tissue specifically expressed genes (Bernstein et al., 2005; Pekowska et al., 2010). This

52 may have contributed to the ambiguity of H3K4me2 in the model. A probability and modeling approach, in contrast to an overlapping enrichment approach, allows variations in positively predictive features within Enh candidates; therefore, not all features need to be highly enriched in the same region for high probability prediction.

Figure 4-3. Heatmap of features used in lasso regression for top 50 enhancer and promoter-like candidates.

The dark red and blue side bar on the left denotes putative Enh and PrL 1kb genome bins, whereas the dark red and blue side bar on the top denotes indicative Enh and PrL feature sets. Feature values are scaled to exhibit the contrast between the Enh and PrL. The clustering of features revealed the ambiguity of H3K4me2, and indicated the advantage of probabilistic approach by allowing flexibility in feature enrichment to categorization.

53

4.2.4. Enhancer and promoter-like candidates coordinately regulate gene expression

I examined the distribution of Enh and PrL candidates in the genome and found that Enh and PrL candidates are differentially distributed relative to annotated genes (Figure 4-4a,b). As expected, more Enh candidates are located in intergenic regions than PrL candidates which more frequently overlap TSSs. To examine the distribution of Enh and PrL candidates more specifically with respect to TSSs I calculated the distance to the closest TSS for each candidate. Enh candidates tend to be further away from TSSs compared to PrL candidates (p<2.2e-16, Figure 4-4b). As the probability threshold for Enh and PrL candidates are increased, the median distance from TSSs increases for Enh candidates and decreases for PrL candidates.

To assess the regulatory potential of the Enh and PrL candidates, I assigned candidates to the closest gene TSS, and compared the gene expression distribution among subsets of genes associated with both Enh and PrL (denoted as Enh&PrL), either Enh or PrL, and genes without an associated Enh or PrL candidate (denoted as “None”) (Figure 4-4c). Distributions of gene expression in categories containing an Enh or PrL are significantly higher than the None set. While PrL alone conferred significantly higher expression than Enh alone (p=1e-47), Enh&PrL set showed significantly higher expression than that of PrL-only genes (p=2e-18). These findings suggest that Enh and PrL candidates co-ordinately regulate transcription of a subset of target genes while other genes are regulated solely by PrL signatures in ES cells.

As enhancers are associated with tissue specific histone modification, I further investigated whether the Enh&PrL set is associated with tissue-specific gene expression. Through comparison of ES cell gene expression to gene expression in mouse embryoid bodies generated in culture, I found that the percentages of differentially regulated genes within Enh&PrL and PrL-only sets are similar and are both higher than that of Enh-only and None sets. Interestingly, Enh&PrL set contains a higher proportion of ESC up-regulated genes in comparison to other sets; whereas PrL set contains a higher proportion of embryoid body up-regulated genes. The significant enrichment of ESC specific genes further elucidates the tissue specific regulatory potential of Enh and PrL sets.

54

55

Figure 4-4. Genomic distribution of candidates and their association with gene expression.

(a) Pie charts representing the genomic distributions of the OSN and MYC training sets, high confidence Enh and PrL candidates (prob≥0.8), and all Enh and PrL candidates. Intergenic regions are defined to be regions ≥ 10kbps away from the closest TSS or transcription end site; whereas upstream regions are regions within 10kbp upstream of TSSs. (b) Violin plots demonstrating the distances to TSS of the closest transcript for each set. The plot for high confidence candidates are on the top (Enh and PrL: prob ≥0.8; unknown: top 30k genes) and the plot for the entire set is on the bottom. (c) Boxplots of ESC gene expression of genes closest to various subsets of Enh and PrL candidates. Enh&PrL denotes genes with at least one of both Enh and PrL closest to their TSSs; PrL set denotes genes with PrL but not Enh closest to them and vice versa for Enh set; None set denotes genes without any Enh and PrL closest to them. The numbers in brackets below each set show the counts of genes within the category. The y-axis denotes gene expression represented by RPKM and is plotted in log scale. Distributions of gene expression are all significantly higher compared to sets on their right (*** p-values <10-7). (d) Percentages of differentially expressed genes between ESC and embryoid body in each category are plotted with up-regulated genes of ES cells and embryoid bodies above and below the x-axis, respectively.

4.2.5. Top enhancer regions locate near genes encoding transcription factors

In order to probe potential functionalities of the Enh candidates, genes with TSSs closest to the candidates (prob>0.9) were tested for functional enrichment using DAVID (Huang da et al., 2009a, b) followed by clustering of functions with significant numbers of shared genes using Enrichment map (Merico et al., 2010). For comparison purposes, genes closest to PrL candidates with the same probability threshold were also tested for functional enrichment. Interestingly, the Enh set is highly enriched with DNA binding and transcription regulating/factor activities (Figure 4-5).

Twenty-one percent of the genes in the top Enh set are involved in DNA binding are involved in DNA binding including several TFs involved in ES cell transcriptional regulation: Pou5f1 (Oct4), c-Myc, N-Myc, Sox2, Esrrb, Phc1, Zic3 and Tead1. In contrast the PrL candidates are

56 enriched with a wide variety of molecular functions in addition to DNA binding, such as RNA binding and processing, translation, and chromatin organization indicating these genes are associated with more basal cellular functions. Furthermore, the top 2000 expressed genes in mouse ES cells are enriched in similar functions with PrL set including both translation and transcription factor activities.

Although a proportion of enhancers regulating genes from a long range distance are likely to be misrepresented, it is interesting that putative enhancers tend to locate around genes with TF functionalities in ES cells, which is likely due to the hierarchy of transcription factor network as OCT4, SOX2 and NANOG are the key regulators. The results indicate that enhancers tend to locate around genes encoding TFs specifically and suggest gene expression of these TFs can be regulated through enhancers in addition to promoters which add complexity and diversity to the gene expression level of TFs.

57

58

Figure 4-5. Gene Ontology analysis of the enhancer and promoter-like candidate sets.

Enriched functions of Enh and PrL identified from DAVID (False discovery rate<0.1) are plotted in (a) and (b) using Cytoscape Enrichment map plug-in. Functions are further circled and grouped into general categories labeled on the side. Line thickness between nodes is proportional to number of genes shared between nodes. Colors are used for the purpose of visualization contrast between functional groups. Enh candidates are more specifically associated with DNA binding and transcription regulation, whereas PrL candidates are also associated with other molecular functions such as translation activities, RNA binding, and chromatin organization.

4.2.6. Enhancer candidates are associated with multiple transcription factors

Of the 1277 Enh candidate regions (prob ≥0.8), 1065 of these overlapped with at least one binding site for one of the 7 TFs for which ChIP-Seq data was available in ES cells (Table 2-2, excluding Oct4, Sox2, Nanog, c-Myc, N-Myc data). The percentage overlaps of all categories with regions bound by these 7 TF are plotted in Figure 4-6a. To check whether MTL (4+TFs) are enriched in the Enh candidates, I further separated the 3 categories and genomic background into 3 groups: 0 TF peak enrichment, enrichment of 1 to 3 TFs, and enrichment of 4 or more TFs within the window frame. Enh candidates were most highly and significantly enriched with MTL compared to both the PrL and unknown categories (p<1e-50). Although the overlapping count between the PrL set and MTL is greater than that of the Enh set (due to greater numbers in the PrL category), the percent overlap in the Enh set with MTL is significantly higher.

It is noteworthy that although Enh candidates tend to be further away from TSSs compared to PrL candidates, the Enh set is still more enriched of MTL than the PrL set, indicating potential functionality of Enh candidates. This is in agreement with the observation that a significant proportion of individual TF bound regions are located within the intergenic regions of the genome (Table 1-1). Furthermore, increasing the stringency on Enh probability threshold increases the percentage of overlapping MTLs (Figure 4-6b).

59

Figure 4-6. Barplots of percent overlap with multiple transcription factor bound loci.

(a) Three regions are separated based on counts of unique TF enrichment using 7 TFs (excluding Oct4, Sox2, Nanog, c-Myc and N-Myc). The Enh candidates are significantly enriched with MTL in comparison to the PrL and unknown sets as well as whole genome bins with p-values <0.001. (b) The overlapping percentages with MTL with various enhancer probability thresholds. The numbers in brackets beside enhancer probability thresholds show the number of regions predicted to be Enh using the indicated threshold.

60

4.2.7. Mouse ES cell enhancer candidates

4.2.7.1. Previously validated enhancers

With predicted Enh and PrL probabilities, I next looked at characterized enhancer regions in mouse ES cells, which have previously been associated with mediator and cohesin proteins and shown to form chromatin loops with the nearby gene TSSs (Kagey et al., 2010) (Figure 4-7). The Enh set highlights the enhancer regions; while the PrL set highlights the promoters or the gene regions. These experimentally validated promoter interacting enhancer regions for Pou5f1 (Oct4), Nanog, Phc1 and Lefty1 are all predicted to be positive of enhancers with probabilities greater than 0.8 (among a total of 1277 regions). Several significant TF binding peaks predicted from ChIP-Seq data using SISSRs algorithm (Jothi et al., 2008) overlap with these enhancers giving indications of functionality. Although the aligned TF peaks at the Oct4 enhancer are located at the boundary of two bins, the model predicted both sides with high probability (prob=0.9487 and 0.8363). At Phc1 I identified the enhancer 28kb upstream of Phc1 (prob=0.8973) as well as two additional contiguous enhancers around 2kb upstream of Phc1 TSS (prob=0.9982 and 0.9975), although 3C experiment from the literature did not use these regions as an anchor, strong interaction with the originally characterized enhancer was detected (Kagey et al., 2010). Moreover, in addition to the known Lefty1 promoter-interacting enhancer, novel contiguous enhancer regions over 3-5kb upstream of Lefty2 also show high potential in regulating Lefty2 (prob= 0.9107 and 0.8999).

61

Figure 4-7. Four known mouse ES cells enhancers that interact with nearby promoters through looping mechanisms.

Plot showing previously validated enhancers around Pou5f1 (Oct4), Nanog, Phc1, and Lefty1. The Enh and PrL probabilities of 1kbp bins are shown in red and blue bars, respectively. Only probabilities greater than 0.8 are shown for higher stringency (n=1277 for Enh; n=21581 for PrL), and the y-axis scale is from 0.5 to 1. Additional transcription factors peaks identified using SISSRs algorithms are illustrated in rectangle boxes to demonstrate overlaps of the enhancers with TFs.

62

4.2.7.2. Novel enhancer clusters overlapping multiple binding sites.

Other interesting Enh candidates around and over 80kbp downstream of the Sox2 gene were also identified by the model (Figure 4-8a). In addition to the role SOX2 plays in regulating the transcriptional program in ES cells, Sox2 is also a key neurodevelopmental gene, and multiple Sox2 enhancers have been identified in various different cell types ranging from ES to neural (precursor) and lens epithelial cells (Catena et al., 2004; Inoue et al., 2007; Sikorska et al., 2008; Takemoto et al., 2011; Tomioka et al., 2002; Uchikawa et al., 2003). A diverse array of Sox2 enhancers within a 50kbp region in chicken embryos was shown to have specific activities in various development stages and domains (Uchikawa et al., 2003); however, the role of the region 70 kb downstream has not been investigated to date. In agreement with tissue specificity of enhancers, N1 and N3 enhancers involved in neural and embryonic visual system development were not found to be enhancers in the model (Inoue et al., 2007; Takemoto et al., 2011). On the other hand, the evolutionary conserved SRR1 enhancer (Catena et al., 2004; Sikorska et al., 2008; Tomioka et al., 2002) (a.k.a HSI_0.4a) around 4kb upstream of Sox2 was found to not only drive Sox2 expression in neural stem cells, but can also enhance the expression of a reporter gene by tenfold in ES cells (Catena et al., 2004; Sikorska et al., 2008; Tomioka et al., 2002). SRR1 and HSI_0.4a overlap entirely with a high confident Enh candidate (prob=0.9946), whereas another enhancer 4kb downstream of SOX2, SRR2, overlaps partially with a lower confidence Enh (prob=0.621; not shown on the graph due to the 0.8 display cutoff).

In addition of these validated enhancers, I identified more high confidence Enh candidates downstream of Sox2 including an Enh candidate overlapping ESRRB, OCT4 and KLF4 peaks as well as a cluster of Enh candidates around 100kb downstream overlapping with multiple TF peaks. Notably, the furthest downstream enhancer region has an Enh ranked in the top 4 (prob=1). A zoomed in shot of the enhancer cluster shows almost no transcription going on at two Genbank mRNAs and a RefSeq gene, Gm3141, but relatively low level of transcription is found beside the cluster at BC051220 (Figure 4-8b).

63

(a)

(b)

Figure 4-8. Novel putative enhancer regions around Sox2 gene.

(a) Multiple enhancer candidates are identified up- and down-stream of Sox2. (b) A close shot of the downstream Sox2 enhancer cluster. The Enh and PrL probabilities of 1kbp bins are shown in red and blue bars, respectively. Only probabilities greater than 0.8 are shown for higher

64 stringency (n=1277 for Enh; n=21581 for PrL), and the y-axis scale is from 0.5 to 1. Additional transcription factors peaks identified using SISSRs algorithms are illustrated in rectangle boxes to demonstrate overlaps of the enhancers with TFs. Coverage plots of features are shown in the bottom.

The miR-290 cluster, though not sufficient to maintain pluripotency alone, is found to inhibit ES cell differentiation (Lichner et al., 2011; Zovoilis et al., 2009). Four out of seven high probability Enh regions (prob≥0.8) upstream of miR-290 cluster overlap MTL, and the enhancer with the p300 peak also has an Enh probability of 1 ranking top 4 (Figure 4-9; Appendix 6a). These Enh probabilities are in general reflective of the number of TF peak overlaps. Although the left-most Enh candidate contains c-Myc and N-Myc binding peaks, it has an Enh probability of 0.888 indicating that the chromatin and genomic states still match more to that of Enh set.

In addition to obvious Enhs with OSN co-bound sites or the Enhs nearby, I have also identified potential enhancers around Tet1/U6, Zic3, and C80913 that overlap with at most one of OSN. The seven contiguous enhancers identified around U6 small nuclear RNA and upstream of Tet1 overlap with several TF peaks determined from SISSRs and high confidence CHD7 binding peaks obtained from Schnetz et al. (Schnetz et al., 2010) (Figure 4-9; Appendix 6b). As the RNA-Seq experiment by Guttman et al. (Guttman et al., 2010) extracted poly-A selected total RNA, transcription level of U6 known to be non-polyadenylated remains unclear. TET1 protein modifies DNA methylation status by hydroxylating the 5 position methylated cytosine (5mC) to 5-hydroxymethylcytosine (5hmC) in DNA to potentially promote DNA demethylation in mouse ESC (Tahiliani et al., 2009), and it functions to regulate the lineage potential of ESCs (Koh et al., 2011). Although multiple NANOG peaks overlap the enhancers, Nanog RNAi was found to have no effect on gene expression of Tet1 (Koh et al., 2011). Further investigation using 3C will be necessary to elucidate the target of these enhancer candidates. Interestingly, in agreement with the finding that Oct4 RNAi significantly down regulated Tet1 in mouse ES cells (Koh et al., 2011), the PrL region overlapping the mm9 aligned Homo sapiens TET1 gene contains an OCT4 binding peak. The strong enrichment of H3K4me3 and RNAPII-ser5 at the aligned TSS of Homo sapiens TET1 as opposed to Mus musculus Tet1 suggests that I identified an upstream promoter for Tet1, active in mouse ES cells.

65

Figure 4-9. Other novel putative enhancer regions in mouse ES cells.

Plot showing novel putative enhancers around miR-290, Tet1/U6, Zic3, and C80913. The Enh and PrL probabilities of 1kbp bins are shown in red and blue bars, respectively. Only bins with probabilities greater than 0.8 are displayed for higher stringency, and the y-axis scale is from 0.5 to 1. Additional transcription factors peaks identified using SISSRs algorithms are illustrated in rectangle boxes to demonstrate overlaps of the enhancers with TFs. mTet1: mouse Tet1 gene; hTET1: human Tet1 gene aligned to mouse mm9 genome.

66

4.2.7.3. Novel enhancers with fewer transcription factor binding peaks

Even though the modeling approach learns the states of histone modification and other chromatin features around OSN co-bound regions to predict enhancers, the prediction is not limited to regions co-bound by all three TFs. For example, the upstream enhancer candidate of Zic3 (prob=0.92), which overlaps only a NANOG bound region, has been validated with nine-fold up- regulation by luciferase assay compared to a minimal Oct4 promoter (Lim et al., 2007). In addition, deletion of the NANOG binding site within the enhancer showed four fold down- regulation indicating that Zic3 expression is directly regulated by NANOG. The other enhancer downstream of Zic3 with a higher Enh probability of 0.96 has not been validated (Figure 4-9; Appendix 6c). Note that this Enh candidate overlaps with only a E2F1 and a CHD7 binding peak. Furthermore, the putative enhancer 10kb upstream of C80913 overlaps only with a CHD7 peak (Figure 4-9; Appendix 6d). Although KLF4, E2F1, ESRRB and STAT3 are enriched in the 1kbp enhancer bin, the SISSRs algorithm did not find any TF peaks around the enhancer candidate.

4.2.8. Transcription factor motif enrichment at ES cell enhancer candidates

4.2.8.1. ES cell related transcription factors

Although the modeling approach trained around OSN co-bound regions, the prediction is not limited to sites bound by these TFs. In fact, of the 1277 regions with enhancer probability ≥0.8, 522 overlap with OSN co-bound regions, 394 overlap with at least 4 of the other 7 TFs, 136 of which are not co-bound by OSN. Since over one fourth of the MTL regions ranked highly are non-OSN co-bound enhancer candidates, the approach does not exclusively identify high potential enhancers that are co-bound by OSN, even though they were used as the Enh training set. Overall 281 regions are not associated with OCT4, SOX2 or NANOG, and 97 of Enh candidates were not associated with any of the 12 TFs. As a number of Enh candidates do not overlap a region bound by any of the 12 TFs for which ChIP-Seq data is available in ES cells, I next investigated which TFs may be binding these regions.

To further explore the potential binding sites of novel mouse ES cell-related TFs in putative enhancers, I carried out a supervised motif analysis to confirm the ability to pick up motifs of important ES cell TFs and to look for novel TFs that are likely to be bound at these enhancer

67 regions. The significance of motif enrichment is assessed through comparison with random sequences drawn from PrL candidates, chr19 and 5kb upstream of TSS using the Clover algorithm (Frith et al., 2004). Significantly enriched motifs are listed in Table 4-1. The mouse ES cells related TFs with ChIP-Seq data sets, KLF4, SOX2, POU5F1 (OCT4), ESRRB and STAT3, are ranked top 11. Furthermore, almost all of them have high gene expression in mouse ES cells and contain at least one Enh candidates with high probabilities closest to their TSSs. In fact, Klf4 and Nanog both have a putative enhancer nearby, but the links between the genes and enhancer regions are missing due to non-coding Ensemble transcripts in proximity to the candidates. Such results confirm the ability to identify active TFs in the target cell type, and indicate a promising potential in identifying novel TFs regulating stem cell biology.

68

Table 4-1. Motif enrichment in mouse ES cell putative enhancers.

The motifs of the listed TFs are significantly enriched with p < 0.01 in putative Enh regions compared to random sequences drawn from PrL set, chr19 and promoter 5kb regions. “Max Enh prob” column lists the maximum probabilities of enhancers if there exists at least one Enh closest to the TSS of the corresponding gene. RPKM column reports the absolute gene expression of each TF in mouse ES cells. PSSM column acknowledge the source of TF binding matrices. † Several putative enhancer located 50kb downstream of Klf4 are categorized to another non protein-coding transcript. „#‟ sign denotes the TFs with ChIP-Seq datasets in mouse ES cells.

Motif Raw score Max Enh prob RPKM PSSM SP1 339 0.8099 33.5 (Portales-Casamar et al., 2010) Klf4# 154 † 126.07 (Portales-Casamar et al., 2010) Sox2# 140 1 942.79 (Portales-Casamar et al., 2010) Pou5f1# 128 0.9487 1318.38 (Portales-Casamar et al., 2010) Sox4_1 89.6 -- 37.06 (Badis et al., 2009) Sox11_1 86.5 0.6035 2.86 (Badis et al., 2009) Esrrb# 72.6 0.9999 162.78 (Portales-Casamar et al., 2010) Klf7_1 61 -- 4.94 (Badis et al., 2009) Esrra_1 37 -- 4.02 (Badis et al., 2009) NR4A2 33.7 -- 0.03 (Portales-Casamar et al., 2010) Stat3# 31.6 0.5795 29.92 (Portales-Casamar et al., 2010) Zic2_2 31.3 0.4291 13.76 (Badis et al., 2009) Zic1_2 29.6 0.5901 0.06 (Badis et al., 2009) Rara_1 28.3 0.5909 30.28 (Badis et al., 2009) Ascl2_1* 25.1 -- 0.72 (Badis et al., 2009) Nr2f2_1 19.9 0.5537 0.07 (Badis et al., 2009) Zic3_2 14 0.9550 15.27 (Badis et al., 2009) RORA_1 7 0.5016 0.13 (Portales-Casamar et al., 2010) NFE2L2 6.28 0.5947 68.09 (Portales-Casamar et al., 2010) 5.34 (Rxra) RXR::RAR_DR5 4.89 -- 17.55 (Rxrb) (Portales-Casamar et al., 2010) 30.28 (Rara) TEAD1 3.91 0.8987 66.24 (Portales-Casamar et al., 2010)

69

The top motif enrichment candidate, SP1, is a transcription factor is essential for early embryonic development (Marin et al., 1997), and its binding sites have been reported to protect CpG islands from methylation (Brandeis et al., 1994; Macleod et al., 1994; Xu et al., 2009). Although not enhancer specific, there is evidence of SP1 and SP3, another member of SP family, binding to within 380bps and 76bps upstream of Oct4 and Nanog TSSs, respectively (Pesce et al., 1999; Wu and Yao, 2006; Yang et al., 2005). In addition to well studied TFs in ES cells, ZIC3, reportedly required for maintenance of pluripotency in mouse ES cells (Lim et al., 2007), has recently been identified to directly activate Nanog (Lim et al., 2010). Not only is ZIC3 enriched in the motif analysis on enhancer candidates, I have reported two Enh regions around Zic3 (prob>0.9). The gene expression level of this pluripotency regulator ranks top 27% of absolute expression in mouse ES cells. NRF2, encoded by Nfe2l2 gene, acts as a master regulator of the antioxidant response, and its deficiency results in embryonic lethality and severe oxidative stress (Leung et al., 2003). Nfe2l2 is also expressed in mouse ES cells, and can potentially be active in response to oxidative stress in ES cells. These TFs identified from motif enrichment in Enh candidates represent potential regulators in mouse ES cells warranting follow-up investigations. TEAD proteins, also known as transcriptional enhancer factor family, are widely expressed in mammals and regulate various developmental processes despite redundant roles within the family (Sawada et al., 2008).

4.2.8.2. Differentiation and developmentally related transcription factors

Interestingly, motifs of TFs involved in development and differentiation are enriched in the putative enhancer regions, while the absolute expression levels of these TFs are minimal. Other members of the SOX family are also found to be enriched. Sox4 and Sox11, highly expressed in lymphocytes and brain, respectively, play a central regulatory role during neuronal maturation (Bergsland et al., 2006). ZIC family members are also crucial for neural development (Aruga, 2004). NR4A2 (Nurr1) is critical in neuroprotection and the development of dopamine neurons (Bensinger and Tontonoz, 2009). RAR and RXR are also reported to cooperatively regulate neuronal differentiation in P19 embryonic carcinoma cells (Horn et al., 1996). Ascl2, though not expressed in embryonic stem cells, plays an essential role in the maintenance of adult intestinal stem cells (van der Flier et al., 2009). NR2F2 has recently been reported to regulate human ES cell differentiation and activate neural genes during early differentiation (Rosa and Brivanlou, 2011). These results indicate that a portion of the putative enhancers I identified may be poised

70 and developmentally related enhancers or other family members of these TFs may be active in mouse ES cells.

4.2.9. Identified enhancers are mainly active and cell type-specific

4.2.9.1. Enhancer activities

H3K27ac enrichment is indicated to be a deterministic feature of active enhancers and is suggested to be an enhancer marker in addition to H3K4me1 (Creyghton et al., 2010). I investigated the overlap of top Enh candidates (prob≥0.8) in ES cell to distal H3K27ac marks and H3K27me repressive marks for contrast (Figure 4-10a). Distal H3K27ac obtained from Creyghton et al. 2010, are non-overlapping with known TSS and H3K4me3 enriched regions (+/- 1kbps). I found significant overlap between top Enh candidates (prob≥0.8) and the H3K27ac mark in ES cells (p< 2.2e-16, logOdds=3.86). These Enh candidates were also significantly enriched in H3K27ac compared to the training set of OSN co-bound regions (p <2.2e-16, logOdds=0.77) (Appendix 7a). In addition, Enh candidates are significantly less associated with the repressive H3K27me3 mark compared to OSN co-bound set (p=4e-12, logOdds=-2.1). The 679 non-overlapping enhancer candidates are likely to be intermediate enhancers (Zentner et al., 2011) or overlooked active enhancers in proximity to TSS or promoter regions due to the curated promoter filter of distal H3K27ac regions.

4.2.9.2. Tissue specificity of enhancers

To further investigate cell specificity of top Enh candidates, the overlapping percentages with distal H3K27ac and H3K4me1 marks from various cell types were compared (Figure 4-10c). The percentages with the two histone marks in ES cells are much greater than the percentages in other cells indicating cell type specificity in enhancer candidates. Furthermore, I also observed an increase in the overlap to H3K27ac and H3K4me1 in ES cells, and a decrease in the overlap in the differentiated cell types with higher Enh probability cut-off (Figure 4-10c compared to Appendix 7c). Although several differentiation and developmentally related TF motifs were discovered from the candidate list, the above findings demonstrated that the proposed model identified a good proportion of cell type-specific and active enhancers, according to H3K4me1 and H3K27ac marks.

71

Figure 4-10. Active enhancers and cell specificity of top enhancers (prob≥0.8).

(a) Venn diagram depicting overlap of high confidence Enh candidates with distal (TSS +/- 1kb removed) H3K27ac marks (active) and H3K27me3 marks (repressive). (b) Venn diagram depicting overlap of high confidence Enh candidates with distal H3K27ac marks (active) and high confidence enhancer binding CHD7 peaks. (c) The stacked bar plot shows the percent overlap of high confidence Enh candidates with H3K27ac / H3K4me1 in various cell types. The overlaps presented here allow a 500 bp gap.

72

4.2.10. Comparison to other enhancer associated factors

In order to understand how well identified Enh set agrees with various enhancer-associated factors, comparison of Enh and PrL sets to these factors are illustrated in Figure 4-11.

Figure 4-11. Overlapping percentages with 5-hydroxymethylcytosine and other features.

The overlapping percentages of Enh and PrL sets as well as the high confident sets are plotted with respect to enrichment regions of CHD7, H3K27ac, 5hmC using two different techniques (GLIB and CMS), 5mC and H3K27me3. Overlapping percentages with high confidence Enhs and PrLs (prob≥0.8) are plotted in solid red and blue colors; whereas the overall Enhs and PrLs are shaded in slanted red and blue lines. ※ Since distal H3K27ac sites obtained from literature were sites removing TSS and H3K4me3 regions (+/- 1kbps), the overlap with PrLs is not shown.

73

4.2.10.1. H3K27ac and CHD7

Although Enh candidates overlap significantly with the distal active H3K27ac mark, introducing H3K27ac as a model feature showed positive weighting for PrL indicating higher enrichment of H3K27ac at promoter regions than at enhancers (Appendix 8). Furthermore, H3K27ac enrichment regions identified using PeakRanger algorithm (Feng et al., 2011) without promoter filter overlap with both promoters and enhancers by visual inspection of validated enhancer profiles and nearby genes. As the logistic regression model detects the overall trend, features enriched in both PrL and Enh, such as RNAPII-ser5 and H3K27ac, end up contributing to prediction of PrL in which the marks were most enriched.

To elucidate how well the enhancers overlap with other enhancer associated factors, I compared the overlap of top Enh candidates and distal H3K27ac mark with binding sites of CHD7, which is found to both positively and negatively regulate gene expression through enhancer binding (Schnetz et al., 2010) (Figure 4-10b). Forty-four percent of top Enh candidates overlap with the distal active H3K27ac mark, and 78 percent overlapped with distal H3K27ac or CHD7 peaks of medium stringency supporting that identified top Enh candidates are active enhancers.

4.2.10.2. 5-hydroxymethylcytosine versus active enhancers

As over one fourth of top Enh candidates cannot be explained by CHD7 and the active enhancer mark, H3K27ac, I further investigated the overlap to 5-hydroxymethylcytosine (5hmC) sites. In agreement with the discovered association of 5hmC with enhancers, exons and TSS regions in human ES cells (Stroud et al., 2011), I found both Enh and PrL sets to be significantly more enriched of 5hmC in comparison to 5mC (p<2.2e-16; logOdds=2.58 and 1.81), which also supports the association of 5hmC with active enhancers. However, in contrast to the increase in overlapping percentage of higher confident Enh (prob≥0.8) with CHD7 and H3K27ac, the overlapping percentage with 5hmC regions decreased significantly (p<2.2e-16; logOdds=-0.69) (Figure 4-11). As 5hmC is suggested to be an intermediate stage between methylated and non- methylated cytosine (Ficz et al., 2011), the decrease in overlap with higher confident enhancers, and the limited overlap with 5mC indicate that the most active enhancers tend to located at non- hydroxymethylated and non-methylated cytosine. Although 5hmC is significantly enriched with active enhancers, it is less associated with the most active ones.

74

4.3. Discussion

In this chapter, I present computational approaches to assess discriminative features for enhancer identification in mouse ES cell. I have established the importance of feature selection from comparing the cross validating deviance, precision and recall of Naive Bayes classifiers built from various feature combinations. Subsequently, I used lasso regularized multinomial logistic regression to systematically shrink the weights of chromatin related ChIP-Seq and genomic features in identifying enhancer regions. I identified 10 key signatures for distinguishing enhancer regions from promoter-like regions and the rest of the genome. The top signatures identified for Enh, namely p300, H3K4me1, and Med12 are consistent with previous studies; whereas the top signatures identified for PrL, such as CpG islands and H3K4me3, agree with promoter properties.

The advantage of the approach in ranking top non-TF key signatures for enhancer identification is to retain the potential in applying the model to other cell types as opposed to previous approaches which used all features available. The importance of data transformation and feature extraction of chromatin signatures has recently been established to achieve a more accurate enhancer prediction in human ES cells (Firpi et al., 2010). Although the feature extraction by Fisher discriminant analysis determines a linear combination of features that optimally separates two classes, data sets of all features used will be required in other cell types. Other unsupervised approaches have also been done to characterize chromatin states in systematically annotating the human genome using all data sets (Ernst and Kellis, 2010; Hon et al., 2008). Another unsupervised approach used selected histone marks, the Ser2/5 phosphorylated forms of RNAPII to distinguish multiple classes of enhancers (Zentner et al., 2011). Differing from these methods, my approach not only identifies key signatures in enhancer identification, but also demonstrates the advantage of incorporating genomic sequence features to better distinguish enhancer and promoter regions.

I have confirmed that previously validated enhancers and novel enhancers around ES cell regulating TFs fall within the top 1277 high confidence enhancer candidates (prob≥0.8). Examples of these novel Enhs are around Lefty2, Sox2, miR290, Tet1, Zic3, and C80913 with the majority playing important roles in ES cells, which warrant further experimental validations. Since less than 40% of top candidates are co-bound by OCT4, SOX2 and NANOG, enhancers

75 identified from the modeling approach are not limited to the Enh training set, OSN co-bound regions.

Although Enh candidates are significantly further from TSS than PrL candidates, the absolute expression distribution of genes with Enhs and PrLs is significantly higher than that of genes with only PrLs, indicating co-ordinated regulation of gene expression by Enh and PrL candidates. It is important to note that Enh candidates are expected to be more difficult to annotate to the correct gene(s) as they are often located in intergenic regions and may in fact not regulate the closest gene in a linear conformation. Regardless, the presence of an Enh candidate is associated with increased levels of expression in addition to the presence of PrL candidate and both sets regulate tissue-specific gene expression. Furthermore, Enh regions identified are significantly enriched with MTLs comparing to unknown, background and even the PrL regions. GO functionality investigation revealed that top Enh candidates are specifically enriched in DNA binding and transcriptional regulating activities. These statistical analyses provide supporting evidence for functionality of enhancers identified in mouse ES cells.

Motif enrichment analysis on enhancers further identified novel enhancer-bound TFs that may play major roles in mouse ES cells. Although motif enrichment analysis is limited by the position weight matrices available, I have successfully identified motif enrichment of several known ES regulating TFs, such as KLF4, SOX2, OCT4, ESRRB, STAT3 and ZIC3. In addition, SP1 was previously reported to regulate Oct4 and Nanog gene expression through binding to their promoters (Pesce et al., 1999; Wu and Yao, 2006). Interestingly, I have also discovered enhancer regions closet to genes of most of these TFs, and TFs with high gene expression tend to have at least one high probability enhancer nearby. Furthermore, enrichments of several other neural development and differentiation related TFs may indicate that a subset of enhancers identified may be poised enhancers or that there is an active TF of the same family with similar PSSM. This is consistent with the observations of epigenetic pre-marking of developmental enhancers through histone and DNA methylation marks from other studies (Cui et al., 2009; Xu et al., 2009). Through motif enrichment analysis, I have established the potential in identifying known and novel TFs regulating the target cell type.

Follow-up investigation using H3K27ac data in mouse ES cells can further highlight active enhancers. I have reported a higher overlapping percentage with the distal H3K27ac mark using

76 top ranked enhancers indicating a positive correlation between enhancer activity and Enh probability. As the distal H3K27ac regions obtained were processed through removing promoter regions, various validated enhancers closer to TSS, such as those of Pou5f1 (Oct4), Nanog and Zic3 as well as the two enhancers around Sox2, cannot be identified. Alternatively, H3K27ac enriched regions identified using the PeakRanger algorithm (Feng et al., 2011) overlap with both promoters and enhancers. Such incomplete and non-exclusive representation of enhancers further necessitates the usage of multiple features, and highlights the advantage of the modeling approach.

In agreement with tissue-specificity of enhancers reported previously (Heintzman et al., 2009; Visel et al., 2009a), I have shown that the putative enhancers are cell type-specific through comparing with the distal active H3K27ac and H3K4me1 marks in various cell types. The limited overlap with H3K27me3 (Figure 4-9a; Appendix 7) of the Enh candidates also indicated that the Enh set contains very few numbers of poised enhancers marked by H3K27me3. Furthermore, I found significant and no association of 5hmC and 5mC, respectively, with the overall Enh set, and decreased association of both 5hmC and 5mC with top Enh candidates, which suggest the association with enhancer activity from strong to very weak is non-methylated Cytosine, 5hmC and 5mC. The subset of enhancer candidates without distal H3K27ac regions can potentially be previously defined intermediate enhancers (Zentner et al., 2011). Overall, the majority of enhancers identified are active or intermediate enhancers in ES cells and the proportions of overall Enh set (n=19200) and top Enh candidates (n=1277) explained by distal H3K27ac, CHD7, or 5hmC are 61.9% and 83.4%, respectively.

77

4.3.1. Future work

The advantage of minimizing key signatures necessary and using non-TF features is to retain the potential in applying the model to other cell types efficiently and cost effectively. For well studied cell types with multiple TF ChIP-Seq data sets, another interesting question is whether using ChIP-Seq data of crucial TFs as features will improve prediction accuracy of enhancers. If key TFs increase the precision of functional enhancer identification, the model can then be modified as data accumulates to more accurately identify enhancers.

Other data sets, such as another coactivator (CREB-binding protein), can be incorporated to build a more robust model in predicting enhancers as more data sets become available. Quantitative measurements of ChIP-Seq data sets were used in feature selection, but binary measurements may improve predictions for noisy data sets, which could be tested in future modeling approaches. Various combinations of features may also be further assessed. Although the enhancers and signatures are successfully identified using the binning approach, non-binning approaches may improve the prediction resolution.

The Enh candidates identified can be validated using enhancer assays, and the target genes the enhancer regulate as well as the looping conformation can be elucidated by high throughput chromosome conformation capture techniques. Moreover, TFs identified by the motif analysis can be further investigated using ChIP experiments.

78

Chapter 5

General Discussion

79

5.1. Discussion

The advent of high throughput sequencing data has empowered functional annotations of genomes, especially in distal regulatory elements that have been long under-explored. Multiple sources of evidence have stressed the important role of these elements, specifically enhancers, in tissue-specificity and development. The growing focus on enhancer identification using various different associated features calls for detailed investigation of their predictability. Limited by data availability in mouse erythroid cells, I first identified enhancers using a biologically directed approach. Given the wealth resources of public domain RNA-Seq and ChIP-Seq data sets for mouse embryonic stem cells, I then took modeling approaches to assess features that distinguish enhancers from promoters and the rest of the genome.

Enhancers in erythroid cells were identified using RNAPII-ser5 enrichment and the absence of nucRNA, and putative enhancers predicted from the biological directed approach were supported by downstream statistical analyses. Specifically, significant association with TF binding peaks and conserved regions as well as Gene Ontology functional enrichment analysis of putative enhancers indicated functionality of these elements. Motif enrichment study on enhancers also identified TFs potentially involved with erythroid development. However, the Naive Bayes pilot study in mouse ES cells revealed that combinations of RNA and RNAPII-ser5 with or without conservation factor were not as good predictors of enhancers as others. Although total RNA instead of nucRNA was used in model assessment in mouse ES cells, the unbiased computational approach in Chapter 4 elucidated the importance of feature selection. Moreover, the overlapping percentage between enhancer candidates and tissue-specific TFs in ES cells is much higher than that of in erythroid cells. Such discrepancy is likely due to the improvement of model performance and/or the number of TF ChIP-Seq data sets available in each cell type.

In mouse ES cells, the lasso regularized logistic regression model provided systematic assessment of chromatin related and genomic features in enhancer prediction: top signatures of Enh and PrL are p300/H3K4me1/Med12 and CpG island/H3K4me3/RNAPII-ser5/G+C percentage/CTCF, respectively. Putative enhancers from the model using the 10 signatures identified are significantly enriched with MTL compared to other sets and overall genomic bins. In addition, known and novel enhancers are top ranked in the model, which included candidates closest to several ES cell regulating TFs that are found to be enriched in the supervised motif

80 enrichment analysis of enhancers. These enhancers are also found to be cell type-specific, and top enhancer candidates contain 44% of overlap with the active H3K27ac enhancer mark. Moreover, 83.4% of top candidates are explained by H3K27ac, CHD7, and 5hmC regions. Different from previous approaches of feature filtering or modeling with all histone modification data, my approach also incorporated CpG island, mediators, G+C percent, and CTCF features and identified minimal signatures necessary in enhancer identification with ranking which allows more efficient prediction of enhancers in other cell types. Subsequent supervised motif analysis can also elucidate key TFs regulating in the corresponding cell type.

Transcriptional activities have been previously reported in enhancer regions identified by H3K4me1 in mouse cortical neurons, and these transcripts, termed eRNAs, tend not to extend beyond 4kbs, which are bidirectional, non-polyadenylated and much shorter than transcripts at promoters (Kim et al., 2010). Somewhat different from the study, I found nearly no transcripts at hbb LCR enhancers in the erythroid cells, and the transcripts beside Sox2 and Tet1/U6 enhancer clusters span longer than 4kbs (6 to 10kbs). The discretion could result from difference in size and stability of these RNAs, coverage or experimental procedures. Nevertheless, the enhancer cluster 100 kb downstream of Sox2 in mouse ES cells shares similarities to the hbb LCR in adult erythroid cells including closely located high confident enhancers, overlaps with multiple TF binding peaks, and the long range distance to the nearest transcribed gene. Such resemblance infers a possibility of chromatin looping for Sox2. Therefore, further investigations such as enhancer assays and 3C will be necessary to validate their enhancer activities, reveal the conformation around them, and further identify their target genes.

81

5.2. Summary

The contributions of the presented work to the current research include highlighting the importance of selecting key signatures in enhancer prediction, and integrating multiple sources of relevant biological data sets to address a biological question. The data sets used here brings together wide range of biological concepts from expression, histone modification, TF occupancy, chromatin associated proteins, DNA (hydroxy/) methylation, TF motifs, to Gene Ontology. Data types included expression through RNA-Seq data, histone modification, chromatin associated proteins, and TF ChIP-Seq data, genomic features, PSSM, Gene Ontology categories, 5hmC and 5mC-Seq data.

The significance of the study also lies in aiding identification of potential novel TFs bound to putative enhancer elements and narrowing down functional enhancer candidates that are likely to interact with promoters through long range looping to regulate gene expression. Additionally, the usage of transcription factor independent features in the model retains the flexibility of applying the model to other cell-types using only minimal signatures necessary.

The series of bioinformatics and statistical analyses uncovered several biological observations associated with enhancers that are consistent with previous literature, such as top Enh and PrL signatures identified, prediction of known enhancers, tissue-specificity and MTL enrichment of enhancers. In addition, this work also presented novel findings such as novel enhancers and looping candidates, potential regulatory TFs, lack of overlap with poised histone marks of the putative enhancers, and overall enrichment of 5hmC, but in a lesser degree for top Enh candidates. The bioinformatics study not only narrows down enhancer candidates that warrant experimental validation, but also provides inference for biological understanding and interpretation.

82

References……………………

Aerts, S., van Helden, J., Sand, O., and Hassan, B.A. (2007). Fine-tuning enhancer models to predict transcriptional targets across multiple genomes. PLoS One 2, e1115.

Aruga, J. (2004). The role of Zic genes in neural development. Mol Cell Neurosci 26, 205-221.

Ashe, H.L., Monks, J., Wijgerde, M., Fraser, P., and Proudfoot, N.J. (1997). Intergenic transcription and transinduction of the human beta-globin locus. Genes Dev 11, 2494-2509.

Badis, G., Berger, M.F., Philippakis, A.A., Talukder, S., Gehrke, A.R., Jaeger, S.A., Chan, E.T., Metzler, G., Vedenko, A., Chen, X., et al. (2009). Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720-1723.

Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., et al. (2010). NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 39, D1005-1010.

Bensinger, S.J., and Tontonoz, P. (2009). A Nurr1 pathway for neuroprotection. Cell 137, 26-28.

Bergsland, M., Werme, M., Malewicz, M., Perlmann, T., and Muhr, J. (2006). The establishment of neuronal properties is controlled by Sox4 and Sox11. Genes Dev 20, 3475-3486.

Berman, B.P., Pfeiffer, B.D., Laverty, T.R., Salzberg, S.L., Rubin, G.M., Eisen, M.B., and Celniker, S.E. (2004). Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 5, R61.

Bernstein, B.E., Kamal, M., Lindblad-Toh, K., Bekiranov, S., Bailey, D.K., Huebert, D.J., McMahon, S., Karlsson, E.K., Kulbokas, E.J., 3rd, Gingeras, T.R., et al. (2005). Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120, 169-181.

Bernstein, B.E., Mikkelsen, T.S., Xie, X., Kamal, M., Huebert, D.J., Cuff, J., Fry, B., Meissner, A., Wernig, M., Plath, K., et al. (2006). A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 125, 315-326.

Blow, M.J., McCulley, D.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F., et al. (2010). ChIP-Seq identification of weakly conserved heart enhancers. Nat Genet 42, 806-810.

Brandeis, M., Frank, D., Keshet, I., Siegfried, Z., Mendelsohn, M., Nemes, A., Temper, V., Razin, A., and Cedar, H. (1994). Sp1 elements protect a CpG island from de novo methylation. Nature 371, 435-438.

Brandt, S.J., and Koury, M.J. (2009). Regulation of LMO2 mRNA and protein expression in erythroid differentiation. Haematologica 94, 447-448.

83

Carter, D., Chakalova, L., Osborne, C.S., Dai, Y.F., and Fraser, P. (2002). Long-range chromatin regulatory interactions in vivo. Nat Genet 32, 623-626.

Catena, R., Tiveron, C., Ronchi, A., Porta, S., Ferri, A., Tatangelo, L., Cavallaro, M., Favaro, R., Ottolenghi, S., Reinbold, R., et al. (2004). Conserved POU binding DNA sites in the Sox2 upstream enhancer regulate gene expression in embryonic and neural stem cells. J Biol Chem 279, 41846-41857.

Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J., et al. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-1117.

Cheng, Y., Wu, W., Kumar, S.A., Yu, D., Deng, W., Tripic, T., King, D.C., Chen, K.B., Zhang, Y., Drautz, D., et al. (2009). Erythroid GATA1 function revealed by genome-wide analysis of transcription factor occupancy, histone modifications, and mRNA expression. Genome Res 19, 2172-2184.

Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B., Faulkner, G.J., Brown, M.K., Taylor, D.F., Steptoe, A.L., Wani, S., Bethel, G., et al. (2008). Stem cell transcriptome profiling via massive- scale mRNA sequencing. Nat Methods 5, 613-619.

Creyghton, M.P., Cheng, A.W., Welstead, G.G., Kooistra, T., Carey, B.W., Steine, E.J., Hanna, J., Lodato, M.A., Frampton, G.M., Sharp, P.A., et al. (2010). From the Cover: Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A 107, 21931-21936.

Cui, K., Zang, C., Roh, T.Y., Schones, D.E., Childs, R.W., Peng, W., and Zhao, K. (2009). Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell 4, 80-93.

De Santa, F., Barozzi, I., Mietton, F., Ghisletti, S., Polletti, S., Tusi, B.K., Muller, H., Ragoussis, J., Wei, C.L., and Natoli, G. (2010). A large fraction of extragenic RNA pol II transcription sites overlap enhancers. PLoS Biol 8, e1000384.

Dekker, J., Rippe, K., Dekker, M., and Kleckner, N. (2002). Capturing chromosome conformation. Science 295, 1306-1311.

Ernst, J., and Kellis, M. (2010). Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol 28, 817-825.

Ernst, J., Plasterer, H.L., Simon, I., and Bar-Joseph, Z. (2010). Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res 20, 526-536.

Feng, X., Grossman, R., and Stein, L. (2011). PeakRanger: a cloud-enabled peak caller for ChIP- seq data. BMC Bioinformatics 12, 139.

Ficz, G., Branco, M.R., Seisenberger, S., Santos, F., Krueger, F., Hore, T.A., Marques, C.J., Andrews, S., and Reik, W. (2011). Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature 473, 398-402.

84

Firpi, H.A., Ucar, D., and Tan, K. (2010). Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 26, 1579-1586.

Forrester, W.C., Epner, E., Driscoll, M.C., Enver, T., Brice, M., Papayannopoulou, T., and Groudine, M. (1990). A deletion of the human beta-globin locus activation region causes a major alteration in chromatin structure and replication across the entire beta-globin locus. Genes Dev 4, 1637-1649.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1-22.

Frith, M.C., Fu, Y., Yu, L., Chen, J.F., Hansen, U., and Weng, Z. (2004). Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res 32, 1372-1381.

Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber, G.P., Clawson, H., Coelho, A., et al. (2010). The UCSC Genome Browser database: update 2011. Nucleic Acids Res.

Fullwood, M.J., Liu, M.H., Pan, Y.F., Liu, J., Xu, H., Mohamed, Y.B., Orlov, Y.L., Velkov, S., Ho, A., Mei, P.H., et al. (2009). An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 58-64.

Galan-Caridad, J.M., Harel, S., Arenzana, T.L., Hou, Z.E., Doetsch, F.K., Mirny, L.A., and Reizis, B. (2007). Zfx controls the self-renewal of embryonic and hematopoietic stem cells. Cell 129, 345-357.

Gardiner-Garden, M., and Frommer, M. (1987). CpG islands in vertebrate genomes. J Mol Biol 196, 261-282.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80.

Goren, A., Ozsolak, F., Shoresh, N., Ku, M., Adli, M., Hart, C., Gymrek, M., Zuk, O., Regev, A., Milos, P.M., et al. (2010). Chromatin profiling by directly sequencing small quantities of immunoprecipitated DNA. Nat Methods 7, 47-49.

Guttman, M., Garber, M., Levin, J.Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M.J., Gnirke, A., Nusbaum, C., et al. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28, 503-510.

Hallikas, O., Palin, K., Sinjushina, N., Rautiainen, R., Partanen, J., Ukkonen, E., and Taipale, J. (2006). Genome-wide prediction of mammalian enhancers based on analysis of transcription- factor binding affinity. Cell 124, 47-59.

He, A., Kong, S.W., Ma, Q., and Pu, W.T. (2011). Co-occupancy by multiple cardiac transcription factors identifies transcriptional enhancers active in heart. Proc Natl Acad Sci U S A 108, 5632-5637.

85

Heintzman, N.D., Hon, G.C., Hawkins, R.D., Kheradpour, P., Stark, A., Harp, L.F., Ye, Z., Lee, L.K., Stuart, R.K., Ching, C.W., et al. (2009). Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108-112.

Heintzman, N.D., Stuart, R.K., Hon, G., Fu, Y., Ching, C.W., Hawkins, R.D., Barrera, L.O., Van Calcar, S., Qu, C., Ching, K.A., et al. (2007). Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39, 311-318.

Hon, G., Ren, B., and Wang, W. (2008). ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol 4, e1000201.

Horn, V., Minucci, S., Ogryzko, V.V., Adamson, E.D., Howard, B.H., Levin, A.A., and Ozato, K. (1996). RAR and RXR selective ligands cooperatively induce apoptosis and neuronal differentiation in P19 embryonal carcinoma cells. FASEB J 10, 1071-1077.

Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009a). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37, 1- 13.

Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57.

Huang, D.Y., Kuo, Y.Y., Lai, J.S., Suzuki, Y., Sugano, S., and Chang, Z.F. (2004). GATA-1 and NF-Y cooperate to mediate erythroid-specific transcription of Gfi-1B gene. Nucleic Acids Res 32, 3935-3946.

Inoue, M., Kamachi, Y., Matsunami, H., Imada, K., Uchikawa, M., and Kondoh, H. (2007). PAX6 and SOX2-dependent regulation of the Sox2 enhancer N-3 involved in embryonic visual system development. Genes Cells 12, 1049-1061.

Jothi, R., Cuddapah, S., Barski, A., Cui, K., and Zhao, K. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 36, 5221-5231.

Kagey, M.H., Newman, J.J., Bilodeau, S., Zhan, Y., Orlando, D.A., van Berkum, N.L., Ebmeier, C.C., Goossens, J., Rahl, P.B., Levine, S.S., et al. (2010). Mediator and cohesin connect gene expression and chromatin architecture. Nature.

Kharchenko, P.V., Tolstorukov, M.Y., and Park, P.J. (2008). Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 26, 1351-1359.

Kim, T.K., Hemberg, M., Gray, J.M., Costa, A.M., Bear, D.M., Wu, J., Harmin, D.A., Laptewicz, M., Barbara-Haley, K., Kuersten, S., et al. (2010). Widespread transcription at neuronal activity-regulated enhancers. Nature.

Koh, K.P., Yabuuchi, A., Rao, S., Huang, Y., Cunniff, K., Nardone, J., Laiho, A., Tahiliani, M., Sommer, C.A., Mostoslavsky, G., et al. (2011). Tet1 and Tet2 regulate 5-hydroxymethylcytosine production and cell lineage specification in mouse embryonic stem cells. Cell Stem Cell 8, 200- 213.

86

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25.

Leung, L., Kwong, M., Hou, S., Lee, C., and Chan, J.Y. (2003). Deficiency of the Nrf1 and Nrf2 transcription factors results in early embryonic lethality and severe oxidative stress. J Biol Chem 278, 48021-48029.

Lichner, Z., Pall, E., Kerekes, A., Pallinger, E., Maraghechi, P., Bosze, Z., and Gocza, E. (2011). The miR-290-295 cluster promotes pluripotency maintenance by regulating cell cycle phase distribution in mouse embryonic stem cells. Differentiation 81, 11-24.

Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289-293.

Lim, L.S., Hong, F.H., Kunarso, G., and Stanton, L.W. (2010). The pluripotency regulator Zic3 is a direct activator of the Nanog promoter in ESCs. Stem Cells 28, 1961-1969.

Lim, L.S., Loh, Y.H., Zhang, W., Li, Y., Chen, X., Wang, Y., Bakre, M., Ng, H.H., and Stanton, L.W. (2007). Zic3 is required for maintenance of pluripotency in embryonic stem cells. Mol Biol Cell 18, 1348-1358.

Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L., Ye, Z., Ngo, Q.M., et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315-322.

Macleod, D., Charlton, J., Mullins, J., and Bird, A.P. (1994). Sp1 sites in the mouse aprt gene promoter are required to prevent methylation of the CpG island. Genes Dev 8, 2282-2292.

Marin, M., Karis, A., Visser, P., Grosveld, F., and Philipsen, S. (1997). Transcription factor Sp1 is essential for early embryonic development but dispensable for cell growth and differentiation. Cell 89, 619-628.

Marini, M.G., Porcu, L., Asunis, I., Loi, M.G., Ristaldi, M.S., Porcu, S., Ikuta, T., Cao, A., and Moi, P. (2010). Regulation of the human HBA genes by KLF4 in erythroid cell lines. Br J Haematol 149, 748-758.

Marziali, G., Perrotti, E., Ilari, R., Lulli, V., Coccia, E.M., Moret, R., Kuhn, L.C., Testa, U., and Battistini, A. (2002). Role of Ets-1 in transcriptional regulation of transferrin receptor and erythroid differentiation. Oncogene 21, 7933-7944.

Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al. (2003). TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31, 374-378.

Merico, D., Isserlin, R., Stueker, O., Emili, A., and Bader, G.D. (2010). Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One 5, e13984.

87

Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.K., Koche, R.P., et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553-560.

Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R., King, D.C., Baertsch, R., Blankenberg, D., et al. (2007). 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res 17, 1797-1808.

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628.

Muse, G.W., Gilchrist, D.A., Nechaev, S., Shah, R., Parker, J.S., Grissom, S.F., Zeitlinger, J., and Adelman, K. (2007). RNA polymerase is poised for activation across the genome. Nat Genet 39, 1507-1511.

Ong, C.T., and Corces, V.G. (2011). Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet 12, 283-293.

Ouyang, Z., Zhou, Q., and Wong, W.H. (2009). ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A 106, 21521-21526.

Palin, K., Taipale, J., and Ukkonen, E. (2006). Locating potential enhancer elements by comparative genomics using the EEL software. Nat Protoc 1, 368-374.

Palstra, R.J., Tolhuis, B., Splinter, E., Nijmeijer, R., Grosveld, F., and de Laat, W. (2003). The beta-globin nuclear compartment in development and erythroid differentiation. Nat Genet 35, 190-194.

Pastor, W.A., Pape, U.J., Huang, Y., Henderson, H.R., Lister, R., Ko, M., McLoughlin, E.M., Brudno, Y., Mahapatra, S., Kapranov, P., et al. (2011). Genome-wide mapping of 5- hydroxymethylcytosine in embryonic stem cells. Nature 473, 394-397.

Pekowska, A., Benoukraf, T., Ferrier, P., and Spicuglia, S. (2010). A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res 20, 1493-1502.

Pennacchio, L.A., Ahituv, N., Moses, A.M., Prabhakar, S., Nobrega, M.A., Shoukry, M., Minovitsky, S., Dubchak, I., Holt, A., Lewis, K.D., et al. (2006). In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499-502.

Pesce, M., Marin Gomez, M., Philipsen, S., and Scholer, H.R. (1999). Binding of Sp1 and Sp3 transcription factors to the Oct-4 gene promoter. Cell Mol Biol (Noisy-le-grand) 45, 709-716.

Portales-Casamar, E., Thongjuea, S., Kwon, A.T., Arenillas, D., Zhao, X., Valen, E., Yusuf, D., Lenhard, B., Wasserman, W.W., and Sandelin, A. (2010). JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38, D105-110.

88

Rada-Iglesias, A., Bajpai, R., Swigut, T., Brugmann, S.A., Flynn, R.A., and Wysocka, J. (2011). A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470, 279-283.

Rahl, P.B., Lin, C.Y., Seila, A.C., Flynn, R.A., McCuine, S., Burge, C.B., Sharp, P.A., and Young, R.A. (2010). c-Myc regulates transcriptional pause release. Cell 141, 432-445.

Rao, S., and Orkin, S.H. (2006). Unraveling the transcriptional network controlling ES cell pluripotency. Genome Biol 7, 230.

Rosa, A., and Brivanlou, A.H. (2011). A regulatory circuitry comprised of miR-302 and the transcription factors OCT4 and NR2F2 regulates human embryonic stem cell differentiation. EMBO J 30, 237-248.

Rozen, S., and Skaletsky, H. (2000). Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132, 365-386.

Sagai, T., Hosoya, M., Mizushina, Y., Tamura, M., and Shiroishi, T. (2005). Elimination of a long-range cis-regulatory module causes complete loss of limb-specific Shh expression and truncation of the mouse limb. Development 132, 797-803.

Sawada, A., Kiyonari, H., Ukita, K., Nishioka, N., Imuta, Y., and Sasaki, H. (2008). Redundant roles of Tead1 and Tead2 in notochord development and the regulation of cell proliferation and survival. Mol Cell Biol 28, 3177-3189.

Schnetz, M.P., Handoko, L., Akhtar-Zaidi, B., Bartels, C.F., Pereira, C.F., Fisher, A.G., Adams, D.J., Flicek, P., Crawford, G.E., Laframboise, T., et al. (2010). CHD7 targets active gene enhancer elements to modulate ES cell-specific gene expression. PLoS Genet 6, e1001023.

Schoenfelder, S., Sexton, T., Chakalova, L., Cope, N.F., Horton, A., Andrews, S., Kurukuti, S., Mitchell, J.A., Umlauf, D., Dimitrova, D.S., et al. (2010). Preferential associations between co- regulated genes reveal a transcriptional interactome in erythroid cells. Nat Genet 42, 53-61.

Shimizu, R., Trainor, C.D., Nishikawa, K., Kobayashi, M., Ohneda, K., and Yamamoto, M. (2007). GATA-1 self-association controls erythroid development in vivo. J Biol Chem 282, 15862-15871.

Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034-1050.

Sikorska, M., Sandhu, J.K., Deb-Rinker, P., Jezierski, A., Leblanc, J., Charlebois, C., Ribecco- Lutkiewicz, M., Bani-Yaghoub, M., and Walker, P.R. (2008). Epigenetic modifications of SOX2 enhancers, SRR1 and SRR2, correlate with in vitro neural differentiation. J Neurosci Res 86, 1680-1693.

Simonis, M., Klous, P., Splinter, E., Moshkin, Y., Willemsen, R., de Wit, E., van Steensel, B., and de Laat, W. (2006). Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat Genet 38, 1348-1354.

89

Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.L., and Ideker, T. (2011). Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431-432.

Soler, E., Andrieu-Soler, C., de Boer, E., Bryne, J.C., Thongjuea, S., Stadhouders, R., Palstra, R.J., Stevens, M., Kockx, C., van Ijcken, W., et al. (2010). The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation. Genes Dev 24, 277-289.

Stroud, H., Feng, S., Morey Kinney, S., Pradhan, S., and Jacobsen, S.E. (2011). 5- hydroxymethylcytosine is associated with enhancers and gene bodies in human embryonic stem cells. Genome Biol 12, R54.

Tahiliani, M., Koh, K.P., Shen, Y., Pastor, W.A., Bandukwala, H., Brudno, Y., Agarwal, S., Iyer, L.M., Liu, D.R., Aravind, L., et al. (2009). Conversion of 5-methylcytosine to 5- hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930-935.

Takahashi, K., and Yamanaka, S. (2006). Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663-676.

Takemoto, T., Uchikawa, M., Yoshida, M., Bell, D.M., Lovell-Badge, R., Papaioannou, V.E., and Kondoh, H. (2011). Tbx6-dependent Sox2 regulation determines neural or mesodermal fate in axial stem cells. Nature 470, 394-398.

Talbot, D., and Grosveld, F. (1991). The 5'HS2 of the globin locus control region enhances transcription through the interaction of a multimeric complex binding at two functionally distinct NF-E2 binding sites. EMBO J 10, 1391-1398.

Tallack, M.R., Whitington, T., Yuen, W.S., Wainwright, E.N., Keys, J.R., Gardiner, B.B., Nourbakhsh, E., Cloonan, N., Grimmond, S.M., Bailey, T.L., et al. (2010). A global role for KLF1 in erythropoiesis revealed by ChIP-seq in primary erythroid cells. Genome Res.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met 58, 267-288.

Tolhuis, B., Palstra, R.J., Splinter, E., Grosveld, F., and de Laat, W. (2002). Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell 10, 1453-1465.

Tomioka, M., Nishimoto, M., Miyagi, S., Katayanagi, T., Fukui, N., Niwa, H., Muramatsu, M., and Okuda, A. (2002). Identification of Sox-2 regulatory region which is under the control of Oct-3/4-Sox-2 complex. Nucleic Acids Res 30, 3202-3213.

Uchikawa, M., Ishida, Y., Takemoto, T., Kamachi, Y., and Kondoh, H. (2003). Functional analysis of chicken Sox2 enhancers highlights an array of diverse regulatory elements that are conserved in mammals. Dev Cell 4, 509-519. van der Flier, L.G., van Gijn, M.E., Hatzis, P., Kujala, P., Haegebarth, A., Stange, D.E., Begthel, H., van den Born, M., Guryev, V., Oving, I., et al. (2009). Transcription factor achaete scute-like 2 controls intestinal stem cell fate. Cell 136, 903-912.

90

Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F., et al. (2009a). ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854-858.

Visel, A., Prabhakar, S., Akiyama, J.A., Shoukry, M., Lewis, K.D., Holt, A., Plajzer-Frick, I., Afzal, V., Rubin, E.M., and Pennacchio, L.A. (2008). Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat Genet 40, 158-160.

Visel, A., Rubin, E.M., and Pennacchio, L.A. (2009b). Genomic views of distant-acting enhancers. Nature 461, 199-205.

Won, K.J., Ren, B., and Wang, W. (2010). Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol 11, R7.

Wu, D.Y., and Yao, Z. (2006). Functional analysis of two Sp1/Sp3 binding sites in murine Nanog gene promoter. Cell Res 16, 319-322.

Xu, J., Watts, J.A., Pope, S.D., Gadue, P., Kamps, M., Plath, K., Zaret, K.S., and Smale, S.T. (2009). Transcriptional competence and the active marking of tissue-specific enhancers by defined transcription factors in embryonic and induced pluripotent stem cells. Genes Dev 23, 2824-2838.

Yang, H.M., Do, H.J., Oh, J.H., Kim, J.H., Choi, S.Y., Cha, K.Y., and Chung, H.M. (2005). Characterization of putative cis-regulatory elements that control the transcriptional activity of the human Oct4 promoter. J Cell Biochem 96, 821-830.

Yu, M., Riva, L., Xie, H., Schindler, Y., Moran, T.B., Cheng, Y., Yu, D., Hardison, R., Weiss, M.J., Orkin, S.H., et al. (2009). Insights into GATA-1-mediated gene activation versus repression via genome-wide chromatin occupancy analysis. Mol Cell 36, 682-695.

Zentner, G.E., Tesar, P.J., and Scacheri, P.C. (2011). Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res.

Zhao, Z., Tavoosidana, G., Sjolinder, M., Gondor, A., Mariano, P., Wang, S., Kanduri, C., Lezcano, M., Sandhu, K.S., Singh, U., et al. (2006). Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet 38, 1341-1347.

Zovoilis, A., Smorag, L., Pantazi, A., and Engel, W. (2009). Members of the miR-290 cluster modulate in vitro differentiation of mouse embryonic stem cells. Differentiation 78, 69-78.

91

Appendices……………………

Appendix 1. Enhancer candidate list of the mouse erythroid cells.

Columns Fold, conserved, dist2ET, RNASum, and AvgPKb represent the fold enrichment of RNAPII-ser5 over input, whether the region is conserved, shortest distance in number of base pairs from TSS of closest expressed Ensembl transcript, sum of RNA tags within the transcript, and the average of tags per kb, respectively. GATA1* denotes the GATA1 data from Soler et al 2010.

Marker: RNAPII+/nucRNA- marker Transcribed TFs overlapping with marker chr start end Fold conserved geneName OverlapTFs chr12 55797829 55797877 842.58 TRUE Eapp chr11 87262309 87262357 533.53 TRUE Ppm1e chr13 23677285 23677741 472.13 TRUE Hist1h3d GATA1* chr9 65049589 65049637 466.09 TRUE Parp16 GATA1* chr13 21845197 21846013 443.94 TRUE Hist1h4k GATA1* chr13 23636413 23636965 443.94 TRUE Hist1h2bh chr13 23843101 23844661 410.72 TRUE Hist1h4b chr13 21878533 21879229 404.68 TRUE Hist1h1b chr4 116862589 116862637 398.64 TRUE Kif2c chr9 65044381 65044429 363.41 TRUE Parp16 GATA1* chr13 23673565 23673805 359.88 TRUE Hist1h3d chr7 110959813 110959861 352.33 TRUE Hbb-b1 GATA1* chr11 87236365 87236413 315.09 TRUE Ppm1e chr13 21874765 21874981 293.44 TRUE Hist1h1b chr13 21827077 21827389 285.39 TRUE Hist1h4k chr13 23625637 23628517 219.96 TRUE Hist1h4h chr13 22134253 22134373 205.86 TRUE Hist1h4i chr1 72272869 72272917 205.36 TRUE Smarcal1 chr12 55820221 55820269 203.85 TRUE U1 chr12 55835629 55835677 201.84 TRUE U1 chr5 115939549 115939597 190.76 TRUE Pxn GATA1 chr13 23643061 23643253 187.74 TRUE Hist1h1d chr13 23775325 23775373 186.23 TRUE Hist1h2bc GATA1* chr13 21924925 21924973 169.72 TRUE RP23-38E20.2 chr1 173433061 173433253 169.12 TRUE Usf1

92 chr11 101503117 101503165 154.22 TRUE Rdm1 chr11 87284893 87284941 138.08 TRUE Ppm1e chr13 22135045 22135093 135.06 TRUE Hist1h4i chr11 87275965 87276013 133.38 TRUE Ppm1e chr5 115940269 115940365 128.85 TRUE Pxn GATA1 chr11 68886925 68886973 113.42 TRUE Vamp2 GATA1* chr13 21813973 21814021 107.31 TRUE Hist1h3h chr12 55815229 55815277 107.21 TRUE U1 chr3 96290461 96290509 98.15 TRUE U1 chr3 153515341 153515389 97.65 TRUE SNORD45 LDB1,GATA1*,MTGR1 chrY 2786533 2792029 96.64 FALSE Ddx3y GATA1* chr11 101507053 101507125 93.82 TRUE Rdm1 chr12 60141373 60141709 89.59 TRUE Trappc6b chr2 82883077 82883221 85.57 FALSE AL928616.1 chr8 11557933 11557981 84.56 TRUE Ing1 chr13 21812989 21813517 82.55 TRUE Hist1h3h chr7 27564973 27565021 81.54 FALSE AC157553.2 chr17 80606581 80606821 77.51 TRUE Sfrs7 chr2 129123133 129123277 77.51 TRUE Ckap2l chr10 80139277 80139613 75.5 FALSE Mknk2 chr13 23790013 23790061 75.1 TRUE AL592149.2 GATA1* chr12 101437741 101437789 71.47 TRUE Calm1 chr3 90017509 90017701 71.47 TRUE Jtb GATA1 chr8 19719013 19721845 71.47 TRUE AC148089.2 chr11 101488717 101488765 70.63 TRUE Rdm1 chr10 39978493 39978541 70.22 TRUE Gtf3c6 GATA1* chr5 37052101 37052533 68.45 FALSE D5Ertd579e chr7 110973661 110973781 68.45 FALSE Hbb-b2 chr11 58762933 58763461 67.45 TRUE Hist3h2bb chr1 194571781 194572045 66.44 TRUE Hhat chr11 58762405 58762453 66.19 TRUE Hist3h2bb chr11 101519605 101519653 66.15 TRUE Rdm1 chr11 87256933 87256981 64.87 TRUE Ppm1e chr2 155971237 155972461 64.43 TRUE Nfs1 chr11 84686533 84686725 63.42 TRUE Ggnbp2 chr3 151856725 151856797 63.42 TRUE Dnajb4 chr7 80686525 80686597 63.42 TRUE Chd2 chr17 29574997 29575261 61.41 TRUE Pim1 GATA1,KLF1,LDB1,TAL1,GATA1*,MTGR1 chr19 60866653 60866701 61.41 TRUE Eif3a chr7 111019381 111019573 61.41 TRUE Hbb-b2 LDB1,TAL1,GATA1*,MTGR1

93 chr7 3659557 3660973 60.4 FALSE Rps9 chr4 131826037 131826109 59.52 TRUE Taf12 GATA1,GATA1* chr7 125273077 125273317 59.39 TRUE Arl6ip1 GATA1 chr7 118266181 118266253 58.39 TRUE Eif4g2 chr19 45520021 45520141 57.38 TRUE AC126454.1 TAL1,GATA1* chr19 8962333 8963269 55.37 TRUE Ints5 chrX 121284829 121285333 55.37 FALSE AC114001.2 chr11 29447533 29448253 54.36 TRUE Mtif2 chr18 15686869 15686917 54.36 FALSE Chst9 chr14 76904101 76904461 53.35 TRUE Tsc22d1 chr5 93971821 93973021 53.35 FALSE AC134841.1 chr11 97637389 97639189 51.34 FALSE SNORA21 chr17 40948069 40948117 51.34 TRUE Rhag LDB1,GATA1* chr18 57134581 57135301 51.34 TRUE C330018D20Rik chr16 33061117 33062101 50.33 FALSE Rpl35a chr18 49992253 49992421 50.33 TRUE Dmxl1 chrX 120761773 120762133 50.33 FALSE AC114001.2 chr13 21879733 21880045 49.58 TRUE Hist1h1b chr2 51607381 51607429 49.49 TRUE Rif1 chr5 15143917 15144325 49.33 FALSE Speer4d chrX 122652949 122653885 49.33 TRUE Mdm4-ps chr9 108208261 108208501 48.32 TRUE Rhoa chr11 32164957 32165389 47.31 TRUE Mare TAL1,GATA1* chr3 96254485 96254773 47.31 TRUE U1 chrY 2864725 2868949 47.31 FALSE Ddx3y GATA1* chr11 83084989 83085037 46.53 TRUE AL603711.1 GATA1* chr14 51586309 51586573 46.31 FALSE Pnp2 chr15 81352789 81353725 46.31 TRUE Rbx1 GATA1 chr7 111009661 111010357 46.31 TRUE Hbb-b2 GATA1,KLF1,LDB1,TAL1,GATA1*,MTGR1 chrX 122711509 122711869 46.31 FALSE Mdm4-ps chr11 95244109 95245285 45.3 TRUE Slc35b1 chr3 130479085 130479157 45.3 FALSE Rpl34 chr4 68091613 68091901 45.3 FALSE RP23-222K9.2 chr7 13507381 13507957 45.3 TRUE Rps5 chrY 2794069 2796949 45.3 TRUE Ddx3y chr11 87240229 87240277 44.76 TRUE Ppm1e chr13 120251269 120251485 44.29 FALSE 4833420G17Rik chr5 124380565 124380733 44.29 TRUE Hip1r chr8 32279053 32279365 44.29 TRUE BC019943 chrX 122004253 122006773 44.29 FALSE Mdm4-ps

94 chr10 39977797 39978037 43.29 TRUE Gtf3c6 GATA1* chr11 69476125 69476341 43.29 TRUE Mpdu1 chr12 52946029 52946149 43.29 TRUE Gm5785 chr12 111928405 111928597 43.29 FALSE Hsp90aa1 chr13 63916189 63917005 43.29 TRUE Ptch1 chr15 100057525 100057957 43.29 TRUE Atf1 GATA1,GATA1* chr17 33855469 33855589 43.29 TRUE Mar-02 chr7 130752061 130752373 43.29 FALSE Arhgap17 chr8 125629237 125630077 43.29 TRUE snoMBII-202 chr1 72290821 72290893 42.78 TRUE Smarcal1 chr15 82980085 82980133 42.78 TRUE Poldip3 chr12 55839397 55839445 42.75 TRUE U1 chr18 35114221 35114725 42.28 TRUE Hspa9 chr4 141102133 141102181 42.28 TRUE Spen chr5 56120605 56121061 42.28 FALSE AC134463.2 chr9 88413925 88415797 42.28 FALSE SNORD50 chr4 64158925 64159069 41.94 FALSE 6330416G13Rik chr1 72301597 72301669 41.27 TRUE Smarcal1 chr13 21900157 21900541 41.27 TRUE RP23-38E20.1 chr13 105018757 105019165 41.27 TRUE Erbb2ip chr15 98760805 98761333 41.27 TRUE Tuba1b chr3 137806285 137807125 41.27 TRUE Dapp1 chr4 132066253 132066637 41.27 TRUE Sesn2 chr6 56831077 56831725 41.27 FALSE Nt5c3 chr10 79630165 79631029 40.27 TRUE Cirbp chr3 151873213 151873333 40.27 TRUE Dnajb4 chr4 116875165 116875213 40.27 TRUE Kif2c chr5 121669909 121669957 40.27 TRUE AU042671 chrX 47320381 47320453 40.27 FALSE BX813325.1 chrX 121887253 121887757 40.27 FALSE Mdm4-ps chr17 35975029 35975077 39.76 TRUE Tubb5 chr11 90537277 90537325 39.26 FALSE Tom1l1 LDB1,TAL1,GATA1* chr14 55173157 55173493 39.26 TRUE Haus4 chr2 73150573 73150933 39.26 TRUE Cir1 chr10 62978965 62979229 38.25 FALSE Sirt1 chr14 63379909 63380341 38.25 TRUE Ints6 chr15 57724093 57724261 38.25 TRUE Derl1 chr16 38562805 38562997 38.25 TRUE 4930455C21Rik chr5 22056229 22056349 38.25 TRUE AC123661.1 chr5 105838141 105838741 38.25 FALSE Lrrc8b GATA1,GATA1*

95 chr8 129117973 129119053 38.25 TRUE Irf2bp2 chr9 96160069 96160429 38.25 TRUE Tfdp2 chrX 121268317 121270141 38.25 TRUE AC114001.2 chrX 121886221 121886701 38.25 FALSE Mdm4-ps chrX 155040325 155040373 38.25 FALSE Rps6ka3 chrY 2854213 2856805 38.25 TRUE Ddx3y GATA1* chr1 100693285 100693381 37.25 FALSE D1Ertd622e chr16 13671685 13672069 37.25 TRUE Bfar chr16 23127757 23128069 37.25 TRUE Rfc4 chr2 83255749 83255797 37.25 FALSE AL928616.1 chr6 122760925 122761597 37.25 TRUE Foxj2 GATA1* chr7 85022917 85022965 37.25 FALSE Aen chr9 65972605 65973037 37.25 TRUE Snx1 chrX 121897333 121899397 37.25 FALSE Mdm4-ps chr1 135491053 135492493 36.24 TRUE Zc3h11a chr11 5662141 5662405 36.24 TRUE Urgcp chr12 70464997 70465381 36.24 FALSE Arf6 chr13 76155661 76156429 36.24 FALSE Gpr150 GATA1* chr16 19235221 19235845 36.24 FALSE Hira chr17 40978981 40980613 36.24 FALSE Rhag chr18 70689805 70689901 36.24 TRUE Poli chr19 24353989 24354829 36.24 TRUE Fxn GATA1 chr5 14926405 14926813 36.24 FALSE Speer4d chr5 23427997 23428309 36.24 TRUE Fam126a GATA1,LDB1,TAL1 chr5 140797573 140797717 36.24 TRUE Mad1l1 chr8 59996221 59996869 36.24 TRUE Hmgb2 chr9 65063749 65064205 35.74 TRUE Parp16 GATA1,KLF1,LDB1 chr1 55084165 55084549 35.23 TRUE Sf3b1 chr1 68516677 68516965 35.23 FALSE Acadl chr11 84953941 84954133 35.23 TRUE Usp32 chr12 31594717 31595005 35.23 FALSE Acp1 chr15 103088101 103089469 35.23 TRUE Nfe2 GATA1,LDB1,TAL1,MTGR1 chr16 29522053 29522101 35.23 FALSE Opa1 chr4 11132341 11132797 35.23 FALSE Ccne2 chr4 96517837 96518173 35.23 FALSE Nfia chr5 109590661 109590733 35.23 FALSE Crlf2 chr6 83456629 83457013 35.23 TRUE Dguok chr6 119989909 119989981 35.23 TRUE Wnk1 chr7 19813141 19813237 35.23 TRUE Opa3 chr7 54175429 54175981 35.23 TRUE Tsg101

96 chr8 87595549 87595957 35.23 TRUE Dhps chr9 36575389 36575773 35.23 TRUE Stt3a chr9 64129117 64129477 35.23 FALSE Tipin chrX 120988045 120988933 35.23 FALSE AC114001.2 chr4 134483965 134485477 34.73 FALSE Syf2 chr1 176101765 176102149 34.23 FALSE Spna1 GATA1,LDB1,TAL1,GATA1*,MTGR1 chr10 6372589 6372637 34.23 TRUE Mthfd1l chr12 73271893 73271965 34.23 FALSE Daam1 chr16 14163205 14163565 34.23 TRUE Nde1 chr16 40816933 40817245 34.23 FALSE RP23-283M18.1 chr18 6764917 6765397 34.23 TRUE Rab18 chr2 103962157 103962781 34.23 TRUE Cd59a KLF1 chr3 28704493 28705213 34.23 TRUE Eif5a2 chr3 95659717 95659933 34.23 TRUE Prpf3 chr3 142428157 142428229 34.23 TRUE Pkn2 chr4 17148109 17148469 34.23 FALSE Nbn chr4 112947421 112947469 34.23 FALSE AL807236.2 chr4 147003829 147004933 34.23 FALSE AL626778.2 chr5 94091941 94092613 34.23 FALSE AC134841.1 chr6 36311509 36311821 34.23 FALSE Mtpn chr6 87790141 87791365 34.23 FALSE Isy1 chr7 111014461 111014773 34.23 TRUE Hbb-b2 LDB1,TAL1,GATA1* chr8 19942381 19947829 34.23 FALSE 2610005L07Rik chrX 122696101 122697421 34.23 FALSE Mdm4-ps chr1 171215725 171216061 33.22 FALSE Nuf2 chr17 74666797 74667325 33.22 FALSE Memo1 LDB1,GATA1*,MTGR1 chr2 140930005 140930077 33.22 FALSE Snrpb2 chr4 135411181 135411517 33.22 TRUE Sfrs13a chr5 115790029 115790245 33.22 TRUE Gatc chr5 136410517 136410589 33.22 TRUE Ywhag chr7 97591117 97591165 33.22 TRUE Crebzf chr8 11555245 11555893 33.22 TRUE Ing1 chr8 44730085 44730133 33.22 FALSE Snx25 chr8 72240277 72240709 33.22 TRUE Zfp866 chr8 72298645 72299053 33.22 TRUE Zfp866 chrX 120764485 120765685 33.22 FALSE AC114001.2 chr13 81022693 81022741 32.72 TRUE Arrdc3 chr1 85090405 85090573 32.21 FALSE A530040E14Rik chr1 132097861 132097957 32.21 FALSE AC111067.2 chr1 140071837 140072029 32.21 TRUE Ptprc

97 chr11 86396797 86397661 32.21 TRUE Tmem49 chr11 87868333 87868669 32.21 TRUE Sfrs1 chr11 102269413 102270085 32.21 TRUE Slc25a39 chr11 120177541 120177637 32.21 TRUE Actg1 LDB1 chr13 24902653 24902725 32.21 FALSE BC005537 chr15 98860285 98860789 32.21 TRUE Tuba1c chr16 18876877 18876925 32.21 TRUE Hira chr4 73702525 73702789 32.21 FALSE AL844607.1 chr4 86498341 86500141 32.21 FALSE Rps6 chr5 15137365 15140629 32.21 TRUE Speer4d chr5 46363381 46363573 32.21 FALSE Lcorl chr6 24624301 24624541 32.21 FALSE Wasl chr6 136467397 136467973 32.21 TRUE Atf7ip chr8 19747117 19750501 32.21 FALSE AC148089.1 chr8 26711653 26712277 32.21 TRUE Whsc1l1 chr9 78023485 78023701 32.21 TRUE Ick GATA1,GATA1* chrX 32056117 32056405 32.21 FALSE Dock11 chrX 122845981 122848213 32.21 FALSE Mdm4-ps chrX 131930077 131930485 32.21 FALSE Gprasp1 chr1 109722925 109723021 31.21 FALSE Vps4b chr1 174665557 174665845 31.21 TRUE Tagln2 chr1 181733749 181734349 31.21 TRUE Ahctf1 GATA1,LDB1,MTGR1 chr12 3285517 3285589 31.21 TRUE Rab10 chr12 60703453 60703909 31.21 FALSE Fbxo33 chr12 71553565 71554069 31.21 TRUE Tmx1 LDB1,TAL1,GATA1*,MTGR1 chr14 56169277 56170597 31.21 TRUE Dcaf11 chr16 35809045 35809309 31.21 FALSE Hspbap1 chr2 129037861 129038053 31.21 TRUE Slc20a1 chr2 165709861 165710101 31.21 TRUE Zmynd8 chr2 167459869 167460205 31.21 TRUE Ube2v1 GATA1 chr3 68848885 68849173 31.21 FALSE Trim59 chr4 120624445 120624541 31.21 TRUE Zfp69 GATA1* chr5 129530509 129530989 31.21 FALSE Stx2 chr5 130322461 130322869 31.21 TRUE Sumf2 chr6 129136933 129137269 31.21 FALSE Clec2d chr7 81226741 81227077 31.21 FALSE Chd2 chr7 148633357 148634149 31.21 TRUE Lrdd chr8 20034013 20034709 31.21 FALSE AC152164.1 chr8 80434045 80434165 31.21 FALSE Arhgap10 chr9 8480077 8480245 31.21 FALSE Birc3

98 chr9 20692453 20692789 31.21 TRUE Ppan chr9 84966541 84966589 31.21 FALSE Ibtk chrX 7960909 7961029 31.21 FALSE Slc38a5 chrX 122712397 122713213 31.21 FALSE Mdm4-ps chr1 48471013 48471133 30.2 FALSE Slc39a10 chr1 88255741 88256101 30.2 TRUE Ncl chr1 130457173 130457797 30.2 TRUE Cxcr4 chr1 169132093 169132165 30.2 FALSE AC142244.1 chr10 20065357 20065933 30.2 FALSE Fam54a chr10 50945509 50945845 30.2 FALSE Gp49a chr10 62845213 62845453 30.2 TRUE Sirt1 chr10 75227005 75227269 30.2 TRUE Cabin1 GATA1 chr11 29445013 29445397 30.2 TRUE Mtif2 chr15 58270933 58271197 30.2 FALSE D15Ertd621e chr16 78892813 78892885 30.2 FALSE Usp25 chr17 29180389 29184397 30.2 TRUE Sfrs3 chr19 11461957 11462221 30.2 FALSE Ms4a4c chr19 38894029 38894797 30.2 FALSE Tbc1d12 LDB1,TAL1,GATA1*,MTGR1 chr2 40621837 40622005 30.2 FALSE AL840626.1 chr2 125977621 125978197 30.2 TRUE Atp8b4 chr2 142870525 142870693 30.2 FALSE Snrpb2 chr3 66073645 66073909 30.2 FALSE Ccnl1 chr3 96263893 96264013 30.2 TRUE U1 chr4 126715309 126717157 30.2 TRUE AL606985.4 chr4 135426589 135426685 30.2 TRUE Pnrc2 chr4 138936229 138936493 30.2 TRUE Ubr4 chr4 145708549 145708789 30.2 FALSE 1700029I01Rik chr4 146999293 147003157 30.2 TRUE AL626778.2 chr4 147011653 147013021 30.2 FALSE AL626778.2 chr5 74228749 74228797 30.2 FALSE AC164570.2 chr6 69741109 69741181 30.2 FALSE AC153612.9 chr6 71493589 71493997 30.2 TRUE Vps24 chr6 103264621 103264693 30.2 FALSE Ppp4r2 chr9 64640869 64640941 30.2 TRUE F730015K02Rik GATA1,LDB1,GATA1* chr9 65974357 65974429 30.2 TRUE Snx1 chr9 79334077 79334245 30.2 FALSE Tmem30a chr9 89105821 89106373 30.2 TRUE AC163666.1 chr9 93332053 93332317 30.2 FALSE Slc9a9 chrX 121927165 121927453 30.2 FALSE Mdm4-ps chrX 122693077 122693917 30.2 FALSE Mdm4-ps

99 chrX 122705317 122706325 30.2 FALSE Mdm4-ps chrY 2773909 2773957 30.2 FALSE Ddx3y chr1 4847437 4847773 29.19 TRUE Tcea1 chr1 88240885 88243261 29.19 TRUE Ncl chr10 57399325 57399613 29.19 FALSE Serinc1 chr10 63127885 63128173 29.19 FALSE Sirt1 chr11 101303605 101304781 29.19 TRUE Ifi35 chr12 40949773 40950181 29.19 TRUE Arl4a chr12 70784653 70784725 29.19 TRUE Sos2 chr12 99061957 99062005 29.19 FALSE Zc3h14 chr14 47392381 47393821 29.19 TRUE Cdkn3 chr14 54873493 54873685 29.19 TRUE Dad1 chr16 58681021 58683373 29.19 TRUE Cpox chr17 7183285 7183357 29.19 TRUE Rnaset2b chr17 56084317 56084509 29.19 FALSE Ccdc94 chr19 7280845 7281133 29.19 TRUE Otub1 chr19 11672005 11672101 29.19 FALSE Ms4a6b chr3 83101477 83101549 29.19 FALSE Plrg1 chr3 95848597 95848765 29.19 TRUE Vps45 chr4 7797637 7797685 29.19 FALSE Rab2a chr4 18739669 18739789 29.19 FALSE Cnbd1 chr4 131865949 131867965 29.19 TRUE Rcc1 TAL1 chr4 146996341 146996557 29.19 TRUE AL626778.2 chr5 15151021 15151141 29.19 FALSE Speer4d chr5 135394909 135395341 29.19 FALSE Cldn13 GATA1,LDB1,MTGR1 chr5 152170957 152171965 29.19 TRUE Rfc3 chr8 9556165 9556213 29.19 TRUE Arglu1 chr8 113783029 113783293 29.19 TRUE Glg1 chr9 31087573 31087621 29.19 TRUE Prdm10 chr9 59139061 59139421 29.19 TRUE Adpgk GATA1,LDB1 chrX 43897501 43897549 29.19 TRUE AL672035.1 chrX 78442981 78443053 29.19 FALSE AL845169.1 chrX 120775837 120776437 29.19 TRUE AC114001.2 chrX 120951397 120951637 29.19 FALSE AC114001.2 chrX 121784749 121784917 29.19 FALSE Mdm4-ps chr16 49839421 49840429 28.69 TRUE Cd47 GATA1,KLF1,LDB1,TAL1 chr3 60276805 60277093 28.69 TRUE Mbnl1 chr3 86602909 86603317 28.69 TRUE Dclk2 GATA1,LDB1,MTGR1 chr1 47646973 47647045 28.19 FALSE Slc39a10 chr1 134921893 134922037 28.19 TRUE Mdm4

100 chr1 145291645 145291933 28.19 FALSE Cdc73 chr1 150183301 150183637 28.19 FALSE AL592450.1 chr1 161162029 161162197 28.19 TRUE Rfwd2 chr10 16860445 16860733 28.19 FALSE Cited2 chr10 62065141 62065237 28.19 TRUE Ddx21 chr10 115254421 115254613 28.19 TRUE Tspan8 GATA1* chr11 17853877 17854189 28.19 TRUE Wdr92 chr11 50139157 50139445 28.19 TRUE Canx chr11 85048693 85048765 28.19 TRUE Appbp2 chr11 106873837 106874005 28.19 TRUE Kpna2 chr12 120729205 120729397 28.19 FALSE AC140349.1 chr13 5860357 5860429 28.19 TRUE Pitrm1 chr14 52639981 52640461 28.19 TRUE Zfp219 chr14 62239045 62240125 28.19 TRUE AC154660.3 chr15 4816141 4816261 28.19 FALSE Card6 chr16 5049925 5050093 28.19 TRUE Glyr1 chr16 11254573 11254741 28.19 TRUE Gspt1 chr17 15114853 15115573 28.19 TRUE 9030025P20Rik chr17 51318661 51318805 28.19 TRUE Tbc1d5 chr17 66234397 66234781 28.19 TRUE Ralbp1 chr18 85205125 85205293 28.19 FALSE Cyb5 chr2 54423973 54424021 28.19 FALSE AL929546.1 chr3 16319437 16319509 28.19 FALSE Ythdf3 chr3 120994453 120994741 28.19 TRUE Alg14 chr4 21510949 21510997 28.19 FALSE Ccnc chr4 39749725 39749989 28.19 FALSE Aco1 chr4 51315205 51315493 28.19 FALSE AL732619.4 chr4 100351981 100352077 28.19 FALSE Cachd1 chr5 10453789 10453861 28.19 FALSE 4930420K17Rik chr5 15149053 15150421 28.19 FALSE Speer4d chr5 56865037 56865085 28.19 FALSE AC134463.2 chr5 90653029 90653389 28.19 TRUE AU017193 chr5 100461541 100461733 28.19 FALSE Hnrpdl chr5 128097853 128097925 28.19 TRUE Slc15a4 chr5 149858269 149858677 28.19 TRUE Hmgb1 chr6 38826733 38826781 28.19 TRUE Hipk2 chr6 39153181 39153565 28.19 FALSE Jhdm1d GATA1,LDB1,GATA1*,MTGR1 chr6 47738101 47738461 28.19 TRUE mmu-mir-704 GATA1* chr6 72295093 72295453 28.19 TRUE Usp39 chr6 86586517 86586709 28.19 FALSE AC158662.2

101 chr6 146525941 146526565 28.19 TRUE Fgfr1op2 LDB1 chr7 12233077 12233125 28.19 FALSE Zfp110 chr7 46773325 46773373 28.19 TRUE AC113955.2 chr7 80703421 80703997 28.19 TRUE Chd2 chr7 150735013 150735469 28.19 TRUE Nap1l4 chr8 19799917 19800157 28.19 TRUE AC148089.1 chr8 19935829 19936885 28.19 FALSE 2610005L07Rik chr8 19995733 19996165 28.19 TRUE AC152164.1 chr9 11342893 11343133 28.19 FALSE Maml2 chr9 31060237 31060477 28.19 TRUE Prdm10 GATA1,KLF1,GATA1* chrX 95307325 95307637 28.19 FALSE Yipf6 chrX 97956229 97956421 28.19 FALSE Kif4 chrX 120759901 120760189 28.19 FALSE AC114001.2 chrX 121665229 121665325 28.19 FALSE Mdm4-ps chrX 121925125 121926493 28.19 FALSE Mdm4-ps chrX 137488405 137488453 28.19 TRUE Atg4a GATA1* chr1 9474253 9474301 27.18 FALSE Sntg1 chr1 43286413 43286893 27.18 FALSE Tgfbrap1 chr1 155611477 155611549 27.18 TRUE Rnasel chr11 3166357 3166573 27.18 TRUE Drg1 chr11 101520253 101520301 27.18 TRUE Rdm1 chr11 102046717 102046861 27.18 TRUE Lsm12 chr11 102889741 102890317 27.18 TRUE Nmt1 GATA1 chr11 117515749 117515797 27.18 TRUE Tnrc6c chr11 120434797 120435109 27.18 TRUE P4hb chr11 120443173 120443245 27.18 TRUE Arhgdia chr12 6021205 6021301 27.18 FALSE Atad2b chr12 29313349 29313973 27.18 FALSE Rps7 chr12 82057165 82057933 27.18 TRUE Sfrs5 chr13 50787973 50788261 27.18 FALSE Cks2 chr13 99087013 99087085 27.18 TRUE Fcho2 chr14 6573493 6573541 27.18 FALSE AC129222.4 chr14 44204629 44204677 27.18 FALSE AC140256.1 chr14 74525461 74525629 27.18 FALSE Esd chr14 92784037 92784085 27.18 FALSE Pcdh9 chr15 17199805 17200141 27.18 FALSE Rnasen chr16 19241293 19241989 27.18 TRUE Hira chr16 22253773 22253869 27.18 FALSE Tra2b chr17 7032061 7032229 27.18 TRUE Ezr TAL1 chr17 24210349 24210709 27.18 TRUE Kctd5

102 chr17 35372773 35373061 27.18 TRUE Nfkbil1 chr18 70727749 70727797 27.18 TRUE Mbd2 chr18 84255565 84255829 27.18 TRUE Tshz1 chr18 85175029 85175293 27.18 FALSE Cyb5 chr19 36908461 36908509 27.18 TRUE Tnks2 chr2 29643085 29643349 27.18 TRUE Trub2 chr2 34627549 34627957 27.18 TRUE Hspa5 chr2 103734037 103734325 27.18 TRUE AL928544.1 GATA1,LDB1,TAL1,GATA1*,MTGR1 chr2 114000973 114001045 27.18 TRUE Aqr chr2 148498165 148498621 27.18 TRUE Gzf1 chr2 180413317 180414205 27.18 TRUE Dido1 chr3 15309493 15309661 27.18 FALSE Car2 chr3 57505597 57505789 27.18 FALSE Tsc22d2 chr3 58657453 58657549 27.18 FALSE 2810407C02Rik chr3 65748685 65749549 27.18 TRUE Ccnl1 chr3 140390845 140391109 27.18 FALSE AC121279.1 chr4 33073645 33073933 27.18 TRUE Pnrc1 KLF1,LDB1,TAL1,GATA1*,ETO2,MTGR1 chr4 68330941 68331133 27.18 FALSE RP23-222K9.2 chr4 116821717 116822749 27.18 TRUE SNORD38 chr4 145660069 145660381 27.18 TRUE RP23-384D6.1 chr4 147169525 147169645 27.18 FALSE RP23-282C23.1 chr5 30980461 30980701 27.18 FALSE Cenpa chr5 34678789 34679173 27.18 TRUE Rnf4 chr5 41259109 41259469 27.18 FALSE AC110538.1 chr5 95830933 95831317 27.18 FALSE Bmp2k chr6 14753293 14753557 27.18 FALSE RP23-440D19.1 chr6 71780941 71781349 27.18 TRUE Immt chr6 72389269 72389989 27.18 TRUE Mat2a chr7 11706445 11706733 27.18 FALSE Zfp110 chr7 31435405 31435453 27.18 TRUE Rbm42 chr8 25863181 25863517 27.18 TRUE Plekha2 chr8 73046749 73047277 27.18 TRUE Uba52 chr8 74474893 74475061 27.18 FALSE BC049349 chr8 112392061 112392397 27.18 TRUE Phlpp2 chr9 44387677 44387725 27.18 TRUE Ddx6 chr9 88334101 88334557 27.18 FALSE Snx14 chr9 93510397 93510517 27.18 FALSE Slc9a9 chr9 108419149 108419461 27.18 TRUE Qrich1 chrX 7534021 7534405 27.18 FALSE Gata1 chrX 54646141 54646261 27.18 TRUE Rbmx

103 chrX 116015845 116015917 27.18 FALSE AC114001.2 chrX 121776325 121776661 27.18 FALSE Mdm4-ps chrX 122017597 122018485 27.18 FALSE Mdm4-ps chr11 101466325 101466373 26.3 TRUE Rdm1 chr1 4555333 4555381 26.17 FALSE Lypla1 chr1 50273653 50273845 26.17 FALSE Obfc2a chr1 55040533 55040797 26.17 FALSE Sf3b1 chr1 64168621 64168693 26.17 TRUE Creb1 chr1 119065189 119065501 26.17 FALSE Mki67ip chr1 138163021 138163525 26.17 TRUE Camsap1l1 chr1 192125869 192126277 26.17 TRUE Rps6kc1 chr10 84038653 84038965 26.17 TRUE Tcp11l2 GATA1,LDB1,MTGR1 chr11 14112853 14113093 26.17 FALSE Sec61g chr11 22890469 22890973 26.17 TRUE Commd1 chr11 76891837 76892173 26.17 TRUE Ccdc55 chr11 86571853 86572405 26.17 TRUE Cltc chr11 103934245 103934317 26.17 FALSE Nsf chr12 70256941 70258117 26.17 TRUE Ppil5 chr12 72075493 72075685 26.17 TRUE Psma3 chr13 23635765 23635813 26.17 TRUE Hist1h2bh chr13 25112293 25112461 26.17 TRUE Mrs2 chr13 58490701 58491061 26.17 FALSE Rmi1 chr13 91064893 91066165 26.17 FALSE Rps23 chr14 19099693 19099981 26.17 TRUE Rpl15 chr14 55221997 55222477 26.17 TRUE mmu-mir-686 GATA1 chr14 76248565 76249669 26.17 FALSE Tpt1 chr15 54903829 54903877 26.17 TRUE Taf2 chr15 57247741 57247885 26.17 FALSE Zhx2 chr15 96304501 96304813 26.17 FALSE Sfrs2ip chr16 19067221 19067293 26.17 TRUE Hira chr16 58863133 58863205 26.17 FALSE Cldnd1 chr17 29244853 29245477 26.17 TRUE Cdkn1a chr18 35719621 35720005 26.17 FALSE Matr3 chr18 45470629 45470941 26.17 FALSE Dcp2 chr18 67801069 67801141 26.17 TRUE Cep76 GATA1 chr18 68543509 68543605 26.17 FALSE Rnmt chr19 4201357 4201861 26.17 TRUE Rad9 chr19 8967349 8967613 26.17 TRUE Ints5 chr19 15509245 15509293 26.17 FALSE Psat1 chr19 43827253 43827397 26.17 TRUE Slc25a28

104 chr2 31983205 31983997 26.17 TRUE Bat2l GATA1,LDB1,TAL1,MTGR1 chr2 98944333 98944405 26.17 FALSE AL837506.2 chr2 102913621 102914269 26.17 TRUE Pdhx chr3 7912141 7912261 26.17 FALSE Mrps28 chr3 68822821 68823181 26.17 TRUE Smc4 chr3 70632997 70633045 26.17 FALSE Ppm1l chr3 100242949 100243189 26.17 FALSE Fam46c GATA1,LDB1,TAL1,GATA1* chr3 146163565 146164189 26.17 FALSE Rpf1 chr3 151902661 151902973 26.17 TRUE Fubp1 chr4 17675653 17676205 26.17 FALSE Cnbd1 chr4 46385437 46385677 26.17 TRUE 5830415F09Rik chr4 66899221 66899269 26.17 FALSE RP23-222K9.2 chr4 126714517 126714805 26.17 TRUE AL606985.4 chr4 145574773 145574989 26.17 FALSE AL627077.5 chr4 145718965 145719133 26.17 FALSE 1700029I01Rik chr4 147046621 147046741 26.17 FALSE RP23-282C23.1 chr5 14960869 14962045 26.17 TRUE Speer4d chr5 28643461 28643509 26.17 TRUE Prr8 chr5 64049653 64049797 26.17 FALSE Rell1 chr5 93936325 93937741 26.17 FALSE AC134841.1 chr5 93959221 93959677 26.17 FALSE AC134841.1 chr5 94013029 94013077 26.17 FALSE AC134841.1 chr5 124777621 124777813 26.17 TRUE Mphosph9 chr5 130221397 130221469 26.17 FALSE Sumf2 GATA1 chr5 151325077 151325413 26.17 TRUE Brca2 chr6 21349213 21349525 26.17 FALSE Fam3c chr6 84716053 84716365 26.17 FALSE Smyd5 chr7 16638013 16638181 26.17 TRUE Napa GATA1 chr7 19857253 19857301 26.17 TRUE Vasp chr7 20742901 20743069 26.17 FALSE Ceacam16 chr7 24814477 24814765 26.17 FALSE Zfp180 chr7 77825749 77825965 26.17 FALSE AC136516.2 chr7 87438973 87439453 26.17 TRUE Prc1 chr7 116665693 116666029 26.17 FALSE SNORA3 chr8 19911685 19911733 26.17 FALSE 2610005L07Rik chr8 19997245 19997557 26.17 TRUE AC152164.1 chr8 60881869 60882157 26.17 FALSE Galnt7 chr8 73674229 73674709 26.17 FALSE Haus8 chr9 22706149 22706197 26.17 FALSE Bbs9 chr9 50412061 50412205 26.17 TRUE Dlat

105 chr9 61794541 61794853 26.17 TRUE Kif23 GATA1* chr9 64020541 64021549 26.17 TRUE Rpl4 chr9 120034813 120035005 26.17 TRUE Rpsa chrX 32927293 32928061 26.17 FALSE Dock11 chrX 38450269 38450437 26.17 FALSE Thoc2 chrX 39501877 39502213 26.17 TRUE Stag2 chrX 121287805 121288837 26.17 FALSE AC114001.2 chrX 128530669 128530885 26.17 FALSE Trmt2b chr12 55830685 55830733 26.01 TRUE U1 chr14 69922501 69923269 25.67 FALSE Slc25a37 GATA1,KLF1,LDB1,TAL1,GATA1*,MTGR1 chrX 121707541 121707829 25.67 FALSE Mdm4-ps chr12 92828149 92828245 25.5 TRUE Gtf2a1 chr1 36364621 36364669 25.17 TRUE Arid5a chr1 53370253 53370565 25.17 TRUE Pms1 chr1 162142933 162143389 25.17 TRUE Rabgap1l chr1 171899629 171899749 25.17 TRUE Uap1 chr10 21714517 21714565 25.17 TRUE Raet1e chr10 29340901 29340949 25.17 FALSE AC107669.1 chr10 78546301 78546685 25.17 FALSE Ilvbl chr11 23983597 23983885 25.17 TRUE Bcl11a chr11 51429229 51429397 25.17 TRUE Nhp2 chr11 58717357 58717693 25.17 FALSE Zfp39 chr11 85161157 85161229 25.17 FALSE Bcas3 chr12 21379333 21379645 25.17 TRUE Adam17 chr13 21271813 21271933 25.17 TRUE Trim27 chr13 68425309 68425693 25.17 FALSE Mtrr chr13 87327829 87328165 25.17 FALSE Cox7c chr14 15103981 15104221 25.17 FALSE Il3ra chr15 76157365 76157917 25.17 TRUE Gpaa1 chr16 83451973 83452117 25.17 FALSE Atp5j chr16 90220957 90221029 25.17 TRUE Sod1 chr17 37564837 37565053 25.17 FALSE Gabbr1 chr17 90208741 90209125 25.17 FALSE Klraq1 chr18 23124349 23124397 25.17 FALSE Mapre2 chr18 73975045 73975549 25.17 TRUE Me2 chr19 37297357 37297621 25.17 TRUE Mar-05 chr19 49146757 49146829 25.17 FALSE D19Ertd652e chr2 29928061 29928469 25.17 TRUE Set chr2 34661557 34661605 25.17 TRUE Fbxw2 chr2 50152189 50152381 25.17 TRUE Epc2

106 chr2 90215389 90215653 25.17 FALSE Ptprj chr2 131087917 131088229 25.17 TRUE Pank2 chr2 154694581 154694941 25.17 FALSE Eif2s2 chr2 154718485 154718869 25.17 TRUE Eif2s2 chr2 179992093 179992525 25.17 TRUE Cables2 chr3 58405957 58406149 25.17 FALSE 2810407C02Rik chr3 95428621 95428837 25.17 TRUE Ensa chr3 135488653 135488725 25.17 TRUE Nfkb1 GATA1 chr4 14753893 14754013 25.17 TRUE Tmem64 chr4 23135557 23135605 25.17 FALSE AL772326.1 chr4 24225013 24225085 25.17 FALSE BX004998.1 chr4 34725517 34725565 25.17 TRUE Zfp292 chr4 93361477 93361693 25.17 FALSE AL772318.1 chr4 109349005 109349053 25.17 TRUE Faf1 chr4 129488173 129488221 25.17 TRUE Ptp4a2 LDB1 chr4 133518133 133518469 25.17 FALSE Hmgn2 chr4 145600693 145601749 25.17 FALSE AL627077.5 chr5 15148093 15148261 25.17 TRUE Speer4d chr5 68204653 68204917 25.17 TRUE Atp8a1 chr5 74491021 74491333 25.17 FALSE 2700023E23Rik chr5 79463197 79463245 25.17 FALSE Polr2b chr5 148241413 148241653 25.17 TRUE Pan3 chr6 67241461 67242373 25.17 TRUE Serbp1 chr6 87788749 87788989 25.17 TRUE Isy1 chr6 131521165 131521549 25.17 FALSE Csda chr7 29168077 29168125 25.17 TRUE Zfp36 GATA1 chr7 67923733 67923805 25.17 FALSE AC046145.1 chr7 97361101 97362061 25.17 FALSE Picalm chr7 97604653 97604989 25.17 FALSE Crebzf chr7 117203509 117205357 25.17 TRUE Zfp143 LDB1 chr8 18595117 18595213 25.17 FALSE Mcph1 LDB1 chr8 34495765 34495885 25.17 TRUE Wrn chr8 38691445 38691613 25.17 FALSE Tusc3 chr8 43903501 43903549 25.17 FALSE Efha2 chr8 88740901 88741117 25.17 FALSE Itfg1 chr8 125588941 125589397 25.17 TRUE Spg7 chr9 7837189 7837501 25.17 FALSE Birc2 chr9 11653597 11653645 25.17 FALSE Maml2 chr9 16259725 16259821 25.17 FALSE BC017612 chr9 52914469 52914541 25.17 FALSE Npat

107 chr9 110207701 110208037 25.17 TRUE Scap chrX 12819853 12820117 25.17 TRUE Ddx3x chrX 39988621 39988933 25.17 FALSE Stag2 chrX 121206901 121206973 25.17 FALSE AC114001.2 chrX 121210861 121211221 25.17 FALSE AC114001.2 chrX 155339821 155340253 25.17 FALSE Rps6ka3 chr1 155334757 155334805 24.66 TRUE Dhx9 chr1 32724493 32724661 24.16 FALSE Prim2 chr1 42071509 42071605 24.16 FALSE Mrps9 chr1 44158789 44158837 24.16 TRUE Ercc5 GATA1 chr1 89927773 89928109 24.16 FALSE AC087801.1 chr1 95652853 95652901 24.16 FALSE Atg4b chr1 98874085 98874181 24.16 FALSE AC102488.1 chr1 100845781 100845925 24.16 FALSE D1Ertd622e chr1 102300685 102300997 24.16 FALSE D1Ertd622e chr1 148294669 148294909 24.16 FALSE Fam5c chr1 176179525 176179597 24.16 FALSE Spna1 chr1 176183485 176184445 24.16 TRUE Spna1 chr1 182186581 182186629 24.16 TRUE Psen2 chr10 7382893 7383109 24.16 TRUE Pcmt1 chr10 30521869 30522181 24.16 TRUE Ncoa7 chr10 78899629 78899893 24.16 FALSE Gm16517 chr10 88512493 88512757 24.16 FALSE Utp20 chr10 93286909 93286957 24.16 TRUE Metap2 chr11 74652325 74652397 24.16 TRUE Mnt KLF1 chr11 76660549 76660621 24.16 TRUE Cpd chr11 117589525 117589861 24.16 FALSE Tnrc6c chr12 34531669 34531813 24.16 TRUE Twistnb chr12 93832309 93832477 24.16 FALSE Sel1l chr12 102668989 102669133 24.16 FALSE Smek1 chr13 5065381 5065645 24.16 FALSE Net1 chr13 51298213 51298285 24.16 TRUE Cks2 chr13 80191381 80191429 24.16 FALSE Arrdc3 chr13 86773909 86773981 24.16 FALSE Cox7c chr13 104968549 104968957 24.16 TRUE Erbb2ip chr14 116971357 116971429 24.16 FALSE Gpc6 chr15 13043221 13043317 24.16 FALSE Rnasen chr15 29582485 29582773 24.16 FALSE Dap chr15 34366381 34366789 24.16 TRUE Rpl30 chr15 54386197 54386509 24.16 FALSE Taf2

108 chr16 3341029 3341293 24.16 FALSE Zfp263 chr16 19064485 19064533 24.16 FALSE Hira chr16 53351581 53351677 24.16 FALSE Alcam chr16 58309405 58309453 24.16 FALSE St3gal6 chr16 60776653 60776821 24.16 TRUE Gabrr3 chr16 82829413 82829725 24.16 FALSE Atp5j chr16 86821093 86821357 24.16 TRUE Rnf160 chr17 65682541 65682829 24.16 FALSE Tmem232 chr17 73187077 73187341 24.16 TRUE Ypel5 chr17 80461717 80461861 24.16 TRUE Hnrpll chr18 6516229 6516397 24.16 TRUE Epc1 chr19 20644165 20644285 24.16 FALSE Zfand5 chr19 29392069 29392213 24.16 FALSE 5033414D02Rik chr19 53977285 53977453 24.16 TRUE Pdcd4 chr19 54008941 54009325 24.16 TRUE AC117612.1 chr19 56489797 56489941 24.16 FALSE Casp7 chr2 9970381 9970573 24.16 TRUE Taf3 chr2 29989189 29989237 24.16 TRUE Tbc1d13 chr2 67782229 67782541 24.16 FALSE Spc25 chr3 60980197 60980509 24.16 FALSE P2ry1 chr3 68858629 68858893 24.16 TRUE Trim59 KLF1,LDB1,MTGR1 chr3 107948941 107949253 24.16 TRUE Gnai3 chr3 117812413 117812461 24.16 FALSE Frrs1 chr3 137593285 137593621 24.16 FALSE Dapp1 chr3 158839765 158840029 24.16 FALSE Lrrc7 chr4 4953949 4954093 24.16 FALSE Rps20 chr4 25886605 25886797 24.16 FALSE Klhl32 chr4 35549029 35549341 24.16 FALSE Mobkl2b chr4 54257965 54258181 24.16 FALSE Slc44a1 chr4 74323981 74324149 24.16 FALSE Kdm4c chr4 116180125 116180293 24.16 TRUE Mast2 chr4 145600021 145600093 24.16 FALSE AL627077.5 chr4 145626157 145626757 24.16 FALSE RP23-384D6.1 chr4 146432629 146432677 24.16 TRUE AL929462.1 chr4 146487685 146487973 24.16 TRUE AL929462.1 chr4 146992357 146992741 24.16 FALSE AL626778.2 chr4 147063349 147063469 24.16 FALSE RP23-282C23.1 chr5 17416885 17417101 24.16 FALSE Cd36 chr5 28118197 28118581 24.16 TRUE Paxip1 chr5 59999605 59999821 24.16 FALSE Arap2

109 chr5 79158037 79158181 24.16 FALSE Polr2b chr5 130732549 130732861 24.16 TRUE Tyw1 chr5 143682325 143682589 24.16 TRUE Actb chr6 48626317 48626389 24.16 TRUE Gimap4 chr6 52663741 52663813 24.16 TRUE Tax1bp1 chr6 58649941 58650013 24.16 FALSE Abcg2 chr6 113646685 113646925 24.16 TRUE Tatdn2 chr6 125081029 125081821 24.16 TRUE Nop2 chr6 136776757 136777045 24.16 TRUE Wbp11 chr7 66383413 66383701 24.16 FALSE Ube3a chr7 82975309 82975429 24.16 FALSE Klhl25 LDB1,TAL1,MTGR1 chr7 111436741 111436957 24.16 FALSE Trim5 chr7 114223405 114223453 24.16 FALSE Gm1966 chr7 119752261 119752381 24.16 FALSE Arntl chr7 120656341 120656413 24.16 TRUE Btbd10 chr8 19726741 19727221 24.16 FALSE AC148089.2 chr8 74446501 74446573 24.16 TRUE BC049349 chr8 126514549 126514837 24.16 FALSE Abcb10 GATA1,GATA1* chr9 20160901 20160949 24.16 FALSE Zfp26 chrX 7713277 7714357 24.16 FALSE Wdr13 chrX 36486517 36486685 24.16 FALSE Lamp2 chrX 41362765 41362957 24.16 FALSE Stag2 chrX 43016029 43016341 24.16 FALSE AL672035.1 chrX 72661813 72662077 24.16 TRUE Mtcp1 chrX 96016141 96016309 24.16 FALSE Yipf6 chrX 108078661 108078997 24.16 FALSE AL670648.1 chrX 121300501 121304629 24.16 FALSE AC114001.2 chrX 147022789 147022885 24.16 TRUE Apex2 chrY 1788229 1791061 24.16 TRUE Ddx3y GATA1* chr16 38523013 38523085 23.66 TRUE 4930455C21Rik chr9 24346117 24346597 23.6 FALSE Dpy19l1 chr1 9929893 9930445 23.15 TRUE SNORD87 chr1 25660789 25660837 23.15 FALSE Lmbrd1 chr1 75131125 75131221 23.15 FALSE 1810031K17Rik chr1 83670445 83670781 23.15 FALSE Pid1 chr1 113858101 113858221 23.15 FALSE Vps4b chr1 135957397 135957469 23.15 TRUE Btg2 chr1 174902437 174902581 23.15 FALSE Darc chr10 22140973 22141261 23.15 FALSE H60b chr10 62041237 62041357 23.15 TRUE 2510003E04Rik

110 chr10 62740069 62740141 23.15 FALSE Herc4 LDB1,TAL1,MTGR1 chr10 70126381 70126765 23.15 FALSE Ccdc6 chr10 80845309 80845477 23.15 TRUE Dohh GATA1,LDB1,TAL1,GATA1*,MTGR1 chr10 82480981 82481053 23.15 FALSE AC102114.2 chr10 101787325 101787397 23.15 FALSE Cep290 chr10 103331725 103331917 23.15 FALSE Ccdc59 chr11 12031693 12031789 23.15 FALSE Fignl1 chr11 40555477 40555645 23.15 FALSE Hmmr chr11 77499445 77499781 23.15 TRUE Nufip2 chr11 107050789 107050885 23.15 TRUE Bptf chr12 3426781 3427069 23.15 TRUE Asxl2 chr12 11243437 11244085 23.15 FALSE Gen1 chr12 32184373 32185213 23.15 TRUE Cbll1 chr12 43923949 43924213 23.15 FALSE Dnajb9 chr12 56499541 56499853 23.15 TRUE Nfkbia chr12 59122909 59123269 23.15 FALSE Mipol1 chr12 59791477 59791525 23.15 FALSE Trappc6b chr12 59889397 59889445 23.15 FALSE Trappc6b chr12 69513205 69513253 23.15 FALSE Ppil5 chr12 87165709 87165781 23.15 TRUE Ttll5 chr12 92710237 92710285 23.15 FALSE 4930534B04Rik chr13 3917485 3917581 23.15 TRUE Net1 chr14 17075677 17075989 23.15 FALSE Ngly1 GATA1 chr14 32994805 32995069 23.15 TRUE AC154532.1 chr14 42535261 42535573 23.15 FALSE 5730469M10Rik chr14 45240445 45240853 23.15 FALSE AC164295.1 chr14 51572533 51573997 23.15 FALSE Pnp2 chr14 95765269 95765581 23.15 FALSE Klhl1 chr15 51792013 51792229 23.15 FALSE Rad21 chr15 75911149 75911317 23.15 TRUE Puf60 chr15 85846381 85846621 23.15 FALSE Celsr1 chr16 16799893 16800373 23.15 FALSE Top3b chr16 19053181 19053229 23.15 TRUE Hira chr17 29626861 29627149 23.15 TRUE Pim1 GATA1 chr17 32421781 32421829 23.15 TRUE Brd4 chr17 39008533 39008749 23.15 FALSE mmu-mir-715 chr18 26533357 26533525 23.15 TRUE AW554918 chr18 68358565 68358613 23.15 FALSE D18Ertd653e chr19 9028285 9028333 23.15 TRUE Tut1 chr19 13765789 13765837 23.15 FALSE Tle4

111 chr19 15917461 15917509 23.15 FALSE Psat1 chr19 44182021 44182309 23.15 TRUE Chuk chr19 54014005 54014053 23.15 FALSE AC117612.1 chr2 34634389 34635013 23.15 TRUE Hspa5 chr2 64155613 64155685 23.15 FALSE AL928703.1 chr2 90866821 90866893 23.15 FALSE Cugbp1 chr2 90867565 90867613 23.15 FALSE Cugbp1 chr2 103605493 103605757 23.15 TRUE Nat10 chr2 111808789 111808861 23.15 FALSE Lpcat4 chr2 114306997 114307213 23.15 FALSE Aqr chr2 119434477 119434957 23.15 TRUE Nusap1 chr2 121841773 121841821 23.15 TRUE Eif3j chr3 8923885 8924077 23.15 TRUE Mrps28 chr3 36133477 36133597 23.15 FALSE Mccc1 chr3 73107733 73107925 23.15 FALSE Pdcd10 chr3 105490477 105490549 23.15 TRUE Rap1a chr3 110651917 110652037 23.15 FALSE Vav3 chr3 116365477 116365717 23.15 TRUE Hiat1 chr4 11083645 11083717 23.15 TRUE Trp53inp1 chr4 97955797 97955845 23.15 FALSE Inadl chr4 108520381 108520501 23.15 TRUE Btf3l4 chr4 112303117 112303285 23.15 FALSE RP23-219O14.1 chr4 116793133 116793565 23.15 TRUE Plk3 chr4 126719389 126719581 23.15 FALSE AL606985.4 chr4 145587949 145588453 23.15 FALSE AL627077.5 chr4 155135317 155135413 23.15 TRUE Atad3a chr5 11310229 11310421 23.15 FALSE 4930420K17Rik chr5 14947669 14949133 23.15 FALSE Speer4d chr5 23012197 23012245 23.15 TRUE Mll5 chr5 44592109 44592373 23.15 TRUE Tapt1 chr5 72506941 72507013 23.15 FALSE Atp10d chr5 93951901 93953269 23.15 FALSE AC134841.1 chr5 139676269 139676413 23.15 TRUE Unc84a chr6 44865277 44865349 23.15 FALSE Tpk1 chr6 48689101 48689581 23.15 TRUE Gimap5 chr6 51403981 51404293 23.15 TRUE Hnrnpa2b1 chr6 70693525 70693933 23.15 TRUE Igk-C chr6 124915381 124915741 23.15 TRUE Cops7a chr7 52964077 52964341 23.15 TRUE Dbp chr7 72838117 72838285 23.15 TRUE Tarsl2

112 chr7 101540293 101540341 23.15 FALSE AC102906.2 chr7 115644661 115644877 23.15 FALSE Eif3f chr7 116745781 116746117 23.15 TRUE St5 chr8 19774669 19775845 23.15 FALSE AC148089.1 chr8 19961509 19961557 23.15 FALSE AC152164.1 chr8 19966069 19966549 23.15 FALSE AC152164.1 chr8 32159581 32160205 23.15 FALSE Rnf122 chr8 36630037 36630589 23.15 TRUE Mfhas1 LDB1,TAL1,MTGR1 chr8 51069373 51069709 23.15 FALSE Cdkn2aip chr8 63430453 63430501 23.15 TRUE Clcn3 chr8 72060133 72060781 23.15 FALSE Zfp866 chr8 72736837 72736885 23.15 TRUE Slc25a42 chr8 113521741 113521981 23.15 TRUE Ddx19a GATA1 chr8 131473957 131474293 23.15 FALSE Itgb1 chr9 15162373 15162421 23.15 TRUE 5830418K08Rik chr9 50619469 50619541 23.15 FALSE Alg9 GATA1,LDB1,TAL1,MTGR1 chr9 103254925 103255141 23.15 TRUE Cdv3 chrX 25351693 25351741 23.15 TRUE BX294196.1 chrX 27212389 27212485 23.15 TRUE BX294196.1 chrX 120113653 120114757 23.15 TRUE AC114001.2 chrX 120546637 120546853 23.15 FALSE AC114001.2 chrX 120641749 120641869 23.15 FALSE AC114001.2 chrX 121785685 121785829 23.15 FALSE Mdm4-ps chrY 458173 458365 23.15 FALSE Uty chr1 72283549 72283717 22.82 TRUE Smarcal1 chr6 120307045 120307405 22.65 TRUE Ccdc77 chr1 5465197 5465293 22.15 FALSE Atp6v1h chr1 55042717 55042789 22.15 TRUE Sf3b1 chr1 58562005 58562053 22.15 TRUE Orc2l chr1 58768477 58768909 22.15 FALSE Cflar GATA1* chr1 82312909 82313149 22.15 TRUE Rhbdd1 chr1 90102205 90102709 22.15 TRUE AC087780.1 chr1 100793653 100793725 22.15 FALSE D1Ertd622e chr1 131336173 131336221 22.15 FALSE Cxcr4 chr1 133423957 133424149 22.15 TRUE Srgap2 chr1 180253525 180253573 22.15 TRUE Fam36a chr1 180264061 180264109 22.15 TRUE Hnrnpu chr10 12728269 12728341 22.15 TRUE Stx11 chr10 28968109 28968421 22.15 FALSE AC107669.1 chr10 44628661 44628733 22.15 FALSE Prep

113 chr10 76721893 76721989 22.15 TRUE Pofut2 chr10 77761093 77761141 22.15 FALSE Agpat3 chr11 14816893 14817253 22.15 FALSE Sec61g chr11 54674317 54674557 22.15 TRUE Lyrm7 chr11 68718301 68720053 22.15 FALSE Rpl26 chr11 74711341 74711485 22.15 TRUE Tsr1 chr11 77846077 77846149 22.15 TRUE Dhrs13 chr11 82893349 82893445 22.15 FALSE Slfn2 chr11 84920725 84920773 22.15 TRUE AL669859.1 LDB1,TAL1,GATA1* chr11 85995181 85995325 22.15 FALSE Brip1 chr11 86504053 86504341 22.15 FALSE Ptrh2 chr12 36491557 36491749 22.15 FALSE Tspan13 chr12 45310981 45311053 22.15 TRUE Dnajb9 chr12 52843669 52843861 22.15 FALSE Strn3 chr12 67461781 67461877 22.15 FALSE C79407 chr12 70037005 70037893 22.15 FALSE Ppil5 chr12 104856397 104856469 22.15 FALSE Ifi27l1 chr13 27987685 27987733 22.15 FALSE Cdkal1 chr13 58227325 58227565 22.15 TRUE 5133401N09Rik chr13 79470997 79471333 22.15 FALSE Arrdc3 chr13 81727501 81727693 22.15 FALSE Cetn3 chr13 87714157 87714301 22.15 FALSE Cox7c chr14 6128485 6128677 22.15 FALSE AC129222.4 chr14 31362157 31362253 22.15 TRUE Tkt chr14 58423765 58424125 22.15 FALSE F630043A04Rik chr14 79790437 79790509 22.15 TRUE Naa16 chr14 97516981 97517245 22.15 FALSE Klhl1 chr14 101140813 101140933 22.15 FALSE Commd6 chr14 103116205 103116253 22.15 FALSE AC102815.1 chr14 115107565 115107805 22.15 FALSE Gpc5 chr14 124091221 124091581 22.15 FALSE A2ld1 chr14 124826149 124826221 22.15 FALSE A2ld1 chr15 19061869 19061941 22.15 FALSE Rnasen chr15 27526717 27526789 22.15 TRUE AC130671.1 chr15 29919661 29919757 22.15 FALSE Dap chr15 33764101 33764173 22.15 FALSE Laptm4b chr15 33801277 33801685 22.15 FALSE Laptm4b chr15 65433853 65434165 22.15 FALSE Tmem71 chr15 67497349 67497397 22.15 FALSE St3gal1 chr15 75672541 75672613 22.15 TRUE Zc3h3

114 chr16 21794077 21794509 22.15 FALSE AC130214.1 GATA1 chr16 51734125 51734245 22.15 FALSE Cblb chr16 61854877 61854925 22.15 FALSE RP24-200D20.1 chr16 71278357 71278549 22.15 FALSE Nrip1 chr16 78886621 78886741 22.15 FALSE Usp25 chr17 19399117 19399189 22.15 FALSE Fpr3 chr17 24986965 24987637 22.15 TRUE Hagh chr17 35302213 35302309 22.15 TRUE Bat2 chr17 37869277 37869517 22.15 FALSE Gabbr1 chr17 56176069 56176117 22.15 TRUE Sh3gl1 chr18 6696541 6696709 22.15 FALSE Rab18 chr18 73272733 73272925 22.15 FALSE Smad4 chr19 5388685 5388973 22.15 TRUE Sart1 chr19 8942989 8943277 22.15 FALSE Ints5 chr19 11114341 11114509 22.15 FALSE AC148321.1 chr19 37252837 37253077 22.15 TRUE Cpeb3 chr2 17408365 17408677 22.15 FALSE Mllt10 chr2 25187965 25188229 22.15 TRUE Man1b1 GATA1 chr2 36964909 36965005 22.15 FALSE AL845356.2 chr2 37278613 37278661 22.15 TRUE Rabgap1 chr2 81098197 81098341 22.15 FALSE RP24-364O14.1 chr2 88679677 88679773 22.15 FALSE AL928949.1 chr2 94493509 94493581 22.15 FALSE Ttc17 chr2 103602469 103602637 22.15 FALSE Nat10 chr2 105686677 105686821 22.15 FALSE Elp4 chr2 121285117 121285189 22.15 FALSE 2310003F16Rik chr2 154002229 154002397 22.15 FALSE AL732466.1 chr2 155974861 155975005 22.15 TRUE Nfs1 chr3 22864309 22864549 22.15 FALSE Tbl1xr1 chr3 46004845 46005013 22.15 FALSE AC113507.1 chr3 93731245 93731557 22.15 TRUE Mrpl9 chr3 121971781 121971853 22.15 TRUE Gclm chr3 130412245 130412293 22.15 TRUE Rpl34 chr3 133232485 133232557 22.15 FALSE Tet2 chr3 137330581 137330749 22.15 FALSE H2afz chr4 10423693 10423885 22.15 FALSE AL672160.2 chr4 11181397 11181469 22.15 TRUE Ints8 chr4 43137373 43137445 22.15 FALSE B230312A22Rik chr4 74345749 74346157 22.15 FALSE Kdm4c chr4 102947989 102948061 22.15 FALSE Slc35d1

115 chr4 112880437 112880701 22.15 FALSE AL807236.2 chr4 145580581 145581229 22.15 TRUE AL627077.5 chr4 146804773 146804821 22.15 FALSE AL954334.2 chr4 147146509 147147085 22.15 FALSE RP23-282C23.1 chr4 147180109 147180301 22.15 FALSE RP23-282C23.1 chr5 34915525 34916149 22.15 FALSE Add1 KLF1,LDB1,TAL1,MTGR1 chr5 65224261 65225317 22.15 FALSE Klf3 chr5 66089749 66089869 22.15 TRUE Pds5a chr5 72372997 72373261 22.15 FALSE Atp10d chr5 72935077 72935197 22.15 FALSE Txk chr5 94064101 94064461 22.15 FALSE AC134841.1 chr5 144675997 144676069 22.15 TRUE Pms2 chr6 24444709 24445021 22.15 FALSE Ndufa5 chr6 26914525 26914837 22.15 FALSE Pot1a chr6 38563381 38563621 22.15 FALSE Luc7l2 chr6 57530989 57531109 22.15 TRUE Herc5 chr6 96548917 96549061 22.15 FALSE Tmf1 chr6 101537869 101538085 22.15 FALSE Ppp4r2 chr6 107781805 107782285 22.15 FALSE Itpr1 chr6 131105749 131105869 22.15 FALSE Csda chr6 148160725 148161109 22.15 TRUE Ergic2 chr7 14990317 14990413 22.15 FALSE Lig1 chr7 25873093 25873405 22.15 TRUE Pou2f2 chr7 60025957 60026005 22.15 FALSE Svip chr7 65271877 65271949 22.15 FALSE AC158300.1 chr7 65360797 65360869 22.15 FALSE AC158300.1 chr7 67679413 67679485 22.15 FALSE AC046145.1 chr7 80685253 80685301 22.15 TRUE Chd2 chr7 87331237 87331573 22.15 TRUE Sema4b chr7 88907725 88907845 22.15 TRUE 3110040N11Rik chr7 133792885 133793245 22.15 TRUE Ccdc101 chr8 22713469 22713613 22.15 FALSE Atp7b chr8 26864989 26865037 22.15 TRUE Ddhd2 chr8 40014757 40014805 22.15 FALSE Tusc3 chr8 40633909 40634005 22.15 FALSE Tusc3 chr8 60869605 60869749 22.15 FALSE Galnt7 chr8 76787005 76787125 22.15 FALSE AC158385.1 chr8 86483773 86483821 22.15 TRUE Asf1b chr8 96560653 96560941 22.15 TRUE Nudt21 chr8 108404845 108404917 22.15 TRUE Edc4

116 chr8 114167749 114167917 22.15 TRUE Zfp1 LDB1,TAL1 chr9 21314605 21314725 22.15 TRUE Tmed1 GATA1 chr9 64585621 64585693 22.15 TRUE Megf11 chr9 82765765 82765861 22.15 TRUE Phip chr9 89993869 89993917 22.15 FALSE Morf4l1 chr9 95456893 95456965 22.15 FALSE Paqr9 LDB1,TAL1,GATA1* chr9 108249853 108249925 22.15 TRUE Usp4 chr9 111749773 111749821 22.15 FALSE Epm2aip1 chrX 20265373 20265469 22.15 TRUE Pctk1 chrX 26278285 26278429 22.15 FALSE BX294196.1 chrX 32069365 32069533 22.15 FALSE Dock11 chrX 54897349 54897421 22.15 FALSE Rbmx chrX 55729837 55730173 22.15 FALSE Rbmx chrX 62611261 62611501 22.15 FALSE AL662923.1 chrX 89213341 89213581 22.15 FALSE AL513468.1 chrX 99384349 99384709 22.15 TRUE Rgag4 chrX 119839861 119839909 22.15 FALSE AC114001.2 chrX 120319789 120320005 22.15 FALSE AC114001.2 chrX 121524997 121525045 22.15 FALSE AC114001.2 chrX 122672245 122672389 22.15 FALSE Mdm4-ps chrX 128346469 128346541 22.15 FALSE Trmt2b chrX 147170485 147170605 22.15 FALSE Apex2 chrX 150678013 150678301 22.15 FALSE RP23-174M24.1 chrX 153370309 153370381 22.15 FALSE RP23-174M24.1 chrX 158063389 158063605 22.15 FALSE Scml2 chr10 126726949 126727189 21.64 TRUE Ddit3 chr13 94158589 94159069 21.64 FALSE Jmy GATA1,KLF1 chr18 75163981 75164797 21.64 TRUE BC031181 chr6 112995973 112996141 21.64 TRUE Thumpd3 chr9 69871549 69871813 21.64 TRUE Gtf2a2 chr9 70505221 70505389 21.48 TRUE Fam63b chr1 9085213 9085453 21.14 FALSE Sntg1 chr1 45982453 45982597 21.14 TRUE Slc39a10 LDB1,GATA1* chr1 57434221 57434317 21.14 TRUE 9430016H08Rik chr1 80260597 80260669 21.14 FALSE Cul3 chr1 94830997 94831309 21.14 TRUE Capn10 chr1 135989677 135989725 21.14 TRUE Btg2 chr1 159126373 159126493 21.14 TRUE Ralgps2 chr1 180299077 180299413 21.14 TRUE Hnrnpu chr1 196957333 196957885 21.14 TRUE Cd46

117 chr10 35355325 35355397 21.14 FALSE Nt5dc1 chr10 63190357 63190429 21.14 FALSE Sirt1 chr10 79757893 79757965 21.14 TRUE Apc2 chr10 110725429 110725477 21.14 TRUE Bbs10 chr10 123233149 123233197 21.14 FALSE Usp15 chr11 5607853 5608021 21.14 TRUE Mrps24 chr11 16287301 16287349 21.14 FALSE Sec61g chr11 16850965 16851853 21.14 TRUE Fbxo48 chr11 49064053 49064101 21.14 TRUE Mgat1 chr11 51570037 51570253 21.14 TRUE Phf15 chr11 60745549 60745813 21.14 TRUE Map2k3 chr11 80220277 80220421 21.14 FALSE Zfp207 chr11 94148533 94148605 21.14 FALSE Luc7l3 chr11 107340781 107341189 21.14 TRUE Psmd12 chr12 40350277 40350445 21.14 FALSE Arl4a chr12 56256109 56256373 21.14 TRUE 1700047I17Rik2 chr12 59967421 59967925 21.14 FALSE Trappc6b chr12 87582829 87582901 21.14 TRUE 1700020O03Rik chr12 110030485 110030557 21.14 TRUE Yy1 chr13 3291973 3292141 21.14 FALSE AC125190.1 chr13 14866333 14866405 21.14 FALSE Arid4b chr13 19512733 19513093 21.14 FALSE Epdr1 chr13 23070493 23070733 21.14 FALSE Abt1 chr13 47203501 47204053 21.14 FALSE Dek LDB1 chr13 50602285 50602501 21.14 FALSE Iars chr13 102316117 102316429 21.14 FALSE AC107663.1 GATA1,LDB1,GATA1*,MTGR1 chr14 5816725 5816773 21.14 FALSE AC129222.4 chr14 37682245 37682293 21.14 FALSE Ghitm chr14 43393381 43393645 21.14 FALSE AC140256.1 chr14 44061181 44061277 21.14 FALSE AC140256.1 chr14 44879509 44879701 21.14 FALSE AC164295.1 chr14 45092269 45092365 21.14 FALSE AC164295.1 chr14 47451373 47451637 21.14 TRUE Cgrrf1 chr14 85650685 85650829 21.14 FALSE Diap3 chr14 124408597 124408813 21.14 FALSE A2ld1 chr14 125123941 125124109 21.14 FALSE A2ld1 chr15 3770845 3770941 21.14 FALSE Sepp1 chr15 36699085 36699421 21.14 FALSE Ywhaz chr15 62083957 62084053 21.14 TRUE H2afy3 chr15 70997893 70997965 21.14 FALSE Fam135b

118 chr15 88619005 88619053 21.14 TRUE Zbed4 chr15 88919221 88919437 21.14 TRUE 1300018J18Rik chr15 98661181 98661253 21.14 FALSE C430014K11Rik chr16 7758877 7759021 21.14 FALSE BC024814 chr16 8549293 8549341 21.14 FALSE BC024814 chr16 11232253 11232301 21.14 TRUE Gspt1 chr16 20498701 20499013 21.14 TRUE Eif2b5 chr16 38452885 38452957 21.14 TRUE AC154425.1 chr16 44921269 44921413 21.14 FALSE Btla chr16 74973733 74973829 21.14 FALSE Nrip1 chr17 15664285 15664429 21.14 TRUE Psmb1 chr17 24985741 24985909 21.14 TRUE Hagh GATA1,GATA1* chr17 27693421 27693469 21.14 TRUE Hmga1 chr17 28036645 28036813 21.14 TRUE Anks1 chr17 42205909 42206077 21.14 FALSE Cd2ap chr17 62088277 62088613 21.14 FALSE Fbxl17 chr17 70590469 70590565 21.14 FALSE Dlgap1 chr17 82815157 82815421 21.14 FALSE Eml4 chr18 23149981 23150029 21.14 FALSE Mapre2 chr18 45143341 45143533 21.14 FALSE Dcp2 chr18 51970885 51971125 21.14 FALSE Srfbp1 chr18 56926093 56926525 21.14 FALSE AC158763.1 chr18 63639061 63639373 21.14 FALSE Txnl1 chr18 64362445 64362493 21.14 FALSE Amd2 chr18 67959541 67959781 21.14 TRUE Cep192 chr19 29722813 29722861 21.14 TRUE Ermp1 chr19 43395037 43395085 21.14 FALSE Got1 chr19 43749781 43749829 21.14 TRUE Slc25a28 chr19 54987085 54987253 21.14 FALSE Acsl5 chr19 60835597 60835717 21.14 TRUE Eif3a chr2 9599005 9599221 21.14 FALSE Taf3 chr2 22000045 22000381 21.14 FALSE Myo3a chr2 31993285 31993525 21.14 FALSE Bat2l chr2 34660693 34660741 21.14 TRUE Fbxw2 LDB1 chr2 38921101 38921317 21.14 TRUE Golga1 chr2 39810325 39810397 21.14 FALSE AL840626.2 chr2 47451757 47451829 21.14 FALSE Orc4l chr2 58576117 58576165 21.14 FALSE Pkp4 chr2 59720701 59720749 21.14 TRUE Wdsub1 chr2 82624429 82624501 21.14 FALSE AL928616.1

119 chr2 84035917 84036037 21.14 FALSE AL928616.1 chr2 93998533 93998701 21.14 TRUE Hsd17b12 chr2 104464117 104464213 21.14 TRUE Hipk3 chr2 128961541 128961613 21.14 FALSE BX000699.2 chr2 149113813 149113861 21.14 FALSE Cst3 chr2 155978461 155978677 21.14 TRUE Nfs1 chr2 156138229 156138397 21.14 TRUE 4921517L17Rik GATA1 chr3 16896517 16896589 21.14 FALSE Ythdf3 chr3 17895037 17895085 21.14 FALSE Armc1 chr3 18249541 18249589 21.14 FALSE Armc1 chr3 25446877 25446949 21.14 TRUE Nlgn1 chr3 30525301 30525349 21.14 FALSE Mynn chr3 30664501 30664573 21.14 TRUE Sec62 GATA1,LDB1,TAL1,GATA1*,ETO2,MTGR1 chr3 33716749 33716869 21.14 FALSE Ttc14 chr3 35279437 35279725 21.14 FALSE Atp11b chr3 58329613 58329685 21.14 TRUE Serp1 chr3 79395037 79395109 21.14 TRUE Ppid chr3 80270533 80270605 21.14 FALSE 4930579G24Rik chr3 107104333 107104381 21.14 TRUE Slc16a4 chr3 127540645 127540765 21.14 TRUE Ap1ar chr3 128055661 128055709 21.14 FALSE 5730508B09Rik chr3 130722253 130722397 21.14 TRUE Lef1 chr3 138060685 138060805 21.14 FALSE Adh5 chr4 16041133 16041181 21.14 FALSE Nbn chr4 56453197 56453245 21.14 FALSE Ctnnal1 chr4 95873725 95873773 21.14 FALSE Jun chr4 105353317 105353413 21.14 FALSE Usp24 chr4 145291861 145291933 21.14 TRUE AL627077.5 chr4 145418053 145418125 21.14 TRUE AL627077.5 chr4 145571341 145571413 21.14 FALSE AL627077.5 chr4 145727029 145727845 21.14 FALSE 1700029I01Rik chr4 145766533 145766581 21.14 FALSE 1700029I01Rik chr5 8960677 8960749 21.14 FALSE Crot chr5 15205813 15207037 21.14 TRUE Speer4d chr5 17334901 17335237 21.14 TRUE Cd36 chr5 22585429 22585477 21.14 FALSE BC050254 chr5 93960133 93960205 21.14 FALSE AC134841.1 chr5 94659805 94660093 21.14 FALSE AC134841.1 chr6 40490125 40490269 21.14 FALSE Braf chr6 67223629 67223893 21.14 TRUE Serbp1

120 chr6 83075077 83075149 21.14 TRUE Wbp1 chr6 142335517 142335589 21.14 TRUE Recql chr7 11061157 11061445 21.14 FALSE Zfp110 chr7 13623445 13623709 21.14 TRUE Ube2m chr7 13910413 13910581 21.14 TRUE Lig1 chr7 31065349 31065517 21.14 TRUE Wdr62 chr7 70930045 70930213 21.14 FALSE AC148977.1 chr7 80638549 80638597 21.14 TRUE Chd2 chr7 91800109 91800157 21.14 TRUE Zfand6 GATA1,LDB1,TAL1,GATA1* chr7 101121493 101121757 21.14 FALSE AC102906.2 chr7 107006941 107006989 21.14 TRUE Neu3 chr7 111007837 111007909 21.14 TRUE Hbb-b2 GATA1,KLF1,LDB1,TAL1,GATA1*,MTGR1 chr7 129306829 129306901 21.14 FALSE Plk1 chr7 139768189 139768405 21.14 TRUE Lhpp chr7 147657973 147658093 21.14 FALSE Echs1 chr8 5611741 5611813 21.14 FALSE AC132451.1 chr8 19859101 19860109 21.14 FALSE 2610005L07Rik chr8 19934269 19934869 21.14 FALSE 2610005L07Rik chr8 20009917 20010109 21.14 FALSE AC152164.1 chr8 20036581 20037421 21.14 FALSE AC152164.1 chr8 39895021 39895261 21.14 FALSE Tusc3 chr8 59302693 59302837 21.14 FALSE Fbxo8 chr8 62555437 62555485 21.14 FALSE 2700029M09Rik chr8 68068981 68069653 21.14 FALSE Mar-01 chr8 73678837 73678909 21.14 FALSE Haus8 chr8 73735045 73735093 21.14 FALSE Haus8 chr8 114516133 114516421 21.14 TRUE Adat1 chr9 3148429 3148501 21.14 FALSE AC131780.9 chr9 4183837 4183981 21.14 FALSE Gria4 chr9 62659765 62659837 21.14 TRUE Fem1b chr9 64153429 64154197 21.14 TRUE Tipin chr9 70551637 70551709 21.14 TRUE Adam10 chr9 77642725 77642773 21.14 TRUE Gclc chr9 90749989 90750157 21.14 FALSE AC152825.1 chr9 93807565 93807661 21.14 FALSE Slc9a9 chr9 94118005 94118053 21.14 FALSE Slc9a9 chr9 123431893 123432181 21.14 FALSE Limd1 chrX 46713997 46714093 21.14 FALSE Rbmx2 chrX 105157741 105157789 21.14 FALSE P2ry10 chrX 118954933 118955245 21.14 FALSE AC114001.2

121 chrX 120071509 120071557 21.14 FALSE AC114001.2 chrX 121458397 121458493 21.14 FALSE AC114001.2 chrX 121900573 121900765 21.14 FALSE Mdm4-ps chrX 121901269 121901509 21.14 FALSE Mdm4-ps chrX 122664829 122665213 21.14 FALSE Mdm4-ps chrX 122694613 122695645 21.14 FALSE Mdm4-ps chrX 146099053 146099341 21.14 FALSE Alas2 chrY 2765773 2765821 21.14 FALSE Ddx3y chr10 46474357 46474429 20.64 FALSE Hace1 chr8 19753429 19756813 20.64 TRUE AC148089.1 chr8 19941301 19941829 20.47 FALSE 2610005L07Rik chr1 5984893 5985157 20.13 FALSE Rb1cc1 chr1 6204805 6204901 20.13 TRUE Rb1cc1 chr1 7181317 7181629 20.13 FALSE AC171328.1 chr1 9789109 9789181 20.13 TRUE Sgk3 chr1 11797141 11797261 20.13 FALSE Prex2 chr1 14378533 14378581 20.13 FALSE Lactb2 chr1 27812341 27812629 20.13 FALSE Phf3 chr1 30858781 30858853 20.13 TRUE Phf3 chr1 46777309 46777717 20.13 FALSE Slc39a10 chr1 52290157 52290349 20.13 TRUE Gls chr1 58643269 58643773 20.13 TRUE Fam126b chr1 82721629 82721701 20.13 TRUE Mff chr1 96048853 96048997 20.13 FALSE Gm6086 chr1 98338261 98338741 20.13 FALSE AC102488.1 chr1 135496693 135496741 20.13 FALSE Zc3h11a chr1 144771613 144771685 20.13 FALSE Cdc73 chr1 152084725 152084917 20.13 FALSE BC003331 chr1 158649373 158649445 20.13 TRUE Fam20b chr1 168524365 168524653 20.13 FALSE Pogk chr1 183100645 183100717 20.13 TRUE Cnih4 chr1 187186597 187186813 20.13 TRUE Eprs chr10 11043301 11043349 20.13 FALSE Fbxo30 GATA1,LDB1,GATA1* chr10 20055373 20055445 20.13 FALSE Fam54a chr10 33254773 33254869 20.13 FALSE Zufsp chr10 40004917 40004989 20.13 FALSE Amd2 chr10 42941149 42941221 20.13 TRUE Pdss2 chr10 65885269 65885317 20.13 FALSE Jmjd1c chr10 78908317 78908509 20.13 FALSE Gm16517 chr10 90644173 90644701 20.13 TRUE Tmpo

122 chr10 94786621 94786693 20.13 TRUE Cradd chr10 125637109 125637205 20.13 FALSE Ctdsp2 chr11 4452949 4453021 20.13 TRUE Mtmr3 chr11 5426365 5426413 20.13 TRUE Xbp1 chr11 6343453 6343525 20.13 FALSE H2afv chr11 8982565 8982901 20.13 FALSE RP23-104F14.1 chr11 11513389 11513605 20.13 FALSE Ikzf1 chr11 13249381 13249549 20.13 FALSE Fignl1 chr11 16399405 16399549 20.13 FALSE Sec61g chr11 26287021 26287357 20.13 TRUE Fancl chr11 58805605 58805653 20.13 FALSE Trim11 chr11 59653213 59653477 20.13 TRUE Nt5m chr11 68869573 68869645 20.13 FALSE Aurkb chr11 74060365 74060461 20.13 TRUE Rap1gap2 chr11 80566789 80566837 20.13 TRUE Myo1d chr11 86171701 86171773 20.13 TRUE Med13 chr11 86560813 86560885 20.13 TRUE Cltc chr11 97887877 97888237 20.13 TRUE Rpl19 chr11 106847221 106847533 20.13 TRUE Kpna2 chr11 109474765 109475293 20.13 TRUE Wipi1 GATA1,LDB1 chr11 115992325 115992541 20.13 TRUE Trim65 chr12 3261757 3261805 20.13 TRUE 1700012B15Rik chr12 4234453 4234597 20.13 TRUE Cenpo GATA1 chr12 8214949 8214997 20.13 TRUE 1110057K04Rik chr12 12077797 12077845 20.13 FALSE Fam49a chr12 24846445 24846493 20.13 FALSE AC125543.1 chr12 40754365 40754413 20.13 TRUE Arl4a GATA1* chr12 50921797 50921869 20.13 FALSE G2e3 chr12 71062981 71063077 20.13 TRUE Atl1 chr12 72069613 72069973 20.13 TRUE Psma3 chr12 81891277 81891541 20.13 TRUE 4933426M11Rik chr12 120526093 120526189 20.13 FALSE AC140349.1 chr13 12350389 12350461 20.13 TRUE Mtr chr13 12878581 12878677 20.13 FALSE CT030166.1 chr13 13286101 13286221 20.13 FALSE Lyst chr13 15345565 15345637 20.13 FALSE Arid4b chr13 17897317 17897461 20.13 TRUE Cdk13 chr13 19156909 19156957 20.13 FALSE Vps41 chr13 23642221 23642293 20.13 TRUE Hist1h1d chr13 46777501 46777573 20.13 TRUE Nup153

123 chr13 57088309 57088357 20.13 TRUE Tgfbi chr13 58512493 58512541 20.13 FALSE Hnrnpk chr13 71452309 71452453 20.13 FALSE BC018507 chr13 91067557 91067605 20.13 FALSE Rps23 chr13 99580093 99580213 20.13 TRUE Fcho2 LDB1,GATA1* chr13 119920093 119920141 20.13 FALSE 4833420G17Rik chr14 7159621 7159669 20.13 FALSE AC129222.4 chr14 32064301 32064349 20.13 TRUE Bap1 chr14 36090325 36090373 20.13 FALSE Wapal chr14 53249845 53249893 20.13 FALSE Sall2 chr14 66437029 66437749 20.13 TRUE Esco2 chr14 76251637 76251709 20.13 FALSE Tpt1 chr14 80321533 80321677 20.13 FALSE Sugt1 chr14 81731941 81732037 20.13 FALSE Sugt1 chr14 88318573 88318645 20.13 FALSE Tdrd3 chr14 120674917 120674965 20.13 TRUE Mbnl2 chr14 124073101 124073149 20.13 FALSE A2ld1 chr14 124188109 124188397 20.13 TRUE A2ld1 chr15 3232669 3232741 20.13 TRUE Sepp1 chr15 19570141 19570189 20.13 FALSE Myo10 chr15 34368445 34368661 20.13 TRUE Rpl30 chr15 56718349 56718493 20.13 TRUE Zhx2 chr15 68090773 68090941 20.13 TRUE Zfat chr15 72920845 72920893 20.13 TRUE Chrac1 chr15 75812485 75812557 20.13 TRUE Tsta3 chr15 89520733 89520781 20.13 FALSE Shank3 chr15 98765437 98765485 20.13 TRUE Tuba1b chr16 3247093 3247141 20.13 FALSE Zfp263 chr16 6253861 6253909 20.13 FALSE Fam86 chr16 19245949 19246021 20.13 FALSE Hira chr16 45515485 45515629 20.13 FALSE Btla chr16 52669141 52669453 20.13 FALSE Alcam chr16 56962357 56962405 20.13 FALSE Tomm70a chr16 58962637 58962685 20.13 FALSE Cldnd1 chr16 70820029 70820125 20.13 FALSE Nrip1 chr16 79010965 79011421 20.13 FALSE Usp25 chr16 86728213 86729077 20.13 FALSE Rnf160 chr16 91013677 91013749 20.13 TRUE Synj1 chr17 15147205 15147373 20.13 TRUE Gm3435 chr17 21342037 21342109 20.13 FALSE Zfp677

124 chr17 28659805 28659853 20.13 TRUE Fkbp5 chr17 42349477 42349837 20.13 FALSE Cd2ap chr17 53706109 53706181 20.13 TRUE Kat2b chr17 54073117 54073165 20.13 FALSE Sgol1 chr17 69766573 69766621 20.13 FALSE Zfp161 chr17 73292941 73293013 20.13 FALSE Lbh chr17 74677477 74677621 20.13 TRUE Memo1 chr17 79729285 79729549 20.13 FALSE Cdc42ep3 chr17 80244277 80244349 20.13 FALSE Atl2 chr17 80971093 80971189 20.13 TRUE Cdkl4 chr18 15495469 15495541 20.13 TRUE Kctd1 chr18 15497149 15497269 20.13 FALSE Kctd1 chr18 27338629 27338701 20.13 FALSE AC102562.1 chr18 35718229 35718301 20.13 TRUE SNORA74 chr18 36960973 36961645 20.13 TRUE Zmat2 chr18 37705981 37706101 20.13 FALSE Taf7 chr18 61720717 61720789 20.13 FALSE Csnk1a1 chr18 63797893 63798085 20.13 FALSE Txnl1 chr18 64073509 64073605 20.13 FALSE Wdr7 chr18 64213237 64213309 20.13 FALSE Wdr7 chr18 70673605 70673677 20.13 FALSE Poli chr18 85017565 85017637 20.13 FALSE Cyb5 chr18 89684413 89684509 20.13 FALSE Socs6 chr19 8160997 8161069 20.13 FALSE AB056442 chr19 9829717 9829789 20.13 TRUE Incenp chr19 53695213 53695453 20.13 TRUE Smc3 chr2 15253141 15253285 20.13 FALSE Stam chr2 36616621 36617029 20.13 FALSE AL845356.2 chr2 44735581 44735725 20.13 TRUE Gtdc1 chr2 48084229 48084325 20.13 FALSE Orc4l chr2 59226205 59226325 20.13 FALSE Tanc1 chr2 70664965 70665013 20.13 TRUE Tlk1 LDB1,GATA1* chr2 82915885 82916197 20.13 FALSE AL928616.1 chr2 83774581 83774677 20.13 FALSE AL928616.1 chr2 87244573 87244621 20.13 FALSE Tnks1bp1 chr2 104333557 104333605 20.13 TRUE Hipk3 chr2 106764133 106764493 20.13 FALSE 2700007P21Rik chr2 114324061 114324397 20.13 FALSE Aqr chr2 117671485 117671557 20.13 FALSE Thbs1 chr2 123322957 123323725 20.13 FALSE Spata5l1

125 chr2 125450725 125450797 20.13 TRUE Cep152 chr2 125492629 125492677 20.13 TRUE Eid1 GATA1* chr2 134469709 134469829 20.13 TRUE Plcb4 GATA1 chr2 143210317 143210365 20.13 FALSE Snrpb2 chr2 155506645 155506789 20.13 TRUE Trpc4ap chr2 169977133 169977181 20.13 TRUE Zfp217 LDB1,MTGR1 chr2 176059741 176059909 20.13 FALSE Zfp831 chr2 177505045 177505141 20.13 TRUE Cdh4 chr3 9287917 9288133 20.13 FALSE Tpd52 chr3 11053669 11053717 20.13 FALSE Impa1 chr3 13260781 13260853 20.13 FALSE AC123747.1 chr3 14775397 14775541 20.13 FALSE AC123747.1 chr3 15084229 15084325 20.13 FALSE Car2 chr3 16376341 16376389 20.13 FALSE Ythdf3 chr3 21094837 21095005 20.13 FALSE Tbl1xr1 chr3 24887053 24887317 20.13 TRUE Nlgn1 chr3 28706149 28706221 20.13 TRUE Eif5a2 chr3 42945397 42945445 20.13 FALSE Phf17 chr3 52363021 52363093 20.13 TRUE AC142167.1 chr3 64440973 64441141 20.13 TRUE Gmps chr3 73539829 73539925 20.13 FALSE Pdcd10 chr3 75121045 75121117 20.13 FALSE Pdcd10 chr3 90279733 90280189 20.13 TRUE Ilf2 chr3 134210605 134210821 20.13 FALSE Cenpe chr3 135133069 135133429 20.13 FALSE Manba chr3 137071405 137071453 20.13 FALSE H2afz chr3 149637469 149637781 20.13 FALSE Dnajb4 chr3 159002101 159002173 20.13 FALSE Lrrc7 chr4 24212725 24212869 20.13 FALSE BX004998.1 chr4 27817213 27817261 20.13 FALSE Klhl32 chr4 28674853 28674901 20.13 FALSE Map3k7 chr4 40033453 40033597 20.13 FALSE Aco1 chr4 40545373 40545565 20.13 FALSE Dnaja1 chr4 40895557 40895629 20.13 TRUE Chmp5 chr4 46417405 46417453 20.13 TRUE Hemgn GATA1,LDB1,TAL1 chr4 48055981 48056053 20.13 TRUE Stx17 chr4 61485421 61485469 20.13 FALSE Slc31a2 chr4 62053765 62054221 20.13 TRUE Cdc26 chr4 73897501 73897621 20.13 TRUE Kdm4c chr4 81448525 81448573 20.13 FALSE Ttc39b

126 chr4 81857221 81857293 20.13 FALSE Ttc39b chr4 94645741 94645789 20.13 TRUE AL772318.1 chr4 101320861 101322301 20.13 TRUE Leprot chr4 115511581 115511653 20.13 TRUE Mknk1 chr4 116267317 116267629 20.13 FALSE Ccdc17 chr4 117807973 117808021 20.13 TRUE St3gal3 chr4 145543573 145543981 20.13 FALSE AL627077.5 chr4 145613965 145614301 20.13 FALSE AL627077.5 chr4 146782933 146783293 20.13 FALSE AL954334.2 chr4 149383117 149383381 20.13 TRUE H6pd chr4 151100605 151100677 20.13 FALSE Camta1 KLF1 chr5 8264773 8265013 20.13 FALSE Dbf4 chr5 9815605 9815653 20.13 FALSE 4930420K17Rik chr5 20493877 20493997 20.13 FALSE Ptpn12 chr5 23320237 23320309 20.13 FALSE AC117663.1 chr5 24083173 24083509 20.13 TRUE Abcf2 chr5 43403917 43404109 20.13 FALSE AC115947.1 chr5 55167997 55168165 20.13 FALSE AC134463.2 chr5 70680805 70680877 20.13 FALSE Atp10d chr5 88979485 88979605 20.13 FALSE Mobkl1a chr5 93997477 93998125 20.13 FALSE AC134841.1 chr5 94046725 94047661 20.13 TRUE AC134841.1 chr5 94089781 94090861 20.13 FALSE AC134841.1 chr5 98438437 98438485 20.13 TRUE Antxr2 chr5 100947445 100947493 20.13 TRUE Cops4 chr5 123270973 123271309 20.13 TRUE Anapc5 chr5 123572965 123573421 20.13 FALSE Rhof chr5 124932925 124933549 20.13 TRUE Setd8 chr6 5248381 5248477 20.13 TRUE Pon2 chr6 7817941 7817989 20.13 TRUE C1galt1 chr6 8058613 8058661 20.13 FALSE AC162391.1 chr6 11631157 11631229 20.13 FALSE Phf14 chr6 22077181 22077325 20.13 FALSE Fam3c chr6 37950229 37950493 20.13 FALSE Trim24 chr6 93051277 93051325 20.13 FALSE Zfyve20 chr6 96874357 96874405 20.13 FALSE Tmf1 chr6 109723309 109723429 20.13 FALSE Edem1 chr6 113640733 113640877 20.13 TRUE Irak2 GATA1,LDB1,MTGR1 chr6 133691845 133692061 20.13 FALSE Etv6 chr7 3870805 3870973 20.13 FALSE Pira2

127 chr7 11379709 11379925 20.13 TRUE Zfp110 chr7 13114789 13114909 20.13 FALSE Zfp110 chr7 17709757 17709949 20.13 FALSE Ppp5c chr7 23720725 23720797 20.13 FALSE Zfp180 chr7 24480805 24480901 20.13 FALSE Zfp180 chr7 39265981 39266029 20.13 FALSE 1600014C10Rik chr7 47365165 47365621 20.13 FALSE AC113955.2 chr7 47491141 47491309 20.13 FALSE AC113955.2 chr7 63305485 63305773 20.13 TRUE Herc2 chr7 80763661 80764045 20.13 TRUE Chd2 chr7 82732741 82732789 20.13 TRUE Akap13 chr7 94135549 94135621 20.13 FALSE Zfand6 chr7 95054581 95054869 20.13 FALSE l7Rn6 chr7 97359541 97360645 20.13 FALSE Picalm chr7 97600597 97601029 20.13 TRUE Crebzf chr7 99793645 99793741 20.13 TRUE Pcf11 chr7 110427781 110427829 20.13 FALSE Hbb-b1 chr7 113417125 113417245 20.13 FALSE Gvin1 chr7 135054301 135054421 20.13 FALSE Myst1 chr7 135446389 135446725 20.13 TRUE BC017158 chr7 141293749 141293917 20.13 FALSE Uros chr8 31149181 31149421 20.13 FALSE Rnf122 chr8 34407709 34407877 20.13 FALSE Wrn chr8 38545213 38545357 20.13 FALSE Lonrf1 chr8 50235853 50236093 20.13 FALSE Cdkn2aip chr8 72149029 72149149 20.13 TRUE Zfp866 chr8 83041021 83041069 20.13 FALSE AC126539.1 chr8 99638125 99638293 20.13 FALSE AC146815.1 chr8 103075501 103075573 20.13 FALSE AC122514.2 chr8 113498293 113498845 20.13 TRUE AC132945.5 chr8 115009333 115009645 20.13 FALSE Terf2ip chr9 15122029 15122149 20.13 TRUE SNORA25 chr9 16673701 16673773 20.13 FALSE BC017612 chr9 17117341 17117413 20.13 FALSE Chordc1 chr9 44449429 44449501 20.13 FALSE Ddx6 chr9 50336749 50336797 20.13 TRUE Pts chr9 57369085 57369157 20.13 TRUE 2310046O06Rik chr9 113146645 113146861 20.13 FALSE Pdcd6ip chr9 115368973 115369069 20.13 FALSE Stt3b chrX 12858013 12858205 20.13 TRUE Ddx3x

128 chrX 47535253 47535301 20.13 FALSE BX813325.1 chrX 57043165 57043597 20.13 FALSE RP23-419J18.1 chrX 60149413 60149461 20.13 FALSE RP23-267D10.1 chrX 63332989 63333037 20.13 FALSE AL662923.1 chrX 71542477 71542549 20.13 TRUE Atp6ap1 chrX 72340885 72340957 20.13 FALSE Gab3 chrX 78137869 78138181 20.13 FALSE AL845169.1 chrX 84183565 84183613 20.13 FALSE AL954720.1 chrX 106795645 106795693 20.13 FALSE AL670648.1 chrX 108079501 108079981 20.13 FALSE AL670648.1 chrX 120405901 120405997 20.13 FALSE AC114001.2 chrX 120523765 120523813 20.13 TRUE AC114001.2 chrX 120777037 120781261 20.13 FALSE AC114001.2 chrX 121655293 121655365 20.13 FALSE Mdm4-ps chrX 121702045 121702429 20.13 FALSE Mdm4-ps chrX 121939933 121940437 20.13 FALSE Mdm4-ps chrX 122656501 122656621 20.13 FALSE Mdm4-ps chrX 122699869 122699917 20.13 TRUE Mdm4-ps chrX 125894077 125894317 20.13 FALSE Mdm4-ps chrX 131123125 131123197 20.13 TRUE Hnrnph2 chrX 139367605 139367821 20.13 FALSE Ammecr1 chrX 146555317 146555389 20.13 FALSE Alas2 chrX 162948397 162948493 20.13 TRUE Tmsb4x chrY 2021173 2021557 20.13 FALSE Ddx3y chr6 103599157 103599205 20.03 TRUE Ppp4r2 chr4 116229373 116229781 19.8 TRUE Gpbp1l1 chr10 60836557 60836821 19.38 TRUE Prf1 chr5 83697949 83698333 19.13 FALSE Lphn3 chr7 59060053 59060101 19.13 TRUE Svip chr7 99879997 99880165 19.13 TRUE Rab30 chr11 69572149 69572197 18.88 TRUE Polr2a chr3 102933421 102933589 18.79 TRUE Dennd2c LDB1 chr9 96159205 96159349 18.79 TRUE Tfdp2 chr13 91870669 91870717 18.62 FALSE 4833422C13Rik chr5 64981957 64982413 18.62 FALSE Klf3 GATA1,LDB1,TAL1,GATA1*,MTGR1 chr5 96639037 96639181 18.62 TRUE Bmp2k LDB1,MTGR1 chr7 96149845 96150013 18.62 FALSE l7Rn6 chr9 57758821 57758869 18.62 TRUE Arid3b chr3 96177949 96178021 18.25 TRUE Hist2h3b chr1 24350077 24350197 18.12 FALSE AC116997.1

129 chr4 12164125 12164773 18.12 TRUE RP23-425K3.1 chr5 150057349 150057637 18.12 TRUE Uspl1 GATA1 chr6 66485053 66485509 18.12 TRUE AC165355.1 chr7 74790349 74790661 18.12 TRUE Igf1r chr7 88046173 88046269 18.12 TRUE Wdr73 chr7 135111013 135111085 18.12 TRUE Fus chrX 47305165 47305357 18.12 FALSE BX813325.1 chrX 78230797 78231109 18.12 FALSE AL845169.1 chr6 124691293 124691341 18.05 TRUE Grcc10 chr4 131864605 131864869 17.87 FALSE Rcc1 GATA1,LDB1,MTGR1 chr11 115274125 115274197 17.62 TRUE Ict1 LDB1 chr13 97907749 97908253 17.62 TRUE Gfm2 chr14 60870517 60870949 17.62 TRUE Nupl1 chr17 46293733 46293781 17.62 FALSE Mad2l1bp chr5 6557413 6557605 17.62 FALSE AC133939.1 chr5 15012541 15012733 17.62 FALSE Speer4d chr5 94006381 94009501 17.62 FALSE AC134841.1 chrX 120308197 120308269 17.62 FALSE AC114001.2 chr2 20891029 20891413 17.45 TRUE Arhgap21 chr11 6525829 6526405 17.11 TRUE Tbrg4 chr12 66136861 66137221 17.11 TRUE Prpf39 chr4 116883397 116883613 17.11 TRUE Kif2c chr7 71193613 71193877 17.11 FALSE Klf13 GATA1,LDB1,TAL1,MTGR1 chr7 88965421 88965493 17.11 FALSE Btbd1 chr8 20029933 20030701 17.11 FALSE AC152164.1 chr9 20296957 20297221 17.11 TRUE Zfp426 GATA1 chr1 16647109 16647181 16.61 TRUE Ube2w chr13 21912421 21912469 16.61 TRUE Hist2h4 chr14 37948645 37948741 16.61 TRUE Ghitm chr14 52690453 52690909 16.61 TRUE AC157572.3 chr16 4003789 4004125 16.61 TRUE Btbd12 chr18 3707797 3707965 16.61 FALSE Cul2 chr19 3080941 3081013 16.61 FALSE Ighmbp2 chr19 41986093 41986141 16.61 TRUE Rrp12 chr3 138189973 138190093 16.61 TRUE Eif4e chr5 14954917 14955253 16.61 FALSE Speer4d chr6 67227853 67228045 16.61 TRUE Serbp1 chr6 79078597 79078813 16.61 FALSE AC131340.1 chr6 107092813 107092861 16.61 FALSE Crbn chr8 26707381 26708653 16.61 TRUE Whsc1l1

130 chrX 119980549 119980645 16.61 FALSE AC114001.2 chr12 92827045 92827093 16.31 TRUE Gtf2a1 chr10 100111837 100112173 16.11 FALSE Cep290 chr12 56181469 56181829 16.11 TRUE Srp54c chr13 23637637 23637685 16.11 TRUE Hist1h2bh chr13 51941773 51942229 16.11 TRUE Sema4d chr14 92073157 92073229 16.11 FALSE AC098722.1 chr16 17922037 17922253 16.11 FALSE Slc25a1 LDB1 chr17 53828485 53829205 16.11 TRUE Sgol1 chr19 5894461 5894749 16.11 TRUE Tigd3 chr2 157193245 157193317 16.11 TRUE Manbal chr3 90237469 90237973 16.11 TRUE Ints3 chr4 83405269 83405317 16.11 FALSE 4930473A06Rik chr5 77334421 77334661 16.11 FALSE 2310040G07Rik chr7 13534789 13535341 16.11 FALSE Zfp324 chr8 71942581 71942821 16.11 FALSE Slc18a1 chr9 103190557 103190629 16.11 TRUE 1300017J02Rik chrX 14152069 14152117 16.11 FALSE AL833805.2 chr11 58128733 58129189 15.6 TRUE Zfp672 chr11 100870813 100870861 15.6 TRUE Atp6v0a1 chr11 102757285 102757333 15.6 TRUE Gfap GATA1,LDB1,MTGR1 chr14 52111381 52111453 15.6 FALSE Mett11d1 chr16 49777333 49777597 15.6 TRUE AC164566.2 LDB1,TAL1,GATA1* chr16 55966501 55966693 15.6 TRUE Rpl24 chr3 14448877 14449165 15.6 FALSE AC123747.1 chr4 44181373 44181613 15.6 TRUE Rnf38 chr5 50131429 50131813 15.6 FALSE Dhx15 chr6 11116261 11116333 15.6 FALSE Phf14 chr6 57695317 57695725 15.6 TRUE Vopp1 GATA1,LDB1,TAL1,GATA1* chr6 66984637 66984853 15.6 TRUE AC165355.1 GATA1,LDB1,TAL1,GATA1*,MTGR1 chr7 53178733 53178997 15.6 TRUE Emp3 LDB1 chr9 123839701 123840061 15.6 TRUE Fyco1 chr1 88050949 88051237 15.1 TRUE Armc9 chr11 121118797 121120093 15.1 TRUE Foxk2 chr12 30065269 30065581 15.1 FALSE Tssc1 chr12 70258741 70260085 15.1 TRUE Ppil5 chr15 75266989 75267493 15.1 FALSE Zfp41 chr17 29580133 29580661 15.1 FALSE Pim1 GATA1,LDB1,TAL1,MTGR1 chr18 72738925 72738973 15.1 FALSE Smad4 chr18 89540653 89540941 15.1 FALSE Socs6

131 chr2 131036221 131036269 15.1 TRUE 2310035K24Rik LDB1,GATA1* chr3 91379773 91380301 15.1 FALSE AC127036.1 chr3 95021941 95022037 15.1 TRUE Gabpb2 chr4 94582813 94582933 15.1 FALSE AL772318.1 chr4 146784589 146784973 15.1 FALSE AL954334.2 chr6 60264661 60264709 15.1 FALSE Herc3 chr7 67469701 67470061 15.1 FALSE AC046145.1 chr8 74327581 74327749 15.1 FALSE Fcho1 chr8 83017933 83018005 15.1 TRUE Gypa GATA1,LDB1,TAL1,GATA1* chr9 9470677 9470845 15.1 FALSE Birc3 chr9 48303325 48303781 15.1 TRUE Rbm7 chrX 131120221 131120269 15.1 TRUE Hnrnph2 chrX 141639085 141639349 15.1 FALSE Ammecr1

132

Appendix 2. Primers designed and used in erythroid study.

Primers designed or obtain from the lab for the erythroid study. NG stands for non-genic primer and pr stands for promoter primer.

Name Forward Primer Reverse Primer

NG-Pim1 TTGCAGGCAGAGTTGGTTTA GTGTCGTTGTCGTTGGAATG

NG-Ep300 GCAAGGGAGAGTGTCTGACC CTCTGCGGAAGTAGCCTGAG

NG-Klf13 ACTGTTCTGGGGACTGTTGG ATGGCACCTTGAATGAGAGC

NG-Hbb-3 CCCTAGCAGTCCTGAGATGC TAGCCCAGACCCCTAGTCCT

NG-Hbb-5 TTTTCACCTTCCCTGTGGAC CCAGCATCTAGCTTGAGCAC Designed NG-Slc25a37 ACGGCTTGGTAGAACACACC GTCAGTGCCACAAGGTTGAA

pr-Hbb-b1 GCAAATGTGAGGAGCAACTG CGAAGCCTGATTCCGTAGAG

NG-Wdr26 GGGTTTGGGCTCTAGAGGAG GGTTGTGCGAAGACCAGATT

NG-Aplp2 GCCTTGTTGCCATATGTGAT TTAAAATCCACCGGAACAGG

TGCATTAGTGGGAGCATGAG

lab NGpr-Hbb-1 ACTGAGCTGTTGGCACAATG

rom NGpr-Lmo2-4 TTCCCTGCTAACTGGGTGAG GTGTGCCGTAGACTGTGGAA

TTCTGACAGACTCAGGAAGAAACC AGATTGGTGGCCATGGTGCT HbaE1 A

Obtained f Vh16 CAAGAGAGTTTAGTGGACCCTCC GCATAGCCTTTTCCACTCTCATC

133

Appendix 3. Weights of features used in multinomial logistic regression for ES cells

Weights of features used in lasso regularized multinomial logistic regression model to predict enhancer regions in mouse ES cells. PrL Enh Unknown Intercept -3.56247 -0.71075 4.27322 RNAP2S5 0.203108 0 0 RNAP2S2 0 0 0 RNAPII 0 0 0 H3 0 0 0 H3K4me1 0 0.092797 -0.18618 H3K4me2 0.019107 0 -0.07448 H3K4me3 0.033292 0 0 H3K36me3 0 0 0 H4K20me3 0 0 0 H3K27me3 0 0 0 H3K9me3 0 0 0 CTCF 0.070769 0 0 p300 0 0.684674 -0.4018 Smc1a 0 0 0 Smc3 0 0 0 Med12 0 0.063368 0 Med1 0 0 0 Nipbl 0 0.010572 -0.23242 RNA 0 0 0 Most conserved 0 0 0 snp128 0 0 0 CpG island 0.998654 0 0 G+C Percent 0.091902 0 0

134

Appendix 4. Top enhancer candidate list of mouse embryonic stem cells.

The top 1277 enhancer candidates (prob≥0.8) in mouse ES cells are listed. Columns represent the chromosome location (Chr, start and end), enhancer probability (Enh.prob) , gene symbol of the closest Ensembl transcript (geneName), the location relative to genes (location), the distance of the enhancer to the TSS of the transcript (distance). Chr start end Enh.prob geneName location distance chr3 34660001 34661000 1 Sox2 intergenic 108618 chr7 3206001 3207000 1 mmu-mir-290 intergenic 11627 chr8 91527001 91528000 1 Sall1 intergenic 23147 chr9 110850001 110851000 1 Tdgf1 upstream 1430 chr12 87842001 87843000 0.9999 Esrrb genic 19570 chr4 55488001 55489000 0.9999 SNORA17 upstream 5970 chr8 44406001 44407000 0.9999 Zfp42 intergenic 13711 chr9 58127001 58128000 0.9997 Loxl1 downstream 8409 chr3 34657001 34658000 0.9996 Sox2 intergenic 105618 chr3 34659001 34660000 0.9996 Sox2 intergenic 107618 chr7 114380001 114381000 0.9993 AC132462.2 upstream 5536 chr7 148145001 148146000 0.9993 Ifitm2 upstream 3141 chr1 37042001 37043000 0.9992 5S_rRNA intergenic 19284 chr3 135209001 135210000 0.9992 Manba genic 8530 chr11 12367001 12368000 0.9987 Cobl upstream 2139 chr3 95457001 95458000 0.9987 Mcl1 upstream 4779 chr6 100328001 100329000 0.9985 Rybp intergenic 90566 chr6 64974001 64975000 0.9983 Smarcad1 intergenic 17661 chr6 122291001 122292000 0.9982 Phc1 upstream 477 chr17 71178001 71179000 0.9979 Dlgap1 downstream 7248 chr2 71704001 71705000 0.9978 Pdk1 upstream 6281 chr17 84750001 84751000 0.9975 Thada genic 112772 chr4 55477001 55478000 0.9975 RP23-120N11.5 genic 686 chr6 122292001 122293000 0.9975 Phc1 upstream 1477 chr17 45744001 45745000 0.9973 Gm7325 upstream 4985 chr5 73293001 73294000 0.997 AC158939.1 intergenic 11535 chr10 56094001 56095000 0.9965 Gja1 upstream 2136 chr2 20576001 20577000 0.9965 Etl4 genic 89443 chr4 141121001 141122000 0.9965 Fblim1 downstream 9977 chr10 62358001 62359000 0.9961 U6 upstream 1739 chr12 12794001 12795000 0.9959 U6 intergenic 87050

135 chr14 106258001 106259000 0.9959 Spry2 intergenic 32732 chr5 110898001 110899000 0.9957 AC169384.1 intergenic 20479 chr11 44560001 44561000 0.9956 Ebf1 genic 129182 chr3 34545001 34546000 0.9956 Sox2 upstream 2927 chr11 97523001 97524000 0.9955 Mllt6 upstream 728 chr7 38814001 38815000 0.9955 C80913 upstream 9440 chr3 34655001 34656000 0.9953 Sox2 intergenic 103618 chr18 33995001 33996000 0.9951 Epb4.1l4a genic 39020 chr11 20131001 20132000 0.9943 Cep68 genic 3961 chr3 133187001 133188000 0.9942 Tet2 genic 19332 chr5 65256001 65257000 0.9941 Klf3 intergenic 34634 chr1 13075001 13076000 0.9936 AC118477.1 intergenic 22572 chr3 135210001 135211000 0.9935 Manba genic 7530 chr10 21708001 21709000 0.9934 Sgk1 genic 10699 chr4 55490001 55491000 0.9933 SNORA17 upstream 7970 chr1 12734001 12735000 0.9932 Sulf1 genic 25391 chr6 143090001 143091000 0.9929 Etnk1 intergenic 24750 chr7 3201001 3202000 0.9928 mmu-mir-290 intergenic 16627 chr14 99724001 99725000 0.9926 Klf5 intergenic 11377 chr16 84769001 84770000 0.9924 Jam2 upstream 4368 chr12 12810001 12811000 0.9923 U6 intergenic 103050 chr12 12950001 12951000 0.9923 Mycn upstream 1281 chr12 40724001 40725000 0.9921 Arl4a downstream 7034 chr10 62359001 62360000 0.992 U6 upstream 2739 chr2 20591001 20592000 0.992 Etl4 genic 74443 chr3 34567001 34568000 0.992 Sox2 intergenic 15618 chr7 56602001 56603000 0.992 Nav2 genic 105328 chr5 110656001 110657000 0.9915 Golga3 downstream 512 chr11 8469001 8470000 0.9913 Tns3 genic 94538 chr10 21466001 21467000 0.9911 Gm5420 intergenic 52217 chr2 21314001 21315000 0.991 Gpr158 genic 24832 chr5 92541001 92542000 0.991 5S_rRNA downstream 5203 chr15 50732001 50733000 0.9905 Trps1 intergenic 10414 chr18 30281001 30282000 0.9905 Pik3c3 intergenic 150401 chr11 107214001 107215000 0.9902 Pitpnc1 genic 117013 chr6 67061001 67062000 0.9899 RP23-129P10.1 intergenic 30353 chr17 37134001 37135000 0.9894 Zfp57 upstream 4487 chr2 154252001 154253000 0.9894 Cbfa2t2 upstream 9217 chr1 183960001 183961000 0.9891 Enah intergenic 10209 chr1 184093001 184094000 0.9891 Srp9 intergenic 30463

136 chr14 99739001 99740000 0.9885 U2 intergenic 11267 chr1 37044001 37045000 0.9879 5S_rRNA intergenic 17284 chr11 79945001 79946000 0.9877 Atad5 genic 3296 chr6 82125001 82126000 0.9877 Fam176a intergenic 81908 chr3 88376001 88377000 0.9875 Ubqln4 downstream 2354 chr15 97424001 97425000 0.9874 Rpap3 intergenic 80529 chr19 44340001 44341000 0.9873 Scd3 intergenic 21495 chr2 134898001 134899000 0.9871 Plcb1 genic 286106 chr3 53454001 53455000 0.9871 Frem2 genic 6277 chr1 91767001 91768000 0.9867 Agap1 genic 23852 chr1 136431001 136432000 0.9866 AC122771.1 downstream 5562 chr10 85002001 85003000 0.9862 Btbd11 genic 120037 chr6 149079001 149080000 0.9859 AC132412.3 genic 309 chr8 74835001 74836000 0.9859 AC113182.1 upstream 5921 chr8 74858001 74859000 0.9859 Eps15l1 downstream 5898 chrX 96752001 96753000 0.9858 Pja1 intergenic 85389 chr7 3212001 3213000 0.985 mmu-mir-290 upstream 5627 chr4 141126001 141127000 0.9849 Fblim1 downstream 4977 chr11 34552001 34553000 0.9847 Dock2 genic 44394 chr4 10859001 10860000 0.9843 RP23-130H14.1 intergenic 13755 chr2 40766001 40767000 0.9839 Lrp1b genic 315233 chr18 69941001 69942000 0.9837 Y_RNA upstream 3302 chr3 65461001 65462000 0.9831 Lekr1 upstream 8150 chr2 162857001 162858000 0.983 Ift52 genic 13877 chr10 60878001 60879000 0.9826 Nodal upstream 1720 chr6 64972001 64973000 0.9821 Smarcad1 intergenic 19661 chr4 141722001 141723000 0.9819 9030409G11Rik genic 43121 chrX 159270001 159271000 0.9815 4932441K18Rik upstream 2615 chr6 145224001 145225000 0.9812 RP24-359O2.4 downstream 8649 chr14 77919001 77920000 0.981 Enox1 genic 201214 chr18 17577001 17578000 0.981 AC125166.1 intergenic 280933 chr7 3213001 3214000 0.981 mmu-mir-290 upstream 4627 chr10 6238001 6239000 0.9809 Mthfd1l genic 58541 chr11 77487001 77488000 0.9809 Nufip2 intergenic 11641 chr19 20998001 20999000 0.9806 Tmc1 genic 29692 chr6 39396001 39397000 0.9805 Dennd2a intergenic 15377 chr1 133786001 133787000 0.9804 Rab7l1 intergenic 16537 chr2 65259001 65260000 0.98 Scn3a intergenic 36320 chr3 34656001 34657000 0.9798 Sox2 intergenic 104618 chr17 38127001 38128000 0.9788 Olfr761 intergenic 37034

137 chr6 128211001 128212000 0.9787 Tead4 genic 13595 chr13 95476001 95477000 0.9784 Tbca intergenic 81898 chr3 34634001 34635000 0.9783 Sox2 intergenic 82618 chr6 83840001 83841000 0.9777 Gm10481 intergenic 14554 chr4 84308001 84309000 0.9776 Bnc2 genic 11990 chr19 23169001 23170000 0.9774 AC132389.1 intergenic 18587 chr10 21565001 21566000 0.977 AC158614.1 upstream 344 chr4 133216001 133217000 0.977 Pigv genic 301 chr10 27714001 27715000 0.9767 AC160404.1 intergenic 41735 chr5 108506001 108507000 0.9766 Mtf2 genic 11308 chr1 182855001 182856000 0.9764 Lefty1 upstream 9153 chr2 91984001 91985000 0.9764 Phf21a genic 50727 chr6 52028001 52029000 0.9755 RP23-103L13.1 intergenic 23948 chr7 87275001 87276000 0.9755 Idh2 intergenic 14723 chrX 7585001 7586000 0.9741 Glod5 genic 3675 chr2 165794001 165795000 0.9734 RP23-108D12.6 downstream 7257 chr6 125383001 125384000 0.9731 Cd9 intergenic 26284 chr1 53010001 53011000 0.9729 1700019D03Rik genic 9104 chr1 133787001 133788000 0.9727 Rab7l1 intergenic 17537 chr17 46949001 46950000 0.9727 CT030702.2 genic 12170 chr8 87174001 87175000 0.9724 Cacna1a downstream 9853 chr4 132678001 132679000 0.9719 Wasf2 upstream 7420 chr4 33910001 33911000 0.9716 Cnr1 intergenic 100568 chr14 122815001 122816000 0.9711 AC168091.1 upstream 9036 chr11 8475001 8476000 0.9709 Tns3 genic 88538 chr11 30649001 30650000 0.9705 RP23-358M10.2 upstream 832 chr2 71705001 71706000 0.9704 Pdk1 upstream 5281 chr17 37217001 37218000 0.9703 Olfr90 downstream 4176 chr15 39560001 39561000 0.9698 Tm7sf4 intergenic 16478 chr19 28257001 28258000 0.9698 Glis3 intergenic 75341 chr7 134367001 134368000 0.9698 1-Sep upstream 4989 chr11 9016001 9017000 0.9697 Upp1 upstream 1106 chr18 75520001 75521000 0.9696 Smad7 upstream 6183 chr15 61924001 61925000 0.9694 Pvt1 genic 54399 chr3 88375001 88376000 0.9694 Ubqln4 downstream 1354 chr17 71177001 71178000 0.9687 Dlgap1 downstream 6248 chr19 45724001 45725000 0.9685 Fbxw4 genic 9802 chr10 82272001 82273000 0.9682 1700028I16Rik upstream 1869 chr9 44403001 44404000 0.9674 Ddx6 upstream 8979 chr4 116398001 116399000 0.967 RP23-109A3.1 genic 12971

138 chrX 100507001 100508000 0.967 4930519F16Rik overlapTSS 120 chr2 154253001 154254000 0.9667 Cbfa2t2 upstream 8217 chr4 85829001 85830000 0.9666 Adamtsl1 genic 68877 chr18 5012001 5013000 0.9665 Svil genic 91463 chr5 25060001 25061000 0.9662 Cct8l1 intergenic 36162 chr12 87845001 87846000 0.9661 Esrrb genic 16570 chr10 17296001 17297000 0.9659 Cited2 intergenic 146032 chr17 11546001 11547000 0.9659 Park2 genic 286010 chr1 155948001 155949000 0.9656 SNORA17 intergenic 20672 chr3 72662001 72663000 0.965 Sis intergenic 29482 chr11 117836001 117837000 0.9649 Socs3 upstream 4640 chr10 62361001 62362000 0.9648 U6 upstream 4739 chr3 121848001 121849000 0.9648 Abca4 genic 33979 chr14 30764001 30765000 0.9645 Gm10245 upstream 4998 chr7 3204001 3205000 0.9645 mmu-mir-290 intergenic 13627 chr17 10625001 10626000 0.9639 Pacrg genic 29123 chrX 100633001 100634000 0.9638 Tsix genic 6145 chr2 154249001 154250000 0.9636 Cbfa2t2 intergenic 12217 chr10 41187001 41188000 0.9632 Zbtb24 downstream 1624 chr19 5846001 5847000 0.963 AC142098.4 upstream 1508 chr6 53151001 53152000 0.9629 Creb5 intergenic 85289 chr12 12184001 12185000 0.9628 5S_rRNA intergenic 70328 chr11 69517001 69518000 0.9625 Tnfsf12-Tnfsf13 upstream 7401 chr1 137450001 137451000 0.9624 Nav1 genic 31286 chr1 168055001 168056000 0.9624 Dusp27 genic 2029 chr4 99308001 99309000 0.9616 RP23-35D18.1 intergenic 11298 chr12 42830001 42831000 0.9612 Immp2l genic 225575 chr19 25563001 25564000 0.961 Dmrt1 intergenic 16108 chr18 36575001 36576000 0.9608 Pfdn1 genic 11263 chr13 63631001 63632000 0.96 Ptch1 genic 21364 chr2 160482001 160483000 0.9598 Top1 genic 10377 chr1 138624001 138625000 0.9597 AC125186.3 downstream 1696 chr10 18928001 18929000 0.9597 Olig3 intergenic 147366 chr6 64973001 64974000 0.9597 Smarcad1 intergenic 18661 chr11 33433001 33434000 0.9595 Gabrp intergenic 16781 chr1 182854001 182855000 0.9592 Lefty1 intergenic 10153 chr1 155223001 155224000 0.9587 E330020D12Rik intergenic 27957 chr11 69521001 69522000 0.9584 Tnfsf12-Tnfsf13 intergenic 11401 chr4 57703001 57704000 0.958 Palm2 genic 20888 chr1 72275001 72276000 0.9579 U2 downstream 1997

139 chr6 37807001 37808000 0.9579 Trim24 intergenic 12811 chr10 94683001 94684000 0.9577 Cradd genic 45102 chr10 76662001 76663000 0.9575 Col18a1 intergenic 32708 chr10 115999001 116000000 0.9575 Cnot2 genic 18556 chr2 165982001 165983000 0.9574 Sulf2 upstream 838 chr10 84534001 84535000 0.9571 U7 downstream 7325 chr4 99302001 99303000 0.9563 RP23-35D18.1 intergenic 17298 chr8 91515001 91516000 0.9562 Sall1 intergenic 35147 chr12 87843001 87844000 0.9561 Esrrb genic 18570 chr1 59006001 59007000 0.956 Trak2 genic 23274 chr11 116943001 116944000 0.9558 RP23-41E14.4 downstream 2732 chr15 61842001 61843000 0.9554 Myc intergenic 20072 chr17 66829001 66830000 0.9553 Rab12 intergenic 13852 chr6 122652001 122653000 0.9553 RP23-180N22.1 upstream 2955 chrX 55293001 55294000 0.955 Zic3 genic 913 chr10 84373001 84374000 0.9548 Rfx4 downstream 3720 chr2 149809001 149810000 0.9548 Tmem90b genic 20128 chr10 51998001 51999000 0.9542 Dcbld1 genic 42182 chr11 33437001 33438000 0.9542 Gabrp intergenic 12781 chr2 162878001 162879000 0.9538 Mybl2 upstream 1423 chr5 33878001 33879000 0.9538 Fam53a intergenic 64000 chr10 20821001 20822000 0.9536 Ahi1 intergenic 20767 chr4 45784001 45785000 0.9531 RP23-358I3.1 downstream 4754 chr1 63184001 63185000 0.953 RP23-13C17.1 genic 19864 chr15 7066001 7067000 0.953 Gm10050 intergenic 12021 chr7 71093001 71094000 0.9524 Klf13 upstream 9258 chr2 146457001 146458000 0.9519 RP23-402L5.1 intergenic 85157 chr2 144541001 144542000 0.9516 Dtd1 genic 52494 chr12 87840001 87841000 0.9514 Esrrb genic 21570 chr7 97124001 97125000 0.9513 Eed genic 4493 chr7 4772001 4773000 0.9509 Shisa7 downstream 4154 chr14 78688001 78689000 0.9501 Tnfsf11 genic 10749 chr17 37132001 37133000 0.9501 H2-M5 upstream 5519 chr1 138571001 138572000 0.9499 AC125186.3 intergenic 20788 chr4 106532001 106533000 0.9499 RP24-423N4.1 intergenic 11476 chr1 72260001 72261000 0.9497 Mreg upstream 1120 chr3 90256001 90257000 0.9495 Npr1 genic 1488 chr4 109984001 109985000 0.9494 Elavl4 genic 39514 chr1 75801001 75802000 0.9493 AC133581.2 intergenic 167247 chr6 144310001 144311000 0.9491 Sox5 genic 419497

140 chr1 120544001 120545000 0.949 Tcfcp2l1 genic 19474 chr19 30020001 30021000 0.9488 Il33 genic 14194 chr17 35640001 35641000 0.9487 Pou5f1 upstream 2007 chr2 115549001 115550000 0.9485 BC052040 genic 54504 chr2 174086001 174087000 0.9485 mmu-mir-296 downstream 5548 chr18 25027001 25028000 0.9482 Fhod3 genic 159055 chr13 96304001 96305000 0.9481 F2rl1 upstream 8819 chr13 113252001 113253000 0.9481 Il6st upstream 1317 chr15 25699001 25700000 0.9481 Myo10 genic 43428 chr10 92705001 92706000 0.9474 Cdk17 downstream 1415 chr12 74187001 74188000 0.9473 Six4 intergenic 12943 chr12 112075001 112076000 0.9473 Stk30 genic 3149 chr3 95455001 95456000 0.9465 Mcl1 upstream 6779 chr13 98053001 98054000 0.9464 AC160109.1 upstream 7974 chr12 73750001 73751000 0.9463 Dhrs7 downstream 398 chr3 122137001 122138000 0.9463 Bcar3 genic 14285 chr8 114313001 114314000 0.9463 Cfdp1 genic 20628 chr1 166307001 166308000 0.9462 Nme7 genic 26805 chr1 88207001 88208000 0.9455 B3gnt7 downstream 3123 chr1 35289001 35290000 0.9451 SNORA17 intergenic 53072 chr11 88830001 88831000 0.945 Coil upstream 566 chr9 7634001 7635000 0.945 Mmp20 genic 5770 chr9 96130001 96131000 0.945 Tfdp2 genic 33234 chr14 77650001 77651000 0.9443 Enox1 genic 93378 chr13 110423001 110424000 0.9442 Pde4d genic 317682 chr15 98822001 98823000 0.9441 AC157610.1 downstream 2289 chr3 9008001 9009000 0.944 Tpd52 upstream 3278 chr11 33427001 33428000 0.9438 AL669814.1 intergenic 13113 chr10 69827001 69828000 0.9437 Fam13c intergenic 75454 chr19 29082001 29083000 0.9435 Cdc37l1 genic 9059 chrX 50100001 50101000 0.9435 Kis2 upstream 1864 chr4 55489001 55490000 0.9432 SNORA17 upstream 6970 chr16 88720001 88721000 0.943 Krtap13-1 upstream 8107 chr10 21549001 21550000 0.9426 AC158614.1 downstream 305 chr4 118744001 118745000 0.9422 RP23-344D2.1 upstream 2285 chr2 68886001 68887000 0.9419 Lass6 genic 65339 chr1 121295001 121296000 0.9413 Inhbb intergenic 16042 chr17 76511001 76512000 0.941 Y_RNA upstream 357 chr1 134097001 134098000 0.9409 Lemd1 genic 8993 chr11 97497001 97498000 0.9409 E130012A19Rik upstream 5985

141 chr19 14730001 14731000 0.9407 Tle4 intergenic 57528 chr14 49320001 49321000 0.9402 AC166574.2 genic 31038 chr4 140847001 140848000 0.94 Epha2 upstream 9155 chr13 34336001 34337000 0.939 Slc22a23 genic 64974 chr10 79514001 79515000 0.9389 Gpx4 genic 4090 chr7 13599001 13600000 0.9389 Zbtb45 upstream 3822 chr13 52657001 52658000 0.9385 Syk intergenic 20542 chr7 137395001 137396000 0.9385 Fgfr2 genic 89036 chr9 120589001 120590000 0.9381 5830454E08Rik intergenic 101811 chr3 54794001 54795000 0.9379 Ccna1 intergenic 54399 chr10 107724001 107725000 0.9377 Ppp1r12a downstream 9370 chr12 109950001 109951000 0.9371 Degs2 upstream 9485 chr7 26049001 26050000 0.9369 Cic upstream 5865 chr10 18142001 18143000 0.9364 Nhsl1 genic 31873 chr19 46910001 46911000 0.9355 Cnnm2 genic 42285 chr19 32329001 32330000 0.9354 Sgms1 genic 131784 chr17 71250001 71251000 0.9352 CT009715.1 genic 23336 chr4 57933001 57934000 0.9351 D630039A03Rik upstream 3832 chr11 63084001 63085000 0.9349 RP23-355P23.1 intergenic 29024 chr2 172370001 172371000 0.9347 Tcfap2c upstream 4093 chr1 158925001 158926000 0.9346 Gm10531 intergenic 47680 chr15 51844001 51845000 0.9341 Rad21 intergenic 20708 chr2 71490001 71491000 0.934 Gm1631 intergenic 66446 chr6 145225001 145226000 0.934 RP24-359O2.4 downstream 9649 chr6 71035001 71036000 0.9338 RP24-92L16.2 intergenic 38588 chr10 27734001 27735000 0.9337 Ptprk intergenic 59626 chr3 156984001 156985000 0.9335 Negr1 downstream 4592 chr2 26466001 26467000 0.9334 Agpat2 upstream 6064 chr18 3006001 3007000 0.9333 Crem intergenic 259046 chr14 76915001 76916000 0.9332 AC142266.1 genic 636 chr13 44114001 44115000 0.9327 RP23-469C2.1 intergenic 101536 chr7 148148001 148149000 0.9326 Ifitm1 upstream 4120 chr10 51993001 51994000 0.9324 Dcbld1 genic 39576 chr5 104153001 104154000 0.9324 Aff1 genic 31608 chr4 99309001 99310000 0.9316 RP23-35D18.1 intergenic 10298 chr2 150483001 150484000 0.9314 Acss1 genic 9994 chr15 103350001 103351000 0.931 Pde1b genic 9483 chr3 103011001 103012000 0.931 AC150894.1 upstream 8103 chr13 44800001 44801000 0.9309 Jarid2 intergenic 25640 chr4 141727001 141728000 0.9306 9030409G11Rik genic 38121

142 chr7 121884001 121885000 0.9302 Insc upstream 2208 chr6 72737001 72738000 0.9301 Tcf7l1 genic 1248 chr3 34644001 34645000 0.9299 Sox2 intergenic 92618 chr2 106591001 106592000 0.9298 Mpped2 genic 57575 chr1 142326001 142327000 0.9295 Kcnt2 genic 179656 chr12 87839001 87840000 0.9294 Esrrb genic 22570 chr4 53965001 53966000 0.9289 SNORA17 intergenic 46309 chr6 108208001 108209000 0.9279 Itpr1 genic 44889 chr1 59506001 59507000 0.9278 Fzd7 intergenic 32023 chr11 8486001 8487000 0.9276 Tns3 genic 77538 chr18 35203001 35204000 0.9276 Ctnna1 intergenic 74566 chr2 131708001 131709000 0.9275 Lamr1-ps1 intergenic 20822 chr10 44111001 44112000 0.9273 Atg5 intergenic 26904 chr5 104164001 104165000 0.9273 Aff1 genic 42608 chr4 126904001 126905000 0.927 Dlgap3 genic 9266 chr4 53802001 53803000 0.9265 Tal2 downstream 417 chr4 118699001 118700000 0.9263 RP23-228G24.4 intergenic 25911 chr10 107723001 107724000 0.926 Ppp1r12a downstream 8370 chr12 42285001 42286000 0.9251 Immp2l genic 534324 chr18 75744001 75745000 0.925 Gm672 genic 112208 chr5 120035001 120036000 0.925 AC140675.3 intergenic 25272 chr3 52058001 52059000 0.9248 Foxo1 intergenic 13259 chr11 22730001 22731000 0.9246 B3gnt2 genic 28544 chr5 139643001 139644000 0.9242 Heatr2 genic 16824 chr7 127301001 127302000 0.9241 Zp2 intergenic 12197 chr15 92794001 92795000 0.924 Pdzrn4 intergenic 192680 chr6 128246001 128247000 0.9238 Tead4 genic 3831 chr10 98923001 98924000 0.9236 B530045E10Rik intergenic 37320 chr2 10256001 10257000 0.9235 RP23-6M16.2 downstream 3910 chrX 12783001 12784000 0.9235 RP23-158L19.1 downstream 2095 chr12 21468001 21469000 0.9233 Ywhaq intergenic 44503 chr13 37528001 37529000 0.923 CT030182.1 upstream 983 chr4 6896001 6897000 0.9227 Tox genic 20946 chr18 65746001 65747000 0.9226 Zfp532 genic 6117 chr10 84017001 84018000 0.9225 AC140333.1 upstream 918 chr6 35186001 35187000 0.9224 Nup205 genic 10590 chr7 140306001 140307000 0.9224 Fgfr2 genic 8033 chr3 102950001 102951000 0.9221 Dennd2c genic 18522 chr4 84174001 84175000 0.9221 Bnc2 genic 145990 chr18 36413001 36414000 0.9218 Pura intergenic 26751

143 chr3 96479001 96480000 0.9218 Ankrd35 genic 4947 chr12 73751001 73752000 0.9217 Dhrs7 genic 397 chr10 84513001 84514000 0.9215 U7 intergenic 12620 chr15 7481001 7482000 0.9213 AC109195.1 intergenic 119620 chr2 159244001 159245000 0.9206 RP23-53L16.1 intergenic 36906 chr13 89801001 89802000 0.9204 Vcan genic 6086 chr1 11268001 11269000 0.9197 Prex2 genic 19582 chr13 48340001 48341000 0.9197 AC154314.1 genic 16992 chr12 111710001 111711000 0.9195 Ppp2r5c genic 24483 chr8 46064001 46065000 0.9194 Fat1 genic 28433 chr2 172342001 172343000 0.9187 RP23-228E2.8 upstream 6810 chr7 16950001 16951000 0.9185 Sae1 genic 22146 chr8 114137001 114138000 0.9185 Znrf1 genic 9671 chr15 88539001 88540000 0.9184 Brd1 genic 21271 chr12 72154001 72155000 0.9183 Arid4a genic 37047 chr13 44817001 44818000 0.9183 Jarid2 upstream 8640 chr5 122641001 122642000 0.9183 Hvcn1 intergenic 14813 chr13 47951001 47952000 0.9178 4931429P17Rik intergenic 103699 chr8 123195001 123196000 0.9176 Galnt2 genic 760260 chr2 166884001 166885000 0.9173 Znfx1 genic 3515 chr3 41226001 41227000 0.917 SNORD22 intergenic 18265 chr18 35917001 35918000 0.9167 Tmem173 intergenic 16894 chr6 37639001 37640000 0.9166 AC157101.1 intergenic 108574 chr10 44112001 44113000 0.9164 Atg5 intergenic 27904 chr4 127022001 127023000 0.9164 Gjb4 downstream 5330 chr2 8420001 8421000 0.916 7SK intergenic 25990 chr2 70116001 70117000 0.916 Myo3b genic 149290 chr10 17299001 17300000 0.9158 Cited2 intergenic 143032 chr2 17910001 17911000 0.9158 H2afb1 downstream 7044 chr4 154564001 154565000 0.9156 Ski genic 31644 chr11 8535001 8536000 0.9152 Tns3 genic 28538 chr3 95440001 95441000 0.915 Ensa downstream 3977 chr10 95058001 95059000 0.9149 7SK upstream 7426 chr17 90431001 90432000 0.9148 Nrxn1 downstream 3914 chr19 23162001 23163000 0.9147 AC132389.1 intergenic 11587 chr11 88829001 88830000 0.914 RP23-176J13.11 upstream 756 chr14 20680001 20681000 0.9139 Gng2 intergenic 10781 chrX 55267001 55268000 0.9138 Zic3 upstream 7877 chr15 10653001 10654000 0.9137 AC138765.1 genic 8410 chr6 31847001 31848000 0.9137 RP24-538P5.1 genic 24270

144 chr10 62360001 62361000 0.9135 U6 upstream 3739 chr18 20648001 20649000 0.9131 Dsg4 intergenic 17679 chr7 114492001 114493000 0.913 Syt9 intergenic 21242 chr8 37611001 37612000 0.9129 Dlc1 intergenic 18805 chr10 67600001 67601000 0.9127 Arid5b genic 41660 chr15 7067001 7068000 0.9127 Gm10050 intergenic 11021 chr19 3842001 3843000 0.9127 AC133523.2 genic 9419 chr2 75699001 75700000 0.9126 Agps genic 28767 chr11 25999001 26000000 0.9125 RP23-306I3.1 genic 110569 chr3 100967001 100968000 0.9125 RP23-434L7.1 intergenic 42782 chr17 79764001 79765000 0.9123 Cdc42ep3 upstream 9604 chr6 133937001 133938000 0.9117 Etv6 intergenic 47718 chr2 116805001 116806000 0.9113 U1 intergenic 11092 chr5 50514001 50515000 0.9112 Gpr125 intergenic 63766 chr14 122867001 122868000 0.911 AC154683.2 downstream 1506 chr1 182818001 182819000 0.9107 Lefty2 upstream 4239 chr13 98202001 98203000 0.9107 AC124581.1 intergenic 62301 chr11 59594001 59595000 0.9105 Mprip genic 355 chr11 17783001 17784000 0.9104 Etaa1 intergenic 54759 chr3 9009001 9010000 0.91 Tpd52 upstream 4278 chr8 128224001 128225000 0.91 4933403G14Rik intergenic 26745 chr6 34294001 34295000 0.9099 Akr1b8 upstream 9119 chr10 66360001 66361000 0.9098 AC153382.2 intergenic 14237 chr6 92445001 92446000 0.9096 Prickle2 genic 71360 chr6 55742001 55743000 0.9093 Ccdc129 intergenic 44012 chr3 133195001 133196000 0.9092 Tet2 genic 11332 chr16 8778001 8779000 0.9088 Usp7 genic 13401 chr12 80097001 80098000 0.9087 Tmem229b genic 10324 chr5 64891001 64892000 0.9087 Tbc1d1 intergenic 148276 chr6 33720001 33721000 0.9085 Exoc4 genic 202979 chr16 35967001 35968000 0.908 Parp9 genic 4689 chr16 42782001 42783000 0.908 RP23-283M18.2 genic 56186 chr6 92793001 92794000 0.9079 Adamts9 genic 70308 chr6 65746001 65747000 0.9077 Prdm5 genic 17045 chr7 52807001 52808000 0.9077 Plekha4 genic 1599 chr11 88929001 88930000 0.9076 RP23-176J13.9 genic 412 chr2 166970001 166971000 0.9076 Kcnb1 genic 41123 chr10 62355001 62356000 0.9069 U6 downstream 157 chr17 81200001 81201000 0.9069 AC169501.1 intergenic 72139 chr18 13871001 13872000 0.9068 Zfp521 genic 24276

145 chr1 34160001 34161000 0.9066 Dst genic 78881 chr7 90525001 90526000 0.9064 AC099701.1 intergenic 34772 chr15 81348001 81349000 0.906 Rbx1 intergenic 41202 chr4 140838001 140839000 0.9057 RP23-18M1.5 downstream 2292 chr7 3348001 3349000 0.9057 Cacng7 genic 14946 chr2 87321001 87322000 0.9056 Pramel7 downstream 7244 chr5 97647001 97648000 0.9055 AC122916.1 intergenic 45768 chr1 154856001 154857000 0.9053 Nmnat2 genic 53873 chr11 77493001 77494000 0.9052 Nufip2 upstream 5641 chr1 54541001 54542000 0.9051 Pgap1 genic 3451 chr3 121998001 121999000 0.9047 mmu-mir-760 upstream 1380 chr15 101064001 101065000 0.9045 Grasp downstream 815 chr8 4366001 4367000 0.9044 Ccl25 downstream 5981 chr14 106260001 106261000 0.9042 Spry2 intergenic 30732 chr19 21863001 21864000 0.9041 Tmem2 genic 10169 chr1 74871001 74872000 0.904 Wnt10a intergenic 20248 chr10 75344001 75345000 0.9035 Derl3 intergenic 11170 chr3 21924001 21925000 0.9034 2810416G20Rik intergenic 49611 chr6 30109001 30110000 0.9034 Nrf1 downstream 5543 chr6 39380001 39381000 0.9032 SNORA71 upstream 7583 chr13 52016001 52017000 0.9031 CT025557.1 intergenic 51583 chr6 91660001 91661000 0.9029 Slc6a6 genic 25912 chr2 181653001 181654000 0.9028 RP23-409P4.1 intergenic 19584 chr8 85684001 85685000 0.9028 Tbc1d9 upstream 4251 chr2 20418001 20419000 0.9025 Etl4 genic 247443 chr6 61381001 61382000 0.9023 Fam190a genic 250683 chr5 143565001 143566000 0.9021 Tnrc18 genic 13341 chr2 51927001 51928000 0.902 Rif1 upstream 352 chr5 54062001 54063000 0.9019 Rbpj intergenic 13317 chr15 39612001 39613000 0.9015 Dpys genic 11970 chr18 38308001 38309000 0.9014 Pcdh1 intergenic 36569 chr1 11492001 11493000 0.9008 A830018L16Rik genic 87815 chr13 44181001 44182000 0.9007 RP23-469C2.1 intergenic 34536 chr5 27018001 27019000 0.9004 Dpp6 intergenic 124743 chr3 34633001 34634000 0.9003 Sox2 intergenic 81618 chr19 38423001 38424000 0.9 Gm9886 intergenic 33257 chr9 120058001 120059000 0.9 Mobp overlapTSS 140 chr1 182819001 182820000 0.8999 Lefty2 upstream 3239 chr2 71489001 71490000 0.8999 Gm1631 intergenic 67446 chrX 57422001 57423000 0.8995 Mcf2 genic 9266

146 chr11 88551001 88552000 0.8994 Msi2 genic 27543 chr10 104961001 104962000 0.899 Tmtc2 genic 49491 chr10 28356001 28357000 0.8989 Themis intergenic 31166 chr7 119832001 119833000 0.8987 Tead1 genic 9130 chr1 71719001 71720000 0.8983 Gm15456 downstream 9584 chr16 82596001 82597000 0.898 U6 intergenic 52421 chr10 25033001 25034000 0.8976 Akap7 intergenic 14029 chr10 119608001 119609000 0.8975 Irak3 genic 29297 chr6 122316001 122317000 0.8973 1700063H04Rik intergenic 24397 chr6 142460001 142461000 0.8972 Ldhb upstream 3524 chr7 107848001 107849000 0.8972 Fam168a upstream 6165 chr3 30623001 30624000 0.897 Samd7 intergenic 21215 chrX 140772001 140773000 0.8968 Alg13 genic 19435 chr8 24672001 24673000 0.8967 Zmat4 intergenic 73491 chr17 71251001 71252000 0.8966 CT009715.1 genic 22336 chr1 62814001 62815000 0.8964 Nrp2 genic 50269 chr10 5023001 5024000 0.8964 Syne1 genic 227152 chr7 31248001 31249000 0.8964 Nphs1 genic 3257 chr11 88491001 88492000 0.8963 Msi2 genic 87543 chr12 111743001 111744000 0.8963 Ppp2r5c genic 57483 chr1 134367001 134368000 0.896 Dstyk downstream 3465 chr2 20577001 20578000 0.8959 Etl4 genic 88443 chr11 49619001 49620000 0.8957 Gfpt2 genic 1252 chr5 137522001 137523000 0.8957 Ap1s1 overlapTSS 4 chr8 55161001 55162000 0.8957 Vegfc upstream 960 chr4 130682001 130683000 0.8953 SNORA17 intergenic 112192 chr4 28894001 28895000 0.8952 Epha7 genic 351 chr17 29602001 29603000 0.8949 AC163629.1 genic 3160 chr9 10449001 10450000 0.8949 Cntn5 genic 454726 chr13 115069001 115070000 0.8944 Ndufs4 downstream 8859 chr10 8805001 8806000 0.8941 Sash1 intergenic 199133 chr10 88171001 88172000 0.8936 5S_rRNA downstream 753 chrX 101183001 101184000 0.8933 Rlim upstream 6378 chr14 49374001 49375000 0.8927 AC166574.2 genic 38023 chr8 89299001 89300000 0.8927 Gm10638 intergenic 28499 chr15 96075001 96076000 0.8926 AC109202.1 genic 38039 chr2 44380001 44381000 0.8925 Gtdc1 intergenic 38932 chr11 102193001 102194000 0.8924 Ubtf intergenic 12214 chr11 120525001 120526000 0.8921 RP23-84C12.14 downstream 1046 chr19 27979001 27980000 0.892 Rfx3 genic 105630

147 chr16 59545001 59546000 0.8918 Crybg3 genic 9578 chr17 87150001 87151000 0.8909 Epas1 upstream 2482 chr5 129776001 129777000 0.8907 Gpr133 intergenic 65527 chr2 119683001 119684000 0.8906 AL954662.1 intergenic 14480 chr10 6699001 6700000 0.8901 Iyd intergenic 91664 chr12 40615001 40616000 0.89 Arl4a intergenic 116034 chr8 114174001 114175000 0.89 Zfp1 genic 6567 chr1 188338001 188339000 0.8896 Tgfb2 intergenic 108529 chr18 23673001 23674000 0.8896 Dtna genic 99085 chr6 149080001 149081000 0.8896 AC132412.3 genic 676 chr3 153339001 153340000 0.8895 St6galnac3 genic 48026 chr12 12791001 12792000 0.8894 U6 intergenic 84050 chr3 78843001 78844000 0.8892 Rapgef2 intergenic 22438 chr7 6713001 6714000 0.8892 Usp29 genic 29707 chr2 134786001 134787000 0.889 Plcb1 genic 174106 chr3 157595001 157596000 0.8889 AC117211.2 downstream 4890 chr19 41088001 41089000 0.8887 Dntt intergenic 14765 chr13 98370001 98371000 0.8886 AC124581.1 intergenic 103998 chr4 141626001 141627000 0.8886 Tmem51 genic 13219 chr3 103012001 103013000 0.8885 AC150894.1 upstream 9103 chr4 132205001 132206000 0.8883 Eya3 genic 10099 chr10 30457001 30458000 0.8881 Ncoa7 genic 64744 chr19 29360001 29361000 0.8879 Jak2 genic 26570 chr14 122741001 122742000 0.8877 Clybl genic 59450 chr3 83985001 83986000 0.8877 Trim2 genic 20640 chrX 12591001 12592000 0.8877 Gm1549 intergenic 13226 chr17 31945001 31946000 0.8874 Sik1 intergenic 35193 chrX 101182001 101183000 0.8873 Rlim upstream 5378 chr8 48423001 48424000 0.887 Stox2 genic 13702 chr1 12735001 12736000 0.8867 Sulf1 genic 26391 chr4 123310001 123311000 0.8866 Macf1 genic 50603 chr5 129748001 129749000 0.8866 Gpr133 intergenic 37527 chr2 119615001 119616000 0.8865 Rpap1 upstream 1728 chr2 60035001 60036000 0.8864 Baz2b genic 11896 chr3 9538001 9539000 0.8864 Zfp704 genic 71085 chr9 30826001 30827000 0.8858 Zbtb44 intergenic 11229 chr15 9828001 9829000 0.8853 Spef2 intergenic 149378 chr3 63646001 63647000 0.8853 Plch1 genic 7913 chr13 58382001 58383000 0.885 AC154437.1 overlapTSS 456 chr1 4960001 4961000 0.8849 Rgs20 genic 60344

148 chr1 89123001 89124000 0.8849 Eif4e2 genic 12512 chr3 19607001 19608000 0.8849 Crh intergenic 11672 chr10 62357001 62358000 0.8847 U6 upstream 739 chr14 106300001 106301000 0.8847 Spry2 upstream 3965 chr12 100580001 100581000 0.8846 Foxn3 genic 51007 chr3 7579001 7580000 0.8845 Il7 genic 5819 chr1 59505001 59506000 0.8844 Fzd7 intergenic 33023 chr7 3193001 3194000 0.8844 AU018091 intergenic 23797 chr7 3538001 3539000 0.8843 Oscar intergenic 22415 chr18 75739001 75740000 0.884 Gm672 genic 117208 chr8 15844001 15845000 0.884 Csmd1 intergenic 47537 chr19 16566001 16567000 0.8837 Gna14 genic 55749 chr11 26070001 26071000 0.8835 RP23-306I3.1 genic 39569 chr8 28346001 28347000 0.8832 Adrb3 upstream 5941 chr4 123343001 123344000 0.8831 Macf1 genic 17603 chr10 59437001 59438000 0.8828 Anapc16 intergenic 12657 chr17 53701001 53702000 0.8828 Kat2b upstream 4640 chr10 12328001 12329000 0.8826 Utrn genic 226015 chr2 135440001 135441000 0.8824 RP24-364H2.1 intergenic 30681 chr12 106002001 106003000 0.8822 Clmn genic 677 chr7 127042001 127043000 0.8818 Lyrm1 genic 2626 chr13 54311001 54312000 0.8816 Hrh2 genic 5801 chr4 138350001 138351000 0.8808 Pla2g5 downstream 4159 chr9 18527001 18528000 0.8807 Olfr24 intergenic 31136 chr11 76298001 76299000 0.8806 Abr genic 23921 chr4 82093001 82094000 0.8806 Nfib genic 57674 chr17 88746001 88747000 0.8804 U6 upstream 4179 chr16 41634001 41635000 0.8803 Lsamp genic 101030 chr19 32330001 32331000 0.8803 Sgms1 genic 132002 chr11 62324001 62325000 0.8801 Pigl genic 4795 chr3 50932001 50933000 0.88 Ccrn4l intergenic 95369 chr6 31321001 31322000 0.8799 RP23-400L4.4 genic 369 chr12 74186001 74187000 0.8798 Six4 intergenic 13943 chr18 67657001 67658000 0.8798 Spire1 genic 9138 chr10 88147001 88148000 0.8794 Spic overlapTSS 240 chr11 60228001 60229000 0.8793 Atpaf2 genic 1553 chr1 188334001 188335000 0.8792 Tgfb2 intergenic 112529 chr1 154763001 154764000 0.8791 Smg7 intergenic 13249 chr11 12369001 12370000 0.879 Cobl upstream 4139 chr15 7118001 7119000 0.879 Lifr genic 13641

149 chr13 114560001 114561000 0.8787 Arl15 intergenic 23767 chr7 140287001 140288000 0.8786 Fgfr2 genic 27033 chr1 138574001 138575000 0.8785 AC125186.3 intergenic 17788 chr1 36111001 36112000 0.8784 Hs6st1 intergenic 13245 chr12 55203001 55204000 0.8784 AC139934.1 intergenic 30273 chr3 68369001 68370000 0.8784 Schip1 genic 60385 chr11 66825001 66826000 0.878 RP23-67J22.5 genic 126 chr10 62356001 62357000 0.8767 U6 genic 156 chr9 23965001 23966000 0.8767 Npsr1 genic 62561 chr13 34201001 34202000 0.8766 Tubb2b intergenic 17092 chr19 4849001 4850000 0.8766 Ctsf upstream 5129 chr2 58457001 58458000 0.8764 Upp2 genic 29692 chr16 48475001 48476000 0.8763 Morc1 genic 43651 chr2 116804001 116805000 0.8763 U1 intergenic 10092 chrX 7784001 7785000 0.8761 Porcn upstream 350 chr1 138591001 138592000 0.876 AC125186.3 upstream 788 chr17 34038001 34039000 0.8757 Daxx upstream 7463 chr10 27733001 27734000 0.8756 Ptprk intergenic 60626 chr14 76895001 76896000 0.8754 Tsc22d1 genic 11569 chr10 120892001 120893000 0.8753 Rassf3 genic 20333 chr17 63883001 63884000 0.8753 Fbxl17 intergenic 34352 chr6 92196001 92197000 0.8753 Trh upstream 1357 chr8 66408001 66409000 0.8753 SNORA17 intergenic 50878 chr8 109487001 109488000 0.8751 Sntb2 genic 27351 chr3 18524001 18525000 0.8748 AC162611.1 intergenic 89712 chrX 50112001 50113000 0.8746 RP23-426M5.1 intergenic 12519 chr8 86546001 86547000 0.8745 4432412L15Rik genic 631 chr6 67096001 67097000 0.8742 AC122296.1 intergenic 18197 chr12 120005001 120006000 0.8741 7SK intergenic 12776 chr4 140840001 140841000 0.8741 RP23-18M1.5 downstream 4292 chr7 140256001 140257000 0.874 Fgfr2 genic 58033 chr10 69822001 69823000 0.8737 AC122923.1 intergenic 72976 chr2 129624001 129625000 0.8737 Stk35 upstream 1297 chr12 103963001 103964000 0.8735 Moap1 intergenic 16284 chr5 18194001 18195000 0.8734 SNORA17 intergenic 175395 chr13 48255001 48256000 0.8733 AC154314.1 genic 101992 chr15 96533001 96534000 0.8733 Slc38a2 upstream 2872 chr6 136438001 136439000 0.8733 RP23-100O18.1 downstream 275 chr10 5853001 5854000 0.873 RP24-192F16.1 genic 10039 chr2 132096001 132097000 0.873 Cds2 genic 7117

150 chr9 88227001 88228000 0.873 Nt5e genic 4530 chr1 13074001 13075000 0.8727 AC118477.1 intergenic 21572 chr17 81185001 81186000 0.8727 AC169501.1 intergenic 57139 chr5 118886001 118887000 0.8726 AC132338.1 intergenic 65849 chr6 122720001 122721000 0.8724 Slc2a3 intergenic 27464 chr18 38049001 38050000 0.8723 Diap1 genic 45130 chr10 61392001 61393000 0.8722 Col13a1 genic 48856 chr16 30641001 30642000 0.8722 Fam43a intergenic 38160 chr6 122716001 122717000 0.8722 Slc2a3 intergenic 23464 chr1 20857001 20858000 0.8719 AC156980.3 intergenic 20731 chr16 17457001 17458000 0.8719 Crkl genic 4921 chr14 121905001 121906000 0.8717 Slc15a1 upstream 525 chr3 86649001 86650000 0.871 Dclk2 genic 58920 chr12 87820001 87821000 0.8709 Esrrb genic 41570 chr15 61895001 61896000 0.8709 Pvt1 genic 25399 chr2 59472001 59473000 0.8709 Tanc1 genic 21902 chr3 11214001 11215000 0.8709 AC158155.1 intergenic 198838 chr7 103393001 103394000 0.8708 Odz4 genic 73247 chr17 63882001 63883000 0.8707 Fbxl17 intergenic 33352 chr9 118447001 118448000 0.8706 Golga4 genic 31565 chr9 113102001 113103000 0.8705 7SK intergenic 209593 chr19 42210001 42211000 0.8701 Avpi1 upstream 6452 chr6 54805001 54806000 0.8701 Znrf2 genic 37494 chr7 13600001 13601000 0.8701 Zbtb45 upstream 4822 chr10 84018001 84019000 0.8696 AC140333.1 upstream 1918 chr12 57397001 57398000 0.8696 7SK intergenic 28810 chr9 77607001 77608000 0.8696 Gclc genic 4659 chr11 39864001 39865000 0.8695 RP23-31O3.1 genic 42305 chr9 45157001 45158000 0.8693 Tmprss13 downstream 1337 chr15 10572001 10573000 0.8692 Rai14 genic 70295 chr5 114047001 114048000 0.8691 AC121909.1 downstream 7322 chr1 60908001 60909000 0.8689 Ctla4 intergenic 34844 chr4 34506001 34507000 0.8689 Akirin2 genic 7157 chr18 37931001 37932000 0.8688 Pcdhga2 genic 69524 chr2 135482001 135483000 0.8686 Plcb4 upstream 1747 chr16 26962001 26963000 0.8685 Gm606 genic 4679 chr19 23207001 23208000 0.8685 AC121973.1 upstream 1390 chr5 53933001 53934000 0.8684 AC084054.1 downstream 6881 chr9 99320001 99321000 0.8684 Mras genic 16137 chr11 30650001 30651000 0.8682 RP23-358M10.2 overlapTSS 168

151 chr11 25374001 25375000 0.868 RP23-117H24.2 intergenic 106864 chr4 116454001 116455000 0.868 Tesk2 genic 21847 chr6 4969001 4970000 0.868 Ppp1r9a genic 115681 chr14 99738001 99739000 0.8679 U2 intergenic 12267 chr9 78212001 78213000 0.8677 Dppa5a downstream 1859 chr11 79949001 79950000 0.8674 Atad5 genic 295 chr1 139160001 139161000 0.8673 U6 intergenic 47777 chr15 102071001 102072000 0.8673 Rarg genic 5632 chr4 85887001 85888000 0.867 Adamtsl1 genic 10877 chr7 140304001 140305000 0.867 Fgfr2 genic 10033 chr2 112517001 112518000 0.8669 Ryr3 genic 45464 chr12 62711001 62712000 0.8667 Lrfn5 genic 60811 chr12 107414001 107415000 0.8664 SNORA17 intergenic 44285 chr3 28741001 28742000 0.8664 Rpl22l1 intergenic 34664 chr1 34130001 34131000 0.8662 Dst genic 108881 chr3 73238001 73239000 0.8658 Bche intergenic 200730 chr13 114816001 114817000 0.8656 Arl15 genic 130663 chr2 106401001 106402000 0.8653 RP23-253D15.2 intergenic 12948 chr10 28145001 28146000 0.8652 Ptprk genic 170046 chr9 77464001 77465000 0.865 Klhl31 intergenic 19539 chr17 80145001 80146000 0.8646 1700038P13Rik intergenic 29225 chr8 37613001 37614000 0.8646 Dlc1 intergenic 16805 chr11 101634001 101635000 0.8644 Etv4 genic 2940 chr1 162240001 162241000 0.8637 Rabgap1l genic 40702 chr1 193485001 193486000 0.8637 Ints7 intergenic 38450 chr7 102628001 102629000 0.8636 U6 intergenic 37034 chr13 63632001 63633000 0.8633 Ptch1 genic 22364 chr13 44773001 44774000 0.8631 Jarid2 intergenic 52640 chr2 74015001 74016000 0.8631 RP23-238A18.2 intergenic 26110 chr9 114493001 114494000 0.8629 Cnot10 downstream 996 chr2 101961001 101962000 0.8628 Ldlrad3 genic 64515 chr1 120972001 120973000 0.8625 Gli2 intergenic 21805 chr6 39395001 39396000 0.8625 Dennd2a intergenic 16377 chr9 45158001 45159000 0.8623 Tmprss13 downstream 2337 chr16 30940001 30941000 0.8621 AI480653 intergenic 14713 chr1 51593001 51594000 0.8618 Obfc2a intergenic 57758 chr8 49814001 49815000 0.8616 Odz3 genic 54431 chr18 85103001 85104000 0.8611 Fbxo15 upstream 397 chr16 38558001 38559000 0.861 Tmem39a upstream 3972 chr2 166019001 166020000 0.861 RP23-120P1.3 upstream 2639

152 chr19 7336001 7337000 0.8609 5S_rRNA downstream 1450 chr6 97052001 97053000 0.8608 A130022J15Rik downstream 7018 chr16 25050001 25051000 0.8605 AC133967.1 upstream 8725 chr4 107891001 107892000 0.8604 Zyg11a upstream 499 chr8 17576001 17577000 0.8604 Csmd1 intergenic 40415 chr10 57817001 57818000 0.8602 Lims1 genic 30787 chrX 49702001 49703000 0.8602 Gpc3 genic 76398 chr16 38472001 38473000 0.8601 Cd80 genic 12988 chr11 110904001 110905000 0.86 Kcnj16 intergenic 14719 chr19 14752001 14753000 0.86 U4 intergenic 39848 chrX 98677001 98678000 0.8599 U7 upstream 339 chr3 128920001 128921000 0.8598 Pitx2 genic 1504 chr5 117846001 117847000 0.8598 AC113299.3 downstream 2494 chr18 82062001 82063000 0.8595 SNORA17 intergenic 114808 chr2 116803001 116804000 0.8593 U1 upstream 9092 chr1 60907001 60908000 0.8592 Ctla4 intergenic 35844 chr18 34975001 34976000 0.8589 Kdm3b genic 21936 chr2 146125001 146126000 0.8589 Ralgapa2 genic 59386 chr1 137071001 137072000 0.8588 Gpr37l1 upstream 6743 chr10 36655001 36656000 0.8588 Hdac2 intergenic 38350 chr11 117837001 117838000 0.8586 Socs3 upstream 5640 chr12 88630001 88631000 0.8582 Ism2 genic 9655 chr2 65258001 65259000 0.8582 Scn3a intergenic 37320 chr13 112331001 112332000 0.8581 RP23-359G8.2 genic 8119 chr7 89934001 89935000 0.8581 Eftud1 downstream 7639 chr10 94918001 94919000 0.858 Mrpl42 intergenic 24440 chr4 33427001 33428000 0.8579 Rngtt genic 29715 chr9 65467001 65468000 0.8578 Rbpms2 intergenic 10366 chr1 62228001 62229000 0.8577 Pard3b genic 459858 chr5 103334001 103335000 0.8577 Mapk10 downstream 1967 chr1 36001001 36002000 0.8575 Hs6st1 intergenic 123245 chr1 159347001 159348000 0.8574 2810025M15Rik genic 2366 chr14 48192001 48193000 0.8573 AC123665.1 downstream 2549 chr9 18526001 18527000 0.857 Olfr24 intergenic 32136 chr19 29346001 29347000 0.8568 Jak2 genic 19683 chr5 118895001 118896000 0.8567 AC132338.1 intergenic 56849 chr3 103183001 103184000 0.8565 7SK genic 322 chr17 53700001 53701000 0.8564 Kat2b upstream 5640 chr17 7726001 7727000 0.8561 5830477G23Rik intergenic 115042 chr5 32740001 32741000 0.8561 Ppp1cb intergenic 20347

153 chr12 3662001 3663000 0.856 Dtnb genic 89478 chr6 97471001 97472000 0.856 Frmd4b genic 95535 chr11 5939001 5940000 0.8557 Camk2b genic 25751 chr13 98358001 98359000 0.8557 AC124581.1 intergenic 91998 chr15 16376001 16377000 0.8557 Cdh9 intergenic 330856 chr3 101994001 101995000 0.8557 Vangl1 genic 13616 chr11 6795001 6796000 0.8553 RP23-327K3.3 intergenic 32694 chr3 80906001 80907000 0.8553 Pdgfc genic 65663 chr10 44088001 44089000 0.8549 Atg5 downstream 3904 chr8 49837001 49838000 0.8549 Odz3 genic 77431 chr3 86650001 86651000 0.8548 Dclk2 genic 59920 chr7 25245001 25246000 0.8542 Plaur upstream 1519 chr10 51992001 51993000 0.8541 Dcbld1 genic 38576 chr12 27716001 27717000 0.8536 Gm9866 intergenic 108661 chr4 83135001 83136000 0.8536 Psip1 upstream 2632 chr7 104953001 104954000 0.8535 Pak1 intergenic 37445 chr13 37865001 37866000 0.8534 Rreb1 upstream 4269 chr3 17389001 17390000 0.8534 SNORA48 intergenic 237557 chr10 117898001 117899000 0.8532 Ifng intergenic 15053 chr14 77909001 77910000 0.8532 Enox1 genic 211214 chr16 30821001 30822000 0.853 U6 downstream 6647 chr4 123312001 123313000 0.8529 Macf1 genic 48603 chr9 110855001 110856000 0.8529 Lrrc2 genic 952 chr13 64070001 64071000 0.8528 0610007P08Rik intergenic 68391 chr18 38795001 38796000 0.8526 Arhgap26 genic 9576 chr3 27054001 27055000 0.8525 AC121099.1 genic 941 chr16 8839001 8840000 0.8524 1810013L24Rik genic 8807 chr3 95456001 95457000 0.8524 Mcl1 upstream 5779 chr17 14955001 14956000 0.8523 Wdr27 genic 321 chr18 67501001 67502000 0.8523 AC109280.1 genic 1757 chr7 125971001 125972000 0.8523 9030624J02Rik genic 14480 chr16 9284001 9285000 0.8521 Grin2a intergenic 292803 chr10 33950001 33951000 0.8517 Dse intergenic 22480 chr8 58697001 58698000 0.8517 Hpgd intergenic 75382 chr12 57399001 57400000 0.8516 Mbip intergenic 29334 chr4 61910001 61911000 0.8512 Slc31a2 intergenic 12596 chr5 118899001 118900000 0.8512 AC132338.1 intergenic 52849 chr12 42831001 42832000 0.8511 Immp2l genic 224575 chr16 72072001 72073000 0.8511 SNORA71 intergenic 407838 chr10 129731001 129732000 0.8507 Neurod4 intergenic 13705

154 chr13 28682001 28683000 0.8504 RP23-143O13.1 genic 129354 chr16 33844001 33845000 0.8504 Itgb5 genic 14250 chr17 28731001 28732000 0.8499 Srpk1 genic 6408 chr19 55631001 55632000 0.8498 Vti1a genic 66454 chr8 35790001 35791000 0.8496 Dusp4 intergenic 79351 chr11 78816001 78817000 0.8495 Ksr1 downstream 9942 chr15 62790001 62791000 0.8495 AC132141.1 intergenic 21479 chr19 14669001 14670000 0.8494 Tle4 genic 2473 chr4 61918001 61919000 0.8493 Slc31a2 upstream 4596 chr18 47652001 47653000 0.8492 Sema6a intergenic 123477 chr3 129247001 129248000 0.8492 Elovl6 genic 11697 chr12 4360001 4361000 0.8491 Ncoa1 genic 110546 chr17 37135001 37136000 0.8491 Zfp57 upstream 3487 chr5 104862001 104863000 0.8491 Spp1 upstream 1137 chr10 12292001 12293000 0.849 Utrn genic 190015 chr16 13748001 13749000 0.8489 RP23-331D17.2 intergenic 10328 chr13 23687001 23688000 0.8486 Hist1h2be genic 11390 chr3 22055001 22056000 0.8485 Tbl1xr1 genic 54455 chr11 20221001 20222000 0.8483 Slc1a4 genic 10716 chr7 138031001 138032000 0.8479 Fgfr2 genic 725036 chr17 71122001 71123000 0.8478 Dlgap1 genic 47753 chr9 41982001 41983000 0.8478 Sorl1 intergenic 49621 chr9 58126001 58127000 0.8478 Loxl1 downstream 9409 chr11 11842001 11843000 0.8477 Grb10 genic 11490 chr8 81017001 81018000 0.8477 Rbmxrt intergenic 11170 chr3 28740001 28741000 0.8475 Rpl22l1 intergenic 33664 chr4 82975001 82976000 0.8475 Ttc39b upstream 4842 chr6 37608001 37609000 0.8475 AC157101.1 intergenic 77574 chr17 45739001 45740000 0.8474 Gm7325 overlapTSS 15 chr2 128066001 128067000 0.8474 RP23-249P4.1 genic 44272 chr5 135417001 135418000 0.8473 Wbscr27 genic 507 chr5 65039001 65040000 0.8471 Klf3 intergenic 154628 chr13 5802001 5803000 0.847 Klf6 intergenic 57735 chr9 67056001 67057000 0.847 Tln2 downstream 8943 chr3 147626001 147627000 0.8469 SNORA17 intergenic 390051 chr11 47454001 47455000 0.8466 Sgcd genic 61694 chr16 23102001 23103000 0.8464 Eif4a2 upstream 4517 chr8 80592001 80593000 0.8462 Ttc29 intergenic 144196 chr11 79239001 79240000 0.846 Nf1 genic 85707 chr7 137744001 137745000 0.8459 Fgfr2 genic 438036

155 chr14 52686001 52687000 0.8457 SNORD58 upstream 1975 chr17 85542001 85543000 0.8457 1700106N22Rik genic 52055 chr9 8463001 8464000 0.8457 Trpc6 intergenic 80142 chr15 43361001 43362000 0.8453 Ttc35 downstream 1693 chr2 93767001 93768000 0.8453 RP23-375N21.1 intergenic 12456 chr3 148882001 148883000 0.8452 Gm10287 downstream 4420 chr12 71812001 71813000 0.8451 U6 intergenic 37268 chr1 36118001 36119000 0.8449 Hs6st1 upstream 6245 chr1 102774001 102775000 0.8449 AC111074.1 intergenic 15523 chr17 87871001 87872000 0.8449 SNORA17 intergenic 18305 chr19 33038001 33039000 0.8449 7SK downstream 5440 chr4 138001001 138002000 0.8448 Mul1 downstream 2821 chr4 132998001 132999000 0.8446 Slc9a1 intergenic 18384 chr7 147135001 147136000 0.8446 AC107822.5 upstream 295 chr1 84173001 84174000 0.8445 Pid1 genic 107166 chr19 53523001 53524000 0.8445 5830416P10Rik upstream 1197 chr8 12050001 12051000 0.8445 1700018L24Rik intergenic 194240 chr9 78219001 78220000 0.8444 Dppa5a upstream 3017 chr3 35169001 35170000 0.8442 AC110893.1 intergenic 142764 chr16 46500001 46501000 0.844 Pvrl3 upstream 1363 chr6 5069001 5070000 0.844 Ppp1r9a genic 45661 chr2 146126001 146127000 0.8435 Ralgapa2 genic 60386 chr1 59544001 59545000 0.8434 Fzd7 downstream 210 chr2 112516001 112517000 0.8433 Ryr3 genic 44464 chr6 80540001 80541000 0.8433 Lrrtm4 genic 219137 chr13 108451001 108452000 0.843 Zswim6 intergenic 63334 chr9 17679001 17680000 0.843 Chordc1 intergenic 416711 chr14 64120001 64121000 0.8428 Tdh genic 6593 chr4 13023001 13024000 0.8428 RP23-321K16.1 genic 47 chr10 95346001 95347000 0.8427 SNORA17 intergenic 14237 chr6 54047001 54048000 0.8425 Chn2 genic 57434 chr13 74052001 74053000 0.8423 Trip13 genic 2093 chr13 14519001 14520000 0.8422 Hecw1 genic 95495 chr11 23701001 23702000 0.842 U1 intergenic 23954 chr18 20907001 20908000 0.842 B4galt6 upstream 2223 chr10 33953001 33954000 0.8419 Dse intergenic 25480 chr13 16861001 16862000 0.8417 5033411D12Rik intergenic 87689 chr2 154254001 154255000 0.8417 Cbfa2t2 upstream 7217 chr4 53352001 53353000 0.8416 RP23-204P1.3 intergenic 30457 chr7 86356001 86357000 0.8416 Abhd2 intergenic 61087

156 chr11 23702001 23703000 0.8415 U1 intergenic 24954 chr15 56574001 56575000 0.8415 Has2 intergenic 47942 chr11 95173001 95174000 0.8414 RP23-265A5.4 genic 1258 chr6 64971001 64972000 0.8413 Smarcad1 intergenic 20661 chr1 77450001 77451000 0.8411 Epha4 genic 60663 chr1 88084001 88085000 0.8411 Armc9 genic 32646 chr16 65649001 65650000 0.8411 Chmp2b intergenic 86030 chr2 38637001 38638000 0.8411 Nr6a1 genic 58111 chr5 145673001 145674000 0.8411 Smurf1 genic 35637 chr1 120405001 120406000 0.841 Clasp1 genic 100009 chr7 87160001 87161000 0.8409 Zfp710 intergenic 10220 chr19 55617001 55618000 0.8408 Vti1a genic 80454 chr4 117086001 117087000 0.84 Rnf220 genic 82657 chr7 3210001 3211000 0.84 mmu-mir-290 upstream 7627 chr19 37475001 37476000 0.8398 Kif11 genic 20349 chr6 72877001 72878000 0.8396 Kcmf1 intergenic 27247 chr6 142910001 142911000 0.8396 St8sia1 genic 1972 chr13 40958001 40959000 0.8395 Gcnt2 genic 2500 chr5 135446001 135447000 0.8393 Cldn3 intergenic 15084 chr6 6090001 6091000 0.839 Slc25a13 genic 76118 chr17 33820001 33821000 0.8383 Hnrnpm genic 2805 chr17 43849001 43850000 0.8383 Cyp39a1 genic 37794 chr19 34730001 34731000 0.8382 Ifit1 downstream 5502 chr16 64885001 64886000 0.8379 Cggbp1 intergenic 25668 chr12 88658001 88659000 0.8378 Sptlc2 genic 7418 chr15 5941001 5942000 0.8378 SNORA17 intergenic 159217 chr1 138524001 138525000 0.8377 Zfp281 genic 1630 chr8 93428001 93429000 0.8376 Chd9 genic 75265 chr10 117154001 117155000 0.8375 Mdm2 upstream 6187 chr17 37111001 37112000 0.8373 2410137M14Rik downstream 2646 chr8 25681001 25682000 0.8373 Ido2 genic 4805 chr10 108005001 108006000 0.8372 Syt1 genic 70295 chr5 38980001 38981000 0.8372 AC084322.1 intergenic 13333 chr6 30222001 30223000 0.8372 Ube2h genic 31539 chr14 122667001 122668000 0.8371 Clybl genic 86058 chr4 141791001 141792000 0.837 9030409G11Rik genic 3316 chr6 67062001 67063000 0.837 RP23-129P10.1 intergenic 31353 chr4 107855001 107856000 0.8368 Zyg11a genic 462 chr11 33430001 33431000 0.8364 AL669814.1 intergenic 16113 chr17 35641001 35642000 0.8363 Pou5f1 upstream 1007

157 chr11 93908001 93909000 0.8358 Spag9 genic 50596 chr13 110630001 110631000 0.8358 Pde4d genic 110682 chr6 128232001 128233000 0.8358 Tead4 genic 17831 chr6 52027001 52028000 0.8355 RP23-103L13.1 intergenic 24948 chr12 40722001 40723000 0.8352 Arl4a downstream 9034 chr5 123608001 123609000 0.8352 Setd1b genic 5374 chr14 99909001 99910000 0.835 U1 intergenic 148955 chr19 14948001 14949000 0.835 AC116999.1 intergenic 85039 chr14 106320001 106321000 0.8349 Spry2 intergenic 23965 chr10 59628001 59629000 0.8346 Chst3 intergenic 15282 chr10 77510001 77511000 0.8342 Dnmt3l genic 4969 chr10 117243001 117244000 0.8342 Rap1b downstream 7602 chr1 183883001 183884000 0.8341 Enah genic 47357 chr10 66563001 66564000 0.834 Reep3 upstream 3346 chr13 44116001 44117000 0.8338 RP23-469C2.1 intergenic 99536 chr19 25648001 25649000 0.8338 Dmrt1 genic 29819 chr10 116000001 116001000 0.8337 Cnot2 genic 17556 chr7 87366001 87367000 0.8335 Sema4b genic 282 chr2 170597001 170598000 0.8331 Dok5 genic 39694 chr14 49273001 49274000 0.8328 U6 downstream 724 chr17 47708001 47709000 0.8328 Ccnd3 intergenic 21491 chr1 54542001 54543000 0.8327 Pgap1 genic 4451 chr12 103945001 103946000 0.8327 Itpk1 upstream 1922 chr18 38788001 38789000 0.8326 Arhgap26 genic 16576 chr2 157202001 157203000 0.8325 Manbal genic 8671 chr1 130756001 130757000 0.8323 U6 intergenic 25486 chr19 4329001 4330000 0.8323 Kdm2a genic 12831 chr13 55415001 55416000 0.8322 Nsd1 genic 3686 chr1 194278001 194279000 0.8321 Kcnh1 genic 55040 chr3 97344001 97345000 0.8317 Chd1l intergenic 19665 chr5 97632001 97633000 0.8316 AC122916.1 intergenic 60768 chr8 125547001 125548000 0.8316 Galnt2 genic 1321622 chr10 44792001 44793000 0.8315 Prep genic 4989 chr3 30900001 30901000 0.8315 Prkci genic 5332 chr5 118884001 118885000 0.8313 AC132338.1 intergenic 67849 chr6 129314001 129315000 0.8313 Clec12a genic 321 chr12 4904001 4905000 0.8312 Ubxn2a genic 9511 chr6 72758001 72759000 0.8312 Tcf7l1 intergenic 18753 chr11 62028001 62029000 0.8311 Cytsb genic 7515 chr13 91009001 91010000 0.8311 AC108947.1 intergenic 13156

158 chr18 77305001 77306000 0.8311 Pias2 genic 1054 chr11 101543001 101544000 0.831 Arl4d intergenic 13855 chr13 4770001 4771000 0.831 Gm5444 overlapTSS 105 chr2 162877001 162878000 0.8309 Mybl2 upstream 2423 chr6 148940001 148941000 0.8309 Dennd5b genic 3408 chr14 117268001 117269000 0.8308 Gpc6 intergenic 55519 chr2 160297001 160298000 0.8307 RP23-152H17.2 intergenic 96115 chr10 6221001 6222000 0.8305 Mthfd1l genic 41541 chr14 27860001 27861000 0.8305 Il17rd genic 7814 chr6 52056001 52057000 0.8305 RP23-103L13.1 genic 3053 chr8 109811001 109812000 0.8303 Nfat5 upstream 5370 chr3 137443001 137444000 0.8302 U6 intergenic 64273 chr6 8295001 8296000 0.8301 RP23-31E23.3 genic 85713 chr7 118481001 118482000 0.8301 Galntl4 intergenic 133175 chr2 152003001 152004000 0.83 RP23-452D3.2 downstream 7027 chr12 54575001 54576000 0.8298 Npas3 genic 225338 chr16 13749001 13750000 0.8298 RP23-331D17.2 intergenic 11328 chr17 31943001 31944000 0.8297 Sik1 intergenic 37193 chr9 20787001 20788000 0.8295 S1pr2 upstream 5834 chr7 4854001 4855000 0.8294 AC157563.2 downstream 6682 chr2 171104001 171105000 0.8293 RP23-117M21.1 downstream 7151 chr14 9236001 9237000 0.8285 Oit1 intergenic 24724 chr3 122641001 122642000 0.8285 Usp53 genic 4482 chr10 40083001 40084000 0.8284 Cdk19 genic 13887 chr3 133197001 133198000 0.8284 Tet2 genic 9332 chr6 77554001 77555000 0.8282 Ctnna2 genic 374693 chr2 149624001 149625000 0.8281 RP24-87G15.2 overlapTSS 436 chr7 31249001 31250000 0.828 Nphs1 genic 4257 chr7 125262001 125263000 0.828 Arl6ip1 genic 413 chr18 68317001 68318000 0.8279 D18Ertd653e genic 97203 chr15 37907001 37908000 0.8277 Ubr5 genic 9918 chr7 136564001 136565000 0.8275 Ppapdc1a intergenic 29079 chr2 146456001 146457000 0.8272 RP23-402L5.1 intergenic 84157 chr1 33997001 33998000 0.827 Dst genic 31876 chr17 31444001 31445000 0.827 Slc37a1 genic 10299 chr7 38813001 38814000 0.827 C80913 upstream 8440 chr11 60255001 60256000 0.8269 4933439F18Rik genic 8429 chr15 41714001 41715000 0.8269 Abra intergenic 12787 chr11 68207001 68208000 0.8268 Ntn1 genic 6325 chr14 28160001 28161000 0.8268 Arhgef3 genic 56090

159 chr19 57651001 57652000 0.8266 Atrnl1 intergenic 33524 chr18 82948001 82949000 0.8265 AC113104.1 intergenic 76286 chr12 87807001 87808000 0.8263 Esrrb genic 54570 chr10 47237001 47238000 0.8262 U1 intergenic 678688 chr9 58120001 58121000 0.8262 Stoml1 downstream 9676 chr9 74845001 74846000 0.8261 BC031353 genic 21083 chr10 56100001 56101000 0.826 Gja1 genic 2865 chr6 148941001 148942000 0.8258 Dennd5b genic 4408 chr11 88581001 88582000 0.8257 RP23-393B19.1 genic 1042 chr6 125454001 125455000 0.8257 AC153580.1 genic 4153 chr1 4844001 4845000 0.8256 Lypla1 genic 31851 chr8 128028001 128029000 0.8256 Sipa1l2 intergenic 11391 chr10 83845001 83846000 0.8255 Nuak1 genic 11351 chr2 146127001 146128000 0.8254 Ralgapa2 genic 61386 chr8 44376001 44377000 0.8254 Zfp42 downstream 3422 chr12 42113001 42114000 0.8252 Immp2l genic 362324 chr1 34167001 34168000 0.8249 Dst genic 71881 chr5 64322001 64323000 0.8249 Rell1 genic 21866 chr11 26029001 26030000 0.8248 RP23-306I3.1 genic 80569 chr2 154719001 154720000 0.8248 a genic 101863 chr4 104995001 104996000 0.8248 RP23-104G20.2 intergenic 51059 chr13 16921001 16922000 0.8246 5033411D12Rik intergenic 27689 chr1 183672001 183673000 0.8243 Dnahc14 genic 2882 chr4 135137001 135138000 0.8243 U6 upstream 1663 chr11 53111001 53112000 0.8242 Hspa4 genic 1959 chr2 181667001 181668000 0.8241 AL928734.1 intergenic 28997 chr3 30953001 30954000 0.8241 Prkci downstream 1338 chr17 36076001 36077000 0.824 CR974451.2 upstream 3106 chr16 73097001 73098000 0.8238 Robo1 intergenic 74661 chr1 77469001 77470000 0.8235 Epha4 genic 41663 chr12 36326001 36327000 0.8235 Ahr intergenic 106291 chr16 20372001 20373000 0.8235 Abcc5 genic 40625 chr5 107203001 107204000 0.8233 Hfm1 intergenic 65211 chr9 40601001 40602000 0.8233 Hspa8 upstream 7067 chr9 70665001 70666000 0.8233 Lipc genic 18942 chr4 82189001 82190000 0.8232 Nfib genic 44776 chr8 107757001 107758000 0.823 D230025D16Rik genic 7956 chr3 100969001 100970000 0.8229 RP23-434L7.1 intergenic 40782 chr10 68258001 68259000 0.8228 Tmem26 intergenic 14639 chr8 67732001 67733000 0.8228 BC030870 intergenic 78690

160 chr19 57441001 57442000 0.8227 Fam160b1 genic 5467 chr10 41550001 41551000 0.8224 Sesn1 genic 19621 chr2 93768001 93769000 0.8224 RP23-375N21.1 intergenic 11456 chr8 129081001 129082000 0.8223 AC118255.1 upstream 8870 chr4 59571001 59572000 0.8221 AL824704.1 downstream 8504 chr8 121320001 121321000 0.822 Cdh13 genic 512368 chr5 14002001 14003000 0.8219 Sema3e intergenic 22276 chr1 153194001 153195000 0.8217 Ivns1abp genic 2373 chr2 131709001 131710000 0.8216 Lamr1-ps1 intergenic 19822 chr7 87891001 87892000 0.8211 Iqgap1 genic 33328 chr11 7323001 7324000 0.8208 RP23-20C9.3 intergenic 132624 chr3 135778001 135779000 0.8208 Bank1 genic 61671 chr12 88286001 88287000 0.8204 2310044G17Rik upstream 1364 chr16 8832001 8833000 0.8204 1810013L24Rik genic 1807 chr10 20822001 20823000 0.8203 Myb intergenic 21736 chr11 117834001 117835000 0.8203 Socs3 upstream 2640 chr2 171056001 171057000 0.8203 RP23-117M21.1 genic 37232 chr11 107297001 107298000 0.82 Pitpnc1 genic 34013 chr5 76751001 76752000 0.8199 Pdcl2 genic 8158 chr7 103364001 103365000 0.8199 Odz4 genic 44247 chr8 91591001 91592000 0.8199 Sall1 intergenic 22940 chr4 55476001 55477000 0.8198 RP23-120N11.5 genic 1686 chr10 66380001 66381000 0.8197 AC153382.2 genic 2006 chr1 63185001 63186000 0.8196 RP23-13C17.1 genic 18864 chr19 21969001 21970000 0.8196 Tmem2 intergenic 36184 chr10 119607001 119608000 0.8195 Irak3 genic 28297 chr13 110467001 110468000 0.8192 Pde4d genic 273682 chr2 59460001 59461000 0.8192 Tanc1 genic 9902 chr8 119987001 119988000 0.8191 4933407C03Rik downstream 3690 chr6 143091001 143092000 0.8189 Etnk1 intergenic 23750 chr7 26088001 26089000 0.8188 Prr19 genic 152 chr19 23211001 23212000 0.8187 AC121973.1 downstream 794 chr5 110866001 110867000 0.8187 Fbrsl1 genic 10505 chr13 17936001 17937000 0.8186 U6 intergenic 34470 chr10 19611001 19612000 0.8185 Pex7 genic 15495 chr3 126152001 126153000 0.8185 Arsj downstream 8709 chr14 119113001 119114000 0.8184 Abcc4 upstream 7574 chr16 93463001 93464000 0.8184 mmu-mir-802 intergenic 92940 chrX 20452001 20453000 0.8184 Syn1 genic 14364 chr13 85286001 85287000 0.8183 Ccnh intergenic 42082

161 chr2 71189001 71190000 0.8183 Slc25a12 genic 15806 chr11 79951001 79952000 0.8182 1110002N22Rik genic 821 chr12 118286001 118287000 0.818 Ptprn2 genic 228266 chr1 134916001 134917000 0.8179 Mdm4 genic 4925 chr4 147974001 147975000 0.8179 Masp2 upstream 1649 chr12 105671001 105672000 0.8178 SNORA17 intergenic 13391 chr2 165806001 165807000 0.8178 RP23-108D12.6 genic 3744 chr4 128728001 128729000 0.8177 RP23-269B7.6 genic 660 chr10 92279001 92280000 0.8174 4930485B16Rik intergenic 80067 chr3 41300001 41301000 0.8174 SNORD22 intergenic 54612 chr1 36071001 36072000 0.8173 Hs6st1 intergenic 53245 chr14 86796001 86797000 0.8173 U1 intergenic 11329 chr17 37110001 37111000 0.8172 H2-M6-ps upstream 1660 chr9 20782001 20783000 0.8172 S1pr2 upstream 834 chr6 140554001 140555000 0.8171 Plekha5 downstream 9197 chr4 105133001 105134000 0.817 RP23-48A7.1 intergenic 49837 chr4 130195001 130196000 0.817 Pum1 intergenic 23236 chr10 107728001 107729000 0.8168 Ppp1r12a intergenic 13370 chr12 72561001 72562000 0.8168 RP23-50H15.2 intergenic 138354 chr11 66746001 66747000 0.8167 RP23-67J22.2 genic 13588 chr3 135208001 135209000 0.8167 Manba genic 9530 chr7 38815001 38816000 0.8167 C80913 intergenic 10440 chr1 188746001 188747000 0.8166 D1Pas1 intergenic 44295 chr1 6448001 6449000 0.8165 St18 intergenic 28312 chr9 31903001 31904000 0.8165 Arhgap32 intergenic 111785 chr2 154291001 154292000 0.8164 Cbfa2t2 genic 28784 chr2 26314001 26315000 0.8163 RP23-306D20.11 genic 101 chr13 34919001 34920000 0.8162 Prpf4b intergenic 47398 chr9 55060001 55061000 0.8161 Nrg4 genic 70394 chr15 81493001 81494000 0.816 L3mbtl2 upstream 386 chr16 16088001 16089000 0.816 2310008H04Rik genic 57944 chr8 114173001 114174000 0.816 Zfp1 genic 5567 chr10 21569001 21570000 0.8159 AC158614.1 upstream 4344 chr17 17578001 17579000 0.8159 Lix1 genic 17351 chr13 19661001 19662000 0.8157 Epdr1 intergenic 21587 chr15 83272001 83273000 0.8156 Pacsin2 genic 22000 chr3 130849001 130850000 0.8156 Lef1 genic 35612 chr16 13951001 13952000 0.8155 Mpv17l downstream 1289 chr3 83824001 83825000 0.8155 D930015E06Rik genic 19083 chr9 78207001 78208000 0.8154 Gsta2 upstream 3406

162 chr10 21421001 21422000 0.8152 Gm5420 downstream 7217 chr10 111837001 111838000 0.8151 Kcnc2 genic 65360 chr4 98405001 98406000 0.8151 L1td1 downstream 1587 chr17 44119001 44120000 0.815 Rcan2 genic 56465 chr7 73018001 73019000 0.815 Pcsk6 genic 10902 chr8 18480001 18481000 0.815 SNORA17 intergenic 31515 chr10 66546001 66547000 0.8147 Reep3 genic 12655 chr2 59471001 59472000 0.8147 Tanc1 genic 20902 chr4 82076001 82077000 0.8145 Nfib genic 74674 chr13 96302001 96303000 0.8144 F2rl1 upstream 6819 chr19 23206001 23207000 0.8144 AC121973.1 upstream 2390 chr15 79350001 79351000 0.8143 Kdelr3 genic 3146 chr4 137878001 137879000 0.8143 Pink1 genic 3222 chr1 166422001 166423000 0.8142 AC158943.1 intergenic 10438 chr14 50067001 50068000 0.8142 Slc35f4 genic 77540 chr9 20432001 20433000 0.8141 Fbxl12 genic 9439 chr7 146048001 146049000 0.8136 Ppp2r2d genic 10239 chr11 107213001 107214000 0.8135 Pitpnc1 genic 118013 chr15 38242001 38243000 0.8135 Klf10 intergenic 11540 chr4 141125001 141126000 0.8132 Fblim1 downstream 5977 chr2 26298001 26299000 0.8131 Sec16a genic 1734 chr10 66364001 66365000 0.813 AC153382.2 intergenic 10237 chr5 72856001 72857000 0.8129 Corin genic 38447 chr7 148144001 148145000 0.8129 Ifitm2 upstream 2141 chrX 154039001 154040000 0.8129 RP23-22J3.6 downstream 1965 chr12 40723001 40724000 0.8127 Arl4a downstream 8034 chr3 83784001 83785000 0.8127 D930015E06Rik genic 59083 chr5 113759001 113760000 0.8127 mmu-mir-469 intergenic 12106 chr2 20417001 20418000 0.8124 Etl4 genic 248443 chr2 165795001 165796000 0.8123 RP23-108D12.6 downstream 6257 chr11 97150001 97151000 0.8122 RP23-81D2.6 downstream 5792 chr2 65686001 65687000 0.8122 Csrnp3 genic 2177 chr10 58816001 58817000 0.8121 P4ha1 genic 19052 chr10 59021001 59022000 0.8121 Ccdc109a genic 57440 chr19 32269001 32270000 0.8119 Sgms1 genic 71784 chr6 4930001 4931000 0.8119 Ppp1r9a genic 76681 chr7 56704001 56705000 0.8118 Nav2 genic 3328 chr16 22413001 22414000 0.8117 Etv5 genic 25691 chr4 130187001 130188000 0.8117 Nkain1 intergenic 18609 chr6 83000001 83001000 0.8117 Loxl3 genic 1556

163 chr19 38138001 38139000 0.8116 Cep55 genic 8471 chr4 123301001 123302000 0.8116 Macf1 genic 59603 chr7 104578001 104579000 0.8114 Thrsp intergenic 11761 chr8 89412001 89413000 0.8113 N4bp1 upstream 2844 chr2 104658001 104659000 0.8112 Qser1 upstream 1148 chr13 51148001 51149000 0.8111 AC160122.1 upstream 8935 chr15 55709001 55710000 0.8111 Sntb1 genic 28504 chr3 144263001 144264000 0.8111 15-Sep downstream 2357 chr1 158892001 158893000 0.8109 Ralgps2 intergenic 22337 chr8 128226001 128227000 0.8108 2310079N02Rik intergenic 26863 chr10 18138001 18139000 0.8107 Nhsl1 genic 27873 chr11 22836001 22837000 0.8107 Commd1 genic 45090 chr2 29356001 29357000 0.8105 Med27 genic 23313 chr8 14015001 14016000 0.8105 Erich1 intergenic 11565 chr4 130196001 130197000 0.8104 Pum1 intergenic 22236 chr14 64119001 64120000 0.8102 Tdh genic 7593 chr2 70117001 70118000 0.8102 Myo3b genic 148290 chr3 129249001 129250000 0.8102 Elovl6 genic 13697 chr12 56487001 56488000 0.8101 Psma6 genic 1792 chr1 130289001 130290000 0.81 Dars genic 23902 chr4 12590001 12591000 0.81 5S_rRNA intergenic 148853 chr15 102232001 102233000 0.8099 Sp1 upstream 3747 chr5 104155001 104156000 0.8099 Aff1 genic 33608 chr6 72876001 72877000 0.8099 Kcmf1 intergenic 26247 chr9 56395001 56396000 0.8098 Hmg20a intergenic 50310 chr16 94823001 94824000 0.8097 Dyrk1a genic 31384 chr1 56754001 56755000 0.8096 Hsfy2 intergenic 59781 chr8 75860001 75861000 0.8096 Large genic 16439 chr2 109783001 109784000 0.8095 Lgr4 genic 25197 chr10 21568001 21569000 0.8094 AC158614.1 upstream 3344 chr4 118698001 118699000 0.809 RP23-228G24.4 intergenic 24911 chr16 14021001 14022000 0.8089 2900011O08Rik genic 34304 chr4 21974001 21975000 0.8089 RP23-416G5.3 intergenic 38665 chr15 39478001 39479000 0.8088 Rims2 genic 34486 chr2 129071001 129072000 0.8086 RP23-160G19.5 upstream 112 chr3 25026001 25027000 0.8086 AC124979.1 intergenic 82938 chr4 65991001 65992000 0.8085 Astn2 genic 73517 chr2 123971001 123972000 0.8084 Sema6d genic 55296 chr4 98514001 98515000 0.8084 RP23-291M22.4 upstream 7305 chr14 58720001 58721000 0.8083 Fgf9 genic 7956

164 chr15 37168001 37169000 0.8083 Grhl2 genic 5210 chr2 73961001 73962000 0.8083 RP23-238A18.2 intergenic 11814 chr14 12496001 12497000 0.8082 Ptprg genic 109955 chr14 100609001 100610000 0.8082 RP24-501J23.1 upstream 2361 chr17 80080001 80081000 0.8082 Fam82a1 genic 492 chr17 10559001 10560000 0.8081 Pacrg intergenic 35878 chr6 82653001 82654000 0.808 Pole4 genic 1359 chr2 156516001 156517000 0.8079 Dlgap4 genic 9513 chr11 63903001 63904000 0.8078 RP23-78H18.2 downstream 6240 chr4 33601001 33602000 0.8078 Rngtt intergenic 11412 chr1 141391001 141392000 0.8075 Aspm downstream 334 chr1 168054001 168055000 0.8075 Dusp27 genic 3029 chr10 42763001 42764000 0.8074 Sobp genic 40694 chr7 28432001 28433000 0.8073 Ttc9b upstream 5934 chr19 47525001 47526000 0.8072 Sh3pxd2a genic 12891 chr17 88558001 88559000 0.8071 Gm4832 intergenic 27655 chr18 46873001 46874000 0.8071 Cdo1 genic 154 chr4 33425001 33426000 0.8071 Rngtt genic 27715 chr8 119868001 119869000 0.8069 4933407C03Rik genic 86897 chr13 19720001 19721000 0.8068 Sfrp4 genic 3690 chr2 172346001 172347000 0.8065 RP23-228E2.8 intergenic 10810 chr3 148883001 148884000 0.8065 Gm10287 downstream 3420 chr6 23780001 23781000 0.8064 Cadps2 genic 8420 chr9 21572001 21573000 0.8064 Kank2 genic 768 chr16 42781001 42782000 0.8063 RP23-283M18.2 genic 55186 chr14 122654001 122655000 0.806 Clybl genic 73058 chr15 7065001 7066000 0.8059 Gm10050 intergenic 13021 chr10 94638001 94639000 0.8056 Cradd genic 102 chr8 119197001 119198000 0.8056 Cdyl2 genic 58891 chr18 77799001 77800000 0.8055 8030462N17Rik intergenic 72020 chr10 60778001 60779000 0.8054 X99384 downstream 3410 chr11 20491001 20492000 0.8054 Sertad2 genic 5023 chr6 136417001 136418000 0.8054 RP23-100O18.1 intergenic 19276 chr3 100928001 100929000 0.8051 Ptgfrn intergenic 13800 chr3 25027001 25028000 0.805 AC124979.1 intergenic 81938 chr9 31902001 31903000 0.805 Arhgap32 intergenic 112785 chr2 65260001 65261000 0.8048 Scn3a intergenic 35320 chr13 6641001 6642000 0.8047 Pfkp genic 6023 chr2 58906001 58907000 0.8046 Ccdc148 genic 91740 chr18 54879001 54880000 0.8044 AC129179.1 intergenic 20170

165 chr5 75196001 75197000 0.8044 AC120865.2 intergenic 65526 chr19 23542001 23543000 0.8042 Mamdc2 intergenic 19113 chr5 104863001 104864000 0.8041 Spp1 upstream 137 chr18 61793001 61794000 0.804 AC148011.4 upstream 5307 chr12 27749001 27750000 0.8039 Gm9866 intergenic 75661 chr2 31017001 31018000 0.8039 Gpr107 genic 9165 chr3 83983001 83984000 0.8039 Trim2 genic 18640 chr5 107202001 107203000 0.8039 Hfm1 intergenic 66211 chr9 49559001 49560000 0.8039 Ncam1 genic 46834 chr13 98350001 98351000 0.8038 AC124581.1 intergenic 83998 chr3 96378001 96379000 0.8035 AC122037.6 upstream 1907 chr3 37565001 37566000 0.8032 Spry1 intergenic 21481 chr14 122833001 122834000 0.803 Gm5089 downstream 76 chr2 50905001 50906000 0.8029 Rnd3 intergenic 79958 chr4 82976001 82977000 0.8029 Ttc39b upstream 5842 chr4 84810001 84811000 0.8029 RP23-158E10.1 intergenic 11431 chr9 92155001 92156000 0.8029 Plscr1 genic 10106 chr2 83473001 83474000 0.8027 Zc3h15 intergenic 10592 chr2 44381001 44382000 0.8025 Gtdc1 intergenic 37932 chr4 137153001 137154000 0.8025 Usp48 genic 3334 chr7 109348001 109349000 0.8023 Nup98 genic 9634 chr9 55446001 55447000 0.8023 Scaper genic 48311 chr12 59574001 59575000 0.8022 Clec14a intergenic 204180 chr1 129680001 129681000 0.8021 Ccnt2 genic 9260 chr9 42940001 42941000 0.8021 Pou2f3 genic 7979 chrX 12504001 12505000 0.8021 RP23-354L12.1 upstream 8998 chr16 22253001 22254000 0.8019 Tra2b genic 8379 chr2 129004001 129005000 0.8018 RP24-388N11.3 genic 1368 chr3 27053001 27054000 0.8018 AC121099.1 genic 53 chr2 135385001 135386000 0.8016 RP24-364H2.1 genic 23320 chr17 24932001 24933000 0.8015 4930528F23Rik intergenic 33014 chr6 7703001 7704000 0.8014 Asns intergenic 59747 chr5 63966001 63967000 0.8013 3110047P20Rik intergenic 73342 chr12 72398001 72399000 0.8011 Dact1 intergenic 11871 chr16 92216001 92217000 0.8011 Kcne2 intergenic 75634 chr1 142331001 142332000 0.8009 Kcnt2 genic 174656 chr4 10326001 10327000 0.8009 RP23-172G14.1 intergenic 38962 chr8 87997001 87998000 0.8009 Gpt2 intergenic 18475 chr5 122719001 122720000 0.8008 Tctn1 upstream 4532 chr2 152618001 152619000 0.8006 Bcl2l1 genic 24949

166 chr10 79513001 79514000 0.8005 Gpx4 genic 3090 chr18 77381001 77382000 0.8005 Pias2 genic 10539 chr9 78222001 78223000 0.8004 Ooep downstream 916 chr1 138467001 138468000 0.8003 AC126606.1 intergenic 19147 chr14 12940001 12941000 0.8 Ptprg genic 133555 chr2 135388001 135389000 0.8 RP24-364H2.1 genic 20320

167

Appendix 5. Prediction of luciferase-validated enhancer positives and negatives.

The table lists previously luciferase validated regions from Chen et al. 2008: 25 enhancer positive regions co-bound by OCT4, SOX2 and NANOG; 5 enhancer negative regions co-bound by N-Myc and c-Myc. “mm9 location” column reports the locations lifted over from mouse mm8 built using liftover tool on UCSC genome browser. Enh and PrL columns report whether these regions are predicted to be the corresponding categories. “Enh≥0.8” and “PrL≥0.8” columns report whether the probability is higher than 0.8 for such category, and values are shown if lower. Enhancer activity mm9 locations Enh Enh≥0.8 PrL PrL≥0.8 active chr7:13599673-13600014 Y Y -- -- active chr5:104153453-104153778 Y Y -- -- active chr8:74858224-74858547 Y Y -- -- active chr3:18523982-18524304 Y Y -- -- active chr8:49837289-49837606 Y Y -- -- active chr10:21465925-21466245 Y Y -- -- active chr9:21572904-21573224 Y Y -- --

active chr3:53454181-53454502 Y Y -- -- active chr4:57703713-57704033 Y Y -- -- active chr14:86795849-86796170 Y Y -- -- Cluster active chr16:84769428-84769748 Y Y -- -- active chr15:97424211-97424611 Y Y -- -- Sox2

- active chr9:45157325-45157735 Y Y -- -- active chr4:55490226-55490626 Y Y -- -- Oct4

- active chr10:21708245-21708598 Y Y -- -- active chr8:91527403-91527812 Y Y -- -- active chr15:61924308-61924695 Y Y -- -- Nanog active chr6:64971806-64972205 Y Y -- -- active chr1:11267990-11268340 Y Y -- -- active chr8:74834955-74835344 Y Y -- -- active chr1:77450422-77450790 Y Y -- -- active chr12:72397784-72398167 Y Y -- -- active chr1:72275524-72275915 Y Y -- -- active chr7:6209894-6210215 Y N (0.6839) -- -- active chr1:88207558-88207953 Y Y -- -- inactive chr11:62416232-62416589 -- -- Y Y

inactive chr5:74489120-74489516 -- -- Y Y inactive chr4:133466459-133466799 -- -- Y Y inactive chr11:116532861-116533190 -- -- Y Y inactive chr2:158186982-158187301 -- -- Y Y inactive chr7:150482240-150482559 -- -- Y Y Myc Cluster inactive chr18:33954471-33954790 -- -- Y Y inactive chr3:30915398-30915717 -- -- Y N (0.7331)

168 a)

169 b)

170 c)

171 d)

Appendix 6. Detailed plots for novel putative enhancer regions.

Detailed coverage plots of novel enhancer regions identified including a) multiple putative enhancers upstream of miR-290 cluster, b) multiple contiguous enhancer regions upstream of Tet1 and around a non-coding small nuclear RNA, U6, c) two putative enhancers around Zic3, and d) the putative enhancer region located 10kb upstream of C80913.

172

Appendix 7. Active enhancers and cell specificity of all enhancer candidates.

(a) Venn diagrams of OSN co-bound sites from training data set with active H3K27ac mark and repressive H3K27me3 mark. (b) The percent overlap of all enhancers with H3K27ac and H3K27me3. (c) The stacked bar plot shows the percent overlap of all enhancers with H3K27ac / H3K4me1 in various cell types. All overlaps presented here allow a 500 bp gap.

173

Appendix 8. Feature coefficients determined from lasso regularization (H3K27ac included)

The plot shows feature weights in each class with respect to logged lambda, the penalization parameter. Weights of features less discriminative of the three categories are shrinked to 0 as the lambda is increased. H3K27ac, a positive predictor of PrL group, is highlighted in a blue box.