UNIVERSITY OF CALIFORNIA, SAN DIEGO

Identification of a conserved periodic promoter structure in metazoans

A dissertation submitted in partial satisfaction of the requirements for the degree

Doctor of Philosophy

in

Bioinformatics

by

Christopher William Benner

Committee in charge:

Professor Christopher K. Glass, Chair Professor Shankar Subramaniam, Co-Chair Professor Alexander Hoffmann Professor James T. Kadonaga Professor Michael G. Rosenfeld

2009 Copyright

Christopher William Benner, 2009

All rights reserved. SIGNATURE PAGE

The Dissertation of Christopher William Benner is approved, and it is acceptable in quality and form for publication on microfilm and electronically:

______

______

______

______Co-Chair

______Chair

University of California, San Diego

2009

iii DEDICATION

I would like to dedicate this work to my grandmother Evelyn Benner, who passed away while I was in graduate school. Her unconditional love had a lasting impact on my life that resonates throughout this work. She will always be remembered.

iv TABLE OF CONTENTS

SIGNATURE PAGE...... iii

DEDICATION ...... iv

TABLE OF CONTENTS ...... v

LIST OF FIGURES ...... x

LIST OF TABLES ...... xiii

ACKNOWLEDGEMENTS...... xiv

VITA ...... xvi

ABSTRACT OF THE DISSERTATION...... xviii

Chapter 1: Introduction ...... 1

Chapter 2: Identification of regulatory elements driving inflammatory expression ...... 4

2.1 Abstract...... 4

2.2 Introduction ...... 5

2.3 Methods ...... 9

2.3.1 Description of HOMER ...... 9

2.3.2 Background Correction...... 15

2.3.3 Analysis of ChIP-Chip data and evaluation of Algorithm performance17

2.3.4 Electrophoretic mobility shift assay (EMSA) ...... 18

2.3.5 Microarray Analysis ...... 18

2.4 Results ...... 19

2.4.1 Benchmarking HOMER as a de novo motif discovery algorithm ...... 19

v 2.4.2 Temporal analysis of cis-regulatory elements under the control of TLR4

signaling in macrophages...... 21

2.4.3 Identification of novel response element in induced selectively

by Type I Interferon ...... 22

2.4.4 Identification of motifs enriched in disease ...... 25

2.5 Discussion...... 26

Chapter 3: Advanced analysis techniques for ChIP-Seq ...... 38

3.1 Abstract...... 38

3.2 Introduction ...... 39

3.3 Methods ...... 41

3.3.1 Identification of bound regions from ChIP-Seq data ...... 41

3.3.2 Generation of UCSC Genome Browser Tracks ...... 42

3.3.3 Genomic Motif Discovery Using HOMER ...... 43

3.3.4 Centering of ChIP-Seq Peaks on Regulatory Elements ...... 44

3.3.5 Functional Enrichment of Peaks...... 44

3.4 Results:...... 45

3.4.1 Basic Motif Analysis of ChIP-Seq Defined Regions...... 45

3.4.2 Differential Motif Discovery...... 46

3.4.3 Normalizing for Transcriptional Initiation...... 47

3.4.4 Spatial organization ChIP-Tags and Motifs ...... 48

3.4.5 The Co-occurrence and Spacing of motifs underlying RNA Poly II

Recruitment ...... 50

3.4.6 Expanding The CTCF Consensus Motif ...... 52

vi 3.4.7 Cis-regulatory code of pluripotent transcription factors...... 53

3.5 Discussion...... 57

Chapter 4: Nucleosome Positioning Sequences in the ...... 77

4.1 Abstract...... 77

4.2 Introduction: ...... 78

4.3 Methods ...... 81

4.3.1 Analysis of nucleosome positions...... 81

4.3.2 MNase and DNase I sequence bias and normalization...... 81

4.3.3 Fourier Analysis...... 83

4.4 RESULTS ...... 84

4.4.1 DNAse I Defined NPS ...... 84

4.4.2 MNase Defined NPS ...... 87

4.4.3 Nucleosome Positioning in C. elegans ...... 89

4.4.4 Frequency Spectrum Analysis...... 89

4.5 Conclusions ...... 90

Chapter 5: Identification of a Conserved Periodic Promoter Structure in

Metazoans...... 104

5.1 Abstract...... 104

5.2 Introduction...... 105

5.3 METHODS ...... 107

5.3.1 Determination of the TSS ...... 107

5.3.2 Analysis of Mononucleosome Positions ...... 108

5.3.3 Description of HOMER for de novo promoter motif discovery ...... 108

vii 5.3.4 ChIP-Seq for NFY and NRF1 ...... 109

5.3.5 Analysis of ChIP-Seq Datasets ...... 109

5.3.6 Gene Expression Data ...... 109

5.4 Results...... 110

5.4.1 Identification of periodic promoters...... 110

5.4.2 Discovery of promoter proximal motifs (PSSEs)...... 112

5.4.3 Conserved rotational position of downstream nucleosome ...... 113

5.4.5 Genes under the control of periodic and focused promoters ...... 115

5.5 Discussion...... 116

Chapter 6: Analysis of transcriptional initiation across eukaryotes ...... 129

6.1 Abstract...... 129

6.2 Introduction ...... 130

6.3 Methods ...... 132

6.3.1 Determination of the TSS ...... 132

6.3.2 Mapping promoters between species...... 133

6.4 Results...... 134

6.4.1 Nucleotide Frequency Patterns in Eukaryotic promoters...... 134

6.4.2 Proximal Motifs in Eukaryotic promoters...... 136

6.4.3 Promoter Conservation vs. Motif conservation ...... 138

6.4.4 Novel and Unexpected Motifs...... 141

6.4.5 Position dependent motifs ...... 142

6.5 Discussion...... 143

Chapter 7: Conclusions ...... 155

viii Appendix A: Algorithm Pseudocode ...... 162

Appendix B: Evolutionary Analysis of Proximal Promoters...... 164

References ...... 173

ix LIST OF FIGURES

Figure 2.1: Overview of the HOMER algorithm for de novo motif discovery...... 29

Figure 2.2: Normalization of background sequences...... 30

Figure 2.3: Summary of results analyzing ChIP-Chip data for factors with a

known binding site...... 31

Figure 2.4: Motifs identified in TLR4 signaling...... 32

Figure 2.5: Motifs enriched in sub-populations of TLR signaling...... 33

Figure 2.6: Electrophoretic mobility shift assay (EMSA) of T1ISRE...... 34

Figure 2.7: Confidently enriched motifs in human patients and murine disease

models...... 35

Figure 2.8: Function and conservation of the T1ISRE ...... 36

Figure 2.9: Generality of motif half-site spacing to control binding specificity..... 37

Figure 3.1: Enriched motifs found using de novo motif finding in Foxa2 ChIP-Seq

peaks...... 62

Figure 3.2: Use of differential motif discovery elements specifically enriched in n-

Myc vs. c-Myc ChIP-Seq peaks...... 63

Figure 3.3: Use of TSS-based normalization to identify meaningful co-factors in

ChIP-Seq data...... 64

Figure 3.4: TSS-based normalization of RNA polymerase II ChIP-Seq...... 65

Figure 3.5: ChIP-Seq tag distribution relative to motifs...... 66

Figure 3.6: Co-occurrence between RNA polymerase II enriched motifs...... 67

x Figure 3.7: Co-occurrence between RNA polymerase II enriched motifs at

promoters ...... 68

Figure 3.8: Analysis of CTCF motif in CTCF bound regions...... 69

Figure 3.9: Motifs for Oct4, Sox2, Nanog and Tcf3 in embryonic stem cells ...... 70

Figure 3.10: Difference between Sox2 and Tcf3 binding patterns ...... 71

Figure 3.11: Dinucleotide frequencies and ChIP-Seq tag distributions relative to

Nanog motifs in Nanog bound regions...... 72

Figure 3.12: Distribution of SOX and OCT motifs relative to Nanog in Nanog

bound or non-bound regions...... 73

Figure 3.13: Nucleosome Positioning patterns near Sox2 and Oct4 peaks...... 74

Figure 3.14: Nucleosome Positioning patterns near Nanog Peaks...... 75

Figure 3.15: Structural Model for Nanog binding ...... 76

Figure 4.1: Sequence bias introduced by restriction enzymes...... 96

Figure 4.2: DNAseI defined NPS...... 97

Figure 4.3: Variation of DNaseI NPS with GC-content ...... 98

Figure 4.4: MNase defined NPS...... 99

Figure 4.5: Mononucleosome NPS defined using MNase digested H3K4me3

positive nucleosomes...... 100

Figure 4.6: MNase defined NPS in C. elegans ...... 101

Figure 4.7: Fourier analysis of H3K4me3 NPS...... 102

Figure 4.8: Model for GC-rich and AT-rich NPS ...... 103

Figure 5.1: Schematic depicting promoter selection strategy ...... 121

Figure 5.2: Global sequence features of focused and periodic promoters...... 122

xi Figure 5.3: Identification of false nucleosome positions ...... 123

Figure 5.4: Position dependent motifs in human and Drosophila promoters .... 124

Figure 5.5: Identification of PSSEs ...... 125

Figure 5.6: Distribution of PSSEs to the TSS ...... 126

Figure 5.7: NPS relative to the TSS are dependent on GC content...... 127

Figure 5.8: Functional and evolutionary utilization of periodic promoters...... 128

Figure 6.1: Summary of promoter data in all organisms studied...... 148

Figure 6.2: Nucleosome positioning patterns revealed by the alignment of

promoter DNA...... 150

Figure 6.3: Evolutionary conservation of promoters ...... 151

Figure 6.4: Conservation of NFY spacing...... 152

Figure 6.5: General distribution of motifs from the TSS...... 153

Figure 6.6: Core promoter elements in eukaryotes...... 154

xii LIST OF TABLES

Table 3.1: Summary of sequencing data used in this study...... 61

Table 4.1: Summary of high throughput sequencing used in this study...... 94

Table 4.2: Spectral analysis of low GC content and high GC content

nucleosomes...... 95

Table 5.1 Summary of Sequencing Data, TSS, and ChIP-Seq peaks used in this

study...... 119

Table 5.1 (continued)...... 120

Table 6.1: Summary of 5’ RNA Sequencing used in the eukaryotic TSS database

...... 146

Table 6.2: Species, genome versions, and download locations for each species

used in used the eukaryotic TSS database ...... 147

xiii ACKNOWLEDGEMENTS

I would like to thank my advisor, Chris Glass, for all of the attention and energy he has invested in making this document possible. He has been exemplary in making himself available for discussing my research, in addition to having enough confidence in me to chase down crazy, high-risk ideas. In similar fashion I would like to thank each of my committee members, Shankar

Subramaniam, Jim Kadonaga, Geoff Rosenfeld, and Alex Hoffmann, all of which have been very supportive while at the same time challenging me scientifically to make me the best scientist possible.

I would also like to thank every member of the Glass, Rosenfeld, and

Subramaniam laboratories for their efforts toward making my project a success.

Of this group I would specifically like to thank Sven Heinz for his guidance and friendship. The investment Sven made in many of my ideas helped drive this project forward. I would also like to thank Lynn Bautista and Jan Lennington for their administrative help.

Most of all, I would like to thank my friends and family for helping me through this long and arduous process. I owe them for bolstering my confidence and keeping my energy and passion for science high. I also owe them for addiction to caffeine and lower standards concerning quality of beer. Each one of them is irreplaceable. I would like to give a special thanks to my mother and father for their love and support. Finally, I would like to thank my girlfriend

xiv Michele for being there for me throughout the most difficult parts of assembling my work, including the careful proofreading of this entire document.

Chapter 2, in part, will be submitted for publication. Benner, C., Heinz S.,

Subramaniam S., Glass C.K. A novel approach to motif discovery identifies a type I interferon specific response element. The dissertation author was the primary investigator and author of this paper.

Chapter 3, in part, will be submitted for publication. Benner, C., Heinz S.,

Subramaniam S., Glass C.K. Advanced analysis of ChIP-Seq data reveals the stem cells specific factor Nanog as a pioneering . The dissertation author was the primary investigator and author of this paper.

Chapters 4, 5, and 6, in part, will be submitted for publication. Benner, C.,

Garcia-Bassets I., Heinz S., Kadonaga J.T., Rosenfeld M.G., Subramaniam S.,

Glass C.K. Identification of a conserved periodic promoter structure in metazoans. The dissertation author was the primary investigator and author of this paper.

xv VITA

1998 Mission San Jose High School, Fremont, CA

2002 B.S. Bioengineering, University of California San Diego, San Diego, CA.

2009 Ph.D. Bioinformatics, University of California San Diego, San Diego, CA.

Publications:

Barrera L, Benner C, Tao YC, Winzeler E, Zhou Y. Leveraging two-way probe- level block design for identifying differential gene expression with high-density oligonucleotide arrays. BMC Bioinformatics. 2004 Apr 20;5:42.

Ogawa S, Lozach J, Benner C, Pascual G, Tangirala RK, Westin S, Hoffmann A, Subramaniam S, David M, Rosenfeld MG, Glass CK. Molecular determinants of crosstalk between nuclear receptors and toll-like receptors. Cell. 2005 Sep 9;122(5):707-21.

Raetz CR, Garrett TA, Reynolds CM, Shaw WA, Moore JD, Smith DC Jr, Ribeiro AA, Murphy RC, Ulevitch RJ, Fearns C, Reichart D, Glass CK, Benner C, Subramaniam S, Harkewicz R, Bowers-Gentry RC, Buczynski MW, Cooper JA, Deems RA, Dennis EA. Kdo2-Lipid A of Escherichia coli, a defined endotoxin that activates macrophages via TLR-4. J Lipid Res. 2006 May;47(5):1097-111.

Kwon YS, Garcia-Bassets I, Hutt KR, Cheng CS, Jin M, Liu D, Benner C, Wang D, Ye Z, Bibikova M, Fan JB, Duan L, Glass CK, Rosenfeld MG, Fu XD. Sensitive ChIP-DSL technology reveals an extensive estrogen receptor alpha- binding program on human gene promoters. Proc Natl Acad Sci U S A. 2007 Mar 20;104(12):4852-7.

Hevener AL, Olefsky JM, Reichart D, Nguyen MT, Bandyopadyhay G, Leung HY, Watt MJ, Benner C, Febbraio MA, Nguyen AK, Folian B, Subramaniam S, Gonzalez FJ, Glass CK, Ricote M. Macrophage PPAR gamma is required for normal skeletal muscle and hepatic insulin sensitivity and full antidiabetic effects of thiazolidinediones. J Clin Invest. 2007 Jun;117(6):1658-69.

Maurya MR, Benner C, Pradervand S, Glass C, Subramaniam S. Adv Exp Med Biol. 2007;598:62-79.

xvi Young JA, Johnson JR, Benner C, Yan SF, Chen K, Le Roch KG, Zhou Y, Winzeler EA. In silico discovery of transcription regulatory elements in Plasmodium falciparum. BMC Genomics. 2008 Feb 7;9:70.

Benner C, Garcia-Bassets I, Heinz S, Kadonaga JT, Rosenfeld MG, Subramaniam S, Glass CK. Identification of a conserved periodic promoter structure in metazoans. (In preparation)

Benner C, Heinz S, Subramaniam S, Glass CK. A novel approach to motif discovery identifies a type I interferon specific response element. (In preparation)

Benner, C, Heinz S, Subramaniam S, Glass CK. Advanced analysis of ChIP-Seq data reveals the stem cells specific factor Nanog as a pioneering transcription factor. (In preparation)

xvii ABSTRACT OF THE DISSERTATION

Identification of conserved periodic promoter structure in metazoans

by

Christopher William Benner

Doctor of Philosophy in Bioinformatics

University of California, San Diego, 2009

Professor Christopher K. Glass, Chair Professor Shankar Subramaniam, Co-Chair

The identification of sequence elements responsible for transcriptional activity remains a difficult challenge in post-genomic biology. Advances in microarray and next-generation sequencing technology have increased the accuracy and resolution for determining co-regulated genes, the genomic localization of transcription factors, and the locations of transcriptional initiation.

We have developed a computational framework, named HOMER, for the

xviii analysis of gene expression, ChIP-Seq, and RNA-Seq data that utilizes

differential motif discovery to accurately identify transcription factor binding sites.

We used HOMER to find the promoters of genes induced by different

inflammatory stimuli, leading to the discovery of a novel cis-regulatory element,

the type I interferon response element (T1ISRE), which resides in the promoters

of genes specifically induced by type I interferons. We then applied these

methods to embryonic stem cell factors and identified a nucleosome positioning

pattern surrounding Nanog motifs that would prevent Oct4 and Sox2 binding if a

nucleosome were present, suggesting a novel pioneering role for Nanog in

dictating the accessibility of pluripotent enhancers.

Analysis of high-throughput 5’ RNA-Seq data indicated that only 20% of

human and mouse and 40% of Drosophila promoters use a focused pattern of initiation and are enriched for position-specific core elements, such as the TATA box. Unexpectedly, nearly half of human and mouse promoters and 30% of

Drosophila promoters appeared to contain nucleosome positioning sequences and rotationally constrained nucleosomes in phase with a selective group of upstream transcription factor binding sites. These features are associated with multiple sites of transcriptional initiation with a periodicity of approximately 10 bp and define a previously unrecognized class of ‘periodic promoters’. Comparison of nucleosome positioning in human and Drosophila periodic promoters revealed a GC-dependent shift in sequence patterns that is predicted to facilitate a corresponding placement of the first downstream nucleosome relative to the sites of transcriptional initiation in both species. In contrast to focused promoters,

xix which preferentially direct developmental and high-magnitude, signal-dependent programs of gene expression, periodic promoters are configured to preferentially direct constitutive expression of genes that support general cellular functions.

Features of periodic promoters are evident in a wide range of metazoan species, suggesting an ancient origin for this initiation strategy.

xx Chapter 1: Introduction

The DNA contained within the cell of every living organism specifies a

complex set of instructions for the creation and maintenance of life. Transcription

factors and other chromatin-associated comprise the cellular machinery

responsible for interpreting regulatory DNA and controlling the expression of

genes throughout the genome. Coordinated changes in gene expression allow

cells to divide, to respond to external stimulus, or allow them to change from one

cell type to another. Underlying these changes are networks of transcription

factors that precisely target specific regions of DNA out of roughly 3 billion

nucleotides. In most cases, an individual transcription factor is capable of

recognizing only 6-12 bp of DNA, which requires it to work in conjunction with

other transcription factors, as well as the local chromatin environment, to bind to

highly specific regions of the genome.

The exact sequences that recruit specific transcription factors and the

manner by which they regulate the general transcriptional machinery remain

poorly understood. Completion of the human and mouse genomes1-3 along with

extensive sequencing of 5’ RNA libraries4 has revealed accurate positions of transcriptional initiation, therefore providing a starting point from which to investigate the sequence elements that drive transcription in mammals. The advent of ChIP-Seq has greatly increased our ability to localized transcription factors to specific regions of DNA. ChIP-Seq couples chromatin immunoprecipitation (ChIP), which can isolate DNA bound to a specific

1 2 transcription factor, with massively parallel sequencing, to detect the isolated

DNA fragments. Additional applications of high-throughput sequencing include resolution mapping of nucleosome positions (MNase-Seq), analysis of accessible DNA (DNase-Seq), and the discovery of different isoforms of mRNA.

In theory, the identification of regulatory motifs (the sequences that recruit transcription factors) should be simple and straightforward. Extensive biochemical data demonstrate that most mammalian transcription factors bind relatively short sequences of DNA (6-12 bp) with sequence specificity that is often fairly degenerate5. Unfortunately, the process of identifying motifs is compounded by the structural and functional content encoded in genomic sequence, particularly in promoter regions. Promoter sequences may contain several elements that modulate transcription in response to a variety of environmental and hormonal signals or elements that confer tissue specific expression in developmental programs. Core promoter regions also have sequence elements responsible for recruiting RNA polymerase II and other components of the general transcriptional machinery responsible for the production of mRNA6 (Chapter 5). Furthermore, over half of the promoters found in mammals occur in CpG islands, defined as regions of genomic DNA marked with unusually high “CG” dinucleotide content and a high incidence of C and G nucleotides in general7.

The initial aim of this work was to develop the methodology to analyze high-throughput gene expression and localization experiments in order to infer the function and identity of regulatory elements encoded in the genome. 3

Much to our surprise, this work led us to the discovery of a novel promoter architecture that drives the expression of nearly half of all mammalian transcripts.

In the chapters that follow we will detail the discovery of known and novel regulatory elements, and discuss unprecedented relationships between motifs, transcription start sites, sequencing tags, and nucleosomes. We document widespread evidence for the periodic distribution of genomic features at a distance of approximately 10 bp, or one helical turn of DNA, which suggests the structural arrangement of transcriptional machinery on a consistent surface of the

DNA. Chapter 2: Identification of regulatory elements driving inflammatory gene expression

2.1 Abstract

The identification of sequence elements responsible for changes in gene expression remains a difficult challenge in post-genomic biology. Here, we report the development and application of a novel de novo motif discovery framework named HOMER (Hypergeometric Optimization of Motif EnRichment) that combines differential sequence analysis and high-quality transcription start sites

(TSS) to accurately identify cis-regulatory motifs in co-regulated gene promoters.

We demonstrated the utility of HOMER by using it to reconstruct the temporal cis-regulatory code activated by TLR4 signaling in macrophages. Analysis of different inflammatory stimuli led us to the discovery of a novel cis-regulatory element, the Type I interferon response element (T1ISRE), which resides in the promoters of genes specifically induced by Type I interferons. Finally, we analyzed expression data from a variety of inflammatory diseases, finding widespread evidence for activation of inflammation inflammatory elements. In conclusion, we find HOMER to be an effective and practical algorithm for the discovery of transcription factor binding sites.

4 5

2.2 Introduction

Gene-specific regulation of transcription is largely dependent on the

binding of sequence-specific transcription factors to the promoter regions of

genes. The exact sequences that recruit specific transcription factors and the

manner in which they regulate the general transcriptional machinery remain

poorly understood. Completion of the human and mouse genomes1-3 along with

extensive sequencing of 5’ RNA libraries4 has revealed accurate positions of transcriptional initiation, providing a starting point from which to investigate the cis-regulatory code in mammals (Chapter 5). Advances in microarray technology allow the accurate quantification of mRNA levels at a genome-wide scale, which in turn allow us to identify transcripts that are reliably co-regulated.

Computational analysis of the information contained within the promoter sequences of co-regulated transcripts could provide evidence for a specific DNA code that is mechanistically responsible for the recruitment of transcriptional regulators.

Extensive biochemical data demonstrate that most mammalian transcription factors bind relatively short sequences of DNA (6-12 bp), with sequence specificity that is often fairly degenerate5. The goal of motif finding is

to successfully identify the short sequences within longer promoter regions that

are responsible for the recruitment of the transcriptional regulators to the DNA.

This may not always be possible given the large number of cellular mechanisms

regulating transcription, including distal regulatory regions, mRNA stability, and

RNA interference pathways, all of which contribute to mRNA transcript levels. 6

To complicate matters, transcription factors have been shown to associate with

DNA directly through protein-DNA interactions or indirectly by tethering to another primary DNA binding transcription factor. Coupled with biological variation and technical artifacts introduced by microarray technology, groups of co-regulated genes are typically noisy, heterogeneous sets that may rely on multiple mechanisms for their expression signature. This necessitates the need for robust motif discovery algorithms that are very sensitive for finding subtle motifs that may be present in only a fraction of the co-enriched genes.

The discovery of regulatory motifs is further compounded by the structural and functional content encoded in genomic sequence, particularly in promoter regions. Promoter sequences may contain several elements important for modulating transcription in response to a variety of environmental and hormonal signals or elements that confer tissue specific expression in developmental programs. Core promoter regions also have sequence elements responsible for recruiting RNA polymerase II and other components of the general transcriptional machinery responsible for the production of mRNA6 (Chapter 5). Furthermore, over half of the promoters found in mammals occur in CpG islands, defined as regions of genomic DNA marked with unusually high “CG” dinucleotide content and a high incidence of C and G nucleotides in general7. Any attempt to find accurate transcription factor binding sites must consider the effect these sequence features might have on the results.

Existing motif discovery algorithms rely on a variety of different scoring schemes and search strategies to identify motifs8,9. The goal is to find a short 7 sequence motif that occurs in co-regulated promoters much more frequently then expected by chance. Traditionally algorithms have used nucleotide frequency or higher order hidden markov models (HMMs) to quickly estimate the expected number of occurrences of a motif to measure enrichment. Examples of widely used methods that utilize this approach include MEME10, AlignACE11, MDscan12,

Weeder13. More recent algorithms favor an empirical approach to calculating the expected motif count (14,15, DME16, Trawler17, Amadeus18) that simply counts motif occurrences in a set of background sequences. Another class of methods makes use of the phylogenetic relationships between closely related species to identify elements that are conserved19. In addition to different background estimation strategies, each method has developed its own scheme for scoring the quality of a putative motifs, which is highly influenced by the strategy used to search for enriched motifs. For example, methods such as MEME10 and DME16 use differentiable scoring functions for efficient gradient based local optimization, while others such as MDscan12, AlignACE11, and DME16 incorporate measures of entropy to avoid excessively degenerate motifs.

In this study, we developed a novel algorithm for the de novo discovery of regulatory motifs named HOMER (Hypergeometric Optimization of Motif

EnRichment). HOMER works by scoring motifs based on differential enrichment between two sets of promoters, allowing it cancel out common sequences and identify those most likely to be functionally relevant in the co-enriched set. Our strategy builds on the concept of scoring motif enrichment using the hypergeometric distribution14, much in the same way functional enrichment is 8

determined when assessing the functional enrichment of biological processes

using the Gene Ontology20. We also introduced a strategy for removing sequence bias introduced by CpG islands and proximal promoter regions that can be problematic when utilizing a differential motif finding approach. Using these strategies and the hypergeometric distribution as a fitness function,

HOMER attempts to solve the global optimization problem of identifying motifs represented by nucleotide probability matrices with optimal enrichment in a set of target promoters. We showed that HOMER outperforms most other methods when recovering known motifs from co-regulated genes.

In order to demonstrate the efficacy of our algorithm we analyzed a wide range of microarray datasets to identify the cis-regulatory elements directing inflammatory gene expression. Macrophages are a key cell type in the immune system, serving as one of the first lines of defense and functioning as key mediators of the transition from innate to adaptive immunity21.

Lipopolysaccharide (LPS), a common component in the cell wall of gram-

negative bacteria, binds to Toll-like receptor 4 (Tlr4) on the surface of

macrophages and initiates a intricate signal transduction cascade resulting in the

activation of several transcription factors including IRFs and NFkB22.

Downstream targets of TLR4 signaling includes a host of secondary signaling

peptides, such as tumor necrosis factor (TNF) and Type I Interferon, which

initiate paracrine and autocrine responses within the cell’s local environment that

help orchestrate the innate and adaptive immune response. The antimicrobial

gene products of the innate immune system, while important for the clearance of 9 foreign microbes, are inevitably toxic to the host system itself. It is believed that dysregulation or inadvertent activation of this system is partly responsible for many human diseases, where the failure to appropriately repress transcription of toxic gene products contributes to the pathology of disease23,24. In addition to providing a general system for the study of signal transduction and transcription, new discoveries in Tlr4 signaling bring us closer to understanding our immune system and may elucidate mechanisms in disease.

In this study we performed an unbiased analysis of inflammatory microarray data to identify regulatory elements that are critical for the induction of genes by inflammatory mediators. This analysis led to the discovery of a novel response element targeted by the ISGF3 transcriptional activation complex under the control of the type I interferon signaling pathway. Finally, we demonstrated the significance of these findings by showing strong enrichment for inflammatory response elements in the genes dysregulated by several common human diseases.

2.3 Methods

2.3.1 Description of HOMER

HOMER (Hypergeometric Optimization of Motif EnRichment) is an algorithm designed for the accurate identification of sequence motifs in large sequence sets. HOMER uses a differential motif finding approach that scores the significance of a motif by calculating its enrichment in a group of target sequences relative to a set of background sequences using the hypergeometric 10

distribution. This approach allows us to effectively cancel out sequences

common between the datasets and focus on those that are unique to the target

data. In addition to the target and background sequences, HOMER requires only

the length of the motif as an input, returning a list of motif probability matrices

that discriminate between target and background sequences with the highest

enrichment values.

The algorithm is composed of an exhaustive search step followed by the

local optimization of promising motifs (Overview in Figure 2.1). Target and

background sequences are initially parsed to produce an oligo table of length w,

where w is the length of the motif. The number of occurrences in both target and

background sequences is recorded for each oligo, providing a fast lookup for

oligo enrichment later in the algorithm. A motif in this context is a collection of

oligos that are considered bound by the same transcription factor, so it is natural

and more efficient to analyze the data in terms of oligos rather than the original

promoter sequences. The original sequences are discarded at this point in the

motif finding algorithm and used only at the end for recovering actual instances of

the discovered motifs.

Motif enrichment is calculated using a modified version of the cumulative

hypergeometric distribution (or Fisher Exact test, referred to as the

hypergeometric). Given a total set of N promoters, the hypergeometric is used to

calculate the random probability that two independent groups of promoters of

size n1 and n2 are likely to share n or more common promoters. In this setting, the two independent groups are represented by the group of target promoters 11 and the group of promoters containing a putative motif. This method of scoring implies that the existence of a motif in a promoter is enough to cause it to bind and does not try to integrate the collective probability of binding at each position over the entire promoter14,15. While an integrative binding model is an attractive assumption when designing a scoring metric that is differentiable for gradient ascent algorithms, it is poorly supported by experimental data where deletion of the binding site in DNA usually eliminates binding of the protein to the DNA. The underlying assumptions of the hypergeometric distribution make it ideal for determining the optimal tradeoff between false positives and false negatives when counting motifs in groups of promoters given their simple classification as regulated or not. This formulation ties the calculation of enrichment to the number of co-enriched DNA fragments and implicitly scores motifs more favorably if the number of promoters with the motif is comparable to the number of promoters bound by the factor.

Since HOMER uses an oligo table to record the number of times each oligo appears in the target or background groups, exact mapping between individual promoters and the oligos of a putative motif is not possible. Instead,

HOMER can only calculate the total number of oligos corresponding to a given motif found in either the target or background sets. However, one can calculate the expected number of promoters likely to contain the motif by assuming each oligo occurrence was independently distributed in the set of promoters using the recursive relationship ( npi = npi-1 + (nset-npi-1)/nset; np1=1 where npi is the expected number of promoters given the presence of i oligos in the set of nset 12 total promoters.). Since the recursive relationship is not normally an integer, its value is rounded to the nearest integer for the calculation of the hypergeometric.

While this scoring scheme differs from traditional applications of the hypergeometric, this approach can dramatically speed up implementation and drastically reduce memory requirements by removing associations between oligos and individual sequences.

HOMER searches for the most enriched motifs by first performing an exhaustive search of putative motifs centered on each entry in the oligo table.

For each oligo, HOMER constructs putative motifs composed of the oligos with at most two mismatches from the given oligo. Each putative motif is scored by calculating the total number of oligos represented by the motif in both the target and background sets, and then scored as described above with the modified hypergeometric. A sequence tree is used for fast lookup of oligos within a given number of mismatches from the consensus oligo. While a only allowing mismatches is not a very descriptive motif model, optimal motifs at the end of the algorithm are typically represented by one or more highly significant putative mismatch motifs. More complicated schemes such as searching the space of

IUPAC degenerate symbols motifs may be performed as well, but our experience has shown that the simple mismatch model offers a good tradeoff between running time and sensitivity for regulatory motifs.

The most promising putative motifs identified by the exhaustive search are then refined using an iterative local optimization algorithm. HOMER uses promising putative motifs (seeds) to construct probability matrices and detection 13 thresholds capable of specifying groups of oligos that are not only similar in sequence but also collectively highly enriched in the target promoter set and correspondingly likely to be bound by the same transcription factor. These probability matrices and thresholds will be the final output of the algorithm. Prior to local optimization each promising seed from the exhaustive part of the algorithm is converted into a probability matrix that is representative of the consensus oligo with small arbitrary probabilities assigned to the non-consensus nucleotides.

The local optimization phase of the algorithm can be divided into two parts. The first part finds the optimal detection threshold, which corresponds to the overall degeneracy of the motif, and is analogous to the number of mismatches allowed. Given a probability matrix, each oligo in the oligo table is scored relative to the probability matrix by multiplying the probability of seeing each nucleotide at each position of the oligo. Each oligo is then sorted based on this score such that the oligos most similar to the probability matrix are listed first.

The optimal detection threshold can then be found by adding each oligo in the list to the motif, calculating the enrichment after adding each oligo. With the addition of each oligo the effective threshold is lowered, and the threshold that results in the most significant enrichment is assigned as the optimal threshold.

The second part of the local optimization phase refines the probability matrix to more accurately define oligos described by the motif. This part is performed by assessing the relative enrichment contributed by each oligo when included in the motif. For each oligo with a similarity score greater than the 14 optimal threshold calculated in part one, the enrichment of the motif (not including the oligo) is calculated and the log differential from the full motif is recorded. A new probability matrix is then formed by adding each oligo weighted by its relative contribution to the enrichment. Oligos that decrease the significance of the motif when removed are excluded from this calculation. The resulting matrix is normalized and the local optimization process repeated until the enrichment of the motif fails to improve. Since the probability matrix refinement step can be stuck in an unwanted local optimum, several new matrices corresponding to several thresholds below the optimal threshold are also used to effectively search for highly enriched oligos that miss the optimal threshold, giving them an opportunity to contribute to a new probability matrix.

Each of the probability matrices are scored for their optimal enrichment, and only the most enriched matrix is used for subsequent refinement. Once the local optimization algorithm fails to improve the p-value of the motif, the motif is reported and the next seed is optimized. To avoid the tendency for seeds to converge on the same motif, oligos corresponding to the optimized motif are deleted from the oligo table, which is analogous to masking them from the primary sequence and repeating the procedure. Finally, motifs are aligned to known motifs by scoring possible matrix alignments11. Motifs are visualized using WebLogo25 (http://weblogo.berkeley.edu/). In an effort to maintain user interest, a database of Chuck Norris facts26 accompanies the software and is used to generate random nuggets of roundhouse kick induced wisdom that accompanies the final motif enrichment report. 15

It is important to note that the p-value generated by the hypergeometric is

only a relative enrichment score and does not reflect the likelihood that the motif

is a true binding element for a transcription factor. Due to the many degrees of

freedom inherent in a motif probability matrix, HOMER can find seemingly

enriched motifs in random data, with hypergeometric p-values exceeding 1e-10.

In order to assess the believability of HOMER results, the data labels must be

scrambled and the algorithm repeated to provide a null distribution of motif

enrichment in random data.

2.3.2 Background Correction

Due to the sensitivity of differential motif finding algorithms, it is not

uncommon for HOMER to return incorrect motifs if there exists a bias in

nucleotide content between target and background sequences. For example, if

target sequences are found frequently in CpG islands and the background

sequences are derived from the remaining promoters, HOMER may find motifs

with high CG content that are more reflective of target bias for CpG islands than

specificity for an actual binding site. This can have disastrous consequences

since many known transcription factor binding sites are CG-rich, resulting in the

frequent false positive identification of motifs bound by transcription factors such

as E2F (GCGCGAAA)27 and NRF1 (GCGCATGCGC)28.

Assuming the recruitment of a transcription factor to a site is independent of the general nucleotide content of CpG islands, accurate identification of motifs with HOMER can be achieved by normalizing the CG dinucleotide content 16 between target and background sequences. The general idea is to adjust the relative number of background sequences to match the nucleotide content or functional annotation of target sequences to facilitate the comparison of “similar” sequences. First, both target and background sequences are grouped into sets based on their CG dinucleotide frequency, with sets corresponding to 0-1%, 1-

2%, 2-3%, etc. Next, the relative number of background sequences in each CG frequency set must be adjusted such that it matches the distribution of target sequences across each set. HOMER accomplishes this by “weighting” each background sequence in a CG frequency set, although similar results are possible by simply randomly removing the appropriate number of background sequences in sets that are overrepresented. Weighting sequences has the advantage that it preserves information contained within each sequence.

Weighted sequences are processed in HOMER by counting each occurrence of an oligo in a weighted sequence as the fraction determined by the weight, with the total number of background sequences being equal to the sum of their weights.

In addition to CpG islands, other genomic features may be preferentially associated with target sequences that can affect results when performing motif discovery. In particular, many transcription factors are found highly associated with transcription start sites (TSS) in unbiased tiling ChIP-Chip experiments where they are mechanistically linked with gene activation or repression

(Chapters 3, 5). As a result, ChIP target sequences for an unrelated transcription factor may be substantially enriched for one or more of the TSS-specific motifs 17 by virtue of its localization near the TSS. This bias can be corrected for by using a similar strategy to CpG island correction. Instead of grouping sequences based on their CG% frequency, sequences are grouped based on their distance from the TSS (i.e. sequences between -1000 and -500, -500 and +1, +1 and

+500, etc.). Sequences in the background are then weighted such that distribution of sequences in each set is the same for the target and background sequences. Background correction is a general strategy and may be used to correct for other features other features as well. TSS and CpG island correction can also be performed simultaneously by assigning sequences to sets based on both properties. Examples of the effect of background correction are shown in

Figure 2.2.

2.3.3 Analysis of ChIP-Chip data and evaluation of Algorithm performance

All genomic sequences queried by the microarray were extracted and annotated for distance from the nearest TSS and CpG content. Agilent array

(with probes ~250 – 500 bp apart) were assigned to the genomic sequence in their vicinity. Nimblegen (or arrays with probes < 200 bp apart) are arbitrarily broken into 500bp pieces around the nearest TSS. Regions bound by the transcription factor were defined as stated by the original authors.

Each algorithm (MEME, MDscan, Weeder, DME, and HOMER) was used to find motifs of length 10 on the same target and background sequences with the appropriate parameters selected. Since we want to assess the ability of the 18 algorithm to identify motifs when the correct answer is not known, only the top- scoring motif from each method was used for comparison. While the correct motif may be ranked lower in the list of results, this result would have diminished utility when analyzing data where the motif is not known. The top motif from each dataset and each method were compared to the expected motif from the

TRANSFAC29 and JASPAR30 databases. Motif similarity was computed by comparing motif frequency values using a chi-squared test as described in

Schones et al.31 Correct motifs were identified as those with a similarity score above the 5% false discovery threshold.

2.3.4 Electrophoretic mobility shift assay (EMSA)

EMSA was performed as previously described32. The T1ISRE from the murine Mx1 promoter was used as the prototype for the EMSA probe. The ISRE probe is identical except for an addition of a canonical nucleotide need to space the GAAA repeats to resemble an ISRE. The sequences of the probes are as follows: T1ISRE 5’ (GGCCCTGAGGGTGAGTTTCGTTTCTGACCTCC), T1ISRE

3’ (CGCGGAGGTCAGAAACGAAACTCACCCTCAGG), ISRE 5’

(GGCCCTGAGGGTGAGTTTCAGTTTCTGACCTCC), ISRE 3’

(CGCGGAGGTCAGAAACTGAAACTCACCCTCAGG).

2.3.5 Microarray Analysis

Single probe per gene microarrays (Agilent, Codelink) were normalized using quantile normalization, while gcRMA33 was used to normalize Affymetrix 19 microarrays. Differentially expressed genes were defined as those exhibiting >3 fold change between control and stimulated conditions. Motif analysis was performed using promoters defined in Chapter 5 in the range of –300 to +50 bp with respect to the TSS.

Gene expression clustering was performed using the program Cluster34.

Genes that were induced >3 fold by LPS were normalized (by gene vector) and clustered using hierarchical clustering (average linkage, un-centered correlation).

The clustering result was visualized using Java TreeView35.

2.4 Results

2.4.1 Benchmarking HOMER as a de novo motif discovery algorithm

The accuracy of HOMER and its performance relative to existing motif discovery algorithms was assessed through the analysis of ChIP-Chip data.

Motif discovery programs are traditionally evaluated on their ability to recover implanted motifs from synthetic data, but algorithm performance can vary greatly due to the techniques used to implant the motif and generate the sequence. This can artificially inflate the performance of the algorithm since synthetic datasets are typically generated such that the performance of the algorithm is maximized.

To overcome this limitation we chose to use real promoter sequences and actual experimental data from ChIP-Chip experiments carried out in a number of previously published studies. Since proteins should predominately bind to their known DNA binding motifs, ChIP-Chip data for proteins with known motifs can serve as a “gold standard” by which to judge algorithm performance. This 20

approach has the advantage that no data needs to be fabricated and each of the

datasets is a test of the algorithm’s real-world utility. ChIP-Chip datasets were

collected (27 total), 9 of which use Tiling promoter microarrays36,37, while the rest

use traditional promoter microarrays with one probe per promoter28,38-41.

We compared HOMER to 4 previously published methods (MEME,

Mdscan, Weeder, and DME). MEME was selected based on its widespread use and maturity as a motif discovery program10. Weeder13 was included based on a recent review that showed Weeder outperforming 14 other algorithms in the field9. MDscan12 was chosen because it was the first program specifically

designed for use with ChIP-Chip data and has been widely used for ChIP-Chip42.

We also include DME16, an algorithm that uses the same concept of differential motif finding but uses a different measure of motif significance and optimization strategy.

The results from the analysis of each dataset with each algorithm are

summarized in Figure 2.3. Running times varied greatly for each algorithm, with

MEME and Weeder taking hours per experiment, HOMER and DME taking

roughly 10 minutes per experiment, and MDscan taking as little as one minute on

a 3G Intel processor. HOMER outperformed other methods by correctly finding

the known motif as the top ranked motif in 22 of the 27 samples (81%). The next

best algorithm is DME, which identified 13 of 27 correct motifs (48%). HOMER

dramatically outperformed existing algorithms on the tiling promoter array data

where background corrections for CG content and TSS bias have a large impact

on the correct answer. If only the traditional ChIP-Chip experiments were 21 considered (excluding Tiling arrays), HOMER correctly identifies 14 of 18 motifs

(78%) while the runner up, DME, identified 11 of 18 motifs (61%).

2.4.2 Temporal analysis of cis-regulatory elements under the control of TLR4 signaling in macrophages

Macrophages respond to treatment with the TLR4 agonists by inducing a large number of transcripts, making it an ideal system to leverage motif discovery tools to aid in the identification of cis-regulatory elements. We applied HOMER to a large microarray dataset measuring the response of RAW264.7 macrophages to Kdo2 at several time points (www.lipidmaps.org). We applied HOMER independently at each time point, confidently identifying 10 motifs enriched in either induced (ISRE, modified ISRE, κB, AP-1, CRE, SRE, and GAS) or repressed (E2F, NFY, and CHR) promoters (Figure 2.4). Motifs found in induced genes, with the exception of the “modified-ISRE”, have been extensively characterized in inflammation through the study of individual genes43-45.

Likewise, motifs associated with down-regulated genes have been implicated in cell-cycle progression46,47, which is consistent with halt in proliferation of

RAW264.7 cells following treatment with TLR4 agonists48.

The temporal enrichment patterns seen for each factor provide a high- resolution view of TLR4 signaling and are remarkably consistent with the progressive activation of signaling cascades in macrophages. The TLR4 signaling primarily activates the NFkB and MAPK signaling cascades upon stimulation49. This is reflected in the robust enrichment for kB motifs as early as 22

30 minutes after activation along with CRE and SRE motifs (cAMP and Serum

Response Elements), which are downstream of the MAPK pathway and

commonly associated with generic stress response50. Between 1.5 and 2 hours,

a robust amplification of the NFkB occurs, accompanied by STAT (bound by

STAT dimers) and then ISRE (interferon stimulated response element) motifs.

These events occur after the primary response of the cell and are most likely the

effects of autocrine/paracrine signaling by inflammatory mediators TNF-alpha

and IFN-beta, both of which are induced strongly at 1 hour.

2.4.3 Identification of novel response element in genes induced

selectively by Type I Interferon

In an effort to deconvolute the TLR4 signaling cascade, we repeated our

analysis on a variety of data measuring the response of macrophages to other

inflammatory mediators (TNF, IFN-beta, IFN-gamma) as well as the response of

genetically modified cells to TLR activation (Irf3 KO, Ifnar1 KO). All microarrays

were generated with RNA extracted 6 hours after treatment. TNF-alpha binds

the TNF receptor, leading to robust activation of the NFkB pathway51 and approximately 100 genes. Interferon (IFN) beta and gamma bind the Type I

(Ifnar1/2) and Type II (Ifngr1/2) Interferon receptors, respectively, resulting in the activation of the STAT signaling cascade52. IFN-beta induces approximately 300

genes, two thirds of which are also induced by IFN-gamma. Very few genes are

induced on IFN-gamma alone, although at least one of these genes, CIITA, is a 23 master regulator of MHC class II expression governing the presentation of phagocytosed antigens53.

Clustering of all expression profiles revealed the importance of IFN signaling in TLR activation (Figure 2.5a). A large portion of the genes induced by

IFN-beta were not induced during TLR activation in macrophages from genetically modified mice lacking either the transcription factor Irf3 or the Type I interferon receptor (Ifnar1). Irf3 has been shown to be necessary for the induction of IFN-beta following TLR activation54, while Ifnar1 is necessary for the autocrine/paracrine activites of IFN-beta55. Removing either of these components cripples the IFN arm of the TLR4 pathway, limiting the response.

Surprisingly, there were several genes induced by IFN-beta that were unaffected by the loss of Irf3 or Ifnar1. This same group of genes was also induced by TNF- alpha, suggesting they may share a common mechanism of induction that is not dependent on Type I IFN production.

Analysis of sequences driving TNF-alpha, IFN-beta, and IFN-gamma transcriptional induction using HOMER revealed the expected enrichment for kB elements in TNF-alpha responsive promoters and ISRE and STAT motifs in IFN responsive promoters (Figure 2.5b). One noticeable difference between IFN- beta and IFN-gamma motifs was the unique enrichment for a modified ISRE motif in the IFN-beta sequences, which resembles a compressed ISRE sequence

(GAAAnGAAA vs. GAAAnnGAAA). This same motif was enriched in genes induced by TLR4 (Figure 2.4). Focused analysis of genes induced by IFN-beta and not IFN-gamma revealed exceptional enrichment for the modified ISRE, 24 which we termed the T1ISRE (Type I Interferon Stimulated Response Element)

(Figure 2.5b).

While examples of the T1ISRE have been studied extensively in the literature, previous studies have failed to appreciate the difference between the

T1ISRE and ISRE and the differential utilization of the T1ISRE in Type I and

Type II Interferon signaling. To verify that T1ISRE and ISRE play distinct roles in recruiting transcriptional activators, we performed Electrophoretic Mobility Shift

Assays (EMSA) to identify differential affinity for nuclear proteins and probed for either a T1ISRE or ISRE. Incubation of both probes with macrophage nuclear extracts revealed similar banding patterns, with the exception of one band that was exceptionally strong with the T1ISRE probe (Figure 2.6a). This band showed strong induction in the LPS treated versus untreated nuclear extracts, and it corresponds to the ISGF3 complex containing Irf9, Stat1, and Stat2 transcription factors. The ISGF3 transcriptional complex is formed following stimulation with IFN-beta specifically, and is known to have high affinity for the

ISRE56. Incubation of T1ISRE and ISRE EMSA probes with the nuclear extracts of macrophages treated with LPS, IFN-beta, and IFN-gamma treated for 30 minutes and 2 hours revealed selective activation of the ISGF3 band (Figure

2.6b). The strongest binding occurred in the IFN-beta treated samples (both

30min and 2h) and with LPS treated samples at 2 hours. The lack of induction in the 30 minute LPS sample strongly suggests that the signal in the 2-hour sample was likely a result of autocrine IFN-beta signaling, consistent with our findings in

Figures 2.4 and 2.5. These findings imply the T1ISRE is a specialized ISRE that 25

has high affinity for the ISGF3 complex, which explains its high specificity for

Type I Interferon signaling.

2.4.4 Identification of motifs enriched in disease

Our success in uncovering inflammatory response elements in focused

microarray data sets led us to consider the analysis of microarrays data used to

study human inflammatory diseases. We analyzed data from a range of

diseases originating from both patient and mouse models, avoiding infectious

diseases where the presence of foreign pathogen is expected to give identical

results to those describe above. These included patients with chronic coronary

occlusions57, atherosclerosis58, non ischemic cardiomyopathy59, Parkinson’s

disease60, systemic lupus erythematosus61, type II diabetes62, adipocytes from obese humans63, and liver from mice fed with a high fat diet64. Surprisingly, we

observe widespread enrichment for ISRE, and in some cases kB and PU.1

motifs. PU.1 is a cell-type restricted transcription factor, found mainly in

macrophages and Bcells, and may be an indicator for an increased percentage

of myeloid cells in the sample (Figure 2.7). While these results are somewhat

expected given our disease panel’s bias toward diseases known to have

inflammatory components, the robust enrichment for ISRE coupled with the lack

of evidence for other types of motifs suggests that inflammatory pathways, and

the Interferon pathway in particular, play a significant role in the pathology of all

these diseases and conditions. 26

2.5 Discussion

The development of accurate motif finding programs is an important step in providing researchers with a tool to mine associations between co-regulated genes and the promoter-based sequence elements that drive their transcription.

We developed a novel motif discovery algorithm, named HOMER

(Hypergeometric Optimization of Motif EnRichment), and used it to discover known response elements used by the innate immune system for the response to pathogens as well as their contribution to human disease. We presented evidence for a novel response element, the T1ISRE (Type 1 Interferon

Stimulated Response Element), which is utilized specifically for the induction of

Type I-specific Interferon genes. Not only was there strong enrichment for

T1ISRE in TLR4 and Type I interferon target genes, specific instances of T1ISRE are highly conserved in activated promoters (Figure 2.8a). Although only 7 mammalian genomes align to the human Mx1 promoter, every vertebrate species for which the Mx1 promoter can be defined contains a T1ISRE, including mouse

(where it was discovered) and zebrafish. Evidence for a T1ISRE in fish suggests that the T1ISRE is nearly as ancient as the Interferon system itself.

While IFN-beta (Type I) and TNF-alpha are directly induced by TLR activation, IFN-gamma (Type II) is primarily produced by Th1 (T-helper cells) and

NK (natural killer) cells to “prime” macrophages, thus enhancing their response to

TLR signaling and making them much more efficient at presenting antigen53. The high homology between Type I and II Interferon proteins, signaling cascades, and target genes underscore their similar function and evolutionary origin. 27

However, the existence of the T1ISRE in Type I specific genes highlights an important difference between them. Promoters harboring T1ISREs are typically associated with aggressive antiviral functions and gene products that are toxic to the cell. Since the Type I specific activating complex (ISGF3) has been documented to activate the ISRE as well56, it is likely the existence of T1ISRE is meant to prevent activation by Type II activating complexes. The condensed form of the motif may make it inaccessible to IRF-dimers, which are the primary activators of ISRE in Type II interferon stimulated cells (Figure 2.8c).

Furthermore, T1ISREs are found in the promoters of Stat2 and Irf9 (Components of ISGF3), suggesting an auto-regulatory activity capable of amplifying the type I interferon response.

The difference of a single base pair between T1ISRE and ISRE elements may advocate a more general concept when placed in the context of other response elements (Figure 2.9). Two well-known motifs identified in our analysis include CRE (TGACGTCA) and AP-1 (TGAnTCA) (Figure 2.4), two very similar elements that differ by the spacing of one nucleotide. Another example would be

NFkB. The actual motifs identified when benchmarking HOMER’s performance

(Figure 2.3) revealed a subtle difference between motifs identified from ChIP-

Chip data for different NFkB proteins (Figure 2.8a). p50 and p52, which lack the

Rel-homology domain, prefer to bind a condensed form of the kB motif

(GGGattCCC), while p65 (Rela) and c-Rel bind an expanded form

(GGGaatttCCC), supported by in vitro binding data65. To complete the trend, a study has shown that changing the gap size in STAT homo/heterodimer 28 elements affects activity66. Taken together, these observations make a compelling case that the spacing between half sites in response elements help confers specificity and functionality.

Chapter 2, in part, will be submitted for publication. Benner, C., Heinz S.,

Subramaniam S., Glass C.K. A novel approach to motif discovery identifies a type I interferon specific response element. The dissertation author was the primary investigator and author of this paper. 29

Figure 2.1: Overview of the HOMER algorithm for de novo motif discovery. A detailed description of the method is in the methods section while the pseudocode is in Appendix A. The overall goal of the algorithm is to identify motif probability matricies that specify a motif with the highest possible hypergeometric enrichment. 30

Figure 2.2: Normalization of background sequences. (a) Distribution of CpG dinucleotide frequency in OCT4 bound regions relative to the background regions. The top two motifs found when analyzing the OCT4 positive dataset with no background correction. (b) Illustration of adjusted background distribution when correcting the CG content in the OCT4 background sequences (yellow arrows represent scaling of background sequences) and the top two motifs found when analyzing corrected dataset. Both are representative of the known OCT1 motif (TRANSFAC: M00135). (c) Distribution of distance to the TSS for HNF1 positive sequences and background sequences and top two motifs found when analyzing the HNF1 dataset with no background correction. (d) Illustration of adjusted background distribution when correcting for TSS distance in the HNF1 dataset and top two motifs found when analyzing the adjusted dataset. The top result is the known HNF1 motif (TRANSFAC: M00132). Figure 2.3: Summary of results analyzing ChIP-Chip data for factors with a known binding site. (a) Each ChIP-Chip dataset is represented as a column and each row shows the results using each individual algorithm. A red square corresponds to a match between the top ranked motif found by the algorithm and the known binding site. (b) Summary of (a) showing the sum of correctly analyzed datasets. (c) Actual motif results from two datasets where HOMER failed to identify the correct answer while at least one of the other methods found the correct motif. (d) Examples of datasets where every method correctly identifies the known motif (NRF1), and where every method fails to identify the correct motif (HNF4alpha, Islets cells). (e) Examples of datasets where only HOMER is capable of identifying the correct motif (CREB in HEK cells, E2F1 from Nimblegen Tiling arrays) 31 32

Figure 2.4: Motifs identified in TLR4 signaling. Dark blue and dark green correspond to enrichment in up-regulated and down- regulated genes, respectively. Motifs on the right represent the most highly enriched version of the motif identified by HOMER over all the time points. 33

Figure 2.5: Motifs enriched in sub-populations of TLR signaling. (a) Clustering of gene expression data for genes regulated greater than 3 fold in LPS. PIPC is a ligand for TLR3 with a highly similar profile to LPS (TLR4) signaling (b) De novo motif discovery results for specific subgroups of genes. (a) (b)

Figure 2.6: Electrophoretic mobility shift assay (EMSA) of T1ISRE EMSA was performed using nuclear extracts prepared from macrophage-like RAW264.7 cell line. Antibody shifts were not performed but protein complexes can be inferred from highly similar banding patterns seen with ISRE probes in the literature67,68. 34 Figure 2.7: Confidently enriched motifs in human patients and murine disease models. The top non-redundant motifs enriched in genes induced by each disease. (PBMC stands for peripheral blood mononuclear cells) 35 Figure 2.8: Function and conservation of the T1ISRE (a) Genomic alignment to the human Mx1 promoter. T1ISREs are highlighted with a black box. (b) Distribution of T1ISRE containing genes and their induction in IFN-beta and IFN-gamma datasets. (c) Schematic of Interferon signaling. 36 37

(a)

(b)

Figure 2.9: Generality of motif half-site spacing to control binding specificity. (a) Known NFkB motifs and NFkB motifs derived from ChIP-Chip data. (b) Models depicting motif spacing differences for inflammatory response elements. Residues that differ between examples are highlighted in red. Chapter 3: Advanced analysis techniques for ChIP-Seq

3.1 Abstract

Advances in next-generation sequencing combined with chromatin immuoprecipitation (ChIP-Seq) have increased the accuracy and resolution for determining the genomic localization of proteins binding to DNA. Traditional approaches to de novo motif discovery have been very successful when using

ChIP-Seq data to identify the primary consensus elements thought to mediate protein-DNA interaction. However, limited attention has been given to additional motifs enriched in both individual ChIP-Seq experiments and motifs that may be specifically enriched in the sub-populations of binding sites defined by comparisons between multiple experiments. We have developed a computational framework for the analysis of ChIP-Seq data that utilizes differential motif discovery to find both the range of DNA elements enriched in a single experiment as well as context specific motifs found by comparing different

ChIP-Seq experiments and genomic annotations. We apply these techniques to several published studies to decipher both general and cell-type specific regulatory networks controlling gene expression in a variety of cell types, differentiating between elements that play roles in either the proximal promoter or distal regulatory regions. Furthermore, we demonstrate the power of this methodology by decoupling the cis-regulatory elements of embryonic stem cell factors Oct4, Sox2, Nanog, and Tcf3. This analysis reveals a nucleosome

38 39 positioning pattern surrounding Nanog motifs that would prevent Oct4 and Sox2 binding if a nucleosome is present, suggesting a novel pioneering role for Nanog in dictating the accessibility of pluripotent enhanceosomes. Taken together, we show how cross-referencing motifs with sequencing tags and genomic sequence reveals insights into higher order cis-regulatory and chromatin structure.

3.2 Introduction

The DNA contained within the cell of every living organism specifies a complex set of instructions for the creation and maintenance of life. Transcription factors and other chromatin-associated proteins comprise the cellular machinery responsible for interpreting regulatory DNA and controlling the expression of genes throughout the genome. Coordinated changes in gene expression allow cells to divide, respond to external stimulus, or allow them to change from one cell type to another. Underlying these changes are networks of transcription factors that precisely target specific regions of DNA out of roughly 3 billion nucleotides. In most cases an individual transcription factor is capable of recognizing only 6-12 bp of DNA, requiring most transcription factors to work in conjunction with other transcription factors and the local chromatin environment to bind to specific regions of the genome.

The advent of next-generation sequencing has greatly enhanced our ability to experimentally detect the localization of proteins across the genome.

Chromatin immunoprecipitation (ChIP), a technique where DNA fragments specifically associated with a protein of interest are biochemically isolated in vivo, 40 has served as the de facto standard for determining protein localization on DNA for nearly two decades. Earlier attempts to quantify ChIP-enriched DNA fragments at a genome-wide level used DNA microarrays (ChIP-Chip)41. Next generation sequencing platforms such as Illumina Genome Analyzer, ABI Solid, and Roche 454 have the ability to sequence millions of fragments of DNA from

ChIP samples that can then be mapped back to the reference genome to reveal protein localization69,70. Quantifying DNA by directly sequencing the sample bypasses the complexity of DNA hybridization used by microarrays for DNA detection, producing cleaner and more reliable data71. In addition to lower false positive rates provided by sequencing, the actual region of protein-DNA contact can be resolved to less than 40 bp with ChIP-Seq data while ChIP-Chip resolution is limited to several hundred bases72.

The high quality of ChIP-Seq data sets the stage for a breakthrough in our understanding of the cis-regulatory code and its interpretation by the cell’s transcriptional machinery. To date, most efforts have been focused on determining the correct consensus motifs bound by transcription factors69,73,74.

To this end, most motif discovery methods have been very successful when applied to high quality ChIP-Seq data. Unfortunately, these degenerate 6-12 bp consensus motifs typically predict the frequency of protein binding at orders of magnitude higher than the observed rate in the genome, implying additional factors play an important role in determining the true in vivo binding of many proteins. 41

In this study, we present the methodology to identify and integrate additional cis-regulatory features underlying protein localization. The foundation of this analysis utilizes differential motif finding to identify context dependent protein binding with respect to other datasets and genome annotations. Using this approach, we confirm existing as well as propose novel interactions between transcription factors inscribed in genomic loci responsible for basal, developmental, and signaling response networks. In addition, many elements revealed preferred distances between other binding sites and genomic features, implying interactions along the DNA, and in some cases with nucleosomes. This analysis is carried out using a collection of software tools called HOMER developed for this study. All software is freely available

(http://bennerserver.ucsd.edu).

3.3 Methods

3.3.1 Identification of bound regions from ChIP-Seq data

Sequencing results for each of the experiments was downloaded from

GEO (NCBI Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/), SRA

(NCBI Short Read Archive: http://www.ncbi.nlm.nih.gov/Traces/sra/) , or the authors website. Data was remapped to provide consistency when comparing experiments from different studies. Single-end sequences were trimmed to 25 bp and mapped to the genome using Eland (Illumina), keeping only tags that mapped uniquely. Since each ChIP-Seq tag represents the edge of a ChIP fragment, we adjusted the position of each tag 3’ of its position by 75 bp, 42 corresponding to half the recommended fragment length for Illumina sequencing.

We considered one tag from each unique position to eliminate peaks resulting from clonal amplification of fragments during the ChIP-Seq protocol. Peaks

(binding sites) were identified by searching for clusters of tags within a sliding

200 bp window, requiring adjacent clusters to be at least 500 bp away from each other. The threshold for the number of tags that determine a valid peak was determined by randomizing tag positions on a 2e9 bp genome and repeating the peak finding procedure, choosing a threshold corresponding to a false discovery rate of 0.001. We also required peaks to have at least 4-fold more tags

(normalized to total count) than input or IgG control samples when available. In addition, we required 4-fold more tags relative to the local background region to avoid identifying duplicated genomic regions, which is particularly important for data generated in cell lines. A description of the published ChIP-Seq data, complete with alignment information and individual analysis details is available in

Table 1.

3.3.2 Generation of UCSC Genome Browser Tracks

The UCSC Genome Browser (http://genome.ucsc.edu/) is an invaluable resource for visualizing and cross-referencing different types of genome-wide data. To create custom tracks for UCSC Genome Browser, we took the 5’ end of each sequencing tag and extended it to match the fragment length size selected for sequencing (~150 bp), which approximates the original ChIP DNA fragment.

The depth of coverage is reported for each position in the genome. To minimize 43 the load on the UCSC Genome Browser, only regions exceeding a given threshold are shown.

3.3.3 Genomic Motif Discovery Using HOMER

A detailed description of the core algorithm for differential motif discovery can be found in Chapter 2. For motif discovery of ChIP-Seq bound regions, genomic background must be selected for comparison when using a differential motif discovery algorithm. We extracted sequences of length 200 bp from –50 kb to +50 kb relative to all TSS, removing redundant sequences from nearby genes.

These regions were selected to avoid centromere or repeat-rich heterochromatin sequences in non-genic regions that could bias results. The CpG density for each sequence fragment was calculated and binned in intervals of 0-0.02%,

0.02-0.04%, etc. For each application of the algorithm, a random selection of

100k background sequences were made such that the relative composition of

CpG bins is identical between target and background sequences. For example, if the target regions contained 9 regions with 0-0.02% CpG and 1 region with

0.02-0.04% CpG, then 90k 0-0.02% CpG and 10k 0.02-0.04% CpG background regions would be randomly selected as background. A similar approach was used when normalizing for TSS bound sequences. In this case, sequences were binned based on their distance from the TSS, with bins separated based on distances of –10 kb, -1 kb, -500 bp, -200 bp, 200 bp, 500 bp, 1 kb, 10 kb from the

TSS. Masked sequences were used in this study, although most results were 44 identical when unmasked sequences are used since repeats were generally cancelled out using differential motif discovery algorithms.

3.3.4 Centering of ChIP-Seq Peaks on Regulatory Elements

The positions of ChIP-Seq were re-centered over a specific motif to create several of the figures in this chapter. To do this, we used HOMER to identify instances of the motif in the region from –100 bp to +100 bp, relative to the center of the peak. New peaks were generated such that they all aligned on the motif of interest. If no motif was identified, the peak was discarded from consideration. In addition, highly redundant peaks, sharing more than a 40 bp region of identical sequence were removed to avoid results generated from duplications or repeats that may not be general.

3.3.5 Functional Enrichment of Peaks

To identify biological functions that may be associated with genomic locations of peaks, we assigned gene products to each peak based on the nearest annotated TSS from RefSeq. Only peaks within 50 kb were considered.

Functional annotations were derived from the Consortium

(http://www.geneontology.org/). Functional enrichment was calculated using the cumulative hypergeometric distribution. 45

3.4 Results:

3.4.1 Basic Motif Analysis of ChIP-Seq Defined Regions

To gauge the baseline performance of HOMER, we analyzed all ChIP-Seq datasets of sequence-specific transcription factors available in the literature.

These experiments included a variety of different proteins from many different cell types. Each ChIP-Seq data set was first analyzed to determine regions in the genome where each protein binds, referred to as “peaks”. HOMER works by finding motifs that are preferentially enriched in one set of sequences versus a background set of sequences. Since ChIP-Seq has the ability to localize proteins to any non-duplicated region of the genome, we used a large random sample of genomic DNA as background. In addition, sequences were chosen such that the relative CpG content of background sequences matched that of those bound by the transcription factor to remove bias introduced by CpG islands. Consistent with previous attempts to identify overrepresented motifs in ChIP-Seq data,

HOMER reliably identifies the known consensus motif in each of the studies analyzed (Figures 3.1, 3.2, 3.8, 3.9, and data not shown).

Nearly every analyzed experiment yielded several additional, non- redundant, motifs. This is exemplified by our analysis of ChIP-Seq peaks for

Foxa2 in murine liver75, where in addition to strong enrichment for the FOX motif, we saw strong and exclusive enrichment for the motifs of a number of liver- specific transcription factors including HNF4α, HNF1, and CEBP (Figure 3.1). 46

Many of these transcription factors are known to physically interact and all have

been shown to drive hepatocyte-specific gene expression76.

3.4.2 Differential Motif Discovery

The greatest advantage of using a differential motif discovery algorithm is

the ability to change both the target and background sequences to answer very

specific questions with different biological meaning. For example, the binding

profiles of c-Myc and n-Myc73, which bind the same consensus motif

(CACGTG)77, exhibit different affinities for a subset of their overall binding sites.

Analysis of enriched motifs in regions that have more than twice as many n-Myc

than c-Myc tags revealed that the consensus motif was still the strongest

genomic sequence pattern in these regions, which is similar to common peaks

and c-Myc specific regions. To identify motifs that are specifically enriched in n-

Myc specific regions, we can use the regions that are commonly bound by c-Myc

and n-Myc as background sequences for differential motif finding. Motif finding

using the new background set revealed a clear enrichment for the motif bound by

CTCF. This result can be confirmed by considering the overlap of CTCF bound

regions73 with those bound by c-Myc and n-Myc in the same cell type. CTCF is

co-bound in 12.5% of the n-Myc specific regions, compared to 3% of the c-Myc

specific regions (Figure 3.2b). 47

3.4.3 Normalizing for Transcriptional Initiation

Many of the co-enriched motifs found when analyzing the ChIP-Seq data were redundant across multiple experiments. These included NRF1, NFY, ETS, etc. and are representative of motifs that are commonly found near the TSS

(Chapters 5 and 6). Analysis of hypo-phosphorylated RNA polymerase II ChIP-

Seq70, which typically marks the 5’ end of transcripts78, revealed a set of enrichment motifs exclusively composed of these motifs (Figure 3.4). It is likely that since many of the factors studied here are transcriptional regulators and biased to localize near the TSS, they are also likely to exhibit enrichment for basic TSS motifs. To remove this bias, we next re-sampled our genomic background used for motif finding such that we equalized the relative composition of promoter or distal DNA relative to regions bound by the transcription factor. This effectively allowed us to compare promoter targets to other promoters and distal targets to other distal regions. The usefulness of this normalization is demonstrated with c-Myc ChIP-Seq peaks, which are highly localized to TSS genome-wide. Analysis of c-Myc bound regions revealed strong enrichment for the Myc (E-box), a novel enrichment for AARE (amino acid response element)79, and TSS specific motifs YY1, ETS, and NRF1 as the top 5 results (Figure 3.3a). After normalizing for TSS localization, we recovered Myc,

AARE, and p53 motifs as the top results followed by TSS specific motifs at lowered enrichment values (Figure 3.3b). Both c-Myc and p53 have been implicated together in pathways driving oncogenesis, with evidence for their involvement coming from a range of different cancer types80. c-Myc has also 48 been shown to physically interact with p73α81, a close homologue of p53, providing the basis for c-Myc recruitment to p53 elements.

An additional application of TSS-based normalization is the identification of distal enhancer motifs. Although analysis of ChIP-Seq for RNA polymerase II revealed strong enrichment for TSS specific motifs such as ETS and NRF1, we can apply our TSS-normalization approach directly to RNA polymerase II data to identify distal motifs of recruitment (Figure 3.4). Recruitment of RNA polymerase

II to distal motifs has been proposed as an important mechanism of action of many enhancers82. Analysis of RNA polymerase II from CD4+ Tcells revealed several elements that were highly enriched in distal regions including a variant

ETS site (ACAGGAAGT), RUNX, AP-1, and CTCF. With the exception of CTCF, transcription factors that bind each of these elements have been implicated in developmental programs in the hematopoietic system83,84. Distal RNA polymerase II sites containing these motifs are likely either cell-type specific enhancers or alternative promoters in CD4+ Tcells.

3.4.4 Spatial organization ChIP-Tags and Motifs

Our understanding of how cis-regulatory elements interact with each other and other genomic features can be substantially improved by cross-referencing motifs and sequencing tags to reveal spatially constrained patterns encoded in the DNA sequence. This analysis starts by re-centering peaks on their predicted or known binding motifs, which presumably locates the exact position of the protein on the DNA. Using this position as a starting point, we can map the 49 distribution of various motifs and sequencing tags to shed light on higher order cis-interactions.

Analysis of ChIP-Seq tag distributions relative to motifs can be used to confirm the binding of factors to their motifs. During the ChIP-Seq experiment, nuclei are cross-linked with formaldehyde, presumably covalently attaching transcription factors to their target elements on DNA. The DNA is then fragmented into small pieces (~200 bp) and immunoprecipitated, yielding fragments of DNA still bound by the transcription factor of interest. High- throughput sequencing proceeds at either end of the fragment from 5’ to 3’, meaning that if the factor is cross-linked to its consensus motif, sequencing reads should be confined to the region upstream of the motif on the positive strand and the region downstream from the motif on the negative (opposite) strand of DNA

(Figure 3.5a).

We reasoned that if a factor is cross-linked to its motif, and only to its motif, it should be impossible to observe sequencing reads downstream from the motif on the positive strand of DNA. To test this hypothesis, we aligned consensus motifs in each peak and calculated the distribution of ChIP-Seq tags of all peaks in the genome (Figure 3.5b). We observed that sequencing tags on the forward and reverse strand predominately mapped 5’ and 3’ to the motif, respectively. For most factors there is a dramatic drop in the number of mapped tags on each strand at the motif, exemplified by binding sites for Foxa2 mapped to FOX motifs (Figure 3.5b). This pattern of tags should be indicative of where the factor is cross-linked to DNA. In contrast, the alignment of Foxa2 peaks on 50 highly co-enriched HNF4 motifs yields a smoother profile of tags with no drop-off in tag counts at the HNF4 motif, implying this site is not likely to be significantly cross-linked to the Foxa2 transcription factor, suggesting an interaction between

HNF4 and Foxa2 in cis as opposed to a trans, through a tethering-based mechanism.

3.4.5 The Co-occurrence and Spacing of motifs underlying RNA Poly

II Recruitment

The large number of enriched motifs found associated with some ChIP-

Seq experiments raises the possibility that these motifs may functionally or spatially interact with one another. In the case of peaks for RNA Polymerase II in

CD4+ Tcells, combinations of motifs may provide a genetic code for recruiting the general transcription machinery. To address these possibilities, we calculated the likelihood that each of the motifs enriched in RNA Polymerase II peaks would be found in the same DNA fragments by identifying sets of motifs that significantly co-occur (Figure 3.6a). This analysis yielded two major groups of motifs that tend to co-occur. The first group was predominately associated with promoters, and included ETS, NRF1, CRE, and YY1, among others (Chapter 5).

The second group contained the motifs that were highly enriched when analyzing

RNA Polymerase II peaks adjusted for TSS bias, including AP-1, RUNX, and a modified ETS motif (cAggaagt vs. cCggaagt).

While the concept of RUNX, AP-1, and modified ETS motifs working together to recruit RNA Polymerase II to enhancers is a plausible hypothesis, we 51 were concerned that these regions could be cell-type specific promoters instead of enhancers. We addressed this possibility by cross-referencing each of these motifs to H3K4me3 ChIP-Seq data from CD4+ Tcells. Tri-methylation of lysine 4 on histone H3 has been shown to ubiquitously mark sites of transcriptional initiation85, and should serve as a reasonable surrogate for transcription in the absence of 5’ RNA sequences for CD4+ Tcells. RNA polymerase II peaks were aligned on each individual motif and the distribution of H3K4me3 nucleosomes was calculated based on data from Barski et al (Figure 3.6b)70. TSS-associated motifs, including ETS, NRF1, and CRE, showed strong evidence for H3K4me3 nucleosomes immediately upstream and downstream with a nucleosome-free region centered over the motif. In contrast, RUNX and AP-1 motifs were not associated with H3K4me3 nucleosomes, while the modified ETS motif showed limited enrichment, which may be linked to the fact that its binding site is similar to that of the TSS-associated ETS motif. This analysis suggests that distal RNA polymerase II associated motifs likely serve an enhancer role and are not functioning in the localization of transcriptional initiation.

Focused analysis of motifs common to the transcription start site revealed that some motifs were frequently found together in the same promoters. While nearly all TSS-specific elements were significantly co-enriched when considering all RNA polymerase II peaks, a restricted analysis using only promoter proximal peaks uncovered pairs of factors that preferentially co-occurred relative to others

(Figure 3.7a, compared to Figure 3.6a). Co-occurring motifs frequently displayed preferential spacing between elements, as exemplified by SP1 and NFY (Figure 52

3.7b). Motifs that did not significantly co-occur, such as NRF1 and NFY, do not have any preferred spacing between elements (Figure 3.7c). Furthermore, motifs such as NFY and ETS are frequently found several times in promoters with very precise spacing preferences (Figure 3.7d-e). In each of these examples, preferential spacing is observed at intervals of 10 bp, or one helical turn of DNA. This implies that there is selective pressure to find co-occurring motifs on the same rotational surface of the DNA, where they or their co-factors can interact.

3.4.6 Expanding The CTCF Consensus Motif

During the search for motifs enriched in CTCF ChIP-Seq peaks70, we noticed several motifs that contained parts of the CTCF consensus but were offset relative to the known binding motif (Figure 3.8a). Analysis of the DNA sequence surrounding all bound CTCF peaks aligned by the CTCF motif86

(14,144 total) revealed strong bias in nucleotide frequency flanking the central

CTCF motif (Figure 3.8b). This nucleotide frequency pattern matches that predicted by the secondary motifs identified in the motif analysis and implies that the secondary motifs function together with CTCF as a single composite genetic element. The auxiliary motifs did not match any known binding sites that have been previously reported. The amino acid sequence of CTCF revealed an array of 11 zinc finger domains that would be predicted to bind up to 30 bp of sequence, supporting the idea of an expanded consensus motif. Additional support for CTCF’s expanded footprint on the DNA comes from the noticeable 53

drop-off in 5’ and 3’ ChIP-Seq tags at the beginning of the composite element

(Figure 3.8d). A drop in DNaseI hypersensitivity87 is also seen in this region, with

the exception of a region between the central CTCF motif and the 5’ auxiliary

motif, which appears to be very hypersensitive in the 5’ orientation.

It is possible that the auxiliary motifs found alongside the central CTCF

motif play an important role in helping to recruit CTCF and configuring the

chromatin environment. We scanned the repeat-masked genome for CTCF

motifs, finding 29,814 CTCF motifs not bound by CTCF in CD4+ Tcells. These

sights showed no enrichment for the auxiliary motifs seen near motifs actually

bound by CTCF (Figure 3.8c). In addition, analysis of nucleosome positions as

determined by MNase-Seq in CD4+ Tcells88 revealed an array of nucleosomes both upstream and downstream relative to bound CTCF motifs, which is consistent with previous observations89 (Figure 3.8e). No consistent nucleosome

pattern was observed relative to CTCF motifs that were not bound by the CTCF

protein.

3.4.7 Cis-regulatory code of pluripotent transcription factors

Oct4, Sox2, and Nanog are transcription factors that are collectively

expressed in embryonic stem cells and are important for maintaining a

pluripotent state. The high level of interest in these proteins and the regulatory

networks they control has led to the creation of two high quality ChIP-Seq

datasets in murine embryonic stem cells72,73. Results for these and other studies37,90 revealed an astonishingly high rate of co-localization of all three 54 factors, particularly on the Oct4-Sox2 composite regulatory motif (Figure 3.9a).

In addition to the three classic pluripotent factors, Tcf3, a transcription factor that is a downstream target of the Wnt pathway, has been reported to bind to many of the same loci72.

While the high similarities in binding locations and enriched motifs between Oct4, Sox2, Nanog, and Tcf3 have been established, we applied our differential motif discovery analysis to uncover differences between factors to learn more about the motifs that recruit them to DNA. We first searched for motifs in peaks that were specifically bound by each factor and compared them to peaks that were collectively bound. This analysis successfully eliminated enrichment for the Oct4-Sox2 composite and instead recovered motifs that were specifically recognized by each of the proteins (Figure 3.9b).

Two notable results emerged from this analysis. First, although Sox2 and

Tcf3 both bind DNA through HMG-box domains, they preferentially bind different motifs. Sox2 (AACAATGG) and Tcf3 (ATCAAAGG) showed preferences for different nucleotides within the same general element (ANCAANGG). Analysis of enriched 18 bp motifs in Sox2 or Tcf3 specific regions revealed that the original

Oct4-Sox2 composite motif was actually a blend of Oct4-Sox2 and Oct4-Tcf3 composite motifs (Figure 3.10a). Furthermore, functional enrichment of genes in the vicinity of either Sox2 or Tcf3 specific peaks revealed different developmental processes, suggesting Sox2 and Tcf3 play overlapping yet distinct roles in ES cells and subsequent development (Figure 3.10c). 55

Second, we report a novel motif (GGCCATTAAC) that is highly enriched in

Nanog peaks that matches general sequence preference reported by a recent structural study91. This element can also be identified by restricting the sequence searched for motifs from 200 bp, used to assist in the identification of co-enriched motifs, to only 40 bp around the center of the ChIP-Peak for maximum sensitivity.

Nanog binds DNA through a single homeodomain, differing from Oct4, which contains both homeodomain and POU DNA binding domains. Although the

Nanog motif was relatively degenerate, alignment of genomic DNA centered on the motif revealed strong periodic fluctuations in nucleotide frequency extending approximately 70 bp in either direction from the motif. This fluctuation was absent surrounding Nanog motifs from random genomic regions that were not bound in the Nanog ChIP-Seq experiment from embryonic stem cells (Figure

3.11). The pattern strongly resembles the nucleotide fluctuations seen in DNaseI treated chromatin (Chapter 4) and may help position a nucleosome in the vicinity of the binding site. This pattern appears to be unique to Nanog since none of the other motifs identified in any of the ChIP-Seq datasets studied so far exhibited this feature.

In addition to a strong periodic nucleotide frequency fluctuation surrounding the Nanog motif, the distribution of Oct4 and Sox2 motifs preferentially occupied several specific positions relative to the Nanog motif with a separation of approximately 10 bp, or one helical turn of DNA (Figure 3.12). In many cases there is an OCT motif adjacent to the Nanog motif, where the homeodomains of each protein would bind to the same region (ggccATTaaC for 56

Nanog, ATTtgCata for Oct4). The 10 bp spacing intervals between Nanog and

Sox2 or Nanog and Oct4 are in contrast to the fixed distance seen between Oct4 and Sox2 (i.e Oct4-Sox2 composite motif). The former relationship is indicative of transcription factors interacting along a consistent rotational surface of the

DNA, which is equivalent at intervals of 10 bp due to the helical nature of DNA, while the Oct4-Sox2 motif is indicative of proteins forming a tight complex on the

DNA92 (Figure 3.15).

Although nucleotide fluctuations surrounding bound Sox2 and Oct4 motifs were not as strong as those found near bound Nanog motifs, a slight periodic pattern was present. This is consistent with the observation that Nanog, Sox2, and Oct4 motifs exhibited preferential spacing from one another, transitively implying that these nucleotide fluctuations should be evident near Sox2 and Oct4 motifs since they are observed for Nanog motifs. We found that if we specifically analyzed bound Sox2 or Oct4 motifs that were highly enriched for Nanog ChIP-

Seq tags, we observed enhanced nucleotide frequency periodicity (Figure 3.13).

In contrast, Sox2 and Oct4 sites that were not bound by Nanog did not exhibit this property. It is worth noting that these sites were preferentially enriched near promoters and insulators (enriched for CTCF), while sites co-bound with Nanog were predominately distal from promoter regions. We also observed increased amplitude in the nucleotide frequency pattern as a function of binding strength

(i.e. Nanog ChIP-Seq tags near Nanog motifs) (Figure 3.14). 57

3.5 Discussion

In this study we demonstrated the ability to integrate ChIP-Seq data and genomic sequence to shed light on the mechanisms of transcription factor binding. We showed that careful selection of background sequences used to assess motif enrichment can assist in the identification elements that are context specific or more biologically meaningful. We also found that the spacing between ChIP-Seq tags and motifs can be used to verify cross-linked locations, or the positions of nucleosomes relative to motifs using tags from MNase digested samples. Finally, the spacing between motifs and patterns in the surrounding DNA sequence can offer additional insight describing the combinatorial interaction of transcription factors and nucleosomes on the DNA.

ChIP-Seq data offers specific advantages over techniques such as ChIP-

Chip that have yet to be fully appreciated in the literature. For example, tag profiles found relative to motifs should be usable as proof of a protein binding to a consensus site. The dramatic drop-offs in tag counts observed in ChIP-Seq data when comparing regions upstream and downstream relative to the motif provide strong evidence that a factor is binding to the motif. This type of analysis would have never been possible using ChIP-Chip technology and represents one of the unexpected advantages to new sequencing technology. Existence of some residual tags downstream from the motif, where it is theoretically impossible, may be a result of the protein cross-linking through additional DNA- protein contacts or through protein-protein interactions to nearby co-factors. In addition, it may be possible to accurately identify the exact nucleotide(s) where 58 cross-linking occurs, although this is complicated by the limitation of DNA fragmenting from sonication and sequencing bias from ligation reactions

(Chapter 4). These profiles can also serve as a starting point for peak finding methods that depend on parameterization of the distribution of tags around binding sites, leading to higher accuracy and specificity when identifying peaks.

The existence of nucleosome positioning pattern surrounding the Nanog motif could provide valuable insight about how proteins may be structurally arranged in Oct4/Sox2/Nanog bound regions. One possibility is that this sequence helps position a nucleosome with a fixed rotational orientation to the

DNA. Extensive research has shown A/T dinucleotides thermodynamically favor bending around the minor groove, and when spaced with a periodicity of 10 bp, allow the helical DNA to bend in a consistent direction, effectively minimizing the free energy needed to wrap around the histone octamer. Assuming the sequences surrounding the Nanog motif help position a nucleosome, we can use the fluctuations in A/T dinucleotides (AA/TT/TA/AT) to define where the minor groove is facing the interior of the nucleosome. In the case of the Nanog, this pattern occurred at (-21, -10, +10, +22) relative to the center of the Nanog motif

(Figure 3.14), implying the minor groove of the motif was facing toward the center of the nucleosome where the protein actually bound the DNA. This would allow

Nanog, which binds to the major groove91, to bind DNA when a nucleosome is present. Similar analysis for Sox2 also implied that the minor groove of the Sox2 motif, when co-bound by Nanog, was facing toward the nucleosome. In contrast to homeodomains, HMG-box proteins like Sox2 bind the minor groove of the 59

DNA92, making it impossible for Sox2 to bind in the presence of the nucleosome.

Binding of Oct4 would similarly be obstructed since its POU domain bound the major groove 5 bp downstream from its homeodomain, placing the POU domain binding site (atttGCATA) on the inside of the bending nucleosomal DNA (Figure

3.15).

Taken together, these observations suggest that Nanog may play a

“pioneer role” in displacing the nucleosome, allowing the subsequent binding of

Sox2 and/or Oct4. This model would help explain the specificity of the complex and suggest a 2-step mechanism for binding, where Sox2 and Oct4 binding sites are “hidden” with the presence of a nucleosome, requiring Nanog to bind first and displace the nucleosome. This model is also supported by the fact that Sox2 and

Oct4 peaks that are not co-bound by Nanog are enriched for promoter and insulator regions, which are typically highly DNase-hypersensitive, a characteristic of accessible chromatin. It is also worth noting that while the CTCF motif is significantly enriched in Oct4 or Sox2 specific ChIP-Seq peaks, there is no enrichment for Oct4 or Sox2 motifs in CTCF bound compared to unbound

CTCF motifs in embryonic stem cells (p-value > 0.5). This implies that the binding of CTCF is likely independent of Oct4 or Sox2 genetic elements, supporting a model that Oct4 and Sox2 can bind CTCF bound regions simply because of the open chromatin configuration at these sites.

The interplay between Nanog and nucleosomes is not the only possibility that exists given the data shown here. For example, Nanog and its co-factors may create an enhanceosome that bends DNA, where Sox2 and Oct4 bind on 60 the inside bend of the DNA-protein complex. The complex may not be stable unless the DNA is able to intrinsically bend in the correct direction to serve as a scaffold for the complex. An example would be a co-factor containing a histone- fold domain that may bind next to Sox2 and the Oct4 POU domains on the interior of the bending DNA.

Chapter 3, in part, will be submitted for publication. Benner, C., Heinz S.,

Subramaniam S., Glass C.K. Advanced analysis of ChIP-Seq data reveals the stem cells specific factor Nanog as a pioneering transcription factor. The dissertation author was the primary investigator and author of this paper. 61

Table 3.1: Summary of sequencing data used in this study. Sample Mapping Details Peak Summary c-Myc ChIP-Seq 12,610,745 (mm8, 25 bp, ≤ 2 mismatch, 6138 peaks [ES cells]73 unique alignment in genome) n-Myc ChIP-Seq 6,900,667 (mm8, 25 bp, ≤ 2 mismatch, 10,675 peaks [ES cells]73 unique alignment in genome) CTCF ChIP-Seq 7,694,241 (mm8, 25 bp, ≤ 2 mismatch, 52,306 peaks [ES cells]73 unique alignment in genome) Oct4 ChIP-Seq 14,211,918 (mm8, 25 bp, ≤ 2 mismatch, 33,770 peaks [ES cells]72,73 unique alignment in genome) Sox2 ChIP-Seq 9,734,581 (mm8, 25 bp, ≤ 2 mismatch, 8,723 peaks [ES cells]73 unique alignment in genome) Nanog ChIP-Seq 20,068,373 (mm8, 25 bp, ≤ 2 mismatch, 33,754 peaks [ES cells]72,73 unique alignment in genome) Tcf3 ChIP-Seq [ES 8,141,967 (mm8, 25 bp, ≤ 2 mismatch, 13,723 peaks cells]72,73 unique alignment in genome) CTCF ChIP-Seq 2,947,043 (hg18, mapped by authors) 26,447 peaks [CD4+ Tcells]70 RNA Pol II ChIP- 4,150,378 (hg18, mapped by authors) 15,664 peaks Seq [CD4+ Tcells]70 Foxa2 ChIP-Seq 6,505,846 (mm8, 25 bp, ≤ 2 mismatch, 11,472 peaks [Liver]75 unique alignment in genome) 62

Figure 3.1: Enriched motifs found using de novo motif finding in Foxa2 ChIP-Seq peaks. Known motifs correspond to the TRANSFAC identifiers found column 1. FOX is the known binding site for forkhead-transcription factors, while HNF4a, HNF1, and CEBP correspond to factors known to play a role in hepatocyte development. 63

Figure 3.2: Use of differential motif discovery elements specifically enriched in n-Myc vs. c-Myc ChIP-Seq peaks. (a) Top motifs found by analyzing n-Myc specific peaks versus a random genomic background or by using peaks bound by both c-Myc and n-Myc as background. (b) Plot depicting the log2 tag counts in peaks from both n-Myc and c-Myc experiments. Myc Peaks that were co-bound for CTCF (>20 CTCF ChIP- Seq tags) are shown in red. X,Y coordinates were locally randomized to avoid redundant data positions due to the digital nature of the tag counts. Figure 3.3: Use of TSS-based normalization to identify meaningful co-factors in ChIP-Seq data. Motifs found enriched in c-Myc ChIP-Seq peaks relative to a randomly selected genomic background matched for CpG content (a) or matched for CpG content and distance from the TSS (b). Matching peaks for their distance to the TSS means that for experiments highly localized to the TSS (like c-Myc), non-bound TSS proximal sequences will be used as a background to eliminate the enrichment of potentially non-specific TSS proximal motifs like ETS, NRF1, YY1, etc. 64 Figure 3.4: TSS-based normalization of RNA polymerase II ChIP-Seq TSS-based normalization was used to identify distal regulatory motifs that recruit RNA polymerase II to putative enhancers. Motifs enriched in RNA polymerase II ChIP-Seq from CD4+ Tcells when using CpG content matched background (a) or CpG and TSS matched background (b). * denote motifs not observed without using a TSS matched background. 65 66

Figure 3.5: ChIP-Seq tag distribution relative to motifs. (a) Schematic demonstration of the relative positions of 5’ and 3’ sequencing reads relative to the location where the transcription factor is cross-linked to the DNA. While all sequencing tags are read the from the 5’ end, a 3’ read implies that the tag would map to the reverse strand of DNA. (b) Tag distribution measured at Foxa2 ChIP-Seq peaks relative to either the FOX or HNF4 motif. The large drop-off in 5’ or 3’ tag counts near the FOX motif is indicative of the protein binding to that location. 67

Figure 3.6: Co-occurrence between RNA polymerase II enriched motifs. (a) Strongly co-occurring motifs and those in divergent peaks are shown as green and red respectively (bright green and red p-value < 10-10). ETS1 corresponds to the modified ETS motif (CAGGAAGT) while ETS corresponds to the promoter proximal ETS motif (CCGGAAGT). (b) Distribution of H3K4me3 nucleosomes near RNA polymerase II peaks centered on different motifs. Distal elements, particularly RUNX1 and AP-1, do not show appreciable enrichment for H3K4me3 in their vicinity. Figure 3.7: Co-occurrence between RNA polymerase II enriched motifs at promoters (a) Co-occurrence between RNA polymerase II enriched motifs at promoters only (-500

corresponding to one helical turn of DNA, which places elements on a consistent “side” of the DNA molecule. 68 Figure 3.8: Analysis of CTCF motif in CTCF bound regions. (a) Alignment of auxiliary motifs identified when analyzing CTCF ChIP-Seq peaks from both murine embryonic stem cells and human CD4+ Tcells. Nucleotide frequencies relative to CTCF motifs in regions bound (b) or unbound (c) in CD4+ Tcells. (d) Distribution of CTCF ChIP-Seq and DNaseI-Seq tags from CD4+ Tcells. (e) Distribution of MNase- Seq tags around bound and unbound CTCF motifs in CD4+ Tcells. 69 Figure 3.9: Motifs for Oct4, Sox2, Nanog and Tcf3 in embryonic stem cells (a) Visualization of ChIP-Seq peaks for Nanog, Oct4, Sox2, and Tcf3 at the Sox2 locus using the UCSC Genome browser. Binding profiles for each of the factors are highly similar but not identical. (b) Motifs found enriched in Oct4, Nanog, Sox2, and Tcf3 specific peaks. The Oct4-Sox2 composite motif was found in the analysis of total Oct4 peaks and is representative of previous reports from the literature. 70 Figure 3.10: Difference between Sox2 and Tcf3 binding patterns (a) Composite Motifs identified in Sox2 or Tcf3 specific peaks. (b) Plot of Sox2 and Tcf3 peaks showing their relative tag counts for Tcf3 and Sox2 (log2). X,Y coordinates were slightly randomized to avoid redundant data positions due to the digital nature of the data. (c) Biological processes that are functionally enriched in the sets of genes near Sox2 or Tcf3 specific peaks. 71 72

Figure 3.11: Dinucleotide frequencies and ChIP-Seq tag distributions relative to Nanog motifs in Nanog bound regions. A/T and C/G frequency correspond to the combined frequency of AA/AT/TA/TT or CC/CG/GC/GG dinucleotides. 10 bp fluctuations in A/T and counterphased C/G are indicative of nucleosome positioning sequences. 73

Figure 3.12: Distribution of SOX and OCT motifs relative to Nanog in Nanog bound or non-bound regions. 74

(a)

(b)

Figure 3.13: Nucleosome Positioning patterns near Sox2 and Oct4 peaks A/T dinucleotide frequencies (AA/AT/TA/TT) relative to SOX motifs in Sox2 bound peaks (a) or OCT motifs in Oct4 bound peaks (b). In each case, Sox2 or Oct4 bound peaks were subdivided based on the log2 ratio of either Sox2 or Oct4 ChIP-Seq tags to Nanog ChIP-Seq tags. Dark blue (negative log ratio) corresponds to regions with more Nanog ChIP-Seq tags than Sox2 or Oct4, while red corresponds to regions that are specific to Sox2 or Oct4 with a lack of Nanog co-binding. 75

Figure 3.14: Nucleosome Positioning patterns near Nanog Peaks A/T dinucleotide frequencies (AA/AT/TA/TT) relative to Nanog motifs in Nanog peaks. Peaks were subdivided based on the strength of Nanog binding based on the log2 number of ChIP-Seq tags per peak. Blue corresponds to low confidence peaks and red to high confidence peaks. 76

Figure 3.15: Structural Model for Nanog binding Spikes in A/T dinucleotides reveal the positions where the minor groove of DNA faces the nucleosome relative to the Nanog motif (Figure 3.14), indicating that homeodomain of Nanog can bind nucleosomal DNA. The preferred positions of SOX and OCT (POU half) motifs relative to the Nanog motif imply that a nucleosome would impair their binding if present. The nucleosome structure is from Luger et al.93 and Oct1:Sox2 structure is from Williams et al.92 Chapter 4: Nucleosome Positioning Sequences in the

Human Genome

4.1 Abstract

Nucleosomes are the basic unit of chromatin, impacting a wide variety of cellular processes including transcription, replication, and DNA repair. We analyzed high throughput sequencing tags from DNase I and MNase treated chromatin to define nucleosome-positioning sequence (NPS) patterns for the human genome. Human NPS are characterized by 10 bp periodic fluctuations in

T/A and G/C nucleotides in sequences with low overall G/C content, resembling patterns that have been previously discovered for other organisms. This pattern smoothly shifts to predominately fluctuate between purines (A/G) and pyrimidines

(C/T) in regions with increasing G/C content which are typically found near promoters. Analysis of mononucleosome sequences from C. elegans reveals this GC dependent pattern is species independent and likely a general effect due to the biochemical properties of the DNA itself. In summary, we report a novel sift in NPS patterns as a function of G/C content suggesting an emphasis on bending around the grooves of DNA at low G/C content and the phosphate- deoxyribose backbone at high G/C content.

77 78

4.2 Introduction:

All eukaryotic organisms package their genomic DNA as chromatin in the

nucleus. The nucleosome is the most basic unit of chromatin, composed of two

tetramers of four highly conserved histone proteins (H2A, H2B, H3, and H4).

Approximately 147 bp of DNA is tightly wrapped around each nucleosome,

forming nearly two complete loops around the exterior of the histone octomer94.

Nucleosomes are usually separated by 10 to 60 bp of linker DNA, with

continuous stretches of DNA wrapped around nucleosomes forming 30 nm

chromatin fibers, although the exact structure of these fibers and other higher

order structures are still unknown95.

Some nucleosomes have high affinity for specific regions of DNA implying the existence of nucleosome positioning sequences (NPS) that specifically place nucleosomes along the DNA96. The sequence of the DNA can play a role in

meeting the thermodynamic requirements of extreme curvature and kinking

required in the DNA structure that must wrap tightly around the histone octamer.

Preference for A/T rich sequences at the inward facing minor grove at 10 bp

intervals (one helical turn of DNA) have been shown to help the DNA fold around

the nucleosome, decreasing the free energy of association93,97-99. Due to the

repeating nature of this pattern, it is favorable for a nucleosome to assume

positions at intervals of ~10 bp, largely conserving the nucleotide specific

bending of DNA around the histone octamer in each conformation. For this

reason, it is often useful to define both a translational position (location along the 79 ) as well as a rotational position (helical orientation along the DNA) when describing a nucleosome’s position97.

Chromatin plays an essential role is regulating key cellular functions through the covalent modification of histones and the structural repositioning of nucleosomes and other DNA-associated proteins100. Extensive work has linked the methylation and acetylation of key residues on histone tails to transcriptional states101. Trimethylation of lysine 4 and acetylation of lysine 9 on histone H3 predominately occur in promoter regions and are associated with transcriptional activation102. In contrast, trimethylation of lysine 27 and lysine 9 are indicative of transcriptional repression and dense heterochromatin103,104. H2A.Z, H2A.X, and

H3.3 are examples of variant histone proteins that substitute for standard histones to accomplish special functions in transcription, DNA repair, and replication, respectively105,106. These histone modifications and replacements serve as molecular beacons and scaffolds for a variety of specialized proteins, encoding additional information along the DNA strand and collectively forming the basis of the histone code hypothesis107.

Transcription is particularly sensitive to nucleosome position. Active promoters contain a nucleosome free region immediately upstream of the transcription start site (TSS) making it accessible to RNA polymerase II and other general and sequence specific transcription factors108. Silenced or repressed genes typically have nucleosomes along the length of the promoter, inhibiting access of the general transcriptional machinery. Functional cis-regulatory elements are frequently found in linker regions between nucleosomes or exposed 80

along the edge99. Many examples of gene activation require chromatin- remodeling enzymes that shift nucleosomes or remove them to make the chromatin accessible for transcription100,109. Understanding the role and extent to

which the genetic material encodes nucleosome positions is central to

understanding this process.

Development of massively parallel signature sequencing (MPSS)

technology has enabled the first experiments measuring base-pair resolution of

nucleosome location at a genome-wide level. Micrococcal nuclease (MNase) is

used to degrade DNA not wrapped around the histone octomer, resulting in ~150

bp fragments that are then sequenced and mapped back to the reference

genome70,99. Alternatively, DNAse I cleaves DNA at accessible regions, and has

been used extensively to identify functional DNA regions such as promoters,

enhancers, and insulators87. While nucleosomes typically protect DNA from

cleavage, DNAse I can nick DNA on the nucleosome where the minor groove is

exposed, providing a useful tool for defining the rotational position of a

nucleosome.

In this study, we seek to define the genetic determinates that underlie

nucleosome positions in the human genome. While previous studies have

extensively studied NPS for yeast, a comparable analysis in human has not been

performed. We leveraged previously published large-scale experiments using

MNase and DNAse I treated chromatin from CD4+ T-cells70,87,88,110 to determine

the composition of NPS in the human genome. We discovered a shift in NPS

patterns as a function of GC content. We also observed this pattern in recently 81 published MNase-Seq data from C. elegans111, suggesting that this pattern is species independent. In summary, this study outlines general sequence preferences for the positioning of nucleosomes in the human genome.

4.3 Methods

4.3.1 Analysis of nucleosome positions

A description of all high-throughput sequencing datasets, alignment information, and individual analysis details are available in Table 1. All sequences were mapped using Eland (Illumina). In each case, the 5’ end of the fragment was taken to represent the cleavage location by MNase or DNase I.

4.3.2 MNase and DNase I sequence bias and normalization

In addition to providing valuable insight into the genetic mechanisms underlying nucleosome positioning, high throughput sequencing data contains information describing the behavior of enzymes used to carry out the experiment.

While DNAse I and MNase are generally considered non-sequence specific enzymes whose activity is primarily dependent on the chromatin structure of the

DNA, affinity for specific sequences have been documented for both enzymes112-

114. Since millions of tags for both enzymes were sequenced, a high-resolution analysis of binding preferences can be done to identify precise sequences that are favored by each enzyme. The average nucleotide frequencies for MNase and DNAse I generated sequences tags in their genomic context are shown in 82

Figure 4.1. DNase I tags demonstrate a high affinity for the consensus ACTGAT, while MNase tags prefer the consensus (A/T)3GGG.

This information has the potential to quantify the exact sequence preferences for each enzyme, but this approach is compounded by the complexity of the experimental procedures used to generate sequence tags.

Treatment of DNA with DNAse I produces fragments with 3’ overhangs of approximately 3 bp, which are then removed using the 3’-> 5’ exonuclease activity of T4 DNA polymerase. Finally, a linker is ligated to the fragment, amplified, and sequenced. Many of these steps have the potential to introduce sequence bias in the sample. The clearest evidence for this complex bias is the fact that the sequence preferences for DNAse I are not palindromic. Cleavage by DNAse I results in two exposed fragments of DNA, both of which have an equal likelihood to be included in the final sequencing reaction, implying the exonuclease or ligation steps may prefer specific sequences. Furthermore, sequences destined for DNAse I Illumina sequencing were processed into ~20 bp tags using the Mme1 enzyme, which cleaves 20 bp 3’ from its recognition site.

This in turn introduces its own sequence bias toward the end of each tag, resulting in a spike in the preference for G 19 bp from the start of the tag (Figure

4.1a). Comparion of the Solexa and 454 DNAse I tags indicates the Mme1 tag creation bias resulted in the general selection of fragments with significantly higher GC content (49% vs. 43%), significantly biasing the entire sample (Figure

4.1a,b). A similar effect is seen with the 5’ RNA sequencing method known as

CAGE4, which also uses Mme1 to create 20 bp tags (Fig 4.1d). This produces a 83 similar biased effect when compared to EST data as a measure for transcriptional initiation (data not shown).

Another approach is to consider the collective bias of the experimental procedure and correct the observed tag counts based on their expected frequency, effectively normalizing tags based on sequence bias at the edge of each tag. We analyzed the bias in MNase generated tags by calculating the relative representation of 9 bp oligos found from –2 to +6 bp, representing the region with the most extreme bias. The most overrepresented oligo,

TATTGCGCG, was found enriched over 30-fold relative to the surrounding sequence. We adjusted the tag count at each position by normalizing the tag count by the ratio of observed to expected frequency of the oligo found at the cleavage site. Confident nucleosome positions were then taken as positions with more than one tag (adjusted).

4.3.3 Fourier Analysis

To provide an unbiased estimate of the strength of periodic signals in various data, we performed spectral analysis using the Fourier Transformation.

In each case the mean was subtracted from the data and the Discrete Fourier

Transform calculated for frequencies corresponding to wavelengths from 2 bp to

25 bp by increments of 0.01 bp. The resulting complex number was then multiplied by its complex conjugate to reveal the amplitude of the signal at a given frequency. 84

b 1 i2πωk F()ω = ∑ xke ()b − a k= a 2 F()ω P()ω = 2π

In the equation above, F(w) is the normalized Discrete Fourier Transform, P(w) is the spectral density,€ w is the signal frequency (units/bp), and xk is the data at position k ranging from positions a to b.

4.4 RESULTS

4.4.1 DNAse I Defined NPS

We analyzed high throughput sequencing tags generated from DNAse I treated chromatin87 to identify putative NPS in the human genome.

Nucleosomes typically protect DNA from cleavage, but DNAse I can nick DNA on the nucleosome where the minor groove is exposed revealing the rotational position of the nucleosome relative to the DNA. This enzymatic activity was observed as an increased likelihood for tags, spaced one helical turn of DNA away from one other87. To investigate the sequence characteristics surrounding

DNAse I cleavage sites, we extracted genomic sequence from –200 to +200 bp relative to each of the 15 million mapped tags produced by Illumina (a.k.a.

Solexa) sequencing.

Alignment of all 15 million sequences produced striking 10 bp periodicities in the frequency for each nucleotide, which is indicative of NPS (Figure 4.2a).

The relatively modest amplitude of this oscillation was appreciable considering 85 many DNAse I cleavage sites occur in linker regions or nucleosome free regions that do not contain information about rotational settings of a nucleosome.

Additional extreme nucleotide fluctuations near the cleavage site and ~18 bp downstream were likely introduced by sequence specificity in enzymatic cleavage and ligation reactions (Figure 4.1). The 10 bp periodicity observed for each nucleotide gradually decays for approximately 140 bp in each direction from the cleavage site. This is expected since DNAse I nicks can occur at any position along the nucleosome. Similar results were seen when analyzing data produced from 454 Sequencing (data not shown).

The observed NPS varied subtly depending on the GC content (overall fraction of G+C) of the sequence. This difference is particularly important since a vast number of mammalian promoters occur in CpG islands where GC content is commonly greater than 60%. To investigate this difference, we calculated the

GC content of each of the 15 million regions and segregated sequences by their overall GC nucleotide content into bins of 0-10%, 10-20%, etc. Two bins of representative fractions containing 30-40% and 60-70% were similar to the genomic average and CpG islands, respectively (Figure 4.2b). A/T-rich sequences were characterized by the pairing of A and T nucleotides oscillating out of phase with C and G nucleotides. In GC-rich sequences this pairing shifts to A and G nucleotides oscillating out of phase with C and T nucleotides.

Analysis of each nucleotide over the full range of general GC bins (>1 million tags each) revealed that the phase of each nucleotide oscillation smoothly shifted

5’ (G, T) or 3’ (A, C) as the GC-content of the sequence increased (Figure 4.3a). 86

These results suggest that the NPS found in human for sequences with

GC-content near and below the genomic average (~45% GC) are similar to those

described previously97-99. The hallmark features of these NPS are the ~10 bp periodicity of A/T dinucleotides in positions where the minor groove faces inward and the 5 bp offset of G/C dinucleotides where the minor groove faces outward.

Both of these general observations are strongly supported by dinucleotide frequencies in the human DNAse I data (Figure 4.3b). A/T dinucleotides generally form a tighter minor groove due to the less favorable base stacking energies in the major groove, producing the negative roll in the DNA structure that facilitates bending toward the minor groove. These structural properties are the opposite for G/C dinucleotides, creating a stable bend in a single direction when found staggered every 10 bp along the helical DNA.

In contrast to previous descriptions of NPS, human nucleosomal DNA with high GC content exhibits a 10 bp oscillation between purines (A/G) and pyrimidines (C/T) in positions where the minor/major grooves face orthogonal to the surface of the histone core. Our analysis demonstrated that purines (G/A) were preferred on the 5’ strand as the minor groove of the DNA transitioned from making contact with the histone octamer to being exposed. During this transition, the 5’ stand of DNA was on the inside of the superhelix with its phosphate backbone making contact with the histone core. Previous work analyzing free oligonucleotide structures showed that dinucleotide base-steps between purines generally prefered a negative tilt115, which facilitates the bending of DNA around

the purine side of the double helix (as opposed to bending around the minor 87 groove for A/T dinucleotides – otherwise referred to as roll). This is consistent with the type of conformation needed for the DNA to smoothly wrap around the underlying histones. If we continued to follow the 5’ stand approximately 5 bp from the preferred purine position, we observed a preference for pyrimidines when the 5’ strand of DNA was in the opposite conformation with the phosphate backbone facing outward. Base pairing dictates purines will be on the opposite strand, reinforcing the thermodynamic preference for nucleotides that naturally prefer to bend DNA in the direction needed for nucleosome formation.

4.4.2 MNase Defined NPS

We next analyzed high throughput sequencing of MNase digested chromatin to independently validate our findings for DNAse I treated chromatin.

While DNAse I is useful for determining the rotational position of a nucleosome,

MNase has the potential to define both the translational and rotational positions of a nucleosome by defining the exact boundary between nucleosomal and linker

DNA. We analyzed MNase-Seq data comprised of 300 million sequencing tags derived from CD4+ Tcell chromatin88, as well as MNase treated ChIP-Seq data for several specific histone modifications70,110.

Analysis of the nucleotide frequency surrounding MNase-Seq tags revealed a ~10 bp periodic fluctuation for each nucleotide starting at the tag and lasting approximately 150 bp (Figure 4.4a). Extreme nucleotide frequencies from

–5 to +10 from the MNase cleavage site were not seen on the other side of the

NPS (i.e. +140 to +155), suggesting these were artifacts of MNase or tag 88

preparation. The pattern seen around MNase-Seq tags resembled that seen

near Dnase I tags, although in this case the pattern was confined to the region of

sequence expected to wrap around the histone core. Separation of nucleosomal

DNA based on GC-content revealed a similar shift in nucleotide frequency

patterns observed from DNase I data (Figure 4.4b).

Of the 39 different histone modification ChIP-Seq datasets70,110 examined, only 19 produced nucleotide frequency fluctuations downstream of tag positions representative of NPS. Unfortunately, it is unclear whether the lack of NPS reveals anything about how DNA may be packaged differently in nucleosomes harboring different histone modifications. Differing concentrations of MNase or incubation times can lead to under or over-digested chromatin, and without precise controls it is difficult to comment on the differences observed. NPS observed in many of the modified nucleosomes, exemplified by H3K4me3, are

GC-rich and mirror the oscillation between purine and pyrimidine residues found in GC rich sequence using DNase I and MNase-Seq data (Figure 4.5). This might be expected given the strong enrichment for H3K4me3 and other modified histones in CpG Islands located at the 5’ end of genes.

An anomaly exists in the mononucleosomal NPS at approximately +/- 2 helical turns from the dyad (symmetrical center of the nucleosome ~74 bp). At this position, oscillations in either G/T (+55bp) or A/C (+95bp) unexpectedly flattened relative to nearby regions. Multiple studies analyzing the crystal structures of nucleosomes have identified extreme properties in the bending of nucleotides at these positions116,117. Most notably, Ong et al.117 have identified 89 these regions as stretch points when DNA of different sizes are crystallized to the histone core. In addition, several studies have observed that while the curvature of DNA in a nucleosome is largely symmetric around the dyad, the largest differences between symmetric halves usually occur at +/- 2 helical turns from the dyad94.

4.4.3 Nucleosome Positioning in C. elegans

A general lack of evidence for alternative NPS patterns in large-scale yeast studies raised the possibility that NPS patterns in GC-rich sequences may be unique to mammals and potentially serve regulatory roles in CpG islands. A recent MNase-Seq study in C. elegans allows us to explore the possibility that this pattern may be present in species without CpG islands111. We analyzed over

43 million MNase-Seq tags to confidently identify nucleosomal DNA in the C. elegans genome. Nucleotide frequency patterns derived from nucleosomal DNA were similar to those seen derived from human MNase-Seq data (Figure 4.6a).

Surprisingly, we identified a similar shift in nucleotide frequency patterns as a function of GC-content (Figure 4.6b), although very few sequences with GC- content higher than 60% were present in the C. elegans genome. This suggests the GC-dependent shift in NPS patterns are likely species independent.

4.4.4 Frequency Spectrum Analysis

We next employed unbiased Fourier analysis of the nucleotide frequency to quantify the periodic patterns observed in GC-rich DNA. Fourier analysis 90 decomposes a signal into its sinusoidal frequency components and can be useful for identifying frequencies that resonate with fluctuations in the data. Using

H3K4me3 modified nucleosomes as a model, we analyzed sequences from +20 to +150 bp relative to the edge of the nucleosome in order to avoid bias introduced by MNase cleavage. The power spectra for mononucleotides and dinucleotides are reported in Figure 4.7 and primarily show strong signals in C,

G, CC, and GG, which is consistent with our earlier observations (Figure 4.4).

For purposes of comparison, we applied Fourier analysis to nucleosomal

DNA from 30-40% GC and 60-70% GC groups of nucleosomes. This analysis confirmed that the primary signals in 30-40% GC nucleosomal DNA were A/T dinucleotides at a period of 10.18 bp. The strongest signals in GC-rich nucleosomal DNA were A/G (purine) and C/T (pyrimidine) dinucleotides, oscillating counterphase to one another with a period of 10.13 bp (Table 4.2).

The period of the fluctuation is similar to previously reported values of 10.2 bp97 and no appreciable difference can be concluded due to the fact that we removed the first 20 bp on account of MNase bias.

4.5 Conclusions

Using high throughput sequencing data from MNase and DNase I treated chromatin, we describe nucleosome positioning sequence (NPS) patterns in the human genome. We identified two independent signals in our analysis that both contain oscillating nucleotide patterns with a period of approximately 10 bp. The first resembles traditional NPS, consisting of A/T and C/G dinucleotides 91 oscillating out of phase with each other, with A/T dinucleotides taking up positions where the minor groove faces toward histone core. The second pattern appears in GC-rich DNA and consists of purines and pyrimidines oscillating out of phase with each other, with purines preferentially on the inside of the DNA helix as it bends around the histone core (Figure 4.8). Furthermore, a seamless transition between these patterns is seen as the GC-content of the DNA changes. Since the NPS derived from different enzymes and experimental techniques are virtually the same, it can be safely ruled out that these results arise from sequence bias introduced by either of the assays.

Our analysis suggests that the general make-up of these NPS patterns were in general agreement with the properties of nucleotide bending. This is in contrast to the view where the specific placement of nucleotides is necessary to make structural contacts with the histone core. This also predicts the DNA would naturally bend without the histone octomer present, albeit most likely to a lesser degree. The ideal supercoil of DNA would have properties favoring roll and tilt sinusoidally phased down the length of the DNA (5’-> 3’) such that negative roll

(T/A), negative tilt (A/G), positive roll (G/C), then positive tilt (C/T) would be preferentially positioned every ~2.5 bp. Our analysis reveals that the preferential position of each nucleotide is in agreement with its measured effect on the roll and tilt of the DNA.

Existing crystal structures attribute a majority of the DNA curvature in nucleosomes to roll over tilt at a ratio of 2:1, but these structures are limited to

DNA that is AT-rich93,116. This is generally supported in low GC-content 92 sequences where A/T and G/C are preferentially found when the minor and major grooves, respectively, are found on the inside of the bending DNA. As the

GC-content increases nucleotide preferences shift to purines and pyrimidines in regions where the major and minor grooves are orthogonal to the histone core.

This suggests that sequences with high GC content rely more heavily on tilt for their curvature. While these observations imply A/T prefer to contribute roll and

G/C prefer to contribute tilt to the DNA superhelix, the shift of A/T nucleotides to regions dictating tilt in sequences with high GC content (and vice versa) suggests that excessive tilt and roll are not tolerated in the same regions, forcing NPS to choose between tilt or roll to successfully wrap around the histone core.

The identification of phased NPS downstream from sites of transcriptional initiation provides a key link between chromatin and gene expression (Chapter

5). Taking into account that the first downstream nucleosome is also one of the most heavily modified by H3K4me3 and H3K9ac70,110, it is likely that the NPS downstream of the initiation site plays a key role in structurally constraining the position of the nucleosome, enabling it to form a stable complex with components of the transcriptional initiation machinery. This underlies the importance of knowing that the NPS varies with GC content. Most human promoters reside in

CpG islands where this is extremely relevant for understanding chromatin structure and gene regulation.

It should be noted that the NPS presented here is based on a composite signal averaged over thousands (or even millions) of sequences. The obvious features that emerged from this analysis correlated best with the general 93

properties of DNA bending to describe the roll and tilt of base-pair steps relative

to one another. However, individual NPS may rely on a variety of structural

properties to stably wrap around the histone core. Numerous crystal structures

have demonstrated that DNA shifts, slides, and twists its way to an optimal

conformation that may only partially conform to the requirements of roll and tilt

needed to smoothly curve throughout the nucleosome94,116.

Although it has been suggested that histone modification may vary the

configuration of DNA along the nucleosome, this is difficult to verify given current

data. Varying concentrations and reaction times with MNase can digest

chromatin to various degrees. Good examples of under and over-digested

chromatin are MNase-Seq in human and H2A.Z ChIP-Seq in Drosophila, respectively. While variation between different chromatin marks are seen, it is likely that most of these differences are due to different experimental conditions and the underlying GC-content of the DNA and not due to the histone modification itself.

Chapters 4, 5, and 6, in part, will be submitted for publication. Benner, C.,

Garcia-Bassets I., Heinz S., Kadonaga J.T., Rosenfeld M.G., Subramaniam S.,

Glass C.K. Identification of a conserved periodic promoter structure in metazoans. The dissertation author was the primary investigator and author of this paper. 94

Table 4.1: Summary of high throughput sequencing used in this study Data Set Total Mapped Tags Mapping/Usage Details Bias-Correction Details Human H3K4me3 16,845,478 (hg18, tags Used for promoter MNase bias normalized by mononucleosome ChIP- mapped by authors) nucleosome mapping in considering overrepresented 8- Seq [CD4+ Tcells]70 human due to high coverage bp oligos [-3,+4] relative to the in promoter regions MNase cleavage site Variety of Histone (mapped by authors) Marks Human 84,170,052 (hg18, 23 Only perfect matches were MNase bias normalized by mononucleosome bp, ≤ 2 mismatch, used due to adenine bias considering overrepresented 10- MNase-Seq [CD4+ unique alignment in found in sequence tags bp oligos [-4,+5] relative to the Tcells]88 genome) (Supp Fig X). Only MNase cleavage site. A larger nucleosome from the CD4+ oligo size could be used due to untreated sample were used the availability of more data. (similar results with TCR treatment) Human DNaseI-Seq 15,100,762 (hg18, 23 None DNaseI bias normalized by [CD4+ Tcells]87 bp, ≤ 2 mismatch, considering overrepresented 8- unique alignment in bp oligos [-4,+3] relative to the genome) DNaseI cleavage site. C. elegans MNase-Seq 43,991,541 (ce4 None MNase bias normalized by mononucleosome [WS170], tags mapped considering overrepresented 8- [whole worm]111 by authors) bp oligos [-3,+4] relative to the MNase cleavage site 95

Table 4.2: Spectral analysis of low GC content and high GC content nucleosomes. For each nucleotide, dinucleotide, or group of dinucleotides the period with the highest power spectra is reported. The phase, reported in bp, reports the offset of the signal relative to a standard cosine wave. When interpreting the phase, the period divided by 2 is the maximum possible phase, which is also equivalent to the minimum possible phase do to the repeating nature of the signal. (i.e. if the period is 10, then a phase of 5=-5, so a phase of –4.89 is roughly equivalent to a phase of 4.9). “A/T Dinuc” represent AA/AT/TA/TT, “C/G Dinuc” represent CC/CG/GC/GG, “A/G Dinuc” represent AA/AG/GA/GG, and “C/T Dinuc” represent CC/CT/TC/TT. Figure 4.1: Sequence bias introduced by restriction enzymes. Genomic nucleotide frequencies near DNase I cleavage sites in samples prepared with 20 bp tags using an Mme1 enzyme (a) DNase I cleavage sites in sample that used 454-sequencing for long reads (b). Genomic nucleotide frequencies near MNase cleavage sites (c). Genomic nucleotide frequencies near 5’ RNA (CAGE4) tags (d). 96 97

Figure 4.2: DNAseI defined NPS (a) Genomic nucleotide frequency distribution surrounding DNAse I Tags. (b) Frequency fluctuations in C and G nucleotides from DNAse I tags in regions with 60-70% GC content and 30-40% GC content. Frequencies were normalized to the average frequency for each nucleotide in the region. Region from 70 to 120 bp from the DNAse I site were used to avoid sequence fluctuations most likely introduced by sequence specificity of the enzyme. 98

(a)

(b)

Figure 4.3: Variation of DNaseI NPS with GC-content (a) Variation of genomic nucleotide frequency surrounding DNAse I tags as a function of the GC-content of the sequence. Frequencies were normalized to the average frequency for each nucleotide in the region. (b) Traditional nucleosome positioning patterns in low GC (30-40%) or high GC (60-70%) sequences. A/T corresponds to AA/AT/TA/TT dinucleotides and C/G corresponds to CC/CG/GC/GG dinucleotides. 99

(a)

(b)

Figure 4.4: MNase defined NPS (a) Genomic nucleotide frequency derived from nucleosomal DNA defined using MNase-Seq. (b) Variation of genomic nucleotide frequency from nucleosomal DNA as a function of the GC-content of the sequence. Frequencies were normalized to the average frequency for each nucleotide in the region. 100

Figure 4.5: Mononucleosome NPS defined using MNase digested H3K4me3 positive nucleosomes. The top 30k nucleosomes after MNase normalization were used. Arrows indicate positions where nucleotide frequency fluctuations are unexpectedly flat. 101

(a)

(b)

Figure 4.6: MNase defined NPS in C. elegans (a) Genomic nucleotide frequency derived from nucleosomal DNA from C. elegans defined using MNase-Seq. (b) Variation of genomic nucleotide frequency from nucleosomal DNA as a function of the GC-content of the sequence. Frequencies were normalized to the average frequency for each nucleotide in the region. 102

Figure 4.7: Fourier analysis of H3K4me3 NPS Fourier analysis of H3K4me3 nucleosomes of mononucleotides (a) and dinucleotides (b). In this analysis, the Fourier transform is used to decompose the nucleotide content of nucleosomal DNA into their primary frequency fluctuations. Figure 4.8: Model for GC-rich and AT-rich NPS Model depicting the preferred localization of nucleotides along the curving structure of nucleosomal DNA. The 5’ end of the red DNA strand is on the left, while the 5’ end of the blue strand is on the right. 103 Chapter 5: Identification of a Conserved Periodic

Promoter Structure in Metazoans

5.1 Abstract

The organization of cis-regulatory elements and their relative roles in

directing focused or dispersed patterns of transcriptional initiation remain poorly

understood. Analysis of high-throughput sequencing data indicates that

approximately 20% of human and mouse and 40% of Drosophila promoters use a focused pattern of initiation and are enriched for position-specific core elements such as the TATA box. Unexpectedly, nearly half of human and mouse promoters and 30% of Drosophila promoters appear to contain nucleosome positioning sequences and rotationally constrained nucleosomes in phase with a selective group of upstream transcription factor binding sites. These features are associated with multiple sites of transcriptional initiation with a periodicity of approximately 10 bp and define a previously unrecognized class of ‘periodic promoters’. Comparison of nucleosome positioning in human and Drosophila periodic promoters revealed a GC-dependent shift in sequence patterns that is predicted to facilitate a corresponding placement of the first downstream nucleosome relative to the sites of transcriptional initiation in both species. In contrast to focused promoters, which preferentially direct developmental and high-magnitude, signal-dependent programs of gene expression, periodic

104 105

promoters are configured to preferentially direct constitutive expression of genes

that support general cellular functions.

5.2 Introduction

Transcriptional initiation is a highly ordered process involving hundreds of

proteins and originating from very specific regions in the genome. The

information that specifies physiologic programs of gene expression is encoded by

short DNA sequences that function as binding sites for general and sequence-

specific transcription factors. Large but somewhat independent bodies of work

have identified core promoter sequences and distal regulatory elements that

operate in a combinatorial manner to recruit specific sets of general transcription

factors required for transcription of individual genes6,118. In addition to general

transcriptional machinery, sequence-specific transcription factors function to

recruit nucleosome remodeling factors and histone modifying enzymes that are

necessary to overcome the repressive effects of chromatin5,100.

Two very different general strategies for transcriptional initiation have

been observed119. The best-characterized strategy is focused initiation, in which transcription initiates at a single nucleotide or within a narrow region of several nucleotides. This strategy is generally observed in promoters that contain conserved sequence motifs that have a strong position bias with respect to the transcriptional start site. Examples of these motifs include the TATA box120, the

Inr121, the DPE122, and the MTE123. These sequences serve to bind and position

general transcription factors that ultimately facilitate the recruitment of RNA 106 polymerase II. The second strategy is dispersed initiation, in which transcription initiates at multiple sites over a broad region of approximately 50 to 100 bp.

Dispersed transcription frequently occurs in CpG islands, which are GC-rich stretches of DNA (typically 0.5-2 kb in length) that are prevalent in vertebrate genomes and are typically associated with housekeeping genes4,124. Although it is possible that dispersed transcription is due to the presence of multiple weak core promoters, the mechanisms that specify this pattern of initiation remain to be clarified.

Here, we have taken advantage of genome-wide data and new computational approaches to define conserved sequence features of promoters with respect to well-defined genome-wide transcriptional start sites (TSS). We find that focused and dispersed promoters utilize different combinations of sequence elements to preferentially direct the transcription of different functional classes of genes. Unexpectedly, we identify a new subclass of dispersed promoters termed ‘periodic’ promoters defined by transcriptional initiation sites spaced approximately 10 bp apart that are coupled to nucleosome positioning sequences and phased transcription factor binding sites. Our analysis suggests that these binding sites and the first nucleosome downstream relative to the TSS are structurally constrained to play a role in the assembly of the general transcriptional machinery. Furthermore, we present evidence for extensive conservation of these initiation strategies over a range of metazoan organisms.

This analysis thus reveals syntactical relationships between functional elements 107

of promoters that underlie the distinct molecular strategies for focused and

dispersed patterns of transcriptional initiation.

5.3 METHODS

5.3.1 Determination of the TSS

A description of the 5’ RNA sequencing datasets, alignment information,

and individual analysis details are available in Table 1. EST and mRNA data

were downloaded from Genbank125. For each organism we used BLAT126 to

align transcripts (5’ EST, mRNA) to the genome or used alignment information

from the UCSC Genome Browser127 (http://genome.ucsc.edu/) when available.

Only full length mRNA/EST libraries were used when available. Transcripts that did not align precisely at the 5’ end were discarded. Transcripts were then grouped into genes using the Unigene (http://www.ncbi.nlm.nih.gov/UniGene/).

5’ RNA sequencing tags were assigned to a Unigene cluster if they were located within 100 bp of the 5’ end of an EST/mRNA on the same strand of DNA. EST and mRNA 5’ ends were then converted into tags, and the total distribution of tags (5’ RNA tags and EST/mRNA) was then analyzed for each Unigene cluster.

The region with the highest number of tags within a 100 bp region for each gene was designated the primary promoter (Figure 5.1a). Additional non-overlapping

100 bp regions with high numbers of tags could be considered as alternative promoters, but were discarded in this analysis. The specific position with the highest number of tags (or most central in the case of a tie) in the primary promoter region was assigned as the TSS. 108

5.3.2 Analysis of Mononucleosome Positions

A description of the MNase-Seq and DNaseI-Seq datasets, alignment information, and individual analysis details are available in Table 5.1. The composite view of tags relative to the TSS was found by first adjusting the tags to the center of the nucleosome (i.e. +74 bp on the transcribed forward strand, -74 bp on the reverse strand, Figure 5.3) and combining nucleosome profiles across all TSS.

5.3.3 Description of HOMER for de novo promoter motif discovery

Motif discovery was performed using HOMER, a differential motif discovery algorithm described in Chapter 2. Position specific motif finding was accomplished by specifically examining proximal promoters (-50 to +50 bp from the TSS) using the sequences located at a specific position. Target and background sets were composed of single putative binding sites where the target set contains all 10 bp oligonucleotides located at a given position, and the background is composed of the remaining oligonucleotides from other positions.

This is repeated for each position within the core-promoter region.

When looking for proximally enriched motifs, promoter sequences were fragmented into target and background sets where the target set was composed of all promoter sequences from –150 to +50 bp and the background set contained the remaining 200 bp fragments in the range from –2 kb to +2 kb. This allows HOMER to identify motifs that are specific to TSS regions and not sequences that may function in enhancer roles or are unique to the genomic 109 region surrounding the promoter. Sequence logos were generated using

WebLOGO (http://weblogo.berkeley.edu/).

5.3.4 ChIP-Seq for NFY and NRF1

Chromatin immunoprecipitation (ChIP) in MCF-7 cells was performed as described in Garcia-Bassets et al128. ChIP was performed with 107 cells using either anti-NFYB (Santa Cruz Biotechnology, sc-13045) or anti-NRF1 (a generous gift from Dr. Danny Reines). Biological triplicates using each antibody were pooled for sequencing. Single-end sequencing of the ChIP samples using the Illumina Genome Analyzer and ABI SOLiD was carried out according to manufacturer's specifications. Data generated by each sequencing platform was very similar and pooled for further analysis.

5.3.5 Analysis of ChIP-Seq Datasets

A description of the novel and published ChIP-Seq data, complete with alignment information and individual analysis details are available in Table 1.

Peaks were found as described in Chapter 3.

5.3.6 Gene Expression Data

Gene expression data for the response of RAW264.7 macrophages to the

TLR4 agonist Kdo2 lipid A was obtained from the LIPID MAPS web site

(www.lipidmaps.org). 110

5.4 Results

5.4.1 Identification of periodic promoters

To elucidate the genetic determinants of focused and dispersed initiation, we assembled available evidence for human, mouse, and Drosophila transcripts using full length mRNA/EST databases and high throughput 5’ RNA tag (20-36 bp) sequencing4,129 to assign both the most likely location of initiation and define the profile of alternative initiation points within each individual promoter (Figure

5.1a). We first defined the transcriptional start site (TSS) as the nucleotide position within a 100 bp window of the 5’ end of an mRNA having the most tags

(mRNA/EST and 5’ RNA tag data). We then characterized the tendency for promoters to use focused or dispersed strategies of initiation using a “peak ratio”, defined as the ratio of tags near the TSS (±5 bp) to the surrounding promoter region (±50 bp) (Figure 5.1b-e). Examples of human promoters exhibiting

‘focused’ (Peak Ratio > 75%, CXCL2) or ‘dispersed’ (Peak Ratio < 75%,

SLC25a23) sites of transcriptional initiation are shown in Figures 2a and 2b, respectively. Among promoters for which sufficient transcript evidence was available to distinguish focused or dispersed patterns of initiation (≥15 tags/promoter), approximately 20% of human, mouse and 40% of Drosophila promoters exhibited a focused pattern. DNA sequence alignment of focused human promoters at the TSS revealed strong nucleotide bias associated with the

TATA box at –30 bp (Figure 5.2c). A corresponding alignment of focused 111

Drosophila promoters revealed the TATA box and sequence biases associated with the MTE and DPE (Figure 5.2e).

Examination of the nucleotide frequencies relative to the TSS in dispersed promoters revealed a weak 10 bp periodicity in nucleotide content extending downstream from the TSS. This pattern coincides with the fine grain mapping of

H3K4me3/H2A.Z modified nucleosomes70,130 centered in the interval from +80 bp to +130 bp, suggesting these sequences position nucleosomes with a constrained rotational orientation relative to the transcriptional initiation sites.

Examination of individual dispersed promoters revealed a number of cases in which clusters of secondary start sites are phased at approximately 10 bp intervals with respect to the TSS. For example, the SLC25A23 promoter exhibits major secondary start sites at –22 bp and +10 bp from the TSS (Figure 5.2b).

We therefore systematically searched for promoters containing periodic start sites, defined by promoters in which >50% of start sites not found within 5 bp of the preferred TSS cluster at 10 bp intervals from the TSS. Nearly 45% of human and mouse and 30% of Drosophila promoters met this criterion. An underlying

10 bp periodicity in nucleotide frequency relative to the TSS in these promoters was independently verified by Fourier transformation analysis (data not shown).

These promoters exhibit a marked reduction in TATA bias and a striking increase in the 10 bp phasing of nucleotides and translationally positioned nucleosomes

(arrowheads) starting downstream of the TSS (Figure 5.2c,e), establishing a relationship between nucleosome position and periodic transcription. We therefore refer to these promoters as ‘periodic promoters’. 112

5.4.2 Discovery of promoter proximal motifs (PSSEs)

Two strategies of de novo motif discovery were employed to identify sequence elements correlating with either focused or periodic promoters. First, we searched for motifs exhibiting a fixed distance from the TSS, finding strong enrichment for the TATA box and Inr motifs in focused mammalian promoters and the TATA box (-31), Inr (+1), MTE, and DPE in focused Drosophila promoters. In contrast, periodic promoters for both species were devoid of enrichment for core promoter motifs with the exception of the Inr (Figure 5.4).

Our second motif discovery strategy involved searching the nucleosome- free region [-150, +50] for specifically enriched regulatory elements relative to the surrounding genomic sequence. In mammals, we confidently identified 10 previously reported, non-redundant proximal motifs131, 8 of which are known binding sites for representative transcription factors SP1 (GC-box), NFY

(CCAAT), NRF1, ETS, CREB (CRE), MYC (E-box), YY1, and TBP (TATA-box)

(Figure 5.5). With the exception of the TATA box, these motifs were relatively enriched in, but not exclusive to periodic promoters. Intriguingly, most of these motifs exhibited strong positional preferences with respect to the TSS, and almost all displayed 10 bp intervals between preferred locations (Figure 5.6). In the case of NFY motifs, sites found on the opposite strand are typically out of phase by 5 bp (Figure 5.6d), consistent with studies of the functional consequences of varying spacing and orientation between the NFY binding site and the TATA box in the rat Ibsp gene132. Comparison of focused and periodic promoters in mammals revealed that proximal motifs are more precisely phased 113

in periodic promoters, exemplified by ETS and NRF1 (Figure 5.6a,b). Proximal

elements with the highest enrichment in either periodic (ETS, NRF1) or focused

(TATA, DPE) promoters are typically found in the core promoter region (-35 bp to

+35 bp), suggesting that these elements play an important role in determining the

initiation strategy.

The precise spatial relationships between proximal motifs and the TSS

raised the question of whether specific transcription factors binding to these

motifs might be dedicated to roles in transcriptional initiation. To test this, we

performed ChIP-Seq experiments for NFY and NRF1 and analyzed published

ChIP-Seq data for c-Myc73 and the ETS factor GABPα133. The genomic locations of these factors were compared with the locations of transcription factors that play roles in development and signal-dependent responses (e.g., STAT1134,

CTCF70, Oct473, Esrrb73, and FoxA275). High confidence binding sites for NFY,

NRF1, GABPα, and c-Myc were strongly associated with promoters relative to

the other factors, resembling the binding pattern of RNA polymerase II70 (Fig.

6e). This tight association with the TSS was observed in both focused and

periodic promoters and is consistent with specialized roles of these proteins in

transcriptional initiation. We therefore refer to the binding sites for these proteins

as Promoter-Specific Sequence Elements (PSSEs).

5.4.3 Conserved rotational position of downstream nucleosome

Although the 10 bp fluctuation in nucleotide frequency downstream of

mammalian periodic promoters was associated with translationally positioned 114

nucleosomes (Figure 5.3d), the specific nucleotide frequency patterns differed

from nucleosome positioning patterns in yeast98 that are consistent with the

pattern observed in Drosophila periodic promoters (Figure 5.3f, Figure 5.7a).

Suspecting that the difference in mammalian nucleotide fluctuation may be due

to the differences in GC content, we segregated human nucleosome locations

derived from independent genome-wide DNaseI-Seq87 and MNase-Seq88

datasets (Chapter 4) based on the GC content of the nucleosomal DNA.

Alignment of nucleosomal DNA with low GC content recapitulates the antiparallel

10 bp oscillations in A/T versus C/G mononucleotides representative of

previously described nucleosome positioning patterns, which are defined by the

oscillation of AA/AT/TA/TT versus CC/CG/GC/GG dinucleocleotides97,99.

However, as the GC content of the nucleosomal sequences increase, the phases

of the oscillations shift for each nucleotide, resulting in the co-localization of A

and G (purines) versus C and T (pyrimidines) at high GC content, resembling

alternative patterns reported in nucleosomal DNA135 (Figure 5.7b). Structurally, the GC-rich pattern consistently places purines on the inside of the bending DNA, where the phosphate-deoxyribose backbone must compress when wrapping around the histone core136. This structural difference from linear DNA is

consistent with the natural tendency for DNA to exhibit ‘negative tilt’ between

purine-purine base steps as observed in oligonucleotide crystal structures115.

This pattern is also observed in GC-rich nucleosomes from C. elegans111

(Chapter 4). The nucleotide frequency fluctuations at positioned nucleosomes

associated with low or high GC content are nearly identical to the patterns 115 observed in periodic promoters of corresponding GC content that were derived purely by sequence alignment at the TSS, and are associated with rotationally positioned nucleosomes (Figure 5.7c). These results suggest that although the specific nucleotide biases are different, the 10 bp sequence variations downstream of Drosophila and mammalian periodic promoters are functionally analogous and may act in concert with upstream binding factors to facilitate the positioning of nucleosomes at periodic translational locations in phase with periodic start sites (Fig. 3d).

5.4.5 Genes under the control of periodic and focused promoters

The recognition that a large fraction of mammalian and Drosophila promoters utilize a periodic initiation strategy raised the question of whether they are associated with different functional roles than focused promoters. In contrast to human and Drosophila genes transcribed by focused promoters, which are enriched for functional annotations linked to development or stimulus response, genes under the control of periodic promoters are typically enriched for general cellular functions (Figure 8a). These predicted preferences are supported by analysis of transcriptional programs induced during the innate immune response of macrophages to the bacterial cell wall component Kdo2. Both focused and periodic promoters are activated in response to Kdo2, consistent with the presence of inflammatory response elements in both types of promoters.

However, high magnitude changes in gene expression are preferentially 116 mediated by focused promoters (Figure 5.8b). In addition, the kinetics of promoter activation is much faster for focused promoters.

5.5 Discussion

Although molecular biology textbooks present focused, TATA box- containing promoters as the general paradigm for understanding transcriptional initiation in eukaryotic organisms, the majority of metazoan promoters direct dispersed patterns of transcriptional initiation and most lack canonical core promoter motifs. Analysis of dispersed promoters led to the unexpected discovery of periodic promoters, characterized by secondary start sites that occur at intervals of approximately 10 bp with respect to the TSS. Periodic promoters exhibit phased nucleosome positioning sequences that co-occur with rotational position of the first downstream nucleosome and more pronounced periodicity and precise phasing of upstream proximal motifs relative to the TSS (Figure

5.8c). While the numbers of available sequence tags make it difficult to determine the precise nucleosome positions at the level of individual genes, it is clear that nucleosomes take up different translational positions on individual periodic promoters, rather than occupying a single preferred translational position. We propose that interactions between proximal motif binding factors, general transcription factors, and the downstream nucleosome can occur effectively with the nucleosome positioned at several of the allowed, phased positions, giving rise to periodic sites of transcriptional initiation when the appropriate combination of proximal motifs and phased initiator elements are also present. This model of 117 promoter architecture places structural constraints on nearly 300 bp of DNA from

–100 bp to +200 bp, with sequence specific transcription factors binding to the first 150 bp and a rotationally positioned nucleosome occupying the last 150 bp, collectively cradling the RNA polymerase II machinery (Figure 5.8c). Focused promoters may also have sequences that function to translationally position nucleosomes, but these positions are predicted to differ from promoter to promoter, and are thus not evident when these promoters are aligned globally.

In addition, most proximal motifs are noticeably absent from the core promoter region of focused promoters where the TATA box is markedly enriched relative to periodic promoters.

A significant number of promoters did not meet the criteria for focused or periodic promoters and are simply referred to as dispersed promoters. While these promoters can be clearly distinguished from focused promoters, a substantial fraction have periodic start sites that are slightly out of phase with the

TSS, exhibit altered phasing of proximal motifs upstream of the TSS, and are enriched for the same proximal motifs and functional annotations that are associated with periodic promoters and genes. Because a small error in the assignment of the CE-TSS would prevent detection of nucleosome positioning sequences and phased nucleosomes within this group of promoters, it is likely that many of the dispersed promoters in fact represent periodic promoters.

The distinguishing characteristics of focused and periodic promoters are consistent with their utilization for directing transcription of genes with distinct biological roles and patterns of expression. In the case of periodic promoters, 118 the relationship of highly phased PSSEs and nucleosome positioning sequences is structurally optimized for self-assembly of constitutively active promoters. This is consistent with the finding that these promoters preferentially direct the expression of genes required for general cellular functions. Variation in promoter strength for these promoters can be achieved in this model by the particular sequence-specific transcription factors that bind to PSSEs and other cis active elements and the specific coactivator complexes that they recruit. In contrast, focused promoters preferentially direct the transcription of genes that are expressed in a cell-type specific and/or signal dependent manner, particularly genes that exhibit high magnitude responses to stimuli. The conspicuous absence of nucleosome positioning sequences that are in phase with the TSS in focused promoters suggests that they would be counterproductive in these types of regulated gene expression, in which the repositioning of key nucleosomes may be a critical determinant of transcriptional activation or repression. In concert, the findings presented here identify syntactical elements that specify different strategies for transcriptional initiation from metazoan promoters.

Chapters 4, 5, and 6, in part, will be submitted for publication. Benner, C.,

Garcia-Bassets I., Heinz S., Kadonaga J.T., Rosenfeld M.G., Subramaniam S.,

Glass C.K. Identification of a conserved periodic promoter structure in metazoans. The dissertation author was the primary investigator and author of this paper. 119

Table 5.1 Summary of Sequencing Data, TSS, and ChIP-Seq peaks used in this study. Data Set Total Mapped Tags Mapping/Usage Details Bias-Correction Number of Details TSS/Peaks 5’ RNA Sequencing Human 5’ RNA 13,124,701 (hg18, 20 Leading “G” was None 15,253 [≥ 15 tags (CAGE) [various bp, ≤ 2 mismatch, ≤ 5 removed from raw per promoter, 19.9% tissues]4 matches to genome) CAGE sequence. focused, 44.8% Multiple matches were periodic] considered due to short tag length (false positives mitigated by limiting analysis to tags near 5’ mRNA/ESTs) Mouse 5’ RNA 14,467,522 (mm8, 20 Same as Human 5’ None 15,281 [≥ 15 tags (CAGE) [various bp, ≤ 2 mismatch, ≤ 5 RNA per promoter, 18.5% tissues]4 matches to genome) focused, 45.2% periodic] Drosophila 5’ RNA 27,177,789 (dm3, 25 Only tags withing 200 None 7,676 [≥ 15 tags per [various tissues]129 bp, ≤ 2 mismatch, bp of the 5’ of promoter, 39.9% unique alignment in annotated RefSeq focused, 30.3% genome) were used due to periodic] cleavage products in this dataset within gene bodies. Medaka 5’ RNA (5’ 1,882,154 (oryLat1, 19 Same as Human 5’ None 4287 [≥ 5 tags per SAGE) [various bp, ≤ 2 mismatch, ≤ 5 RNA, except leading promoter, not tissues]137 matches to genome) “G” was not removed enough tags to determine periodic/focused promoters] Yeast 5’ RNA (5’ 365, 342 (sacCer1 [Oct Same as Human 5’ None 3708 [≥ 3 tags per SAGE)138 2003], 15-17 bp, ≤ 2 RNA, except leading promoter, not mismatch, ≤ 5 matches “G” was not removed enough tags to to genome) determine periodic/focused promoters] Mononucleosome Sequencing Human H3K4me3 16,845,478 (hg18, tags Used for promoter MNase bias mononucleosome mapped by authors) nucleosome mapping normalized by ChIP-Seq [CD4+ in human due to high considering Tcells]70 coverage in promoter overrepresented 8- regions bp oligos [-3,+4] relative to the MNase cleavage site Drosophila H2A.Z 1,076,384 (dm3, 50 bp, Only 5’ end of read MNase bias mononucleosome ≤ 5 mismatch, unique was used. Sample normalized by ChIP-Seq alignment in genome) was over-digested with considering [embryos]130 MNase and reveals an overrepresented 8- extremely weak bp oligos [-3,+4] nucleotide positioning relative to the pattern. MNase cleavage site 120

Table 5.1 (continued) Chromatin Immunoprecipitation Sequencing NFY ChIP-Seq 17,780,745 (hg18, 25 None None 3092 peaks [MCF7] (This study) bp, ≤ 2 mismatch, unique alignment in genome) NRF1 ChIP-Seq 22,323,804 (hg18, 25 None None 2679 peaks [MCF7] (This study) bp, ≤ 2 mismatch, unique alignment in genome) GABP ChIP-Seq 7,862,231 (hg18, 25 None None 10,563 peaks [Jurkat]133 bp, ≤ 2 mismatch, unique alignment in genome) c-Myc ChIP-Seq 12,610,745 (mm8, 25 None None 6138 peaks [ES cells]73 bp, ≤ 2 mismatch, unique alignment in genome) Esrrb ChIP-Seq 13,563,975 (mm8, 25 None None 68,988 peaks [ES cells]73 bp, ≤ 2 mismatch, unique alignment in genome) Oct4 ChIP-Seq [ES 9,520,995 (mm8, 25 None None 6,167 peaks cells]73 bp, ≤ 2 mismatch, unique alignment in genome) CTCF ChIP-Seq 2,947,043 (hg18, None None 26,447 peaks [CD4+ Tcells]70 mapped by authors) RNA Pol II ChIP- 4,150,378 (hg18, None None 15,664 peaks Seq [CD4+ Tcells]70 mapped by authors) Foxa2 ChIP-Seq 6,505,846 (mm8, 25 None None 11,472 peaks [Liver]75 bp, ≤ 2 mismatch, unique alignment in genome) STAT1 Chip-Seq 15,432,161 (hg18, 25 None None 39,454 peaks [HelaS3+IFN- bp, ≤ 2 mismatch, gamma]134 unique alignment in genome) 121

(a)

Figure 5.1: Schematic depicting promoter selection strategy (a) Schematic depicting strategy for identifying the TSS from 5’ RNA tag data. Distribution of Peak Ratios for Drosophila (b) and mouse (c) promoters. Frequency of the most common nucleotides in Drosophila [A] (c) and mouse [G] (d) as a function of peak ratio with respect to the TSS. 122

Figure 5.2: Global sequence features of focused and periodic promoters. a, b. Representative examples of the distribution of transcriptional initiation sites in the human CXCL2 (a) and SLC25A23 (b) promoters. Underlying schematics indicate initiation patterns for focused and periodic promoters, respectively. c, d. Nucleosome position and nucleotide frequencies relative to the TSS in focused (c) and periodic (d) human promoters. Relative nucleosome occupancy determined by H3K4me3 tags was overlaid on nucleotide frequencies relative to the TSS. e, f. Nucleosome position and nucleotide frequencies of Drosophila focused (e) and periodic (f) promoters relative to the TSS with nucleosome occupancy determined by H2A.Z tags. Asterisks (*) indicate likely artifactual peaks introduced by MNase cutting preferences at the TATA box (Figure 5.3). Sequence biases associated with TATA, Inr, DPE and MPE elements are indicated. Arrowheads indicate the positions of translationally phased nucleosomes in periodic promoters. 123

(a)

(b)

Figure 5.3: Identification of false nucleosome positions Distribution of 5’ and 3’ H3K4me3 ChIP-Seq reads from MNase digested nucleosomes in human focused (a) and periodic (b) promoters. Peaks represented by only 5’ or 3’ reads are likely artifacts of MNase digestion. 124

Figure 5.4: Position dependent motifs in human and Drosophila promoters Results of de novo motif discovery from analyzing sequences with a fixed distance relative to the TSS. The -log p-value of the top 10 bp motif in focused and periodic promoters is presented as a function of distance to the TSS for human (a) and Drosophila (b). Representative 10 bp motifs identified in the analysis are shown in (c). 125

Figure 5.5: Identification of PSSEs Proximal Sequence Specific Elements (PSSE) identified de novo using the motif discovery program, HOMER, in human. For each PSSE, the upper sequence logo represents the identified motif and the lower sequence logo represents the corresponding known motif derived from TRANSFAC. GFX and GFY (General Factors X and Y) do not match any previously determined binding motifs. Periodic vs. Genome and Focused vs. Genome columns show the LogP enrichment value of each motif in the proximal promoter (-150 bp to +50 bp) relative to the surrounding promoter sequence (-2 kb to +2 kb) by the Fisher Exact Test. Site frequency reports the fraction of promoters that contain each motif in its promoter from –150 bp to +50 bp. Promoter type enrichment reports the relative enrichment for each motif in either periodic or focused promoters (Fisher Exact Test). 126

Figure 5.6: Distribution of PSSEs to the TSS Relative utilization and positions of promoter-specific sequence elements (PSSEs) and their binding proteins in focused and periodic promoters. a-c, Distribution of ETS (a), NRF1 (b), and SP1 (c) elements in human periodic and focused promoters. d. Distribution of NFY elements in human periodic promoters in both orientations, showing a shift in 5 bp between profiles depending on strand. e. Frequency of transcription factor binding in promoters. Binding sites for the indicated factors were ranked according to ChIP-seq tag count and partitioned into 10 quantiles. The plot illustrates the percentage of peaks of a given height (range) that are within 500 bp of a transcriptional start site (defined by at least 15 mRNA tags). The x-axis is ordered from lowest confidence binding sites (lowest tag count per peak) at left to highest confidence sites (highest tag count per peak) to the right. 127

Figure 5.7: NPS relative to the TSS are dependent on GC content Sequence determinants of nucleosome position vary depending on GC content. a. Fluctuations of C and G nucleotide frequencies +50 to +100 bp downstream of human and Drosophila periodic promoters identified by aligning promoters at the TSS. b. Fluctuations of nucleotide frequencies relative to confident DNase I cleavage sites depicting putative nucleosome positioning patterns. DNase I cleaves DNA when the minor groove is exposed along the surface of the nucleosome. Cleavage sites are segregated based on their surrounding GC % (±150 bp). The region from +70 to +120 bp is shown, which excludes sequence bias introduced by DNase I sequence preferences near the cleavage site(Chapter 4) c. Fluctuations in C and G nucleotide frequencies relative to confident DNaseI cleavage sites in human nucleosomes associated with 60-70% or 30-40% GC content. 128

c

Figure 5.8: Functional and evolutionary utilization of periodic promoters. a. Biological process annotations from the Gene Ontology consortium enriched in human and Drosophila focused or periodic promoters. b. Utilization of promoter types by macrophages treated with the TLR4 agonist Kdo2 lipid A. The plot illustrates the percentage of total regulated promoters above a given fold induction that are focused (e.g., at 0.5 h nearly 100% of the promoters induced >12-fold are focused). The dotted line indicates the overall fraction of focused promoters. (c) Models depicting the organization of sequence elements in periodic and focused promoters. Chapter 6: Analysis of transcriptional initiation across eukaryotes

6.1 Abstract

RNA polymerase II and several general transcription factors such as TBP are highly conserved in nearly every eukaryotic species. Few transcription factor binding sites that recruit the transcriptional machinery, such as the TATA box, have been extensively studied in a wide range of organisms. However, given that a large number of metazoan promoters use periodic initiation strategies instead of traditional [TATA-driven] focused initiation, we sought to discover the extent to which different promoter types and motifs are conserved throughout eukaryotes. We analyzed promoter regions from nearly 70 organisms to catalogue nucleotide frequency patterns and enriched proximal motifs from species covering diverse corners of the eukaryotic tree of life. We find that closely related organisms typically have similar promoter properties, although these are highly divergent between major kingdoms of life. We observe sequence constrained nucleosome positions with respect to the TSS in many metazoans and protists, which along with extensive utilization of motifs such as

NFY, NRF1, CRE, and Myc, suggest widespread usage of periodic initiation strategies. Positional analysis of novel and known proximal motifs reveals a general promoter architecture between proximal motifs, core elements, and

129 130

surrounding nucleosomes that is conserved throughout eukaryotes, regardless of

focused and periodic initiation strategies.

6.2 Introduction

The creation of mRNA from a DNA template is a basic process common to

every living organism. Decades of research on transcriptional initiation have

produced a multitude of findings from a large number of different species120-

122,139. Research in the field of transcription uses a variety of different species to

take advantage of resources, techniques, expertise, and biological or disease

significance that may be limited to specific organisms. Unlike reagents and

experimental systems that may be limited to a single species, bioinformatics

techniques are readily applied to a host of organisms provided the availability of

primary sequencing data. The exponential explosion in sequencing data this

past decade has led to the complete or partial sequencing of hundreds of

genomes, in addition to extensive EST (expressed sequence tag) libraries

detailing the RNA repertoires of each organism125. These datasets represent a

valuable source of information for discovery as well as an opportunity to measure

the extent to which known genetic elements and transcriptional mechanisms are

conserved.

Most comparative studies focus on the alignment of genomes or

proteomes to assess the conservation of various features between organisms.

Alignment of closely related organisms has proven a valuable technique for the

identification of regulatory elements in promoters, harnessing the assumption 131

that functional DNA is under selective pressure19,131. Unfortunately, if the most

recent common ancestor between two species occurred more than 100-200

million years ago, the percentage of homologous promoters that can be

successfully aligned starts to rapidly decrease140. For this reason, studies focusing on a broader cross-section of organisms must analyze each organism separately (as performed in Chapter 5) and then compare the features identified in each species. The most extensive of these studies analyzed 13 fungal

Ascomycete genomes for regulatory elements enriched upstream of ORFs, finding several elements conserved to varying degrees between species141.

Studies of this type have been limited to yeast and compact genomes due to lack

of accurate transcription start sites in larger genomes where the true promoter

may be several kb away from the beginning of an open reading frame.

We took advantage of primary RNA sequencing data and techniques

developed in previous chapters to create a massive eukaryotic promoter

database detailing the locations of transcriptional initiation in a large number of

different species. We sought to understand the extent to which different

promoter features and transcriptional principles are conserved between metazoa,

fungi, plants, and protists. In this study, we calculate nucleotide frequency

patterns and find proximal motifs in each organism and use them to identify both

common and divergent principles of promoters in an evolutionary context. 132

6.3 Methods

6.3.1 Determination of the TSS

A description of the 5’ RNA Sequencing datasets, alignment information, and individual analysis details are available in Table 6.1. EST and mRNA data for each organism was downloaded from Genbank125 and aligned using BLAT126.

Alignments from the UCSC Genome Browser127 were used when possible

(http://genome.ucsc.edu/). Genomes for each organism, and in some cases 5’

EST data not deposited in Genbank, were downloaded from websites run by each species’ respective sequencing consortium (Table 6.2). Roughly half of the data used in this study originated from the Joint Genomics Institute

(http://www.jgi.doe.gov/).

When possible, the TSS was found as described in Chapter 5. If Unigene clusters were unavailable for the organism, we searched for clusters of 5’ ESTs on the same strand within 100 bp, independent of gene assignments, excluding clusters within 1 kb of each other. The specific position with the highest number of tags (or most central in the case of a tie) in the primary promoter region was assigned as the TSS. Promoters in species without Unigene annotations were assigned to predicted proteins in their respective species based on the tblastp analysis of ESTs and predicted peptides.

Every attempt was made to exclusively use full-length 5’ mRNA/EST sequences when available. EST libraries are created for a variety of reasons, such as for unbiased gene discovery and for use in assembling genomic 133 scaffolds during the initial genomic sequencing process. While in many cases the resulting EST sequences represent truncated forms of the full length mRNA, these sequences can still be useful in accurately locating the general location of promoters within 10 to 50 bp. If an adequate number of full-length 5’ mRNA/EST sequences or high throughput 5’ RNA sequencing were available, all other sequences are discarded to assist in the accurate identification of true TSS. The difference between full-length and non-full-length TSS can be illustrated by mouse and rat. Nucleotide frequencies near the human and mouse promoters are nearly identical and are a much closer match than mouse and rat even though the rodents have a much higher rate of conservation. This arises from the limited number of full-length 5’ mRNA/EST available for the rat genome, resulting in obscured features such as a lack of a clear TATA box at –31 bp

(Appendix B). However, results from motif discovery were nearly identical, implying that the general promoter regions have been correctly identified.

6.3.2 Mapping promoters between species

We mapped homologous genes between organisms using the

Homologene database at NCBI (http://www.ncbi.nlm.nih.gov/homologene). For species not included in Homologene, the predicted proteome from NCBI or the genome sequencing consortium’s website was used with BLAST142 to compare against proteins represented in Homologene. In the event that multiple peptides from a single species mapped to the same Homologene ID we took the peptide with the highest BLAST score. 134

6.4 Results

6.4.1 Nucleotide Frequency Patterns in Eukaryotic promoters

We determined the TSS for nearly 70 eukaryotic species from all three kingdoms and protists using primary 5’ RNA data. Species were selected based on two criteria. First, each species must have at least a draft genome containing genomic contigs, scaffolds, or with an average size greater than

10 kb. Assemblies comprised of smaller fragments restrict the ability to map

ESTs and identify contiguous promoter regions with confidence. Second, a minimum of 10k 5’ ESTs must be available for mapping promoters that are ideally supported by at least two 5’ sequences. In addition to these two restrictions, several additional species were removed from consideration due to contamination of 5’ ESTs in the public databases. For example, C. elegans contains a large portion of ambiguously annotated ESTs that originate from a study designed to find coding regions143, compromising our ability to use bulk

EST data from C. elegans for promoter discovery purposes. When possible we tried to fix author annotation errors, such as with the mosquito Aedes aegypti, where 5’ and 3’ ESTs were mislabeled. We also left out some highly related species, including several Drosophila, nematode, and vertebrate species that met our criteria but were likely to yield redundant results.

For each species, we compiled the nucleotide frequencies relative to the

TSS (Appendix B). Related species generally exhibited similar patterns in nucleotide frequency. A variety of patterns were seen, including broad changes 135 in nucleotide content both upstream and downstream of the TSS. Sharp fluctuations in nucleotide frequency (like the TATA box) were restricted to TSS derived from full-length 5’ RNA libraries, such as those found in human, mouse,

Drosophila, and plants. There is also a correlation between the strength of the nucleotide frequency patterns in similar species and the number of 5’ RNA sequences used to construct the promoter set. This observation might be expected with more data considering the greater chance of identifying true promoter regions that are likely to exhibit these patterns.

Unexpectedly, a large number of metazoan and protist promoters contain nucleotide frequency patterns that may position nucleosomes downstream from the TSS (Figure 6.1, 6.2, Appendix B). In the case of metazoans, these coarse- grain patterns coincide with the first downstream nucleosome described for human and Drosophila periodic promoters (Chapter 5). In addition to the human, mouse, and Drosophila promoters, high quality promoters for zebrafish and medaka (fish, Figure 6.2a) display fine-grain 10 bp fluctuations consistent with

AT-rich nucleosome positioning sequences that are in phase with the TSS.

Furthermore, the phasing of these sequences would constrain the nucleosome in the identical rotational angle relative to the TSS as described for human, mouse, and Drosophila, adding support for our earlier model of a conserved periodic promoter architecture (data not shown). These results suggest that the other organisms containing coarse-grain (150 bp) nucleosome positioning patterns are also likely to utilize periodic initiation strategies for a large number of their genes.

Algae represent the group of organisms with the strongest coarse-grain 136 nucleosome positioning patterns in relation to the promoter, implying these organisms may make heavy use of periodic promoters. In contrast, high quality promoters for Arabidopsis and other plants generated with full-length 5’ RNA sequences exhibited no evidence for nucleosome positioning sequences aligned with the TSS. Instead, most plant genomes have strong enrichment for the

TATA box at –31 bp from the TSS, suggesting plants may predominately use focused initiation strategies for gene expression.

6.4.2 Proximal Motifs in Eukaryotic promoters.

We analyzed the proximal promoters from each species to identify enriched motifs (8-12 bp) that may play an important role in promoter function.

Our analysis focused on regions from –150 bp to +50 bp for accurate TSS derived from full-length RNA, and from –220 bp to –20 bp for other promoter sets to avoid identifying restriction sites or other signals in the DNA introduced by EST cloning techniques that do not emphasize isolation of the full transcript. We used the surrounding sequence, from –2 kb to + 2 kb, as background for motif finding to help isolate motifs that are specifically found proximal to the TSS. Enriched motifs for at least two different species are reported in Figure 6.1 (All results are listed in Appendix B).

Our analysis revealed a large number of known and novel motifs from different species. Similar to our analysis of nucleotide frequency patterns, we observed that closely related species exhibited enrichment for the same motifs.

We identified considerably more enriched motifs in vertebrate promoters than 137 most other species, which is probably due to a combination of the greater number of genes encoded by their genomes as well as the greater amount of sequencing data available. The most common motif identified was for Myc (E- box, CACGTG), found enriched in proximal promoters for many animals, fungi, and plants. Many species, including many insects, whose promoters were not enriched for Myc were enriched for a similar E-box motif (CAGCTG). These elements are bound by a wide variety of Helix-Loop-Helix DNA-binding domain containing proteins that comprise a large group of transcription factors in human alone.

Major sub-groups of animals, fungi, plants, and protists have diverged to utilize different elements in their proximal promoters. In most cases, it is difficult to infer the proximal elements used by a common ancestor due to lack of data or poor coverage of species in those portions of the evolutionary tree. However, in the case of insects, we saw widespread evidence for accelerated evolution of proximal motifs relative to other metazoans. Distantly related metazoans, including vertebrates, snails, some worms, and some insects showed strong enrichment for NFY, NRF1, CRE, and Myc. Strong enrichment for these motifs in the promoters of the sea anemone N. vectensis highlights the ancient origin of these motifs and their likely role in recruiting general transcription factors and

RNA polymerase II during roughly 700 million years of evolution143. Flies, including Drosophila, showed enrichment for completely different proximal motifs, which include the DRE and a novel element that is strongly enriched in mosquitoes (Figure 6.1). This was an odd finding considering the honeybee, 138

which is morphologically and genetically more similar to flies than to humans,

shared most of its proximal motifs with vertebrates. This suggests that ancestors

of modern fly species may have been under considerable evolutionary pressure

to change their proximal promoter motif repertoire, resulting in the complete

restructuring of promoter motifs. This finding is reflected in the large number of

genes (762) that are shared between the honeybee and deuterostomes that have

been seemingly lost in Drosophila144.

6.4.3 Promoter Conservation vs. Motif conservation

Extensive evidence for proximal promoter enrichment of elements like

NFY in several species led us to question whether or not promoters driving the

expression of homologous genes in these species were also conserved. This

question is trivial for closely related species, such as mammals or the Drosophila

family of species, where the evolutionary distance between species is small

enough to easily align promoter regions using sequence alignment software such

as blastz145. In fact, most of the proximal elements identified here have been independently identified in studies that search for promoter elements that are strongly conserved in related species relative to the surrounding sequence19,131,146. In the case of mammals, the motifs identified in this study are

also the top scoring motifs based on mammalian conservation (SP1, NFY, NRF1,

etc.)131.

Unfortunately, approaches that use unbiased sequence alignment begin to

lose their effectiveness at larger evolutionary distances when promoters can no 139 longer be aligned. Figure 6.3a shows the NFYA locus in humans aligned to other vertebrates. As the evolutionary distance increases between species in the alignment, introns start failing to align, along with regulatory regions and exons encoding the 5’ and 3’ UTR. When considering zebrafish and human, most homologous genes share little more in common than their coding DNA. In fact, analysis of bulk promoter conservation reveals that 23% of chicken and only 9% of zebrafish genomes aligned to mouse promoter regions (Figure 6.3b). This could be due to two reasons. First, the promoters driving homologous coding regions could be from fundamentally different evolutionary origins and would not align in the first place. Alternatively, it is possible that the non-functional sequence surrounding conserved motifs becomes divergent to the point that the region is no longer scored favorably by alignment algorithms, effectively reaching the limit of the algorithm’s sensitivity.

Our initial hypothesis, given the extensive list of organisms with similar proximal motifs between them, was that many of the promoters driving homologous genes were likely to be of ancient origin. To investigate this possibility given the lack of sequence alignment, we independently analyzed promoters of homologous genes to determine if the same promoter motifs were used across different species. Searching for motifs instead of aligning sequences increased our sensitivity by ignoring the alignment of other sequences within the promoter, but biased the results by only considering specific motifs. We searched for instances of NFY, NRF1, and MYC in several metazoan species, comparing promoter regions of homologous genes between 140 species. We calculated the significance or co-occurrence using the Fisher Exact

Test, controlling for the total number of homologous promoters between pairs of species (Figure 3b). We found that motifs mapped between homologous promoters at rates comparable to that observed by sequence alignment. While the honeybee, sea anemone, and mice were all strongly enriched for NFY,

NRF1, and MYC in their promoters, specific motifs in the promoter from one species were not reliably found in the promoter of the homologous gene in the other species.

To interpret this result, we actively searched for the conserved gene whose orthologous promoters had the most consistent set of motifs across different species. This search returned cyclin B1 (CCNB1), an important cell cycle protein containing an NFY motif in the promoters of over 10 different metazoan species(Figure 6.4c). By comparison, we were only able to identify 20 conserved genes with over 5 orthologous promoters with a consistent motif.

Alignment of promoter sequences from orthologous CCNB1 promoters revealed a conserved structure across species, with two NFY motifs separated by 31 or 32 bp (-87 and –55 bp) and an A/T rich motif resembling the TATA box at –30 bp

(Figure 6.4c).

While this finding is fairly unique and fails to support a general model of ancient promoters evolving with their coding regions, it does however underscore two important concepts. The first is that the relative spacing between elements and rules of promoter architecture may be conserved. To examine this at a genome-wide level, we analyzed NFY motifs found in N. vectensis (sea 141

anemone) promoters, examining the distance between adjacent NFY motifs.

Amazingly, we found that the spatial distribution between NFY motifs in the sea

anemone was a perfect match for the distribution in human promoters (Figure

6.4a,b). No similar pattern was observed in Drosophila or Arabidopsis, analyzed as a control (data not shown). Furthermore, the relative spacing of these motifs relative to the TATA-like element in each species strongly supports the concept that NFY motifs maintain strict spacing requirements with respect to the TSS in other species, in a manner similar to that shown for mammals (Chapter 5).

6.4.4 Novel and Unexpected Motifs

Our survey identified a variety of novel motifs from a wide range of

species. Most of these motifs, despite being identified as enriched from –150 to

+50 bp from the TSS, are highly localized in the region from –100 to –50 bp, or

immediately upstream of the TATA region in the promoter (Figure 6.5). The

distribution of SP1 and NFY motifs relative to the TSS in mammalian promoters

are the quintessential examples of this pattern. Novel motifs found in N.

vectensis, Arabidopsis, and the protist Toxoplasma gondii are shown in essentially the same positions as SP1 in human promoters. CTCF, which was unexpectedly enriched in the promoters of various species of fish, displayed the same pattern. Proximal motifs in S. cerevisiae, exemplified by REB1, displayed a slight shift in their localization with maximal enrichment between –150 and

–100 bp from the TSS. We also observed that the TATA box and promoter proximal nucleosome positions exhibited the same shift (Figure 6.5c), implying 142 that the relationship between proximal elements, the TATA box, and nearby nucleosomes is likely highly conserved in most species. In the case of yeast, transcriptional initiation has been shown to start further downstream from the

TATA box than in mammalian systems dependent on TFIIB147. These results suggest that the general layout of elements and nucleosomes in promoters are likely strongly conserved throughout eukaryotic evolution and in some cases slight differences in the basal transcriptional machinery may affect the way RNA polymerase II scans the DNA and eventually initiates transcription.

6.4.5 Position dependent motifs

The large number of species with high quality TSS derived from full-length

RNA sequences provided the opportunity to repeat our search for core-promoter elements undertaken in Chapter 5. We searched the promoters of 11 species, including mammals, fish, insects, yeast, and plants, for position-specific motifs in the region from –50 bp to +50 bp. This analysis revealed a strong enrichment for the TATA box and the Inr in each of the species, with the exception of yeast (S. cerevisiae), which only showed enrichment for a poly-T track in the region from

–40 to –20 bp instead of the TATA box (Figure 6.6a). The Inr element was strongly conserved at the –1 and +1 positions (CA) and exhibited slight variation in the two nucleotides that followed (Figure 6.6b). A strict requirement for a pyrimidine (-1) followed by a purine (+1) is maintained at the Inr when there is variation, consistent with previous studies4. The only species to show significant enrichment for core motifs downstream from the TSS was Drosophila, which was 143 enriched for the MTE and DPE at the expected locations. Modest enrichment for the TATA , MTE, and DPE was observed in the honeybee, where the small number of full-length 5’ RNAs available reduced our chances to accurately map core elements in that species. Interestingly, the honeybee also showed modest enrichment for mammalian motif YY1 in the same region as the MTE, which is consistently observed in vertebrates as well (data not shown).

6.5 Discussion

We constructed the inaugural eukaryote-wide database of transcription start sites, containing 69 species and over 300k promoter regions from all branches of the eukaryotic tree of life. Using primary 5’ RNA sequences instead of predicted ORFs, we were able to localize the TSS within the range needed to accurately identify promoter proximal motifs and nucleotide frequency patterns, thus enabling the discovery of conserved and divergent features of eukaryotic promoters. Our analysis revealed that while most of the specific DNA elements that recruit proximal transcription factors differed between distantly related organisms, the general layout of promoters is remarkably conserved (Figure 6.5).

This study highlights the robust conservation in proximal motif identity and syntax at the genome-wide level that greatly exceeds the rate of conservation seen in individual promoters. It is remarkable that NFY, NRF1, CRE, and MYC are highly enriched in several metazoan species considering nearly 90% turnover of promoters between mammals and fish and nearly 100% turnover between vertebrates and insects. Having said this, proximal motifs are the most 144 conserved regions of genomic loci, with the exception of exons131, suggesting they are under enormous evolutionary pressure to preserve their identity since a mutation likely results in the elimination of expression from that locus. Not to be lost in this analysis is the alignment of cyclin B1 promoters from over 700 million years of sequence divergence, showcasing the extreme conservation of some transcriptional mechanisms.

These observations invite the question of how new promoter units are formed. Recent work demonstrating how long regulatory elements such as

CTCF can spontaneously mutate from repeats that contain pseudo-CTCF motifs encoded within the retroviral repeat itself suggests a mechanism by which conserved promoter architectures can spread throughout the genome148.

Alternatively, the relatively short length of most proximal motifs identified here suggests they can readily mutate over time to form new promoter regions. A careful analysis of related species, aided by extensive 5’ RNA sequencing, may be able to address these questions and will be left for future work.

While the lack of 5’ RNA data prevents us from discerning between focused and periodic promoter types in most species, widespread evidence of downstream nucleosome positioning patterns and mammalian proximal motifs strongly suggests periodic promoters are common in most, if not all, metazoans.

The prevalence of a subset of mammalian proximal motifs across the metazoan evolutionary tree also points to an ancestral motif signature comprised of NFY,

NRF, E-box, and CRE. Unfortunately, very few non-metazoan species have quality 5’ RNA information, severely limiting our ability to find clear evidence for 145 focused or periodic promoters in more distantly related organisms. One exception is in plants, which exhibit a strong TATA box but no evidence for nucleosome positioning patterns. In contrast, all species of algae analyzed, including red algae, contain strong coarse-grain nucleosome positioning patterns downstream of their promoters and no clear evidence for TATA boxes. These two groups of organisms might represent polar opposites on the spectrum of periodic and focused promoter utilization. Widespread evidence for nucleosome positioning relative to transcriptional initiation in eukaryotes suggest the basic mechanisms underlying periodic promoter transcription may be as ancient as the nucleosomes themselves, providing the basis by which transcription factors and the general transcriptional machinery structurally interact with the chromatin environment.

Chapters 4, 5, and 6, in part, will be submitted for publication. Benner, C.,

Garcia-Bassets I., Heinz S., Kadonaga J.T., Rosenfeld M.G., Subramaniam S.,

Glass C.K. Identification of a conserved periodic promoter structure in metazoans. The dissertation author was the primary investigator and author of this paper. 146

Table 6.1: Summary of 5’ RNA Sequencing used in the eukaryotic TSS database Data Set Total Mapped Tags Mapping/Usage Details Number of TSS/Peaks Human 5’ RNA 13,124,701 (hg18, 20 bp, ≤ Leading “G” was removed from 15,253 [≥ 15 tags per (CAGE) [various 2 mismatch, ≤ 5 matches raw CAGE sequence. Multiple promoter, 19.9% focused, tissues]4 to genome) matches were considered due to 44.8% periodic] short tag length (false positives mitigated by limiting analysis to tags near 5’ mRNA/ESTs) Mouse 5’ RNA 14,467,522 (mm8, 20 bp, ≤ Same as Human 5’ RNA 15,281 [≥ 15 tags per (CAGE) [various 2 mismatch, ≤ 5 matches promoter, 18.5% focused, tissues]4 to genome) 45.2% periodic] Drosophila 5’ 27,177,789 (dm3, 25 bp, ≤ Only tags withing 200 bp of the 5’ 7,676 [≥ 15 tags per RNA [various 2 mismatch, unique of annotated RefSeq were used promoter, 39.9% focused, tissues]129 alignment in genome) due to cleavage products in this 30.3% periodic] dataset within gene bodies. Medaka 5’ RNA 1,882,154 (oryLat1, 19 bp, Same as Human 5’ RNA, except 4287 [≥ 5 tags per promoter, (5’ SAGE) ≤ 2 mismatch, ≤ 5 matches leading “G” was not removed not enough tags to [various to genome) determine periodic/focused tissues]137 promoters] Yeast 5’ RNA (5’ 365, 342 (sacCer1 [Oct Same as Human 5’ RNA, except 3708 [≥ 3 tags per promoter, SAGE)138 2003], 15-17 bp, ≤ 2 leading “G” was not removed not enough tags to mismatch, ≤ 5 matches to determine periodic/focused genome) promoters] 147

Table 6.2: Species, genome versions, and download locations for each species used in used the eukaryotic TSS database Figure 6.1: Summary of promoter data in all organisms studied. Colored rectangles correspond to motifs found by de novo motif discovery in the proximal promoter of each species. Dark green nucleosome positioning patterns correspond to species with strong evidence of fine or coarse-grain nucleosome positioning sequences found relative to the TSS. Light green patterns indicate species with weak patterns or species that are likely to have phased nucleosome-positioning patterns due to evidence in a closely related organism. The nucleotide frequency pattern column indicates general patterns shared by many organisms. 148 Figure 6.1: Summary of promoter data in all organisms studied (continued) 149 Figure 6.2: Nucleosome positioning patterns revealed by the alignment of promoter DNA. Nucleotide frequency from –600 to +600 bp in medaka (fish) (a), N. vectensis (sea anemone) (b), Arabidopsis (c), and C. merolae (red algae) (d). H3K4me3 nucleosome positions from human promoters were overlaid on metazoan promoters to show the overlap between nucleosome positions and coarse-grain frequency patterns. 150 (a)

(b)

Figure 6.3: Evolutionary conservation of promoters (a) View of genomic alignments relative to human at the NFYA locus using the UCSC Genome Browser. The 5’ end of the NFYA gene (including promoters) is on the left. Aligned regions for each species are shown beneath the gene, with dark grey/black regions representing highly conserved regions and light grey representing weakly conserved regions. White regions indicate no alignment, and double lines represent unaligned regions between two syntenic regions (i.e. a gap in the alignment). (b) Conservation of metazoan promoters relative to mouse. Promoter alignment indicates the percentage of mouse promoters that align to each organism. Columns for NFY, NRF, and MYC indicate the significance and percentage of overlap between promoters containing each motif relative to promoters in other species driving homologous genes. 151 Figure 6.4: Conservation of NFY spacing. Distribution of spacing between NFY motifs in human (a) and N. vectensis (b) promoters. Sequence alignment of promoter DNA driving cyclin B1 from 11 species. NFY (CCAAT) motifs are highlighted in yellow. 152 153

Figure 6.5: General distribution of motifs from the TSS Distribution of top proximally enriched motif, TATA box, and nucleosomes for human (a), fish (b), S. cerevisiae (c), N. vectensis (d), Arabidopsis (e), and Toxoplasma gondii (f). (Note: TSS in (d) and (f) are NOT based on full-length RNA libraries.) 154

(a)

(b)

Figure 6.6: Core promoter elements in eukaryotes (a) Results of de novo motif discovery from analyzing sequences with a fixed distance relative to the TSS. The negative log p-value of the top 10 bp motif is presented as a function of distance to the TSS. (b) Initiator motifs were identified for each organism where full-length 5’ RNA sequencing data was available. Chapter 7: Conclusions

This study identified sequence elements that provide syntax for periodic promoter architecture. While both focused and periodic initiation strategies are associated with Inr elements and proximal sequence specific elements (PSSE), only focused promoters are enriched for position-specific core elements, such as

TATA and DPE elements, and lack nucleosome positioning sequences. Periodic promoters are characterized by periodic start sites that co-occur with rotationally positioned downstream nucleosomes and more pronounced periodicity and precise phasing of upstream PSSEs (Fig. 5.8). We propose that interactions between PSSEs, general transcription factors, and the downstream nucleosome can occur effectively with the nucleosome positioned at one of several energetically favored, phased positions, giving rise to periodic sites of transcriptional initiation when the appropriate combination of PSSEs and phased initiator elements are also present. In this model, it is likely that PSSE-binding factors in periodic promoters function in a manner analogous to factors that recognize position-specific elements in focused promoters, such as the TATA and DPE motifs. These factors are also known or expected to play a role in recruitment of the enzymatic machinery responsible for promoter-specific histone modifications. Overall, the configuration of periodic promoters appears to be ancient in origin and structurally optimized for self-assembly of constitutively active promoters that direct expression of genes required for general cellular functions. While the results from this study are biased toward the architectures of

155 156 metazoan promoters, future high-throughput sequencing of 5’ RNA from other species, coupled with the analysis strategies presented here, may yield new promoter architectures yet to be discovered.

In addition to the discovery of novel promoter architecture, we have described approaches to successfully infer the identities and spatial organization of DNA motifs from genome-wide expression and protein localization experiments. This led to the discovery and characterization of the T1ISRE, a previously unappreciated response element found in the promoters of Type I

Interferon induced genes. This motif differs from the canonical ISRE by a signal nucleotide, which we document as a general phenomenon among family response elements as a strategy to confer specificity among similar transcription factors. This methodology also led to the identification of a novel motif for the embryonic stem cell factor Nanog. The Nanog motif is unique among motif identify in that it is in phase with nucleosome positioning sequences, leading the prediction of its role as a pioneering factor for the assembly of Sox2 and Oct4 on embryonic stem cell enhancers.

Several general concepts emerged from this study. In terms of motif discovery, we learned that the most important step is the careful consideration of sequences used for analysis. Aside from correctly analyzing each experiment to identify co-regulated or co-enriched genes, intelligent selection of background sequences can lower the false discovery rate by removing CpG or TSS- associated sequence bias. Background sequences can also be leveraged to ask specific biological questions. It was unlikely we would have been able to 157 correctly and confidently identify the Nanog motif without specifically using Oct4-

Sox2-Nanog co-bound regions as background to cancel out the Oct4-Sox2 composite motif from Nanog ChIP-Seq peaks (Chapter 3). Likewise, the identification of the T1ISRE and its likely significance would not be accessible without careful analysis of IFN-beta and IFN-gamma expression profiles (Chapter

2).

The second concept to emerge from this study was the importance of rotational positioning in protein-DNA complexes. The primary example of this concept is the periodic promoter, where we observe transcriptional activity at intervals of 10 bp, placing the initiating polymerase on a consistent “side” of the

DNA molecule (Chapter 5). This transcriptional periodicity is accompanied by

PSSEs and nucleosome positioning sequences that adhere to the same periodic patterns, placing each of these features in a consistent rotational position relative to one another. This concept reemerged when considering Nanog binding elements, where Sox2, Oct4, and putative nucleosome positioning sequences aligned with consistent rotational phasing relative to one another. Cues from the

DNA sequence can be used in these cases to infer the actual structural arrangement of the proteins, such as the preference of the minor groove of A/T- rich regions to face inward toward the core of a nucleosome. Taken together, these insights allowed us to reconstruct the structural configuration of protein complexes on the DNA, representing one of the first steps in assembly models of the molecular machines responsible for the regulation of transcription. 158

Future experimentation and analysis are needed to resolve ambiguities and provide more convincing support for the major findings presented in this work. Our model of the periodic promoter involves the interaction of PSSEs, the

Inr, and a downstream nucleosome, working together to initiation transcription.

The contribution of each element should be investigated using a panel of mutated promoter constructs to provide insight into the relative contribution of each element. Likewise, the exciting possibility that Nanog functions as a true

“pioneering” factor that targets the removal of nucleosomes requires experimental follow-up. A template that is stably integrated or chromatinized in vitro can be used to test the rules governing the accessibility of Nanog, Sox2, and Oct4 motifs under a variety of genetic mutations. The rational design of templates that modify the rotational position of nucleosome and motifs relative to one another may provide valuable insight into the natural design of distal enhancer regions in the genome. Several key experiments are missing to functionally validate the importance of the T1ISRE. The mutation of T1ISREs within their target promoters and in vivo confirmation of the binding of ISGF3 by

ChIP has not been performed. In addition, we predict that by swapping the

T1ISRE for an ISRE, or vice versa, the specificity for Type I Interferon signaling would swap as well.

An interesting observation that came from of the analysis of inflammatory response elements is the remarkable similarity between the basic building blocks of each motif. If we disregard full response elements and instead focus on the half-sites that bind NFkB, IRF, STAT and even ETS proteins, we find that each of 159 these factors recognizes the site “GGAA”. Generally speaking, there are three basic orientations this site can be placed in: 5’ first, 3’ second (GGAA.TTCC

[kb]), 3’ first, 5’ second (TTCC.GGAA [STAT]), or 5’ in tandem (GGAA...GGAA

[ISRE]). After considering these simple rules and the basic building block on which they are based, it becomes more sensible that different spacing may or orientations may change the composition of transcription factors on a binding site. For example, the ISRE (GAAANNGAAA) classically binds IRF homo- and heterodimers, while the T1ISRE (GAAANGAAA) appears to have specific affinity for a STAT-IRF heterodimer. Another variation on this theme is the PU.1/IRF composite element (GGAAGNGAAA) 149, which differs from the T1ISRE by a single base. It is likely that focused analysis of different spacing, orientation, and exact sequence composition may give rise to a wide variety of specific regulatory elements with novel function within the immune system.

Our current approach to sequence analysis has been naïve in light of the complex interactions seen between genetic elements. The workflow has been to isolate sequences that hopefully contain the same transcription factor binding sites (i.e. co-enriched or co-regulated), identify motifs, and then try to reconstruct patterns between motifs and mine for associations with the DNA sequence and

ChIP-Seq tags. However, now that we have examples of how regulatory elements are arranged on DNA, it may be possible to leverage these relationships to create more sensitive computational tools. For example, the prevalence of elements with consistent rotational phasing may lead us search for pairs of motifs that specifically exhibit this property, increasing our sensitivity for 160

these elements and leading to the discovery of motifs that we would have

otherwise missed. Another example would be to look for elements that are

consistently phased with nucleosome positioning sequences, aiding in the

discovery of a “pioneer class” of transcription factors and their motifs.

Aside from the many conceptual advances made in this work, a number of

valuable resources were produced. First, we created HOMER, a motif discovery

algorithm and ChIP-Seq analysis tool. Second, we have created a vast promoter

database that includes many highly accurate TSS based on a large number of

species (69 total). Differential motif discovery algorithms such as Amadeus18 are

built on principles similar to the motif discovery algorithm in HOMER. They

normally find very similar motifs if given the exact same input sequences for

analysis. However, the Amadeus program contains its own promoter library

based on TSS from Ensembl (http://www.ensembl.org/), and normally fails to find

any significant motifs if given several of the data sets from Chapter 2 without

changing the promoter library. The last major resource produced from this study

is the final list of motifs identified by HOMER. As ChIP-Seq data is analyzed, the

obvious consensus motifs discovered by HOMER are saved. In many ways,

ChIP-Seq resembles a massive in vivo SELEX experiment150, and is arguably orders of magnitude more accurate than motifs contained within TRANSFAC29.

Advances in sequencing technology during the course of this work played

an invaluable role in this study. The ability to sequence millions of individual

molecules in a single experiment provided the resolution of 5’ RNA, nucleosome,

and transcription factor binding positions needed to discover periodic promoters. 161

This technology has already revolutionized the way ChIP experiments are measured, and within the near future all RNA will be measured by sequencing, thus making microarrays a relic of the past. It is also worth admiring the scientific community’s commitment to sharing data, since it was the experimental work of thousands of authors in well over two hundred studies, and not our own, that made this work possible. Appendix A: Algorithm Pseudocode

Exhaustive Search Phase:

O : oligo table of all DNA sequences of length w R : list of enrichments (p-value) for each oligo in O, initialized to 1. For each oligo oi in O and each corresponding enrichment ri in R For each number of mismatches m (i.e. 0, 1, 2) putativeMotif = all oligos ox within m mismatches of oi rtemp = calculateEnrichment( putativeMotif ) ri = rtemp if rtemp < ri end end

Rsorted = sort R from most enriched to least.

Local Optimization:

M = optimized motifs composed of matrices (mi), thresholds (di), and enrichments (ri) for each seed oligo oi

For top x oligos in Rsorted mi = initialize probability matrix for oligo oi corresponding to Rsorted(i) (di,ri) = findOptimalThreshold(mi,O) rlast = 1 While ri < rlast rlast = ri mi = refineMatrix(mi, di, O)** (di,ri) = findOptimalThreshold(mi,O)** End mi,di,ri represent optimized motif and enrichment for seed oligo i removeOligosInMotifFromWholeSet(O, mi, di) End

Output Msorted = sort motifs in M based on enrichment ri

** additional values of di are typically used to avoid a local minimum

Sub calculateEnrichment (putativeMotif, Oligo Table O)

# find total number of oligos represented by the motif ntarget = 0 : number of occurrences in target set nbackgroud = 0 : number of occurrences in background set Ns = number of total sequences Nstarget = number of target sequences Nsbackground = number of background sequences

For each oligo oi in putativeMotif ntarget = ntarget + number of time oi appears in target set nbackground = nbackground + number of time oi appears in background set end

# calculate the expected number of sequences having at least one oligo nstarget = 0 : number of sequences containing motif in target set nsbackground = 0 : number of sequences containing motif in background set for i = 1, i <= ntarget, increment by 1 if i = 1, nstarget = 1

162 163

else, nstarget = nstarget + (ns1- nstarget)/ns1 end for i = 1, i <= nbackground, increment by 1 if i = 1, nsbackground = 1 else, nsbackground = nsbackground + (ns1- nsbackground)/ns1 end

p-value = hypergeometric(Ns, Nstarget, nstarget + nsbackground, nstarget) return p-value End sub

Sub findOptimalThreshold (matrix m, Oligo Table O)

S : scores representing the similarity of oligos to probability matrix; Osorted = sort oligos from with most similar oligos first based on S

putativeMotif = null d = 0 (optimal threshold) r = 1 (enrichment of optimal threshold) For each oligo oi in Osorted and corresponding similarity score si in S Add oi to putativeMotif ri = calculateEnrichment(putativeMotif) if (ri < rbest) rbest = ri dbest = si end end return (dbest , rbest) End sub

Sub refineMatrix (matrix m, threshold d, Oligo Table O)

S : scores representing the similarity of oligos to probability matrix;

mnew = empty probability matrix currentMotif = set of oligos with similarity score greater than d r = calculateEnrichment(currentMotif)

For each oligo oi with similarity score greater than d (or oligos in currentMotif) testMotif = currentMotif excluding oi ri = calculatedEnrichment(testMotif) wi = log(r) – log(ri) if wi > 0 mnew = mnew + wi for the position of each nucleotide in oi end end normalize each position in mnew to sum to 1 (i.e. resemble probability matrix) return mnew

End sub Appendix B: Evolutionary Analysis of Proximal

Promoters

The following pages report the nucleotide frequency distribution relative to the

TSS and enriched proximal motifs in the interval [-150,+50] for 69 organisms.

Simple mono-, di-, or trinucleotide motifs were removed from the results if found and are likely due to nucleotide bias found in the promoter regions of some organisms.

164 165 166 167 168 169 170 171 172 References

1. McPherson, J.D. et al. A physical map of the human genome. Nature 409, 934-41 (2001).

2. Venter, J.C. et al. The sequence of the human genome. Science 291, 1304-51 (2001).

3. Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-62 (2002).

4. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science 309, 1559-63 (2005).

5. Kadonaga, J.T. Regulation of RNA polymerase II transcription by sequence-specific DNA binding factors. Cell 116, 247-57 (2004).

6. Smale, S.T. & Kadonaga, J.T. The RNA polymerase II core promoter. Annu Rev Biochem 72, 449-79 (2003).

7. Bird, A., Taggart, M., Frommer, M., Miller, O.J. & Macleod, D. A fraction of the mouse genome that is derived from islands of nonmethylated, CpG- rich DNA. Cell 40, 91-9 (1985).

8. Das, M.K. & Dai, H.K. A survey of DNA motif finding algorithms. BMC Bioinformatics 8 Suppl 7, S21 (2007).

9. Tompa, M. et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137-44 (2005).

10. Bailey, T.L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. UCSD Technical Report CS94(1994).

11. Roth, F.P., Hughes, J.D., Estep, P.W. & Church, G.M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16, 939-45 (1998).

12. Liu, X.S., Brutlag, D.L. & Liu, J.S. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 20, 835-9 (2002).

13. Pavesi, G., Mauri, G. & Pesole, G. An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17 Suppl 1, S207-14 (2001).

173 174

14. Barash, Y., Bejerano, G. & Friedman, N. A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites. Proc. First International Workshop on Algorithms in Bioinformatics (WABI) (2001).

15. Segal, E., Barash, Y., Simon, I., Friedman, N. & Koller, D. From promoter sequence to expression: a probabilistic framework. Proc. 6th Inter. Conf. on Research in Computational Molecular Biology (RECOMB), Washington, DC. (2002).

16. Smith, A.D., Sumazin, P. & Zhang, M.Q. Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci U S A 102, 1560-5 (2005).

17. Ettwiller, L., Paten, B., Ramialison, M., Birney, E. & Wittbrodt, J. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat Methods 4, 563-5 (2007).

18. Linhart, C., Halperin, Y. & Shamir, R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res 18, 1180-9 (2008).

19. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E.S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241-54 (2003).

20. Harris, M.A. et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32, D258-61 (2004).

21. Shibata, N. & Glass, C.K. Regulation of macrophage function in inflammation and atherosclerosis. J Lipid Res (2008).

22. Kaisho, T. & Akira, S. Toll-like receptors and their signaling mechanism in innate immunity. Acta Odontol Scand 59, 124-30 (2001).

23. Li, A.C. et al. Peroxisome proliferator-activated receptor gamma ligands inhibit development of atherosclerosis in LDL receptor-deficient mice. J Clin Invest 106, 523-31 (2000).

24. Hevener, A.L. et al. Macrophage PPAR gamma is required for normal skeletal muscle and hepatic insulin sensitivity and full antidiabetic effects of thiazolidinediones. J Clin Invest 117, 1658-69 (2007).

25. Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. WebLogo: a sequence logo generator. Genome Res 14, 1188-90 (2004).

26. Chuck Norris FAQ. (2007). 175

27. Lim, C.A. et al. Genome-wide mapping of RELA(p65) binding identifies E2F1 as a transcriptional activator recruited by NF-kappaB upon TLR4 activation. Mol Cell 27, 622-35 (2007).

28. Cam, H. et al. A common set of gene regulatory networks links metabolism and growth inhibition. Mol Cell 16, 399-411 (2004).

29. Knuppel, R., Dietze, P., Lehnberg, W., Frech, K. & Wingender, E. TRANSFAC retrieval program: a network model database of eukaryotic transcription regulating sequences and proteins. J Comput Biol 1, 191-8 (1994).

30. Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32, D91-4 (2004).

31. Schones, D.E., Sumazin, P. & Zhang, M.Q. Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics 21, 307-13 (2005).

32. Heinz, S. et al. Species-specific regulation of Toll-like receptor 3 genes in men and mice. J Biol Chem 278, 21502-9 (2003).

33. Irizarry, R.A. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31, e15 (2003).

34. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8 (1998).

35. Saldanha, A.J. Java Treeview--extensible visualization of microarray data. Bioinformatics 20, 3246-8 (2004).

36. Odom, D.T. et al. Core transcriptional regulatory circuitry in human hepatocytes. Mol Syst Biol 2, 2006 0017 (2006).

37. Boyer, L.A. et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 122, 947-56 (2005).

38. Schreiber, J. et al. Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. Proc Natl Acad Sci U S A 103, 5899-904 (2006).

39. Odom, D.T. et al. Control of pancreas and liver gene expression by HNF transcription factors. Science 303, 1378-81 (2004). 176

40. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc Natl Acad Sci U S A 102, 4459-64 (2005).

41. Ren, B. et al. Genome-wide location and function of DNA binding proteins. Science 290, 2306-9 (2000).

42. Carroll, J.S. et al. Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1. Cell 122, 33-43 (2005).

43. Leung, T.H., Hoffmann, A. & Baltimore, D. One nucleotide in a kappaB site can determine cofactor specificity for NF-kappaB dimers. Cell 118, 453-64 (2004).

44. Pine, R. & Darnell, J.E., Jr. In vivo evidence of interaction between interferon-stimulated gene factors and the interferon-stimulated response element. Mol Cell Biol 9, 3533-7 (1989).

45. Decker, T., Lew, D.J., Mirkovitch, J. & Darnell, J.E., Jr. Cytoplasmic activation of GAF, an IFN-gamma-regulated DNA-binding factor. Embo J 10, 927-32 (1991).

46. Zwicker, J. et al. Cell cycle regulation of the cyclin A, cdc25C and cdc2 genes is based on a common mechanism of transcriptional repression. Embo J 14, 4514-22 (1995).

47. Mudryj, M., Hiebert, S.W. & Nevins, J.R. A role for the adenovirus inducible E2F transcription factor in a proliferation dependent signal transduction pathway. Embo J 9, 2179-84 (1990).

48. Zhuang, J.C. & Wogan, G.N. Growth and viability of macrophages continuously stimulated to produce nitric oxide. Proc Natl Acad Sci U S A 94, 11875-80 (1997).

49. Guha, M. & Mackman, N. LPS induction of gene expression in human monocytes. Cell Signal 13, 85-94 (2001).

50. Johnson, G.L. & Nakamura, K. The c-jun kinase/stress-activated pathway: regulation, function and role in human disease. Biochim Biophys Acta 1773, 1341-8 (2007).

51. Israel, A. et al. TNF stimulates expression of mouse MHC class I genes by inducing an NF kappa B-like enhancer binding activity which displaces constitutive factors. Embo J 8, 3793-800 (1989). 177

52. Pestka, S., Krause, C.D. & Walter, M.R. Interferons, interferon-like cytokines, and their receptors. Immunol Rev 202, 8-32 (2004).

53. Steimle, V., Siegrist, C.A., Mottet, A., Lisowska-Grospierre, B. & Mach, B. Regulation of MHC class II expression by interferon-gamma mediated by the transactivator gene CIITA. Science 265, 106-9 (1994).

54. Sato, M. et al. Distinct and essential roles of transcription factors IRF-3 and IRF-7 in response to viruses for IFN-alpha/beta gene induction. Immunity 13, 539-48 (2000).

55. Muller, U. et al. Functional role of type I and type II interferons in antiviral defense. Science 264, 1918-21 (1994).

56. Kessler, D.S., Veals, S.A., Fu, X.Y. & Levy, D.E. Interferon-alpha regulates nuclear translocation and DNA-binding affinity of ISGF3, a multimeric transcriptional activator. Genes Dev 4, 1753-65 (1990).

57. Schirmer, S.H. et al. Interferon-beta signaling is enhanced in patients with insufficient coronary collateral artery development and inhibits arteriogenesis in mice. Circ Res 102, 1286-94 (2008).

58. Habenicht, A. Aortae of 32 weeks old apoE mice (GSE2372). Gene Expression Ommibus NCBI (2005).

59. Napolitano, M. & Capogrossi, M. Expression data from heart failure vs control peripheral blood mononuclear cells (GSE9128). Gene Expression Ommibus NCBI (2007).

60. Moisoi, N. et al. Mitochondrial dysfunction triggered by loss of HtrA2 results in the activation of a brain-specific transcriptional stress response. Cell Death Differ (2008).

61. Davis, L. Expression data from human peripheral blood subsets (GSE10325). Gene Expression Ommibus NCBI (2008).

62. Lan, H. et al. Gene expression profiles of nondiabetic and diabetic obese mice suggest a role of hepatic lipogenic capacity in diabetes susceptibility. Diabetes 52, 688-700 (2003).

63. Lee, Y.H. et al. Microarray profiling of isolated abdominal subcutaneous adipocytes from obese vs non-obese Pima Indians: increased expression of inflammation-related genes. Diabetologia 48, 1776-83 (2005).

64. Baur, J.A. et al. Resveratrol improves health and survival of mice on a high-calorie diet. Nature 444, 337-42 (2006). 178

65. Kunsch, C., Ruben, S.M. & Rosen, C.A. Selection of optimal kappa B/Rel DNA-binding motifs: interaction of both subunits of NF-kappa B with DNA is required for transcriptional activation. Mol Cell Biol 12, 4412-21 (1992).

66. Kraus, J., Borner, C. & Hollt, V. Distinct palindromic extensions of the 5'- TTC...GAA-3' motif allow STAT6 binding in vivo. Faseb J 17, 304-6 (2003).

67. David, M., Romero, G., Zhang, Z.Y., Dixon, J.E. & Larner, A.C. In vitro activation of the transcription factor ISGF3 by interferon alpha involves a membrane-associated tyrosine phosphatase and tyrosine kinase. J Biol Chem 268, 6593-9 (1993).

68. Matsumoto, M. et al. Activation of the transcription factor ISGF3 by interferon-gamma. Biol Chem 380, 699-703 (1999).

69. Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497-502 (2007).

70. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823-37 (2007).

71. Zhang, Y. et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008).

72. Marson, A. et al. Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134, 521-33 (2008).

73. Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-17 (2008).

74. Kharchenko, P.V., Tolstorukov, M.Y. & Park, P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol (2008).

75. Wederell, E.D. et al. Global analysis of in vivo Foxa2-binding sites in mouse adult liver using massively parallel sequencing. Nucleic Acids Res 36, 4549-64 (2008).

76. Qadri, I. et al. Interaction of hepatocyte nuclear factors in transcriptional regulation of tissue specific hormonal expression of human multidrug resistance-associated protein 2 (abcc2). Toxicol Appl Pharmacol (2008).

77. Blackwell, T.K., Kretzner, L., Blackwood, E.M., Eisenman, R.N. & Weintraub, H. Sequence-specific DNA binding by the c-Myc protein. Science 250, 1149-51 (1990). 179

78. Brodsky, A.S. et al. Genomic mapping of RNA polymerase II reveals sites of co-transcriptional regulation in human cells. Genome Biol 6, R64 (2005).

79. Guerrini, L., Gong, S.S., Mangasarian, K. & Basilico, C. Cis- and trans- acting elements involved in amino acid regulation of asparagine synthetase gene expression. Mol Cell Biol 13, 3202-12 (1993).

80. Arango, D., Corner, G.A., Wadler, S., Catalano, P.J. & Augenlicht, L.H. c- myc/p53 interaction determines sensitivity of human colon carcinoma cells to 5-fluorouracil in vitro and in vivo. Cancer Res 61, 4910-5 (2001).

81. Uramoto, H. et al. p73 Interacts with c-Myc to regulate Y-box-binding protein-1 expression. J Biol Chem 277, 31694-702 (2002).

82. Calhoun, V.C., Stathopoulos, A. & Levine, M. Promoter-proximal tethering elements regulate enhancer-promoter specificity in the Drosophila Antennapedia complex. Proc Natl Acad Sci U S A 99, 9243-7 (2002).

83. Michaud, J. et al. Integrative analysis of RUNX1 downstream pathways and target genes. BMC Genomics 9, 363 (2008).

84. Friedman, A.D. Transcriptional control of granulocyte and monocyte development. Oncogene 26, 6816-28 (2007).

85. Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature 436, 876-80 (2005).

86. Kim, T.H. et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231-45 (2007).

87. Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311-22 (2008).

88. Schones, D.E. et al. Dynamic regulation of nucleosome positioning in the human genome. Cell 132, 887-98 (2008).

89. Fu, Y., Sinha, M., Peterson, C.L. & Weng, Z. The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet 4, e1000138 (2008).

90. Loh, Y.H. et al. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet 38, 431-40 (2006).

91. Jauch, R., Ng, C.K., Saikatendu, K.S., Stevens, R.C. & Kolatkar, P.R. Crystal structure and DNA binding of the homeodomain of the stem cell transcription factor Nanog. J Mol Biol 376, 758-70 (2008). 180

92. Williams, D.C., Jr., Cai, M. & Clore, G.M. Molecular basis for synergistic transcriptional activation by Oct1 and Sox2 revealed from the solution structure of the 42-kDa Oct1.Sox2.Hoxb1-DNA ternary transcription factor complex. J Biol Chem 279, 1449-57 (2004).

93. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. & Richmond, T.J. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389, 251-60 (1997).

94. Luger, K., Mäder, A.W., Richmond, R.K., Sargent, D.F. & Richmond, T.J. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 389, 251-260 (1997).

95. Staynov, D.Z. The controversial 30 nm chromatin fibre. Bioessays 30, 1003-9 (2008).

96. Ramsay, N., Felsenfeld, G., Rushton, B.M. & McGhee, J.D. A 145-base pair DNA sequence that positions itself precisely and asymmetrically on the nucleosome core. Embo J 3, 2605-11 (1984).

97. Satchwell, S.C., Drew, H.R. & Travers, A.A. Sequence periodicities in chicken nucleosome core DNA. J Mol Biol 191, 659-75 (1986).

98. Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772-8 (2006).

99. Albert, I. et al. Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome. Nature 446, 572-6 (2007).

100. Almer, A., Rudolph, H., Hinnen, A. & Horz, W. Removal of positioned nucleosomes from the yeast PHO5 promoter upon PHO5 induction releases additional upstream activating DNA elements. Embo J 5, 2689- 96 (1986).

101. Allfrey, V.G., Faulkner, R. & Mirsky, A.E. Acetylation and Methylation of Histones and Their Possible Role in the Regulation of Rna Synthesis. Proc Natl Acad Sci U S A 51, 786-94 (1964).

102. Pokholok, D.K. et al. Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 122, 517-27 (2005).

103. Cao, R. et al. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298, 1039-43 (2002).

104. Vandel, L. et al. Transcriptional repression by the retinoblastoma protein through the recruitment of a histone methyltransferase. Mol Cell Biol 21, 6484-94 (2001). 181

105. Redon, C. et al. Histone H2A variants H2AX and H2AZ. Curr Opin Genet Dev 12, 162-9 (2002).

106. Ahmad, K. & Henikoff, S. The histone variant H3.3 marks active chromatin by replication-independent nucleosome assembly. Mol Cell 9, 1191-200 (2002).

107. Strahl, B.D. & Allis, C.D. The language of covalent histone modifications. Nature 403, 41-5 (2000).

108. Lee, C.K., Shibata, Y., Rao, B., Strahl, B.D. & Lieb, J.D. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 36, 900-5 (2004).

109. Archer, T.K., Lefebvre, P., Wolford, R.G. & Hager, G.L. Transcription factor loading on the MMTV promoter: a bimodal mechanism for promoter activation. Science 255, 1573-6 (1992).

110. Wang, Z. et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 40, 897-903 (2008).

111. Valouev, A. et al. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18, 1051-63 (2008).

112. Horz, W. & Altenburger, W. Sequence specific cleavage of DNA by micrococcal nuclease. Nucleic Acids Res 9, 2643-58 (1981).

113. Dingwall, C., Lomonossoff, G.P. & Laskey, R.A. High sequence specificity of micrococcal nuclease. Nucleic Acids Res 9, 2659-73 (1981).

114. Sutton, D.H., Conn, G.L., Brown, T. & Lane, A.N. The dependence of DNase I activity on the conformation of oligodeoxynucleotides. Biochem J 321 ( Pt 2), 481-6 (1997).

115. Gorin, A.A., Zhurkin, V.B. & Olson, W.K. B-DNA twisting correlates with base-pair morphology. J Mol Biol 247, 34-48 (1995).

116. Richmond, T.J. & Davey, C.A. The structure of DNA in the nucleosome core. Nature 423, 145-50 (2003).

117. Ong, M.S., Richmond, T.J. & Davey, C.A. DNA stretching and extreme kinking in the nucleosome core. J Mol Biol 368, 1067-74 (2007).

118. Thomas, M.C. & Chiang, C.M. The general transcription machinery and general cofactors. Crit Rev Biochem Mol Biol 41, 105-78 (2006). 182

119. Juven-Gershon, T., Hsu, J.Y. & Kadonaga, J.T. Perspectives on the RNA polymerase II core promoter. Biochem Soc Trans 34, 1047-50 (2006).

120. Buratowski, S., Hahn, S., Sharp, P.A. & Guarente, L. Function of a yeast TATA element-binding protein in a mammalian transcription system. Nature 334, 37-42 (1988).

121. Smale, S.T. & Baltimore, D. The "initiator" as a transcription control element. Cell 57, 103-13 (1989).

122. Burke, T.W. & Kadonaga, J.T. Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box- deficient promoters. Genes Dev 10, 711-24 (1996).

123. Lim, C.Y. et al. The MTE, a new core promoter element for transcription by RNA polymerase II. Genes Dev 18, 1606-17 (2004).

124. Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38, 626-35 (2006).

125. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Sayers, E.W. GenBank. Nucleic Acids Res (2008).

126. Kent, W.J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-64 (2002).

127. Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res 31, 51-4 (2003).

128. Garcia-Bassets, I. et al. Histone methylation-dependent mechanisms impose ligand dependency for gene activation by nuclear receptors. Cell 128, 505-18 (2007).

129. Ahsan, B. et al. MachiBase: a Drosophila melanogaster 5'-end mRNA transcription database. Nucleic Acids Res (2008).

130. Mavrich, T.N. et al. Nucleosome organization in the Drosophila genome. Nature 453, 358-62 (2008).

131. Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434, 338-45 (2005).

132. Su, M., Lee, D., Ganss, B. & Sodek, J. Stereochemical analysis of the functional significance of the conserved inverted CCAAT and TATA elements in the rat bone sialoprotein gene promoter. J Biol Chem 281, 9882-90 (2006). 183

133. Valouev, A. et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods (2008).

134. Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 4, 651-7 (2007).

135. Kato, M. et al. Dinucleosome DNA of human K562 cells: experimental and computational characterizations. J Mol Biol 332, 111-25 (2003).

136. Salih, F., Salih, B. & Trifonov, E.N. Sequence-directed mapping of nucleosome positions. J Biomol Struct Dyn 24, 489-93 (2007).

137. Kasahara, M. et al. The medaka draft genome and insights into vertebrate genome evolution. Nature 447, 714-9 (2007).

138. Zhang, Z. & Dietrich, F.S. Mapping of transcription start sites in Saccharomyces cerevisiae using 5' SAGE. Nucleic Acids Res 33, 2838-51 (2005).

139. Pribnow, D. Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc Natl Acad Sci U S A 72, 784-8 (1975).

140. Prakash, A. & Tompa, M. Discovery of regulatory elements in vertebrates through comparative genomics. Nat Biotechnol 23, 1249-56 (2005).

141. Gasch, A.P. et al. Conservation and evolution of cis-regulatory systems in ascomycete fungi. PLoS Biol 2, e398 (2004).

142. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J Mol Biol 215, 403-10 (1990).

143. Li, S. et al. A map of the interactome network of the metazoan C. elegans. Science 303, 540-3 (2004).

144. Insights into social insects from the genome of the honeybee Apis mellifera. Nature 443, 931-49 (2006).

145. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14, 708-15 (2004).

146. Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219-32 (2007).

147. Choi, W.S., Yan, M., Nusinow, D. & Gralla, J.D. In vitro transcription and start site selection in Schizosaccharomyces pombe. J Mol Biol 319, 1005- 13 (2002). 184

148. Bourque, G. et al. Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res 18, 1752-62 (2008).

149. Kanno, Y., Levi, B.Z., Tamura, T. & Ozato, K. Immune cell-specific amplification of interferon signaling by the IRF-4/8-PU.1 complex. J Interferon Cytokine Res 25, 770-9 (2005).

150. Tuerk, C., MacDougal, S. & Gold, L. RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse transcriptase. Proc Natl Acad Sci U S A 89, 6988-92 (1992).