<<

Determinants of the Eukaryotic DNA Replication Program

by

Matthew L. Eaton

Program in Computational Biology and Bioinformatics Duke University

Date:

Approved:

David M. MacAlpine, Supervisor

Fred S. Dietrich

Terrence S. Furey

Alexander J. Hartemink

Laura N. Rusche

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2011 ABSTRACT (0715)

Chromatin Determinants of the Eukaryotic DNA Replication Program

by

Matthew L. Eaton

Department of Computational Biology and Bioinformatics Duke University

Date:

Approved:

David M. MacAlpine, Supervisor

Fred S. Dietrich

Terrence S. Furey

Alexander J. Hartemink

Laura N. Rusche An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2011 Copyright © 2011 by Matthew L. Eaton All rights reserved Abstract

The accurate and timely replication of eukaryotic DNA during S-phase is of critical importance for the cell and for the inheritance of genetic information. Missteps in the replication program can activate checkpoints or, worse, trigger the genomic instability and aneuploidy associated with diseases such as cancer. Eu- karyotic DNA replication initiates asynchronously from hundreds to tens of thou- sands of replication origins spread across the genome. The origins are acted upon independently, but patterns emerge in the form of large-scale replication timing domains. Each of these origins must be localized, and the activation time deter- mined by a system of signals that, though they have yet to be fully understood, are not dependent on the primary DNA sequence. This regulation of DNA replica- tion has been shown to be extremely plastic, changing to fit the needs of cells in development or effected by replication stress. We have investigated the role of chromatin in specifying the eukaryotic DNA replication program. Chromatin elements, including histone variants, histone mod- ifications and nucleosome positioning, are an attractive candidate for DNA replica- tion control, as they are not specified fully by sequence, and they can be modified to fit the unique needs of a cell without altering the DNA template. The origin recognition complex (ORC) specifies replication origin location by binding the DNA of origins. The S. cerevisiae ORC recognizes the ARS (autonomously replicating sequence) consensus sequence (ACS), but only a subset of potential

iv genomic sites are bound, suggesting other chromosomal features influence ORC binding. Using high-throughput sequencing to map ORC binding and nucleosome positioning, we show that yeast origins are characterized by an asymmetric pattern of positioned nucleosomes flanking the ACS. The origin sequences are sufficient to maintain a nucleosome-free origin; however, ORC is required for the precise positioning of nucleosomes flanking the origin. These findings identify local nu- cleosomes as an important determinant for origin selection and function. Next, we describe the D. melanogaster replication program in the context of the chro- matin and transcription landscape for multiple cell lines using data generated by the modENCODE consortium. We find that while the cell lines exhibit similar repli- cation programs, there are numerous cell line-specific differences that correlate with changes in the chromatin architecture. We identify chromatin features that are associated with replication timing, early origin usage, and ORC binding. Pri- mary sequence, activating chromatin marks, and DNA-binding (including chromatin remodelers) contribute in an additive manner to specify ORC-binding sites. We also generate accurate and predictive models from the chromatin data to describe origin usage and strength between cell lines. Multiple activating chro- matin modifications contribute to the function and relative strength of replication origins, suggesting that the chromatin environment does not regulate origins of replication as a simple binary switch, but rather acts as a tunable rheostat to regu- late replication initiation events. Taken together our data and analyses imply that the chromatin contains suffi- cient information to direct the DNA replication program.

v To my parents, Jonathan and Mariellen Eaton, who taught me how to think and in so doing taught me everything I know.

To my wife-to-be, Dr. Hilary Wade, my reverse complement.

vi Contents

Abstract iv

List of Tables xi

List of Figures xii

List of Abbreviations and Symbols xiv

Acknowledgments xvi

1 Introduction1

1.1 DNA replication and the eukaryotic DNA replication program....3

1.1.1 Overview of the eukaryotic DNA replication program.....4

1.1.2 Regulation of the eukaryotic DNA replication program.... 19

1.1.3 Genomic tools for analysis of replication...... 28

1.2 Chromatin...... 37

1.2.1 Overview of chromatin...... 38

1.2.2 Genomic tools for analysis of chromatin...... 43

1.3 Regulation of the replication program by chromatin...... 45

1.3.1 Nucleosome positioning...... 46

1.3.2 Chromatin features and origin localization...... 48

1.3.3 Chromatin features and origin initiation...... 50

1.4 Thesis roadmap...... 53

vii 2 Conserved Nucleosome Positioning Defines Replication Origins in S. cerevisiae 57

2.1 Introduction...... 57

2.2 Results...... 59

2.2.1 Identification of functional ACSs...... 59

2.2.2 Nucleosome positioning at origins of replication...... 61

2.2.3 Nucleosome depletion at origins is encoded by sequence... 65

2.2.4 ORC is necessary and sufficient for precise nucleosome posi- tioning...... 70

2.3 Discussion...... 74

2.3.1 Model for ORC specificity...... 76

2.3.2 NFR architecture...... 77

2.3.3 Connection to origin activity and extension to metazoans... 78

2.4 Methods...... 79

2.4.1 Data deposition...... 79

2.4.2 Yeast strains and growth conditions...... 79

2.4.3 Chromatin immunoprecipitation and mononucleosome prepa- ration...... 80

2.4.4 Reconstituted chromatin assembly...... 80

2.4.5 High-throughput sequencing...... 80

2.4.6 Mapping ORC binding sites by ChIP-seq...... 81

2.4.7 Comparison of ORC ChIP-seq peaks with prior ChIP-chip peaks 81

2.4.8 Identification of functional ACS matches...... 82

2.4.9 Mapping nucleosome positions...... 83

2.4.10 Visualization of nucleosomes...... 84

2.4.11 Nucleosome positioning at genomic features...... 85

2.4.12 Construction of the nr-ACS set...... 86

viii 2.4.13 Sequence analysis around genomic features...... 87

2.4.14 Purification of ORC and ABF1...... 88

3 Chromatin signatures of the Drosophila replication program 89

3.1 Introduction...... 89

3.2 Results...... 92

3.2.1 Characterization of the Drosophila replication program in three cell lines...... 92

3.2.2 ORC enrichment at genic elements...... 96

3.2.3 Diverse chromatin marks define the replication program... 97

3.2.4 Potential cis and trans-acting elements directing ORC associ- ation...... 103

3.2.5 Predicting early origin usage from the chromatin landscape. 110

3.3 AWG analysis...... 118

3.3.1 Chromatin state 3...... 118

3.3.2 HOT spots...... 121

3.4 Discussion...... 123

3.5 Methods...... 127

3.5.1 Cell growth...... 127

3.5.2 ChIP-seq...... 127

3.5.3 Read mapping and peak calling...... 127

3.5.4 Replication timing...... 128

3.5.5 Early origin mapping...... 128

3.5.6 DNA extraction...... 129

3.5.7 BrdU immunoprecipitation...... 129

3.5.8 Array hybridization and analysis...... 130

3.5.9 Meta-peaks...... 130

ix 3.5.10 ORC at genic loci...... 131

3.5.11 SVM analysis...... 131

3.5.12 Motif analysis...... 132

3.5.13 Chromatin enrichment heatmaps...... 132

3.5.14 Regression Analysis...... 133

3.5.15 Data accession...... 135

4 Conclusion 136

4.1 Summary and Discussion...... 136

4.2 Future Directions...... 142

5 Appendix A – Aneuploidy analysis 146

5.1 Results...... 148

5.1.1 Model...... 148

5.1.2 Interpreting calls...... 149

5.1.3 Initial results...... 150

5.1.4 Future directions...... 151

5.2 Conclusion...... 154

Bibliography 156

Biography 185

x List of Tables

3.1 Pearson correlation of whole-genome replication timing between cell lines...... 93

3.2 Top ORC binding SVM features by F-score...... 109

xi List of Figures

1.1 ARS1...... 7

1.2 The chorion locus...... 11

1.3 Origin licensing and activation...... 15

2.1 Precise localization of yeast origins...... 60

2.2 Venn diagram of the overlap between ORC-ACS and nim-ACS sites.. 61

2.3 Replication origins are nucleosome free...... 63

2.4 ORC-ACS and nr-ACS motif comparison...... 64

2.5 Comparison of nucleosome positioning at ORC-ACS and nimACS sites. 65

2.6 The NFR at origins is encoded by primary sequence...... 67

2.7 Sequence polarity at replication origins...... 69

2.8 A-rich islands correlate with the position of the downstream nucle- osome...... 70

2.9 Orc1-161 is defective for origin binding at the non-permissive tem- perature...... 71

2.10 ORC is necessary and sufficient for precise nucleosome positioning.. 72

2.11 Comparison of nucleosome dyad locations across mnase-seq experi- ments...... 73

2.12 Abf1p can position the downstream nucleosome at ARS1...... 75

2.13 ORC chromatin binding model...... 77

2.14 The origin NFR is the same in G1 and G2...... 78

3.1 The Drosophila replication program across three cell lines...... 94

xii 3.2 Expression of genes with and without TSS proximal ORC binding... 97

3.3 ORC enrichment at annotated genomic features...... 98

3.4 The chromatin landscape of the replication program...... 100

3.5 Chromatin signals at TSS and non-TSS ORC sites...... 102

3.6 Median array signal for all factors at promoters with ORC, and those without...... 104

3.7 ROC curve for motif classification of ORC binding sites...... 105

3.8 Sequence, chromatin and DNA binding proteins classify ORC bind- ing sites...... 107

3.9 SVM classification of ORC sites in Bg3 and Kc...... 108

3.10 Per-factor median signals in areas of differential ORC and early origins.111

3.11 Hierarchical clustering of chromatin factors at early origin meta-peaks.113

3.12 Subsets of factors are highly correlated at early origins...... 114

3.13 Chromatin factor clusters correlation with early origin strength.... 115

3.14 Chromatin signatures are predictive of early origin activity...... 117

3.15 Example of an intergenic state 3 segment...... 119

3.16 Correlation between chromatin state 3, cohesin and replication.... 120

3.17 Nucleosome dynamics for HOT regions at varying complexity levels. 122

3.18 ORC is enriched at TF HOT spots...... 123

5.1 Example CNV HMM topology...... 150

5.2 HMM results on chromosome arm 3L...... 151

5.3 Differential copy number leads to differential expression...... 152

5.4 The copy number profile of the S2 X chromosome...... 154

xiii List of Abbreviations and Symbols

Symbols

ARS1 Autonomously Replicating Sequence 1, S. cerevisiae.

ORC-ACS set Set of S. cerevisiae high-confidence ORC binding sites.

nr-ACS set Set of S. cerevisiae ACS motif matches that ORC does not bind.

Abbreviations

ACS ARS Consensus Sequence

APC Anaphase Promoting Complex

ARS Autonomously Replicating Sequence

bp base-pair

BrdU 5-bromo-2’-deoxyuridine

ChIP Chromatin Immunoprecipitation

ChIP-chip ChIP coupled with a tiling microarray

ChIP-seq ChIP coupled with high-throughput sequencing

DNase I Deoxyribonuclease I

GINS Go-Ichi-Ni-San (5-1-2-3, Japanese)

ISWI Imitation SWI

HU hydroxyurea

kb kilobases

Lexo λ-exonuclease

xiv Mb megabases

MCM complex Mini-Chromosome Maintenance complex

MNase Micrococcal Nuclease

NDR Nucleosome Depleted Region

NFR Nucleosome Free Region

nt nucleotide

ORC Origin Recognition Complex

pre-RC pre-Replicative Complex

SWI/SNF SWItch/Sucrose NonFermentable

TSS Transcription Start Site

xv Acknowledgments

First of all, I would like to thank my advisor, Dr. David MacAlpine. Dave has been the best mentor I could have hoped for, patiently putting up with a never-ending litany of questions, providing me with mountains of mineable data and testable hy- potheses, and teaching me how science works, from putting together presentations to navigating the vast expanse outside of Duke. If I succeed in science, it will have a lot to with Dave’s guidance. I gratefully acknowledge the members of my thesis committee for their advice, both at committee meetings as well as individually. David MacAlpine, Terry Furey, Alex Hartemink, Fred Dietrich and Laura Rusche, thank you all for your time and your wise words. Thank you perhaps most of all for agreeing to read this hefty tome. I would like to acknowledge the members of the MacAlpine lab, especially Leyna, Heather, Queying and Sara, who had to suffer through those early bio- chemistry-oriented journal clubs where I had no idea what was going on. Your patience will certainly come back around in the form of good karma. A special thank you also to Joey, who went through the over-engineered modENCODE data submission process and made it out unscathed. I would like to thank all of my collaborators from my time at Duke, a list that in- cludes but is not limited to Peter Kharchenko, Stephen P. Bell, Kiki Galani, Sukhyun Kang, Wendy Lam, Clara Chan, Zhiguo Zhang, Qing Li, Terry Orr-Weaver, Noa Sher,

xvi Catherine Fox, Donald McDonnell, Mary Dwyer, Hilary Wade and the modENCODE AWG, headed by Manolis Kellis. I am honored to be able to be associated with the names listed here, and I thank Dave for facilitating so many great collaborations. Leyna gets a second thank you for being an amazing friend since my first week at Duke, and for introducing me to both Dave, and my fianc´e Hilary. I would also like to thank the rest of the Durham crew, especially my occasional housemate Jenny, for the many game nights, good food and engrossing discussion. The Duke Graduate School has provided me with useful resources as well as travel funding during my time at Duke. The Computational Biology and Bioinfor- matics program has been similarly hospitable, and special thanks go to the two DGSAs Jennifer Avery and Liz Labriola for keeping things running smoothly. I would like to acknowledge my Wesleyan University research advisors, Dr. Michael Rice (Computer Science) and Dr. Michael Weir (Biology). Their men- torship introduced me to the role of computer science in biological research. If it had not been for Dr. Rice asking me to join them for a summer research project in 2002, I would almost certainly not be where I am today. Finally, I would like to thank my family. My sister Caitrin for never wearing matching socks and for spawning a sibling arms race to see who can be the bigger nerd. Also, my parents, Jonathan and Mariellen Eaton, who taught me to think for myself and to value fairness and compassion. Throughout my life they have provided me with the necessary tools and then let me find my own answers to life’s problems, and supported me regardless of the outcome. And last but certainly not least, my beautiful fianc´e Hilary, whose mental prowess will keep me humble, and whose confidence in me will continue to inspire me to achieve.

xvii 1

Introduction

It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. –Watson and Crick[1953b]

The eukaryotic DNA replication program — contained completely within S- phase of the cell cycle — is an essential component in the process of cellular di- vision; it is the method whereby a cell exactly duplicates its DNA content during S-phase in preparation to divide later in [reviewed in DePamphilis, 2006]. Minor over- or under-replication of the genome has catastrophic consequences for the cell [reviewed in Arias and Walter, 2007] (see below, section 1.1.2). Although the biochemistry of the replication program has been well established [reviewed in Bell and Dutta, 2002; Sclafani and Holzen, 2007], the tools for investigating the genomic regulation of the eukaryotic DNA replication program have only been available for the last ten years, and with each passing year their genomic coverage and resolution increases. The genomic tools that we discuss here, including tiling arrays and ChIP-seq, were originally developed to map other genomic properties, such as chromatin configuration and DNA binding proteins. We have a limited

1 understanding of how the cell regulates the replication of its DNA in the context of an active chromatin template, not to mention how it responds to failures in that regulation, and thus there is an urgent need to generate and analyze genomic data with respect to DNA replication. Further, these data must be compared with other genomic data sets describing the nuclear environment, so as to better under- stand the relationship between the DNA replication program and the other nuclear processes which may both specify and react to the replication program. In this thesis, I will discuss the methods used for mapping features of the DNA replication program genome-wide, and compare them with other whole-genome data sets relating to, for example, and chromatin. I will discuss these methods primarily in the context of two investigations into the role of chromatin in specifying the DNA replication program. The first examines the role of the physical structure of the chromatin in specifying the localization of replication origins in S. cerevisiae, in work that was published in Genes and Development [Eaton et al., 2010]. The second describes the chromatin environment of replication origins in D. melanogaster, and shows that the chromatin contains sufficient information to predict both the localization and the S-phase activation of replication origins with a high level of accuracy. This work was recently published in Science [modENCODE Consortium et al., 2010] and Genome Research [Eaton et al., 2011]. Finally, to add context to our analysis of whole-genome DNA replication, I will describe a tech- nique for inferring the ploidy of individual genomic regions in the D. melanogaster cell lines where our experiments take place, and show that, much like cells with a misregulated DNA replication program, these cell lines are drastically aneuploid with many large scale structural variations, and these structural genomic changes have effects at the level of gene transcription. This represents a work in progress, to be submitted for peer review shortly. For details of my individual contributions to each of these projects see below, section 1.4. The overarching theme of this the-

2 sis is the integrative analysis of eukaryotic high-throughput biological experiments and how they relate to various DNA-templated processes, with a special emphasis on the regulation of the DNA replication program.

1.1 DNA replication and the eukaryotic DNA replication program

DNA replication is the process through which a cell can make a high-fidelity copy of its DNA in preparation for cellular division. The two strands of the DNA double helix are separated, and the base pairing potential of the strands is exploited to cre- ate reverse complement copies of the genome from the two template strands. This replication strategy known as semi-conservative replication, with each daughter double-stranded DNA molecule containing one of the parent’s original strands and one newly synthesized. Semi-conservative replication was first suggested by Wat- son and Crick as an elegant solution to the problem of exactly duplicating a cell’s genome in their second paper describing the structure of the DNA double-helix [Watson and Crick, 1953a]. Semi-conservative replication was proven by Mesel- son and Stahl [Meselson and Stahl, 1958] who showed that E. coli cells grown in a media containing 15N (a rare isotope of nitrogen) and then allowed to replicate in media containing 14N (a more abundant and lighter isotope), would yield cells that, after the first division, had an equal amount of 15N and 14N, and after the next doubling had a split population of cells, half with an equal amount of 15N and 14N, and the other half with only 14N. This showed that the parental cells grown in the 15N media would separate their two strands of DNA and one would be inherited by each of the two daughter cells, and then following the next cell divisions two of the four grand-daughter cells would receive one of the original strands of DNA originally synthesized in 15N media, and so on, thus confirming that DNA replicates in a semi-conservative manner.

3 The DNA replication “program” is the succession of events which leads to an organism duplicating its genome exactly, a process which includes but is not limited to DNA synthesis. Owing to the fact that DNA replication must take place in a living cell which must be able to interact with its DNA for nuclear processes such as transcription and epigenetic control even while the DNA is being copied, the process must be regulated. Thus, the DNA replication program encompasses the localization, initiation, error-control and termination required to synthesize new DNA from an active template in a timely fashion. This program is specified by the cis-acting elements that play host to origins of replication, the trans-acting factors that bind them, the cell-cycle controlled factors that activate them, and the replication forks that progress bidirectionally from them. In the following sections I will describe the replication program and go into detail on those elements that are relevant to the research which I will present in the following chapters. I will also briefly describe what happens in response to mis- steps in the regulation of DNA replication, and discuss the genomic tools which have been developed to assay the replication program. I will attempt to be as general as possible with my description of the replication program with respect to eukaryotic organisms, but in the case of specific examples I will in general default to describing S. cerevisiae and D. melanogaster, as these two organisms are the focus of the research in the following chapters.

1.1.1 Overview of the eukaryotic DNA replication program

In 1963, Jacob and colleagues developed a theory of replication control, termed the model, in which certain cis-acting DNA loci (“replicators”) would interact with trans-acting regulatory factors (“initiators”) to initiate replication in response to some cell cycle signal [Jacob et al., 1963]. In prokaryotic systems, such as the E. coli cells used by Meselson and Stahl, the replicon model was quickly validated;

4 DNA replication begins at a single locus on the circular chromosome, which inter- acts with the initiator (dnaA) [reviewed in Kornberg and Baker, 1992]. Replication then proceeds bidirectionally around the circular chromosome until the replication forks encounter the terminator element. Eukaryotic organisms, however, were found to be more complex. Not only are there multiple chromosomes to be replicated, but it was quickly discovered that each chromosome utilized multiple cis-acting replicators [a review of the initial pri- mary research can be found in the discussion of Huberman and Riggs, 1966]. For example, the first visualization of “replicons”1 in human cells showed that chromo- somes contain multiple origins of bidirectional replication, spaced 50-150 kb apart [Huberman and Riggs, 1968]. Taken together, this constituted the initial evidence that the eukaryotic replication program would have to regulate the interactions of multiple replicators per chromosome within the confines of S-phase. Initial mea- surements using short pulses of tritiated thymidine estimated a lower bound on the number of replicators per D. melanogaster chromosome to be 50 [Plaut and Nash, 1964]. These tritiated thymidine pulses followed by autoradiography are the ex- perimental ancestors of the more recent eukaryotic replication maps, and through a ChIP-seq experiment described in chapter 3, we now know the number of po- tential replicators, or “origins of replication” (which will be used interchangeably with “replication origins” and simply “origins”), as they are now more commonly known, to be twenty times greater than those initial estimates. The true number of replication origins in eukaryotic organisms varies with the size of the genome. For example, the budding yeast S. cerevisiae has approximately 350 functional origins (excluding rDNA), whereas the human HeLa cell line has an

1 A replicon is a region of DNA that replicates from a single initiation event. Given the profusion of origins and the apparent stochasticity of origin firing, it is not clear that this is prevalent in higher , so this thesis will employ the term replicon primarily when summarizing previous liter- ature.

5 estimated 3¢104 origins [Appendix II, DePamphilis, 2006]. In all eukaryotes, how- ever, the functionality of these origins is conserved. Origins of replication act as sites for the assembly of the pre-replicative complex (pre-RC), which comprises the origin recognition complex, two helper proteins Cdt1 and Cdc6, and the replica- tive the MCM complex [reviewed in Bell and Dutta, 2002]. The pre-RC is assembled in G1, and is complete once MCM, a heterohexameric com- plex, is clamped around the DNA. At this point the origin is referred to as being “licensed”. Upon entering S-phase, the helicase may be activated, at which point the origin DNA is unwound. Then, DNA and other proteins essential to active DNA replication are recruited to the nascent replication fork and the origin is initiated. As noted above, activation of any one licensed origin is not required for the suc- cessful completion of S-phase [Santocanale and Diffley, 1996]. Some origins may never fire and instead be passively replicated by replication forks originating from neighboring origins [Dubey et al., 1991; Vujcic et al., 1999]. Others may be weak origins that fire in only a subset of a population of replicating cells [Friedman et al., 1997; Yamashita et al., 1997]. Among those origins that do fire, however, there is a temporal order within S-phase to their time of initiation [Taylor, 1960; Balazs et al., 1974; Raghuraman et al., 2001;M ´endez, 2009], the mechanism of that tem- poral regulation is not yet fully understood, but it is clear that it is integrated with other DNA-templated processes such as transcription [MacAlpine et al., 2004].

Origins of replication

Origins of replication are defined by the cis-acting elements that play host to repli- cation initiation and the trans-acting proteins that bind them. In this section I will focus on the relevant features of replication origins, and how their cis-acting elements are specified in eukaryotes.

6 Eukaryotic origins of replication were initially discovered in the budding yeast S. cerevisiae by testing genomic regions for an ability to confer high-frequency transformation of plasmids [Stinchcomb et al., 1979; Beach et al., 1980] [reviewed in Newlon, 1996]. Such genomic regions were termed ARS elements, for Au- tonomously Replicating Sequences. Once candidate ARS elements were identified, they could be tested through mutation and analysis with 2D gel electrophoresis [Brewer and Fangman, 1987] at their endogenous locus to show that they func- tion as origins of replication in their natural setting. Many of the loci tested that were capable of maintaining a plasmid were found to be active replicators in their chromosomal setting as well. The first ARS element identified, ARS1 [Figure 1.1 Stinchcomb et al., 1979], has remained one of the most studied replication origins in eukaryotes.

ARS1 pre-RC ORC Abf1 ACS B1 B2 B3

130 bp

FIGURE 1.1: ARS1 schematic. Relevant sequence elements of the S. cerevisi- aeARS1 replication origin are shown in white. The Dnase I footprint of the various trans-acting factors are shown as grey boxes. Adapted from [Aldajem et al., 2006].

ARS elements were soon found to contain multiple short DNA sequences that, when mutated, reduced their replication efficiency, sometimes ablating it com- pletely. These elements were given an alphanumeric code, and comprise the A, B1-3 and C elements [reviewed in Newlon, 1996]. Importantly, though not identi- cal, the elements were found to have a strong sequence homology across ARS ele- ments. In a trend that would continue throughout the eukaryotes, most functional

7 origins were found to exist in intergenic regions [Wyrick et al., 2001; MacAlpine and Bell, 2005]. The A element, often referred to as the canonical ACS (Autonomously repli- cating sequence Consensus Sequence) [Broach et al., 1983], is an 11 bp sequence element generally considered to be of the form (A/T)TTTAYRTTT(A/T). This ele- ment is absolutely essential (though not sufficient) for ARS activity, and acts as the primary binding site for the origin recognition complex (ORC, see below), which serves as the primary initiator of origin function. The few bases directly 5’ and 3’ of the ACS may also have an effect on origin function, but are generally less conserved and function on a per-origin basis [Theis and Newlon, 1994; Rao et al., 1994]. Counting from the first base of the ACS, approximately 30 bases downstream is the B1 element. The B1 is generally a small cluster of Ts, and has only one base that is essential for origin activity [Huang and Kowalski, 1996]. Based on DNA protection assays, subunits of Orc1, Orc2 and Orc4 interact with the major groove of the DNA over the ACS, while Orc5 appears to contact the B1 element at a specific position [Lee and Bell, 1997]. Given that both the ACS and B1 element interact with ORC, they are often combined into a single 33 bp motif referred to as the EACS (Extended ACS) [Xu et al., 2006]. The B2 element is not conserved in sequence across all budding yeast origins, although substitution experiments have shown that some B2 elements may be in- terchanged without a strong impact on replication efficiency [Rao et al., 1994; Theis and Newlon, 1994; Huang and Kowalski, 1996; Lin and Kowalski, 1997]. It often appears as a weak match to the ACS [Wilmes and Bell, 2002; Bell, 2002], although that can be said of many T-rich elements. It appears a mix between its specific sequence and its helical stability is critical for origin efficiency [Wilmes and Bell, 2002]. In particular, the B2 might be particularly important as a DUE

8 (DNA Unwinding Element), allowing the DNA to be pulled apart to prime DNA replication and to load the helicase. In support of this hypothesis, mutating the B2 region in such a way as to increase its helical stability reduces origin efficiency [Na- tale et al., 1993; Huang and Kowalski, 2003], and such mutations further hamper origin activity if combined with lower temperatures which make the DNA concomi- tantly more difficult to unwind [Miller et al., 1999]. In general, mutations to the B2 element do not inactivate origins, but reduce their efficiency, with effects that interrupt pre-RC assembly after ORC association but at some point before the final step when the helicase is loaded onto the DNA [Zou and Stillman, 2000; Wilmes and Bell, 2002]. Importantly, the ARS1 B2 is protected from Dnase I digestion by the pre-RC but not by ORC alone [Figure 1.1 Diffley et al., 1994], implying that it interacts with origin proteins only after either the loading proteins Cdt1 and Cdc6 or the helicase has associated with the pre-RC. The B3 element is not a sequence motif so much as a chromatin feature. It is a transcription factor binding site which appears at a subset of the origins, and the identity of the transcription factor itself seems not to be important. At ARS1 (the best-studied yeast origin), for example, the transcription factor binding site used as the B3 element is that of Abf1, although other origins contain a binding site for Rap1. If the the Abf1 site in ARS1 is replaced with a sequence that binds Rap1, origin function is recovered [Marahrens and Stillman, 1992], implying that the transcription factor is contributing to origin efficiency simply by its occupancy of the DNA, which could influence the local chromatin organization, perhaps by blocking nucleosome occupancy. ARS121 also contains an Abf1 binding site, and while loss of the Abf1 site ablates origin function, moving it or reversing its orien- tation does not interfere with origin activity [Walker et al., 1990], again implying that merely its downstream presence is enough to enhance origin function. Although this large body of evidence would seem to imply that origin specifica-

9 tion is entirely sequence-based, a recent study presented an algorithm for finding S. cerevisiae replication origins based on the above sequence requirements alone and concluded that using only sequence information there were upwards of 10,000 potential origins of replication within the full S. cerevisiae genome [Breier et al., 2004]. Thus, taken together with the above results, these sequence elements seem to serve the dual role of serving as an ORC binding site and setting up certain chromatin properties (i.e. helical stability and open chromatin) that can interact favorably for origin assembly. It is known that nucleosome occupancy can inter- fere with origin function at ARS1 [Simpson, 1990], and in chapter 2 I will explore further the role of chromatin in specifying origin localization. Given this extensive body of research on the elements of S. cerevisiae replica- tion origins, and the conservation of these functional sequence elements between origins, it would be logical to expect that other eukaryotic organisms would ex- hibit similar replicator sequence architecture and specificity. However, this is not the case. Even within the unicellular eukaryotes, budding yeast is unique in terms of its extensive and specific replication origin architecture, as even the S. pombe genome lacks a well defined ORC binding motif analogous to the ACS [reviewed in Kelly and Brown, 2000]. This is not to say that most eukaryotes do not follow the replicon model of Jacob and colleagues. Indeed, S. pombe initiates replication reliably from certain genomic locations, but these sites are large (500 – 1000 bp) and relatively degenerate due to their A/T richness [Segurado et al., 2003]. In fact, the features of a DNA sequence which are most predictive of its replication origin potential in S. pombe are its length and A/T richness [Dai et al., 2005]. This correlates well with the observation that in S. pombe, ORC subunit 4 contains A/T hook motifs that are not present in S. cerevisiae [Chuang and Kelly, 1999; Gaczynska et al., 2004]. Further, replacing a known ARS in S. pombe with a se- quence of sufficient length and A/T richness will recover origin function [Chuang

10 and Kelly, 1999], implying that some interaction of A/T richness and chromatin environment is sufficient to specify an in S. pombe. This trend of diminishing origin sequence specificity increases with genomic complexity when considering metazoans.

chorion

Ori-α ACE3 Ori-β Ori-γ

S18 S15 S19 S16

10-15% 80% 10-15%

~7 kb

FIGURE 1.2: The chorion locus. Grey boxes indicate the extent of regions where initiation has been mapped, the numbers represent the approximate percentage of local initiation events the regions are responsible for. White rectangles represent genes, with their direction of transcription marked by an arrow. The ACE3 locus is indicated in blue, and green corresponds to areas of high A/T content. Adapted from [Aldajem et al., 2006].

In metazoan organisms, the ARS assay is non-functional. Exogenous plasmids introduced into metazoan cells will either not be maintained at all [Biamonti et al., 1985], or if given a nuclear retention signal [Krysan et al., 1993; Schaarschmidt et al., 2004] or introduced into the X. lavis embryo model replication system [M´echali and Kearsey, 1984], will replicate stably regardless of their sequence. To date, most research into metazoan origins of replication has focused on sites of replicative gene amplification, where a particular genomic locus will undergo mul- tiple rounds of replication outside of normal cell cycle controls, resulting in a much higher copy number for the locus in question. For example, the D. melanogaster chorion gene locus (Figure 1.2) is amplified 100-fold in the follicle cells surround- ing the developing oocyte, where its protein product is necessary in large quantities

11 to form the egg shell [reviewed in Claycomb and Orr-Weaver, 2005]. This is essen- tially a strategy for gene up-regulation, by drastically increasing the copy number of the gene and thus the number of templates available from which to transcribe mRNA. A nearby locus was identified as the replication origin responsible for the massive amplification [Heck and Spradling, 1990], and has since been used to in- terrogate the properties of replication origins in D. melanogaster. Coincident with its discovery, a 320 bp sequence upstream of the origin was shown to enhance ori- gin activity. This site, named ACE3 for Amplification Control Element of chromo- some 3, is essential for amplification of the chroion locus [de Cicco and Spradling, 1984; Orr-Weaver and Spradling, 1986; Orr-Weaver et al., 1989]. Although repli- cation can initiate infrequently anywhere throughout the chorion locus, the most frequent origin is an 867 bp stretch of DNA between two genes,termed Ori-β [Heck and Spradling, 1990; Claycomb and Orr-Weaver, 2005]. Whole-genome methods are just beginning to be developed to assay metazoan origin distribution and function. These will be the subject of section 1.1.3 and Chapter 3 of this thesis. Briefly, using whole-genome tiling arrays and an antibody specific for the initiator proteins, it has been confirmed that metazoan origins local- ize to specific replicator loci and these loci have been mapped at ChIP-chip resolu- tion [e.g. MacAlpine et al., 2004]. These studies indicate that although replicators can still be mapped to distinct loci, these loci are more diffuse and have a lower probability of being used in any given cell cycle than in S. cerevisiae. Furthermore, they do not share the same sorts of sequence structure described above, and indeed no sequence specificity has been determined for metazoan initiator proteins in vitro outside of a preference for A/T rich and negatively supercoiled DNA [Vashee et al., 2003; Remus et al., 2004].

12 Origin licensing

The process of assembling the pre-replicative complex (pre-RC) [Diffley et al., 1994] at origins of replication is referred to as the process of origin licensing [re- viewed in Bell and Dutta, 2002]. Origin licensing requires DNA, ORC, Cdc6 and Cdt1 to load the MCM complex (Figure 1.3). Once licensed, an origin is competent to initiate replication, or “fire”, in the subsequent S-phase. Although the regula- tion of origin licensing is not fully understood in the genomic sense, the proteins involved and the order of their involvement have been relatively well worked out [e.g. Kawasaki et al., 2006], and in this section I will briefly describe the process of origin licensing from a mechanistic standpoint. Importantly, in eukaryotic cells, licensing is confined to the G1 phase of the cell cycle, while initiation may only take place on licensed origins in S-phase. This is one of the mechanisms the cell uses to ensure that an origin may only fire once per cell cycle (see below for in- dividual proteins’ regulation, and section 1.1.2 for a more in-depth discussion and the consequences of failure. The first step in origin licensing is the binding of the origin recognition complex (ORC) to the DNA of the replicator in late M phase or early G1. ORC was initially discovered as the heterohexameric protein complex (comprising Orc1–6) which in- teracts with the replicator DNA in S. cerevisiae [Bell and Stillman, 1992], and has well conserved orthologs in all eukaryotes [reviewed in Bell, 2002; Duncker et al., 2009], including D. melanogaster [Gossen et al., 1995]. In both D. melanogaster and S. cerevisiae ORC forms a relatively immutable complex, whereas in verte- brates the association of Orc6 with the full complex is transient [Gillespie et al., 2001; Dhar and Dutta, 2000]. Interestingly, ScOrc6 [Li and Herskowitz, 1993] is dispensable for DNA binding, though it is essential for cell viability [Bell and Still- man, 1992; Lee and Bell, 1997], whereas DmORC requires all six subunits for DNA

13 binding and replication [Chesnokov et al., 2001]. The role of ORC is to identify and bind genomic locations suitable for hosting replication initiation (replicators), and then to act as a loading dock for the subsequent proteins of the pre-RC. ORC does have a number of roles outside of the replication program; for example ORC is involved in heterochromatic silencing at both telomeric and mating-type loci in budding yeast [Foss et al., 1993; Micklem et al., 1993; Rusche et al., 2003], as well as in maintenance of heterochromatin in higher eukaryotes [Pak et al., 1997], however here I will limit myself to describing ORC’s role in replication initiation. Although ORC’s role is to bind the DNA of cis-acting regulators, a clear DNA binding motif has only been identified in S. cerevisiae (described above), and even that is insufficient to fully specify ORC localization genome-wide [Breier et al., 2004]. In the fission yeast S. pombe, reliable cis-acting replicator sites can be iden- tified with an ARS assay, but there is considerably more degeneracy, as the A/T hook motifs found in SpOrc4 appear to confer the ability to bind any sufficiently A/T rich DNA [Gaczynska et al., 2004]. In metazoans, the DNA binding speci- ficity of ORC seems to have been even further reduced, and though purified D. melanogaster ORC shows a preference for A/T rich DNA, a stronger determinant appears to be the topological properties of the DNA [Remus et al., 2004]. However, this does not imply that ORC’s binding isn’t predictable, simply that the prediction will not come from DNA sequence alone. It has been known for many years that replication initiates from specific locations in metazoan cells [see above and e.g. Huberman and Riggs, 1968], each of which must be marked by ORC binding, thus it is reasonable to assume that ORC is able to detect its binding site efficiently and reliably. The bulk of the research presented in this thesis will focus on the identity of those signals, and the contribution of the chromatin environment of the genome to their specification. For a further introduction to this topic, see the sec- tion on chromatin’s regulation of replication in this chapter, and the introductions

14 of chapters 2 and 3.

G1 S

M G2

ORC MCMs Cdt1p/Dup1p Pre-Initiation complex Cdc6p Polymerase Machinery CDK Activity

FIGURE 1.3: Origin licensing and activation. In S. cerevisiae and D. melanogaster, ORC is bound to origins of replication throughout the cell cycle. In G1, the origin is licensed through the stepwise association of Cdt1, Cdc6 and the MCM complex. In S-phase the licensed origin is activated through the actions of the components of the pre-IC, including Cdc45 and other factors described below.

The next step in origin licensing is the association of Cdc6 with ORC. Cdc6 has a very close homology with Orc1 [Iyer et al., 2004], and was first character- ized as a necessary replication protein almost twenty years before ORC’s discovery [Hartwell, 1973, 1976]. In S. cerevisiae, through mutating the various domains of Cdc6 it was found that both its DNA binding and ATP hydrolysis capabilities are es- sential for origin function, specifically for loading the MCM complex [Perkins and Diffley, 1998; Weinreich et al., 1999]. Cdt1 associates with the complex around the same time. Although it is known that Cdt1 relies on ORC for chromatin binding [Maiorano et al., 2000], whether

15 Cdt1 depends on Cdc6 for binding is not clear, as they have been shown to interact in S. pombe and mammals [Nishitani et al., 2000; Cook et al., 2004], but such interactions are not required in X. lavis embryos for Cdt1 loading [Maiorano et al., 2000]. Cdt1 was originally identified in S. pombe as a gene critical to replication and later pre-RC assembly [Hofmann and Beach, 1994; Nishitani et al., 2000], and has orthologs in all eukaryotes including S. cerevisiae (Tah11/Sid2) [Devault et al., 2002], and D. melanogaster (Dup) [Whittaker et al., 2000]. It has been shown that in X. lavis egg extract, Cdt1 binding to the pre-RC before Cdc6 blocks helicase loading [Tsuyama et al., 2005], thus implying that while Cdc6 may be dispensable for Cdt1 chromatin association, it is required for Cdt1 to fulfill its functional role in the pre-RC. It has been suggested that Cdt1 and MCM are recruited to the origin as a complex [Tanaka and Diffley, 2002], and that this complex can bind Cdc6 [Cook et al., 2004]. The role of Cdt1 is thus most likely to promote MCM association with the pre-RC. The eukaryotic replicative helicase, the MCM2-7 complex (MCM), is the final element of the pre-RC. Long implicated as the replicative helicase, in vitro heli- case activity of the full complex was finally demonstrated by the Schwacha group [Bochman and Schwacha, 2008]. The MCM complex is an elongated, ring-shaped heterohexameric protein complex which is well conserved across all eukaryotes [Bell and Dutta, 2002]. All of the subunits are closely related to one another and share a central AAA+ ATPase domain, but despite this conservation they are all uniquely necessary, deletion of any of the six subunits in S. cerevisiae or S. pombe is lethal [reviewed in Dutta and Bell, 1997]. In its role as the eukaryotic replica- tive helicase, it is necessary that MCM be able to travel with the replication fork, which it has been shown to do [Aparicio et al., 1997]. Thus, in pre-RC assembly, the MCM complex must first associate with the other pre-RC proteins, and then be clamped around the DNA. It has recently been shown that MCM is present at

16 the pre-RC as a double hexamer, connected at the N-terminal ends (head-to-head) [Remus et al., 2009]. Furthermore, single heptamer complexes comprising Mcm2- 7 and Cdt1 are loaded cooperatively, and then the MCM hexamers are conjoined, encircling the dsDNA just downstream of ORC’s binding footprint. At this point MCM is inactive, and can slide freely on dsDNA without separating the two strands [Remus et al., 2009]. Furthermore, once the MCM is loaded, the rest of the pre-RC can be removed from the DNA without interfering with replication activity [Hua and Newport, 1998; Randell et al., 2006]. Thus it can be argued that the process of pre-RC assembly is the process of loading MCM onto the DNA at origins of repli- cation. With the MCM complex clamped around the DNA of the origin, the pre-RC has been assembled, and the MCM awaits activation in the subsequent S-phase. Importantly the pre-RC assembly process described above is cell-cycle depen- dent, such that the pre-RC can only be assembled in G1 [see below section 1.1.2; reviewed in Blow and Dutta, 2005; Sclafani and Holzen, 2007; Arias and Walter, 2007]. Briefly, in S. cerevisiae, cell-cycle control of DNA replication is controlled primarily by Cdk1 [Dahmann et al., 1995]; while CDK levels are low in late M through the end of G1, pre-RC assembly is permitted, but as the cell moves into S-phase and CDK levels rise, individual members of the pre-RC are inhibited while concurrently CDK activates the origins. This duality in the replication programs en- sures that origins are licensed only in G1 and activated (and thus de-licensed) only in S, meaning that an origin will fire at most once per cell cycle. In D. melanogaster, most of the work is done by inhibiting Cdt1 using both degredation of Cdt1 by an S- phase dependent ligase and a protein, geminin, which is prevalent from S-phase until its degraded during M-phase [McGarry and Kirschner, 1998], and which binds Cdt1, inactivating it [Wohlschlegel et al., 2000]. D. melanogaster also exhibits APC-mediated degradation of Orc1 in late M and G1 [Araki et al., 2003], although why the largest subunit of ORC would be degraded as the cell prepares

17 to license origins has yet to be fully understood.

Origin activation

With pre-RC assembly complete by the end of G1, any licensed origins (i.e. those that have assembled a pre-RC to the point that inactivated MCM is encircling the DNA) are now available to be activated. Activation consists of of the MCM complex by CDK and DDK and unwinding of the DNA at the origin, fol- lowed by recruitment of DNA and the departure of the replication fork [reviewed in Walter and Araki, 2006; Sclafani and Holzen, 2007; Labib, 2010]. In this section, I will briefly discuss the activation of origins to motivate the work in the following chapters. Importantly, the results in the following chapters (specif- ically chapter 3) focus not on the mechanism of origin of activation, but on how the process of activation is regulated — in other words, I have investigated when origins are selected relative to the duration of S-phase, but not the mechanism of that selection. Helicase activation is a complex process that involves two protein kinases (CDK and DDK) and the contributions of at least six additional proteins or complexes (Mcm10, Dpb11, Sld2, GINS, Cdc45 and Sld3) (the pre-initiation complex). The process begins with Mcm10, a Zn finger protein implicated in replication origin and fork function [Merchant et al., 1997], loading onto the origin at G1/S in an MCM complex dependent manner [Wohlschlegel et al., 2002], where its role ap- pears to be recruiting Cdc45. The protein kinase DDK (Dbf4-dependent kinase), which is a complex of Cdc7 kinase and its regulatory subunit Dbf4, then phospho- rylates subunits of the MCM complex [Zou and Stillman, 2000; Sclafani, 2000; Sheu and Stillman, 2006]. Without the DDK complex, the pre-RC and Mcm10 are able to associate with origins but Cdc45 loading is blocked [Jares and Blow, 2000]. CDK then associates with the complex and acts in a DDK-dependent manner

18 in metazoan cells [Jares and Blow, 2000], although in budding yeast cells it has been shown that DDK cannot act without active CDK, implying that their relative order or cooperatively may have changed over time [Nougar`ede et al., 2000]. A complex of Dpb11 and Sld2 is then recruited to origins, and again is important in the loading Cdc45 [Van Hatten et al., 2002; Hashimoto and Takisawa, 2003; Kamimura et al., 1998]. The GINS complex then associates with the origins — GINS is so named for the number of its four subunits, 5–1–2–3, or in Japanese, Go- Ichi-Ni-San (GINS comprises Sld5, Psf1, Psf2, and Psf3) [Takayama et al., 2003]. GINS functions interdependently with Cdc45 in loading the replication machin- ery in preparation for replication fork initiation, and is necessary for replication fork progression [Takayama et al., 2003]. Finally, Cdc45 [Zou et al., 1997] is re- cruited to origins in a process that depends on all of the previously mentioned factors [Wohlschlegel et al., 2002; Jares and Blow, 2000; Zou and Stillman, 2000; Takayama et al., 2003; Zou and Stillman, 1998]. Cdc45 stimulates the helicase ac- tivity of the MCM complex, and without it origin unwinding, RPA, and the recruit- ment of DNA polymerases are blocked [Mimura et al., 2000; Walter and Newport, 2000]. Cdc45, along with GINS and MCM, will travel along with the replication fork when fork movement is initiated [Pacek and Walter, 2004]. At this point the helicase has been activated, and the eukaryotic DNA polymerases (, α, and δ) are recruited in a manner dependent on the factors described above [reviewed in Walter and Araki, 2006], and the replication forks begin to travel away from the origins, creating nascent DNA strands.

1.1.2 Regulation of the eukaryotic DNA replication program

Beyond the management of DNA replication fidelity by DNA polymerase and the associated machinery, the cell goes to great lengths to ensure that the genome is duplicated exactly once and only once. At the same the cell faithfully maintains

19 a set of genomic loci as origins and a stable replication timing program. Thus, it is logical to assume that errors in any of these programs will have negative con- sequences for the cell. This is indeed the case, and multiple mechanisms exist to ensure that a cell with a genome that is not exactly duplicated will not be able to divide [reviewed in Arias and Walter, 2007; Hook et al., 2007].

Why regulate DNA replication?

The main requirement of the DNA replication program is that it faithfully duplicate the genome exactly once during S-phase. However, this is a relatively short-sighted view of the requirements of the replication program. A more complete directive is that the DNA replication program must faithfully duplicate the genome exactly once during S-phase from an actively used and highly configured chromatin tem- plate; it must pull the DNA from its chromatin context for a short time as the fork passes through a region and then have it packaged back into chromatin within sec- onds [Gasser et al., 1996]; and it must respond to the various contexts of the cell including DNA damage, nutrient depletion, widely varying S-phase durations and genomic chromatin states. Thus, the controls that ensure that no segment of the genome is replicated more or less than exactly once are critical, but there exists another level of regulation: origins are controlled both in space and in time. Origins are controlled in space by the replicon model first suggested by Jacob in 1963: ORC binds to specific loci throughout the genome in G1. The deter- minants of ORC binding are not yet fully understood, especially in a metazoan system [reviewed inM ´echali, 2010; Cayrou et al., 2010], but the very fact that ORC regularly chooses some sites over others implies that this localization is an important control within the replication program, and that it benefits the cell to maintain it. It is certainly true that ORC localizes predominantly to intergenic re- gions, which alleviates any potential conflict between poised replication machinery

20 and active transcription machinery [e.g. Haase et al., 1994; MacAlpine and Bell, 2005; Mesner and Hamlin, 2005; Mori and Shirahige, 2007;M ´echali, 2010]. It has been suggested that the underlying reasons include chromatin structures that facilitate unwinding [Pearson et al., 1996], the 3D structure of the genome inside the nucleus [reviewed in Gilbert et al., 2010], dovetailing with active transcrip- tion machinery, and the maintenance of chromatin state or epigenetic information [reviewed in Hiratani and Gilbert, 2009]. Origins are controlled in time by their activation during S-phase. Activation is triggered at origins asynchronously throughout the genome during S-phase. This timing appears to be programmed in G1 [Wu and Gilbert, 1996] although it can be plastic in the presence of replication stress [Branzei and Foiani, 2005]. There is ev- idence that activation in metazoans is stochastic within clusters of proximal origins but programmed at the level of timing domains (i.e. the activation of individual origins in a cluster may change but whether or not a cluster is activated does not) [Jackson and Pombo, 1998]. It has been shown that in general euchromatic regions replicate early and heterochromatic regions replicate late within S-phase, and that actively transcribed genes are associated with early timing domains [e.g. Schubeler¨ et al., 2002; Woodfine et al., 2004; MacAlpine et al., 2004]. Taken together, this and other evidence argues that programming the time of DNA replication within S-phase is actively regulated and linked to other chromatin-templated processes. It has been suggested that this tunes the mutation rate to be higher in less tran- scribed regions, as the later replicating regions may have a higher mutation rate [Stamatoyannopoulos et al., 2009]. It has also been shown that replication timing effects chromatin mark deposition [Zhang et al., 2002a].

21 Regulation and mis-regulation within the cell cycle

Eukaryotes have evolved an elegant mechanism for avoiding aberrant rereplication of the genome. Origins may only be licensed in G1 and only activated in S-phase, and only licensed origins may be activated [reviewed in Bell and Dutta, 2002]. Thus, by separating the mechanisms of origin licensing and origin activation, and by linking both to a distinct phase of the cell cycle, each origin can be licensed and fired only once during a normal cell cycle. Although this general method of regula- tion is conserved across all eukaryotes, its mechanism differs across species. In the budding yeast S. cerevisiae, the regulation of origin licensing is tied exclusively to CDK activity, and the availability of pre-RC factors in G1 and their CDK-related in- activation in S [Nguyen et al., 2001]. Cdc6 is targeted for ubiquitylation by the E3 ligase SCFCdc4 [Drury et al., 1997], its transcription is downregulated through CDK blocking the nuclear import of its transcription factor Swi5 [Moll et al., 1991], and its licensing activity is blocked by phosphorylation at N-terminal CDK sites [Mimura et al., 2004]. The MCM complex is exported from the nucleus after S-phase in a CDK phosphorylation-dependant manner, at which time Cdt1 is also exported due to its association with MCM [Tanaka and Diffley, 2002]. Finally, ORC is phospho- rylated in a CDK dependent manner [Nguyen et al., 2001] and the S-phase cyclin Clb5 binds directly to Orc6 after an RXL motif is uncovered when the origin fires [Wilmes et al., 2004]. In metazoans, the process of licensing regulation is similar but is not tied as directly to CDK activity. For example, in D. melanogaster, in which cyclin-E functions as the S-phase cyclin in mitosis and endocycles and cyclin-A and B are the mitotic cyclins [reviewed in Lee and Orr-Weaver, 2003], the major target of rereplication control is not Cdc6 but Cdt1 (Dup1 in Drosophila), which is bound and inactivated by Geminin (implicated in many metazoan rereplication control mechanisms). Geminin levels are depleted after M-phase via proteolysis by the

22 Anaphase Promoting Complex (APC), thus allowing origin licensing, but are ele- vated as the cell transitions into S-phase, thus blocking further licensing [Quinn et al., 2001]. Cdt1/Dup1 is also phosphorylated in a CDK dependent manner, and thus its destruction may be mediated by CDK activity [Thomer et al., 2004], and without the mitotic cyclins the cell can undergo multiple S-phases with no inter- vening M-phase, effectively triggering an endocycle [Mihaylov et al., 2002]. Thus eukaryotic cells exhibit multiple overlapping methods for ensuring the temporal separation of the licensing and activation phases of the replication program, which provides multiple methods for guarding against aberrant initiation of DNA replica- tion in G1 as well as re-initiation from previously fired origins in S-phase. Should these overlapping methods for protecting the cell against aberrant repli- cation fail, the cell faces serious consequences centered around genomic instability. For example, in metazoan cells, rereplication can be induced by over-expression of a subset of the origin licensing proteins; i.e. Cdt1 and Cdc6, as well as by depletion of the negative regulators of origin licensing such as Geminin. In the best case sce- nario the cell will detect over- or under-replication of the genome and trigger the G2/M checkpoint, activating chk1 and chk2 and down-regulating Cdk1, preventing entry into mitosis [reviewed in Arias and Walter, 2007]. If the cell should escape this checkpoint, which can happen in the case of very limited amount of rereplica- tion [Davidson et al., 2006] or the inhibition of Chk1, and can be exacerbated by the loss of functional p53 [Vaziri et al., 2003], the cell will enter mitosis, causing mitotic catastrophe and apoptosis [Melixetian et al., 2004; Zhu et al., 2004]. In- terestingly, the re-replication induced by, for example, inhibition of geminin in D. melanogaster, localizes preferentially to the heterochromatic regions of the genome [Ding and MacAlpine, 2010], implying that the mechanism of re-replication may be linked to a chromatin or epigenetic control. Should all of these safeguards fail, however, aberrant replication leads to genomic

23 instability including increased ploidy and tumorigenesis. For example, elevated levels of Cdt1 and Cdc6 can be detected in tumors and tumor-derived cell lines [Arentson et al., 2002; Gonzalez et al., 2006], and mutations in the well-studied tumor suppressor gene p53 may synergize with overexpression of Cdt1 and Cdc6 to cause genomic instabilities in some cancers [Karakaidos et al., 2004]. The evi- dence of genomic instability in cancer is also overwhelming. As one small example, frequently thymocytes from Cdt1 transgenic mice lacking p53 contain extra chro- mosomes, some as many as 18 more than the expected 40 [Seo et al., 2005]. Thus the proper regulation of the DNA replication program is essential to prevent against genomic instability and tumorigenesis.

Regulation and mis-regulation at the level of localization

Much of what specifies pre-RC location in metazoans remains unknown [see above; reviewed in Gilbert, 2004], and indeed even in S. cerevisiae sequence alone does not tell the complete story of origin localization [Breier et al., 2004]. This could be a consequence of increased plasticity in the replication program. For example, in D. melanogaster early embryos, origins are spaced very close together, and as the fly develops and zygotic transcription initiates, inter-origin distances increase and the replication program resembles that of differentiated tissue [Blumenthal et al., 1974]. If the replication localization program were fully specified by an immutable sequence signal rather than mutable chromatin, epigenetic, or transcriptional sig- nals, this level of plasticity might not be possible. Active transcription influences the metazoan replication program. Although there is no overlap between S. cerevisiae replication origins with TSSs, two thirds of D. melanogaster’s ORC binding sites are proximal to TSSs, and these TSSs tend to be very actively transcribed [MacAlpine et al., 2010]. This phenomenon has also been observed in human [Cadoret et al., 2008]. This could be due to a chromatin

24 structure at active promoters that is adventageous for origin function or DNA un- winding. For example, both promoters and D. melanogaster origins are sites of high histone turnover [MacAlpine et al., 2010] and through a new assay for identifying rapidly substituted histones called CATCH-IT (Covalent Attachment of Tags to Cap- ture Histones and Identify Turnover) the Henikoff group has identified both TSSs and replication origins as hot spots for histone swapping [Deal et al., 2010]. Ori- gins located in active promoters might also reduce the incidence of head-to-head collisions between elongating RNA transcription machinery and elongating DNA replication machinery, which could be deleterious [Lebofsky et al., 2006]. Further- more, it has been shown that active transcription will silence a replication origin (see above), one example is a gene in S. cerevisiae, MSH4, which is only transcrip- tionally active in early meiosis. MSH4 contains a replication origin, ARS605, which is active in the mitotic S-phase but silenced in the meiotic S-phase when MSH4 is active. This origin can be silenced by up-regulating MSH4 transcription in mitotic S-phase [Mori and Shirahige, 2007], showing that in a chromosomal context, ac- tive transcription can silence origins of replication. Chromatin has been shown to influence origin localization [reviewed inM ´echali, 2010]. This is the subject of a later section of this introduction. Briefly, generally activating marks have been shown to enhance pre-RC formation. Further, CpG islands in vertebrates and plants have been shown to frequently overlap with repli- cation origins [Delgado et al., 1998], and furthermore methylated CpG islands seem to replicate later in S-phase [Gomez´ and Brockdorff, 2004]. Far less is known about the consequences of mis-regulation at the level of replication origin localization than about the consequences of over- or under- replication. Since the genome could at least in theory be replicated by far fewer origins than are licensed each cell cycle, determining the reason for the profusion of pre-RCs and why they track with active transcription start sites should be a

25 fruitful avenue of future research. Clearly some are used as dormant origins that fire in times of replication stress [Branzei and Foiani, 2005], and it has been shown that reduction of the number of active origins can trigger a cell cycle checkpoint [Shimada et al., 2002].

Regulation and mis-regulation at the level of initiation within S-phase

Regulation of origin activation drives the temporal program of DNA replication. Although the biochemistry of the activation process is well on its way to being fully characterized [reviewed in Sclafani and Holzen, 2007], less is known about the establishment of the timing profile or its significance. It is clear, however, that whole-genome replication time is coordinated by replication origin cluster activa- tion times, and that replication time coordinates with transcriptional activity and chromatin features [Hiratani et al., 2008; Schwaiger et al., 2009]. For example, over the course of mouse embryonic cell differentiation, whole-genome replication timing reorganizes to reflect the reorganization of genomic transcription [Hiratani et al., 2008]. However, this relationship is not direct; genes may change replica- tion timing without changing transcriptional rate and vice-versa, the correlations between replication and transcription are only apparent when averaging over large domains. One mechanism for regulation of timing may be the nuclear localization of the origins themselves [reviewed in Gilbert et al., 2010; Gilbert, 2010a]. By im- munofluorescent staining, it’s clear that active replication localizes to discrete foci throughout the nucleus [Nakamura et al., 1986; Pasero et al., 1997]. The spa- tial distribution of these replication foci changes within the nucleus throughout S-phase [e.g. Panning and Gilbert, 2005], as they seem to work their way from the nuclear interior out to the periphery over the course of S-phase. A series of experiments over the last ten years has demonstrated an “origin

26 decision point” (ODP) in which during each cell cycle the cell’s replication tim- ing is reestablished while the chromatin is reorganized after mitosis [Raghuraman et al., 1997; Dimitrova and Gilbert, 1999]. This appears to be coincident with the organization of post-mitotic chromatin into nuclear compartments, in which the hetrochromatin is relegated to the nuclear periphery [reviewed in Chubb and Bick- more, 2003], linking replication to the chromatin architecture described above. Chromatin and initiation share an intrinsic link. This is the subject of a later section of this introduction. Briefly, activating marks have been shown to correlate with early replication, as well as the physical chromatin structure as defined by nucleosomes. Chromosomal context also appears to be very important, for example yeast telomeric origins generally fire very late, but can become early activating in a silencing-deficient cell [Stevenson and Gottschling, 1999]. Like the mechanism underlying the regulation of replication timing, we have only begun to appreciate what effects mis-regulation of replication timing may have. One of the clues comes from studies showing that as replication forks move into heterochromatic regions late in S-phase, they associate with different chro- matin remodeling factors [Collins et al., 2002; Bozhenok et al., 2002]. Howard Cedar’s group subsequently showed that the chromatin marks associated with a plasmid will change based solely on whether it is injected into the cell in early S-phase versus late S-phase, with early S-phase plasmids containing a higher de- gree of histone acetylation [Zhang et al., 2002a; Lande-Diner et al., 2009]. Thus maintenance of replication timing could equate to maintenance of chromatin state, providing a tantalizing hint as to the epigenetic inheritance of chromatin states. As chromatin and nuclear position seem to be linked, one could also imagine that if change in replication timing effects chromatin, it would also effect the 3D structure of the genome, or vice-versa. There is also recent evidence that later replicating regions have a higher mutation

27 rate [Stamatoyannopoulos et al., 2009; Pink and Hurst, 2010; Chen et al., 2010]. If this is true, and late replication is sufficient for an increased mutation rate, one could imagine that the regulation of replication timing has implications for both germline and somatic mutations. In this case, mis-regulation could be a novel oncogenic pathway.

1.1.3 Genomic tools for analysis of replication

The first investigations into the eukaryotic DNA replication program were per- formed using tritiated thymidine incorporation visualized on autogradiograms from meta-phase chromosome spreads [e.g. Hsu et al., 1964], and these gave us the first glimpses into the genomic organization of replication, albeit at low resolution. Great technological strides have been made in the years since these first experi- ments. In particular, the last ten years have seen the advent of tools which allow us to directly interrogate various aspects of the replication of an entire eukary- otic genome simultaneously and with near base-pair level resolution [reviewed in MacAlpine and Bell, 2005; Gilbert, 2010b]. In this section I will describe the tools available for studying the DNA replication program genome-wide.

Locating origins

Eukaryotic origins of replication — those sites bound by ORC and which host pre-RC assembly — can be difficult to pinpoint, even in budding yeast where ORC occupies very punctate sites. As the complexity of the genome increases there are fewer and fewer obvious determinants of ORC binding, and replication can origi- nate from specific loci or from “initiation zones” that can span many kb [reviewed in Hamlin et al., 2008]. For example, one of the first and most-studied mammalian origins is the chinese hamster dihydrofolate reductase (DHFR) origin, responsible for amplifying the DHFR locus as a mechanism for transcriptional up-regulation.

28 The DHFR locus has a zone of replication initiation that spans approximately 55 kb, and though there are some preferred replication loci contained therein, repli- cation can with a lower efficiency initiate from essentially any area within the zone [e.g. Leu and Hamlin, 1989; Dijkwel et al., 2002]. Thus, locating replication ori- gins is hampered by a number of obstacles, many related to the diffuse nature of the eukaryotic replication program. These obstacles are not insurmountable, how- ever, and the last few years have seen remarkable improvement in genomic origin mapping. An important point is that not all potential origins of replication will play host to replication initiation in any given cell cycle. There are dormant origins that only fire if needed in times of replication stress but are generally replicated passively by other origins [Santocanale et al., 1999], and there are inefficient origins that fire in only a subset of possible cell cycles [e.g. Friedman et al., 1997], and so mapping by replication initiation activity (the subject of the next section) will not provide a complete map of licensed origins. This leaves the ARS assay (which only works in unicellular organisms and removes the origin from its chromatin context, inviting false positives), and the genomic location of the proteins of the pre-RC as the guideposts of replication origin licensing. Thus chromatin immunoprecipita- tion (ChIP) is the technique of choice for locating potential origins of replication. Although single locus ChIP can help to confirm the existence of an origin, genome- scale techniques are far more time- and cost-effective for interrogating the DNA replication program [reviewed in MacAlpine and Bell, 2005]. Chromatin immunoprecipitation on a microarray (ChIP-on-chip, or more fre- quently ChIP-chip) [Ren et al., 2000] for ORC has been used to globally profile locations of pre-RC formation in both fission [Hayashi et al., 2007] and budding yeast [Wyrick et al., 2001; Xu et al., 2006], as well as D. melanogaster [MacAlpine et al., 2004, 2010]. Briefly, ChIP-chip involves cross-linking any chromatin-bound

29 proteins to the DNA with a cross-linking agent such as formaldehyde, followed by sonication of the DNA. This produces small DNA fragments, and the fragments still bound by the protein of interest can be enriched using an antibody specific to that protein. The cross-link is reversed and DNA is then purified from the sample. This purified DNA is labeled with a fluorescent dye and is hybridized to an array of short oligos with sequence identity which tiles the portions of the genome that can be mapped uniquely given the oligo length. A sample prepared in parallel but with a mock antibody or lacking a precipitation step and different wavelength dye is hybridized competitively to the same array. The ratio of fluorescence of the former (the IP) sample to the latter (the input, or IN) sample can then be measured by laser excitation and indicates the relative ratio of enriched IP fragments to back- ground IN fragments. This data can then be analyzed with ChIP-chip peak calling software such as MA2C [Song et al., 2007], which both smooths the genome-wide signal assuming a relationship between genome-proximal probes and also calls peaks by finding areas that are statistically significantly enriched in the IP over the IN, using “negative peaks” (areas where the IN signal is significantly greater than the IP) as a measure of false discovery rate (FDR). ChIP-chip for ORC in S. cerevisiae has yielded maps of around 400 replication origins [Wyrick et al., 2001; Xu et al., 2006]. ChIP-chip has also been performed for the MCM complex, and there is a considerable overlap between MCM peak calls and ORC peak calls [MacAlpine and Bell, 2005; Xu et al., 2006]. However, the resolution of ChIP-chip is affected by sonication fragment size, oligo length, and tiling density. Due to a combination of this resolution and the abundance of ACS motif matches (see above), even by combining ChIP-chip for ORC and MCM with the ACS motif into a single model it is impossible to distinguish ORC’s binding site down to the base pair in most cases [Xu et al., 2006]. ChIP-seq can provide greater resolution due to the fact that it does not depend on oligo length or tiling

30 density, and exploiting this resolution will be one of the main themes of Chapter 2. ChIP-chip for ORC and MCM in S. pombe revealed 460 origins [Hayashi et al., 2007], however the signal in both cases was more diffuse and less enriched over background, reflecting the larger replication initiation zones in S. pombe. Inter- estingly, about two thirds of these origins were found to be active in hydroxyurea (HU, see below), implying that many of these replication origins can fire before the early S block imposed by HU. This could indicate that, since only around 15% of origins activate per cell cycle (see below), the origins are firing stochastically and the initiation events being observed in the HU-arrested cells are occurring in only a subset of the cell population. ChIP-chip for members of the pre-RC in metazoans has proven to be more diffi- cult, due both to larger genome size making the genome more difficult to tile, and low IP enrichment due to more numerous and diffuse ORC binding sites. To date there are no vertebrate ChIP-chip data sets for any member of the pre-RC. In two previous publications, the MacAlpine group has interrogated the replication pro- gram of D. melanogaster with a number of techniques including ChIP-chip for ORC and MCM [MacAlpine et al., 2004, 2010; modENCODE Consortium et al., 2010]. They have found that ORC localizes to open chromatin, marked by a low density of nucleosomes and high histone turnover, and that it capitalizes on the open chro- matin at the TSSs of actively transcribed genes. An extension of these results is the subject of Chapter 3. Any ChIP strategy for members of the pre-RC, whether it is single loci ChIP, ChIP-chip or ChIP-seq, will suffer from the reality that it must be performed over a population of cells. If synchronicity of the cells is desired, synchronizing may have unexpected effects on the replication program [Anglana et al., 2003] as well as confounding results as cells slowly lose their synchronicity over the course of the experiment. Even in asynchronous cells, population effects can make a group of

31 closely spaced potential ORC binding sites appear to be an initiation zone; or if a particular origin is only bound by ORC in a subset of the cells, it may be missed. As the technology improves, one can imagine a single cell approach to locating chromatin-bound pre-RC proteins on a genome-wide scale, which would clear up these population effects. Until then, ChIP-chip and ChIP-seq remain powerful tools to interrogate origin location and licensing genome-wide.

Mapping replication initiation events

Given that some origins are dormant, while others fire inefficiently, the next step in assaying the eukaryotic replication program is measuring the active initiation of licensed origins. There are a number of replication initiation “fingerprints” that can be leveraged to map origin activity. However, mapping initiation remains chal- lenging, due to the speed with which these fingerprints are washed away, and con- founded by the fact that not all origins fire in every cell cycle [reviewed in Hamlin et al., 2008]; in fact, fission yeast origins seem to initiate in less than 10% of cell cycles [Heichinger et al., 2006; Cotobal et al., 2010], while in mammalian cell lines evidence indicates that most origins are also inefficient [e.g. Mesner et al., 2006]. This implies that each cell within a population will have a different pattern of ori- gin activation, and since activation assays require a large population of cells their output will be a convolution of the cells’ individual replication programs. Never- theless, taking the average replication initiation “potential” of each origin over a population of cells is very informative, and there are a number of different ways to go about it. All techniques for mapping replication initiation events leverage the previously mentioned replication initiation fingerprints [reviewed in Gilbert, 2010b]. These are generally derived from the unique nucleic acid structures that briefly arise at active origins. The most often used single-loci assay, for example, is the bubble

32 trapping gel [Bell and Byers, 1983; Brewer and Fangman, 1987; Friedman and Brewer, 1995], in which a restriction fragment is separated on a 2-D gel first by length and then by shape. The resulting characteristic gel pattern will vary based on the replication dynamics of the restriction fragment, such that a fragment that hosts an active replication origin (a fragment that contains a replication “bubble”) will look distinct from a fragment which is generally replicated from a nearby origin. While highly informative, the bubble trapping assay is time consuming and restricted to single loci. Very recently, one group reported success mapping initiation in humans by extending the bubble trapping assay to be genome-wide, with the results visualized on an ENCODE array [Mesner et al., 2011], although these results have yet to be vetted by the community, this represents an exciting new development in the mapping of replication initiation. One approach to mapping replication initiation genome-wide is to leverage the replication dynamics of a newly initiated origin, where the “leading strands” of the replication bubble will be nascent strands with an RNA primer at their 5’ end which will be coincident with the origin. By denaturing these nascent strands from actively dividing cells and size selecting them to be as small as possible without being confused with (which are between 50 and 350 bp), the researcher can paint an accurate picture of genomic activation by competitively hybridizing them against whole genomic DNA on a tiling array or by sequencing them. However, successfully enriching for these intermediates can be difficult, and often involves incorporating 5-bromodeoxyuridine (BrdU) (a nucleotide analog) and adding a ChIP step, or by digesting with λ-exonuclease (Lexo) which degrades all DNA without a 5’ RNA cap, effectively enriching for nascent strands. The Lexo and BrdU approaches can be combined for maximum effectiveness. Recent at- tempts to apply these techniques in human have led to some success [Cadoret et al., 2008; Lucas et al., 2007; Karnani et al., 2010], although troublingly the

33 results of the latter two studies agree on less than 14% of their calls despite using the same cell line. It should be noted that they used two different amplification methods and array platforms, see [Karnani et al., 2010] for a discussion on the apparent discrepancy. A third technique to mapping origin initiation is to arrest the cells just after they initiate replication. This can be done using, for example, hydroxyurea (HU), which forces the forks to stall by depleting the available nucleotide stores of the nucleus [reviewed in Nyberg et al., 2002]. After this early S-phase arrest, there are num- ber of ways to detect replication initiation. For example, the active origins can be detected by hybridizing genomic DNA to a tiling array and searching for elevated copy number [Yabuki et al., 2002]. Alternatively, BrdU could have been incorpo- rated into the cell before the arrest, meaning that the areas of BrdU incorporation would cover the area that a fork had replicated [Viggiani et al., 2010; MacAlpine et al., 2004]. In this case any peak-calling tiling array software (e.g. MA2C, see above) could be used to determine areas of significant BrdU enrichment. One po- tential drawback of this technique is that it only captures early S-phase initiation, although that adds an additional and useful layer of information to this assay, and it’s that early potential that we will exploit in the latter half of chapter 3. However, it has been suggested that by holding wildtype S. cerevisiae cells in HU for longer periods, eventually all origins will initiate [Alvino et al., 2007], meaning that this technique could illuminate all active origins, and that performing a time course to understand the kinetics of HU arrests and BrdU incorporation in metazoans could yield novel insights into the origin activation program.

Mapping replication time

Mapping initiation events can tell us where active origins are located, and with a properly administered HU arrest they can even tell us whether they fired early

34 within S-phase; however, determining the replication time of the genome (i.e. the time relative to the duration of S-phase in which any given base pair is replicated) allows us to visualize what is arguably the end result of all of the intermediate steps which result in duplicating the genome in a regulated fashion. Most methods for measuring replication timing genome-wide amount to partitioning cells based on their progress through S-phase and then interrogating the differences between the partitions. This can be achieved through synchronization in G1 with (in S. cerevisiae) an α-factor mating type arrest or the previously described early S HU block, but it can also be achieved in cells that are difficult to synchronize using fluorescence-activated cell sorting (FACS) [reviewed in Raghuraman and Brewer, 2010]. Regardless of the method used to partition the cell population by progress through S-phase, the strategy remains roughly the same. Any Meselson-Stahl style assay for differentiating newly synthesized DNA from pre-existing DNA will suf- fice [Raghuraman et al., 2001]. For example, by pulsing a population briefly with BrdU, partitioning by progress through S-phase, performing chromatin immuno- precipitation for the BrdU and then either competitively hybridizing them to a tiling microarray [e.g. Schubeler¨ et al., 2002; MacAlpine et al., 2004; Schwaiger et al., 2009; Hiratani et al., 2008], or sequencing them [Hansen et al., 2010]. The relative time of replication for each locus in the genome can then be determined by examining the relationship of the various fractions (e.g. log ratio of fluorescence in the case of a two channel tiling array). Until the technology exists for assaying replication time in a single cell, there will probably not be any further progress on this assay. Just as in all of the assays described above, a replication timing assay suffers from the fact that it must draw from a population of cells, each of which might be employing its own particular replication timing program, in which case the “replication timing” that the assay reports is an ensemble average. An ensemble

35 average is useful in its own right, but it is one step removed from actual replication timing. This assay’s resolution is further hampered regardless of the resolution of the genomic mapping technique, whether it is array based or sequencing based, by the reality of the underlying biology and the requirements of the experiment. The cells within each S-phase partition must be considered synchronous, but in reality they inhabit a continuum of S-phase progression between the two cutoffs of the partitioning method (cell sorting, for example). Furthermore the BrdU pulse, even if it only lasts for 10–15% of S-phase, will be relatively general, as the replication forks are moving very quickly through the chromatin and 10–15% of S-phase in D. melanogaster covers approximately 20 mb worth of replicated DNA. Thus the as- say tracks approximately 20 mb of BrdU incorporation in a partially asynchronous population of cells which could each be employing its own individual replication program; this means that the resolution limiting step is not the choice of tiling ar- ray vs. sequencing, but rather the intrinsic biological requirements of the system being studied. The end result of the replication timing assay is very broad (megabase-level) domains of replication timing. It has been shown using this technique that, for example, replicate late in S-phase [Raghuraman et al., 2001] and that in metazoans replication timing and active transcription are intrinsically linked [MacAlpine et al., 2004]. It’s important to point out that within these domains exist tens or hundreds of potential origins, and it would be an interesting computational experiment to use the three data types described in this chapter to attempt to deconvolve the contributions of each origin in particular to the replication timing profile.

36 1.2 Chromatin

Each 10.4 bp helical turn of our DNA spans approximately 3.4 nm, meaning that with chromosomes stretched end to end, the 3.3 gb human genome is over 1 meter long. Thus, given the rough estimate of 10 trillion cells in the human body, we have enough linear DNA in our bodies to stretch from the earth to the sun seventy times over, or to circle the earth’s equator a quarter of a million times. Given this massive amount of linear DNA, eukaryotic cells have evolved a strategy for com- pacting and storing the DNA in the 3D space of the nucleus. This strategy involves wrapping the DNA around octamers of proteins called histones to form a structure called the nucleosome. These nucleosomes then interact through their “tails”, their peptide chain N-termini which extrude from the complex, to set up arrays of higher order structure using the nucleosomes as the basic building blocks [reviewed in Ko- rnberg, 1977; Workman and Kingston, 1998; Rando and Chang, 2009; Woodcock and Ghosh, 2010; Margueron and Reinberg, 2010, e.g.]. The histone tails extrud- ing from the nucleosome can be modified covalently, and through this modification the level of compaction of the chromatin can be modified, and, potentially, certain signals can be embedded to be read by the various chromatin-templated processes at work in the nucleus [reviewed in Jenuwein and Allis, 2001; Zhou et al., 2011, e.g.]. In the following sections I will discuss the features of chromatin salient to the research presented throughout the rest of this thesis. Although DNA methylation is often grouped together with histone modifications in discussions of chromatin and its regulatory potential, this thesis focuses on S. cerevisiae and D. melanogaster, neither of which have prevalent DNA methylation. Although some evidence for DNA methylation has been reported in D. melanogaster [Lyko et al., 2000; Kunert et al., 2003], it is not very prevalent and its role in fruit fly biology has yet to be

37 determined; it may only function in early embryos [Lyko et al., 2000].

1.2.1 Overview of chromatin

Eukaryotic genomes are compacted into a structure known as chromatin [reviewed in Rando and Chang, 2009]. The basic unit of this compaction is the nucleosome, a complex of DNA with proteins known as histones [Kornberg, 1974]. The nu- cleosomes comprises (most frequently) 147 bp of DNA wound approximately 1.7 times around a disk-shaped octamer of histone proteins. This histone octamer in turn comprises a central tetramer of histones H3 and H4 ((H3/H4)2) in com- plex with two flanking (H2A/H2B) dimers [Luger et al., 1997], to form a protein complex which is symmetrical around a dyad axis. Nucleosomes occur serially along a chromosome, with the intervening DNA of variable length between nu- cleosomes referred to as “linker” DNA. The simple association of histone octamers with DNA represents the lowest level of chromatin compaction and results in a 10 nm fiber which, when viewed through electron microscopy, resembles “beads on a string” [Olins and Olins, 1974], where the beads are nucleosomes and the string is the linker DNA between them. Nucleosomes may then interact with one another through exposed N-terminal “tails” and the linker DNA may interact with another histone protein (H1) to form a higher order structure: the 30 nm fiber. The 30 nm fiber has a structure that remains only partially understood [Hansen, 2002; Wood- cock and Ghosh, 2010] but results in a linear DNA compaction of approximately 50x [Felsenfeld and Groudine, 2003]. The 30nm fiber may then be compacted further by scaffolding proteins into domains and loops which can then be further compacted during metaphase [Razin et al., 2007]. The effect of DNA’s nuclear compaction into chromatin is profound and cannot be understated. There is no genomic DNA free of chromatin, and so any process which interacts with the DNA must be able to compete with histones for access or

38 have a strategy for getting nucleosomes out of its way. The list of affected processes includes transcription, DNA repair, and of course replication. Chromatin may have evolved as a strategy to condense the eukaryotic genome and thus fit more DNA into the nucleus, but it also imparts a wide variety of addi- tional regulation. Although the structure described above seems relatively regular, there exist a host of chromatin modifications that can alter the configuration of the chromatin fiber, and thus influence any DNA-templated processes. One method of chromatin modification is histone variants — proteins that can substitute for one of the canonical histone proteins and thus alter the local chromatin environment. The non-canonical histone proteins include H2AX, which can replace histone H2A and is thought to mark sites of DNA damage, and H2AZ which is thought to alter nucleosome stability and tends to be enriched at transcription start sites [Redon et al., 2002; Zhou et al., 2011]. H3.3 is one of many H3 variants, and enrich- ments in H3.3 tend to mark sites of active histone turnover. CENP-A is a histone variant of H3 found at centromeres [Malik and Henikoff, 2003]. These are just four examples; for a recent list of histone variants and their function see [Talbert and Henikoff, 2010]. By swapping out canonical histones for variants, the chro- matin has a method to embed within its structure an additional set of signals on top of those already present in the DNA sequence — signals that can be read by chromatin-associating factors. In addition to histone variants, the chromatin holds yet another layer of infor- mation. The N-terminal tails of the core histones extrude from the nucleosome and their residues are available for chemical modification by chromatin modifying enzymes [reviewed in Jenuwein and Allis, 2001; Kouzarides, 2007; Zhou et al., 2011]. These modifications are referred to collectively as “histone marks”, and include acetylation, methylation, ubiquitinylation, sumoylation and ribosylation of the lysine residues and phosphorylation of the serine and threonine residues. The

39 complete specification of the modification state of the histone tails and the histone variant composition of local nucleosomes at a given locus composes the “histone code” at that locus. Nucleosomes are important in chromatin regulation not only for the modifi- cations on their histone tails or their histone variant composition, but also their precise positioning within the chromosome [reviewed in Jiang and Pugh, 2009]. Due to the nature of the histone octamer’s interaction with the DNA, nucleosomes can sterically occlude binding of other factors to a potential protein binding site and so when they are strongly positioned they can act as highly informative reg- ulatory guide for protein binding. For instance, nucleosomes are very well posi- tioned around the TSSs of transcriptionally active genes [Yuan et al., 2005], with a nucleosome free region (NFR) extending for 150 bp 5’ of the TSS, and a phased array of nucleosomes beginning at the TSS and extending down into the gene body. NFRs found at transcriptionally active genes will generally be the binding site for RNA Pol II and its co-factors. To take advantage of the information contained in nucleosome positioning, whole-genome Dnase I assays have been developed to map constitutively non-nucleosomal DNA [Crawford et al., 2006a] and the discov- ered hypersensitive sites are enriched for genetic regulatory elements [Boyle et al., 2008b]. Furthermore, protein binding motif finding algorithms which leverage the prior knowledge that proteins will prefer to bind in non-nucleosomal regions have been shown to be more accurate than an algorithm uniformed about the chromatin structure [Narlikar et al., 2007]. Thus, nucleosome positioning is an extremely im- portant method of specifying regulation through chromatin structure, and how nucleosome occupancy and positioning interacts with the replication program will be a large focus of chapters 2 and 3. A number of algorithms for predicting nucleosome position from sequence have been developed, relying on the observed frequencies of dinucleotides in the 147 bp

40 of DNA that wrap the histone core [e.g. Segal et al., 2006]. Importantly, the po- sition of nucleosomes within the chromosome can be influenced by sequence, but sequence does not seem to be the primary determinant in vivo [Stein et al., 2010], and so the accuracy of such predictions have been called into question [Zhang et al., 2009]. Superlative nucleosome positioning sequence elements such as poly- dA:dT, however, play a very active role in nucleosome positioning by resisting to wrap around the histone core, thus resisting nucleosome formation [Segal and Widom, 2009]. On a broad scale, chromatin can be grouped into two categories: euchro- matin and heterochromatin, initially detected by chromosomal stains in which heterochromatin, due to its dense nature, took up more dye [Heitz, 1928]. The euchromatin is less compact, generally following the structure of the 10 nm chro- matin fiber described above. Most genes are found in euchromatic regions, which, due euchromatin’s more relaxed nature, is more accessible to the transcriptional machinery [reviewed in Woodcock and Ghosh, 2010; Rando and Chang, 2009]. The cell’s heterochromatin generally follows the structure of the 30 nm fiber de- scribed above, rendering it far less accessible and thus generally transcriptionally silent. Heterochromatin can be further partitioned into “constitutive” and “fac- ultative” heterochromatin. Constitutive heterochromatin is persistently silent re- gardless of cellular context (classic examples include telomeres and centromeres), whereas facultative heterochromatin can undergo transitions between a tight clas- sically heterochromatic structure and an open more euchromatic structure. Fac- ultative heterochromatin is generally enriched in areas critical for cellular context such as developmental stage and differentiation [reviewed in Trojer and Reinberg, 2007]. Although useful, the division of chromatin into heterochromatin and euchro- matin glosses over some of the more interesting points of chromatin function.

41 There are clearly subdivisions within both heterochromatin and euchromatin that specify unique chromatin environments, advantageous for some chromatin-tem- plated processes and deleterious for others. For example, the TSSs of actively transcribed genes have a unique chromatin signature which includes a nucleosome free promoter (see above), promoter-proximal incorporation of histone 3 lysine 4 tri-methylation (H3K4me3), and an H2AZ containing nucleosome at the TSS. The potential number of combinations of marks is staggering, and each combina- tion brings its own regulatory potential, and so annotating chromatin organization computationally is a nascent and exciting field. There have recently been a num- ber of attempts to partition chromatin into a number of different states based on an automated parsing of genome-wide chromatin data by hidden Markov models (HMMs), sometimes paired with principle components analysis (PCA) [Ernst and Kellis, 2010; Kharchenko et al., 2010; modENCODE Consortium et al., 2010; Fil- ion et al., 2010]. The number of states uncovered has varied widely between the different reports, from 5 states in [Filion et al., 2010] to 51 in [Ernst and Kellis, 2010] (albeit in D. melanogaster vs. human), and depends for the most part on the balance between granularity on the one hand and ease of interpretation and avoiding overfitting on the other. Chromatin is the dynamic structure into which DNA is packaged, and it serves many purposes: compacting the genome into the nucleus, allowing access to eu- chromatic DNA for transcriptional machinery and other proteins, and specifying unique chromatin-bound processes with its histone code. Understanding chro- matin is of paramount importance, especially in metazoan cells where the mech- anism behind genetically identical cells having very different roles and behavior has yet to be fully determined. I have consciously avoided referring to the unique configurations of chromatin as epigenetic information, a decision which has to do with how chromatin might be inherited after cell divisions and whether it could

42 persist if the marks were perturbed. I will discuss the inheritance of chromatin information later in the introduction, and refer the reader to [Ptashne, 2007].

1.2.2 Genomic tools for analysis of chromatin

A number of whole-genome methods for assessing the chromatin organization of the eukaryotic genome have recently been developed [reviewed in Rando and Chang, 2009; Zhang and Pugh, 2011]. I will briefly describe them here as they relate to the research in following chapters.

Nucleosome positioning

Determining the positions of nucleosomes genome-wide is an important step in un- derstanding active chromatin regulation. Nucleosomes can occlude DNA binding proteins from the DNA, and this results in the majority of reliable non-histone- protein:DNA interactions occurring in nucleosome free or nucleosome depleted regions (NFR and NDR respectively). An NDR represents an area or a collection of areas (larger than a canonical linker region) within a genome which are infre- quently occupied by nucleosomes, kept out perhaps by a transient DNA binding protein or by some sequence preference. In contrast, an NFR represents an area or a collection of areas which are never bound by nucleosomes, and nucleosome en- croachment is countered with nucleosome remodelers or with boundary elements such as a consistently bound protein. Mapping nucleosomes genome-wide is a matter of mapping resolution and data analysis. One method for mapping nucleosomes will be discussed in great detail in chapter 2. Briefly, taking a population of cells and isolating the chromatin, the linker DNA is digested with micrococcal nuclease (MNase), leaving only DNA frag- ments protected by proteins (such as histones). These fragments are purified and then size-selected to be in the range of a single nucleosome (approximately 150

43 bp). At this point the fragments should be highly enriched for DNA that was incor- porated into nucleosomes. Then the DNA can be hybridized to a tiling microarray (MNase-chip) [e.g. Yuan et al., 2005; Lee et al., 2007] or sequenced using a second generation sequencer (MNase-seq) [Shivaswamy et al., 2008]. Calling nucleosome positions from this data is deceptively complicated, and groups have developed a number of calling algorithms leveraging for example HMMs [e.g. Yuan et al., 2005; Yassour et al., 2008], kernel density estimation [e.g. Shivaswamy et al., 2008], edge detection [Zhang et al., 2008a], and simple visualization of densities or read counts [Mavrich et al., 2008b].

Mapping open chromatin

The question of which genomic loci are constitutively free to be bound by fac- tors others than histones is an important one with implications for whole-genome and cell-type specific regulation [Boyle et al., 2008b]; it is also a question that can be answered without going to the extreme of mapping nucleosomes genome- wide. Methods have been developed, such as DNase-chip [Crawford et al., 2006a], DNase-seq [Crawford et al., 2006b; Boyle et al., 2008b; Song and Crawford, 2010], and FAIRE [Giresi et al., 2007], to map constitutively open chromatin genome-wide using sequencing. Briefly, in the DNase assays, chromatin is digested with DNase I, an endonuclease with no sequence preference. It will nick DNA preferentially at open locations, making DNA ends which are available to attach an Illumina se- quencing primer. These sequences are then mapped back to the genome and peaks are found using kernel density estimation (F-Seq [Boyle et al., 2008a]).

ChIP-chip and ChIP-seq

Multiple elements of the chromatin environment can be measured with ChIP. As described in the replication section above, this involves enriching for protein-bound

44 DNA fragments genome-wide with an antibody specific for your protein of choice. These fragments can then be compared to a background distribution of fragments by hybridized both samples to a genomic tiling array (ChIP-chip) [Ren et al., 2000] or by sequencing (ChIP-seq) [Robertson et al., 2007]. This approach can be used to profile chromatin-bound proteins, histone variants [Mavrich et al., 2008b], and even histone modifications [Barski et al., 2007]. Care needs to be taken in the analysis of this data, however, as the sonication step can lead to biases in the input signal for open chromatin [Auerbach et al., 2009], so accurate correction of the IP signal by an input signal is critical. Further, while some chromatin marks will appear in punctate peaks, such as H3K4me3, others, such as H3K9me3, will appear to be spread over large domains. In this sense, traditional peak calling software may work for punctate chromatin marks but run into problems on broad marks. Some software packages get around this by including both a narrow and a broad peak calling mode, for example SPP [Kharchenko et al., 2008]. Algorithms tailored to ChIP-seq for chromatin marks with broad domains are beginning to be released as well [Song and Smith, 2011].

1.3 Regulation of the replication program by chromatin

DNA replication and chromatin share an intrinsic connection. It has been known for many years that eukaryotes replicate their DNA in a defined order [Taylor, 1960], and that heterochromatic regions replicate late [Gilbert et al., 1962]. More- over, in no eukaryotic system is DNA sequence alone enough to fully specify the lo- calization of the replication program [Theis et al., 1999; Breier et al., 2004; Vashee et al., 2003; Remus et al., 2004], necessitating a second level of information at the cis-acting level. Thus, it is reasonable to conclude that chromatin could be helping to specify the euchromatic DNA replication program.

45 Although it will not be investigated in this work, it is worth mentioning that the link between replication and chromatin is a two-way street. Every time a cell di- vides, its DNA must be copied and chromatin reconstituted. Complicating matters is the fact that the DNA replication machinery must pull DNA from its chromatin context, but then reassemble histones on the nascent daughter strands within a kb of the replication fork passing [Gasser et al., 1996]. Thus, the connection between replication and chromatin is, in a sense, symmetric. Finding a link between eukary- otic DNA replication and chromatin replication is a fantastic research direction and would get at the heart of whether or not chromatin signatures are heritable [re- viewed in Margueron and Reinberg, 2010; Kaufman and Rando, 2010].

1.3.1 Nucleosome positioning

Nucleosomes have the potential to regulate DNA-templated processes through their positioning, as occluding DNA can force DNA binding factors away from some sites and direct them toward linker DNA [Liu et al., 2006]. Nucleosome positioning is a major factor in, for example, transcriptional regulation, where the TSS of active genes has a well-defined nucleosomal structure [reviewed in Radman-Livaja and Rando, 2010], featuring a 150 bp nucleosome free region (NFR) available to be bound by regulatory factors and the transcription machinery. Nucleosomes are also implicated in replication origin localization and function. Very well-positioned nucleosomes were found within S. cerevisiae ARS1 by MNase digestion of the TRP1ARS1 plasmid [Thoma et al., 1984]. After the discovery of the ACS, the Simpson group was quick to notice that the 11 bp core ACS occurred in a stable NFR within the plasmid. In an elegant series of experiments, by mutat- ing the plasmid such that the nucleosomes were occluding the ACS they showed that nucleosomal ACS occupancy drastically reduced maintenance of the plasmid across generations, presumably due to the interruption of initiator binding [Simp-

46 son, 1990]. The question remained if this nucleosomal regulation of pre-RC formation would persist in ARS1’s chromosomal context. In 2001, the Bell group proved that the reg- ulated nucleosome free region was present at ARS1in vivo, and in so doing came to a number of novel conclusions about origin function with respect to nucleosome positioning [Lipford and Bell, 2001]. While the endogenous ARS1 ACS was indeed nucleosome free, inhibiting ORC binding by ACS mutation shifted the nucleosomal array in toward the nucleosome free region. The group then moved back to an in vitro system to show that the shift in nucleosome positioning was due entirely to ORC’s absence, and thus ORC is a determinant of nucleosome position at origins of replication. Interestingly, moving the nucleosome with which ORC was interacting to be further distal to the ACS, which increased both the NFR and the distance from ORC to the nucleosome, had no effect on ORC binding, but did reduce plas- mid maintenance. This implies that although ORC can bind, the pre-RC either cannot form or cannot activate without a proximal nucleosome just 5’ upstream of the ORC binding site. Finally, they showed that MCM loading at the pre-RC was the limiting step in plasmid maintenance when ORC does not interact with its 5’ nucleosome. All of this implies that not only is an NFR important for S. cerevisiae ORC bind- ing, but that a precisely positioned 5’ nucleosome is critical for pre-RC formation. A further investigation into the specificity afforded by nucleosome positioning can be found in chapter 2. Interestingly, Ino80, a chromatin remodeling enzyme, has been shown to localize to origins and to stalled replication forks, implying that nu- cleosome positioning may be a universal feature of loci with replication initiation potential [Shimada et al., 2008; Vincent et al., 2008]. It is also worth noting that positioning may play a role in active elongation, as chromatin remodelers travel with the replication fork (e.g. Caf1 [Krude, 1995]), and other remodelers such

47 as ISWI are necessary later in S-phase during the replication of heterochromatin [Collins et al., 2002]. In higher eukaryotes replication origins also seem to localize to nucleosome depleted regions (NDRs) [Cadoret et al., 2008]. In D. melanogaster, approxi- mately two thirds of ORC binding sites are near the open chromatin of a promoter [MacAlpine et al., 2010]. All of the ORC binding sites, including those that aren’t at TSSs, share a local enrichment for nucleosome depleted regions and active histone turnover and exchange [MacAlpine et al., 2010; Deal et al., 2010]. This implies that without sequence determinants for metazoan ORC binding, ORC will prefer to interact with open, active chromatin, although much more work is required to prove that such a chromatin environment is a determinant rather than a side-effect.

1.3.2 Chromatin features and origin localization

The role of physical DNA accessibility due to nucleosomes was examined in the pre- vious section, and a further tunable parameter of DNA compaction is the chromatin state of a given locus. It would therefore be logical to suppose that chromatin state plays a role in origin localization. For example, since ORC localizes to open chro- matin, and especially to active promoter regions, it could be that ORC favors the stereotypical features associated with open chromatin (histone hyper-acetylation) or the typical promoter chromatin signature (e.g. H3K4me3, H3K9ac) [Kouzarides, 2007]. Investigations along these lines are the subject of Chapter 3 (though not com- prehensive, they represent tentative steps toward a chromatin signature for origin localization). The role of chromatin features in localization have been investi- gated in some single loci studies. For example, the D. melanogaster chorion locus is amplified 100x in follicle cells, in a process that depends on ORC and pre-RC for- mation. The chorion locus hosts a replication origin which is responsible for this

48 local amplification (discussed above). Perhaps due to heavy transcription in this area, the region is hyperacetylated [Aggarwal and Calvi, 2004; Hartl et al., 2007]. However, in the absence of Rpd3, a histone deacetylase, the rest of the genome accumulates histone acetylation marks and, in turn, replication begins to initiate non-specifically genome-wide rather than only at the chorion locus. This implies that local histone acetylation is a determinant of origin localization. Examples do exist of chromatin modifications regulating pre-RC formation di- rectly. In yeast, the Sir2p histone deacetylase has been shown to negatively regu- late origin assembly [Pappas et al., 2004]. In a cdc6-4 temperature sensitive mu- tant, a SIR2 deletion was found to rescue origin function both in vitro and in vivo. It has also been suggested that those origins which responded to the SIR2 dele- tion have a unique structure [Crampton et al., 2008], implying that they have intrinsic properties which make them tunable through acetylation. It has also been shown that HBO1, a histone acetyltransferases, is critical for origin local- ization in human cells, as it binds directly to Cdt1 and promotes MCM association [Miotto and Struhl, 2010], implying that MCM has a preferred acetylated histone substrate. This raises the possibility that human pre-RCs interact with proximal nucleosomes as they do in yeast [Lipford and Bell, 2001]. Additionally, it has re- cently been demonstrated that a cell–cycle regulated methylation mark may also play a role in pre-RC formation, as tethering the histone H4K20 methyltransferase PR-Set7 (which confers the mark H3K20me1) to a specific non-origin locus pro- motes pre-RC formation at that locus [Tardat et al., 2010], although more work is necessary to characterize this phenomena. Perhaps the best method currently available for assessing wholesale whether the chromatin signature surrounding areas of ORC binding are significant is to assay whether a computational segmentation of chromatin data with no prior informa- tion about the replication program will uncover a state that corresponds to ORC

49 localization. This sort of segmentation has recently been attempted by a number of groups [Kharchenko et al., 2010; modENCODE Consortium et al., 2010; Ernst and Kellis, 2010; Filion et al., 2010; Hon et al., 2008] using either chromatin marks or chromatin binding proteins. Three of these investigations were in D. melanogaster, affording a chance to compare state delineations to the ORC localization data pro- duced by the MacAlpine lab [MacAlpine et al., 2010]. All three methods had similar results with respect to replication localization. The states in which ORC tends to be enriched tend to be enriched for euchromatic stretches with active histone turnover and transcription start sites. Filion and colleagues also reported an enrichment of Caf-1 and GAGA factor, two chromatin remodelers, in the state which tends to be enriched for ORC [Filion et al., 2010]. That these automated segmentation algo- rithms found signatures that were enriched for ORC is promising, but the fact that none of them uncovered a state that was deterministic for ORC means that there is still more work to be done.

1.3.3 Chromatin features and origin initiation

Given the lack of sequence determinants specifying origin initiation times during S-phase, chromatin is a good candidate for a method to tune replication timing (for a discussion of other methods, see above, section 1.1.2). Replication and gene transcription are intrinsically linked, as, in general, euchromatin replicates early and heterochromatin replicates late [Gilbert et al., 1962] (less some origin- specific counter-examples), and the genes within early replication domains tend to be highly expressed [MacAlpine et al., 2004]. However, it has been suggested that rather than early replication timing being determined by transcriptional activity, it is connected instead to domains that are devoted to active transcription [Danis et al., 2004; Lunyak et al., 2002]. If replication timing is thus connected to domains permissive of active tran-

50 scription, then it would be logical to suppose that the chromatin marks associated with active transcription [reviewed in e.g. Kouzarides, 2007; Zhou et al., 2011] would correlate well with replication timing and that, indeed, there may be a func- tional link between the two. This correlation has been confirmed by members of the ENCODE consortium, who detected positive correlations between replication timing and the activating chromatin marks H3K4me2, H3K4me3 and H3/H4Ac, whereas facultative heterochromatic marks such H3K27me3 were anti-correlated with replication timing. These results have been echoed in a number of other meta- zoan studies [e.g. Hiratani et al., 2008; Schwaiger et al., 2009; Lee et al., 2010]. Even yeast, which don’t share the replication timing domain correlation with active transcription, exhibit a mechanistic connection between origin activation times and chromatin state. For example, the origins proximal to the heterochro- matic telomeres are typically late replicating or silenced by their chromosomal context [Ferguson and Fangman, 1992], specifically the chromatin environment [Stevenson and Gottschling, 1999]. However, by removing a late replication origin from its chromosomal context and inserting it into a more transcriptionally permis- sive segment of the genome, the origin could be made early replicating [Ferguson and Fangman, 1992]. This mechanistic association goes the other way as well, as taking canonically early replicating ARS1 and inserting it into an area known to replicate late, the replication activation time of ARS1 was significantly delayed [Friedman et al., 1996]. It has been shown that histone acetylation in particular can have an effect on origin activation time. Rpd3 is the primary histone deacetylase (HDAC) in yeast, and it has been demonstrated that through its deacetylation activities it effects the replication activation time of origins genome-wide. Deleting Rpd3 is sufficient to speed up S-phase, due to an earlier association of Cdc45 with the replication ma- chinery at the pre-RC [Vogelauer et al., 2002]. In the same study it was shown

51 that site-specific tethering of Gcn4, a histone acetyltransferase (HAT), could speed the firing times of canonically late origins. Taking a genome-wide approach, it was shown that deletion of Rpd3L, the larger of the two S. cerevisiae Rpd3 complexes, resulted in approximately 100 canonically late origins activating earlier in S-phase [Knott et al., 2009]. Thus there is a strong association between histone acety- lation state, where increased histone acetylation generally promotes more open chromatin, and origin firing times. Metazoans also exhibit a connection between replication initiation times and local histone acetylation. Using a plasmid in an X. lavis in vitro replication as- say, it was demonstrated that efficient replication resulted from the introduction of transcriptional activator. The activator did not initiate transcription but it did presumably set up a chromatin environment permissive of transcription, and that in turn promoted early replication [Danis et al., 2004]. The human β-globin locus can similarly have its replication timing modulated by local acetylation changes without effecting transcription [Goren et al., 2008]. Finally, in D. melanogaster, the replication initiation timing of the chorion locus (discussed above) shows sen- sitivity to changes in local acetylation brought about by tethering a HAT to the area [Aggarwal and Calvi, 2004]. Taken together these experiments imply a ubiquitous control of replication initiation times by local histone acetylation, albeit in in vitro or single loci studies. Finally, since chromatin needs to be reconstituted during S-phase, replication could provide a window of opportunity for the cell to reset or alter the chromatin environment. At least half of the nucleosomes incorporated into the daughter strands will be newly synthesized, and presumably all in an identical basally modi- fied state. In fact, recent data has suggested that histone demethylases work syner- gistically with the DNA replication program to wipe clean a mark that is no longer necessitated by the cell [Radman-Livaja et al., 2010]. Specifically, arresting yeast

52 cells in alpha-factor incorporates H3K4me3 at certain loci throughout the genome. When released from alpha-factor, these marks are no longer needed. Over the course of the subsequent S-phase, the marks were erased, and deletion of the hi- stone demethylase stalled but did not stop the erasure. This agrees with other results that imply that the cell uses S-phase and its wholesale insertion of new histone octamers to affect chromatin changes, or to reinforce existing chromatin environments. For example, in experiments described above in section 1.1.2, the Cedar group showed that microinjecting a plasmid into a cell during different parts of S-phase would alter the chromatin marks associated with the plasmid [Lande- Diner et al., 2009; Zhang et al., 2002a]. Thus, the connection between replication and chromatin may not be a one-way street, and could lead to a positive feedback loop with each subsequent S-phase being an opportunity for the cell to adjust the chromatin environment if need be.

1.4 Thesis roadmap

The following chapters of this thesis will center around the chromatin specification of the DNA replication programs of S. cerevisiae and D. melanogaster. Since all of these were collaborative works, I will use this section of the introduction to explain the division of labor, so as not to later construe the work of others as my own. Chapter 2 is an investigation into the specification of replication origin localiza- tion by the physical chromatin environment of S. cerevisiae. I initially discovered the pattern in nucleosome positioning at origins of replication using the nucleo- some data from [Lee et al., 2007] and the ACS calls from [Xu et al., 2006]. To further the investigation we then began a fruitful collaboration with the lab of Stephen Bell at MIT. The in vivo MNase-seq and ChIP-seq data was generated by Kiki Galani, and the in vitro experiments were performed by Sukhyun Kang. I co-

53 ordinated the project, built the bioinformatic pipeline to process and store the data generated, performed the analysis of the data, and wrote the primary draft of the paper. This work was published in Genes and Development in April 2010 [Eaton et al., 2010], and chapter 2 is a reproduction of that work with permission from Genes and Development. Chapter 3 is an investigation into the possible role of chromatin in the spec- ification of replication origin localization and replication origin activation. The replication data was generated in the MacAlpine lab by Heather MacAlpine. The chromatin data was generated by Gary Karpen’s group, the expression data by Sue Celniker’s group, and the nucleosome dynamics data by Steve Henikoff’s group, all of which is freely available through modENCODE. David MacAlpine and I built the bioinformatics pipeline to process and store the data from the various replication experiments. I performed the analyses in collaboration with Joseph Prinz. In par- ticular Joseph did the analysis associated with figure 1 and a number of the supple- mental figures, and I performed the analysis behind figures 2 and 4 and the rest of the supplemental figures. I supervised George Tretyakov in the analysis for figure 3. Peter Kharchenko helped me to build a pipeline to process the Karpen group’s data, and provided useful advice. Finally, I wrote the primary draft of the Genome Research paper. This work was published in Science in 2010 and Genome Research in 2011 [modENCODE Consortium et al., 2010; Eaton et al., 2011, respectively]. The Science paper was a publication of the modENCODE analysis working group (AWG), and contained vignettes of conclusions attained by integrating the data of the various data production groups together. My role in the AWG was to represent DNA replication in roundtable discussions, and to integrate DNA replication with the data produced by the other labs. I was responsible for figures 3b, 3c, and sup- plemental figures S5 and S9, as well as contributing to figure 5. I also contributed text to the chromatin section as well as the supporting methods. Chapter 3 is the

54 result of a combination of both papers cited above, with permission from Genome Research and Science. Appendix A details a work in progress on using a hidden Markov model to classify copy number throughout various Drosophila cell lines, and relate the large- scale copy number changes to the specific biology of the cell lines. The data was generated by Sara Powell in the MacAlpine lab, and I am performing all of the analysis with advice from Alex Hartemink. Although they will not be discussed in this work, the analysis in Chapter 2 has seen great interest from the replication community, and so has spawned a number of collaborations. My contributions to these works were interpretation of nucle- osome related data, as well as analysis of new data sets. With the Fox lab at the University of Wisconsin, Madison, we investigated nucleosome interaction with the S. cerevisiae BAH1 domain of Orc1, showing that some origins were sensitive to a mutation in this domain, and that this class of origins had a distinctive nucleo- some configuration [Muller¨ et al., 2010]. With the Bell lab we are working with high-throughput data describing ORC and MCM binding in S. cerevisiae to iden- tify the roles of the both nucleosomes and sequence at yeast origins in loading the MCM onto the DNA [Lam et al., 2011]. Finally, with the Zhang lab at the Mayo Clinic we are currently working to elucidate the role of a chromatin remodeler in specifying nucleosome positions genome-wide. I have also engaged in a number of non-nucleosome-specific projects. I have contributed a transcription factor binding HOT-spot calling methodology and analysis to a comprehensive description of the cis-regulatory landscape of D. melanogaster by the White lab at the University of Chicago, which was recently published in Nature [N`egre et al., 2011]. With the Orr-Weaver lab at the Whitehead Institute we are investigating the nature of the replication program in under-replicated regions of salivary gland D. melanogaster tissue, a project to which I have contributed analysis of ChIP-seq data as well as

55 integrative analyses [Sher et al., 2011]. Finally, I have aided the McDonnell lab of Duke University with two expression microarray related projects, with purely bioinformatic contributions [Wade et al., 2010; Dwyer et al., 2010].

56 2

Conserved Nucleosome Positioning Defines Replication Origins in S. cerevisiae

2.1 Introduction

Initiation of DNA replication occurs at multiple genomic loci termed origins of replication. In S. cerevisiae, replication origins were originally identified as short (150 bp) autonomously replicating sequence (ARS) elements that were sufficient for the maintenance of episomes. The origin recognition complex (ORC) binds the ARS consensus sequence (ACS), an 11 bp T-rich sequence that is necessary but not sufficient for origin activity. During G1, ORC coordinates the recruitment of several additional replication factors to load the replicative DNA helicase, the Mcm2-7 complex, onto origin DNA to form the pre-replicative complex (pre-RC) [reviewed in Sclafani and Holzen, 2007]. The S. cerevisiae genome contains, by various metrics, 6000–40,000 potential ACS sequence matches, of which only a few hundred are specifically bound by ORC. Although active transcription of a sequence has been shown to prevent ORC

This chapter was originally published as: Eaton ML, Galani K, Kang S, Bell SP, and MacAlpine DM (2010), “Conserved nucleosome positioning defines replication origins,” Genes Dev, 24(8), 748-53.

57 binding and pre-RC formation [Mori and Shirahige, 2007], the large majority of potential ACS matches are intergenic suggesting that additional chromosomal fea- tures are required to define the subset of these sites that are bound by ORC and act as replication origins. All cellular events involving genomic DNA must operate within their chromo- somal context. Nucleosomes are the most basic elements of chromatin structure. Nearly 80% of S. cerevisiae DNA is incorporated into stable nucleosomesand their position relative to regulatory elements is also a critical component of gene reg- ulation. Significant regions of the genome are not in complex with nucleosomes and are referred to as nucleosome free regions (NFRs). NFRs represent particularly accessible parts of the genome that are frequently the site of multi-protein assem- blies that regulate or perform key DNA-templated processes [reviewed in Rando and Chang, 2009]. The DNA replication program has been shown to be regulated by the local chro- matin environment. Although progress has been made in establishing that chro- matin modifications impact the activation of replication origins [Vogelauer et al., 2002; Knott et al., 2009], it has also been shown that nucleosome positioning is critical for origin function. Early nucleosome mapping experiments on a plasmid containing the ARS1 origin revealed that the ACS needed to be nucleosome free, presumably to facilitate ORC binding [Simpson, 1990]. Studies of nucleosome po- sitioning at the endogenous ARS1 locus confirmed that the ACS was nucleosome free and revealed that the position of the nucleosome adjacent to the ACS was im- portant for origin function; moving this nucleosome farther from the ACS did not interfere with ORC binding, but did inhibit pre-RC formation [Lipford and Bell, 2001]. Thus, at ARS1, the precise position of local nucleosomes facilitates ORC binding and helicase loading. Here, we have used high-throughput sequencing to assess the organization of nucleosomes at ORC binding sites throughout the yeast

58 genome. Our results indicate that sequence defined NFRs and ORC-dependent nu- cleosome positioning are critical determinants of replication origins in S. cerevisiae.

2.2 Results

2.2.1 Identification of functional ACSs

In the S. cerevisiae genome, there are approximately 250–350 ORC binding sites that function as origins of replication [Wyrick et al., 2001; Xu et al., 2006]. Mul- tiple genomic studies have utilized chromatin immunoprecipitation and genomic microarrays (ChIP-chip) to identify ORC binding sites and refine the ACS motif. De- spite these efforts, the degeneracy of the ACS [Breier et al., 2004] and the limited resolution of the genomic array data have made it difficult to identify the specific ACSs bound by ORC. For example, a recent study used ChIP-chip data for both the ORC and Mcm2-7 complex to identify 396 ORC binding sites, but assigned multiple potential ACSs to 86 of those peaks [Xu et al., 2006]. To identify functional ACS motif matches more precisely, we first used chro- matin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) to identify sites of ORC localization across the yeast genome (Fig. 2.1A). We iden- tified 267 peaks of ORC binding throughout the genome (blue triangles). Although this represents 129 fewer ORC binding sites than detected by a prior study [green triangles; Xu et al., 2006], 258 of the 267 areas of ORC enrichment we identi- fied overlapped with 241 previously identified ORC binding sites (Fig. 2.1B). The reduced number of ORC binding sites was due to the very stringent criteria we imposed to minimize the potential for false positives. We implemented a scoring system to identify the single most likely ACS associ- ated with each ORC ChIP-seq peak, which we termed the ORC-ACS. Our analysis identified 253 likely ORC binding locations at nucleotide resolution. The resulting

59 FIGURE 2.1: Precise localization of yeast origins. (A) Histogram of ORC enriched sequence tags for chromosome XIV from G2 arrested cells. ORC binding sites iden- tified in this study (blue) or a prior study [red; Xu et al., 2006] are indicated by triangles. (B) Venn diagram of the overlap between ORC peaks identified in this study (blue) and by Xu et al.[2006] (green). (C) The position-weight matrix of the ORC binding consensus built from the ORC-ACS set. Each nucleotide is repre- sented by a discrete color (red (T), orange (G), blue (C) and green (A)). consensus sequence (Fig. 2.1C) closely matches previously published ACS motifs [Breier et al., 2004; Xu et al., 2006]. The prior study from Xu and colleagues iden- tified 504 potential ACS matches at 387 ORC/Mcm2-7 binding sites. Comparison of the sites where both studies identified a common area of ORC enrichment (SFig. 1), revealed that in the majority (67%) of instances we agreed on the ACS call, while in the remaining instances we identified a novel ORC-ACS site.

60 396

155

241/258

9 267

FIGURE 2.2: Venn diagram of the overlap between ORC-ACS and nim-ACS [Xu et al., 2006] sites. Of the 258 ORC peaks identified in this study that overlapped with the ORC peaks from [Xu et al., 2006] (see Methods), 13 were not considered in this graph for the following reasons: 3 did not have an associated ACS with a log odds score greater than 4 (see Methods) and 10 were mapped to different ACSs between replicates.

2.2.2 Nucleosome positioning at origins of replication

The precise definition of ORC binding sites provided by ChIP-seq allowed us to investigate the position of nucleosomes flanking origins of replication. Previous genome-wide studies of nucleosome positioning have suggested that S. cerevisiae origins are depleted of bulk nucleosomes [Albert et al., 2007; Field et al., 2008; Mavrich et al., 2008a]; however, these studies focused on previously annotated ARS elements (of approx. 500 bp) rather than ORC’s strand-oriented binding site. This would be equivalent to analyzing the position of nucleosomes flanking pro- moters without considering the transcription start site (TSS) or the direction of transcription. Precise nucleosome positioning relative to an ORC binding site has only been established at a few origins by locus specific nucleosome mapping [Simp- son, 1990; Lipford and Bell, 2001]. Thus, we used the precisely defined ORC- binding sites described above and high-resolution mapping of nucleosomes across the yeast genome to comprehensively address the localization of nucleosomes rel-

61 ative to ORC binding at origins. We identified and mapped the location of 67,500 nucleosomes from an asyn- chronous yeast population using high-throughput sequencing of mononucleosomal fragments. Our nucleosome mapping recapitulated previous findings demonstrat- ing that nucleosomes are strongly positioned around transcription start sites (TSSs) citepYLD+05,AMT+07,WRD+07,LTB+07,SBZ+08. Nucleosome scores relative to the TSS for all genes were plotted as a heatmap (Fig. 2.3A; top). Positions fully occupied by a nucleosome are white while those completely lacking a nucleosome are black. A NFR is readily apparent immediately upstream of the TSS as well as linker sequence between the nucleosomes. A composite pattern of the average nucleosome signal was also generated (Fig. 2.3A; bottom). To examine the nucleosomal landscape at replication origins, we aligned the 219 members of the ORC-ACS set that were within non-repetitive regions and were not within 10 kb of a chromosome end. We aligned the origins using the T- rich strand of the ORC-ACS and will subsequently define sequences 51 of the T-rich strand of the ORC-ACS as ‘upstream’ and sequences 31 of the T-rich strand as ‘down- stream’. Similarly, we will refer to the first nucleosome upstream of the ORC-ACS as the -1 nucleosome and the first nucleosome downstream of the ORC-ACS as the +1 nucleosome. Strikingly, we detected a pattern of nucleosome occupancy surrounding the ORC-ACS that is very similar to that of the open promoters (Fig. 2.3B). The ACS is located asymmetrically within an approximately 125 bp NFR. The composite NFR is 90 bp larger than ORC’s DNase I footprint, suggesting that ORC binding to the ACS is not solely responsible for maintaining the NFR. As with the nucleosomes at promoters, both the -1 and +1 nucleosomes are stably positioned and set up an adjacent periodic pattern of nucleosome occupancy. Finally, the NFR is asymmetric with respect to the ACS (Figs. 2.3B, 2.7A), extending much farther downstream

62 FIGURE 2.3: Replication origins are associated with an asymmetric NFR flanked by well-positioned nucleosomes. (A) Heatmap of nucleosome occupancy and average nucleosome signal from asynchronous cells for transcription start sites (TSS). Inter- preted nucleosome positions are represented as ovals. (B) Nucleosome occupancy at 219 ORC-ACS sites. (C) Nucleosome occupancy at 238 nr-ACS sites. Vertical dashed lines represent the TSS or center of the 33 bp ACS motif match. All ACS matches are oriented relative to the T-rich strand and aligned at position 0. than upstream of the ACS. The T-rich nature of the ACS and its proximity to the -1 nucleosome suggested that its sequence might directly participate in the positioning of this nucleosome. To explore the potential role of the ACS and surrounding sequences in encoding an NFR, we examined the 238 highest-scoring ACS motifs that were not within genes, telomeres, rDNA, or proximal to any ORC binding sites identified in this

63 A

B

FIGURE 2.4: ORC-ACS and nr-ACS motif comparison. ORC-ACS and nr-ACS sites have nearly identical motifs. A comparison of the sequence motif derived from the (A) ORC-ACS and (B) nr-ACS sets. or prior studies [Xu et al., 2006]. The sequence composition of these 238 ACS motif matches, which we call non-replicative ACS (nr-ACS) sites, is nearly indis- tinguishable from the ORC-ACS set (Fig. 2.4). Despite their sequence similarity, the nr-ACSs show a markedly different nucleosomal organization (Fig. 2.3C). Al- though the nr-ACS motif match tends to be nucleosome free—not surprising for regions with such a high A/T composition—the nucleosome depletion is weaker and is centered on the ACS instead of extending downstream from the ACS as seen for the ORC-ACS set. Finally, the adjacent upstream and downstream nucleosomes lack the precise periodic positioning found at the ORC-ACS sites. There were a number of sites where the higher resolution of our ORC ChIP-seq experiments led us to identify a different ORC-ACS as the likely ORC binding site

64 FIGURE 2.5: Comparison of nucleosome positioning at ORC-ACS only sites (A: bottom, B: red), nim-ACS only sites (A: middle, B: blue), and those discovered using both methods (A: top, B: black). compared to a prior ChIP-chip study [Fig 2.5; Xu et al., 2006]. For the ACSs jointly identified in both studies, we consistently found a pattern of asymmetric, well-positioned nucleosomes (Fig. 2.5). For the sites in which our findings dis- agreed with the previous study, we found that the ORC-ACS sites identified in this study more closely resembled the nucleosome profile of the jointly identified ACSs. Assuming the pattern of positioned nucleosomes we identified are a hallmark of yeast replication origins, which is well justified based on the sites that both stud- ies agreed upon, these data further underscore the low false positive rate of our ORC-ACS identification.

2.2.3 Nucleosome depletion at origins is encoded by sequence

S. cerevisiae exhibits remarkable specificity in origin selection, despite the potential

65 for ORC to interact with many genomic ACS matches. This selectivity could be the result of a subset of potential ORC binding sites being obscured by nucleosomes while others remain accessible. If so, we might expect that the wide, asymmetric NFR will not depend on ORC, but rather be encoded in the origin DNA sequence. To explore the role of primary DNA sequence on specifying the NFR and potenti- ating ORC binding, we examined the distribution of in vitro nucleosomes surround- ing both ORC-ACS and nr-ACS sites in the absence of competing trans-acting fac- tors. To this end, we examined the distribution of nucleosomes assembled in vitro on purified S. cerevisiae DNA [Kaplan et al., 2009] at ORC-ACS sites. A distinct NFR was maintained at the ORC-ACS sites in the absence of ORC or any other trans-act- ing factors (Fig. 2.6A). Reminiscent of our in vivo nucleosome positioning data, we found that the NFR was asymmetrically positioned over the ORC-ACS, extend- ing downstream from the ACS. Similarly, the in vitro NFR at nr-ACS matches also recapitulated the in vivo nucleosome mapping results with a weak NFR centered on the nr-ACS (Fig. 2.6B). Despite maintaining an asymmetric NFR at ORC-ACS sites in the absence of trans-acting factors, the precise and periodic positioning of adjacent nucleosomes was not observed in the in vitro data. These findings imply that nucleosome excluding signals are encoded in the sequence of the ORC-ACS set that do not exist in the nr-ACS set but that additional trans-acting factors (e.g. ORC) are required for the precise and ordered nucleosome positioning found at origins. In yeast, the ACS and additional downstream modular sequence elements (B- elements) contribute to origin function (Fig. 2.7A). Unlike the ACS, the B-elements are poorly defined with minimal conservation of sequence or spacing between dif- ferent origins [reviewed in Bell, 1995]. A prior study found that the sequences downstream (50–100 bp) of the T-rich ACS were generally enriched for adeno- sine bases with no specific motifs identified [Breier et al., 2004]. We posited that

66 FIGURE 2.6: The NFR at origins is encoded by primary sequence. (A) Heatmap of in vitro nucleosome occupancy at ORC-ACS sites. [Kaplan et al., 2009] (B) In vitro nucleosome occupancy at nr-ACS sites. Vertical dashed lines represent center of the 33 bp ACS motif match. All ACS matches are oriented relative to the T-rich strand and aligned at position 0. the A-rich sequences downstream of the ACS may prevent the +1 nucleosome from encroaching on origin sequences. Consecutive 4–6 bp islands of A’s constitute a strong nucleosome excluding signal [reviewed in Segal and Widom, 2009], as poly-dA:dT islands resist bending around the histone octamer. We re-examined the sequence composition of replication origins to determine if A-islands contributed to the asymmetric nature of the ORC-ACS-associated NFR. To visualize differences in base composition, we color coded each genomic position; shades of red (T-islands) or green (A-islands) represent the bias in surrounding nucleotide composition. At both ORC-ACS (Fig. 2.7B) and nr-ACS (Fig. 2.7C) sites, the T-rich ACS match is visible as a red vertical line. Interestingly, we found a strong concentration of

67 green A-rich islands near the downstream edge of the NFR that was present only in the ORC-ACS set (Figs. 2.7B& 2.7C). We displayed these differences graphically by plotting a kernel density estimation of A- or T-islands as a function of distance from the ACS (Figs. 2.7E& 2.7F). The A-rich islands downstream of the ACS appear to be a determinant of ORC binding as they are specific to the ORC-ACS set. These A-rich sequence elements are well beyond the known ORC footprint [Bell and Stillman, 1992] and our ChIP-seq data do not support a robust interaction of ORC with these sites. Instead, we propose that these sequences maintain the large asymmetric NFR and thereby facilitate ORC binding. Consistent with this hypothesis, the density and position of these A-islands varies with the width of the NFR (Fig. 2.8). These findings strongly suggest that the large asymmetric NFR present at bona fide origins of replication is required for ORC association with origins in vivo and is not simply a consequence of ORC binding. It is interesting that start sites of bi-directional DNA replication have strongly polar sequence and chromatin compositions. If the role of the A-rich islands was simply to exclude nucleosomes downstream from the ACS, we would expect that T-rich islands could also serve a similar function. Perhaps the A-/T-rich polarity of the ACS and surrounding sequence elements is critical for transient long-range ORC interactions or possibly pre-RC assembly. We and others have noted that a similar polar bias in A- and T-rich sequence elements is found at the promoters of active genes residing in an open chromatin environment [Maicas and Friesen, 1990] (Fig. 2.7D& 2.7G). This sharp transition occurs in the NFR and does not appear to be related to the bias in AA or TT dinucleotides observed flanking the dyads of the +1 and -1 nucleosome [Mavrich et al., 2008a]. Instead, we propose that the polar transition from A- to T-rich sequences at both promoters and origins promotes the DNA unwinding that occurs at both sites.

68 FIGURE 2.7: Sequence polarity at replication origins. (A) Schematic of nucleosome positioning at the prototypical ARS1 replication origin. The position of the ACS and B-elements relative to the adjacent nucleosomes is indicated. (B) Heatmap of nu- cleosome occupancy from asynchronous cells at the ORC-ACS sites. Red and green dots represent the bias in nucleotide composition (A-rich or T-rich, respectively) at each sequence (see methods). (C) Nucleosome occupancy and sequence bias for the nr-ACS set. (D) Nucleosome occupancy and sequence bias at open chromatin promoters. (E, F and G) A kernel density estimation of windows with strong A and T enrichment around the ORC-ACS, nr-ACS and promoters, respectively.

69 Dyad < 154bp downstream A Dyad 154bp − 180bp downstream T Dyad > 180bp downstream 6 4 2 NT enrichment density 0 −400 −200 0 200 400 Position relative to ORC−ACS

FIGURE 2.8: A-rich islands correlate with the position of the downstream nucleo- some. A- and T-island nucleotide density (green and red, respectively) are shown by distance to the downstream nucleosome dyad in the ORC-ACS set. Origins were divided into three bins of roughly equal size, based on the distance from the identified ACS to the dyad of the first downstream nucleosome. A kernel density estimation was calculated by noting the position of all A- and T-islands (5 or 6 As or Ts in a 6bp window) relative to the first base of the ORC-ACS. The vertical line represents the center of the 33bp ACS motif.

2.2.4 ORC is necessary and sufficient for precise nucleosome positioning

Positioning of nucleosomes at ARS1 appeared to be dependent on ORC as muta- tions of the ACS permitted the -1 nucleosome to encroach into the NFR in vitro [Lipford and Bell, 2001]. Because sequence is a determinant of nucleosome occu- pancy [Segal et al., 2006] and the ACS sequence is predicted to be resistant to nu- cleosome occupation, we re-evaluated the role of ORC in nucleosome positioning at origins of replication across the genome. To this end, we repeated the genome- wide nucleosome positioning assay in both a wild-type strain and a congenic strain with a temperature sensitive ORC mutation, -161, at the restrictive tempera- ture. Consistent with previous ChIP studies [Aparicio et al., 1997], the temperature

70 sensitive mutation in Orc1p eliminated DNA binding at 37°C (Fig. 2.9). Biological replicates of each strain were arrested at the G2/M border and subsequently raised to the non-permissive temperature for 2 hours before isolating and sequencing the mononucleosomal fragments.

FIGURE 2.9: Orc1-161 is defective for origin binding at the non-permissive tem- perature. Locus specific ORC ChIP at ARS305 and URA3 was performed at the per- missive (23°C) and restrictive (37°C) temperature in both a wild-type and mutant orc1-161 strain. ORC dissociates from origin DNA at the restrictive temperature in the orc1-161 strain. Figure created by collaborator Kiki Galani.

A comparison of the nucleosomal distribution around the set of ORC-ACSs in wild-type and orc1-161 cells revealed a clear effect of ORC on the positioning of the -1 and +1 nucleosomes (Fig. 2.10). At the non-permissive temperature the precise positioning of the nucleosomes around the ORC-ACS was lost (Fig. 2.10B), although a smaller NFR was maintained (consistent with the in vitro nucleosome studies above). In contrast, the wild-type sample at the non-permissive tempera- ture (Fig. 2.10A) was nearly indistinguishable from the profile generated at 23°C (Figs. 2.3B; 2.11A). To more easily view the differences in nucleosome position- ing, we plotted the density of the wild-type (black) and orc1-161 (red) nucleosome dyads as a function of the distance from the ORC-ACS (Fig. 2.10C). In the absence of ORC binding, the nucleosomes on both sides of the ORC-ACS encroached into the NFR and the adjacent nucleosomes lost their precise positioning. Thus, despite

71 the substantial distance between the ACS and the +1 nucleosomes, ORC binding strongly influenced the position of this nucleosome and its near neighbors. The nu- cleosome pattern at TSSs showed no change between the wild-type and the mutant samples, indicating that loss of ORC function did not result in a global remodeling of nucleosomes (Fig. 2.11).

FIGURE 2.10: ORC is necessary and sufficient for precise nucleosome positioning. (A) Heatmap of nucleosome occupancy around ORC-ACS sites in wild-type cells arrested in G2 at 37°C. (B) Heatmap of nucleosome occupancy around ORC-ACS sites in orc1-161 cells arrested in G2 at 37°C. (C) Density of wild-type (black) and orc1-161 (red) nucleosome dyads surrounding ORC-ACS sites arrested in G2 at 37°C. (D) The density of nucleosome dyads assembled in vitro at ARS1 in the presence or absence of ORC.

We were able to recapitulate the ORC-dependent positioning of nucleosomes

72 A

WT Asynchronous 23C WT G2 37C Rep1 WT G2 37C Rep2 WT G2 37C Combined MT G2 37C Rep1 0.012 MT G2 37C Rep2 MT G2 37C Combined Kaplan et al. 0.008 Density 0.004 0.000 −400 −200 0 200 400 Position relative to ORC−ACS

B

WT Asynchronous 23C WT G2 37C Rep1 WT G2 37C Rep2 WT G2 37C Combined MT G2 37C Rep1 0.012 MT G2 37C Rep2 MT G2 37C Combined Kaplan et al. 0.008 Density 0.004 0.000 −400 −200 0 200 400 Position Relative to TSS

FIGURE 2.11: Comparison of nucleosome dyad locations across mnase-seq exper- iments. WT=wildtype and MT=orc1-161 mutant. Replicates (dotted lines) were combined (solid lines) due to their high concordance. Additionally, the wildtype asynchronous experiment at 23°C (purple) is indistinguishable from the wildtype G2 (nocodazole arrested) experiments at 37°C (green). (A) Nucleosome dyad den- sity around the ORC-ACS set. The density represents the likelihood of finding a nucleosome dyad in a given location relative to the ACS. Vertical line indicates the center of the 33bp ACS motif. (B) Nucleosome dyad density relative to the transcription start sites from [David et al., 2006]. adjacent to origin DNA using purified proteins. We assembled nucleosomes on a plasmid containing the ARS1 origin of replication using the four purified S. cere- visiae histones and the ISWI nucleosome remodeling complex [Vary et al., 2004]. Nucleosomes were then mapped by micrococcal nuclease digestion followed by high-throughput sequencing of the resulting mononucleosome associated DNA. When this reaction was performed in the absence of ORC, we observed no precise positioning of nucleosomes around the ACS. Importantly, when nucleosomes were

73 assembled on the plasmid in the presence of ORC, we observed clear positioning of the -1 and +1 nucleosomes as well as adjacent nucleosomes on each side (Fig. 2.10D). Thus, ORC alone (in the context of a nucleosome remodeling complex) is capable of positioning nucleosomes on either side of ARS1. Previous in vitro studies using chicken histones and a Drosophila embryo extract to assemble nucle- osomes found that Abf1p positioned the +1 nucleosome at ARS1 [Lipford and Bell, 2001]. Consistent with these findings, we found that the addition of Abf1p to the purified reaction shifted the position of the +1 nucleosome relative to ORC alone (Fig. 2.12A) almost perfectly re-capitulating the nucleosome positioning observed in vivo at ARS1 2.12B). Thus at ARS1, ORC is sufficient to position nucleosomes in a manner analogous to that seen at most origins but Abf1p acts to supersede this function at the +1 nucleosome. Importantly, most origins are not associated with a DNA binding factor at the downstream position suggesting that, consistent with the orc1-161 nucleosome mapping, ORC is primarily responsible for positioning nucleosomes on both sides of the origin.

2.3 Discussion

Our findings show that ORC plays a key role in positioning nucleosomes on both sides of the origin NFR. This has strong implications for ORC’s interaction with the origin. The close proximity of the ACS to the -1 nucleosome strongly suggests that ORC DNA binding positions this nucleosome by DNA exclusion or by a direct interaction of the -1 nucleosome with ORC [Zhang et al., 2002b] with ORC. In contrast, the median distance from the ORC binding site to the +1 nucleosome is 60 bp from the 31 edge of the known ORC footprint, yet ORC clearly influ- ences the positioning of this nucleosome. Because this positioning is also observed with the purified reconstituted reaction, ORC almost certainly acts directly on the

74 A

In vitro plasmid +ORC In vitro plasmid +ORC +ABF1 0.15 0.10 Dyad Density Dyad 0.05 0.00 −400 −200 0 200 400

Position relative to pUC19 ARS1 ACS B

0.12 In vivo ARS1 0.08 Dyad Density Dyad 0.04 0.00 −400 −200 0 200 400

Position relative to genomic ARS1 ACS

FIGURE 2.12: Abf1p can position the downstream nucleosome at ARS1. Abf1p supersedes ORC in positioning the downstream nucleosome at ARS1 in vitro.A comparison of nucleosome dyad densities on the plasmid (A) and in vivo at ARS1. (A) Reconstituted chromatin assembly in the presence of ORC or ORC and Abf1p. The addition of Abf1p recapitulates the positioning of the downstream nucleosome observed in vivo at ARS1. (B) In vivo nucleosome positions found at ARS1. The downstream nucleosome is positioned by Abf1p in vivo. nucleosomes. It is possible that this effect is mediated by ORC wrapping the ad- jacent A-rich DNA. Such a model is consistent with electron microscopy studies of ORC [Clarey et al., 2008] and DNAse I footprinting studies that show that ORC can induce a 10 bp periodicity of cleavage consistent with ORC wrapping DNA adjacent to the +1 nucleosome [Bell and Stillman, 1992; Speck et al., 2005]. Al-

75 ternatively, ORC could directly bind the +1 nucleosome without interacting with the intervening DNA. In either case, ORC’s interaction with the origin proximal DNA/nucleosomes is not as simple as previously thought and this more complex interaction may facilitate downstream events in DNA replication (e.g. by preparing the DNA for Mcm2-7 loading). Genome-wide location analysis of the chromatin re- modeler Ino80 revealed enrichment at a subset of origins [Shimada et al., 2008]. Although ISWI is sufficient to position nucleosomes in an ORC dependent manner in vitro, it is unclear if ISWI, Ino80 or another remodeling activity will facilitate the in vivo nucleosome positioning at origins of replication. The addition of Ino80 complex to the in vitro assembly reaction did not significantly change the pattern of nucleosomes (data not shown).

2.3.1 Model for ORC specificity

Our studies strongly support a model in which the characteristic nucleosome pat- tern surrounding origins of replication is generated in two steps (Fig. 2.13). First, the large and asymmetric NFR found at origins of replication is established in a DNA sequence-dependent manner and is associated with a subset of ACS matches that are bound by ORC. We propose that the asymmetry of the NFR with respect to the ACS provides ORC access to the ACS and the adjacent A-rich sequences. In contrast, ACS matches not occupied by ORC exhibit weak symmetric NFRs. To- gether, these data argue that the differential affinity for nucleosomes encoded by the primary sequence is a specificity factor for the initial recruitment of ORC to the origin. (Fig. 2.13A) In a second step, ORC bound at the origin acts to pre- cisely position the flanking upstream and downstream nucleosomes to establish the conserved nucleosomal signature evident at origins of replication (Fig. 2.13B).

76 ORC A

ACS AAAA

ACS

ACS B

ORC ACS AAAA

ACS

ACS

FIGURE 2.13: ORC chromatin binding model. At ORC-ACS sites, the nucleosomes are able to subsample the local area, but are generally excluded from the NFR by the A- and T-rich sequences found at the edges of the NFR. ORC preferentially binds these ACSs (A), and once bound, acts to position the -1 and +1 nucleosomes asymetrically with respsect to the ACS (B).

2.3.2 NFR architecture

The large size of the NFR at origins of replication is likely to provide space for the association of other replication factors in addition to ORC. The median origin NFR is approximately 90 bp larger than the ORC DNase I footprint, which based on the size of the archeal MCM complex [100 A;˚ Pape et al., 2003] would provide room for the assembly of 2-3 Mcm2-7 complexes. We considered the possibility that the size of the NFR at origins may be dynamic through out the cell cycle, possibly expanding upon loading of multiple Mcm2-7 complexes during G1 [Remus et al., 2009]. To test this hypothesis, we examined nucleosome positioning in G1 (pre-RCs present) and G2 (pre-RCs absent) cells (Fig. 2.14). We found no statistically significant differences in nucleosome organization at origins between G1 and G2 cell populations, suggesting that the size of the NFR is independent of pre-RC formation and Mcm2-7 loading. Instead, we hypothesize that the large NFR

77 that is predictive for ORC binding is also sufficient to allow subsequent Mcm2-7 loading. Consistent with this hypothesis, our analysis of the distribution of Mcm2-7 proteins relative to the ACS reveal that the bulk of the Mcm2-7 signal is distributed downstream of the ACS in the NFR (Lam et al., manuscript in preparation). It is interesting to consider that ORC may interact with the DNA in a manner that both positions nucleosomes and prepares the DNA for subsequent pre-RC assembly.

A

WT G2 37C Combined WT G1 37C 0.012 0.008 Density 0.004 0.000 −400 −200 0 200 400 Position relative to ORC−ACS

B

WT G2 37C Combined WT G1 37C 0.012 0.008 Density 0.004 0.000 −400 −200 0 200 400 Position Relative to TSS

FIGURE 2.14: The origin NFR is the same in G1 and G2. A comparison of nucleo- some dyad locations around (A) ORC-ACS sites and (B) TSSs in the combined G2 (nocodazole arrested) data set (green) and the G1 (alpha-factor arrested) experi- ment (blue). The center of the 33bp ACS motif (A) and the TSS (B) are indicated by the dashed vertical lines.

2.3.3 Connection to origin activity and extension to metazoans

The conserved positioning of nucleosomes at almost all ORC binding sites identi- fied by this study suggests that nucleosome organization is an essential and critical

78 determinant of origin function. Consistent with the hypothesis that precise nucle- osome positioning is essential for DNA replication, prior work at ARS1 has demon- strated that moving the -1 nucleosome further upstream from the ACS results in a loss of origin function and inhibition of Mcm2-7 loading [Lipford and Bell, 2001]. As almost all the replication origins identified in this study share a common nucleo- somal profile, it is not surprising that we found no correlation between nucleosome organization and origin efficiency or timing. Instead, we suggest that this nucleo- some organization at origins is a basal determinant required for any origin function and that other chromatin-associated features, such as histone modifications [Knott et al., 2009], control the efficiency and timing of replication initiation. Given that ORC exhibits little, if any, sequence specificity in higher eukaryotes [Vashee et al., 2003; Remus et al., 2004], we propose that nucleosome organization will also be a defining feature of ORC binding in metazoans [MacAlpine et al., 2010] and other organisms where ORC lacks the ability to recognize specific sequences.

2.4 Methods

2.4.1 Data deposition

All genomic data, ORC binding and ORC-ACS locations have been deposited at GEO with accession number GSE16926.

2.4.2 Yeast strains and growth conditions

Wild-type (W303) and mutant (orc1-161) yeast strains were grown in rich medium for all experiments. Cells were arrested at the G2/M transition by 10 µg/ml noco- dazole or at G1 with α-factor. For nucleosome positioning in the presence or ab- sence of ORC, wild-type and mutant strains were grown at the permissive tempera- ture (23°C), arrested at G2/M for 2 h and then shifted to the restrictive temperature (37°C) for 2 h.

79 2.4.3 Chromatin immunoprecipitation and mononucleosome preparation

Mononucleosomes were prepared from 1.5¢109 cells that were cross-linked with 0.1% formaldeyhde for 30 minutes and quenched with 125 mM glycine. After cross-linking, the cells were spheroplasted and treated with 80 units of micrococcal nuclease to generate mononucleosome fragments. ChIP extracts were prepared from 5¢108 cells arrested at the G2/M transition with 10 µg/ml nocodazole as previously described [Aparicio et al., 1997]. An anti- ORC polyclonal antibody was used for the ChIP as previously described [Tanny et al., 2006].

2.4.4 Reconstituted chromatin assembly

Chromatin was assembled on an ARS1 containing plasmid using purified recombi- nant S. cerevisiae histones expressed in E. coli and the ISWI complex [Vary et al., 2004] in the presence or absence of recombinant ORC and or Abf1p. Nucleosomes positions were determined by micrococcal nuclease digestion and high-throughput multiplex sequencing of the mononucleosomal fragments.

2.4.5 High-throughput sequencing

Libraries of mononucleosomal and ChIP fragments were prepared for sequencing using the genomic DNA sample preparation kit (Illumina) according to the man- ufacturer’s protocol. The library was sequenced using the Illumina 1G Genome Analyser. To map the genomic locations of Illumina reads, we used MAQ [Li et al., 2008] with a maximum mismatch of 3 bases and a minimum Phred quality score of 35.

80 2.4.6 Mapping ORC binding sites by ChIP-seq

We used an approach similar to that of MACS [Zhang et al., 2008b] to identify ORC binding sites in the sequencing data. Briefly, for each replicate we scanned the genome with a window of size wsize (set to twice the estimated sonication fragment size) for areas with a high enrichment of reads (areas with greater than 3x the read density than expected by average based on chromosome size and the number of reads that mapped to it). Consecutive windows were combined. The negative reads within these windows were then shifted upstream until the maxi- mum Pearson correlation between the positive and negative stranded reads within the window was found. The median optimal shift over all areas of enrichment was recorded as the expected ChIP-seq fragment size f. We then shifted the pos-

f itive stranded reads downstream by 2 and the negative stranded reads upstream f by 2 to create a read pile-up landscape. We then scanned the genome with a win- dow size of 2f, looking for areas that are significantly enriched above background. Background signal was estimated by sampling the distribution of reads within tran- scribed regions [David et al., 2006] and modeling that distribution as a negative binomial (the probability density function which we empirically determined to be the best fit for the tag distribution within ORFs). Areas with a p-value ¤ ¢10¡6 under the background distribution were identified as putative peaks of ORC en- richment.

2.4.7 Comparison of ORC ChIP-seq peaks with prior ChIP-chip peaks

To compare the ORC ChIP-seq peaks with prior ChIP-chip peaks [Xu et al., 2006], we first screened out any ORC peaks not identified in both replicates, leaving us with a total of 267 peaks. We then compared these enriched regions to the 396 regions identified in [Xu et al., 2006] as being ORC-only sites (47) or ORC-MCM

81 sites (349). Our criterion for a match was simply that the areas of enrichment overlap. We found that 241 of the sites from [Xu et al., 2006] overlapped with 258 sites identified in this study. The lower resolution of the tiling arrays (espe- cially in telomeric regions), contributed to the ChIP-chip study calling some very wide peaks, which accounts for the occasional single ChIP-chip peak overlapping multiple ChIP-seq peaks. Their analysis of ORC and MCM binding resulted in a set of predicted ARSs called the nimARS set. This set contains 529 areas of ORC and/or MCM enrichment. ACS calls (the nimACS set) were made at a subset of these nimARSs. Of the 396 ORC-enriched regions, 268 were included in their final set of 370 nimARSs for which an ACS call was made (the 102 other regions in the nimARS set for which an ACS was called were enriched only for MCM). Of this set of 268 ORC-enriched regions included in the nimARS set, 86 had more than one ACS call in their final nimACS set (the final set published in their supplemental materials had 504 unique ACS calls, corresponding to 370 nimARSs).

2.4.8 Identification of functional ACS matches

To locate the most likely ACS motif match bound by ORC for each ChIP-seq peak, we first searched for a common motif between the top 100 ChIP-seq peaks using cosmo [Bembom et al., 2007], an R [R Development Core Team, 2008] implemen- tation of MEME [Bailey and Elkan, 1994]. The motif we found is nearly identical to that shown in Figure 1C. We then scanned the genome, taking a log-odds ra- tio of the probability of matching the motif over the probability of matching yeast background sequence (represented by a 4th-order Markov chain trained on the en- tire yeast genome) at every position. We selected a minimum log-odds ratio cutoff of 4, which ruled out approximately 99.9% of genomic positions as potential ORC binding sites. We then estimated the parameters of a Gaussian distribution from the positions of the shifted reads for each ORC ChIP-seq peak. To select the most

82 likely ACS per peak, we took the log-space product of the log of the density of this Gaussian to the log-odds ratio of the remaining ORC motif matches under each peak. We took the local maximum of this score under each peak to be the most likely ACS bound by ORC. Any ORC peak which didn’t have an ORC motif match above the log-odds ratio cutoff was discarded. We found three examples of ORC peaks in the first replicate which did not have an ORC motif match. These were weak peaks which were not reproduced in the second replicate. ChIP-seq replicates were combined by accepting only those ACS calls which were called identically in both experiments. This set, the intersection of the ACS calls for our replicates, is the final ORC-ACS set.

2.4.9 Mapping nucleosome positions

To identify the location of stably positioned nucleosomes genome-wide, we first mapped the mononucleosomal micrococcal nuclease digested fragments back to the yeast genome using MAQ as described above. We then estimated an experi- mental mononucleosomal fragment size f by identifying the optimal shift between positive and negative reads which maximized the Pearson correlation between the two sets. To accurately locate each nucleosome, we generated two probability landscapes over the genome: one for positive reads and one for negative reads. To build these probability landscapes we used kernel density estimation (KDE) [Parzen, 1962] with a Gaussian kernel (σ = 20) [Boyle et al., 2008a; Shivaswamy et al., 2008]. The area between a positive peak followed by a negative peak approx- imately f bases downstream is interpreted as a potential nucleosome. To make sure that each positive peak is mapped to the most appropriate negative peak down- stream, we use the Needleman-Wunsch alignment algorithm (banded to within 50bp of f)[Needleman and Wunsch, 1970]. Working under the assumption that two ends of the same nucleosome should show a similar distribution of tags, the

83 product of the Euclidean distance and the Pearson correlation between the strands’ KDEs in a 40bp window around the peak was used as the score of an alignment, with a gap penalty of -2 (the median alignment score). This results in a mapping of 51 nucleosome edges to 31 nucleosome edges and allows for small false peaks that are a consequence of our high sequence depth to be discarded. The scores associated with each nucleosome are determined by summing the kernelized distance of each read to its associated edge. The kernelized distance of ¡ © ¡1 ¢p d q2 a read to a nucleosome edge is defined as kpd, σq  e 2 σ where d is the chro- mosomal distance of a read to the nucleosome’s edge and σ is the bandwidth used for KDE over the chromosome. These two sums (51 and 31) were added together to yield the full nucleosome score. To normalize each experiment we generated as many in silico reads (drawn from a uniform distribution) as the experiment had Illumina reads [Shivaswamy et al., 2008]. These reads were subjected to the same analysis as above, and the maximum nucleosome score was taken. This procedure was repeated twenty times and the average maximum was taken to be the scaling factor for the experiment. Each nucleosome score for the in vivo experiment was divided by this scaling factor. To further normalize between experiments, the distribution of nucleosome scores was quantile normalized. Since some experiments yielded a different average frag- ment size than others (estimated between 149 bp and 163 bp for the experiments described here) we represented each nucleosome as a dyad and then extended it 73 base pairs up- and downstream, yielding 147bp width nucleosomes.

2.4.10 Visualization of nucleosomes

When graphing the nucleosomes, in cases where two nucleosomes were found to overlap, we took the sum of the nucleosome scores in the area of overlap to be the score for that area, ignored any regions with a score 0.2, and capped scores

84 at 1 as in [Shivaswamy et al., 2008]. We also ignored areas in rDNA regions (as defined in SGD) and telomeric regions (within 10 kb of the beginning or end of a chromosome), as the highly repetitive sequence content of these regions often made nucleosome mapping impossible with single-end reads. In graphing the nucleosome dyad densities relative to a set of genomic features (Fig. 4C, SFig. 6, SFig. 9, SFig. 10, and SFig. 11), the nucleosome dyad positions (the center of each nucleosome) relative to the genomic features are collected. We then use Gaussian kernel density estimation with a bandwidth of 10 over the collec- tion of relative dyad locations, and normalize the resulting density by the number of dyads and the number of genomic features to generate the data graphed. Peaks in this density may be interpreted as the most likely position of a nucleosome dyad relative to the genomic feature. To generate the nucleosome dyad density over the plasmid relative to the ARS1 ACS (Figs 4D & SFig 10A), we first called nucleosome locations on the plasmid with a much more fine-grained bandwidth than we did for the genomic nucleosomes. The combination of this fine-grained bandwidth and the extremely high sequencing coverage of the plasmid (on average 1m reads covering a 3750 bp plasmid) allowed us to call many overlapping potential nucleosome con- formations. We weighted each of these potential conformations by the number of reads that contributed to it and then ran a similar kernel density estimation (with a bandwidth of 30) over these weighted potential nucleosome conformations. What results is a density map of potential nucleosome dyads relative to the ACS indicat- ing both position of nucleosome dyads and the proportion of reads that support each position (by peak height).

2.4.11 Nucleosome positioning at genomic features

To display nucleosome positioning around genomic features (TSSs, ORC-ACSs, etc), we first needed to align the features. We took TSS coordinates from the

85 transcript tiling array data of [David et al., 2006]. The TSS was taken as the 51 end of a segment which was categorized as overlapping a verified gene. The TSSs were then aligned such that transcription begins at base 0 and proceeds to the right. The first base of the transcript is annotated as position 0, the second base of the transcript is annotated as position 1, etc. The same was done for our ORC-ACS, although the decision of which way to align the set was made more arbitrary by the fact that an origin is a bi-directional feature. We thus aligned the ORC-ACS such that the first base of the T-rich match to the ACS is annotated as position 0, the second base is annotated as position 1, and the last base (the 33rd base), is annotated as position 32. Thus, the T-rich ACS match (Figure 1C) begins at posi- tion 0 and ends at position 32 regardless of whether it is present on the Watson or Crick strand. The same procedure was used for the non-replicative ACS set. The positions of the nucleosomes relative to the genomic features underwent the same transformation.

2.4.12 Construction of the nr-ACS set

The non-replicative ACS (nr-ACS) set was constructed by finding high-scoring matches to the ACS motif that did not show any signs of replication activity. Briefly, we scanned the genome for ACS motif matches in a manner identical to that de- scribed above. We then took all of the motif matches with a log-odds score higher than 9.15 (a number empirically determined to yield a set of motif matches of comparable size to the ORC-ACS set) and screened out any that were in genes or promoters (i.e. within 160 bp of a gene defined by SGD), within 3 kb of an hydroxyurea-resistant origin (Bell lab, unpublished data), within 3 kb of a member of the nimACS set from [Xu et al., 2006] or contained within a ChIP-seq peak of ORC enrichment. We then screened out rDNA and telomeric (within 10 kb of the beginning or end of a chromosome) loci. The set of motif matches that survived

86 this screening process composes the nr-ACS set.

2.4.13 Sequence analysis around genomic features

Given that 4–6 bp ‘islands’ of adenine bases (A’s) constitute a strong nucleosome excluding signal [reviewed in Segal and Widom, 2009], we scanned the sequence local to a given genomic feature with a 6 bp window and noted the position of any windows that were homogeneously composed of 5 or 6 identical bases. We scanned only the genomic DNA strand on which the feature was located, and al- lowed the strand of the feature to define which direction was “upstream” (negative relative coordinates) and which was ‘downstream’ (positive relative coordinates). For example, if a promoter was located on the negative strand at position x, then we would scan the negative genomic strand around x for nucleosome enrichment, and the genomic coordinates (x ¡ 400 .. x 400) would map to the local graph coordinates ( 400 .. ¡400). In the heatmap graphs (Figs. 3B, 3C & SFig. 7), these windows were given a color corresponding to their base composition identity (those with ¥ 5 T’s are colored green, while those with ¥ 5 A’s are colored red) and a transparency based on their homogeneity (those with 5 of the same base are half transparent, while those with 6 of the same base have no transparency). In the nucleotide enrichment density plots, all points identified in the heatmap graphs were combined, and used as input to a kernel density estimation algorithm, with a bandwidth of 20 bp. The resultant densities were then normalized both for the number of points and the number of genomic features considered, to make their y-axes quantitatively comparable.

87 2.4.14 Purification of ORC and ABF1

ORC was Flag-tagged at the N-terminus of ORC1 and purified from yeast G1 ex- tracts as previously described [Tsakraklides and Bell, 2010]. Abf1 was purified from Rosetta 2(DE3)pLysS (Novagen) harboring pET28a-N- His, C-2xFlag-Abf1. Abf1 expressed in this strain has an N-terminal His-Tag and a C-terminal 2xFlag-tag. Abf1 expression was induced by the addition of 1 mM IPTG at OD 0.4 for 4 hours. Cells were resuspended in Buffer A (50 mM HEPES-KOH [pH 7.4], 10% glycerol) containing 0.3M KCl (0.3K-BufferA) and frozen in liquid nitrogen. After thawing, cells were sonicated and then centrifuged for 30 min at 22,500 rpm. Abf1 was purified from the lysis supernatant by sequential use of TALON superflow metal affinity resin (Clontech), Anti-Flag M2 affinity gel (Sigma) and SP sepharose. First, lysis supernatant was mixed with TALON superflow metal affinity resin (Clontech) equilibrated with 0.3K-Buffer A. After a 1 hour incubation at 4¥C, resin was washed with 20 column volumes of 0.3K-Buffer A and eluted with 250 mM imidazole in 0.3K-Buffer A. The eluted fraction was mixed with Anti- Flag M2 affinity gel (Sigma) equilibrated with 0.3K-Buffer A and incubated for 16 hours at 4¥C. Resin was washed with 10 column volumes of 0.3K-Buffer A and 10 column volumes of 0.1K-Buffer A. Abf1 was eluted from the resin with 0.1 mg/ml 3xFlag peptide in 0.1K-Buffer A. The eluted fraction was loaded onto a SP sepharose column equilibrated with 0.1K-BufferA and subsequently eluted with 0.5K-buffer A. ISWI complex and recombinant S. cerevisiae histone were purified as previously described [Vary et al., 2004].

88 3

Chromatin signatures of the Drosophila replication program

3.1 Introduction

With every cell division, DNA replication must initiate in a coordinated manner from thousands of start sites or origins of DNA replication. Potential origins of DNA replication are selected by the heterohexameric origin recognition complex (ORC) which is conserved in all eukaryotes. The replicative helicase, MCM2-7 complex, is loaded at ORC binding sites in G1 to form the pre-replicative complex (pre-RC) [reviewed in Bell and Dutta, 2002]. In S. cerevisiae, ORC recognizes a degenerate consensus sequence, the ACS (ARS consensus sequence) [Marahrens and Stillman, 1992], which is necessary but not sufficient for ORC binding [Breier et al., 2004]. In higher eukaryotes, a defined consensus sequence has yet to emerge, suggesting that additional features such as chromatin organization and modification will be critical components for defining sequences which will function as origins of repli-

This chapter was originally published as: Eaton ML, Prinz JA, MacAlpine HK, Tretyakov G, Kharchenko PV, MacAlpine DM. (2011), “Chromatin signatures of the Drosophila replication pro- gram,” Genome Res, 21(2), 164-74.

89 cation. The assembly, organization and modification of histone octamers on the DNA regulates the accessibility of the DNA to trans-acting factors such as transcription factors and RNA Pol II [reviewed in Rando and Chang, 2009]. Promoter elements have a specific nucleosome organization with a nucleosome free region immedi- ately upstream of the transcription start site (TSS) and well-positioned nucleo- somes within the gene body [Lee et al., 2007]. Similarly, a number of epigenetic histone modifications modulate the recruitment of transcription factors to DNA and gene expression levels in what is often referred to as the histone code hypothesis [Jenuwein and Allis, 2001]. While much progress has been made in our under- standing of how chromatin structure and organization regulate gene expression, we know comparatively little about their contribution to the DNA replication pro- gram. In S. cerevisiae, the organization of nucleosomes is an important determinant of ORC localization. ORC localizes to nucleosome free regions and is required to pre- cisely position nucleosomes flanking the origin of replication [Eaton et al., 2010; Berbenetz et al., 2010]. This precise nucleosome organization is required for origin function as occluding the ACS with a nucleosome or moving the upstream flanking nucleosome abrogates origin function [Simpson, 1990; Lipford and Bell, 2001]. In Drosophila, nucleosome organization also appears to be a defining feature of ORC binding sites. Sites of ORC enrichment are depleted for bulk nucleosomes and enriched for the histone variant H3.3 [MacAlpine et al., 2010]. Recent ex- periments profiling Drosophila nucleosome turn-over in near real time by ‘covalent attachment of tags to capture histones and identify turnover (CATCH-IT), found that ORC associated sites undergo active nucleosome exchange [Deal et al., 2010]. Together, these data suggest that in both yeast and higher eukaryotes, a primary determinant of ORC localization is the accessibility of the DNA in higher order

90 chromatin. Histone acetylation has been positively correlated with replication origin activ- ity in a variety of experimental systems. In S. cerevisiae, deletion of the histone deacetylase, Rpd3, results in the earlier activation of a subset of late firing origins of replication [Vogelauer et al., 2002; Knott et al., 2009]. Similar experiments in Drosophila have shown that replication initiation during the developmentally pro- grammed amplification of the chorion locus is also sensitive to changes in local histone acetylation [Aggarwal and Calvi, 2004]. In mammalian cells, differences in local chromatin acetylation at the β-globin locus between erythroid and non- erythroid cells are reflected in the activity of the β-globin replication origin [Goren et al., 2008]. Finally, the loading of a transcriptional activator, GAL4-VP16, on a plasmid is sufficient to localize origin activity in Xenopus extracts [Danis et al., 2004]. This effect is not dependent on active transcription, but instead is correlated with the local acetylation of histones at the site of initiation. The role of chromatin modification in the selection of origins of replication is less clear, although recent studies at a select number of mammalian origins of replication indicate that HBO1, a histone H4 acetylase, interacts with Cdt1 and is required for the subsequent load- ing of the MCM complex to form the pre-RC [Miotto and Struhl, 2010]. To better understand how the DNA replication program is established and reg- ulated, we have examined multiple DNA replication datasets in the context of the diverse data types generated by the modENCODE (model organism encyclopedia of DNA elements) consortium [Celniker et al., 2009]. The goal of the modENCODE project is to identify all functional DNA elements in the genomes of the model or- ganisms D. melanogaster and C. elegans. Together, the consortium has generated nearly 1000 genomic data sets, consisting of array and sequencing based exper- iments, that describe the transcription program, map the chromatin landscape, identify transcription factor and insulator binding sites, and characterize the DNA

91 replication program across multiple cell lines and developmental stages [modEN- CODE Consortium et al., 2010]. Our analysis of the similarities and differences in the DNA replication program from three Drosophila cell lines has provided unique insights into the role of chro- matin organization and modifications in regulating the DNA replication program. Genome wide replication timing and early origins of replication are correlated with a variety of activating chromatin marks. Similarly, we find that ORC binding sites are enriched for a subset of these chromatin marks as well as additional DNA bind- ing proteins, suggesting that the selection and regulation of origins may be con- trolled by distinct factors. We were unable to identify a simple consensus sequence analogous to the yeast ACS from the high resolution ChIP-seq ORC binding data. Instead, we were able to use machine learning approaches to classify ORC bind- ing sites based on sequence features, chromatin modifications and DNA binding proteins. Finally, we generated accurate computational models to predict origin function based on the chromatin landscape.

3.2 Results

3.2.1 Characterization of the Drosophila replication program in three cell lines

As part of the modENCODE consortium, we have utilized multiple genomic ap- proaches to characterize the Drosophila DNA replication program in three differ- ent cell lines. Two of these cell lines are of embryonic origin (Kc167 (Kc) and S2-DRSC (S2)), and the third is derived from neuronal tissue (ML-DmBG3-c2 (Bg3)); together, they represent some of the most commonly utilized cell lines in the Drosophila research community. We have taken a top down approach, char- acterizing the replication program by utilizing increasingly high resolution and complementary assays (Figure 3.1A). Specifically, we used genomic tiling arrays to

92 determine the relative time of replication for all unique sequences in the Drosophila genome (Figure 1A; top), map early activating origins of replication (Figure 1A; middle) and identify, at near nucleotide resolution, the locations of ORC binding by using high-throughput sequencing (Figure 1A; bottom). Genomic tiling arrays were used to identify the relative time of DNA replication during S-phase for unique sequences in the Drosophila genome. Briefly, synchro- nized cells were pulse labeled with the nucleotide analog, 5-bromo-2-deoxyuridine (BrdU), during either early or late S-phase, resulting in the differential labeling of early and late replicating sequences [MacAlpine et al., 2004]. These fractions of early and late replicating sequences were then hybridized to genomic tiling arrays to determine the relative time of replication. We found that the relative time of DNA replication was correlated across the three cell lines (Table 3.1). The replica- tion timing of the X chromosome is one notable exception; it completes replication significantly earlier than the autosomes in the male cell lines [DeNapoli et al., manuscript in preparation; Schwaiger et al., 2009].

Table 3.1: Pearson correlation of whole-genome replication timing between cell lines.

r Bg3 Kc S2 Bg3 1 - - Kc 0.67 1 - S2 0.68 0.7 1

Early origins of replication were identified in the three cell lines by pulse la- beling cells with BrdU in the presence of hydroxyurea (HU), a potent inhibitor of nucleotide biosynthesis. HU treatment results in stalled replication forks and the activation of the intra-S-phase checkpoint [Santocanale and Diffley, 1998; Shi- rahige et al., 1998]. Thus, only those sequences immediately adjacent to early

93 A 2 Timing 0

-2 Early origin peaks 5 Early origins 0

ORC peaks 150

ORC ChIP-seq

0 CG34398 CG32982 tai Ggamma30A 2L Gene model 8,800,000 8,850,000 8,900,000 8,950,000 9,000,000 9,050,000 9,100,000 9,150,000 9,200,000 9,250,000

CG31708 B C KC S2 113 0 20 40 60 80 100 207 91 Kc S2 195 Bg3

72 23

122

Bg3 D E KC 706 S2 1373 763 0 20 40 60 80 100 Kc S2 2395 Bg3 639 305

1066

Bg3

FIGURE 3.1: The Drosophila replication program across three cell lines. (A) Repli- cation program in S2-DRSC cells. Genome browser track of whole genome S-phase replication timing profiles as the log2 ratio of early to late replicating sequences (red), early origin activity as the log2 ratio of BrdU enrichment to input DNA (blue), ORC binding sites as input subtracted ChIP-seq tag depth (orange), and gene models for a 500 kb region of chromosome 2L. (B) Overlap of early origins in three cell lines. The Venn diagram shows the overlap in total early origin peaks from each cell line. (C) Distribution of early meta peaks per cell line. The per- centage of early origin peaks found in three cell lines (light gray), two cell lines (medium gray) or one cell line (dark gray). (D) Overlap of ORC ChIP-seq peaks in three cell lines. As in (B), the Venn diagram depicts the overlap in ORC peaks for each cell line. (E). Distribution of ORC meta-peaks per cell line. Same as (C) for ORC ChIP-seq peaks. 94 activating origins of replication will incorporate BrdU, allowing the enrichment of early origin proximal sequences by immunoprecipitation with anti-BrdU antibodies [MacAlpine et al., 2010]. We identified 630, 457 and 433 early origins of replica- tion in Kc, S2 and Bg3 cells, respectively. To facilitate inter-cell line comparisons, we assembled a single set of genomic loci that had evidence for early origin activity in at least one of the three cell lines. We will refer to this as the set of early origin meta-peaks, and it represents locations in the Drosophila genome that are capable of functioning as a replication origin. Almost a quarter of these early origin meta- peaks (195) were found in all three cell lines and approximately half (403) were found in at least two cell lines (Figure 3.1B). For each of the three cell lines at least two-thirds of the early origins were also found in another cell line (Figure 3.1C). Importantly, 82% of the early origin meta-peaks contained an ORC associated se- quence (p 1 ¢ 10¡5, bootstrap R  100000, see methods). Recent advances in sequencing technology have allowed the use of high-through- put sequencing to directly sequence DNA from chromatin immunoprecipitation ex- periments (ChIP-seq) [Johnson et al., 2007; Robertson et al., 2007]. We have used ChIP-seq to analyze the genome-wide distribution of ORC. ChIP-seq of biological replicates from independent experiments were performed on each of the cell lines. We identified 5159, 4230 and 4477 distinct ORC binding sites in Kc, S2 and Bg3 cells, respectively. We then generated a set of ORC meta-peaks analogous to the set of early origin meta-peaks described above. We identified 7246 ORC meta-peaks of which more than a third (2395) have support in all three cell lines and more than half of these sites (4045) were found in at least two cell lines (Figure 3.1D). Examination of each of the individual cell lines revealed almost 75% of the ORC peaks identified in a particular cell line were also found in at least one other cell line (Figure 3.1E).

95 3.2.2 ORC enrichment at genic elements

In Drosophila, ORC is frequently located at the TSS of actively transcribed genes [Figure 3.2; MacAlpine et al., 2010]. In S. cerevisiae ORC binding at the MSH4 locus is disrupted by active transcription through the ACS [Mori and Shirahige, 2007]. A simple prediction is that ORC will be excluded from the gene bodies of

actively transcribed Drosophila genes. For each gene we determined the log2 mean enrichment of ORC in the 500 bp upstream of the gene, the first exon, first intron, internal exons, last intron, last exon and 500 bp downstream of the last exon rel- ative to random locations in the genome (not excluding annotated genes) (Figure 3.3A). Not surprisingly, we found ORC enriched immediately upstream of the gene and in the first exon which is likely a consequence of ORC binding at or very near the TSS. However, ORC was depleted within the interior exons, suggesting that active transcription through gene bodies is recalcitrant to ORC binding. We note that background levels of ORC enrichment in the first intron and last intron may be due to the presence of alternative transcripts. The increased resolution provided by the ORC ChIP-seq data allowed us to in- vestigate the relative position of ORC in relation to RNA polymerase II binding at promoters. We examined the distribution of both RNA Pol II and the ORC meta- peaks at the TSSs of annotated genes in all three cell lines. To facilitate comparison between the RNA Pol II data and the ORC data, which were derived from differ- ent whole-genome platforms (ChIP-chip and ChIP-seq respectively), we plotted the density of peak summits relative to the TSS of actively transcribed genes (Figure 3.3B). The density of peak summits for ORC (dark gray) and RNA Pol II (light gray) relative to TSSs represents the most likely locations to find bound ORC or RNA Pol II. We found that ORC was positioned almost precisely at the transcription start site and immediately upstream of RNA Pol II. The region occupied by ORC at the

96 * * * fpkm) Expression log2(1+

ORC No ORC ORC No ORC ORC No ORC Bg3 Kc S2

FIGURE 3.2: Expression of genes with and without TSS proximal ORC binding. Distribution of RNA-seq expression scores for genes with ORC found within 1kb up- or down-stream of their transcription start site compared to genes without ORC in the vicinity of the promoter. Expression values were taken from the cell- line specific modENCODE RNA-seq data Graveley et al. and were converted to log space by log2pfpkm 1q. Asterisks denote statistical significance by Wilcoxen unpaired test. In all cases, p ¤ 2 ¢ 10¡16.

transcription start site is also depleted for bulk nucleosomes and enriched for H3.3 and GAGA factor (GAF) [modENCODE Consortium et al., 2010]. These results par- allel prior studies in yeast where chromatin accessibility is a determinant of ORC binding [Eaton et al., 2010; Berbenetz et al., 2010].

3.2.3 Diverse chromatin marks define the replication program

The chromatin landscape clearly impacts both the expression and the replication of the genome. For example, the transcriptionally active euchromatin typically replicates prior to the repressed heterochromatic sequences [reviewed in Gilbert, 2010b]. Studies in yeast, Drosophila and mammalian systems have shown that

97 A 2 Bg3 Kc S2

1

0 Mean log2 enrichment of ORC

−1

5’ −2 500bp First First Internal Last Last 500bp Exon Intron Exon intron Exon

B ORC Peaks 0.00012 RNA Pol II Peaks 0.0005 Density of RNA Pol II peaks 0.00010 0.0004 0.00008 0.0003 Density of ORC peaks 0.00006 0.0002 0.0001 0.00004

−4000 −2000 0 2000 4000 Distance from TSS

FIGURE 3.3: ORC enrichment at annotated genomic features. (A) ORC is fre- quently associated with promoters but is excluded from interior coding features. The mean log2 enrichment of input subtracted ChIP-seq tag density over random locations for ORC was determined for each type of genic element (500 bp upstream of the transcription start site, the first exon, the first intron, internal exons, the last intron, the last exon and 500 bp downstream of the last exon) for each of the cell lines Bg3 (dark gray), Kc (gray) and S2 (light gray). (B) All TSSs were scanned for ORC peaks (dark gray) and RNA Pol II peaks (light gray). The densities of each are plotted relative to the transcription start site. 98 changes in histone acetylation [Knott et al., 2009; Aggarwal and Calvi, 2004; Goren et al., 2008] are associated with changes in the replication program. However, a comprehensive view of the replication program in the context of chromatin modi- fications and DNA binding proteins is lacking. The different modENCODE data types across multiple cell lines [modENCODE Consortium et al., 2010] allowed us to define the chromatin and transcription land- scape associated with features of the DNA replication program. For each replica- tion data type (replication timing, early origins and ORC binding), we generated a 43 by 3 matrix with each column representing a specific cell line, and each row representing the enrichment or correlations with chromatin marks, DNA binding proteins, nucleosome density, histone variants, nucleosome turnover (CATCH-IT) [Deal et al., 2010] and gene expression (RNA-seq) (Figure 3.4). For the replication timing profiles where we did not have discrete peak calls, we calculated Spearman’s correlation between each factor with the whole-genome replication timing profile (Figure 3.4A). For early origins of replication (Figure 3.4B) and ORC binding sites

(Figure 3.4C), we calculated the median log2 enrichment of each factor within all BrdU peaks and within 500 bp of ORC ChIP-seq peak centers respectively. The rightmost column of each matrix is a summary column depicting the average signal or correlation for each factor derived from the three independent cell lines. We found that the selection and regulation of DNA replication origins is associ- ated with distinct sets of chromatin marks and DNA binding proteins. Prior studies have associated early replication with active transcription and the presence of ‘ac- tivating’ chromatin modifications such as histone acetylation, whereas late repli- cation is associated with ‘repressive’ chromatin marks such as those found in the heterochromatin [reviewed in Gilbert, 2010b]. Indeed, we found that gene expres- sion is positively correlated with replication timing, as are generally euchromatic marks such as H3K4me1 and H3K18ac (Figure 3.4A). In contrast, heterochromatic

99 A B C ORC Binding Promoter Promoter Timing Early Origins Proximal Distal

H3K4me1 H3K4me1 H3K27ac H3K18ac H3K18ac H3K9ac H3K4me2 H3K27ac H3K4me2 H4K16ac H3K4me2 H3K4me3 H3K36me1 H3K79me2 H3K18ac H3K27ac H3K36me1 H3K4me1 H3K79me1 H3K79me1 H4K16ac H3K4me3 H4K16ac H3K79me2 H3K79me2 H3K9ac H2B-ubiq H3K9ac H3K23ac H3K79me1 H3K36me3 H3K4me3 H3K9me3 H2B-ubiq H2B-ubiq H3K9me2 H3K23ac H3K9me2 H3K36me1 H3K9me3 H3K9me3 H3K36me3 H3K27me3 H3K36me3 H3K27me3 H3K9me2 H3K27me3 H3K23ac GAF MRG15 ISWI MRG15 ISWI NURF301 JIL1 GAF RNA pol II ISWI RNA pol II GAF NURF301 HP1c Chro RNA pol II NURF301 HP1c Chro HP1 MRG15 dRING Su(var)3-7 CP190 Pc dRING dRING CP190 CTCF Su(var)3-7 HP1c Ez HP1 CTCF Pc CTCF Su(var)3-7 HP2 Ez mod2.2 CP190 PCL Ez Chro BEAF-70 BEAF-70 Su(var)3-9 JIL1 HP1 PCL Psc Psc JIL1 Pc PCL mod2.2 HP2 Su(var)3-9 BEAF-70 Su(var)3-9 HP2 Psc Su(hw) Su(hw) Su(hw) mod2.2 H3.3 H3.3 CATCH-IT 20’ CATCH-IT 20’ CATCH-IT 20’ H3.3 Nucleosome Density H2AV H2Av H2AV Nucleosome Density Nucleosome Density Expression Expression Expression 3 ce an Kc Kc Kc Kc S2 S2 S2 S2 BG3 BG3 BG3 BG en mean mean me mean er ff

Correlation log2 Enrichment Di Data not available

−0.5 0.5 −2 2 FIGURE 3.4: The chromatin landscape of the replication program. (A) Chromatin correlations with replication timing. The genome-wide replication timing profile of each cell line was paired with the genome-wide array scores for each chro- matin factor, and the pairwise correlation of the factor with replication timing was computed (Spearman’s ρ). The correlation ρ ranges from -.5 (blue) to +.5 (red). (B) The chromatin landscape of early origins. The log2 enrichment for each fac- tor within early origin peaks was determined for each cell line. The enrichment ranges from -2 (blue) to +2 (red). (C) The chromatin landscape of ORC binding sites. ORC associated sequences were divided into TSS proximal (overlapping a TSS) and TSS distal (not overlapping a TSS). The log2 enrichment for each factor within 500 bp of the ORC peak centers was determined for each cell line. The enrichment ranges from -2 (blue) to +2 (red). In all panels, gray boxes represents an experiment that has not yet been submitted to modENCODE. See methods for details. 100 marks such as H3K27me3 and H3K9me2, are negatively correlated with replication timing. The sequences surrounding early origins were also enriched for activating chromatin marks as well as specific DNA binding proteins, including chromatin remodeling factors (Figure 3.4B). Because many of the ORC binding sites co-localized with promoters of active genes [Figure 3.3B; MacAlpine et al., 2010], we separated the ORC binding sites into those that are TSS proximal (within 1 kb of a TSS) and those that were not at a TSS (distal). We were particularly interested in chromatin features that are shared between ORC binding sites both proximal and distal to promoters. Additionally, marks that are specific to ORC sites distal from a promoter will be of interest as these marks may be required for ORC binding or function in the absence of a promoter. ORC binding sites proximal to TSSs were enriched for chromatin remodelers such as the NURF complex (NURF301, ISWI) as well as other DNA binding pro- teins such as GAF, RNA Pol II and Chro (Figure 3.4C). These TSS associated ORC sites were also enriched for H3K9ac, H3K27ac, H3K4me2 and H3K4me3 – marks frequently found at promoters. Interestingly, those ORC sites that did not over- lap with a TSS (distal) were also enriched for chromatin remodelers ISWI and NURF301, as well as GAF which has also been implicated in chromatin remodeling [Petesch and Lis, 2008]. Consistent with the idea of ORC localizing to dynamic and active chromatin, we found an enrichment for CATCH-IT and H3.3 at ORC sites both proximal and distal to TSSs, as well as a reduction in bulk nucleosome occupancy [modENCODE Consortium et al., 2010]. ORC sites not located at pro- moters were enriched for many of the same histone marks as those at promoters, with a few notable exceptions. We found a decrease in H3K4me3 at ORC sites distal from a promoter, as well as an increase in H3K18ac and H3K4me1. Chromatin features specific to transcription start sites, such as RNA Pol II and

101 H2Av, were decreased at ORC binding sites distal to promoter elements. A small amount of RNA Pol II signal remained in the TSS distal ORC binding sites; however, in comparison to the local enrichment of ISWI and GAF, there was a clear reduc- tion in local signal (Figure 3.5). The remaining signal may be due unannotated transcription start sites.

BG3 RNA Pol II BG3 GAF Kc ISWI

3.5 TSS proximal 3.5 TSS proximal 3.5 TSS proximal TSS distal TSS distal TSS distal 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 2.0 tss.pred tss.pred tss.pred 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0

−2000 0 1000 2000 −2000 0 1000 2000 −2000 0 1000 2000

x_ax x_ax x_ax

Kc RNA Pol II S2 GAF S2 ISWI

3.5 TSS proximal 3.5 TSS proximal 3.5 TSS proximal TSS distal TSS distal TSS distal 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 2.0 tss.pred tss.pred tss.pred 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0

−2000 0 1000 2000 −2000 0 1000 2000 −2000 0 1000 2000

x_ax x_ax x_ax

S2 RNA Pol II BG3 ISWI S2 NURF301

3.5 TSS proximal 3.5 TSS proximal 3.5 TSS proximal TSS distal TSS distal TSS distal 3.0 3.0 3.0 2.5 2.5 2.5 2.0 2.0 2.0 tss.pred tss.pred tss.pred 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0

−2000 0 1000 2000 −2000 0 1000 2000 −2000 0 1000 2000

x_ax x_ax x_ax

FIGURE 3.5: Local chromatin signals around TSS proximal (black) and TSS distal (green) ORC binding sites. Each ORC binding site was classified as TSS proximal or TSS distal on the basis of 1 kb proximity to a transcription start site annotated in mb8 or flybase 5.12. The strength and relative location of the array probes in a 4 kb window around the ORC binding site were pooled and loess smoothed. The graphs represent the smoothed signal of PolII, GAF, ISWI and NURF in the cell lines for which we have data on those factors.

Chromatin marks that are associated with active transcription through gene

102 bodies (e.g., H3K79me1, H3K36me1 and H3K36me3) were not found above back- ground levels at any ORC binding sites. However, H3K36me1 was found specifi- cally flanking those ORC binding that did not coincide with a TSS [modENCODE Consortium et al., 2010]. ORC has been shown to facilitate the formation of hete- rochromatin and HP1 binding [Pak et al., 1997]; however, we found that ORC sites were depleted for heterochromatic histone modifications such as H3K27me3 and H3K9me2/3 and were only slightly enriched for HP1. This may, in part, be due to the inability to map distinct ORC binding sites in repetitive sequences, a current limitation of high-throughput sequencing approaches. We also examined the chromatin signatures of promoter elements with and without ORC associated to determine if there were unique chromatin signatures specific for ORC associated promoters. Since those promoters with proximal ORC binding tend to be far more actively transcribed than those with ORC (Figure 3.2), we limited our comparison to active promoter elements only. We found that ORC associated promoters had modestly increased chromatin remodeling activities, de- creased nucleosome occupancy and greater evidence of nucleosome turn-over rel- ative to other active promoters not associated with ORC (Figure 3.6). In summary, these results indicate that dynamic chromatin environments may contribute to ORC localization and the subsequent activation of replication origins.

3.2.4 Potential cis and trans-acting elements directing ORC association

ORC purified from higher eukaryotes exhibits little if any sequence specificity in vitro [Vashee et al., 2003; Remus et al., 2004]. Despite the apparent lack of speci- ficity, we found that ORC localized to specific chromosomal locations in three dif- ferent cell lines. In our prior ORC ChIP-chip experiments [MacAlpine et al., 2010], we were unable to identify a single sequence motif that was predictive for ORC binding, but rather we were able to identify sequence words (k-mers) that in com-

103 Chro NURF301 ISWI RNA pol II CP190 MRG15 BEAF−70 JIL1 HP1c HP1 CTCF Ez GAF Su(var)3−7 dRING HP2 Su(var)3−9 PCL Psc mod2.2 Su(hw) Pc H3K4me3 H3K9ac H3K4me2 H3K79me2 H3K27ac H4K16ac H3K18ac H3K36me3 H2B−ubiq H3K4me1 H3K9me2 H3K9me3 H3K79me1 H3K36me1 H3K23ac H3K27me3 H2Av CATCH−IT 20' H3.3 Nucleosome Density ORC

log2 enrichment No_ORC

-3 0 3 FIGURE 3.6: Median array signal for all factors at promoters with ORC, and those without. To test the possibility that active promoters with and without ORC may have different chromatin environments, we took the top quartile of genes by ex- pression (3,662) in S2 cells and labeled them as having ORC overlapping the 1 kb 5’ of their TSS (ORC; 1149) or not (No ORC; 2513). We then displayed each fac- tor’s median enrichment as in figure 3.4. The most notable difference is the change in nucleosome dynamics which shows that bulk nucleosomes at ORC-occupied pro- moters are more depleted, enriched for H3.3 and experiencing rapid turnover. In addition, we see slight enrichment for NURF301 and ISWI. It is important to note that the scale of this figure is different from figure 3.4; since the top quartile of ac- tive promoters typically have stronger overall signals, the log2 enrichment is from -3 to +3.

104 bination were able to accurately classify sequences as ORC associated or not us- ing support vector machines (SVMs). We revisited the issue of specific sequences directing ORC localization with our ORC binding data derived from the higher resolution ChIP-seq datasets. First, we sought to identify potential cis-acting motif elements in those se- quences that were associated with ORC. We identified 3416 sequences representing the intersection of peaks in each of the three cell lines. The intersection of the ORC peaks was used rather than the previously described meta-peaks to increase the nucleotide-level resolution. The center of the intersection was used as the epicen- ter of the ORC binding sites and extended 250 bp up and downstream to yield a 500 bp fragment. Using the motif identification tool MEME [Bailey and Elkan, 1994], we were only able to identify simple repetitive elements (CA)n, (CT)n and (CG)n which were not particularly predictive of ORC binding (Figure 3.7). 1.0 0.8 0.6 e Rate (Sensitivity) 0.4 ositi v ue P r T 0.2 [CA]n auROC=0.6962 [CT]n auROC=0.6855 [CG]n auROC=0.6141 0.0

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate (1−Specificity)

FIGURE 3.7: ROC curve for motif classification of ORC binding sites. Enriched repetitive motif elements identified by MEME (CA)n,(CG)n and (CT)n had little predictive value in discriminating between ORC associated sites and random neg- ative regions of the genome. Each ORC binding site was ranked by p-value for the presence of these motifs, and the ROC curve was generated by accepting ORC peaks with successively lower ranks.

105 We decided to build upon our earlier success using support vector machines (SVM) to classify ORC binding sites based on primary sequence [MacAlpine et al., 2010] by including additional information such as chromatin modifications and DNA binding proteins. An SVM is a classification algorithm that is trained on a set of labeled samples represented as feature vectors (e.g., ORC bound or unbound sequences) and after training is able to accurately distinguish between samples in a test data set. In accordance with this, we generated a feature vector describing each ORC binding site and an equal number of random loci using unique sequence k-mers (where k is 1 to 6, excluding reverse complements), 15 chromatin marks and 20 DNA binding proteins for a total of 2765 features from the S2 dataset. Given the input training set which consisted of an equal number of ORC bound and random loci (balanced for promoter occupancy) from chromosomes 2L, 3L and 3R, we generated an SVM model using 10-fold cross validation of the training dataset. The X and 4th chromosome possess unique chromatin environments and were excluded from the training and test datasets. Specifically, the single male X chromosome is hyperacetylated on H4K16 [Lucchesi et al., 2005] and the 1.2 Mb 4th chromosome has a high transposon density and a prevalence of repressive heterochromatin marks [Riddle et al., 2009]. After training the SVM, we identified those features with the most discrimina- tory power (F-score ¡ 0.075) and thus reduced the length of our feature vector from 2765 features to only 34 features. The model generated from this reduced feature set was then used to classify ORC bound sequences on chromosome 2R, which was deliberately left out of the training. The results of the testing are pre- sented as receiver operator characteristic (ROC) curves, with the y-axis represent- ing the sensitivity of the assay (true positive rate) as a function of 1-specificity (false positive rate). We found that we could accurately predict ORC binding sites using either sequence, chromatin modifications, or DNA binding proteins (Figure

106 3.8A). However, combining the different features resulted in a marked increase in accuracy of the predictions. For example, at a sensitivity of 83% the false positive rate was approximately 8%. Although there were fewer chromatin factors avail- able for the other two cell lines, the SVM was still able to accurately classify origins in both Kc and Bg3 (Figure 3.9).

A B NURF301 1.0 1.0 Random ORC Sequence Features ISWI Binding Protein Features Chromatin Mark Features 0.8 0.8 WDS er (F−Score) 0.6 0.6

w RNA_pol_IIH3K4me2 o y P

e Rate (Sensitivity) H3K9ac H3K4me3ChroCP190 0.4 0.4 GAF H3K27ac ositi v

iminato r HP1c ue P r CGC H3K18ac T 0.2 0.2 Disc r CGCG H4K16ac Sequence Features; auROC=0.8406 AGGAACGCTGCG ACCTCGCGCAGG dRING Chromatin Mark Features; auROC=0.8785 AGGAC ACGCCCTCTCAAGCGACGCTCCCTCGCAAGCGCCCGCGCH3K79me2 Binding Protein Features; auROC=0.9030 0.0

0.0 Sequence + Chromatin + BP Features; auROC=0.9292

0.0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 20 30 40 50

False Positive Rate (1−Specificity) Class Proximity (T−Value) FIGURE 3.8: Sequence, chromatin and DNA binding proteins classify ORC binding sites. (A) SVM performance was gauged by the ROC curve resulting from sepa- rately using sequence features, chromatin mark features, binding protein features or a combination of all three. In each case, the SVM was trained using 10-fold cross validation on three chromosome arms (2L, 3L and 3R) and tested on a fourth chromosome arm (2R). (B) The importance of individual features was determined by plotting the F-score as a function of class proximity represented here as as a t-statistic.

To better understand the specific features that contributed to the discriminatory accuracy of the SVM, we plotted the class proximity (defined as a t-statistic ob- tained from comparing the feature counts between positive and negative sets) for each feature against the F-score (Figure 3.8B). In each case, the t-statistic denoted the class to which each feature most correlated, and the discriminatory power indicated its overall importance in classifying ORC sites. Our analysis showed that binding proteins were by far the most discriminative feature type, followed closely by chromatin marks. Sequence features were neither very discriminative

107 nor strongly correlated to either class. This analysis was therefore able to identify those features most highly correlated with ORC binding sites across all cell lines (Figure S10), namely WDS (which interacts with complexes whose functions in- clude histone acetylation and enabling ISWI to remodel chromatin) [Suganuma et al., 2008], chromatin remodelers (NURF301, ISWI) and ‘activating’ chromatin modifications (H3K4me2, H3K4me3, H3K27ac, etc.).

A Bg3 B Bg3 0 0 . . Random ORC 1 1 Sequence Features ) ) e

y Binding Protein Features r t i

o Chromatin Mark Features v c i 8 8 t . . ISWI i S 0 0 s − n F e (

S r (

e e 6 6 w t . . o a 0 0 P R

y e r v i o t t i 4 4 a s . . H3K4me2 n o Chro i 0 0 GAF P m

i RNA_pol_II r e c u H3K4me3 r s i H3K27ac T 2 2 D . . CGC HP1c 0 0 Sequence Features; auROC=0.8122 AGCGCGCT CG CGCG CP190 Chromatin Mark Features; auROC=0.8464 AGGACACCGCAACGCCACACCTCTCCTCGCCGCTCCACGC Binding Protein Features; auROC=0.9014 0 0 Sequence + Chromatin + BP Features; auROC=0.9286 . . 0 0

0.0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 20 30 40 50

False Positive Rate (1−Specificity) Class Proximity (T−Value) C D Kc Kc 0 0 . . Random ORC 1 1 Sequence Features Binding Protein Features Chromatin Mark Features ) ) 8 8 e . . y r t 0 0 i o v c i t i S s − n F e 6 6 ( . .

S r ( 0 0

e e w t o a P R

4 4 y e . . r v i 0 0 o t t i a s n o i P m

i 2 2 r e . . ISWI c u 0 0 r

s CGC i RNA_pol_IIWDS T Sequence Features; auROC=0.8013 CGCG HP1c D AGGAAGCG H3K9acChro Chromatin Mark Features; auROC=0.5999 AGGCCTCGCGCTCGCA Binding Protein Features; auROC=0.8128 0 0 .

. Sequence + Chromatin + BP Features; auROC=0.8794 0 0

0.0 0.2 0.4 0.6 0.8 1.0 −30 −20 −10 0 10 20 30 40 50

False Positive Rate (1−Specificity) Class Proximity (T−Value) FIGURE 3.9: SVM classification of ORC sites in Bg3 and Kc. (A) Using those fea- tures from the reduced feature set from figure 3 (in table 3.2) that have been profiled in Bg3 cells, an SVM was trained using each individual feature class (se- quence, chromatin marks and binding proteins) and on the combination of the three. The area under the ROC curve characterizes the sensitivity and specificity of the classifier. (B) Each feature in BG3 cells was given an F-score and a t-statistic to gauge their relative contributions to the accuracy of the classifier. (C) As in A, but for Kc cells, for which fewer marks have been profiled. (D) Again, as in B, but for Kc cells.

108 Table 3.2: Top ORC binding SVM features by F-score. A table showing the features selected by feature selection in each of the three cell lines. Their F-score and t- statistic (as reported by libsvm) is also shown.

S2 Bg3 Kc Feature T-Score F-Score Feature T-Score F-Score Feature T-Score F-Score

e CG −0.14 0.11 CG −0.08 0.11 CG 0.01 0.08 c n

e AGG −0.17 0.10 CAC −0.03 0.08 AGG 0.04 0.09 u

q CCT −0.18 0.10 CGC 0.10 0.18 CCT 0.00 0.08 e

S CGC −0.04 0.20 ACGC 2.63 0.09 CGC 0.12 0.16 ACGC 2.10 0.09 AGCG 2.03 0.13 AGCG 2.13 0.11 AGCG 2.16 0.14 AGGA −3.44 0.10 AGGA −3.57 0.10 AGGA −3.98 0.12 CGCA 2.44 0.09 CGCA 2.32 0.08 CGCA 2.27 0.10 CGCG 7.03 0.12 CGCG 7.48 0.13 CGCG 7.15 0.15 CGCT 2.05 0.13 CGCT 2.17 0.09 CGCT 2.10 0.13 CACAC 8.02 0.08 AGCGA 8.04 0.08 CACGC 11.14 0.08 AGCGC 11.89 0.08 CGCTC 11.01 0.09 AGGAC −13.68 0.08 CTCGC 9.59 0.09 CACAC 6.85 0.08 CTCTC 8.19 0.09 CGCGC 14.69 0.08 CGCTC 10.12 0.08 CTCGC 10.71 0.10 CTCTC 7.31 0.08

s Chro 30.10 0.41 Chro 31.44 0.38 Chro 22.19 0.12 n i e t CP190 32.50 0.41 CP190 20.77 0.11 HP1c 22.82 0.12 o r

P dRING 20.50 0.10 GAF 32.06 0.36 ISWI 26.78 0.20 g

n GAF 30.85 0.36 HP1c 27.16 0.20 RNA_pol_II 23.79 0.14 i d

n HP1c 29.81 0.29 ISWI 38.29 0.80 WDS 25.65 0.15 i

B ISWI 37.26 0.90 RNA_pol_II 29.89 0.30 NURF301 36.97 0.99 RNA_pol_II 30.87 0.55 WDS 36.48 0.78

s H3K18ac 23.97 0.20 H3K27ac 27.47 0.23 H3K9ac 21.06 0.12 k r

a H3K27ac 28.92 0.34 H3K4me2 29.21 0.40 M H3K4me2 30.08 0.54 H3K4me3 27.35 0.27 n i t

a H3K4me3 29.84 0.40 m

o H3K79me2 16.82 0.09 r h H3K9ac 30.04 0.44 C H4K16ac 21.74 0.17

109 3.2.5 Predicting early origin usage from the chromatin landscape

Genome-wide and locus specific experiments have revealed that early replicating regions of the genome typically correlate with ‘activating’ chromatin marks and that changing the local acetylation pattern of surrounding sequences can modulate replication activity [ENCODE Project Consortium, 2007; Karnani et al., 2007; Ag- garwal and Calvi, 2004; Knott et al., 2009]. However, genome-wide experiments have simply revealed global correlations and the observations taken from locus specific examples may not apply to all origins on a genome-wide scale. Finally, it is difficult to discern specific and non-specific effects from altering global patterns of histone acetylation. For example, deletion of the global histone deacetylase Rpd3 in yeast results in the earlier activation of a subset of late firing origins [Knott et al., 2009]; however, the global reduction in histone H3 acetylation levels may also impact the transcriptional control of key replication initiation factors. The chromatin modENCODE datasets derived from multiple cell lines [Kharchenko et al., 2010] provided a unique opportunity to identify chromatin factors and DNA binding proteins that might distinguish differential origin usage or ORC occupancy between cell lines. Using S2 and Bg3 cells, we first identified ORC binding sites that were only found in Bg3 cells and not S2 cells. Comparison of the chromatin signatures between Bg3 and S2 cells for those locations that were only bound by ORC in Bg3 cells revealed only subtle changes in chromatin factors between the lines (Figure 3.10A). For example, very modest decreases were observed for ISWI and RNA Pol II in S2 cells at the ORC locations that were only utilized in Bg3 cells. Similar modest decreases were also observed between S2 and Bg3 cells when ORC locations specific to S2 cells were examined. We also examined the chromatin signatures of early origins of replication in a similar manner (Figure 3.10B). Again, early origins specific to Bg3 cells exhibited

110 A B

BG3 Marks BG3 Marks S2 Marks S2 Marks BG3 ORC BG3 Early origins S2 ORC S2 Early origins ISWI ISWI RNA pol II MRG15 Chro GAF GAF RNA pol II MRG15 HP1c HP1c HP1 CP190 dRING dRING CTCF CTCF Su(var)3−7 Su(var)3−7 Ez HP1 Chro Ez Pc BEAF−70 CP190 PCL PCL JIL1 Su(var)3−9 Psc HP2 Pc mod2.2 HP2 JIL1 Su(var)3−9 BEAF−70 mod2.2 Psc Su(hw) Su(hw) H3K27ac H3K4me1 H3K4me2 H3K18ac H3K4me3 H3K27ac H3K18ac H3K4me2 H3K4me1 H3K36me1 H3K79me2 H4K16ac H4K16ac H2B−ubiq H3K79me2 H3K79me1 H3K79me1 H3K9me2 H3K23ac H3K9me3 H3K4me3 H3K36me3 H2B−ubiq H3K36me1 H3K9me2 H3K23ac H3K9me3 H3K27me3 H3K36me3 H3K27me3 log2 enrichment ORC

-2 0 2

FIGURE 3.10: Per-factor median signals in areas of differential ORC and early origins. Each factor was assayed under (A) ORC peaks and (B) early origin peaks from the union of the S2 and Bg3 ORC and early origin peak sets respectively. (A) ORC peaks that appear only in Bg3 cells were profiled using marks from Bg3 (first column) and S2 (second column), peaks that appear only in S2 cells were profiled using marks from S2 (third column) and Bg3 (fourth column), and peaks that appear in both cell lines were profiled using marks from BG3 (fifth column) and S2 (sixth column). (B) The same column pattern as in A is displayed, showing the enrichment for the factors within early origin peaks. In both panels, The green squares at the top act as a key to the identity of the column. All enrichments are shown from -2 to 2 median log2 enrichment. Factor enrichments were collected as in figure 2.

111 only subtle difference in chromatin signatures between Bg3 and S2 cells. A slight increase in repressive heterochromatin marks (H3K9me2 and H3K9me3) was ob- served at those sites in S2 cells. More dramatic differences in chromatin factors were observed for early origins specific to S2 cells. Specifically, a loss of chromatin remodelers, RNA Pol II and activating marks were observed in Bg3 cells at those locations. Together these results suggest that early origins specific to S2 cells are likely defined by increased activating chromatin marks, RNA Pol II and chromatin remodeling activities. However, Bg3 specific origins of replication exhibited much more subtle differences in the local chromatin environment between cell lines, suggesting that even minor changes in chromatin environment may impact origin usage. The fact that there was not a single chromatin mark or DNA binding protein that was highly correlative with early origin activation by itself prompted us to test the hypothesis that perhaps a complex integration of diverse chromatin marks and DNA binding proteins was defining a local chromatin ‘terroir’ favorable to origin usage. Our goal was to develop a predictive model of origin function based on chromatin and DNA binding protein input. Specifically, given the set of early origin meta-peaks spanning all cell lines, could we accurately classify cell line specific early origins based on the chromatin landscape and furthermore, could we also predict the change in relative strength or activity of early origins between cell lines? To simplify our description of the chromatin environment at early origins, we sought to identify representative classes of chromatin modifications and DNA bind- ing proteins that were highly correlated within early activating origins of replica- tion. We first constructed a correlation matrix between each of the normalized chromatin factors and DNA binding proteins from Bg3 cells, and using hierarchi- cal clustering (Figure 3.11) we identified 5 clusters of highly correlated chromatin

112 o ahcutr pcfial,w olpe h etr etro ak o each for marks of vector feature the collapsed signal we integrated Specifically, the to cluster. proteins each binding for DNA and modifications com- 39 the reduced the we of 2001], plexity Allis, and [Jenuwein code’ ‘histone the of redundancy origins. early potential by covered genome the of subset the bind- at and marks factors chromatin ing These of co-occurrence 4). of trends (cluster general together the clustered describe clusters also were were elements 3 insulator and Finally, 1 pro- clusters teins. binding contrast, heterochromatin and In marks chromatin 5). repressive and of 2 composed largely (clusters II Pol RNA binding as DNA and such ISWI, proteins as such remodelers nucleosome predom- marks, contained ‘activating’ clusters inantly these of Two heatmap). red-green 3.12, (Figure marks indicate boxes red the beneath numbers the indicate label. while boxes cluster red 4A, the The figure of cells. in matrix Bg3 shown correlation in clusters the meta-peaks the origin of early metric at Ward factors the chromatin using 36 clustering hierarchical A peaks. F IGURE ie h orltosbtencrmtnmrswti ahcutradthe and cluster each within marks chromatin between correlations the Given

Clusters Height .1 irrhclcutrn fcrmtnfcosa al rgnmeta- origin early at factors chromatin of clustering Hierarchical 3.11:

0 2 4 6 8

H3K23ac

5 H3K4me1 H3K36me1 H3K18ac BEAF_70 H3K36me3 JIL1 GAF H3K27ac ISWI HP1c H3K79me1 2

H3K79me2 F actor correlationdendogram H2B_ubiq H4K16ac Chro 113

F RNA_pol_II actors H3K4me3 MRG15 H3K4me2 HP2 HP1

3 Su_var_3_7 H3K9me2 Su_var_3_9 H3K9me3 Ez H3K27me3

1 Psc PCL dRING Pc CTCF

4 mod2.2 CP190 Su_hw_ Correlation Bg3 Chromatin -.76 0 1 5 + − 4 3

2

1

log2 enrichment

-2 0 2

FIGURE 3.12: Subsets of factors are highly correlated at early origins. The pairwise correlation between every factor based on their mean signal at early origin meta- peaks in Bg3 cells was computed (where green indicates a negative correlation and red indicates a positive correlation, with values ranging from -.76 to 1). Five groups of correlated marks were identified by hierarchical clustering.

meta early origin down to their mean signal across each of the clusters. A sum- mary of the log2 enrichments for each cluster as a function of origin activity in Bg3 cells is presented in Figure 3.12 (red-blue heatmap). Of the individual clusters, the mean signal of 2 and 5 correlated highly with early origin strength in Bg3 cells (Pearson’s r=.43 and .56 resp., Figure 3.13) while the other clusters showed only weak correlations. To determine if there is enough information contained in the local chromatin environment of the early origin meta-peaks to specify which will be active in a

114 ● ● r = 0.05 ● r = 0.48 ● 3 p <= 0.27117 ● 3 p <= 0 ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ●● ● ●●● ●● ● ● ● ● 2 ● 2 ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ●●● ●●●● ●● ● ●●●●●● ● ● ● ●● ● ● ●●●●● ●● ●● ●● ● ●●●●●●●●●●●● ● ●● ●● ● ●●●●● ● ●●●●● ● ● ●● ●●● ●● ● ● ●●●●●●● ● ● ● ● ● ● ● ●●●●● ●● ● ●●● ●●●● ● ● ● ●● ●●●●● ● ●● ● ● ● ●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●●●●●●●● 1 ● ●●●●●●●●● 1 ● ●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●●●●● ●● ●● ●● ● ●●●●●●●●●●●● ● ●●● ● ● ●●●● ●●●● ● ●●●●●●●●● ● ●● ● ●●●● ● ● ● ● ● ●●●●● ●●●●● ● ● ●● ●● ●●●●●●●●● ● ● ●●●●●●●●●● ● ● ● ●●●● ●●● ●●●●●●● ●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●● ● ● ●● ●●●● ●● ●● ●●● ●●●●●●● ●● ●●● ● ● Early origin activity ● ●●● ●●●●●●●●●●●●● Early origin activity ● ● ●●●●●● ●●●●●● ● ● ●●● ●●●●●●●●●●● ● ●●● ●●●●●●●●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ●●●●●●●● ● ●●● ●●●●● ●● ● ●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●●●● ● ●● ● ● ●●●● ●● ● ●●●●●●● ● ● ● ●● ●●●●●●●●●●●● ●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ●●●●●●● ●●●● ● 0 ●●●●●●● ● ● ● 0 ●● ●● ●● ● ● ● ● ●●●●●●●●●●●●● ● ● ● ● ●●●● ●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●● ●● ●●●● ●● ● ●●●●●●●●●●●●●●●●● ● ●●●●●●●● ●●●●●● ● ● ● ● ●●●●●●● ● ● ●●●● ●●●● ● ● ● ● ● ●●● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●

−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 cluster_1 cluster_2

● ● r = −0.03 ● r = −0.11 ● 3 p <= 0.40529 ● 3 p <= 0.00522 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ●● ● ● ● 2 ● 2 ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●●●●●●●● ● ● ●● ● ●●● ● ● ● ● ●●● ●●● ●●●● ● ● ●● ●● ●● ●●●● ● ● ●●●●●● ● ● ● ●●●●●●●●● ● ● ●●●●●● ● ●●●●● ● ●● ● ● ●●●●●●●● ● ● ● ●● ●●● ●●●●●● ● ● ● ●●●●●●●●●●●●● ● ●● ● ●●●●●●●● ●●● ● ●●●●●●● ●● ● ●●●●●●●● ●●●●● ● 1 ●●●●●●●●● 1 ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●● ● ●●● ● ● ●●●●●●●● ●● ● ●●● ●●●●●●●● ● ●● ● ●●●●● ● ●● ● ● ● ●● ● ● ●●●●●● ● ●●●●● ●●●●●● ● ●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●● ●● ●● ●●●●● ● ●● ● ● ●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●● ● ● ● ●●●● ●●●●●●● ● ● ● ●●●●● ● ●●● ●● ● Early origin activity ●● ●●●● ●●●●●●● ● Early origin activity ● ● ● ● ●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●● ● ● ●●● ●●●●●●● ● ● ● ● ● ●●●●●●●●●● ●● ● ● ● ● ●●●●●●●● ●●● ● ●● ● ● ● ● ●●●●●●●●●●● ● ● ● ●●●●● ●●●●● ● ●●● ● ● ● ●●● ●●●● ● ● ● ● ●●●●● ●●●● ●● ● ● ●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ●●● ● ● ● ●●●●●●●●●●● ●●● ● ● ●● ● ●●●●●●●● ●●●●●●●● ● ●●●● 0 ●● ●●●●●● ● 0 ● ● ●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●●●● ●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●● ●●● ●● ●● ●● ● ●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●●●●● ● ●● ●● ●●●●● ● ● ● ●●●●●●●●●●●●● ● ● ●●●●●●●●● ● ●● ●●●●●● ● ● ●●● ● ●●●● ●● ● ● ● ●●● ●●●●● ● ●● ●● ●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ●

−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 cluster_3 cluster_4

● r = 0.52 ● 3 p <= 0 ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● 2 ● ●● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ●● ●●●●● ● ●●● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●●● ● ● ● ● ●● ● ● ●● ●●● ●●●● ●● ● ●● ●●●●●●●●● ● ● ● ● ● ●● ● ●●● ●●●●●● ●● ● ● 1 ● ●●●● ●● ● ● ● ●●●●●●●● ●● ●●●●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●● ● ●●● ● ●●● ● ●● ● ● ●●●●●●●●● ●●● ●●●● ● ● ●●●●● ●●●●●●● ● ●● ●●●●●●●●●●●●● ●● ●●●●● ●● ●● ●● ●● ● Early origin activity ● ●●● ●●● ● ●●●●● ● ● ● ● ● ●●●●●●●●●● ● ● ●●● ● ● ●● ●●●●● ●●●● ●● ● ●● ● ● ● ●●●●●●●●●●●● ● ● ● ● ●● ●●●●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●● ● ● ●● ●●●●● ●●●●●●●●●● 0 ● ●●●●●●● ● ● ● ●●●●●●●●●●●●●● ●●● ●●● ●● ● ●●●●● ●●● ●● ● ●●● ● ● ●●●●● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●● ●●●●●●● ●● ● ● ● ●●●●●●● ● ● ● ● ● ●● ● ●● ●● ● ●● ●

−2 −1 0 1 2 3 4 cluster_5

FIGURE 3.13: Chromatin factor clusters correlation with early origin strength. Pearson correlation of each chromatin factor cluster’s strength with the signal from the early origin meta-peaks in Bg3. Pearson’s r and the associated p-value are noted in the top right corner.

115 given cell line, we built a logistic regression model based on the chromatin clusters to classify the meta-peaks into Bg3-active and Bg3-inactive early origins. To avoid the unique and distinct chromatin signatures of the fourth and the X chromosomes (heterochromatin and hyperacetylation of H4K16, repsectively) we restricted our analysis to chromosomes 2L, 2R, 3L and 3R. Of the 594 meta early origins on these chromosomes, only 255 were utilized in Bg3 cells (Bg3-active) and the remainder (339) were specific to S2 or Kc cells (Bg3-inactive). Logistic regression allowed the classification of early origin meta-peaks into Bg3-active and Bg3-inactive classes based on a set of predictor variables (the mean cluster strength per meta-peak). We trained our logistic regression model on chromosomes (2L, 2R, and 3L) while holding 3R back for testing. Initially, we trained our model using one chro- matin cluster at a time, and found that, unsurprisingly, while all of the clusters were able to predict early origin usage more accurately than a random classifier (Figure 5B, inset, dashed line) clusters 2 and 5 on their own had the highest ac- curacy in predicting which of the early origins would be utilized on 3R (Figure 5B, inset). However, considering all of the clusters in a single logistic regression yielded a classifier with the highest accuracy of all ( 78%, p ¤ 2¢10¡16). We plot- ted the mean early origin score against the predicted probability of an early origin being Bg3-active, with the actual Bg3-active classifications shown in blue (Figure 3.14A). We next investigated whether we could predict the changes in early origin strength, as determined by mean BrdU incorporation, at early origin meta-peaks between cell lines based solely on chromatin information. We collected chromatin information for the early origin meta-peaks as above for both Bg3 and S2. We then assigned each meta-peak a feature vector containing the change in mean mi- croarray signal for each chromatin cluster between Bg3 and S2. Each meta-peak was also assigned a response variable based on the change in early origin strength

116 fotefrigrno rdcin ae nardcini otma squared mean root in capable reduction was a own on its based on predictions cluster random outperforming each of that their found gauge We to input power. the as predictive singly individual cluster each using models initial built first We 3R. chromatin in change the by strength. strength cluster en- origin data early these in to change applied of was prediction model the regression abling linear A cells. S2 and Bg3 between The (hori- random indicated. over are (inset). (RMSE) (blue) cluster error each Bg3 squared for or of mean line) function (red) Root dashed as zontal S2 0.7. plotted in lines. is is cell active correlation S2 predict two origins Pearson and to the Early Bg3 able between between is change. meta-peaks strength actual S2 origin origin and early in the Bg3 strength change of between in Predicted strength change clusters the in five using change from regression the Predictive signal linear chromatin A (gray). the regression. not of linear are by that cells are those S2 that and and meta-peaks (inset). dashed (blue) cluster those horizontal 3R each accuracy .5 for chromosome 78% power the on with below Bg3 respectively) and five in false (above the used classify and of to true each able of as is scores line cells chromatin Bg3 average regression. the in logistic using clusters by meta-peaks model total regression the logistic from A usage origin early Bg3 of sification F A IGURE Predicted early origin probability si h oitcrgeso bv,w rie n2,2 n Lwietsigon testing while 3L and 2R 2L, on trained we above, regression logistic the in As Off inBg3 On inBg3 .4 hoai intrsaepeitv feryoii activity. origin early of predictive are signatures Chromatin 3.14: Early originactivity

Logistic regression accuracy 80% 20% 40% 60% 0% (B) 78 Combined 59 Cluster 1 72 Cluster 2 rdcigrltv rgnsrnt ewe Bg3 between strength origin relative Predicting 56 Cluster 3 60 Cluster 4 75 Cluster 5 117 B Predicted change in early origin strength Bot Bg3 onl S2 onl h Change inearlyoriginstrength(Bg3−S2) y y

0.0 0.2 0.4 RMSE0.6 0.8 1.0 1.2 (A) .7

Combined .9

Cluster 1 .72

Cluster 2 .93 Clas-

Cluster 3 .92

Cluster 4 .76 Cluster 5 error (RMSE). However, not surprisingly, we found that changes in clusters 2 and 5 were the most predictive of changes in early origin strength (Figure 3.14B, inset). The full linear regression model, taking into account all five of the cluster scores for each early origin, reported the greatest reduction in RMSE, reducing it to  0.69. The Pearson correlation between the actual and predicted change in early origin strength for the full model was r  0.7. The actual change in chromosome 3R early origin strength against was plotted against the predicted change with the cell line specific origins represented by the color of the points (red: active in S2 only, blue: active in Bg3 only) (Figure 3.14B). Thus, differences in the local chromatin environment between the two cell lines could account for changes in relative origin activity.

3.3 AWG analysis

As part of the modENCODE analysis working group we performed a number of integrated analyses using data and members from all data producing and analy- sis labs. In particular, we worked with Peter Kharchenko in the Park lab to link chromatin state to novel biology with respect to the replication program. We also worked with the White lab to draw correlations between transcription factor bind- ing HOT spots and the replication program.

3.3.1 Chromatin state 3

A Hidden Markov Model (HMM) using a multivariate normal emission distribution to model the genome-wide tiling array signal was employed to segment the chro- matin data. It was trained using the Baum-Welch algorithm [Baum et al., 1970]. A nine state model was used, and the resulting states were then mined for enrich-

Section 3.3 is excerpted from: modENCODE Consortium, Roy S*, Ernst J*, Kharchenko PV*, Kheradpour P*, Negre N*, Eaton ML*, et al., (2010), “Identification of functional elements and regulatory circuits by Drosophila modENCODE,” Science, 330(6012), 1787-97.

118 FIGURE 3.15: Example of an intergenic state 3 segment. Enrichment of multi- ple chromatin marks were used to identify putative large (¡10 kbp) intergenic H3K36me1/H3K18ac domains located outside of annotated genes. Although these marks generally correspond to long introns within transcripts, their intergenic do- mains were enriched for replication activity (fig. S5). In this example from BG3 cells, such a domain was found upstream of the bi locus and is associated with early replication, contains an early origin, is enriched for ORC binding, and is fur- ther supported by NippedB binding.

ments of various biological phenomena [Kharchenko et al., 2010]. One of the most striking connections we uncovered was with states characterized by an enrichment for H3K36me1 and H3K18ac (state 3 [Kharchenko et al., 2010]). This state was often found marking introns, but a subset of the segments were found in intergenic areas (as an example, see Figure 3.15). These intergenic areas were enriched for early origin activity and replicated early in S-phase (Figure 3.16 middle and right) as well as being enriched for ORC binding [MacAlpine et al., 2010]. Interestingly,

119 these intergenic intron-like state assignments were also enriched for the cohesin binding partner Nipped-B (Figure 3.16 left), and the cohesin subunit Smc1 (data not shown). The cohesin data shown in these figures was taken from [Misulovin et al., 2008].

Replication and cohesin activity at >10kb intergenic H3K36me1 regions S2 random median signal BG3 random median signal 4 6 Array probe values probe Array 0 2 −4 −2

NippedB ChIP−chip Early origin activity Replication timing Figure S5 - Early origin activity, replication timing and cohesin binding in H3K36me1 intergenic regions.

FIGURE 3.16: Correlation between chromatin state 3, cohesin and replication. The large (¡10 kb) intergenic H3K36me1 domains were analyzed in BG3 (133) and S2 (78) cells for their replication and cohesin activity. In the first panel, quan- tile normalized Nipped-B S2 and BG3 ChIP-chip probe values [Misulovin et al., 2008] from these regions were graphed in a notched boxplot in S2 (blue) and BG3 (green). A background level of enrichment was determined by taking the median signal of 100 samples of sets of segments with identical width, number and chro- mosome distribution to the set of intergenic H3K36me1 domains and evaluating the Nipped-B enrichment within these segments. A representative sample set is shown as the boxplots of lighter shade, and the median background level across all of the random sample sets is shown as the dashed lines. In the second panel, quan- tile normalized early origin probes from these regions were treated analogously to the first panel. The third panel is analogous to the second using array probe values from quantile normalized replication timing tiling arrays.

120 Previous connections have been made between TSS-distal ORC and cohesin complex loading in D. melanogaster [MacAlpine et al., 2010]. The results here sug- gest that cohesin and early origin activity also co-localize. Interestingly, H3K36me1 and H3K18ac are moderately stronger in the promoter distal set of ORC binding sites than in the promoter proximal set (Figure 3.4C), implying that this combina- tion of H3K36me1 and H3K18ac are important determinants of cohesin loading, ORC binding and origin activity in intergenic regions, and that H3K36me1 and H3K18ac have dual roles, both in marking long introns and as marking intergenic areas of replication and cohesin activity. This serves as just one example of the integrative analyses that can be performed using the data generated by the mod- ENCODE consortium.

3.3.2 HOT spots

Highly occupied target regions (or HOT spots) have been demonstrated previously in the D. melanogaster genome [Moorman et al., 2006; MacArthur et al., 2009]. These are loci with evidence for a large number of transcription factors binding simultaneously. Given the lack of ORC sequence specificity in D. melanogaster, we investigated the possibility that ORC was able to follow the same binding signals as transcription factors, and that HOT spots would specify some subset of its bind- ing sites. First, our lab developed a method for calling HOT spot locations from 32 transcription factors using kernel density estimation and peak finding [N`egre et al., 2011], a strategy similar to that of F-seq [Boyle et al., 2008a]. HOT spots in general occupy intergenic regions and contain a significantly dense collection of transcription factors. We first investigated the nucleosome dynamics of these regions to see if they resembled the nucleosome dynamics seen at ORC binding sites in general (for ORC’s general preferences, Figure 3.4; for nucleosome dynamics of HOT spots,

121 Nucleosome density CATCH−IT

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● 1.0 ● 0.0

−0.5 0.5

● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Log ratio enrichmentLog ratio ● ● ● ● ● ● enrichmentLog ratio ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2.0 ● ● ● ● ● ● ● ● ● ● −0.5 ● 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15

Inferred HOT complexity lower bound Inferred HOT complexity lower bound

H2Av H3.3

3 ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● 2 ● ● ● ● ● ● 0.0 1 −0.5

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 −1.0 ● ● ● ● ● ● ● ● ● Log ratio enrichmentLog ratio ● ● ● ● ● ● ● ● ● ● ● enrichmentLog ratio ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2.0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15

Inferred HOT complexity lower bound Inferred HOT complexity lower bound

FIGURE 3.17: Nucleosome dynamics for HOT regions at varying complexity levels. In each panel, the x-axis denotes the lower bound of HOT complexity considered. For each value, a 2 kb region centered around HOT spots of sufficient complex- ity was examined for these four markers of nucleosome dynamics. The mean was taken for each HOT region. The data was taken from the following GEO accessions: Nucleosome Density: GSM333835, GSM333840 and GSM333844; CATCH-IT (20 min): GSM494310 and GSM494311, H2Av: GSM333849; H3.3: GSM333860 [Henikoff et al., 2009; Deal et al., 2010].

Figure 3.17). Figure 3.17 shows the effect on chromatin of sequentially increasing the lower bound of transcription factor complexity (the number of distinct tran- scription factors detected in a region) on the nucleosome dynamics data available from the Henikoff lab [Henikoff et al., 2009; Deal et al., 2010]. We found that the HOT spots of higher complexity were enriched in the hallmarks of histone turnover (H3.3, CATCH-IT), had low nucleosome density, while showing no enrichment for

122 H2Av. Interestingly, this is a very similar nucleosome dynamics profile to that seen in promoter distal ORC binding sites (Figure 3.4C), with its neutral H2Av levels distinguishing it from the promoter proximal set.

TF Complexity B 1 5 10 14

16 14 12 10 8

C binding 6 R 4 O 2

FIGURE 3.18: ORC is enriched at TF HOT spots. The relation between HOT spot complexity (x axis) and enrichment in ORC binding (y axis). ORC enrichment was calculated as a log ratio of the number of ORC binding sites over random for each TF complexity level.

We then looked at the interplay of ORC and HOT spot complexity. We found that as the complexity of the HOT spot increases, the likelihood that ORC will be associated increases monotonically (Figure 3.18). It will be very interesting to see how far HOT spot localization goes to explaining promoter distal ORC binding, and if HOT spot localization represents non-specific ORC binding while a more specific signal (either sequence or chromatin) lies in the non-HOT spot ORC binding sites. Additionally, investigating the mechanism of HOT association should be a high pri- ority – whether it is simply opportunistic binding to open and available chromatin, or if one or more of the transcription factors is acting to recruit ORC to these spots.

3.4 Discussion

Although much progress has been made in understanding how the structure and organization of chromatin regulates the transcription program, we know very little about how start sites of DNA replication are selected and regulated in the context

123 of chromatin. Here, we have used multiple modENCODE datasets to characterize the Drosophila replication program in the context of the surrounding chromatin environment. The computational integration of these diverse data types across multiple cell lines has revealed new insights into how the chromatin landscape influences the selection and regulation of replication origins. Importantly, these insights have allowed us to generate accurate predictive models on the regulation of specific replication origins between cell types. Potential origins of replication are established in highly dynamic and accessi- ble regions of the genome. We have previously shown that Drosophila ORC bind- ing sites are surrounded by actively transcribed genes, enriched for the histone variant H3.3 and depleted for bulk nucleosomes [MacAlpine et al., 2010]. Un- derscoring the importance of nucleosome occupancy in the selection of replication origins, recent studies in S. cerevisiae have shown that nucleosome occupancy is a determinant for ORC binding and that the precise ORC-dependent positioning of nucleosomes flanking the origin is critical for origin function [Eaton et al., 2010; Berbenetz et al., 2010]. In both yeast and Drosophila, the nucleosomes surround- ing potential origins of replication marked by ORC are dynamic and undergo rapid nucleosome exchange [Kaplan et al., 2008; Deal et al., 2010]. This chromatin turn-over may involve specific chromatin remodeling activities. Consistent with the idea of chromatin remodelers altering chromatin organiza- tion at origins of replication, we found that Drosophila ORC binding sites are highly enriched for the ATP-dependent chromatin remodeler ISWI. ISWI can act as part of the NURF complex, specific subunits of which were also enriched (NURF301) [Langst¨ and Becker, 2001]. The chromatin binding proteins WDS and GAF were also enriched at ORC binding sites and have both been implicated in facilitating nu- cleosome dynamics [Petesch and Lis, 2008; Suganuma et al., 2008]. Finally, DNA binding proteins with chromatin remodeling activities or functions were among

124 the most discriminatory features for ORC binding in our SVM analysis. We propose that active chromatin dynamics facilitated by remodeling activities will be a con- served feature of replication origins in all eukaryotes. In support of this hypothesis, mammalian replication origins are also enriched in the vicinity of active promoters [Cadoret et al., 2008; Sequeira-Mendes et al., 2009; Karnani et al., 2010]. In S. cerevisiae, the precise nucleosome positioning observed at origins of repli- cation can be reconstituted in vitro with purified ORC, recombinant histones and ISWI [Eaton et al., 2010]. At this point, we do not know if ORC is directly in- teracting with remodeling enzymes and recruiting them to origins of replication or alternatively, if ORC is being recruited to dynamic and open chromatin facili- tated by the chromatin remodeling activity. Future experiments will address these questions. The enrichment of ORC at the promoters of actively transcribed genes is likely a consequence of the open chromatin environment that exists at transcription start sites. Indeed, those gene promoters that are coincident with ORC are much more actively expressed than their non ORC-interacting complement (Figure 3.2). The apparent exclusion of ORC from internal exons within the gene body suggests that active transcription is disruptive to ORC binding, consistent with prior yeast results [Mori and Shirahige, 2007]. Finally, it is tempting to speculate that the negative super-coiling generated upstream of actively transcribed genes may also contribute to ORC localization [Remus et al., 2004]. Despite the simplicity and mostly invariant nature of the underlying DNA code, the transcription program is able to respond to almost limitless developmental and environmental perturbations. It is the complex interactions between regulatory networks of chromatin modifications, transcription factors, nucleosome positioning and DNA accessibility that provide for the remarkable plasticity in gene expression derived from the genome. It is becoming increasingly clear that the replication pro-

125 gram responds to many of the same epigenetic cues that regulate the transcription program. For example, X chromosome dosage compensation in the fly results not only in the H4K16ac dependent transcriptional up regulation of X-specific genes in male cells [Laverty et al., 2010], but also a sex chromosome specific change in replication timing [Schwaiger et al., 2009]. Similarly, we find that ORC, early origins and early replicating regions of the genome are highly enriched for ac- tivating chromatin marks (H3K18ac, H3K27ac, etc). This coordination between the transcription and replication programs is likely critical for the expression and inheritance of genetic and epigenetic information. The integration of multiple modENCODE datasets across three different Dro- sophila cell lines has allowed us to generate predictive models of ORC binding, origin usage and origin strength. Sequence, chromatin modifications and DNA binding proteins all contribute in an additive manner to our ability to identify ORC binding sites in the genome. Additionally, our data suggest that the competency of a genomic location to replicate in the presence of HU (i.e. to act as an early origin of replication) is determined by the local chromatin landscape. By integrating the chromatin signals near early activating origins of replication, we were able to use logistic regression models to classify with a high degree of accuracy ( 78%) which potential early origins would be utilized within a cell line. Not only were we able to predict origin utilization, but the same chromatin marks were also used to build linear regression models that could predict the relative strength or activity of early origins between two cell lines. Our ability to predict the strength of origin usage between cell lines based on the surrounding chromatin environment suggests that the chromatin environment acts as a rheostat and not a binary switch. That is, the signals from multiple activating and repressive chromatin marks are assimilated to provide a relative index of the potential for a sequence to function as an origin of replication. Finally, although we describe the replication program as responding to

126 the local chromatin environment, there is increasing evidence that the replication program is also important for epigenetic memory [Zhang et al., 2002a; Lande- Diner et al., 2009].

3.5 Methods

3.5.1 Cell growth

Kc167 and S2-DRSC cells were cultured in 150-mm plates in Schneider’s Insect Cell Medium (Invitrogen) supplemented with 10% FBS and 1% penicillin/strepto- mycin/glutamine (Invitrogen). ML-DmBG3-c2 cells were cultured as above with 10ug/ml human insulin (Sigma). All cell growth was conducted at 25¥C. Cell cycle position was determined by flow cytometry.

3.5.2 ChIP-seq

Chromatin immunoprecipitations were performed as in MacAlpine et al.[2010] using a poly clonal ORC2 antibody [Austin et al., 1999] Sequencing libraries were generated using the ChIP-Seq sample prep kit and protocol: (Illumina, http://grcf.jhmi.edu/hts/protocols/11257047 ChIP Sample Prep.pdf). Libraries were sequenced on a GAII Illumina sequencer and processed using SCS2.6 software.

3.5.3 Read mapping and peak calling

MAQ [Li et al., 2008] was used to map the reads back to release 5 of the D. melanogaster genome. Reads with a quality score ¥ 35 were considered in the subsequent analysis to filter out the reads that could not be mapped uniquely. ORC ChIP-seq peaks were called using Peakseq [Rozowsky et al., 2009] with default parameters. Input sequencing libraries were used to control for sequencing spe-

127 cific biases. Replicates were combined by intersection; overlapping peak calls were considered verified and were reduced in a per-nucleotide union. Replicates passed quality control if 80% of the top 40% of peaks (by strength) in one experiment existed in the other and vice-versa. Whole genome background subtracted density tracks were produced by the R package SPP [Kharchenko et al., 2008].

3.5.4 Replication timing

Approximately 2.7¢108 Kc 167, S2-DRSC or ML-DmBg3-C2 cells were treated with HU to a final concentration of 1mM and allowed to incubate for 3h. BrdU was then added to 50 µg/ml final concentration to half the cells and both cell populations incubated for an additional 21h. Cells treated only with HU are the late timing samples, while cells treated with HU and BrdU are considered the early timing samples. All cells were then washed 1x with cold 1x PBS and replated. Early timing cells were treated with additional 50µg/ml BrdU for 1h and then harvested for DNA extraction. The late timing cells were incubated for 4h after washing followed by a 2h incubation with 50ug/ml BrdU before being harvested for genomic DNA extraction.

3.5.5 Early origin mapping

Approximately 3.5¢108 Kc 167, S2-DRSC or ML-DmBg3-C2 cells were treated with hydroxyurea (HU) (Sigma) to a final concentration of 1mM and incubated for 3h. 5-bromo-2-deoxyuridine (BrdU) (Roche) was then added to a final concentration of 50µg/ml for 18-20h. Cells treated solely with HU served as the control sample. Cells were harvested, centrifuged at 1,000g for 5min and washed once with cold 1x PBS, and the resulting pellets were used for genomic DNA extraction.

128 3.5.6 DNA extraction

For early origin experiments, pellets were resuspended in 2ml 10mM Tris pH 9.5. 2ml NDS (10mM Tris pH 9.5, 500mM EDTA pH9.5, 1% SDS); after addition of 0.5ml 20mg/ml Proteinase K were added, samples were inverted to mix and in- cubated at 37°C for 2h. Samples were phenol/chloroform extracted twice and al- lowed to precipitate overnight at 25°C. DNA was pelleted at 3,500rpm for 12min, washed in 70% ethanol and resuspended in 4ml 1x TE. 20µl 10mg/ml RNaseA was added to samples and allowed to incubate at 37°C for 1-2 h. DNA was precipitated with 3M NaOAc and cold ethanol and resuspended in 500ul of 1xTE. DNA was sheared to an average size of 1kb. For timing experiments, DNA was extracted as above, but in the following volumes: 0.7ml 10mM Tris pH 9.5, 0.7ml NDS, 0.1ml 20mg/ml Proteinase K. DNA was resupended in 250µl 1x TE.

3.5.7 BrdU immunoprecipitation

50mul Dynabeads M-280 sheep anti-mouse IgG beads (Invitrogen) per sample were washed 3x in 1ml cold 1x PBS/5 mg/ml BSA using a magnetic concentrator and resuspended in the same. 5 µl of 0.5mg/ml anti-BrdU antibody (Roche) was added and allowed to incubate with rotation overnight at 4¥C. Beads were washed as above and resuspended in 50ul of the same. 15µg of DNA was added to new tubes, incubated at 100¥C for 5min and cooled on ice. 450µl RIPA (50mM Hepes- KOH, pH 7.6, 500mM LiCl, 1mM EDTA, 1% NP-40, 0.7% Na-Deoxycholate) was then added to the DNA with the prepped beads and allowed to incubate overnight with rotation at 4¥C. Beads were collected using a magnetic concentrator and the supernatant saved as the INPUT. Beads were washed 4x in 1 ml cold RIPA and 1x ub 1 ml 1xTE with 5 min room temperature rotations between washes. Beads were resuspended in 150ul TE/1%SDS and incubated at 65¥C for 10min with 2-3

129 quick vortexes. Supernatants were saved to which 150µl of TE, 300ng/µl glycogen, 1µg/µl proteinase K were added and incubated for 1h at 37¥C. 12µl of 5M NaCl were added to samples and subsequently phenol/chloroform extracted. DNA was resuspended in 15µl dH2O.

3.5.8 Array hybridization and analysis

Labeling of DNA was performed as in MacAlpine et al.[2010]. All experiments were performed using biological triplicates. Labeled DNA was hybridized on cus- tom whole genome, 244K tiling microarrays (Agilent). Slides were hybridized and washed as per Agilent recommendations. The array data was processed and ana- lyzed as previously described [MacAlpine et al., 2010]. The p-value cited for the number of early origin peaks overlapping an ORC peak was derived by generating R  100, 000 sets of random segments that mirrored the early origin peaks in width, number and chromosome membership and counting how many of those segments overlapped an ORC peak. We then found n, the number of samples in this bootstrap distribution which had greater than or equal to the number of ORC peaks overlapped in the early origin meta-peak set. Then,  n 1 p R 1 .

3.5.9 Meta-peaks

To construct a set of regions in the genome with the potential to bind ORC or to host early origin activity we constructed a set of ORC meta-peaks and early origin meta-peaks respectively. These meta-peaks are contiguous nucleotides that were covered by a peak in at least one cell line. To conservatively account for the possibility of one peak in a particular cell line overlapping multiple peaks in another cell line, we built the Venn diagram using the set of ORC meta-peaks, such that multiple peaks in a meta-peak were scored as a single overlap.

130 3.5.10 ORC at genic loci

Each annotated MB8 gene model [Graveley et al., 2010] was binned into seven features: 500bp upstream of the promoter, the first exon, the first intron, the in- ternal (neither first nor last) introns, the internal exons, the last exon and 500bp downstream of the last exon. For genes with multiple transcripts, any possible exon was considered, and overlapping exons were merged. The log ratio of the mean background subtracted ChIP-seq tag densities [Kharchenko et al., 2008] to a paired a set of random genomic loci of the same width, cardinality, and chromo- some distribution was calculated for each bin.

3.5.11 SVM analysis

SVM analysis was performed on the set of annotated ORC binding sites for each cell line using the LIBSVM suite [Chang and Lin, 2001]. The SVM was trained and tested on a positive and negative sequence set. The former consisted of 500 nt se- quences centered on the ORC binding sites. The latter (random set) contained an equal number of 500 nt sequences selected randomly from the genome, with the condition that they exclude the positive set and that the chromosome frequency and proportion of promoter proximal instances match that of its positive coun- terpart. Three feature sets were used: sequence 1–6mers, chromatin marks and protein binding sites. The k-mer feature values were compiled as frequencies of each k-mer in each 500 nt sequence (counting each k-mer and its reverse comple- ment as the same feature), then scaled relative to the rest of the sequence set to the range 0 ¤ x ¤ 1. Both the chromatin marks and protein binding sites were given binary values 0,1 depending on whether they fell on each 500 nt sequence. The entire feature set was then split into a training set consisting of sequences on chro- mosomes 2L, 3L and 3R, and a testing set containing sequences on chromosome

131 2R. The SVM was then trained on the training set with 10-fold cross validation. Testing was performed on 2R and the performance of the classifier was evaluated based on the resulting ROC curve. This analysis was carried out using each of the three feature types individually, as well as all the features simultaneously. In addition, the discriminative power of each individual feature was determined by ranking them based on their F-score, and their predilection towards one class over another determined by the class proximity, defined here as t-statistic obtained from a Student’s t-test comparing feature counts in the positive and negative sets.

3.5.12 Motif analysis

Motifs were discovered on the 500 nt sequences containing the ORC binding sites via MEME [Bailey and Elkan, 1994], then ranked according to their p-value. The top 3 motifs were then recovered from the whole genome using MAST [Bailey and Gribskov, 1998]. The motif loci delivered by MAST were then ranked by p-value, and those that fell on the ORC binding site-containing sequences were marked. The presence of these motifs was correlated to ORC binding sites by generating a ROC curve denoting the false and true positive rate of their appearance in the ORC set over random genomic loci.

3.5.13 Chromatin enrichment heatmaps

The list of 39 factors from the Karpen group [Kharchenko et al., 2010] that we included was: Ez, MRG15, H3K36me3, Su(var)3-9, PCL, Chro, H3K27ac, Psc, H3K79me1, H3K9ac, RNA pol II, H3K27me3, H4K16ac, CTCF, ISWI, H3K4me2, JIL1, HP1c, HP1, WDS, H3K4me1, H3K36me1, HP2, H3K9me2, dRING, CP190, H3K4me3, H3K79me2, BEAF-70, H3K9me3, H3K18ac, GAF, H3K23ac, mod2.2, NURF301, Su(hw), Pc, H2B-ubiq, and Su(var)3-7. We also used cell-line specific RNA-seq data from the Celniker group (Graveley et al., Submitted), and the 20

132 minute CATCH-IT data, H2Av, H3.3 and nucleosome density data from the Henikoff group in S2 cells [Deal et al., 2010; Henikoff et al., 2009].

ORC and early origin heatmaps

Each factor available for comparison (nucleosome dynamics, RNAseq, chromatin binders and histone marks) was quantile normalized per-probe or per-transcript between cell lines. Each factor was already mean shifted and loess smoothed ap- propriately by its originating group. Each factor’s distribution was then scaled to unit variance, and the median of the scores in 1 kb windows centered on ORC peaks or directly under early origin peaks was calculated for the ORC matrix (3C) and the early origin matrix (3B) respectively.

Replication timing heatmap

To generate the replication timing correlations for the nucleosome dynamics, chro- matin mark and factor data, each replication timing probe was paired with a collec- tion of factor probes which it overlapped. These factor probes were averaged per timing probe, and a Spearman’s ρ correlation was taken between the timing probe values and the mean factor probe values. For the RNA-seq data, each transcript was assigned a timing score based on the median probe value of the timing probes which it overlapped, and a Spearman’s ρ was taken of these paired samples.

3.5.14 Regression Analysis

The early origin meta-peak set was created by taking a per-nucleotide union of the three cell lines’ early origin peak sets and then combining contiguous nucleotides into meta-peaks. This yielded a set of 823 ‘early origin meta-peaks’. These early origin meta-peaks were given a score for each cell line corresponding to the mean early origin microarray signal within the meta-peak. Each meta-peak was also

133 given a vector of scores corresponding to the microarray signal for each of the 36 factors produced by the Karpen group [Kharchenko et al., 2010]. In selecting factors from the Karpen group, we limited ourselves to those that were common between Bg3 and S2. The list of factors was as follows: Ez, MRG15, H3K36me3, Su-var(3-9), PCL, Chro, H3K27ac, Psc, RNA pol II, H3K79me1, H4K16ac, CTCF, ISWI, H3K4me2, JIL1, HP1c, HP1, H3K4me1, H3K36me1, HP2, H3K9me2, dRING, CP190, H3K27me3, H3K79me2, BEAF-70, H3K4me3, H3K9me3, H3K18ac, GAF, H3K23ac, mod2.2, Su(hw), Pc, H2B-ubiq, and Su(var)3-7. Every factor was quan- tile normalized between cell lines and then mean centered and divided by the standard deviation (z-score transformation). To produce the correlation heatmap in Fig 5A, a pairwise correlation matrix was constructed using Pearson’s correlation (r), and then clustered, using p1¡rrsq to populate the distance matrix, with Ward’s method for hierarchical clustering [Ward, 1963]. The enrichment heatmap in Fig- ure 3.12A represents the mean enrichment for all Bg3 factors in a cluster within Bg3 active (+) and Bg3 inactive (-) early origins.

Logistic regression

The early origin meta-peak set was split into a training set (chromosomes 2L, 2R, and 3L; 441 meta early origins) and a test set (3R; 153 meta early origins). Each early origin meta-peak was given a set of five values corresponding to the mean of the factors within each of the five clusters. Each meta-peak was also given a logical response variable which indicated early origin activity in Bg3 cells (i.e. true if the early origin meta-peak overlapped a called early origin peak from Bg3). We then used logistic regression to regress from the mean cluster scores to the Bg3- active response variable using the glm function of R [R Development Core Team, 2008] with a binomial logit link function. The model parameters were fit using 100-fold cross validation. This regression was then put through a step-wise model

134 selection process minimizing the Akaike information criterion (AIC), during which clusters 1, 2 and 5 were selected for the final model. Accuracy was gauged on 3R as (True Positives + True Negatives) / n, where n was the total number of early origin meta-peaks.

Linear regression

The early origin meta-peak set was split into training and test sets as above. The response variable in this case was the difference in mean microarray signal within each early origin meta-peak between Bg3 and S2. Likewise, the five predictor vari- ables took the form of the difference in mean signal strength between each of the five clusters between Bg3 and S2. We then used standard linear regression using all five clusters with parameters again fit by 100-fold cross validation to predict the difference in early origin strength via the difference in cluster strengths. Accuracy was judged by a reduction in root mean squared error (RMSE) and compared to a background RMSE determined by the mean RMSE of 1000 random permutations of the pairing between the predicted change in early origin strength and the actual change in early origin strength.

3.5.15 Data accession

All data has been deposited at GEO. GSE17281, ML-DmBG3-c2 Replication Timing; GSE17279, Kc167 Replication Timing; GSE17280, S2-DSRC Replication Timing; GSE17287, ML-DmBG3-c2 Replication Origins; GSE17285, Kc167 Replication Ori- gins; GSE17286 S2-DRSC, Replication Origins; GSE20888, ORC2 ML-DmBG3-c2 ChIP-Seq; GSE20889, ORC2 KC-167 ChIP-Seq; GSE20887, ORC2 S2-DSRC ChIP- Seq.

135 4

Conclusion

La rˆeve d’une bact´erie doit devenir deux bact´eries. (The dream of a bacterium is to become two bacteria.) –Franc¸ois Jacob

4.1 Summary and Discussion

In 1963, when Franc¸ois Jacob and Sydney Brenner, on vacation in the south of France, first drew the simple replicon model of replication control in the sand, they could not have envisioned the lengths to which evolution would need to go to adapt their model to eukaryotic organisms. Evolution has extended Jacob’s replicon model to be compatible with gigabase–scale genomes, developmental stages, and complex replication error control systems. As eukaryotic organisms have evolved, they have strayed further from a sequence-based specification of Jacob’s “repli- cator”, toward a mutable cis-acting signal that can respond to the ever-changing demands of the cell. In 1963, the characterization of the basic unit of chromatin, the nucleosome, was still ten years away, and there was no way to know that the contribution of chromatin to the replicon model would prove to be profoundly complex. Describing fully the the role that chromatin plays in the tightly orches-

136 trated process of the DNA replication program is critical to our understanding of eukaryotic DNA replication program and, perhaps more importantly, how errors in the regulation of this program lead to genomic instability. In the past ten years, with the realization that a sequence-based cis-acting regu- lator of origin localization may never be found [Gilbert, 2004], and with very little evidence for a sequence-based method for controlling replication timing in any or- ganism, the focus has shifted toward chromatin. Chromatin, with nucleosome posi- tioning, histone variants and modifications, and sub–nuclear localization, promises to be a rich palette with which to paint the eukaryotic DNA replication program. Its most attractive aspect is that it is mutable without affecting genomic sequence, meaning that it can respond to the variety of scenarios presented by the metazoan cell cycle while still allowing replication to make exact duplicates of the genome. It also holds a fascinating connection with replication, as the doubling of the genome (and thus the chromatin) gives the cell a window of opportunity to perturb the system. The basic unit of chromatin is the nucleosome, 147 bp of DNA wrapped around a symmetrical histone octamer core. Nucleosome positioning has vast regulatory potential, as it protects the DNA that it wraps from binding by trans-acting proteins. Conversely, through stable positioning on either side of an NFR, nucleosomes may specify sites with positive regulatory potential [Liu et al., 2006]. In the budding yeast S. cerevisiae, stably positioned nucleosomes had been shown to be important at the replication origin ARS1 [Simpson, 1990; Lipford and Bell, 2001]. Through a combination of MNase-seq, ChIP-seq, and sequence analysis, we have mapped approximately 250 origins of replication down to the base pair, and mapped nu- cleosome positions genome-wide. We have thus demonstrated that a stable 125 bp NFR is a ubiquitous feature of replication origins in S. cerevisiae. Further, we have shown that nucleosome depletion at these loci is promoted by sequence fea-

137 tures surrounding the origins, and that once bound ORC plays a critical role in stably positioning nucleosomes on both sides of the NFR. Taken together with the observation that interactions with nucleosomes are important for pre-RC licensing [Lipford and Bell, 2001], these data imply that nucleosomes play an active role in the S. cerevisiae replication program. This role is twofold; first, the nucleosomes work to specify ORC binding sites by exposing the ACS motif matches that ORC can bind, and second the nucleosomes, with ORC’s help, take up positions advan- tageous for pre-RC formation. These results indicate that with the combination of sequence and nucleosome positioning, the genome’s pre-RC locations could be fully specified. At first glance, this doesn’t seem to provide much flexibility for a changing genomic landscape. As a unicellular organism, S. cerevisiae’s replication program is not required to re- spond to developmental signals. However, it is clear that S. cerevisiae does exhibit some plasticity in its replication program. For example, the gene MSH4 is only transcribed early in meiosis. In the mitotic S-phase when MSH4 is silent, a pre-RC forms within the MSH4 gene body and hosts robust replication initiation. However, in the meiotic S-phase when MSH4 is up-regulated, pre-RC formation is abolished, and inducing MSH4 expression in the mitotic S-phase has the same effect [Mori and Shirahige, 2007]. Although the assumption that the transcriptional machinery and the replication machinery are incompatible may well , our work suggests a second hypothesis which was mentioned briefly in the discussion of [Mori and Shirahige, 2007]. Specifically, since active transcription strongly positions nucleo- somes over the gene body [Radman-Livaja and Rando, 2010], the nucleosomes of the origin may be repositioned in such a way that they can no longer conform to the pattern of origin nucleosome occupancy described above. In this way, nucleo- some positioning may have a built-in mechanism to regulate origin localization in response to changing requirements of the cell.

138 As part of the modENCODE consortium, we created datasets describing the DNA replication program in the three different cell lines. We found that we re- capitulated previous results in that the majority of ORC binding sites are close to transcription start sites, there is a strong correlation between replication timing and gene transcription, and that there is no clear DNA binding motif for ORC in Drosophila, even with ChIP-seq resolution. performing these experiments in three different cell lines afforded us an opportunity to see how unique the replication programs of different cell types might be. We found that there was considerable overlap between the replication programs, but that there were also significant cell type specific differences. On top of the DNA sequence of the genome sits another layer of information: the “histone code”. This is the combination of chemical modifications to histone tails and histone variants substituted into the octamer core that specifies the config- uration of each nucleosome and thus the “chromatin signature” of any given region of the genome. Recently, chromatin maps have been produced in the fruit fly that supply information on the bulk of the chromatin signature in a number of different cell lines and developmental stages [modENCODE Consortium et al., 2010]. These provided us with a chance to compare the chromatin properties of the genome with the cell’s strategy to replicate the genome in our three D. melanogaster cell lines. We profiled each chromatin mark, binding protein, and indicator of nucleo- some dynamics with respect to ORC binding, early origin activation and replication timing. We found that in general euchromatic marks were indicative of early repli- cation whereas heterochromatic marks are anticorrelated with replication timing. We also found that early origin activity tracked with generally euchromatic marks. Finally, we saw promoter and enhancer marks tracking well with ORC binding, but that nucleosome dynamics (low nucleosome density, high histone turnover) and chromatin remodelers such as ISWI were even more strongly associated. In

139 attempting to predict replication dynamics from chromatin information, we found that ORC’s binding sites were better specified by chromatin marks and chromatin binding proteins than by sequence, and using all three of these feature classes we built a support vector machine with a high level of accuracy in telling ORC binding sites apart from background loci. We were further able to predict with a high level of accuracy the potential of an origin to fire before an early S-phase arrest, as well as how much early origin activity would change between two cell lines, all based on the change in the chromatin environment. This provides the field with evidence that not only is there enough information embedded in the chromatin to specify origin binding sites, but there is also enough information to specify origin initiation times. Taken together, these results indicate that chromatin plays a primary role in specifying the eukaryotic DNA replication program. Many of the results that we attained in yeast may hold in fly and vice-versa. For example, our data indicate that nucleosome positioning is a major determinant of origin localization in yeast, and we found that nucleosome positioning proteins were the single most predictive chromatin property for ORC binding in fly. We also found in fly that origins are highly enriched for both a low nucleosome density and active histone turnover, and given the strong NFR at origins and the observation that high histone turnover is a common feature of most (if not all) NFRs [Dion et al., 2007], one could imagine that NFRs may be a ubiquitous feature of fly origins as well. Unfortunately, in linking these two studies we are faced with a certain data asymmetry. We have nucleosome positions and ORC binding sites down to the base pair in yeast but in the fly we are hindered by a lack of precise nucleosome data and the lack of a Drosophila ORC sequence motif respectively. Without a sequence motif we cannot determine ORC’s exact binding site within our ChIP-seq peaks. Ironically, in the much smaller yeast genome, we lack high-resolution chromatin

140 data for most of the histone marks, making ensemble prediction approaches such as those we employed in Drosophila impossible. If these data were to be generated, it would be fascinating to see how many of the conclusions we drew in the two studies can cross species. Our initial disappointment at being unable to find an ORC DNA binding motif in D. melanogaster analogous to that of S. cerevisiae has been tempered by the real- ization that the DNA replication program of higher eukaryotes has evolved a high degree of adaptability. Our results imply that sequence and nucleosomes specify the localization dimension of the replication program of S. cerevisiae, but altering origin localization by altering sequence is untenable, and altering nucleosome posi- tioning is possible (see above) but cumbersome and constrained by the positions of the surrounding nucleosomes. However, chromatin modifications provide a flexi- ble medium for specifying origin localization and function. Thus, as developmental requirements become more stringent, one could imagine that any sequence-based requirements would be eased in favor of a chromatin-based specification. The potential reliance of the metazoan replication program on the same chro- matin marks that underpin the transcription program, and the mounting evidence that replication time and DNA mutation rate correlate [Stamatoyannopoulos et al., 2009; Pink and Hurst, 2010; Chen et al., 2010], hint at an elegant method for controlling mutation rate. Over many generations, as a new gene evolves and becomes more transcribed, its chromatin state would change to reflect abundant transcription, and thus it may begin to replicate earlier, reducing its mutation rate. Then, if the gene should lose relevance over time, its transcription rate would fall and thus its replication time may get later. As it does, its mutation rate would in- crease, thereby increasing the chances that an advantageous mutation would bring the gene back into relevancy. In this way, replication timing may be critical to the process of evolution. It also raises the question of wholesale mis-regulation of

141 replication timing, and if such a change could bring about mutations on a short enough time-scale to bring about somatic mutations that could lead to genomic instability and cancer. Given that evolution is driven by genetic inheritance, a moment’s introspection will lead to the realization that arguably the first function ever evolved had to be a mechanism for duplication of genetic content. Thus, the replication program has been evolving along with life itself since the beginning. Until we understand DNA replication, we cannot fully understand the basis of genetic inheritance. I have added here a small contribution to our understanding of eukaryotic DNA replica- tion and its regulation program. There is a great deal more to learn, however, and research into the replication of DNA, and indeed the replication of chromatin, holds great promise and many exciting surprises.

4.2 Future Directions

There are many potential ways to expand upon this research, and I have listed some here in no particular order. The full specification of the chromatin environment surrounding yeast origins would allow us to determine if the chromatin environment of ORC binding sites is as fully specified in yeast as in the fly, or if, as we might hypothesize, the full specification of origin localization by sequence and nucleosomes reduces the repli- cation program’s reliance on chromatin marks. Further, description of the yeast chromatin environment would allow us to see the extent that chromatin plays a role in origin activation, for which there is little if any sequence determination in yeast. Along the same lines, a whole-genome map of precise nucleosome positions in D. melanogaster would allow us to see if ORC binding and pre-RC formation in the

142 fly requires a NFR as well specified as that of yeast, or if nucleosome depletion is sufficient. The data generated through the modENCODE project is at the level of tiling arrays showing enrichments for nucleosome occupancy, rather than precisely positioned nucleosomes based on sequencing data. Although some paired end nu- cleosome sequencing data has been reported for D. melanogaster [Mavrich et al., 2008b], the coverage is quite sparse. Pursuant to this end, it might be possible to build an ensemble model of nucleosome occupancy by building a predictive model (an HMM may be a good choice here) based on chromatin dynamics experiments, i.e. by incorporating data from Dnase I, nucleosome occupancy and primary se- quence. It might also be interesting to use an algorithm which aligns nucleosome profiles based on MNase tiling arrays [Lai and Buck, 2010] to align the regions un- der the Drosophila ORC peaks. If there is a distinctive nucleosome architecture in these regions, such as a well defined nucleosome free region, the algorithm should be able to both leverage that in its alignment and emphasize it as a feature. If there is a strong NFR in these regions, it would lend credence to the idea that nucleo- some positioning is as important at replication origins in D. melanogaster as it is in S. cerevisiae, as suggested by the large enrichment of chromatin remodelers under the ORC peaks. The number of chromatin marks being profiled by modENCODE is still increas- ing. The chromatin data included in Chapter 3 was what was available as of a consortium-wide data freeze in April of 2010 (to make sure we were all working on the same data sets for the papers). However, since that freeze more chromatin factors have been profiled. For example, H4K20me, a mark recently implicated in specifying origin location in human [Tardat et al., 2010] has been added. There are a number of further approaches that could help determine if ORC has an in vivo DNA binding motif in D. melanogaster. For example, we could investigate whether or not certain classes of ORC binding sites have different sequence pref-

143 erences. We have shown that ORC associates with transcription factor HOT spots, we could perform a motif search over those ORC peaks not associated with HOT spots, with the hypothesis being that ORC can bind both non-specifically to open chromatin or, as a secondary measure, to a defined sequence motif. Another way to partition the ORC binding sites for this sort of analysis is by those that associate with active promoters versus those that do not. It is interesting to note that many of the chromatin marks associated with ORC binding sites which are promoter-distal are often found to be enriched at enhancer elements [for a review of enhancer related marks, see e.g., Zhou et al., 2011]. Thus, it would be interesting to see if a significant fraction of the non-TSS ORC binding sites could be explained by an attraction of ORC to enhancer elements. Although we have uncovered the correlation of many marks with origin loca- tion and activation, many appear to be redundant. For example, in Figure 3.12 we show the correlation matrix of chromatin marks and binding proteins within meta early origins in the Bg3 cell line. There seem to be five clear clusters, but there is a certain level of variability within each. Identifying a driver chromatin mark or pro- tein with respect to early origins (if one should exist) would be an important step toward understanding early origin regulation. I suggest that learning a Bayesian network using an approach such as the BANJO algorithm [e.g., Smith et al., 2006], would identify the underlying structure in what appears to be a redundant series of signals, so as to find the main driver. There are further predictors that could be incorporated into the early origin pre- dictions of Chapter 3. For example, it has been suggested that, under a stochastic model for origin timing, the local density of ORC binding sites would be informa- tive for early origin activation. Further, incorporating the nucleosome dynamics data from the Henikoff group could aid in both early origin and ORC localization predictions.

144 In this work we have profiled the chromatin state of those D. melanogaster early origin meta-peaks that were specifically turned on in a given cell line. A related question which we have yet to ask is whether or not there is a chromatin signature related to the early origin meta-peaks that were specifically turned off in a given cell line. Just as in gene regulation and chromatin, where genes that are regulated in a tissue-specific manner have both a distinct active state and distinct repressed state [reviewed in Zhou et al., 2011], perhaps the early origins that are opportunistically turned on or off in a particular cell type exhibit similar chromatin signatures. One extremely exciting possibility for the long term future of the field is ex- tending beyond DNA replication to chromatin replication. In other words how, if at all, does the cell coordinate the stable inheritance of not just the DNA sequence but the chromatin state as well. As the question of whether or not histone marks and variants constitute an “epigenetic code” [Ptashne, 2007], begins to be asked with more fervor, proving or disproving a mechanism for chromatin inheritance will constitute a large intellectual step forward for biology. The MacAlpine lab, with specialities in both replication and chromatin, is uniquely positioned to ask these very difficult questions.

145 5

Appendix A – Aneuploidy analysis

All of our genomic tools require us to map our results back to a reference genome. That reference genome is often derived from a representative sample of the popu- lation. Whether the sample was one or a small number of members of the popu- lation, regardless of how representative it might be, the sample will be genetically unique. In general the differences between genomes hundreds of megabases long are small enough to have little impact on our genomic mapping techniques; in- deed the technical error rate of our technology is the accuracy limiting factor in most cases. However, the traditionally cultured Drosophila cell lines — e.g., Bg3, Kc and S2— are far enough removed from the reference fruit fly genome that ge- nomic tools can provide misleading results. They are the product of an extremely harsh immortilization process [Schneider, 1972], and their karyotypes show them to be strongly aneuploid. Aneuploidy, or having an abnormal number of chromosomes, correlates with severe developmental abnormalities and death in all organisms where it has been analyzed [reviewed in Torres et al., 2008]. It is the leading cause of mental retarda-

146 tion in humans and is present in 90% of human solid tumors. Despite the obvious significance of aneuploidy for human health, its mechanism and role in tumorige- nesis is still poorly understood [reviewed in Williams and Amon, 2009]. Although widespread or germline aneuploidy is generally lethal to an organism, individual aneuploid cells can be counterintuitively robust at outcompeting standard diploid cells [reviewed in Weaver and Cleveland, 2006]. In those cancers characterized by aneuploidy for example, the cancer cells are able to proliferate freely [reviewed in Weaver and Cleveland, 2006], while forcing aneuploidy on non-transformed cells generally has an anti-proliferative effect [reviewed in Weaver and Cleveland, 2008]. Thus, the study of aneuploidy and how it can result in both tumorigenic and anti-tumorigenic cells is of great importance. As a motivating example, consider the S2 cell line [Schneider, 1972]. The ex- tent of aneuploidy in S2 cells — the genomic rearrangements, amplifications and deletions that set them apart from the reference genome — has recently been char- acterized, and it is overwhelming [Zhang et al., 2010]. Long considered tetraploid [Schneider, 1972], this recent study has shown that while the majority of the cell is tetraploid, 40% deviates from tetraploidy, suggesting that the karyotype is a rear- ranged mix of the reference genome drosophila chromosomes. Some areas of the genome exist at copy number as high as 10x or 16x. It has also been shown that these copy number deviations have an impact on gene expression [Zhang et al., 2010]. We wish to further the study of aneuploidy by characterizing exactly the break- points of segmental copy number, and the segments’ integer copy number, in three Drosophila cell lines. We propose to accomplish this by sequencing whole-genome sonicated DNA from the cell lines. The number of reads mapping to a given chro- mosomal bin, in the context of the bins around it, should provide us with enough information to make a copy number call for that bin. This represents a work in

147 progress, not yet submitted for peer review.

5.1 Results

5.1.1 Model

We propose to use a Hidden Markov Model (HMM) with single valued emissions probability distributions. Specifically, we will partition the genome into m non- overlapping bins of size x  3.5 kb (experimentally determined to reduce bin vari- ance while maintaining a reasonable resolution). We will assign each of these bins i an value mi in Reads Per Kilobase per Million reads mapped (RPKM). The formula for mi using bins of size x, given the raw number of reads in a bin ni and the total number of reads in the experiment N:

n ¦ 109 m  i (5.1) i x ¦ N

The advantage of expressing the bins in terms of RPKM is that it scales auto- matically to both the bin size and the number of reads in the experiment, making cross-experiment and cross-condition comparisons much easier. While other methods have been developed to look at copy number variation, there are none that we are aware of that are tuned specifically to drastically aneu- ploid cells. In particular, many copy number algorithms assume that the basal state of a genome is as a normal diploid, and then they search for significant deviations from that assumption. In this case, copy number variations are not a rare event, they are the norm. Thus we require a methodology that makes no assumptions about basal state. An example of the HMM topology to be used is shown in Figure 5.1. For simplic- ity the example shows a model that considers copy numbers from 1 to 4, whereas the model we built considers copy numbers 1–8, 10, 16 and 32. Each of the states

148 represents the copy number of its label. Each state j’s emission probability dis-

tribution will be a Gaussian distribution fit to the distribution of mi such that bin i is of copy number j. To determine the distribution associated with each copy number, we will infer the most frequent copy number by finding the peak in the density estimation over all mi. Since we know that our cells are tetraploid, we can assume that this most frequent copy number is the copy number for state 4. We may then take advantage of the fact that sequencing is additive, and assume that, for example, state 2 will be half of state 4, whereas state 6 will be 1.5x state 4. In this way we can infer the mean mi for each copy number.

The model employs three transition probabilities, a0, as and at. Each state is equally likely to be the first state (the 5’ bin), as the a0 specifies an equal probability of transitioning from the initialization state q0 into each of the copy number states.

Each state is then characterized by a high as (probability of self-transition) and a low at (probably of transitioning to another state). These parameters, as well as the variance of the emissions Gaussians, have been tuned by hand. Later I will discuss how best to learn these parameters de novo.

5.1.2 Interpreting calls

Having built the model, we ran it on S2, Bg3 and Kc cells to compare their copy number levels. The results for Bg3 and Kc cells on chromosome arm 3L are shown in Figure 5.2. It appears that the initial 3 mb of 3L are at an increased ploidy over even the basal tetraploid state. Further, it appears that there are a number of smaller amplifications on which the model is unable to pick up, see for example the unbroken blue line around 19 mb. Although these amplifications aren’t at the level of large-scale rearrangements such as the 5 mb deletion of one of the copies of Kc’s 3L, they are certainly large enough to hold a few genes, and thus may have implications for the specific biology of the cell.

149 a s

a 0 1 2

a q0 t

3 4

FIGURE 5.1: Example CNV HMM topology. The model contains three transition probabilities. a0: initial state (blue). as: self-transition (green). at: non-self- transition. (black)

5.1.3 Initial results

We have used the predicted copy number differences between the cell lines to explore the impact of copy number on differential gene expression. Figure 5.3 shows the relationship between a fold change in copy number between two cell lines and a fold change in expression. There is a clear relationship between copy number change and expression change between cell lines, and it seems as though a substantial amount of cell-line specific expression can be explained by aneuploidy. This makes the study of these cell lines a fascinating inquiry into the complexity of aneuploidy biology.

150 FIGURE 5.2: HMM results on chromosome arm 3L for Bg3 and Kc cells. Genomic position is indicated along the x axis, while RPKM is indicated along the left-hand y axis. The right-hand y axis shows the inferred copy number. The dark lines represent the contiguous state calls.

5.1.4 Future directions

There are a number of areas in which the calling methodology could be improved. The HMM relies on an underlying Gaussian distribution to model the RPKM distribution for each copy number, but we have shown a sequencing bin distribu- tion to more closely resemble a negative binomial distribution [Eaton et al., 2010]. Thus, substituting a negative binomial distribution for the Gaussian distribution could improve the accuracy.

There is a great deal of variance in mi even in contiguous stretches of the genome that appear to be of identical copy number. Rather than sequence to a higher depth, we will attempt to reduce the variance computationally by correcting for GC bias and mappability. The GC bias correction is best performed simultane- ously on all bins of identical copy number. However, since the copy number of the bins is what we’re trying to learn, this process can seem somewhat circular. We are

151 FIGURE 5.3: Differential copy number leads to differential expression. For each pair of cell lines, the log ratio change in copy number of each gene is plotted along the x axis, and the log ratio change in gene expression measured by RNA-seq is plot- ted along the y. Darker blue dots indicate a high concentration of data points. The actual regression of copy number onto expression is shown in red, while for com- parison a hypothetical one-to-one correspondence between expression and copy number is shown in the dashed line.

exploring the possibility of an iterative process where we fit and then re-normalize for GC at each iteration. To further reduce the effects of the variance on the HMM calls, and thus reduce

the chance of rapid changes between copy number states without setting at pro- hibitively low, we are developing a new HMM state topology that forces the model to remain in one copy number for at least three bins. This is done by simply tripling the number of states for each copy number, making one of the three copies an in- coming edge state which other copy numbers can transition into, making another capable of transitioning to itself or any incoming edge state, and making the last state a linker state between the previous two states. With this model we should be able to tune down the variance of the underlying emissions distributions without causing unwanted oscillating between states in areas that are borderline between

152 two states. There is also a question of cell population heterogeneity. As an example, Figure 5.4 shows the read depth and inferred copy number along the S2 X chromosome. S2 is a generally considered to be a tetraploid male cell line, and so the X chro- mosome should exist at predominantly two copies. This seems to hold for the area from 1 mb to 12 mb. However, the area from approximately 14 mb until the 3’ end of the chromosome seems to exist in a basal state between copy number 2 and 3. In this case, the HMM was forced by low at and a high underlying Gaus- sian variability to choose one or the other, but it raises the question of whether or not a given S2 cell population may have an assortment of cells, half with two copies of the X and half with three, or perhaps three quarters of the population with two copies and one quarter with four copies. Determining the heterogeneity of the population would perhaps be best left to an experimental biologist. A clev- erly designed fluorescence in-situ hybridization (FISH) experiment could answer the question. For example, if a probe designed against the 5’ half of the X has two foci in every cell but a probe against a spot in the 3’ half has two foci in some and more in others, then we would have our answer. However, since the resources of a computational biologist are inexpensive, we could investigate this in silico by examining any frequent SNPs in the area to see if there is evidence for three alleles rather than two. Finally, allowing the HMM to leverage other sources of information could im- prove its accuracy with respect to breakpoint detection. For example, one could imagine that spreading marks such as H3K9me3 may show a sharp transition at copy number breakpoints. Building these into the model could therefore increase accuracy in those cell lines for which we have any spreading marks mapped. The next step after building an accurate HMM is to mine the genes that are found to be up-regulated due to copy number for similar GO terms or for subgraphs

153 FIGURE 5.4: The copy number profile of the S2 X chromosome. Genomic position is indicated along the x axis, while RPKM is shown along the left-hand y. The inferred copy number is indicated along the right-hand y. Each inferred copy number is given its own color.

within an inferred function gene network [Costello et al., 2009]. This could reveal a systems-level view of cell proliferation through aneuploidy.

5.2 Conclusion

Although we have only scratched the surface of cell-line aneuploid variability, this is a rich avenue of future research. Aneuploidy bears on a number of human dis- eases including cancer and mental retardation, and this is a unique chance to study the basic science of aneuploidy in three extremely well documented Drosophila cell lines. Further, it puts the high-throughput genomic experiments being done on these cell lines through modENCODE in context, and tells us that we’re often

154 looking not at a single locus with an array probe, but rather at a convolution of different loci scattered throughout a rearranged genome.

155 Bibliography

Aggarwal, B. D. and Calvi, B. R. (2004), “Chromatin regulates origin activity in Drosophila follicle cells.” Nature, 430, 372–6.

Albert, I., Mavrich, T. N., Tomsho, L. P., Qi, J., Zanton, S. J., Schuster, S. C., and Pugh, B. F. (2007), “Translational and rotational settings of H2A.Z nucleosomes across the genome.” Nature, 446, 572–6.

Aldajem, M., Falaschi, A., and Kowalski, D. (2006), DNA replication and human dis- ease, chap. Eukaryotic DNA Replication Origins, pp. 31–61, Cold Spring Harbor Laboratory Press, New York.

Alvino, G. M., Collingwood, D., Murphy, J. M., Delrow, J., Brewer, B. J., and Raghu- raman, M. K. (2007), “Replication in hydroxyurea: it’s a matter of time,” Mol Cell Biol, 27, 6396–406.

Anglana, M., Apiou, F., Bensimon, A., and Debatisse, M. (2003), “Dynamics of DNA replication in mammalian somatic cells: nucleotide pool modulates origin choice and interorigin spacing.” Cell, 114, 385–94.

Aparicio, O. M., Weinstein, D. M., and Bell, S. P. (1997), “Components and dynam- ics of DNA replication complexes in S. cerevisiae: redistribution of MCM proteins and Cdc45p during .” Cell, 91, 59–69.

Araki, M., Wharton, R. P., Tang, Z., Yu, H., and Asano, M. (2003), “Degradation of origin recognition complex large subunit by the anaphase-promoting complex in Drosophila,” EMBO J, 22, 6115–26.

Arentson, E., Faloon, P., Seo, J., Moon, E., Studts, J., Fremont, D., and Choi, K. (2002), “Oncogenic potential of the DNA replication licensing protein CDT1.” Oncogene, 21, 1150–8.

Arias, E. E. and Walter, J. C. (2007), “Strength in numbers: preventing rereplication via multiple mechanisms in eukaryotic cells,” Genes Dev, 21, 497–518.

Auerbach, R. K., Euskirchen, G., Rozowsky, J., Lamarre-Vincent, N., Moqtaderi, Z., Lefranc¸ois, P., Struhl, K., Gerstein, M., and Snyder, M. (2009), “Mapping

156 accessible chromatin regions using Sono-Seq,” Proc Natl Acad Sci U S A, 106, 14926–31.

Austin, R., Orr-Weaver, T., and Bell, S. (1999), “Drosophila ORC specifically binds to ACE3, an origin of DNA replication control element.” Genes Dev, 13, 2639–49.

Bailey, T. L. and Elkan, C. (1994), “Fitting a mixture model by expectation maxi- mization to discover motifs in biopolymers.” Proc Int Conf Intell Syst Mol Biol, 2, 28–36.

Bailey, T. L. and Gribskov, M. (1998), “Combining evidence using p-values: appli- cation to sequence homology searches,” Bioinformatics, 14, 48–54.

Balazs, I., Brown, E. H., and Schildkraut, C. L. (1974), “The temporal order of replication of some DNA cistrons,” Cold Spring Harb Symp Quant Biol, 38, 239– 45.

Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D. E., Wang, Z., Wei, G., Che- pelev, I., and Zhao, K. (2007), “High-resolution profiling of histone methylations in the human genome,” Cell, 129, 823–37.

Baum, L., Petrie, T., soules, G., and Weiss, N. (1970), “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Ann. Math. Statist., 41, 164–171.

Beach, D., Piper, M., and Shall, S. (1980), “Isolation of chromosomal origins of replication in yeast,” Nature, 284, 185–7.

Bell, L. and Byers, B. (1983), “Separation of branched from linear DNA by two- dimensional gel electrophoresis,” Anal Biochem, 130, 527–35.

Bell, S. (1995), “Eukaryotic replicators and associated protein complexes.” Curr Opin Genet Dev, 5, 162–7.

Bell, S. (2002), “The origin recognition complex: from simple origins to complex functions.” Genes Dev, 16, 659–72.

Bell, S. and Dutta, A. (2002), “DNA replication in eukaryotic cells.” Annu Rev Biochem, 71, 333–74.

Bell, S. P. and Stillman, B. (1992), “ATP-dependent recognition of eukaryotic ori- gins of DNA replication by a multiprotein complex.” Nature, 357, 128–34.

Bembom, O., Keles, S., and van der Laan, M. J. (2007), “Supervised detection of conserved motifs in DNA sequences with cosmo.” Stat Appl Genet Mol Biol, 6, Article8.

157 Berbenetz, N. M., Nislow, C., and Brown, G. W. (2010), “Diversity of eukaryotic DNA replication origins revealed by genome-wide analysis of chromatin struc- ture,” PLoS Genet, 6.

Biamonti, G., Della Valle, G., Talarico, D., Cobianchi, F., Riva, S., and Falaschi, A. (1985), “Fate of exogenous recombinant plasmids introduced into mouse and human cells,” Nucleic Acids Res, 13, 5545–61.

Blow, J. J. and Dutta, A. (2005), “Preventing re-replication of chromosomal DNA,” Nat Rev Mol Cell Biol, 6, 476–86.

Blumenthal, A., Kriegstein, H., and Hogness, D. (1974), “The units of DNA replica- tion in Drosophila melanogaster chromosomes.” Cold Spring Harb Symp Quant Biol, 38, 205–23.

Bochman, M. L. and Schwacha, A. (2008), “The Mcm2-7 complex has in vitro helicase activity,” Mol Cell, 31, 287–93.

Boyle, A. P., Guinney, J., Crawford, G. E., and Furey, T. S. (2008a), “F-Seq: a feature density estimator for high-throughput sequence tags.” Bioinformatics, 24, 2537– 2538.

Boyle, A. P., Davis, S., Shulha, H. P., Meltzer, P., Margulies, E. H., Weng, Z., Furey, T. S., and Crawford, G. E. (2008b), “High-resolution mapping and characteriza- tion of open chromatin across the genome.” Cell, 132, 311–322.

Bozhenok, L., Wade, P. A., and Varga-Weisz, P. (2002), “WSTF-ISWI chromatin re- modeling complex targets heterochromatic replication foci,” EMBO J, 21, 2231– 41.

Branzei, D. and Foiani, M. (2005), “The DNA damage response during DNA repli- cation,” Curr Opin Cell Biol, 17, 568–75.

Breier, A., Chatterji, S., and Cozzarelli, N. (2004), “Prediction of Saccharomyces cerevisiae replication origins.” Genome Biol, 5, R22.

Brewer, B. J. and Fangman, W. L. (1987), “The localization of replication origins on ARS plasmids in S. cerevisiae,” Cell, 51, 463–71.

Broach, J. R., Li, Y. Y., Feldman, J., Jayaram, M., Abraham, J., Nasmyth, K. A., and Hicks, J. B. (1983), “Localization and sequence analysis of yeast origins of DNA replication,” Cold Spring Harb Symp Quant Biol, 47 Pt 2, 1165–73.

Cadoret, J. C., Meisch, F., Hassan-Zadeh, V., Luyten, I., Guillet, C., Duret, L., Ques- neville, H., and Prioleau, M. N. (2008), “Genome-wide studies highlight indirect links between human replication origins and gene regulation.” Proc Natl Acad Sci USA, 105, 15837–42.

158 Cayrou, C., Coulombe, P., and M´echali, M. (2010), “Programming DNA replication origins and chromosome organization,” Chromosome Res, 18, 137–45.

Celniker, S. E., Dillon, L. A., Gerstein, M. B., Gunsalus, K. C., Henikoff, S., Karpen, G. H., Kellis, M., Lai, E. C., Lieb, J. D., MacAlpine, D. M., Micklem, G., Piano, F., Snyder, M., Stein, L., White, K. P., and Waterston, R. H. (2009), “Unlocking the secrets of the genome.” Nature, 459, 927–30.

Chang, C.-C. and Lin, C.-J. (2001), LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

Chen, C.-L., Rappailles, A., Duquenne, L., Huvet, M., Guilbaud, G., Farinelli, L., Au- dit, B., d’Aubenton Carafa, Y., Arneodo, A., Hyrien, O., and Thermes, C. (2010), “Impact of replication timing on non-CpG and CpG substitution rates in mam- malian genomes,” Genome Res, 20, 447–57.

Chesnokov, I., Remus, D., and Botchan, M. (2001), “Functional analysis of mutant and wild-type Drosophila origin recognition complex.” Proc Natl Acad Sci U S A, 98, 11997–2002.

Chuang, R. Y. and Kelly, T. J. (1999), “The fission yeast homologue of Orc4p binds to replication origin DNA via multiple AT-hooks.” Proc Natl Acad Sci U S A, 96, 2656–61.

Chubb, J. and Bickmore, W. (2003), “Considering nuclear compartmentalization in the light of nuclear dynamics.” Cell, 112, 403–6.

Clarey, M. G., Botchan, M., and Nogales, E. (2008), “Single particle EM studies of the Drosophila melanogaster origin recognition complex and evidence for DNA wrapping.” J Struct Biol, 164, 241–9.

Claycomb, J. M. and Orr-Weaver, T. L. (2005), “Developmental gene amplification: insights into DNA replication and gene expression.” Trends Genet, 21, 149–62.

Collins, N., Poot, R. A., Kukimoto, I., Garc´ıa-Jim´enez, C., Dellaire, G., and Varga- Weisz, P. D. (2002), “An ACF1-ISWI chromatin-remodeling complex is required for DNA replication through heterochromatin,” Nat Genet, 32, 627–32.

Cook, J. G., Chasse, D. A. D., and Nevins, J. R. (2004), “The regulated association of Cdt1 with minichromosome maintenance proteins and Cdc6 in mammalian cells,” J Biol Chem, 279, 9625–33.

Costello, J. C., Dalkilic, M. M., Beason, S. M., Gehlhausen, J. R., Patwardhan, R., Middha, S., Eads, B. D., and Andrews, J. R. (2009), “Gene networks in Drosophila melanogaster: integrating experimental data to predict gene func- tion,” Genome Biol, 10, R97.

159 Cotobal, C., Segurado, M., and Antequera, F. (2010), “Structural diversity and dynamics of genomic replication origins in Schizosaccharomyces pombe,” EMBO J, 29, 934–42.

Crampton, A., Chang, F., Pappas, Jr, D. L., Frisch, R. L., and Weinreich, M. (2008), “An ARS element inhibits DNA replication through a SIR2-dependent mecha- nism,” Mol Cell, 30, 156–66.

Crawford, G. E., Davis, S., Scacheri, P. C., Renaud, G., Halawi, M. J., Erdos, M. R., Green, R., Meltzer, P. S., Wolfsberg, T. G., and Collins, F. S. (2006a), “DNase- chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays,” Nat Methods, 3, 503–9.

Crawford, G. E., Holt, I. E., Whittle, J., Webb, B. D., Tai, D., Davis, S., Margulies, E. H., Chen, Y., Bernat, J. A., Ginsburg, D., Zhou, D., Luo, S., Vasicek, T. J., Daly, M. J., Wolfsberg, T. G., and Collins, F. S. (2006b), “Genome-wide map- ping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS),” Genome Res, 16, 123–31.

Dahmann, C., Diffley, J. F., and Nasmyth, K. A. (1995), “S-phase-promoting cyclin- dependent kinases prevent re-replication by inhibiting the transition of replica- tion origins to a pre-replicative state,” Curr Biol, 5, 1257–69.

Dai, J., Chuang, R.-Y., and Kelly, T. J. (2005), “DNA replication origins in the Schizosaccharomyces pombe genome,” Proc Natl Acad Sci U S A, 102, 337–42.

Danis, E., Brodolin, K., Menut, S., Maiorano, D., Girard-Reydet, C., and M´echali, M. (2004), “Specification of a DNA replication origin by a transcription complex,” Nat Cell Biol, 6, 721–30.

David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C. J., Bofkin, L., Jones, T., Davis, R. W., and Steinmetz, L. M. (2006), “A high-resolution map of tran- scription in the yeast genome.” Proc Natl Acad Sci U S A, 103, 5320–5325.

Davidson, I. F., Li, A., and Blow, J. J. (2006), “Deregulated replication licensing causes DNA fragmentation consistent with head-to-tail fork collision,” Mol Cell, 24, 433–43. de Cicco, D. V. and Spradling, A. C. (1984), “Localization of a cis-acting element re- sponsible for the developmentally regulated amplification of Drosophila chorion genes,” Cell, 38, 45–54.

Deal, R. B., Henikoff, J. G., and Henikoff, S. (2010), “Genome-wide kinetics of nu- cleosome turnover determined by metabolic labeling of histones,” Science, 328, 1161–4.

160 Delgado, S., Gomez,´ M., Bird, A., and Antequera, F. (1998), “Initiation of DNA replication at CpG islands in mammalian chromosomes,” EMBO J, 17, 2426–35.

DePamphilis, M. L. (2006), DNA replication and human disease, vol. 46, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.

Devault, A., Vallen, E. A., Yuan, T., Green, S., Bensimon, A., and Schwob, E. (2002), “Identification of Tah11/Sid2 as the ortholog of the replication Cdt1 in Saccharomyces cerevisiae,” Curr Biol, 12, 689–94.

Dhar, S. K. and Dutta, A. (2000), “Identification and characterization of the human ORC6 homolog,” J Biol Chem, 275, 34983–8.

Diffley, J. F., Cocker, J. H., Dowell, S. J., and Rowley, A. (1994), “Two steps in the assembly of complexes at yeast replication origins in vivo,” Cell, 78, 303–16.

Dijkwel, P. A., Wang, S., and Hamlin, J. L. (2002), “Initiation sites are distributed at frequent intervals in the Chinese hamster dihydrofolate reductase origin of replication but are used with very different efficiencies,” Mol Cell Biol, 22, 3053– 65.

Dimitrova, D. S. and Gilbert, D. M. (1999), “The spatial position and replication timing of chromosomal domains are both established in early G1 phase,” Mol Cell, 4, 983–93.

Ding, Q. and MacAlpine, D. M. (2010), “Preferential re-replication of Drosophila heterochromatin in the absence of geminin,” PLoS Genet, 6.

Dion, M. F., Kaplan, T., Kim, M., Buratowski, S., Friedman, N., and Rando, O. J. (2007), “Dynamics of replication-independent histone turnover in budding yeast,” Science, 315, 1405–8.

Drury, L. S., Perkins, G., and Diffley, J. F. (1997), “The Cdc4/34/53 pathway targets Cdc6p for proteolysis in budding yeast,” EMBO J, 16, 5966–76.

Dubey, D. D., Davis, L. R., Greenfeder, S. A., Ong, L. Y., Zhu, J. G., Broach, J. R., Newlon, C. S., and Huberman, J. A. (1991), “Evidence suggesting that the ARS elements associated with silencers of the yeast mating-type locus HML do not function as chromosomal DNA replication origins,” Mol Cell Biol, 11, 5346–55.

Duncker, B. P., Chesnokov, I. N., and McConkey, B. J. (2009), “The origin recogni- tion complex protein family,” Genome Biol, 10, 214.

Dutta, A. and Bell, S. P. (1997), “Initiation of DNA replication in eukaryotic cells,” Annu Rev Cell Dev Biol, 13, 293–332.

161 Dwyer, M. A., Joseph, J. D., Wade, H. E., Eaton, M. L., Kunder, R. S., Kazmin, D., Chang, C.-y., and McDonnell, D. P. (2010), “WNT11 expression is induced by estrogen-related receptor alpha and beta-catenin and acts in an autocrine manner to increase cancer cell migration,” Cancer Res, 70, 9298–308.

Eaton, M. L., Galani, K., Kang, S., Bell, S. P., and MacAlpine, D. M. (2010), “Con- served nucleosome positioning defines replication origins,” Genes Dev, 24, 748– 53.

Eaton, M. L., Prinz, J. A., Macalpine, H. K., Tretyakov, G., Kharchenko, P. V., and Macalpine, D. M. (2011), “Chromatin signatures of the Drosophila replication program,” Genome Res, 21, 164–74.

ENCODE Project Consortium, T. (2007), “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.” Nature, 447, 799–816.

Ernst, J. and Kellis, M. (2010), “Discovery and characterization of chromatin states for systematic annotation of the human genome,” Nat Biotechnol, 28, 817–25.

Felsenfeld, G. and Groudine, M. (2003), “Controlling the double helix,” Nature, 421, 448–53.

Ferguson, B. M. and Fangman, W. L. (1992), “A position effect on the time of replication origin activation in yeast,” Cell, 68, 333–9.

Field, Y., Kaplan, N., Fondufe-Mittendorf, Y., Moore, I. K., Sharon, E., Lubling, Y., Widom, J., and Segal, E. (2008), “Distinct modes of regulation by chromatin en- coded through nucleosome positioning signals.” PLoS Comput Biol, 4, e1000216.

Filion, G. J., van Bemmel, J. G., Braunschweig, U., Talhout, W., Kind, J., Ward, L. D., Brugman, W., de Castro, I. J., Kerkhoven, R. M., Bussemaker, H. J., and van Steensel, B. (2010), “Systematic protein location mapping reveals five principal chromatin types in Drosophila cells,” Cell, 143, 212–24.

Foss, M., McNally, F. J., Laurenson, P., and Rine, J. (1993), “Origin recognition complex (ORC) in transcriptional silencing and DNA replication in S. cerevisiae,” Science, 262, 1838–44.

Friedman, K. L. and Brewer, B. J. (1995), “Analysis of replication intermediates by two-dimensional agarose gel electrophoresis,” Methods Enzymol, 262, 613–27.

Friedman, K. L., Diller, J. D., Ferguson, B. M., Nyland, S. V., Brewer, B. J., and Fangman, W. L. (1996), “Multiple determinants controlling activation of yeast replication origins late in S phase,” Genes Dev, 10, 1595–607.

162 Friedman, K. L., Brewer, B. J., and Fangman, W. L. (1997), “Replication profile of Saccharomyces cerevisiae chromosome VI,” Genes Cells, 2, 667–78.

Gaczynska, M., Osmulski, P. A., Jiang, Y., Lee, J.-K., Bermudez, V., and Hurwitz, J. (2004), “Atomic force microscopic analysis of the binding of the Schizosaccha- romyces pombe origin recognition complex and the spOrc4 protein with origin DNA,” Proc Natl Acad Sci U S A, 101, 17952–7.

Gasser, R., Koller, T., and Sogo, J. M. (1996), “The stability of nucleosomes at the replication fork,” J Mol Biol, 258, 224–39.

Gilbert, C. W., Muldal, S., Lajtha, L. G., and Rowley, J. (1962), “Time-sequence of human chromosome duplication,” Nature, 195, 869–73.

Gilbert, D. (2004), “Timeline: In search of the holy replicator.” Nat Rev Mol Cell Biol, 5, 848–55.

Gilbert, D. M. (2010a), “Cell fate transitions and the replication timing decision point,” J Cell Biol, 191, 899–903.

Gilbert, D. M. (2010b), “Evaluating genome-scale approaches to eukaryotic DNA replication.” Nat Rev Genet, 11, 673–84.

Gilbert, D. M., Takebayashi, S.-I., Ryba, T., Lu, J., Pope, B. D., Wilson, K. A., and Hiratani, I. (2010), “Space and Time in the Nucleus: Developmental Control of Replication Timing and Chromosome Architecture,” Cold Spring Harb Symp Quant Biol.

Gillespie, P. J., Li, A., and Blow, J. J. (2001), “Reconstitution of licensed replication origins on Xenopus sperm nuclei using purified proteins,” BMC Biochem, 2, 15.

Giresi, P. G., Kim, J., McDaniell, R. M., Iyer, V. R., and Lieb, J. D. (2007), “FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regula- tory elements from human chromatin,” Genome Res, 17, 877–85.

Gomez,´ M. and Brockdorff, N. (2004), “Heterochromatin on the inactive X chro- mosome delays replication timing without affecting origin usage,” Proc Natl Acad Sci U S A, 101, 6923–8.

Gonzalez, S., Klatt, P., Delgado, S., Conde, E., Lopez-Rios, F., Sanchez-Cespedes, M., Mendez, J., Antequera, F., and Serrano, M. (2006), “Oncogenic activity of Cdc6 through repression of the INK4/ARF locus.” Nature, 440, 702–6.

Goren, A., Tabib, A., Hecht, M., and Cedar, H. (2008), “DNA replication timing of the human beta-globin domain is controlled by histone modification at the origin.” Genes Dev, 22, 1319–24.

163 Gossen, M., Pak, D., Hansen, S., Acharya, J., and Botchan, M. (1995), “A Drosophila homolog of the yeast origin recognition complex.” Science, 270, 1674–7.

Graveley, B. R., Brooks, A. N., Carlson, J. W., Duff, M. O., Landolin, J., Yang, L., Artieri, C., van Baren, M. J., Booth, B. W., Brown, J. B., Cherbas, L., Davis, C. A., Dobin, A., Li, R., Lin, W., Mattiuzzo, N., Miller, D., Sturgill, D., Tuch, B., Zaleski, C., Zhang, D., Blanchette, M., Clough, E., Dudoit, S., Eads, B., Green, R. E., Hammonds, A., Kapranov, P., Langton, L., Malone, J., Perrimon, N., Sandler, J. E., Wan, K. H., Willingham, A., Zhang, Y., Zou, Y., Andrews, J., Brenner, S. E., Brent, M., Cherbas, P., Gingeras, T. R., Hoskins, R. A., Kaufman, T., Oliver, B., and Celniker, S. E. (2010), “The Developmental Transcriptome of Drosophila melanogaster,” Nature, Submitted.

Haase, S. B., Heinzel, S. S., and Calos, M. P. (1994), “Transcription inhibits the replication of autonomously replicating plasmids in human cells,” Mol Cell Biol, 14, 2516–24.

Hamlin, J. L., Mesner, L. D., Lar, O., Torres, R., Chodaparambil, S. V., and Wang, L. (2008), “A revisionist replicon model for higher eukaryotic genomes,” J Cell Biochem, 105, 321–9.

Hansen, J. C. (2002), “Conformational dynamics of the chromatin fiber in solution: determinants, mechanisms, and functions,” Annu Rev Biophys Biomol Struct, 31, 361–92.

Hansen, R. S., Thomas, S., Sandstrom, R., Canfield, T. K., Thurman, R. E., Weaver, M., Dorschner, M. O., Gartler, S. M., and Stamatoyannopoulos, J. A. (2010), “Sequencing newly replicated DNA reveals widespread plasticity in human repli- cation timing,” Proc Natl Acad Sci U S A, 107, 139–44.

Hartl, T., Boswell, C., Orr-Weaver, T. L., and Bosco, G. (2007), “Developmentally regulated histone modifications in Drosophila follicle cells: initiation of gene amplification is associated with histone H3 and H4 hyperacetylation and H1 phosphorylation,” Chromosoma, 116, 197–214.

Hartwell, L. H. (1973), “Three additional genes required for deoxyribonucleic acid synthesis in Saccharomyces cerevisiae,” J Bacteriol, 115, 966–74.

Hartwell, L. H. (1976), “Sequential function of gene products relative to DNA syn- thesis in the yeast cell cycle,” J Mol Biol, 104, 803–17.

Hashimoto, Y. and Takisawa, H. (2003), “Xenopus Cut5 is essential for a CDK- dependent process in the initiation of DNA replication,” EMBO J, 22, 2526–35.

164 Hayashi, M., Katou, Y., Itoh, T., Tazumi, A., Tazumi, M., Yamada, Y., Takahashi, T., Nakagawa, T., Shirahige, K., and Masukata, H. (2007), “Genome-wide local- ization of pre-RC sites and identification of replication origins in fission yeast,” EMBO J, 26, 1327–39.

Heck, M. and Spradling, A. (1990), “Multiple replication origins are used during Drosophila chorion gene amplification.” J Cell Biol, 110, 903–14.

Heichinger, C., Penkett, C. J., Bahler,¨ J., and Nurse, P. (2006), “Genome-wide char- acterization of fission yeast DNA replication origins,” EMBO J, 25, 5171–9.

Heitz, E. (1928), “Das Heterochromatin der Moose,” Jb. Wiss. Bot., 69, 728–818.

Henikoff, S., Henikoff, J. G., Sakai, A., Loeb, G. B., and Ahmad, K. (2009), “Genome-wide profiling of salt fractions maps physical properties of chromatin.” Genome Res, 19, 460–9.

Hiratani, I. and Gilbert, D. M. (2009), “Replication timing as an epigenetic mark.” Epigenetics, 4, 93–7.

Hiratani, I., Ryba, T., Itoh, M., Yokochi, T., Schwaiger, M., Chang, C. W., Lyou, Y., Townes, T. M., Schubeler, D., and Gilbert, D. M. (2008), “Global reorganization of replication domains during embryonic stem cell differentiation.” PLoS Biol, 6, e245.

Hofmann, J. F. and Beach, D. (1994), “cdt1 is an essential target of the Cdc10/Sct1 transcription factor: requirement for DNA replication and inhibition of mitosis,” EMBO J, 13, 425–34.

Hon, G., Ren, B., and Wang, W. (2008), “ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome,” PLoS Comput Biol, 4, e1000201.

Hook, S. S., Lin, J. J., and Dutta, A. (2007), “Mechanisms to control rereplication and implications for cancer.” Curr Opin Cell Biol, 19, 663–71.

Hsu, T., Schmid, W., and Stubblefield, E. (1964), The Role of Chromosomes in De- velopment, chap. DNA replication sequence in higher animals, p. 83, Academic Press, New York.

Hua, X. H. and Newport, J. (1998), “Identification of a preinitiation step in DNA replication that is independent of origin recognition complex and cdc6, but de- pendent on cdk2,” J Cell Biol, 140, 271–81.

Huang, R. Y. and Kowalski, D. (1996), “Multiple DNA elements in ARS305 deter- mine replication origin activity in a yeast chromosome,” Nucleic Acids Res, 24, 816–23.

165 Huang, Y. and Kowalski, D. (2003), “WEB-THERMODYN: Sequence analysis soft- ware for profiling DNA helical stability,” Nucleic Acids Res, 31, 3819–21.

Huberman, J. A. and Riggs, A. D. (1966), “Autoradiography of chromosomal DNA fibers from Chinese hamster cells,” Proc Natl Acad Sci U S A, 55, 599–606.

Huberman, J. A. and Riggs, A. D. (1968), “On the mechanism of DNA replication in mammalian chromosomes,” J Mol Biol, 32, 327–41.

Iyer, L. M., Leipe, D. D., Koonin, E. V., and Aravind, L. (2004), “Evolutionary history and higher order classification of AAA+ ATPases,” J Struct Biol, 146, 11–31.

Jackson, D. A. and Pombo, A. (1998), “Replicon clusters are stable units of chromo- some structure: evidence that nuclear organization contributes to the efficient activation and propagation of S phase in human cells,” J Cell Biol, 140, 1285–95.

Jacob, F., Brenner, S., and Cuzin, F. (1963), “On the Regulation of DNA Replication in Bacteria,” Cold Spring Harbor Symposia on Quantitative Biology, 28, 329–348.

Jares, P. and Blow, J. J. (2000), “Xenopus cdc7 function is dependent on licensing but not on XORC, XCdc6, or CDK activity and is required for XCdc45 loading,” Genes Dev, 14, 1528–40.

Jenuwein, T. and Allis, C. D. (2001), “Translating the histone code.” Science, 293, 1074–80.

Jiang, C. and Pugh, B. F. (2009), “Nucleosome positioning and gene regulation: advances through genomics,” Nat Rev Genet, 10, 161–72.

Johnson, D. S., Mortazavi, A., Myers, R. M., and Wold, B. (2007), “Genome-wide mapping of in vivo protein-DNA interactions.” Science, 316, 1497–502.

Kamimura, Y., Masumoto, H., Sugino, A., and Araki, H. (1998), “Sld2, which inter- acts with Dpb11 in Saccharomyces cerevisiae, is required for chromosomal DNA replication,” Mol Cell Biol, 18, 6102–9.

Kaplan, N., Moore, I., Fondufe-Mittendorf, Y., Gossett, A., Tillo, D., Field, Y., LeP- roust, E., Hughes, T., Lieb, J., Widom, J., and Segal, E. (2009), “The DNA- encoded nucleosome organization of a eukaryotic genome,” Nature, 458, 362– 366.

Kaplan, T., Liu, C. L., Erkmann, J. A., Holik, J., Grunstein, M., Kaufman, P. D., Friedman, N., and Rando, O. J. (2008), “Cell cycle- and chaperone-mediated regulation of H3K56ac incorporation in yeast.” PLoS Genet, 4, e1000270.

166 Karakaidos, P., Taraviras, S., Vassiliou, L. V., Zacharatos, P., Kastrinakis, N. G., Kougiou, D., Kouloukoussa, M., Nishitani, H., Papavassiliou, A. G., Lygerou, Z., and Gorgoulis, V. G. (2004), “Overexpression of the replication licensing reg- ulators hCdt1 and hCdc6 characterizes a subset of non-small-cell lung carcino- mas: synergistic effect with mutant p53 on tumor growth and chromosomal instability–evidence of -1 transcriptional control over hCdt1,” Am J Pathol, 165, 1351–65.

Karnani, N., Taylor, C., Malhotra, A., and Dutta, A. (2007), “Pan-S replication patterns and chromosomal domains defined by genome-tiling arrays of ENCODE genomic areas.” Genome Res, 17, 865–76.

Karnani, N., Taylor, C. M., Malhotra, A., and Dutta, A. (2010), “Genomic study of replication initiation in human chromosomes reveals the influence of transcrip- tion regulation and chromatin structure on origin selection.” Mol Biol Cell, 21, 393–404.

Kaufman, P. D. and Rando, O. J. (2010), “Chromatin as a potential carrier of heri- table information,” Curr Opin Cell Biol, 22, 284–90.

Kawasaki, Y., Kim, H.-D., Kojima, A., Seki, T., and Sugino, A. (2006), “Reconsti- tution of Saccharomyces cerevisiae prereplicative complex assembly in vitro,” Genes Cells, 11, 745–56.

Kelly, T. J. and Brown, G. W. (2000), “Regulation of chromosome replication,” Annu Rev Biochem, 69, 829–80.

Kharchenko, P. V., Tolstorukov, M. Y., and Park, P. J. (2008), “Design and analysis of ChIP-seq experiments for DNA-binding proteins.” Nat Biotechnol, 26, 1351–9.

Kharchenko, P. V., Alekseyenko, A. A., Schwartz, Y. B., Minoda, A., Riddle, N. C., Ernst, J., Sabo, P. J., Larschan, E., Gorchakov, A. A., Gu, T., Linder-Basso, D., Plachetka, A., Shanower, G., Tolstorukov, M. Y., Luquette, L. J., Xi, R., Jung, Y. L., Park, R. W., Bishop, E. P., Canfield, T. P., Sandstrom, R., Thurman, R. E., Macalpine, D. M., Stamatoyannopoulos, J. A., Kellis, M., Elgin, S. C. R., Kuroda, M. I., Pirrotta, V., Karpen, G. H., and Park, P. J. (2010), “Comprehensive analysis of the chromatin landscape in Drosophila melanogaster,” Nature.

Knott, S. R., Viggiani, C. J., Tavare, S., and Aparicio, O. M. (2009), “Genome-wide replication profiles indicate an expansive role for Rpd3L in regulating replica- tion initiation timing or efficiency, and reveal genomic loci of Rpd3 function in Saccharomyces cerevisiae.” Genes Dev, 23, 1077–90.

Kornberg, A. and Baker, T. A. (1992), DNA replication, W.H. Freeman, New York, 2nd ed. edn.

167 Kornberg, R. D. (1974), “Chromatin structure: a repeating unit of histones and DNA,” Science, 184, 868–71.

Kornberg, R. D. (1977), “Structure of chromatin,” Annu Rev Biochem, 46, 931–54.

Kouzarides, T. (2007), “Chromatin modifications and their function,” Cell, 128, 693–705.

Krude, T. (1995), “Chromatin assembly factor 1 (CAF-1) colocalizes with replica- tion foci in HeLa cell nuclei,” Exp Cell Res, 220, 304–11.

Krysan, P. J., Smith, J. G., and Calos, M. P. (1993), “Autonomous replication in human cells of multimers of specific human and bacterial DNA sequences,” Mol Cell Biol, 13, 2688–96.

Kunert, N., Marhold, J., Stanke, J., Stach, D., and Lyko, F. (2003), “A Dnmt2-like protein mediates DNA methylation in Drosophila,” Development, 130, 5083–90.

Labib, K. (2010), “How do Cdc7 and cyclin-dependent kinases trigger the initiation of chromosome replication in eukaryotic cells?” Genes Dev, 24, 1208–19.

Lai, W. K. and Buck, M. J. (2010), “ArchAlign: coordinate-free chromatin alignment reveals novel architectures,” Genome Biol, 11, R126.

Lam, W. M., Kang, S., Chan, C. S., Eaton, M. L., MacAlpine, D. M., and Bell, S. P. (2011), “Nucleosomal DNA alters the origin DNA sequence requirements for helicase loading,” submitted.

Lande-Diner, L., Zhang, J., and Cedar, H. (2009), “Shifts in replication timing ac- tively affect histone acetylation during nucleosome reassembly.” Mol Cell, 34, 767–74.

Langst,¨ G. and Becker, P. B. (2001), “Nucleosome mobilization and positioning by ISWI-containing chromatin-remodeling factors,” J Cell Sci, 114, 2561–8.

Laverty, C., Lucci, J., and Akhtar, A. (2010), “The MSL complex: X chromosome and beyond.” Curr Opin Genet Dev, 20, 171–8.

Lebofsky, R., Heilig, R., Sonnleitner, M., Weissenbach, J., and Bensimon, A. (2006), “DNA replication origin interference increases the spacing between initiation events in human cells,” Mol Biol Cell, 17, 5337–45.

Lee, D. G. and Bell, S. P. (1997), “Architecture of the yeast origin recognition com- plex bound to origins of DNA replication,” Mol Cell Biol, 17, 7159–68.

Lee, L. A. and Orr-Weaver, T. L. (2003), “Regulation of cell cycles in Drosophila development: intrinsic and extrinsic cues,” Annu Rev Genet, 37, 545–78.

168 Lee, T.-J., Pascuzzi, P. E., Settlage, S. B., Shultz, R. W., Tanurdzic, M., Rabinowicz, P. D., Menges, M., Zheng, P., Main, D., Murray, J. A. H., Sosinski, B., Allen, G. C., Martienssen, R. A., Hanley-Bowdoin, L., Vaughn, M. W., and Thompson, W. F. (2010), “Arabidopsis thaliana chromosome 4 replicates in two phases that correlate with chromatin state,” PLoS Genet, 6, e1000982.

Lee, W., Tillo, D., Bray, N., Morse, R. H., Davis, R. W., Hughes, T. R., and Nislow, C. (2007), “A high-resolution atlas of nucleosome occupancy in yeast.” Nat Genet, 39, 1235–44.

Leu, T. H. and Hamlin, J. L. (1989), “High-resolution mapping of replication fork movement through the amplified dihydrofolate reductase domain in CHO cells by in-gel renaturation analysis,” Mol Cell Biol, 9, 523–31.

Li, H., Ruan, J., and Durbin, R. (2008), “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Res, 18, 1851–8.

Li, J. J. and Herskowitz, I. (1993), “Isolation of ORC6, a component of the yeast origin recognition complex by a one-hybrid system,” Science, 262, 1870–4.

Lin, S. and Kowalski, D. (1997), “Functional equivalency and diversity of cis-acting elements among yeast replication origins,” Mol Cell Biol, 17, 5473–84.

Lipford, J. and Bell, S. (2001), “Nucleosomes positioned by ORC facilitate the ini- tiation of DNA replication.” Mol Cell, 7, 21–30.

Liu, X., Lee, C.-K., Granek, J. A., Clarke, N. D., and Lieb, J. D. (2006), “Whole- genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection,” Genome Res, 16, 1517–28.

Lucas, I., Palakodeti, A., Jiang, Y., Young, D. J., Jiang, N., Fernald, A. A., and Le Beau, M. M. (2007), “High-throughput mapping of origins of replication in human cells.” EMBO Rep, 8, 770–7.

Lucchesi, J. C., Kelly, W. G., and Panning, B. (2005), “Chromatin remodeling in dosage compensation,” Annu Rev Genet, 39, 615–51.

Luger, K., Mader,¨ A. W., Richmond, R. K., Sargent, D. F., and Richmond, T. J. (1997), “Crystal structure of the nucleosome core particle at 2.8 A resolution,” Nature, 389, 251–60.

Lunyak, V. V., Ezrokhi, M., Smith, H. S., and Gerbi, S. A. (2002), “Developmental changes in the Sciara II/9A initiation zone for DNA replication,” Mol Cell Biol, 22, 8426–37.

Lyko, F., Ramsahoye, B. H., and Jaenisch, R. (2000), “DNA methylation in Drosophila melanogaster,” Nature, 408, 538–40.

169 MacAlpine, D. and Bell, S. (2005), “A genomic view of eukaryotic DNA replication.” Chromosome Res, 13, 309–26.

MacAlpine, D., Rodriguez, H., and Bell, S. (2004), “Coordination of replication and transcription along a Drosophila chromosome.” Genes Dev, 18, 3094–105.

MacAlpine, H. K., Gordan, R., Powell, S. K., Hartemink, A. J., and MacAlpine, D. M. (2010), “Drosophila ORC localizes to open chromatin and marks sites of cohesin complex loading.” Genome Res, 20, 201–211.

MacArthur, S., Li, X.-Y., Li, J., Brown, J. B., Chu, H. C., Zeng, L., Grondona, B. P., Hechmer, A., Simirenko, L., Keranen,¨ S. V. E., Knowles, D. W., Stapleton, M., Bickel, P., Biggin, M. D., and Eisen, M. B. (2009), “Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions,” Genome Biol, 10, R80.

Maicas, E. and Friesen, J. D. (1990), “A sequence pattern that occurs at the tran- scription initiation region of yeast RNA polymerase II promoters.” Nucleic Acids Res, 18, 3387–93.

Maiorano, D., Moreau, J., and M´echali, M. (2000), “XCDT1 is required for the assembly of pre-replicative complexes in Xenopus laevis,” Nature, 404, 622–5.

Malik, H. S. and Henikoff, S. (2003), “Phylogenomics of the nucleosome,” Nat Struct Biol, 10, 882–91.

Marahrens, Y. and Stillman, B. (1992), “A yeast chromosomal origin of DNA repli- cation defined by multiple functional elements.” Science, 255, 817–23.

Margueron, R. and Reinberg, D. (2010), “Chromatin structure and the inheritance of epigenetic information,” Nat Rev Genet, 11, 285–96.

Mavrich, T. N., Ioshikhes, I. P., Venters, B. J., Jiang, C., Tomsho, L. P., Qi, J., Schuster, S. C., Albert, I., and Pugh, B. F. (2008a), “A barrier nucleosome model for statistical positioning of nucleosomes throughout the yeast genome.” Genome Res, 18, 1073–83.

Mavrich, T. N., Jiang, C., Ioshikhes, I. P., Li, X., Venters, B. J., Zanton, S. J., Tomsho, L. P., Qi, J., Glaser, R. L., Schuster, S. C., Gilmour, D. S., Albert, I., and Pugh, B. F. (2008b), “Nucleosome organization in the Drosophila genome,” Nature, 453, 358–62.

McGarry, T. J. and Kirschner, M. W. (1998), “Geminin, an inhibitor of DNA replica- tion, is degraded during mitosis,” Cell, 93, 1043–53.

170 M´echali, M. (2010), “Eukaryotic DNA replication origins: many choices for appro- priate answers,” Nat Rev Mol Cell Biol, 11, 728–38.

M´echali, M. and Kearsey, S. (1984), “Lack of specific sequence requirement for DNA replication in Xenopus eggs compared with high sequence specificity in yeast,” Cell, 38, 55–64.

Melixetian, M., Ballabeni, A., Masiero, L., Gasparini, P., Zamponi, R., Bartek, J., Lukas, J., and Helin, K. (2004), “Loss of Geminin induces rereplication in the presence of functional p53,” J Cell Biol, 165, 473–82.

M´endez, J. (2009), “Temporal regulation of DNA replication in mammalian cells,” Crit Rev Biochem Mol Biol, 44, 343–51.

Merchant, A. M., Kawasaki, Y., Chen, Y., Lei, M., and Tye, B. K. (1997), “A lesion in the DNA replication initiation factor Mcm10 induces pausing of elongation forks through chromosomal replication origins in Saccharomyces cerevisiae,” Mol Cell Biol, 17, 3261–71.

Meselson, M. and Stahl, F. W. (1958), “THE REPLICATION OF DNA IN ES- CHERICHIA COLI.” Proc Natl Acad Sci U S A, 44, 671–82.

Mesner, L. D. and Hamlin, J. L. (2005), “Specific signals at the 3’ end of the DHFR gene define one boundary of the downstream origin of replication,” Genes Dev, 19, 1053–66.

Mesner, L. D., Crawford, E. L., and Hamlin, J. L. (2006), “Isolating apparently pure libraries of replication origins from complex genomes.” Mol Cell, 21, 719–26.

Mesner, L. D., Valsakumar, V., Karnani, N., Dutta, A., Hamlin, J. L., and Bekiranov, S. (2011), “Bubble-chip analysis of human origin distributions demonstrates on a genomic scale significant clustering into zones and significant association with transcription,” Genome Res, 21, 377–89.

Micklem, G., Rowley, A., Harwood, J., Nasmyth, K., and Diffley, J. F. (1993), “Yeast origin recognition complex is involved in DNA replication and transcriptional silencing.” Nature, 366, 87–9.

Mihaylov, I., Kondo, T., Jones, L., Ryzhikov, S., Tanaka, J., Zheng, J., Higa, L., Minamino, N., Cooley, L., and Zhang, H. (2002), “Control of DNA replication and chromosome ploidy by geminin and cyclin A.” Mol Cell Biol, 22, 1868–80.

Miller, C. A., Umek, R. M., and Kowalski, D. (1999), “The inefficient replication origin from yeast ribosomal DNA is naturally impaired in the ARS consensus sequence and in DNA unwinding,” Nucleic Acids Res, 27, 3921–30.

171 Mimura, S., Masuda, T., Matsui, T., and Takisawa, H. (2000), “Central role for cdc45 in establishing an initiation complex of DNA replication in Xenopus egg extracts,” Genes Cells, 5, 439–52. Mimura, S., Seki, T., Tanaka, S., and Diffley, J. F. X. (2004), “Phosphorylation- dependent binding of mitotic cyclins to Cdc6 contributes to DNA replication control,” Nature, 431, 1118–23. Miotto, B. and Struhl, K. (2010), “HBO1 histone acetylase activity is essential for DNA replication licensing and inhibited by Geminin,” Mol Cell, 37, 57–66. Misulovin, Z., Schwartz, Y. B., Li, X. Y., Kahn, T. G., Gause, M., MacArthur, S., Fay, J. C., Eisen, M. B., Pirrotta, V., Biggin, M. D., and Dorsett, D. (2008), “Association of cohesin and Nipped-B with transcriptionally active regions of the Drosophila melanogaster genome.” Chromosoma, 117, 89–102. modENCODE Consortium, Roy, S., Ernst, J., Kharchenko, P. V., Kheradpour, P., Ne- gre, N., Eaton, M. L., Landolin, J. M., Bristow, C. A., Ma, L., Lin, M. F., Washietl, S., Arshinoff, B. I., Ay, F., Meyer, P. E., Robine, N., Washington, N. L., Di Stefano, L., Berezikov, E., Brown, C. D., Candeias, R., Carlson, J. W., Carr, A., Jungreis, I., Marbach, D., Sealfon, R., Tolstorukov, M. Y., Will, S., Alekseyenko, A. A., Artieri, C., Booth, B. W., Brooks, A. N., Dai, Q., Davis, C. A., Duff, M. O., Feng, X., Gorchakov, A. A., Gu, T., Henikoff, J. G., Kapranov, P., Li, R., MacAlpine, H. K., Malone, J., Minoda, A., Nordman, J., Okamura, K., Perry, M., Powell, S. K., Rid- dle, N. C., Sakai, A., Samsonova, A., Sandler, J. E., Schwartz, Y. B., Sher, N., Spokony, R., Sturgill, D., van Baren, M., Wan, K. H., Yang, L., Yu, C., Feingold, E., Good, P., Guyer, M., Lowdon, R., Ahmad, K., Andrews, J., Berger, B., Brenner, S. E., Brent, M. R., Cherbas, L., Elgin, S. C. R., Gingeras, T. R., Grossman, R., Hoskins, R. A., Kaufman, T. C., Kent, W., Kuroda, M. I., Orr-Weaver, T., Perri- mon, N., Pirrotta, V., Posakony, J. W., Ren, B., Russell, S., Cherbas, P., Graveley, B. R., Lewis, S., Micklem, G., Oliver, B., Park, P. J., Celniker, S. E., Henikoff, S., Karpen, G. H., Lai, E. C., MacAlpine, D. M., Stein, L. D., White, K. P., and Kel- lis, M. (2010), “Identification of functional elements and regulatory circuits by Drosophila modENCODE,” Science, 330, 1787–97. Moll, T., Tebb, G., Surana, U., Robitsch, H., and Nasmyth, K. (1991), “The role of phosphorylation and the CDC28 protein kinase in cell cycle-regulated nuclear import of the S. cerevisiae transcription factor SWI5,” Cell, 66, 743–58. Moorman, C., Sun, L. V., Wang, J., de Wit, E., Talhout, W., Ward, L. D., Greil, F., Lu, X.-J., White, K. P., Bussemaker, H. J., and van Steensel, B. (2006), “Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster,” Proc Natl Acad Sci U S A, 103, 12027–32. Mori, S. and Shirahige, K. (2007), “Perturbation of the activity of replication origin by meiosis-specific transcription.” J Biol Chem, 282, 4447–52.

172 Muller,¨ P., Park, S., Shor, E., Huebert, D. J., Warren, C. L., Ansari, A. Z., Weinre- ich, M., Eaton, M. L., MacAlpine, D. M., and Fox, C. A. (2010), “The conserved bromo-adjacent homology domain of yeast Orc1 functions in the selection of DNA replication origins within chromatin,” Genes Dev, 24, 1418–33.

Nakamura, H., Morita, T., and Sato, C. (1986), “Structural organizations of repli- con domains during DNA synthetic phase in the mammalian nucleus,” Exp Cell Res, 165, 291–7.

Narlikar, L., Gordan,ˆ R., and Hartemink, A. (2007), “A nucleosome-guided map of transcription factor binding sites in yeast,” PLoS Computational Biology, 3, e215.

Natale, D. A., Umek, R. M., and Kowalski, D. (1993), “Ease of DNA unwinding is a conserved property of yeast replication origins,” Nucleic Acids Res, 21, 555–60.

Needleman, S. B. and Wunsch, C. D. (1970), “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J Mol Biol, 48, 443–453.

N`egre, N., Brown, C. D., Ma, L., Bristow, C. A., Miller, S. W., Wagner, U., Kherad- pour, P., Eaton, M. L., Loriaux, P., Sealfon, R., Li, Z., Ishii, H., Spokony, R. F., Chen, J., Hwang, L., Cheng, C., Auburn, R. P., Davis, M. B., Domanus, M., Shah, P. K., Morrison, C. A., Zieba, J., Suchy, S., Senderowicz, L., Victorsen, A., Bild, N. A., Grundstad, A. J., Hanley, D., MacAlpine, D. M., Mannervik, M., Venken, K., Bellen, H., White, R., Gerstein, M., Russell, S., Grossman, R. L., Ren, B., Posakony, J. W., Kellis, M., and White, K. P. (2011), “A cis-regulatory map of the Drosophila genome,” Nature, 471, 527–531.

Newlon, C. S. (1996), DNA replication in eukaryotic cells, chap. DNA replication in yeast, pp. 873–914, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.

Nguyen, V. Q., Co, C., and Li, J. J. (2001), “Cyclin-dependent kinases prevent DNA re-replication through multiple mechanisms,” Nature, 411, 1068–73.

Nishitani, H., Lygerou, Z., Nishimoto, T., and Nurse, P. (2000), “The Cdt1 protein is required to license DNA for replication in fission yeast,” Nature, 404, 625–8.

Nougar`ede, R., Della Seta, F., Zarzov, P., and Schwob, E. (2000), “Hierarchy of S- phase-promoting factors: yeast Dbf4-Cdc7 kinase requires prior S-phase cyclin- dependent kinase activation,” Mol Cell Biol, 20, 3795–806.

Nyberg, K. A., Michelson, R. J., Putnam, C. W., and Weinert, T. A. (2002), “Toward maintaining the genome: DNA damage and replication checkpoints,” Annu Rev Genet, 36, 617–56.

173 Olins, A. and Olins, D. (1974), “Spheroid chromatin units (v bodies),” Science, 183, 330–2.

Orr-Weaver, T. L. and Spradling, A. C. (1986), “Drosophila chorion gene amplifica- tion requires an upstream region regulating s18 transcription,” Mol Cell Biol, 6, 4624–33.

Orr-Weaver, T. L., Johnston, C. G., and Spradling, A. C. (1989), “The role of ACE3 in Drosophila chorion gene amplification.” EMBO J, 8, 4153–62.

Pacek, M. and Walter, J. C. (2004), “A requirement for MCM7 and Cdc45 in chro- mosome unwinding during eukaryotic DNA replication,” EMBO J, 23, 3667–76.

Pak, D., Pflumm, M., Chesnokov, I., Huang, D., Kellum, R., Marr, J., Romanowski, P., and Botchan, M. (1997), “Association of the origin recognition complex with heterochromatin and HP1 in higher eukaryotes.” Cell, 91, 311–23.

Panning, M. M. and Gilbert, D. M. (2005), “Spatio-temporal organization of DNA replication in murine embryonic stem, primary, and immortalized cells,” J Cell Biochem, 95, 74–82.

Pape, T., Meka, H., Chen, S., Vicentini, G., van Heel, M., and Onesti, S. (2003), “Hexameric ring structure of the full-length archaeal MCM protein complex.” EMBO Rep, 4, 1079–1083.

Pappas, Jr, D. L., Frisch, R., and Weinreich, M. (2004), “The NAD(+)-dependent Sir2p histone deacetylase is a negative regulator of chromosomal DNA replica- tion,” Genes Dev, 18, 769–81.

Parzen, E. (1962), “On Estimation of a Probability Density Function and Mode,” The Annals of Mathematical Statistics, 33, 1065–1076.

Pasero, P., Braguglia, D., and Gasser, S. M. (1997), “ORC-dependent and origin- specific initiation of DNA replication at defined foci in isolated yeast nuclei,” Genes Dev, 11, 1504–18.

Pearson, C. E., Zorbas, H., Price, G. B., and Zannis-Hadjopoulos, M. (1996), “In- verted repeats, stem-loops, and cruciforms: significance for initiation of DNA replication,” J Cell Biochem, 63, 1–22.

Perkins, G. and Diffley, J. F. (1998), “Nucleotide-dependent prereplicative complex assembly by Cdc6p, a homolog of eukaryotic and prokaryotic clamp-loaders,” Mol Cell, 2, 23–32.

Petesch, S. J. and Lis, J. T. (2008), “Rapid, transcription-independent loss of nu- cleosomes over a large chromatin domain at Hsp70 loci,” Cell, 134, 74–84.

174 Pink, C. J. and Hurst, L. D. (2010), “Timing of replication is a determinant of neutral substitution rates but does not explain slow Y chromosome evolution in rodents,” Mol Biol Evol, 27, 1077–86. Plaut, W. and Nash, D. (1964), The Role of Chromosomes in Development, p. 113, Academic Press, New York. Ptashne, M. (2007), “On the use of the word ’epigenetic’,” Curr Biol, 17, R233–6. Quinn, L. M., Herr, A., McGarry, T. J., and Richardson, H. (2001), “The Drosophila Geminin homolog: roles for Geminin in limiting DNA replication, in anaphase and in neurogenesis,” Genes Dev, 15, 2741–54. R Development Core Team (2008), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3- 900051-07-0. Radman-Livaja, M. and Rando, O. J. (2010), “Nucleosome positioning: how is it established, and why does it matter?” Dev Biol, 339, 258–66. Radman-Livaja, M., Liu, C. L., Friedman, N., Schreiber, S. L., and Rando, O. J. (2010), “Replication and active demethylation represent partially overlap- ping mechanisms for erasure of H3K4me3 in budding yeast,” PLoS Genet, 6, e1000837. Raghuraman, M., Winzeler, E., Collingwood, D., Hunt, S., Wodicka, L., Conway, A., Lockhart, D., Davis, R., Brewer, B., and Fangman, W. (2001), “Replication dynamics of the yeast genome.” Science, 294, 115–21. Raghuraman, M. K. and Brewer, B. J. (2010), “Molecular analysis of the replication program in unicellular model organisms,” Chromosome Res, 18, 19–34. Raghuraman, M. K., Brewer, B. J., and Fangman, W. L. (1997), “Cell cycle- dependent establishment of a late replication program,” Science, 276, 806–9. Randell, J. C. W., Bowers, J. L., Rodr´ıguez, H. K., and Bell, S. P. (2006), “Sequential ATP hydrolysis by Cdc6 and ORC directs loading of the Mcm2-7 helicase,” Mol Cell, 21, 29–39. Rando, O. J. and Chang, H. Y. (2009), “Genome-wide views of chromatin struc- ture.” Annu Rev Biochem, 78, 245–271. Rao, H., Marahrens, Y., and Stillman, B. (1994), “Functional conservation of mul- tiple elements in yeast chromosomal replicators.” Mol Cell Biol, 14, 7643–51. Razin, S. V., Iarovaia, O. V., Sjakste, N., Sjakste, T., Bagdoniene, L., Rynditch, A. V., Eivazova, E. R., Lipinski, M., and Vassetzky, Y. S. (2007), “Chromatin domains and regulation of transcription,” J Mol Biol, 369, 597–607.

175 Redon, C., Pilch, D., Rogakou, E., Sedelnikova, O., Newrock, K., and Bonner, W. (2002), “Histone H2A variants H2AX and H2AZ,” Curr Opin Genet Dev, 12, 162– 9.

Remus, D., Beall, E., and Botchan, M. (2004), “DNA topology, not DNA sequence, is a critical determinant for Drosophila ORC-DNA binding.” EMBO J, 23, 897–907.

Remus, D., Beuron, F., Tolun, G., Griffith, J. D., Morris, E. P., and Diffley, J. F. X. (2009), “Concerted loading of Mcm2-7 double hexamers around DNA during DNA replication origin licensing.” Cell, 139, 719–730.

Ren, B., Robert, F., Wyrick, J., Aparicio, O., Jennings, E., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T., Wilson, C., Bell, S., and Young, R. (2000), “Genome-wide location and function of DNA binding proteins.” Sci- ence, 290, 2306–9.

Riddle, N. C., Shaffer, C. D., and Elgin, S. C. R. (2009), “A lot about a little dot - lessons learned from Drosophila melanogaster chromosome 4,” Biochem Cell Biol, 87, 229–41.

Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Eu- skirchen, G., Bernier, B., Varhol, R., Delaney, A., Thiessen, N., Griffith, O. L., He, A., Marra, M., Snyder, M., and Jones, S. (2007), “Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.” Nat Methods, 4, 651–7.

Rozowsky, J., Euskirchen, G., Auerbach, R. K., Zhang, Z. D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M., and Gerstein, M. B. (2009), “PeakSeq enables sys- tematic scoring of ChIP-seq experiments relative to controls.” Nat Biotechnol, 27, 66–75.

Rusche, L. N., Kirchmaier, A. L., and Rine, J. (2003), “The establishment, inher- itance, and function of silenced chromatin in Saccharomyces cerevisiae,” Annu Rev Biochem, 72, 481–516.

Santocanale, C. and Diffley, J. (1998), “A Mec1- and Rad53-dependent checkpoint controls late-firing origins of DNA replication.” Nature, 395, 615–8.

Santocanale, C. and Diffley, J. F. (1996), “ORC- and Cdc6-dependent complexes at active and inactive chromosomal replication origins in Saccharomyces cere- visiae,” EMBO J, 15, 6671–9.

Santocanale, C., Sharma, K., and Diffley, J. F. (1999), “Activation of dormant ori- gins of DNA replication in budding yeast,” Genes Dev, 13, 2360–4.

176 Schaarschmidt, D., Baltin, J., Stehle, I. M., Lipps, H. J., and Knippers, R. (2004), “An episomal mammalian replicon: sequence-independent binding of the origin recognition complex,” EMBO J, 23, 191–201.

Schneider, I. (1972), “Cell lines derived from late embryonic stages of Drosophila melanogaster,” J Embryol Exp Morphol, 27, 353–65.

Schubeler,¨ D., Scalzo, D., Kooperberg, C., van Steensel, B., Delrow, J., and Groudine, M. (2002), “Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing.” Nat Genet, 32, 438–42.

Schwaiger, M., Stadler, M. B., Bell, O., Kohler, H., Oakeley, E. J., and Schubeler, D. (2009), “Chromatin state marks cell-type- and gender-specific replication of the Drosophila genome.” Genes Dev, 23, 589–601.

Sclafani, R. A. (2000), “Cdc7p-Dbf4p becomes famous in the cell cycle,” J Cell Sci, 113 ( Pt 12), 2111–7.

Sclafani, R. A. and Holzen, T. M. (2007), “Cell cycle regulation of DNA replication.” Annu Rev Genet, 41, 237–80.

Segal, E. and Widom, J. (2009), “Poly(dA:dT) tracts: major determinants of nucle- osome organization.” Curr Opin Struct Biol, 19, 65–71.

Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I. K., Wang, J.-P. Z., and Widom, J. (2006), “A genomic code for nucleosome position- ing.” Nature, 442, 772–778.

Segurado, M., de Luis, A., and Antequera, F. (2003), “Genome-wide distribution of DNA replication origins at A+T-rich islands in Schizosaccharomyces pombe,” EMBO Rep, 4, 1048–53.

Seo, J., Chung, Y. S., Sharma, G. G., Moon, E., Burack, W. R., Pandita, T. K., and Choi, K. (2005), “Cdt1 transgenic mice develop lymphoblastic lymphoma in the absence of p53,” Oncogene, 24, 8176–86.

Sequeira-Mendes, J., Diaz-Uriarte, R., Apedaile, A., Huntley, D., Brockdorff, N., and Gomez, M. (2009), “Transcription initiation activity sets replication origin efficiency in mammalian cells.” PLoS Genet, 5, e1000446.

Sher, N., Li, S., Bell, G., Eng, T., Nordman, J., Eaton, M. L., MacAlpine, D. M., and Orr-Weaver, T. L. (2011), “Developmental Control of Gene Copy Number by Repression of Replication Initiation and Fork Progression,” submitted.

177 Sheu, Y.-J. and Stillman, B. (2006), “Cdc7-Dbf4 phosphorylates MCM proteins via a docking site-mediated mechanism to promote S phase progression,” Mol Cell, 24, 101–13.

Shimada, K., Pasero, P., and Gasser, S. M. (2002), “ORC and the intra-S-phase checkpoint: a threshold regulates Rad53p activation in S phase,” Genes Dev, 16, 3236–52.

Shimada, K., Oma, Y., Schleker, T., Kugou, K., Ohta, K., Harata, M., and Gasser, S. M. (2008), “Ino80 chromatin remodeling complex promotes recovery of stalled replication forks.” Curr Biol, 18, 566–75.

Shirahige, K., Hori, Y., Shiraishi, K., Yamashita, M., Takahashi, K., Obuse, C., Tsu- rimoto, T., and Yoshikawa, H. (1998), “Regulation of DNA-replication origins during cell-cycle progression.” Nature, 395, 618–21.

Shivaswamy, S., Bhinge, A., Zhao, Y., Jones, S., Hirst, M., and Iyer, V. R. (2008), “Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation.” PLoS Biol, 6, e65.

Simpson, R. T. (1990), “Nucleosome positioning can affect the function of a cis- acting DNA element in vivo.” Nature, 343, 387–389.

Smith, V. A., Yu, J., Smulders, T. V., Hartemink, A. J., and Jarvis, E. D. (2006), “Computational inference of neural information flow networks,” PLoS Comput Biol, 2, e161.

Song, J. S., Johnson, W. E., Zhu, X., Zhang, X., Li, W., Manrai, A. K., Liu, J. S., Chen, R., and Liu, X. S. (2007), “Model-based analysis of two-color arrays (MA2C).” Genome Biol, 8, R178.

Song, L. and Crawford, G. E. (2010), “DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells,” Cold Spring Harb Protoc, 2010, pdb.prot5384.

Song, Q. and Smith, A. D. (2011), “Identifying dispersed epigenomic domains from ChIP-Seq data,” Bioinformatics, 27, 870–1.

Speck, C., Chen, Z., Li, H., and Stillman, B. (2005), “ATPase-dependent cooperative binding of ORC and Cdc6 to origin DNA.” Nat Struct Mol Biol, 12, 965–71.

Stamatoyannopoulos, J. A., Adzhubei, I., Thurman, R. E., Kryukov, G. V., Mirkin, S. M., and Sunyaev, S. R. (2009), “Human mutation rate associated with DNA replication timing.” Nat Genet, 41, 393–5.

178 Stein, A., Takasuka, T. E., and Collings, C. K. (2010), “Are nucleosome positions in vivo primarily determined by histone-DNA sequence preferences?” Nucleic Acids Res, 38, 709–19.

Stevenson, J. B. and Gottschling, D. E. (1999), “Telomeric chromatin modulates replication timing near chromosome ends,” Genes Dev, 13, 146–51.

Stinchcomb, D. T., Struhl, K., and Davis, R. W. (1979), “Isolation and characterisa- tion of a yeast chromosomal replicator.” Nature, 282, 39–43.

Suganuma, T., Guti´errez, J. L., Li, B., Florens, L., Swanson, S. K., Washburn, M. P., Abmayr, S. M., and Workman, J. L. (2008), “ATAC is a double histone acetyl- transferase complex that stimulates nucleosome sliding,” Nat Struct Mol Biol, 15, 364–72.

Takayama, Y., Kamimura, Y., Okawa, M., Muramatsu, S., Sugino, A., and Araki, H. (2003), “GINS, a novel multiprotein complex required for chromosomal DNA replication in budding yeast,” Genes Dev, 17, 1153–65.

Talbert, P. B. and Henikoff, S. (2010), “Histone variants–ancient wrap artists of the epigenome,” Nat Rev Mol Cell Biol, 11, 264–75.

Tanaka, S. and Diffley, J. F. X. (2002), “Interdependent nuclear accumulation of budding yeast Cdt1 and Mcm2-7 during G1 phase,” Nat Cell Biol, 4, 198–207.

Tanny, R., MacAlpine, D., Blitzblau, H., and Bell, S. (2006), “Genome-wide analysis of re-replication reveals inhibitory controls that target multiple stages of replica- tion initiation.” Mol Biol Cell, 17, 2415–23.

Tardat, M., Brustel, J., Kirsh, O., Lefevbre, C., Callanan, M., Sardet, C., and Julien, E. (2010), “The histone H4 Lys 20 methyltransferase PR-Set7 regulates replica- tion origins in mammalian cells,” Nat Cell Biol, 12, 1086–93.

Taylor, J. H. (1960), “Asynchronous duplication of chromosomes in cultured cells of Chinese hamster,” J Biophys Biochem Cytol, 7, 455–64.

Theis, J. F. and Newlon, C. S. (1994), “Domain B of ARS307 contains two func- tional elements and contributes to chromosomal replication origin function,” Mol Cell Biol, 14, 7652–9.

Theis, J. F., Yang, C., Schaefer, C. B., and Newlon, C. S. (1999), “DNA sequence and functional analysis of homologous ARS elements of Saccharomyces cerevisiae and S. carlsbergensis,” Genetics, 152, 943–52.

Thoma, F., Bergman, L. W., and Simpson, R. T. (1984), “Nuclease digestion of circular TRP1ARS1 chromatin reveals positioned nucleosomes separated by nuclease-sensitive regions,” J Mol Biol, 177, 715–33.

179 Thomer, M., May, N. R., Aggarwal, B. D., Kwok, G., and Calvi, B. R. (2004), “Drosophila double-parked is sufficient to induce re-replication during develop- ment and is regulated by cyclin E/CDK2,” Development, 131, 4807–18.

Torres, E. M., Williams, B. R., and Amon, A. (2008), “Aneuploidy: cells losing their balance,” Genetics, 179, 737–46.

Trojer, P. and Reinberg, D. (2007), “Facultative heterochromatin: is there a distinc- tive molecular signature?” Mol Cell, 28, 1–13.

Tsakraklides, V. and Bell, S. P. (2010), “Dynamics of pre-replicative complex as- sembly,” J Biol Chem, in press.

Tsuyama, T., Tada, S., Watanabe, S., Seki, M., and Enomoto, T. (2005), “Licensing for DNA replication requires a strict sequential assembly of Cdc6 and Cdt1 onto chromatin in Xenopus egg extracts,” Nucleic Acids Res, 33, 765–75.

Van Hatten, R. A., Tutter, A. V., Holway, A. H., Khederian, A. M., Walter, J. C., and Michael, W. M. (2002), “The Xenopus Xmus101 protein is required for the recruitment of Cdc45 to origins of DNA replication,” J Cell Biol, 159, 541–7.

Vary, Jr, J. C., Fazzio, T. G., and Tsukiyama, T. (2004), “Assembly of yeast chromatin using ISWI complexes.” Methods Enzymol, 375, 88–102.

Vashee, S., Cvetic, C., Lu, W., Simancek, P., Kelly, T., and Walter, J. (2003), “Sequence-independent DNA binding and replication initiation by the human origin recognition complex.” Genes Dev, 17, 1894–908.

Vaziri, C., Saxena, S., Jeon, Y., Lee, C., Murata, K., Machida, Y., Wagle, N., Hwang, D., and Dutta, A. (2003), “A p53-dependent checkpoint pathway pre- vents rereplication.” Mol Cell, 11, 997–1008.

Viggiani, C. J., Knott, S. R. V., and Aparicio, O. M. (2010), “Genome-wide analysis of DNA synthesis by BrdU immunoprecipitation on tiling microarrays (BrdU-IP-chip) in Saccharomyces cerevisiae,” Cold Spring Harb Protoc, 2010, pdb.prot5385.

Vincent, J. A., Kwong, T. J., and Tsukiyama, T. (2008), “ATP-dependent chromatin remodeling shapes the DNA replication landscape,” Nat Struct Mol Biol, 15, 477– 84.

Vogelauer, M., Rubbi, L., Lucas, I., Brewer, B., and Grunstein, M. (2002), “Histone acetylation regulates the time of replication origin firing.” Mol Cell, 10, 1223–33.

Vujcic, M., Miller, C. A., and Kowalski, D. (1999), “Activation of silent replication origins at autonomously replicating sequence elements near the HML locus in budding yeast,” Mol Cell Biol, 19, 6098–109.

180 Wade, H. E., Kobayashi, S., Eaton, M. L., Jansen, M. S., Lobenhofer, E. K., Lupien, M., Geistlinger, T. R., Zhu, W., Nevins, J. R., Brown, M., Otteson, D. C., and McDonnell, D. P. (2010), “Multimodal regulation of E2F1 gene expression by progestins,” Mol Cell Biol, 30, 1866–77.

Walker, S. S., Francesconi, S. C., and Eisenberg, S. (1990), “A DNA replication enhancer in Saccharomyces cerevisiae,” Proc Natl Acad Sci U S A, 87, 4665–9.

Walter, J. and Newport, J. (2000), “Initiation of eukaryotic DNA replication: ori- gin unwinding and sequential chromatin association of Cdc45, RPA, and DNA polymerase alpha,” Mol Cell, 5, 617–27.

Walter, J. C. and Araki, H. (2006), DNA replication and human disease, chap. Acti- vation of pre-replication complexes, pp. 89–104, Cold Spring Harbor Laboratory Press, New York.

Ward, Joe H., J. (1963), “Hierarchical Grouping to Optimize an Objective Func- tion,” Journal of the American Statistical Association, 58, pp. 236–244.

Watson, J. D. and Crick, F. H. (1953a), “Genetical implications of the structure of deoxyribonucleic acid,” Nature, 171, 964–7.

Watson, J. D. and Crick, F. H. (1953b), “Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid.” Nature, 171, 737–8.

Weaver, B. A. and Cleveland, D. W. (2008), “The aneuploidy paradox in cell growth and tumorigenesis,” Cancer Cell, 14, 431–3.

Weaver, B. A. A. and Cleveland, D. W. (2006), “Does aneuploidy cause cancer?” Curr Opin Cell Biol, 18, 658–67.

Weinreich, M., Liang, C., and Stillman, B. (1999), “The Cdc6p nucleotide-binding motif is required for loading mcm proteins onto chromatin,” Proc Natl Acad Sci USA, 96, 441–6.

Whittaker, A. J., Royzman, I., and Orr-Weaver, T. L. (2000), “Drosophila double parked: a conserved, essential replication protein that colocalizes with the ori- gin recognition complex and links DNA replication with mitosis and the down- regulation of S phase transcripts,” Genes Dev, 14, 1765–76.

Williams, B. R. and Amon, A. (2009), “Aneuploidy: cancer’s fatal flaw?” Cancer Res, 69, 5289–91.

Wilmes, G. M. and Bell, S. P. (2002), “The B2 element of the Saccharomyces cere- visiae ARS1 origin of replication requires specific sequences to facilitate pre-RC formation.” Proc Natl Acad Sci U S A, 99, 101–6.

181 Wilmes, G. M., Archambault, V., Austin, R. J., Jacobson, M. D., Bell, S. P., and Cross, F. R. (2004), “Interaction of the S-phase cyclin Clb5 with an ”RXL” docking sequence in the initiator protein Orc6 provides an origin-localized replication control switch,” Genes Dev, 18, 981–91.

Wohlschlegel, J. A., Dwyer, B. T., Dhar, S. K., Cvetic, C., Walter, J. C., and Dutta, A. (2000), “Inhibition of eukaryotic DNA replication by geminin binding to Cdt1,” Science, 290, 2309–12.

Wohlschlegel, J. A., Dhar, S. K., Prokhorova, T. A., Dutta, A., and Walter, J. C. (2002), “Xenopus Mcm10 binds to origins of DNA replication after Mcm2-7 and stimulates origin binding of Cdc45.” Mol Cell, 9, 233–40.

Woodcock, C. L. and Ghosh, R. P. (2010), “Chromatin higher-order structure and dynamics,” Cold Spring Harb Perspect Biol, 2, a000596.

Woodfine, K., Fiegler, H., Beare, D., Collins, J., McCann, O., Young, B., Debernardi, S., Mott, R., Dunham, I., and Carter, N. (2004), “Replication timing of the human genome.” Hum Mol Genet, 13, 191–202.

Workman, J. L. and Kingston, R. E. (1998), “Alteration of nucleosome structure as a mechanism of transcriptional regulation,” Annu Rev Biochem, 67, 545–79.

Wu, J. R. and Gilbert, D. M. (1996), “A distinct G1 step required to specify the Chinese hamster DHFR replication origin,” Science, 271, 1270–2.

Wyrick, J., Aparicio, J., Chen, T., Barnett, J., Jennings, E., Young, R., Bell, S., and Aparicio, O. (2001), “Genome-wide distribution of ORC and MCM proteins in S. cerevisiae: high-resolution mapping of replication origins.” Science, 294, 2357–60.

Xu, W., Aparicio, J. G., Aparicio, O. M., and Tavare, S. (2006), “Genome-wide mapping of ORC and Mcm2p binding sites on tiling arrays and identification of essential ARS consensus sequences in S. cerevisiae.” BMC Genomics, 7, 276.

Yabuki, N., Terashima, H., and Kitada, K. (2002), “Mapping of early firing origins on a replication profile of budding yeast.” Genes Cells, 7, 781–9.

Yamashita, M., Hori, Y., Shinomiya, T., Obuse, C., Tsurimoto, T., Yoshikawa, H., and Shirahige, K. (1997), “The efficiency and timing of initiation of replication of multiple replicons of Saccharomyces cerevisiae chromosome VI,” Genes Cells, 2, 655–65.

Yassour, M., Kaplan, T., Jaimovich, A., and Friedman, N. (2008), “Nucleosome positioning from tiling microarray data,” Bioinformatics, 24, i139–46.

182 Yuan, G. C., Liu, Y. J., Dion, M. F., Slack, M. D., Wu, L. F., Altschuler, S. J., and Rando, O. J. (2005), “Genome-scale identification of nucleosome positions in S. cerevisiae.” Science, 309, 626–30.

Zhang, J., Xu, F., Hashimshony, T., Keshet, I., and Cedar, H. (2002a), “Establish- ment of transcriptional competence in early and late S phase.” Nature, 420, 198– 202.

Zhang, Y., Shin, H., Song, J. S., Lei, Y., and Liu, X. S. (2008a), “Identifying posi- tioned nucleosomes with epigenetic marks in human from ChIP-Seq,” BMC Ge- nomics, 9, 537.

Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., Nussbaum, C., Myers, R. M., Brown, M., Li, W., and Liu, X. S. (2008b), “Model- based analysis of ChIP-Seq (MACS).” Genome Biol, 9, R137.

Zhang, Y., Moqtaderi, Z., Rattner, B. P., Euskirchen, G., Snyder, M., Kadonaga, J. T., Liu, X. S., and Struhl, K. (2009), “Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo,” Nat Struct Mol Biol, 16, 847–52.

Zhang, Y., Malone, J. H., Powell, S. K., Periwal, V., Spana, E., Macalpine, D. M., and Oliver, B. (2010), “Expression in aneuploid Drosophila S2 cells,” PLoS Biol, 8, e1000320.

Zhang, Z. and Pugh, B. F. (2011), “High-resolution genome-wide mapping of the primary structure of chromatin,” Cell, 144, 175–86.

Zhang, Z., Hayashi, M. K., Merkel, O., Stillman, B., and Xu, R. M. (2002b), “Struc- ture and function of the BAH-containing domain of Orc1p in epigenetic silenc- ing.” EMBO J, 21, 4600–11.

Zhou, V. W., Goren, A., and Bernstein, B. E. (2011), “Charting histone modifica- tions and the functional organization of mammalian genomes,” Nat Rev Genet, 12, 7–18.

Zhu, W., Chen, Y., and Dutta, A. (2004), “Rereplication by depletion of geminin is seen regardless of p53 status and activates a G2/M checkpoint.” Mol Cell Biol, 24, 7140–50.

Zou, L. and Stillman, B. (1998), “Formation of a preinitiation complex by S-phase cyclin CDK-dependent loading of Cdc45p onto chromatin,” Science, 280, 593–6.

Zou, L. and Stillman, B. (2000), “Assembly of a complex containing Cdc45p, repli- cation protein A, and Mcm2p at replication origins controlled by S-phase cyclin- dependent kinases and Cdc7p-Dbf4p kinase.” Mol Cell Biol, 20, 3086–3096.

183 Zou, L., Mitchell, J., and Stillman, B. (1997), “CDC45, a novel yeast gene that functions with the origin recognition complex and Mcm proteins in initiation of DNA replication,” Mol Cell Biol, 17, 553–63.

184 Biography

Matthew Lucas Eaton was born March 3, 1982 in Rockport, Maine. He grew up in the nearby sleepy fishing village of Thomaston, where he attended school until 11th grade, when he enrolled at the Northfield Mount Hermon School in Western Massachusetts. He matriculated at Wesleyan University in 2000, and earned a BA in Philosophy and Computer Science with the senior prize in computer science in 2004. In 2006 he joined the 4th class of the Duke University Computational Biology and Bioinformatics Ph.D. program. In 2010 he became engaged to Dr. Hilary Wade, a postdoc in Myles Brown’s lab at the Dana Faber. He has accepted a position as a postdoctoral researcher in Manolis Kellis’ lab at MIT.

Publications (* denotes co-first author) 1.N ´egre N, . . . , Eaton ML, . . . , White KP (2011), “A cis-regulatory map of the Drosophila genome,” Nature, 471(7339), 527-31. 2. modENCODE Consortium, . . . , Eaton ML*, . . . , Kellis M (2010), “Identification of functional elements and regulatory circuits by Drosophila modENCODE,” Science, 330(6012), 1787-97. 3. Eaton ML, Prinz JA, MacAlpine HK, Tretyakov G, Kharchenko PV, MacAlpine DM. (2011), “Chromatin signatures of the Drosophila replication program,” Genome Res, 21(2), 164-74. 4. Dwyer MA, Joseph JD, Wade HE, Eaton ML, Kunder RS, Kazmin D, Chang CY, McDonnell DP (2010), “WNT11 expression is induced by estrogen-related receptor alpha and beta-catenin and acts in an autocrine manner to increase cancer cell migration,” Cancer Res, 70(22):9298- 308. 5. Eaton ML*, Galani K*, Kang S, Bell SP, and MacAlpine DM (2010), “Conserved nucleosome positioning defines replication origins,” Genes Dev, 24(8), 748-53. 6.M uller¨ P, Park S, Shor E, Huebert DJ, Warren CL, Ansari AZ, Weinreich M, Eaton ML, MacAlpine DM, Fox CA (2010), “The conserved bromo-adjacent homology domain of yeast Orc1 functions in the selection of DNA replication origins within chromatin,” Genes Dev, 24(13), 1418-33. 7. Wade HE, Kobayashi S, Eaton ML, Jansen MS, Lobenhofer EK, Lupien M, Geistlinger TR, Zhu W, Nevins JR, Brown M, Otteson DC, McDonnell DP (2010), “Multimodal regulation of E2F1 gene expression by progestins,” Mol Cell Biol, 30(8), 1866-77. 8. Weir M, Eaton M, Rice M (2006), “Challenging the spliceosome machine,” Genome Biol, 7(1), R3.

185