Conformation and Dynamics of the Mammalian

Lyndon N. Zhang Bachelor of Science in Biology Stanford University, 2012

Submitted to the Department of Biology in partial fulfillmentoftherequirementsforthe degree of Master of Science at the Massachusetts Institute of Technology. February 2016 ©2016MassachusettsInstituteofTechnology.Allrightsreserved. The author hereby grants to MIT permission to reproduce and todistributepubliclypaper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Lyndon N. Zhang Department of Biology October 19, 2015 Author Signature:

Ibrahim I. Cissé Assistant Professor of Physics Thesis Supervisor Certification Signature:

Amy E. Keating Professor of Biology and Biological Engineering Co-Director, Biology Graduate Program Acceptance Signature: Conformation and Dynamics of the Mammalian Chromosome Lyndon N. Zhang

Abstract The control of transcription represents a fundamental, initial mechanism by which the reg- ulation of expression is implemented. However, while much research has been done on the biochemistry and cellular function of transcription, comparatively little is known on the dynamics of transcriptional mechanisms, their impact on chromatin structure, and concomitant functional consequences. Employing chromatin immunoprecipitation measure- ments, we report progress towards this goal. We characterize the ensemble chromosome conformation in mouse embryonic stem cells, by measuring interaction, or contact, prob- abilities between distal genomic loci. We map and describe chromosome loops, consisting of two interacting CTCF sites co-bound by cohesin, that maintain the expression of known to promote cell identity, and restrict the expression of genes specifying repressed developmental lineages.

1 Contributions and Prior Publication The writing of this thesis is entirely the work of the author, with background and methods sourced from the indicated references. The data analysis methods and results, described in sections 3 and 4, respectively, were based on the work of Charles Y. Lin, David A. Orlando, and Brian J. Abraham, and performed in collaboration and under the supervision of Zi Peng Fan and Brian J. Abraham. As the research was performed collaboratively, the author has elected to use “we” in the description of the work. Experimental methods were performed in collaboration with and courtesy of Jill M. Dowen, Gang Ren, Denes Hnisz, Abraham S. Weintraub, Jurian Schuijers, and Alla Sigova. Direct intellectual oversight originated from Zi Peng Fan, with additional oversight originating from Tong Ihn Lee and Richard A. Young. Work was performed primarily in the lab of Richard A. Young, in collaboration with the lab of Keji Zhao. Phillip A. Sharp and members of the Sharp lab provided valuable discussion and insight. All research results were published in reference 25, reproduced here: Jill M. Dowen, Zi Peng Fan, Denes Hnisz, Gang Ren, et. al. “Control of Cell Identity Genes Occurs in Insulated Neighborhoods in Mammalian .” Cell. 2014; 159:374-87. The paper is published under the Cell open access policy, and listed in Pubmed Central (PMC4197132). All figures in section 4, with the exception of Figure 10 (which was adapted from reference 23), were adapted from figures in this paper. Research in section 6 was conducted in the lab of Ibrahim I. Cissé. The project was conceived and supervised by Ibrahim I. Cissé. Takuma Inoue implemented the optical microscope used. The cell line was generated by various members of the Cissé lab. Analysis was implemented according to details in the manuscript Izeddin, et. al., “Single-molecule tracking in live cells reveals distinct target-search strategies of transcription factors in the

1 nucleus.” eLife. 2014; 3:e02230, from the laboratories of Xavier Darzacq and Maxime Dahan, with optimization in the Cissé lab. Additional analytical steps were written in Mathematica by Ibrahim I. Cissé and optimized by the author. All data presented were acquired and analyzed by the author. Direct assistance was received from James Owen Andrews, Arjun Narayanan, and Takuma Inoue. Other members of the Cissé lab, including Jan-Hendrik Spille, Namrata Jayanth, and Won- ki Cho, provided valuable discussion and insight.

2 Introduction Transcription, the first step of gene expression, is the cellular mechanism by which genes are copied from the DNA template to the complementary mRNA molecule. Mediating this templated polymerization reaction is the RNA polymerase II holoenzyme, which catalyzes the formation of phosphodiester bonds to link nucleoside triphosphates in a hydrolysis reaction [1]. Eukaryotic transcription initiates with the binding of an activator to its cognate distal enhancer sequence to enable the sequential recruitment of the general transcription factors and RNA polymerase II to the cognate gene promoter [2]. Subsequently, proximal response elements upstream of the promoter, such as the TATA sequence, are recognized and bound by cognate , such as the TATA-binding (TBP). In this case, the TBP binds, as part of the TFIID complex, to the DNA, 35 base pairs upstream of the transcription start site. Binding of TFIID then promotes the recruitment of the remaining general transcription factors, TFIIB, F, E, H, and the Mediator complex, to form the transcriptional preinitiation complex [3]. However, by themselves, RNA polymerase II and the general transcription factors do not fully reconstitute the in vivo response to transcription regulators [4]. DNA is packaged with histone proteins to form chromatin, and the compaction this achieves enables the 2-meter DNA polymer to package inside eukaryotic nuclei with diameters on the order of microns. An additional host of enzymatic apparatus, including chromatin modifiers (de)acetylases, (de)methylases, (de)phosphorylases and ATP-dependent chromatin remodelers the SWI/SNF and the NuRD remodeling complexes modulates access to the DNA template, from the “open” 10-nm filament, to the 30-nm fiber, and to more compacted, higher order packings [5]. More recent studies suggest that a fraction of noncoding RNA, transcribed at enhancers, may collaborate with activating transcription factors, and stabilize chromatin loops, to further upregulate gene expression [6]. The regulation of transcription is essential for the specification of cell state [7]. Post- transcriptional gene regulatory programs and external factors are also known to be involved in the control of cell identity. For instance, the miR-290-295 microRNA family maintains embryonic stem cell identity by repressing Pax6 expression, and consequently ectoderm differention [8]; extrinsic signaling factors, acting through the cell’s molecular signaling network (e.g., in embryonic stem cells, the binding of the nodal signal to its cog- nate receptor, activin), promote the expression of transcription factors to further alter cell identity [9]. However, here we focus on the transcriptional regulatory circuitry required for the specification and maintenance of cell identity. To establish the transcriptional program required for maintaining cell identity, a set of “master” transcription factors those whose protein products bind to the enhancers of themselves and of each other activate their own expression and that of downstream effector genes [7]. This set of transcription factors forms an interconnected autoregulatory loop to maintain cell state. The identity and behavior of this set of factors, that form

2 the core regulatory circuitry, are best studied in embryonic stem cells (in both human and mouse), in which the transcription factors Oct4, Sox2, and Nanog (OSN) were identified as governing the pluripotent state [7, 10]. In murine embryonic stem cells (ESCs), Oct4 / forms a heterodimer with Sox2. While Nanog is not necessary for self-renewal (Nanog ESCs retain wild-type Oct4 expression and the capability to propagate in culture), Nanog- deleted embryonic stem cells are recruited to the germ line, and reach the soma, but do not migrate to the genital ridge beyond 11.5 days after embryogenesis [11]. Additionally, by ChIP-seq, Nanog colocalizes at the majority of Oct4 and Sox2 binding sites throughout the ESC genome, including regions proximal to the Sox2 promoter [12]. Collectively, these three transcription factors Oct4, Sox2, and Nanog operationally define the enhancers that transcriptionally activate genes that drive the embryonic cell state [7]. We examine the first step of transcriptional initiation: the binding of activating transcrip- tion factors at distal enhancers, and the formation of a physical loop between between the enhancer and promoter. To experimentally characterize this, recent molecular technolo- gies, based on modifications of the chromatin immunoprecipitation (ChIP) method, are used to capture a substantial fraction of pairwise interactions between distal genomic loci. In ChIA-PET (chromatin interaction analysis by paired-end tag sequencing), sonication and pulldown of paired DNA linkers colocalizing with the protein of interest are performed as in chromatin immunoprecipitation. Linkers are ligated to the tethered DNA fragments; asubsequentligationreaction,underdilutereagentconcentrations,joinslinkersligated to different fragments. Sequencing and analysis of the paired-end reads identifies local chromosomal structures across the genome [13]. Here, we focus on the chromosome structures associated with the cohesin complex. Cohesin is one of multiple structural maintenance of chromosome proteins, ring-shaped protein com- plexes that link two DNA molecules. (The SMC family also includes the condensin I and II complexes.) The cohesin complex is comprised of Smc1 and Smc3, which contain coiled- coiled arms, the heat repeat domain-containing STAG, and the Rad21 bridging protein. In eukaryotes, cohesin is responsible for maintaining cohesion of sister chromatids [14]. In yeast, an 80% knockdown of cohesin expression led to transcriptional defects, but sister chromatid cohesion and chromosome structure remained intact, suggesting that cohesin plays a functional role in transcription [15]. Additionally, chromatin immunoprecipita- tion assays show cohesin colocalized with actively transcribed promoters and enhancers. In mouse embryonic stem cells, colocalization of cohesin is observed, at enhancers, with Med1/Med12 (components of the Mediator complex) and master transcription factors that bind to enhancers and drive the expression of cell identity-specifying genes; at promoters, cohesin colocalizes with Mediator and Pol II [16]. The molecular interactions suggested by these genome-wide ChIP data were biochemically validated in western blot, through the co-purification of Smc3 with Med12 in nuclear cell extract [16]. Finally, reduction of Media- tor, cohesin, and Nipbl (a cohesin loading protein) achieved through shRNA knockdown resulted in lower Oct4 protein levels, aberrant ESC colony morphology, and reduced OSN expression levels. These show, genetically, the role of cohesin-associated proteins, and Mediator, in maintaining the transcriptional control of cell state. More recent studies identified a subset of OSN enhancers in stem cells as containing a disproportionately larger proportion of the transcriptional machinery. By aggregating en- hancers in linear proximity (within 12.5 Kb), we define, based on total Mediator signal, a small number of enhancer domains, termed “super-enhancers,” that dominate the transcrip- tional program specifying cell state. These super-enhancers contain high densities of not only Oct4, Sox2, and Nanog, but also Klf4 and Esrrb, and are enriched for the concominant

3 transcription factor binding motifs. Genes associated with super-enhancers (also by linear proximity) show higher expression; they also show increased sensitivity to Oct4 and Med12 levels, upon shRNA knockdown. Super-enhancers, cloned in luciferase assays driven by the Oct4 promoter, activate expression to higher levels than typical enhancers. Examining super-enhancers in multiple cell types suggests that genes associated with super-enhancers are more cell-type specific, and drive the developmental functions of their respective cell types (e.g., MyoD, which has been shown to reprogram fibroblasts to muscle cells, oc- cupies a super-enhancer that activates its own expression in skeletal muscle), than those associated with typical enhancers [17]. Thus, to study the chromosome structures respon- sible for controlling cell identity, we specialize to the cohesin-associated loops surrounding super-enhancers, and similar genomic regions, that specify cell lineage. To gain insight into previously identified chromosome structures, we examined previous chromatin interaction studies performed by a related experimental method, Hi-C. Hi-C modifies standard chromosome conformation capture, to comprehensively detect the set of chromatin interactions in the nucleus. Similar to ChIA-PET, chromatin is crosslinked, fragmented, ligated, and purified to create a library of interacting DNA loci. However, without immunoprecipitation, Hi-C measures all interactions, rather those mediated by aspecificfactor.Datafromthesestudiescapturethechromatinconformationensemble; contact density maps of sufficient quality require samples on the order of 106 cells. Recent Hi-C studies showed that contact probabilities between loci P (s) in condensed chromatin (seen in the mitotic phase) initially scale inversely with the square root of the genomic 0.5 7 distance apart s (P (s) s ), followed by a rapid fall-off at 10 base pairs, at which 1 / P (s) s [18]. Human chromatin organization within the 10 Mb cutoff exhibits scaling / 0.1 1 that falls between the fractal (P (s) s ) and equilibrium globule states (P (s) s / / for s 105 bp, as in the fractal globule, but plateaus with P (s) s0 for s 105 bp) [18]. / This contrasts> with human interphase chromatin organization, in which contact? probability scales inversely with genomic distance from 500 Kb to 7 Mb [19]. Application of the statistical mechanics of polymers informs the physical mechanisms of chromatin folding, showing that unlike proteins, which fold into a unique native conformation, chromatin fiber adopts unique conformations in individual cells, of which only the ensemble average is shown through these data [20] (barring more recent single-cell Hi-C methods [21]). The length to chain diameter ratio for the chromatin polymer in mammals is on the order of 103 to 104 higher than for single protein domains ( 105 106 for human chromosomes, but ⇠ 50 250 for proteins) [20]. Overall, the statistics of the long-range interactions indicate ⇠ the chromatin as organizing into a long-lived, nonequilibrium conformation that forms as aconsequenceofrapidcondensation[19,20]. Over smaller length scales, the Hi-C experiment identities local chromosome structures. Ex- k k panding the contact probability matrix T by eigenvector decomposition (Tij = k kEi Ej + k k T ) provides the interaction preference eigenvectors Ei ,Ej of genomic regionsP i, j , h i { } 1 { } weighted by eigenvalue k,oforderk [24]. The first eigenvector E provides a linear model of chromatin interaction preferences, and is used to separate the chromatin into two com- partments, “A” and “B”. Compartments preferentially self-interact and correspond, roughly, to active/euchromatic (“A”) and inactive/heterochromatic (“B”) chromatin, correlating with indicators such as DNA I hypersensitivity, gene density, GC content, and histone marks [22]. Compartments have a characteristic size of 5 Mb and are tissue-specific [19, 24]. ⇠ At finer resolutions, further Hi-C studies identified so-called topological associating domains (TADs) [18, 22, 23]. Schematics of compartments and topological activating domains, adapted from references 22 and 23, are shown in Figures 1-3. Topological associating do-

4 REVIEWS

Box 2 | Genome compartments Inter- and intrachromosomal interaction maps for mammalian genomes28,64,111 have revealed a pattern of interactions that can be approximated by two compartments — A and B — that alternate along chromosomes and have a characteristic size of ~5 Mb each (as shown by the compartment graph below top heat map in the figure). A compartments (shown in orange) preferentially interact with other A compartments throughout the genome. Similarly, B compartments (shown in blue) associate with other B compartments. Compartment signal can be quantified by eigenvector expansion of the REVIEWS interaction map64,111,112. The A or B compartment signal is not simply biphasic (representing just two states) but is mains have a characteristic size of 112 400 500 Kb, and can correspond to either active or continuous⇠ and correlates with indicators of transcriptional activity, such as DNA accessibility, gene density, replication inactive chromatin. In contrasttiming, GC to content compartments, and several histone TADs marks. areThese notindicators cell-type suggest that specific, A compartments nor do are largely euchromatic, adjacent TADs necessarily correspondtranscriptionally active to opposite regions. transcriptional activity [22]. DNA ele- BoxTopologically 2 | Genome associating compartments domains (TADs) are distinct from the larger A and B compartments. First, analysis of embryonic 58,59 ments at the boundary of TADsInterstem- cells, and include intrachromosomal brain tissue CTCF and fibroblasts interaction binding suggests maps sites, forthat mammalian themost, H3K4me3but not genomes all, TADs28 and,64 are,111 tissuehave H3K36me3 revealed-invariant a pattern, whereas of interactions A and B that histone marks, transcriptionalcancompartments be start approximated sites, are tissue by housekeeping -twospecific compartments domains of — genes,active A and and B — andinactive that alternate nascent chromatin along that mRNA chromosomes are correlated signal and with have cell -atype characteristic-specific gene 64 (as measured by global nuclearsizeexpression of ~5 run-on Mb patterns each sequencing)(as. shownSecond, by A the and compartment B [22, compartments 23]. graph Topological are below large (oftentop heat several domains map megabases)in the figure). are and con- A compartments form an alternating (shown pattern in orange)of active preferentially and inactive domains interact alongwith other chromosomes. A compartments By contrast, throughout TADs are the smaller genome. (median Similarly, size aroundB compartments 400–500 kb; (shown see served between syntenic regionsinzoomed blue) associate in in section human with of heat other andmap B compartments.in mousethe figure) and embryonic Compartment can be active signal stemor inactive, can cells. be and quantified adjacent Finally, by TADs eigenvector are in not necessarily expansion of of the contrast to the G1 state, TADsinteractionopposite (and chromatin map compartments)64,111 status.,112. The Thus, A or Bit compartmentseemswere that TADs signal observed are ishard not- wiredsimply to features bebiphasic lost of (representing chromosomes, in HeLa just and S3 two groups states) of butadjacent is TADs can organize112 in A and B compartments (see REF. 50 for a more extensive discussion). cells in the M phase [18]. Therefore,continuous and topological correlates with indicators associating of transcriptional domains activity, are such hypothesized as DNA accessibility, to gene density, replication timing,Shown GC in contentthe figure and are several data for histone human marks. chromosome These indicators 14 for IMR90 suggest cells that (data A compartmentstaken from REF. are 59 ).largely In the euchromatic,top panel, Hi- C function as an evolutionarily-conservedtranscriptionallydata were binned active at mechanism200 regions. kb resolution, tocorrected partition using iterative the correction genome and into eigenvector regions, decomposition (ICE), and containing long-distance interactionstheTopologically compartment betweenassociating graph was domains genescomputed (TADs) and as described are distal distinct in from regulatoryREF. 112the .larger The lower A elementsand panel B compartments. shows [23]. a blow First, up of analysis a 4 Mb fragmentof embryonic of stemchromosome cells, brain 14 tissue(specifically, and fibroblasts 74.4 Mb tosuggests 78.4 Mb) that binned most, at but 40 not kb. all, TADs are tissue-invariant58,59, whereas A and B compartments are tissueCompartments-specific domains of active and inactive chromatin that are correlated with cell-type-specific gene 64 expression patterns . Second, A and B compartments are large (often several megabases)A compartments and form an alternating pattern of active and inactive domains along chromosomes. By contrast, TADs are smaller (median size around 400–500 kb; see zoomed in section of heat map in the figure) and can be active or inactive, and adjacent TADs are not necessarily of opposite chromatin status. Thus, it seems that TADs are hard-wired features of chromosomes, and groups of adjacent TADs can organize in A and B compartments (see REF.20 Mb50 for a more extensive discussion). Shown in the figure are data for human chromosome 14 for IMR90 cells (data taken from REF. 59). In the top panel, Hi-C data were binned at 200 kb resolution, corrected using iterative correction and eigenvector decomposition (ICE), and the compartment graph was computed as described in REF. 112. The lower panel shows a blow up of a 4 Mb fragment of chromosome 14 (specifically, 74.4 Mb to 78.4 Mb) binned at 40 kb.

Compartments A compartments

20 Mb

B compartments

Interaction preference

TADs Figure 1. A schematic of chromosome compartments, adapted from [22]. Chromosome com- partments, with a characteristic size of order 5 Mb, are defined by eigenvector expansion ⇠ B compartments (also known as principal component analysis) of the contact probability map. Scores along the first eigenvector can beInteraction interpreted preference as a measurement2 Mb of “interaction preference,” the probability the region will interact with a gene-rich region of active transcription. Here, orange corresponds to active and blue to repressed chromatin regions.

TADs

2 Mb

Nature Reviews | Genetics 396 | JUNE 2013 | VOLUME 14 www.nature.com/reviews/genetics

5 Nature Reviews | Genetics 396 | JUNE 2013 | VOLUME 14 www.nature.com/reviews/genetics Figure 2. Schematic of topological activating domains (TADs), adapted from [22]. TADs, with a characteristic size of order 400 500 Kb, are identifiable by visual inspection, ⇠ when “zooming in” on the chromatin contact map. TADs fall into both active and repressive chromosome compartments, and adjacent TADs do not necessarily have opposite chromatin activity status.

Interactions downstream B

A Interactions upstream Putative boundary

AB Degree of bias Biased downstream

Biased upstream

Figure 3. Identification and proposed function of topological activating domains (TADs), adapted from [23]. TADs were identified by devising a simple “directionality index”: DI = B A (A E)2 (B E)2 BA E + E , where A is the number of reads that map upstream, B the num- | | ber⇣ of⌘⇣ reads that map downstream,⌘ and E, as the average of A and B, the expected value, based on the chi-squared statistic. Transitions from a negative to positive directionality index, or from an upstream to downstream interaction bias, are observed at the putative boundary between TADs. A Hidden Markov Model was applied to identify these biased states and infer the locations of the topological domains in the genome. This proposes a model in which TADs partition the genome into locally self-interacting regions. We begin the study by examining, with cohesin ChIA-PET data, for the presence of local chromosome structures, analogous to topological domains, that control the transcription of cell-identity specifying genes. Topological domains are part of a hierarchy of genome structures, so we sought to identify and understand domains of lower characteristic size that act specifically to turn on gene regulatory programs. Cohesin mediates gene looping, so SMC1 was selected for ChIA-PET in embryonic stem cells. ChIP-seq data identified where proteins co-bound with interaction sites. We identify super-enhancer domains (“SDs”) and polycomb domains (“PDs”). Super-enhancer domains, surrounding super-enhancers and their target genes, contain an exceptional amount of transcriptional apparatus, and the domain boundaries insulate transcriptional activation of nearby genes. Polycomb domains surround polycomb-bound genes, marked by H3K27me3, whose expression, corresponding to repressed developmental lineages, is suppressed. Evidence for these domains are seen in more than one cell type. Deletion of the CTCF boundaries (marked by ChIP-seq peaks), leads to disruptions in the regulatory control that the domains impose [25].

6 3 Methods 3.1 Cell Culture V6.5 cells mouse ESCs were grown on irradiated mouse embryonic fibroblasts (MEFs). Cells were grown on 0.2% gelatinized tissue culture plates (Sigma, G1890) in media for embryonic stem cells. ESC media contains the following: knockout DMEM (Invitrogen, 10829-018) with 15% fetal bovine serum (Hyclone, characterized SH3007103); 1000 U/mL LIF (ESGRO, ESG1106); 100 mM nonessential amino acids (Invitrogen, 11140-050); 2 mM L-glutamine (Invitrogen, 25030-081), 100 U/ml penicillin, 100 mg/ml streptomycin (Invit- rogen, 15140-122), and 8 nl/ml of 2-mercaptoethanol (Sigma, M7522).

3.2 ChIA-PET Library Preparation The protocol from [25] was used. Embryonic stem cells were grown to a minimum of 80% confluence (up to 1 108 cells) and were treated with 1% formaldehyde at room ⇥ temperature for 20 minutes; the formaldehyde was neutralized with 0.2M glycine. The crosslinked chromatin was fragmented by sonication to sizes of 300-700 base pairs. The anti-SMC antibody (Bethyl, A300-055A) was used to enrich for Smc1-bound chromatin fragments. Some ChIP DNA was eluted; this aliquot was used to quantify concentration, and to determine the degree of environment in the pulldown procedure. For construction of the ChIA-PET library, ChIP DNA fragments were end-repaired using the T4 DNA polymerase (NEB). The fragments were separated into two aliquots, with one of two linkers (“A” and “B”, see table below) ligated to barcode the sequence. After ligation, the samples were recombined and prepared for proximity-based ligation under dilute reaction conditions in a 20 mL volume; by relying on the higher probability of a self ligation, this minimizes ligations between different DNA-protein complexes. Proximity ligation was performed with T4 (bacteriophage) DNA ligase; incubation was performed, without rocking, at 22˚C for 20 hours. During proximity ligation, DNA fragments with the same linker sequence are ligated to form paired-end tags in a single chromatin complex, producing ligation products with a homodimer linker. However, heterodimer linker products can also occur, if ligations between different chromatin fragments occur to form chimeric products, with heterodimeric linker composition. DNA fragments with heterodimeric linker sequence were removed from the final pool, because they cannot represent the ligation of between pairs of a single chromatin complex. Samples were then treated with proteinase K to digest proteins; DNA was purified from the resulting complex.

Name Sequence Linker A 3’-GAC GAC AGGCTA T(biotin) AG CGC CGG-5’ 5’-CTG CTG TCCGAT ATC GC-3’ Linker B 3’-GAC GAC AGTATA T(biotin) AG CGC CGG-5’ 5’-CTG CTG TCATAT ATC GC-3’

A digestion with EcoP15I (NEB), a restriction enzyme that cuts a set distance (27 base pairs) away from its recognition site, was then performed at 37˚C for 17 hours, to lin- earize the ligated chromatin fragments. This generates the 27-base-pair paired end reads obtained in this dataset. The biotinylated chromatin fragments were then immobilized on Streptavidin beads (Dynabeads M280). The chromatin fragments were then end-repaired (Epicentre #ER81050); As were added to the ends with the Klennow polymerase fragment, with rotation at 37˚C for 35 minutes. Illumina paired-end sequencing adapters were ligated to the ends; 18 PCR cycles were performed to amplify the library. PCR purification was

7 performed to extract the paired-end constructs, and the prepared samples were sequenced in a 50 50 bp Illumina HiSeq 2000 instrument. Figure 4 and 5 show schematics of the ⇥ Chromosome conformation technologies ChIA-PET workflow, adapted from [27].

Chromosome conformation technologies

Figure 4. The initial part of the ChIA-PET workflow, adapted from [27]. Chromatin is crosslinked and fragmented, a factor of interest is enriched through immunoprecipitation, the paired ends are ligated, and the crosslinks are reversed. These steps are common to all chromosome conformation capture methods (3C, 4C, Hi-C, etc.)

Figure 1. Overview of 3C-derived methods. An overview of the 3C-derived methods that are discussed is given. The horizontal panel shows the cross-linking, digestion, and ligation steps common to all of the ‘‘C’’ methods. The vertical panels indicate the steps that are specific to separate methods.

suggesting that they may activate CFTR expression sulator sequences are bound by proteins such as CTCF in (Gheldof et al. 2010). mammals (Phillips and Corces 2009) or Su(Hw) in flies Enhancer activity on gene expression can be blocked by (Geyer and Corces 1992). 3C technology has been used insulator sequences (Wallace and Felsenfeld 2007). In- to demonstrate that the function of certain insulators is

Figure 5. The second, method-specific part of the ChIA-PET workflow, adapted from Figure 1. Overview of 3C-derived methods. An overview of the 3C-derived[27]. First, methods before crosslink that are reversal, discussed the library is given. is aliquoted,The horizontal and “A” panel and “B” adapters are shows the cross-linking, digestion, and ligation steps common to allligated. of the After‘‘C’’ methods. crosslinks The arevertical reversed, panels fragments indicate are digested the steps by that a site-specific are restriction specific to separate methods. endonuclease that cuts a distance away from the recognition site. Sequencing adapters are then ligated, the library is PCR-amplified, and the sample submitted for sequencing. suggesting that they may activate CFTR expression3.3 ChIPsulator Data sequences Analysis are bound by proteins such as CTCF in (Gheldof et al. 2010). ChIP-seqmammals datasets (Phillips were analyzed and Corces by Bowtie 2009) using or the Su(Hw) mm9 build in fliesof the mouse genome; Enhancer activity on gene expression can be blocked by the parameters(Geyer and -k Corces 1 -m 1 -n 1992). 2 were 3C used technology to report the has best been alignment used if and only one insulator sequences (Wallace and Felsenfeld 2007). In- to demonstrate that the function of certain insulators is 8

Figure 2. 3C data and the ACH. (A) Example of 3C data showing enhancer looping at the mouse b-globin locus (reproduced from Tolhuis et al. [2002] with permission from Elsevier, Ó 2002). The relative cross-linking frequency (Y-axis) is plotted for the b-major gene (the anchor point; thick black vertical line) with selected other restriction fragments (gray vertical lines) across the locus. The organization of the locus is shown at the top, with arrows pointing at regulatory sequences, the LCR being comprised of regulatory sequence 1–6, and the b-globin genes shown as triangles. In fetal livers (red line), where b-major is active and under the control of the LCR, loops are found between the gene and the LCR. In fetal brains (blue line), where the gene is silent, no such loops are observed. (B) Based on various 3C experiments in the b-globin locus, the ACH model was proposed. A schematic representation shows the conformation of the locus in its active conformation. Used with permission from Splinter and de Laat (2011).

GENES & DEVELOPMENT 13

Figure 2. 3C data and the ACH. (A) Example of 3C data showing enhancer looping at the mouse b-globin locus (reproduced from Tolhuis et al. [2002] with permission from Elsevier, Ó 2002). The relative cross-linking frequency (Y-axis) is plotted for the b-major gene (the anchor point; thick black vertical line) with selected other restriction fragments (gray vertical lines) across the locus. The organization of the locus is shown at the top, with arrows pointing at regulatory sequences, the LCR being comprised of regulatory sequence 1–6, and the b-globin genes shown as triangles. In fetal livers (red line), where b-major is active and under the control of the LCR, loops are found between the gene and the LCR. In fetal brains (blue line), where the gene is silent, no such loops are observed. (B) Based on various 3C experiments in the b-globin locus, the ACH model was proposed. A schematic representation shows the conformation of the locus in its active conformation. Used with permission from Splinter and de Laat (2011).

GENES & DEVELOPMENT 13 alignment was found, with a maximum of 2 mismatches in the seed (first 28 bases of the match, so the entire read for our dataset). MACS was used as a peak-finding algorithm to identify regions of ChIP-seq enrichment over input DNA control. A “p-value” threshhold 9 of 10 was used for calling enrichment for all ChIP-seq data sets; paired reads in ChIA- PET datasets can be decoupled to produce ChIP datasets. The “-nolambda” parameter is set to use a single background lambda (enriched regions, or “peaks”, are called against abackgroundPoissoniandistribution,andsometimesalocallambdacanbeusedduring peak calling); the “-nomodel” parameter is used to bypass shifting reads to their midpoint in the background model building process. Both parameters are additionally used when calling peaks with the H3K27me3 histone modification, which tends to exhibit broad signal over large regions of the genome.

3.4 Definition of Enhancers As described above, co-occupancy of Oct4, Sox2, and Nanog is used to operationally define the location of enhancers; Mediator is typically associated with these sites [16]. Reads from Oct4, Sox2, Nanog, Med1, and Med12 ChIP datasets (the latter two, components of the Mediator complex) were pooled, and MACS was run (ChIP Data Analysis) on the combined dataset to define the locations of putative enhancers. Regions of enriched ChIP-seq signal falling outside promoters (defined as more than 2.5 Kb away from RefSeq transcriptional start sites) were defined as possible enhancers.

3.5 Assignment of Interactions to Regulatory Elements In general, we only required a minimum overlap of 1 between a tag in a paired-end tag (PET) and a DNA element (enhancers, transcription start sites, promoters, ChIP and PET peaks) to assign the tag to the DNA element.

3.6 Enhancer-Gene Assignments Following previous studies [17, 23], enhancer-promoter assignments (for both super-enhancers and typical enhancers) were done by proximity. However, with ChIA-PET data, we ad- ditionally confirmed proximity assignments by high-confidence interactions. If a high- confidence interaction was not available, we looked for support in non-chimeric PETs with spans passing the self-ligation cutoff.

3.7 Super-enhancer Definition Super-enhancers are defined as in Whyte, et. al. (2013) [17], using ROSE (Rank Order- ing of Super-Enhancers). We describe the methdology of ROSE here: in embryonic stem cells, genomic regions cobound by Oct4, Sox2, and Nanog (by pooling the corresponding ChIP-seq reads and calling MACS peaks on the combined library) were defined as en- hancers. Visual inspection of the tracks on the UCSC Genome Browser [28] reveals that certain enhancer regions are closely spaced and have high signal density. To capture these closely spaced, high signal “constituent” enhancers as a single entity, we group constituent enhancers occurring within 12.5 Kb of each other into a single, larger enhancer domain. Then, the set of all enhancers (grouped or not) ranked in a cell type in order of increasing total background-subtracted Mediator signal (Med1 ChIP; background signal obtained from whole cell extract). A normalized, rank-ordered plot of enhancer Mediator signal (Figure 6) identities a coordinate where the first derivative exceeds unity. We define all enhancers beyond this coordinate as super-enhancers, and the remainder as typical enhancers.

9 A F Super- 27kb Enhancer: 10kb Enhancer: enhancer: 5kb 14 10 10 OSN Oct4 rpm/bp rpm/bp 4 14 14 Med1 Sox2 rpm/bp

rpm/bp 21 21 Gck Nanog rpm/bp B 3.5 3.5 Enhancer: 10kb Klf4

14 rpm/bp Other factors or marks were considered, such as OSN, DNase I, H3K27ac, and H3K4me1.15 15 OSN Med1 signal provided the clearest transition between the two populations, and so was chosen Esrrb rpm/bp

as the standard for annotation. In other cell types, the corresponding master transcriptionrpm/bp 20 factors were used to define enhancers: PU.1 for pro-B cells, MyoD for myotubes, T-bet for Med1 Gck miR-290-295 T helper cells, and C/EBPa for macrophages. If the identity of the master transcription rpm/bp factor for the cell type is not available, as in neural progenitor cells, then H3K27acG signal was used instead. miR-290-295 Oct4 Sox2 Nanog Klf4 Esrrb SE 1.2 5 2kb TE 5 C 3 3 1.0 4 4

Super-enhancers (231) 0.8 120,000 3

seq density 2 2 3

− 0.6 100,000 2 2 0.4 1 1 80,000 1 0.2 1

0 0 0 0.0 0 60,000 Mean ChIP Start End Start End Start End Start End Start End 40,000 H Oct4 Sox2 Nanog Klf4 Esrrb 20,000

Med1 signal at enhancers 6 4 2.5 0 5 8 3 2.0 0 2000 4000 6000 8000 3 4 6 1.5 Enhancers ranked by Med1 signal 2 2 3 4

seq density 1.0

− 2 D 1 Figure 6. The normalized, rank-ordered plot of Mediator signal at enhancers, adapted1 2 Typical enhancers Super-enhancers 1 0.5 from [17]. Total Med1 ChIP signals of enhancers, after stitching, were plotted in rankChIP 2kb 0 0 0 0.0 0 order. Super-enhancers were defined4 as the set of enhancers beyond the coordinate where TE SE TE SE TE SE TE SE TE SE the first derivative exceeds unity.3 ROSE can be found at https://bitbucket.org/young_computation/rose.2 I Transcription Motif P-value 1 factor

3.8 ChIA-PET Data AnalysisMean Med1 density 0 -64 ChIA-PET datasets were processedStart similarly End to theStart methodsEnd of [26]. Raw sequences were Oct4 9.19*10 Number: 8563 231 analyzed for barcode composition (linker “A” or “B”), partitioning the dataset into homod- -67 imeric (“AA” or “BB”)Median versus size: heterodimeric703 bp (“AB”) ligation8667 bp products, the latter of which Sox2 3.01*10 unambiguously results from non-specific,Total Density non-local at Total ligationDensity events. at Paired-end tags (PETs) -17 Nanog 9.46*10 were trimmed of the linker sequencesignal constituents once a completesignal match,constituents up until the identifying se- quence of the linkers (“CG” for linker A and “AT” for linker B), was found in other words, -6 Med1: 1x 1x 28x 8.1x Klf4 4.33*10 a perfect match of theH3K27ac: 10 base pairs1x of linker1x sequence.26x This produces4.8x 27 base pair reads per paired end, because EcoP151H3K4me1: cuts 1x 27 base1x pairs away10x from1.3x its recognition sequence. The -84 Esrrb 2.55*10 sequences were then mappedDNaseI: onto 1x the mm91x genome using8x Bowtie2.2x using the parameters -k 1 -m 1 -v 1 (where the -v parameter indicates that at most one mismatch can be found to the CTCF entire sequence, not only the seed). Read identifiers were used to pair the reads; further, to 0.45 remove the possibilityE of redundancy0.4 introduced by PCR artifacts, read pairs with identical c-Myc coordinates and strand sign were collapsed into a single PET. PETs with ends mapping 0.15 to different chromosomes (“interchromosomal PETs”) were additionally filtered from the 0.3 H3K4me1 J dataset. PET peaks, representing local sequence enrichment, were called by MACS, using Oct4/Sox2/Nanog Klf4/Esrrb “-p 1e-09 -nolambda -nomodel -keep-dup=2” as parameters,H3K27ac where “-keep-dup=2” restricts 8 8 to a maximum of 2 duplicate0.2 tags at the exact same location. 6 PETs that overlapped at each end, and for which each end fell into a PET peak, were 6 0.1 grouped into putative interactions. To set a cutoff for the minimum number of PETs each 4 4 OSN DNaseI

Normalized ChIP-seq signal 2 2 0.0 10 Med1 Number of motifs Number of motifs

6000 6500 7000 7500 8000 8500 0 0 Ranked enhancers TE SE TE SE

(legend on next page)

308 Cell 153, 307–319, April 11, 2013 ª2013 Elsevier Inc. putative interaction must have in order for an interaction between two genomic loci to be called with confidence, we examinined PETs with both chimeric and non-chimeric linker sequences. For PETs with heterodimer linkers, more than 99.8% of putative interactions are singletons; 0.1% consist of 2 PETs (Figure 7). Therefore, we set a cutoff of at least 3 PETs for an interaction to be called with high confidence. This is in contrast to PETs with homodimer linkers, where at least 10% of interactions had 3 or more PETs.

PET interactions called using non-chimeric PETs 100% PET interactions called using chimeric PETs

Cut-off for calling high-confidence interactions at 10% 3 PETs (False positive rate is < 0.01% based on interactions called using chimeric PETs).

1%

0.1%

< 0.01% 123456 7 8 9 10

Percent of total intrachromosomal interactions intrachromosomal total of Percent Number of PETs/interaction

Figure 7. The scaling of the proportion of intrachromomosomal interactions with the num- ber of PETs assigned to each interaction, adapted from [25]. Both PETs with chimeric and non-chimeric linker sequences are plotted. Based on the PET count distribution for chimeric linker-interactions, we require at least 3 non-chimeric linker PETs for an interaction to be called with high confidence. As part of the ChIA-PET library preparation, DNA can not only ligate with its crosslinked partner or with another chromatin fragment; it can also self-ligate. Self-ligations will con- tain homodimer linker sequences. However, due to the length of each DNA fragment (< 100 base pairs), we sought to remove PETs from our dataset that fell below a minimum length cutoff. Therefore, we studied the proportion of intrachromosomal interactions scaled with PET genomic span (or the distance between the paired tags), for PETs with both chimeric and non-chimeric linker sequences. We found that while chimeric PET frequency did not scale with genomic span, nonchimeric PET frequency does. As in previous studies [26], we imposed a minimum span cutoff of 4 Kb for non-chimeric interchromosomal PETs to be included in our high-confidence set of interactions.

Non-chimeric PET sequences Chimeric PET sequences 150K Self- Non-self- 150K ligating ligating

100K 100K

Cutoff for self-ligating and non-self-ligating 50K populations is at 4KB 50K Frequency of PETs of Frequency Frequency of PETs of Frequency

0 0 1K 10K 100K 1M 1K 10K 100K 1M Genomic span (bp) Genomic span (bp)

11 Figure 8. The scaling of the proportion of intrachromomosomal interactions with the ge- nomic span between paired ends, adapted from [25]. Both PETs with chimeric and non- chimeric linker sequences are plotted. We require PETs to span at least 4 Kb for an interaction to be included in the high-confidence set. We sought to further refine the set of interactions that formed our dataset. We removed PETs where each end did not fall into different PET peaks, as those may result from self- ligation events. To check the minimum PET cutoff set previously, we applied a statistical model based upon the hypergeometric distribution to calculate the expected number of PETs per interaction under the null model. Specifically, we counted the number of PETs originating from each PET peak. We then determined, given the number of PETs originat- ing any two peaks, the probability that we can see the observed number of PETs linking the two PET peaks. This method yielded a minimum of two independent PETs to call high-confidence interactions between pairs of interacting sites. To understand the limit of our data, and whether we can make statements on the upper limit of interaction probabilties between certain DNA elements (e.g., enhancers and promoters), we sought to determine whether the degree of saturation within our library. To this end, we modeled the number of sampled genomic positions as a function of sequencing depth using a phenomenological Michaelis-Menten, first-order kinetic model. Intrachromosomal PETs with distance spans above the self-ligation cutoff of 4 Kb were subsampled at various depths, and the number of interactions (or unique genomic positions occupied) plotted as a function of the sequencing depth. Fitting by non-linear least squares regression, under the assumptions of this phenomenological model, suggested a sampling of approximately 70% of the available interchromosomal PET space, or 2.2/3.2 million positions (Figure 9) [25].

Saturation of the SMC1 ChIAPET Library

3000K Estimated 100% Saturation

Fitted curve

2000K ~ 70% Saturation

1000K

Number of Unique Positions Unique of Number 0 0 1X 2X 3X 4X 5X Fraction of PETs Relative To Current Dataset Size

Figure 9. Saturation analysis of the SMC1 ChIA-PET data set, adapted from [25]. Subsam- pling of various fractions of PETs within the merged ChIA-PET data set was performed, and the number of unique genomic positions of intrachromosomal PETs beyond the self-ligation distance cutoff of 4 Kb was plotted. The solid line, depicting the non-linear least-squares regression fitting of data to a phenomenological Michaelis-Menten model, suggests a sam- pling of approximately 70% of the available intrachromosomal PETs beyond the 4 Kb in our library. The dotted line represents a putative 100% saturation. We also sought to understand the reproducibility of our data. In collaboration with Zhao and colleagues, we produced two replicates of the Smc1 ChIA-PET. Taking only PETs

12 passing the self-ligation distance cutoff (4 Kb), each chromosome was partitioned in 10 Kb bins, and 21 symmetric two-dimensional matrices were constructed were constructed for the two replicates. Element ai,j in the matrices represented the number of PETs with one end in bin i and the other in bin j. Normalization by the number of reads, as well as the bin size 1000 bp was performed, creating a metric similar to FPKM for the bins in both ⇥ matrices. The datasets reproduced well, as Pearson correlation yielded r =0.912.

3.9 Definition of Super-enhancer and Polycomb Domains Super-enhancer domains (SDs) were defined by proximity to the nearest transcriptional start site (TSS), and by confirmed ChIA-PET interactions. A region of 2.5 Kb around the transcriptional start site was taken, to form super-enhancer (SE)-gene units. On occasion, a super-enhancer could belong to more than one SD, if a super-enhancer interacted with more than one gene. CTCF-CTCF PET interactions that enclosed SE-gene units were then searched for; if nested CTCF-CTCF interactions enclosing the SE-gene unit were found, the smallest domain was taken. This leads SDs being defined as a cohesin/CTCF- cohesin/CTCF interaction surrounding, also colocalizing with cohesin, an SE-promoter interaction. All 231 super-enhancers were assigned to a total of 302 SE-gene units. In con- trast, only 4128/8563 (48%) of typical enhancers were contained in CTCF-CTCF structures co-bound by cohesin [25]. Genes regulating developmentally repressed lineages are frequently bound by extended re- gions of the Polycomb complex, originating from the transcription start site and spanning from 2-35 Kb [10]. H3K27me3 is a histone mark present at most inactive promoters, and modulated by the EZH2 methyltransferase, the catalytic subunit of the PRC2 Polycomb complex. Therefore, we sought genes showing H3K27me3 enrichment spanning more than 2 Kb. Similar to the definition of super-enhancers, H3K27me3 domains within 2 Kb were aggregated. 546 genes showed broad H3K27me3 signal around their transcription start sites. By examining for CTCF-CTCF structures, co-bound by cohesin, that surrounded these Polycomb-bound genes, we defined Polycomb domains (PDs). Only the smallest of multiple nested CTCF-CTCF strctures were selected for. By this criterion, 349 PDs, sur- rounding 380 Polycomb-bound genes (if a cohesin/CTCF loop surrounded more than one gene), were found [25].

4 Results 4.1 Functional Chromosome Structures The partitioning of the genome into a hierarchy of structures (Figure 10), including topo- logical associating domains at the 500 Kb length scale, and enhancer-promoter loops at the 10 Kb length scale (Figure 11), led us to ask whether we can identify chromosome structures, perhaps of intermediate scale, that function in the transcriptional regulation of cell state. The expression of genes within a topological domain is somewhat correlated [23], but the biological role of topological domains remains unclear. Therefore, we sought to understand, in detail, the structure and function of chromosomal loops that regulate the transcription of genes specifying cell identity.

13 SUPPLEMENTARY INFORMATION RESEARCH

a

a 300 Lamina Associated Domains Topological Domains 300 Lamina Associated Domains Topological Domains 250 300 250 200

Median ~ 454 kb 300 Median ~ 880 kb

200 Median ~ 454 kb Median ~ 880 kb 200 150 Frequency 200 150 Frequency 100 Frequency 100 Frequency 100 50 100 0 0 50

0 0

1,000,000 2,000,000 3,000,000 4,000,000 0 0 1,000,000 2,000,000 3,000,000 4,000,000 Domain Size (bp) Domain Size (bp) 0 0

1,000,000 2,000,000 3,000,000 4,000,000 1,000,000 2,000,000 3,000,000 4,000,000 Domain Size (bp) b Domain Size (bp) 100

b

100 Normalized Interacting Counts Interacting 0

chr12: 101000000 101500000 102000000 102500000 103000000 103500000 104000000 Normalized

Interacting Counts Interacting 50 - Directionality Index0

chr12: -50 _ 101000000 101500000 102000000 102500000 103000000 103500000 104000000 5 - Lamin B1 DamID log 0 - ( DamID50 - ) Directionality Index -5 _ Figure 10. ATtc8 hierarchy1700064M15Ri of domaink Tdp1 Gm10432 structures isGpr68 seen inCatsperb Hi-C data,Atxn3 adaptedRin3 fromGm20604 [23]. Also 2610021K21Rik Calm1 Rps6ka5 Smek1 Tc2n Cpsf2 Moap1 shown is-50 the _ “directionality4930474N09Rik index”Kcnk13 (DI, see Introduction)Ccdc88c at eachFbln5 genomicSlc24a4 coordinate.Chga Cox8c 5 - Foxn3 Psmc1 Mir1190 Trip11 Lgmn Ubr7 BC002230 Smek1 Golga5 Btbd7 Lamin B1 DamID Itpk1 log 0 - Gm10433 D130020L05Rik ( Ttc7b Kif4-ps Mir1936 Unc79 DamID ) AK010878 -5 _ 100,000 Ttc8 1700064M15Rik Tdp1 Gm10432 Gpr68 Catsperb Atxn3 Rin3 Gm20604 2610021K21Rik Calm1ChromosomesRps6ka5 Smek1 Tc2n Cpsf2 Moap1 4930474N09Rik Kcnk13 Ccdc88c Fbln5 Slc24a4 Chga Cox8c Foxn3 Psmc1 Mir1190 Trip11 Lgmn Ubr7 10,000BC002230 Smek1 Golga5 Btbd7 Supplementary FigureGm10433 12. ComparisonD130020L05Ri of Topologicalk Domains with Lamina Itpk1 Ttc7b Kif4-ps Mir1936 Unc79 Associated Domains (LADs). a, Histogram showing the size distribution of the AK010878 topological domains1000 and the LADs.Topological Generally, LADs are smaller in size than topological domains. b, Genome browser shotAssociating showing a region on chromosome 12 with multiple Domains topological domains,100 one of which appears to be entirely lamina-associated, with the Supplementarremainder yare Figure non(Kb) Scale -lamina 12. Comparisonassociated. of Topological Domains with Lamina Associated Domains (LADs). a,Gene Histogram loops showing the size distribution of the 10 topological domains and the LADs. Generally, LADs are smaller in size than topological domains. b, Genome browser shot showing a region on chromosome 12 with multiple topological domains, one 1of which appears to be entirely lamina-associated, with the Nucleosomes remainder are non-lamina associated.

Figure 11. The length scales of different chromosome structures, from nucleosomes at the sub-1 Kb scale, to gene loops (enhancer-promoter interactions) at the 10 Kb scale, to ⇠ topological associating domains at the 500 Kb length scale. Figure adapted from [25]. ⇠ WWW.NATURE.COM/NATURE | 31 Two cohesin Smc1 ChIA-PET datasets were generated; in combination, 400 million ⇠ reads were produced. Analysis of read densities showed high reproducibility between the replicates, with Pearson correlation r =0.912 (Methods). Following filtering and data analysis (Methods) to remove PETs with genomic spans falling within a 4 Kb cutoff which may represent self-ligation events, we identified 19 million unique paired-end tags (PETs) ⇠ that were used to define PET peaks, by similar methods as those used to identify ChIP-seq binding sites (Methods). The analysis produced 1 million cohesin-associated DNA loops; ⇠ the majority linked enhancers, promoters, and CTCF binding sites [25]. A high-confidence dataset was produced by grouping all PETs that overlapped with a PET peak (by at least one base pair); a minimum of three PETs between any given two PET peaks was required to call an interaction (Methods). Saturation analysis by fitting to a phenomenological model suggested that our dataset did not fully sample all interacting sites in the genome. We observe that the majority of the interactions fell within topological associating domains; this is evident through both direct visual inspection (Figure 12) and in meta-plots of inter- action frequencies near TAD boundaries (Figure 13). Meta-plots of binding site frequencies (ChIP or PET peaks) around TAD boundaries showed an enrichment of CTCF, Smc1, and cohesin PET peaks binding sites at TAD boundaries (Figure 14).

14 60 Normalized

Interacting Counts Interacting 0

Hi-C Interaction Frequencies

Topological Associating Domain

High-Confidence Interactions

Genes Armc9 Ncl Nmur1 Ptma Cops7b Dis3l2 Alppl2 Akp3

B3gnt7 C130036L24Rik Pde6d Nppc Alpi

1700019O17Rik

Figure 12. The majority of high-confidence Smc1 ChIA-PET (blue lines) interactions fall within the boundaries of previous-defined topological associating domains. Hi-C interaction frequencies from [23] are shown at top. Figure adapted from [25].

0.3 per TAD per

Mean interactions Mean 0.0

High-confidence PET interactions

TAD Normalized TAD TAD Boundary Boundary

Figure 13. A meta-plot of mean Hi-C interaction frequencies across a normalized TAD length, extending, on each side, 10% of the TAD length past TAD boundaries. Figure adapted from [25].

15 80 CTCF

SMC1 40 PET peaks

0 -500 KbTAD +500 Kb Boundary Regions per 10 Kb (per thousand) (per per Regions Kb 10

Figure 14. A meta-plot of binding site frequencies in a 500 Kb window around TAD boundaries. Bin sizes are 10 Kb each. Figure adapted from [25].

4.2 Super-enhancer Domains We sought to define chromosome structures that control the expression of genes that tran- scriptionally regulate cell state. Super-enhancers (SEs), first described in Whyte, et. al. (2013) [17], are clusters of enhancers with high Mediator signal densities that dominate cellular transcriptional programs. Followup studies identified the cohesin and condensin II complexes as occupying super-enhancers at high densities [29]. By visual inspection, we identified, with high frequency, cohesin/CTCF-cohesin/CTCF loops that enclose a super- enhancer and its associated gene; we termed these super-enhancer domains (SDs). We sought to characterize the basic properties of SDs (Figure 15). The SD has a character- istic length scale of 100 Kb; they typically contain one or two super-enhancer regulated ⇠ genes. Cohesin-associated interactions are often found within an SD; these are usually not associated with CTCF binding at either or both ends (Figure 16). Rather, they tend to loop different constituent enhancers within a super-enhancer, or between super-enhancers and their target genes. SDs exhibit high signal densities of Mediator, Pol II, cohesin, hi- stone marks correlated with active transcription, and the master transcription factors of the corresponding cell type (Figure 17). The occupancy of these factors do not typically exceed the CTCF boundaries of these domains (Figure 18).

16 Topological Associating Domain

56 kb Super-enhancer Domain Cohesin- CTCF-CTCF Associated Interactions Enhancer-Promoter 11 CTCF 3 SMC1 22 OSN Super-enhancer

Lefty1

Topological Associating Domain

CTCF CTCF

Super-enhancer Domain (197 SDs) OSN SE Gene Active Cell Identity Gene

Figure 15. A schematic of the basic properties of super-enhancer domains. On the top, a super-enhancer domain is defined by a CTCF-CTCF loop that simultaneously associates with the cohesin protein Smc1. It encloses a super-enhancer, defined by Oct4/Sox2/Nanog occupancy, that loops to a target promoter that activates a gene responsible for the speci- fication of a cell state. Note that the enhancer-promoter loop is not associated with CTCF binding on either or both ends. On the bottom, a graphical depiction of a super-enhancer domain. Figure adapted from [25].

Topologically Associating Domain Topologically Associating Domain

124 kb 84 kb Super-enhancer Domain Super-enhancer Domain CTCF-CTCF Cohesin- Cohesin- CTCF-CTCF Associated Enhancer-Promoter Associated Interactions Enhancer-Enhancer Enhancer-Enhancer Interactions 10 CTCF 10 CTCF 3 SMC1 5 SMC1 20 OSN 24 OSN Super-enhancer Super-enhancer Sox2 Prdm14

Topologically Associating Domain Topologically Associating Domain

105 kb 140 kb Super-enhancer Domain Super-enhancer Domain Cohesin- CTCF-CTCF Cohesin- CTCF-CTCF Associated Enhancer-Promoter Associated Enhancer-Promoter Interactions Enhancer-Enhancer Interactions Enhancer-Enhancer

8 CTCF 8 CTCF 2 SMC1 3 SMC1 24 OSN 21 OSN Super-enhancer Super-enhancer

Sall1 Klf9

Figure 16. Additional examples of super-enhancer domains. In general, we do not find

17 CTCF binding at either or both ends of Smc1 loops internal to the super-enhancer domain boundaries. Figure adapted from [25].

Topological Associating Domain

Super-enhancer Domain

CTCF CTCF SE Gene CTCF CTCF 12 CTCF 2.5 SMC1 3 OCT4

3 SOX2

3 NANOG

1.5 KLF4

4 ESRRB

5 Mediator 7 Pol2 7 H3K27ac 7 H3K4me3 2 H3K36me3 1.5 H3K27me3 6 SUZ12 6 EZH2 0 2 +22+23+3 Start End 3+3TSS End 2+2 2+2

Figure 17. Occupancy metaplots of (ChIP) signal at super-enhancer domains for factors with transcriptional repressing/insulating (CTCF, SUZ12, EZH2) and activating functions (Smc1, Oct4, Sox2, Nanog, Klf4, Esrrb, Mediator, Pol II), and of histone marks associated with gene activation (H3K27ac, H3K4me3) and repression (H3K36me3, H3K27me3). Signal units are in normalized rpm (reads per million base pairs), and metaplots of DNA elements are scaled to normalized lengths and aligned to the center of the element, with extensions on either side in the indicated number of kilobases. Figure adapted from [25].

18 Mediator 197 Super-enhancer Domains Super-enhancer 197

Pol2 197 Super-enhancer Domains Super-enhancer 197

H3K27ac

1.0 ChIP-seq ChIP-seq 197 Super-enhancer Domains Super-enhancer 197 Read Density Read

0.0 Normalized SD SD SD Boundary Boundary

Figure 18. A heatmap representation of Meditator, Pol II, and H3K27ac signal across a normalized SD length. Signal for these factors is largely contained within SD boundaries. Figure adapted from [25]. In collaboration with Hnisz, Weintraub, and Schuijers, we tested the function of super- enhancer domains. Previous literature has attributed CTCF with insulator activity [30]; therefore, we hypothesized that CTCF at SD boundaries acted to insulate nearby genes from ectopic activation. To test this, we studied the super-enhancer domain that enclosing miR-290-295, a microRNA family that has been shown to promote the embryonic stem cell in mice. We delete one of the CTCF boundaries (C1) and assay, by qRT-PCR, expression in the boundary-deleted mutant. We find that the primary miR-290-295 transcript (prior to downstream RNA processing) is reduced in expression by two-fold, while neighboring gene Nirp12, positioned, with respect to the concomitant enhancer, on the same side as miR- 290-295, is increased in expression by eight-fold. The expression of distal genes, AU018091 and Myadm, did not change. This leads to the model that CTCF boundaries are essential for the maintenance of super-enhancer domain integrity. Their deletion lead to a loss of insulating activity and ectopically activates the expression of nearby genes at the expense of the wild-type target promoter.

19 25 kb Super-enhancer Domain CTCF-CTCF

Cohesin- Enhancer-Promoter Associated Interactions Enhancer-Enhancer

10 CTCF 2 SMC1 16 OSN Super-enhancer

miR-290-295 Myadm AU018091 C1 Nlrp12

AU018091 Pri-miR-290-295 Nlrp12 Myadm 150 150 1000 150

800 100 100 100 600

400 50 50 50 200

Expression level type) Expression (%wild of 0 0 0 0 C1 C1 C1 C1

wild type wild type wild type wild type

Figure 19. The CRISPR-mediated deletion of the CTCF boundary of the miR-290-295 super-enhancer domain. Smc1 ChIA-PET interactions are depicted at top, followed by tracks displaying CTCF, Smc1, and OSN ChIP signal. Deletion of the downstream CTCF boundary (C1) led to the ectopic activation of neighboring gene Nirp12 at the cost of miR-290-205 expression. Transcript levels are assayed by qRT-PCR and normalized to that of housekeeping gene GAPDH. Figure adapted from [25].

4.3 Polycomb Domains Maintenance of cell state not only requires the expression of cell-lineage specifying genes, but additionally the repression of alternative cell lineage-specifying genes. Repressed devel- opmental genes are characterized by the H3K27me3 mark, deposited by the Polycomb pro- tein EZH2 [7]. Therefore, we sought to understand if there existed chromosome structures, similar to super-enhancer domains, that locally confine Polycomb-mediated repression. Analysis of looping data in the vicinity of H3K27me3-marked sites revealed the presence of similar Smc1/CTCF-Smc1/CTCF loops that surrounded Polycomb-repressed genes, which we term polycomb domains (PDs). Similar to super-enhancer domains, polycomb domains have a characteristic length scale of 100 Kb, and typically contain one or two H3K27me3 ⇠ marked sites (Figure 20). High signal densities of concomitant chromatin modifiers SUZ12 and EZH2 but not of master transcription factors Oct4, Sox2, Nanog; active enhancer histone mark H3K27ac; or Mediator and Pol II are observed (Figure 21). Furthermore, the boundaries of polycomb domains coincide with the restriction of H3K27me3, SUZ12, and EZH2 signal (Figure 22).

20 Topological Associating Domain

55 kb Cohesin- Polycomb Domain Associated Interaction CTCF-CTCF 10 CTCF 4 SMC1 8 H3K27me3

Gata2

Topological Associating Domain

CTCF CTCF

Polycomb Domain (349 PDs) Gene Repressed Developmental Lineage Gene

Figure 20. A schematic of the basic properties of polycomb domains. On the top, a polycomb domain is defined by a CTCF-CTCF loop that simultaneously associates with the cohesin protein Smc1. It encloses a polycomb-repressed gene that is marked by the repressive histone mark H3K27me3. On the bottom, a graphical depiction of a polycomb domain. Figure adapted from [25].

21 Topological Associating Domain

Polycomb Domain

CTCF CTCF Gene CTCF CTCF

12 CTCF

2.5 SMC1

3 OCT4

3 SOX2

3 NANOG

1.5 KLF4 4 ESRRB

5 Mediator

7 Pol2

7 H3K27ac

7 H3K4me3

2 H3K36me3

1.5 H3K27me3

6 SUZ12

6 EZH2 0 2+2 2+2 3+3TSS End 2+2 2+2

Figure 21. Occupancy metaplots of (ChIP) signal at polycomb domains for factors with transcriptional repressing/insulating (CTCF, SUZ12, EZH2) and activating functions (Smc1, Oct4, Sox2, Nanog, Klf4, Esrrb, Mediator, Pol II), and of histone marks associated with gene activation (H3K27ac, H3K4me3) and repression (H3K36me3, H3K27me3). Signal units are in normalized rpm (reads per million base pairs), and metaplots of DNA elements are scaled to normalized lengths and aligned to the center of the element, with extensions on either side in the indicated number of kilobases. Figure adapted from [25].

22 Left PD Right PD Boundary Boundary

CTCF 120 Polycomb Domains Polycomb 120

H3K27me3 120 Polycomb Domains Polycomb 120

SUZ12 120 Polycomb Domains Polycomb 120

2

0 EZH2

(Z-transformed) -2 ChIP-seq Signal ChIP-seq 120 Polycomb Domains Polycomb 120 0-10 +10 0-10 +10

Figure 22. A heatmap representation of CTCF, H3K27me3, SUZ12, and EZH2 signal in a 10 Kb window around both Polycomb domain boundaries, for 120 Polycomb domains. ± Heatmap intensities correspond to a Z-transformed ChIP-seq read density signal. Signal for these factors is largely contained within PD boundaries. Figure adapted from [25]. To test the possibility that the CTCF boundaries of polycomb domains confine Polycomb- mediated gene silencing, in collaboration with Hnisz, Weintraub, and Schuijers, we deleted the CTCF boundaries of the Tcfap2e polycomb domain. Deletion of the upstream boundary (C1) lead to a 1.7-fold increase in expression of Tcfap2e, while deletion of the downstream boundary (C2) led a 4-fold expression increase; no significant effect was seen on the expression of neighboring genes Psmb2 and Ncdn (Figure 23). Therefore, we similarly posit that CTCF boundaries are essential for the maintenance of polycomb domain integrity.

23 F 75 kb Cohesin- Polycomb Domain Associated CTCF-CTCF Interaction 10 CTCF 2 SMC1 6 H3K27me3

Psmb2 Tcfap2e Ncdn C1 C2

Psmb2 Tcfap2e Ncdn 200 200 200

150 150 150

100 100 100

50 50 50

0 0 0

Expression level (% of wild type) wild level of (% Expression C1 C1 C1

wild type wild type wild type

Psmb2 Tcfap2e Ncdn 150 500 150

400 100 100 300

200 50 50 100

0 0 0

Expression level type) Expression (%wild of C2 C2 C2

wild type wild type wild type

Figure 23. The CRISPR-mediated deletion of both CTCF boundaries for the Tcfap2e polycomb domain. Smc1 ChIA-PET interactions are depicted at top, followed by tracks displaying CTCF, Smc1, and OSN ChIP signal. Deletion of the upstream (C1) and downstream (C2) CTCF boundaries led to the ectopic activation of target gene Tcfap2e to expression levels 1.7 and 4 fold above wild-type, respectively. No significant effect was seen on the expression levels of neighboring genes Psmb2 and Ncdn. Transcript levels are assayed by qRT-PCR and normalized to that of housekeeping gene GAPDH. Figure adapted from [25].

4.4 Cell-type Specificity of Super-enhancer and Polycomb Domains In light of the cell-type dependence of chromosome compartments but the cell-type in- dependence of topological domains, we sought to understand whether the cell-type speci- ficity of super-enhancer and polycomb domains. Previous work, by Phillips-Cremins, et.

24 al. [30], had suggested, by 5C (a chromosome conformation capture method) and chro- matin IP, that cohesin/CTCF loops are more cell-type independent than cohesin/Mediator loops, which typically represent cell-type specific enhancer-promoter interactions. This pro- poses a model that between syntenic regions in different cell types, we expect to see con- served cohesin/CTCF-cohesin/CTCF interaction, regardless of super-enhancer presence in the vicinity of a cell-identity specifying gene. (Figure 24). A Cell Type A

CTCF SE Gene CTCF CTCF Gene CTCF

Cell Type B

CTCF Gene CTCF CTCF SE Gene CTCF

FigureB 24. A model depicting, between two cell types, conserved cohesin-mediated interac- tions between two CTCF sites, regardless of the presence of a super-enhancer in the vicinity 61 kbSMC1 ChIA-PET 98 kb SMC1 ChIA-PET of a cell-identity specifying gene. Figureinteraction adapted from [25]. interaction

To test this model, we used our ESCSuper-enhancer Smc1 ChIA-PET data in combination withSuper-enhancer mouse neuralESC precursor13 cell (NPC) 5C data fromCTCF Phillips-Cremins,8 et. al. [30]. Super-enhancersCTCF in NPCs were4 identified by the co-occupancySMC1 of NPC master4 transcription factors Sox2SMC1 and 17 OSN 17 OSN Brn2. At loci for which 5C and Smc1 ChIA-PET data were available for the respective cell

types (Nanog and Olig1/Olig2),65 kb we saw evidence of ChIA-PET and 104 kb 5C interactions at at both regions, but the cell-type specific presence5C interaction of a super-enhancer driving the expression5C interaction Super-enhancer Super-enhancer of the corresponding cell-identity specifying gene (Figure 25). 6 CTCF 2 CTCF NPC 8 SMC1 2 SMC1

9 61 Kb SMC1SOX2 ChIA-PET 9 98 Kb SMC1SOX2 ChIA-PET 5 InteractionBRN2 5 InteractionBRN2

Super-enhancer Super-enhancer Nanog Olig2 Olig1 ESC 13 CTCF 8 CTCF C 4 SMC1 4 SMC1 17 OSN 17 OSN

all CTCF peaks CTCF peaks at SD borders CTCF peaks at PD borders CTCF peaks at PET peaks 65 Kb 104 Kb observed in ESC 5C Interaction 5C Interaction 20 Super-enhancer Super-enhancer

6 CTCF 2 CTCF NPC 10 8 SMC1 2 SMC1 90 SOX2 9 SOX2 5 1 18 1 BRN2 18 5 1 18 1 BRN2 18 peaks identified in ESCs % of the CTCF ChIP-Seq CTCF the of % (specific) (constitutive) (specific) (constitutive) (specific) (constitutive) (specific) (constitutive)

NanogNumber of cell types in which the CTCFOlig2 ChIP-Seq peak Olig1 is observed

Figure 6. InsulatedFigure 25. Neighborhoods At the Nanog Are Preserved and Olig1/Olig2 in Multiple Cell loci, Types Smc1 ChIA-PET and 5C interactions are (A) Model depicting constitutive domain organization, mediated by interaction of two CTCF sites co-occupied by cohesin, in two cell types. (B) An exampledisplayed SD in ESCs for and athe domain mouse in NPCs. embryonic High-confidence stem interactions cell and from neuralthe SMC1 precursorChIA-PET data cell, set are respectively. depicted by blue lines, The and 5C interactions from Phillips-Creminsconstitutive et al. (2013) presenceare depicted of a by cohesin/CTCF black lines. Super-enhancers loop, are independent indicated by red of bars. the ChIP-seq cell-type binding specificity profiles (reads of per million per base pair) for CTCF,the cohesin corresponding (SMC1), OCT4, superenhancer SOX2, NANOG (OSN), (OSN SOX2, and for BRN2 ESCs, are shown Sox2/Brn2 at the Nanog forlocus NPCs), and the Olig1/Olig2 was observed.locus in ESCs and NPCs. (C) Occupancy of CTCF peaks across 18 cell types. The CTCF peaks used for the analysis are the CTCF peaks found in ESCs. The percentage of these peaks that are observedFigure in the indicated adapted number from of cell [25]. types is shown for four groups of CTCF sites: all CTCF peaks identified in ESCs, CTCF peaks at SD boundaries in ESCs, CTCFTo peaks determine at PD boundaries whether in ESCs, the and constitutive CTCF peaks at PET presence peaks (identified of the by cohesin/CTCF SMC1 ChIA-PET in ESCs). loop related to the See also Figurecell-type S6 and Table independence S3B. of CTCF binding, we profiled the cell-type specificity of ESC CTCF have also demonstrated that cohesin and CTCF are associated extends these observations by connecting such structures with with large loop substructures within TADs, whereas cohesin25 the transcriptional control of specific super-enhancer-driven and Mediator are associated with smaller loop structures that and polycomb-repressed cell identity genes and by showing sometimes form within the CTCF-bound loops (de Wit et al., that these structures can contribute to the control of genes 2013; Phillips-Cremins et al., 2013; Sofueva et al., 2013). both inside and outside of the insulated neighborhoods that CTCF-bound domains have been proposed to confine the activ- contain key pluripotency genes. ity of enhancers to specific target genes, thus yielding proper The organization of key cell identity genes into insulated tissue-specific expression of genes (DeMare et al., 2013; Han- neighborhoods may be a property common to all mammalian doko et al., 2011; Hawkins et al., 2011). Our genome-wide study cell types. Indeed, several recent studies have identified

384 Cell 159, 374–387, October 9, 2014 ª2014 Elsevier Inc. binding sites. Using ChIP-seq data from ENCODE across 18 mouse cell types, we char- acterized the proportion of ESC CTCF peaks that were found in more than one cell type. We found, in general, that CTCF sites that co-bound with super-enhancer domain and polycomb domain boundaries, and more generally with cohesin-cohesin gene loops, were more cell-type independent than the total population of ESC CTCF peaks (Figure 26).

All CTCF sites CTCF sites at SD boundaries CTCF sites at PD boundaries CTCF sites at all PET peaks observed in ESC 20

10

0 1 18 1 18 1 18 1 18 % of CTCF binding CTCF of %

sites identified in ESCs in identified sites (specific) (constitutive) (specific) (constitutive) (specific) (constitutive) (specific) (constitutive) Number of cell types in which CTCF binding site (ChIP peak) is observed

Figure 26. Cell-type specificity of embryonic stem cell CTCF binding sites, across 18 mouse cell types. CTCF peaks colocalizing with SD/PD boundaries, and more generally, with all Smc1 PET peaks, are more constitutive than the global set of CTCF peaks. Figure adapted from [25].

5 Discussion Here we identify and describe, using Smc1 cohesin ChIA-PET, chromosome structures, of intermediate length scale ( 100 Kb), that control the transcriptional of cell identity- ⇠ specifying genes. Evidence provided earlier of the role of CTCF and cohesin in enhancer- promoter looping, as well as their presence at larger chromosomal structures, such as chro- mosome compartments and topological associating domains, suggested that they may also play a role in restricting the regulatory activity of activating super-enhancers and repressive Polycomb-group proteins [4, 7, 10, 14, 16-24, 30]. The two chromosome structures that we have identified, super-enhancer (SDs) and poly- comb domains (PDs), surround super-enhancer/target gene units and H3K27me3 (Polycomb- repressed) developmentally silenced genes. They are surrounded by a cohesin/CTCF loop and, by the insulator activity of CTCF, restrict either the activating (SD) or repressive (PD) activity of super-enhancers or of EZH2, respectively. Deletion of either or both CTCF boundaries disrupt the insulating function of both domain types. We further find that both SDs and PDs are cell-type constitutive, most likely due to the constititutive co-localization of cohesin/CTCF gene loops. Further studies should extend this work by addressing the biochemistry of the factors that regulate chromosomal loop formation, to understand the molecular mechanisms that govern the relationship between the transcriptional regulation of cell identity and local chromatin structure.

6 Appendix The author discusses briefly the background and some attempts, using the methods of single molecule imaging, to measure the dynamics of protein clustering in the living cell. Arecentstudy[31],usinginvivosingle-moleculeanalysis,soughttoclarifythetemporal dynamics of RNA Pol II foci that form transiently in live cells. Thought to possibly represent transcriptional factories, when observed in fixed cells [32], measurement of the

26 lifetimes of Pol II clusters in live cells yielded an average life-time of ⌧on =5.1 0.4 seconds. ± This therefore suggested a transient interaction of Pol II at active gene loci. The author attempted to extend these studies by applying the previously developed sin- gle particle tracking, a complementary technique [33]. Key steps include detection, lo- calization, and linking detections between frames. Technical difficulties include particle merging/splitting, high particle density, and heterogeneity in particle motion [34]. Using the following experimental parameters: 5 ms exposure time; nucleus-restricted region of interest (ROI), leading to a 14.85-16.22 ms interval between frames; 50% AOTF setting for the 561 nm laser (corresponding to a 10.6 mW output), and maximum camera gain, the author captured single particle trajectories for the Rpb1 subunit of RNA polymerase II (Figure 27: screenshot from raw movie, Figure 28: preliminary data). An interesting question that one might consider asking is how various treatments, such as serum induction or flavopiridol (transcriptional inhibitor) addition, might alter the propor- tion of Rpb1 molecules that fall in bound/slow-/fast-diffusing populations.

Figure 27. Still frame from a raw movie (frame 381/1000) taken in the reproduction of single particle tracking. Scale bar spanning 100 pixels represents 1.6 mm; interval between consecutive time steps is 15 ms.

Translocation Distances

1.4

1.2 m) 1.0

0.8

0.6

0.4 Distance Translocated ( Translocated Distance 0.2

2 4 6 8 10 12 Time Step (×15 ms)

27 Figure 28. Preliminary single particle tracking data. On the left are example single molecule traces of a Rpb1-Dendra2 fluorophore fusion protein; the 52 longest trajectories in a single experiment are shown. On the right is a display of translocation distances for a single Rpb1 trajectory. Plotted are, for time steps t =[1, 12],thedistancetheparticletranslocatedfrom time step t-1 to t. Horizontal axis represents time step t (interval between consecutive time steps is 15 ms) and vertical axis, distance translocated in microns.

7 Funding Work described in this thesis is based upon work supported by the National Science Foun- dation Graduate Research Fellowship under Grant Number 1122374. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. Further funding comes from National Institutes of Health grant HG002668 to Richard A. Young. Additional funding comes from the NIH Director’s New Innovator Award and National Cancer Institute grant 1DP2CA195769 to Ibrahim I. Cissé. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

8 Acknowledgements Graduate school has certainly been a very interesting process; I came in thinking that I would work on gene regulation, but have, over the course of a few years, realized that I am much more interested in working on something quite different. I thank the professors, classmates, co-workers, friends, and acquaintances who have helped me, in various capac- ities and at different times, in this long and sometimes quite challenging, but ultimately rewarding process. I suspect that there were a few others who tried to help me out at various points, but either your involvement was too subtle for me to detect, or I simply don’t know your name. I hope that the lessons that I have learned here, at MIT, will be of substantial benefit to me in future research. An alphabetical, non-exhaustive list of people to thank, is as follows, so please don’t take offense if I accidentally forgot to include you in the list. Professors: D. Bartel, S. Bell, C. Burge, J. Cao, I. Cissé, J. England, A. Fire, W. Gilbert, J. Gore, A. Grossman, M. Hemann, M. Kardar, A. Keating, P. Sharp, F. Solomon, T. Van Voorhis, A. Willard, C. Wu, R. Young. Classmates, Co-workers, Friends, and Acquaintances: N. Abdennur, B. Abraham, A. Adadey, V. Agarwal, J. Andrews, J. Arribere, J. Baek, J. Chang, A. Chen, S. Cheng, A. Chiu, W. Cho, V. Chou, W. Conway, L. Cote, D. Dadon, J. Dowen, V. Dwivedi, S. Elder, Z. Fan, G. Flower, P. Freese, L. Gao, T. Gingrich, G. Goon, L. Gracey, N. Hannett, N. Heer, D. Hnisz, W. Huang, T. Inoue, M. Jackson, N. Jayanth, X. Ji, Y. Jung, A. Kao, A. Kedaigle, G. Kim, R. Ko, M. Ku, D. Kwabi, N. Kwiatkowski, J. Lalanne, S. Lam, A. Lau, T. Lee, J. Lee, D. Li, F. Li, A. Lin, C. Lin, P. Loh, A. Lucas, M. Maetani, J. Maniar, M. Matus, J. McKnight, D. Miculescu, P. Miller, S. Naqvi, A. Narayanan, A. Pai, S. Perli, G. Pho, C. Picard, D. Qiu, J. Reddy, N. Ricke, V. Saint-Andre, J. Schuijers, X. Shen, P. Shi, J. Shu, A. Sigova, J. Spille, A. Subtelny, C. Suen, T. Tan, S. Tang, K. Tseng, A. Weintraub, M. Wong, D. Wu, V. Xue, F. Yaul, S. Ye, S. Zhao. And of course my parents, W. Zhang and J. Inagaki.

28 9 Bibliography

1. Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter. Molecular Biology of the Cell.FifthEdition.GarlandScience,2008.

2. Jayasha Shandilya, Stefan G.E. Roberts. “The transcription cycle in eukaryotes: From produc- tive initiation to RNA polymerase II recycling.” Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms. 2012; 1819:391-400.

3. D. B. Nikolov and S. K. Burley. “RNA polymerase II transcription initiation: A structural view.” Proc. Natl. Acad. Sci.1997;94:15-22.

4. Tong Ihn Lee and Richard A. Young. “Transcription of Eukaryotic Protein-Coding Genes.” Annu Rev Genet.2000;34:77-137.

5. Andrew Travers. “The 30-nm Fiber Redux.” Science.2014;344:370-2.

6. Michael T. Lam, Wenbo Li, Michael G. Rosenfeld, and Christopher K. Glass. “Enhancer RNAs and regulated transcriptional programs.” Trends Biochem Sci .2014;39:170-182.

7. Richard A. Young. “Control of the Embryonic Stem Cell State.” Cell.2011;144:940-954.

8. Haggai Kaspi, Elik Chapnik, Maayan Levy, Gilad Beck, Eran Hornstein, and Yoav Soen. “Brief report: miR-290-295 regulate embryonic stem cell differentiation propensities by repressing Pax6.” Stem Cells.2013;31:2266-72.

9. Martin F. Pera and Patrick P. L. Tam. “Extrinsic regulation of pluripotent stem cells.” Nature. 2010; 465:713-720.

10. Laurie A. Boyer, Tong Ihn Lee, et. al. “Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells.” Cell.2005;122:947-56.

11. Ian Chambers, Jose Silva, et. al. “Nanog safeguards pluripotency and mediates germline devel- opment.” Nature.2007;450:1230-4.

12. Alexander Marson, Stuart Levine, et. al. “Connecting microRNA Genes to the Core Transcrip- tional Regulatory Circuitry of Embryonic Stem Cells.” Cell.2008;134:521-33.

13. Iouri Chepelev, Gang Wei, Dara Wangsa, Qingsong Tang, and Keji Zhao. “Characterization of genome-wide enhancer-promoter interactions reveals co-expression of interacting genes and modes of higher order chromatin organization.” Cell Research. 2012; 22:490–503.

14. Jill M. Dowen, Richard A. Young. “SMC complexes link gene expression and genome architec- ture.” Current Opinion in Genetics & Development.2014;25:131-137.

15. Jill M. Heidinger-Pauli, Ozlem Mert, Carol Davenport, Vincent Guacci, Douglas Koshland. “Systematic Reduction of Cohesin Differentially Affects Chromosome Segregation, Condensation, and DNA Repair.” Current Biology. 2010; 20:957-63.

16. Michael H. Kagey, Jamie J. Newman, Steve Bilodeau, et. al. “Mediator and cohesin connect gene expression and chromatin architecture.” Nature. 2010; 467:430-435.

17. Warren Whyte, David A. Orlando, Denes Hnisz, et. al. “Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes.” Cell.2013;153:307-319.

18. Natalia Naumova, Maxim Imakaev, Geoffrey Fudenberg, et. al. “Organization of the Mitotic Chromosome.” Science.2013;342:948-953.

29 19. Erez Lieberman-Aiden, Nynke L. van Berkum, et. al. “Comprehensive mapping of long-range interactions reveals folding principles of the .” Science. 2009; 326:289–293. 20. Marc A. Marti-Renom, Leonid A. Mirny. “Bridging the Resolution Gap in Structural Modeling of 3D Genome Organization.” PLOS Computational Biology. 2011; 7:e1002125. 21. Takashi Nagano, Yaniv Lubling, Tim J. Stevens, Stefan Schoenfelder, Eitan Yaffe, Wendy Dean, Ernest D. Laue, Amos Tanay, and Peter Fraser. “Single-cell Hi-C reveals cell-to-cell variability in chromosome structure.” Nature. 2013; 502:59-64. 22. Job Dekker, Marc A. Marti-Renom, and Leonid A. Mirny. “Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data.” Nature Reviews Genetics. 2013; 14:390-403. 23. Jesse R. Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun S. Liu, and Bing Ren. “Topological domains in mammalian genomes identified by analysis of chromatin interactions.” Nature. 2012; 485:376-80. 24. Maxim Imakaev, Geoffrey Fudenberg, Rachel Patton McCord, Natalia Naumova, Anton Goloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny. “Iterative correction of Hi-C data reveals hallmarks of chromosome organization.” Nature Methods. 2012; 9:999-1003. 25. Jill M. Dowen, Zi Peng Fan, Denes Hnisz, Gang Ren, et. al. “Control of Cell Identity Genes Occurs in Insulated Neighborhoods in Mammalian Chromosomes.” Cell. 2014; 159:374-87. 26. Melissa Fullwood, et. al. “An oestrogen-receptor-alpha-bound human chromatin interactome.” Nature. 2009; 462:58-64. 27. Elzo de Wit and Wouter de Laat. “A decade of 3C technologies: insights into nuclear organiza- tion.” Genes and Development. 2012; 26:11-24. 28. UCSC Genome Browser: W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H. Pringle, Alan M. Zahler, and and David Haussler. “The Human Genome Browser at UCSC.” Genome Research.2002;12:996-1006. 29. Jill M. Dowen, Steve Bilodeau, David A. Orlando, Michael R. Hübner, Brian J. Abraham, David L. Spector, Richard A. Young. “Multiple Structural Maintenance of Chromosome Complexes at Transcriptional Regulatory Elements.” Stem Cell Reports.2013;1:371-8. 30. Jennifer E. Phillips-Cremins, Michael E. G. Sauria, Amartya Sanyal, et. al. Architectural Protein Subclasses Shape 3D Organization of Genomes during Lineage Commitment.” Cell. 2013; 153:1281-1295. 31. Ibrahim I. Cissé, Ignacio Izeddin, et. al. “Real-Time Dynamics of RNA Polymerase II Clustering in Live Human Cells” Science.2013;341:664-667. 32. Peter R. Cook. “The Organization of Replication and Transcription.” Science.1999;284:1790- 1795. 33. Ignacio Izeddin, Vincent Recamier, Lana Bosanac, Ibrahim I. Cissé, et. al. “Single-molecule tracking in live cells reveals distinct target-search strategies of transcription factors in the nu- cleus.” eLife.2014;3:e02230. 34. Khuloud Jaqaman, Dinah Loerke, Marcel Mettlen, Hirotaka Kuwata, Sergio Grinstein, San- dra L Schmid, and Gaudenz Danuser. “Robust single-particle tracking in live-cell time-lapse sequences.” Nature Methods.2008;5:695-702.

30