LEGENOMEENACTION´ SEQUENC´ ¸ AGEHAUTDEBITET´ EPIG´ ENOMIQUE´

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os Regulation´ d’expression la transcription d’une region´ de l’ADN necessite´

? liaisons proteine-ADN´ (facteur de transcription et son site reconnu) ? accessibilite´ de la chromatine REVIEWS

Identification of regions that control transcription An initial step in the analysis of any gene is the identifi- cation of larger regions that might harbour regulatory control elements. Several advances have facilitated the prediction of such regions in the absence of knowl- edge about the specific characteristics of individual cis- regulatory elements. These tools broadly fall into two categories: promoter (transcription start site; TSS) and enhancer detection. The methods are influenced Distal TFBS by sequence conservation between ORTHOLOGOUS genes (PHYLOGENETIC FOOTPRINTING), nucleotide composition and the assessment of available transcript data. Functional regulatory regions that control transcrip- tion rates tend to be proximal to the initiation site(s) of transcription. Although there is some circularity in the Co-activator complex data-collection process (regulatory sequences are sought near TSSs and are therefore found most often in these regions), the current set of laboratory-annotated regula- tory sequences indicates that sequences near a TSS are Transcription more likely to contain functionally important regulatory initiation complex Transcription controls than those that are more distal. However, specifi- initiation cation of the position of a TSS can be difficult. This is fur- ther complicated by the growing number of genes that CRM Proximal TFBS selectively use alternative start sites in certain contexts. Underlying most algorithms for promoter prediction is a Figure 1 | Components of transcriptional regulation. Transcription factors (TFs) bind reference collection known as the ‘Eukaryotic Promoter to specific sites (transcription-factor binding sites; TFBS) that are either proximal or Database’ (EPD)4. Early bioinformatics algorithms that distal to a transcription start site. Sets of TFs can operate in functional cis-regulatory were used to pinpoint exact locations for TSSs were modules (CRMs) to achieve specific regulatory properties. Interactions between bound TFs plagued by false predictions5.These TSS-detection tools and cofactors stabilize the transcription-initiation machinery to enable gene expression. were frequently based on the identification of TATA-box The regulation that is conferred by sequence-specific binding TFs is highly dependent on the three-dimensional structure of chromatin. sequences, which are often located ~30 bp upstream of a TSS. The leading TATA-box prediction method6,reflect- ing the promiscuous binding characteristics of the TATA- binding protein, predicts TATA-like sequences nearly does not reveal the entire picture. There is only partial Wassermanevery 250 bp in & long Sandelin genomeNat sequences. Rev Genet 5 :276 (2004) correlation between transcript and protein concentra- A new generation of algorithms has shifted the tions3.Nevertheless, the selective transcription of genes emphasis to the prediction of promoters — that is, by RNA polymerase-II under specific conditions is cru- regions that contain one or more TSS(s). Given that cially important in the regulation of many, if not most, many genes have multiple start sites, this change in genes, and the bioinformatics methods that address the focus is biochemically justified. initiation of transcription are sufficiently mature to The dominant characteristic of promoter sequences Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os influence the design of laboratory investigations. in the human genome is the abundance of CpG dinu- ii Below, we introduce the mature algorithms and cleotides. plays a key role in the regulation online resources that are used to identify regions that of gene activity. Within regulatory sequences, CpGs ORTHOLOGY regulate transcription. To this end, underlying meth- remain unmethylated, whereas up to 80% of CpGs in Two sequences are orthologous if they share a common ancestor ods are introduced to provide the foundation for other regions are methylated on a cytosine. Methylated and are separated by speciation. understanding the correct use and limitations of each cytosines are mutated to adenosines at a high rate, approach. We focus on the analysis of cis-regulatory resulting in a 20% reduction of CpG frequency in PHYLOGENETIC FOOTPRINTING sequences in metazoan genes, with an emphasis on sequences without a regulatory function as compared An approach that seeks to methods that use models that describe transcription- with the statistically predicted CpG concentration7. identify conserved regulatory elements by comparing genomic factor binding specificity. Methods for the analysis of Computationally, the CG dinucleotide imbalance can be sequences between related regulatory sequences in sets of co-regulated genes will a powerful tool for finding regions in genes that are species. be addressed elsewhere. We use a case study of the human likely to contain promoters8. skeletal muscle troponin gene TNNC1 to demonstrate Numerous methods have been developed that MACHINE LEARNING The ability of a program to learn the specific execution of the described methods. A set of directly or indirectly detect promoters on the basis of from experience — that is, to accompanying online exercises provides the means for the CG dinucleotide imbalance. Although complex modify its execution on the basis researchers to independently explore some of the meth- computational MACHINE-LEARNING algorithms have been of newly acquired information. ods highlighted in this review (see online links box). directed towards the identification of promoters, simple In bioinformatics, neural Because the field is rapidly changing, emerging classes of methods that are strictly based on the frequency of CpG networks and Monte Carlo Markov Chains are well-known software will be described in anticipation of the creation dinucleotides perform remarkably well at correctly pre- examples. of accessible online analysis tools. dicting regions that are proximal to or that contain the

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 277

© 2004 Nature Publishing Group Annotation d’activite´ genomique´

on veut annoter : ? sites de liaison de facteurs de transcription ? chromatine «ouverte» ? methylation´ de l’ADN ? modification de ? interaction ADN-ADN

Hawkins, Hon & Ren Nat Rev Genet 11 :476 (2010)

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os iii On peut faire tout par sequenc¸age.´ . .

1. filtrage / enrichissement de regions´ d’interetˆ 2. sequenc¸age´ 3. alignement pour determiner´ d’ou` viennent les morceaux

Wold & Myers Nature Methods 5 :19 (2008)

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os iv MNase-seqSUPPLEMENTARY : position INFORMATION des ´doi:10.1038/nature10002

MNase : Supplementary Figure 1

CD4+ T Lymphocytes In vivo mapping

Gradient-based and IG-bead cell sorting

CD8+ T Lymphocytes

Granulocytes

Lyse the cells

Isolate and sequence Micrococcal mononucleosome cores nuclease

Supplementary Figure 1. Schematic depiction of in vivo nucleosome mapping experiment. Blood cells were isolated from a human donor blood and sorted into populations representing CD4+ T-cells, CD8+ T-cells and granulocytes. Nuclear chromatin was released by crushing the cells, followed by Micrococcal nuclease treatment. Mononucleosome fraction was isolated by gel electrophoresis and sequenced to high depth using SOLiD platform. Valouev & al Nature 474 :516 (2011)

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os v

WWW.NATURE.COM/NATURE | 1 LETTERS NATURE | Vol 458 | 19 March 2009 in vitro yeast-based model and that of in vivo nucleosome occupancy organization in several growth conditions, with local, condition-spe- in C. elegans19 (Fig. 3f). Moreover, our model classifies nucleosome- cific changes superimposed. enriched regions from nucleosome-depleted regions in C. elegans with To address concerns regarding biases that may be caused by the high accuracy (Supplementary Fig. 4), and the 5-base-pair sequence sequence specificity of micrococcal nuclease20 and possible biases in preferences of the C. elegans in vivo map agree well with those of the parallel sequencing, we performed a different kind of in vitro experi- yeast in vitro map (Fig. 3g). The poorer classification performance in ment that measures the relative nucleosome affinity of ,40,000 dou- comparison with yeast may indicate that factors other than the DNA ble-stranded 150-bp oligonucleotides without the use of micrococcal sequence preferences make a greater contribution to nucleosome nuclease or parallel sequencing. The resulting 5-base-pair nucleo- organization in more complex eukaryotes. Alternatively, the poorer some sequence preferences are in excellent agreement with those performance may indicate that distinct sequence types are present in discovered in the genome-wide in vitro reconstitution (correlation C. elegans for which our yeast in vitro data do not provide statistics. of 0.83), and there is a good correlation (0.51) between the measured Nonetheless, our model is significantly correlated with the in vivo oligonucleotide affinities and those predicted by the model con- nucleosome organization across C. elegans. structed from the genome-wide in vitro map (Supplementary Fig. We next compared the DNA-encoded nucleosome organization of 7). These results are wholly independent of either micrococcal nucle- the in vitro map with nucleosome organization under growth con- ase or parallel sequencing, and thus confirm that the sequence spe- ditions that cause substantial transcriptional changes relative to log- cificities derived from our previous experiments were caused by phase growth in rich medium (that is glucose). In addition to our intrinsic nucleosome preferences, rather than being an artefact of map obtained from yeast cells grown in rich medium, we also mea- our experimental approach. sured the nucleosome organization of yeast cells grown separately in Previous studies identified nucleosome depletion around tran- galactose, and in ethanol, and found that the overall nucleosome scription start and stop sites5–7,9–11. However, because these studies occupancy is very similar between all three in vivo maps, although were based on in vivo data, it was not possible to determine which localized differences are apparent (Fig. 1 and Supplementary Fig. 5). mechanism accounted for the observedNucleosome patterns. Positioning The in in vitro the Humanand in Genome All three in vivoPositionnementmaps are highly correlated with the in vitro map vivo parmaps show s highlyequence´ similar stereotypic nucleosome depletion at and show the sequence characteristics seen in vitro (Supplementary translation end sites, indicating that this depletion is largely encoded Fig. 6). These results imply that intrinsic sequence preferences of by nucleosome sequence preferences (Fig. 4b and Supplementary nucleosomes haveplacement a dominant role determin´ in determininge´ par nucleosome sequence´ Fig. 8). g Theenomique´ two maps also show (dinucl stereotypiceotides)´ nucleosome depletion

a bcIn vitro YPD AA/AT/TA/TT R = 0.98 AA/AT/TA/TT 0.5 0.28 CC/CG/GC/GG 0.28 CC/CG/GC/GG in vitro 0 0.26 0.26

–0.5 0.24 0.24 –1.0 AAAAA

ATATA Dinucleotide frequency Dinucleotide frequency

occupancy on 5-mers 0.22 0.22 Average normalized nucleosome Average –1.0 –1 0 0.5 –60 –40 –2004 20 060 –60 –40 –2004 20 060 Average nucleosome Distance from nucleosome dyad (bp) Distance from nucleosome dyad (bp) occupancy on 5-mers in vivo (YPD)

deNumber of Number of fNumber of g base pairs base pairs base pairs 3 3 3 100,000 R = 0.89 R = 0.75 R = 0.60 R = 0.65 CCGGC . . . mais80,000 pas toujours exactement60,000 ) 0.5 80,000 (yeast) 0 60,000 0 0 CGGCA 40,000 0 60,000 C. elegans in vitro 40,000 40,000 –0.5 20,000 20,000 Predicted normalized Predicted Predicted normalized Predicted 20,000 –1.0 occupancy ( nucleosome occupancy nucleosome occupancy on 5-mers

–5 0 –5 0 –4 0 nucleosome occupancy Average –5 033 –5 0 normalized nucleosome Predicted –3 0 3 –1.0 –0.5 0.50 Normalized nucleosome Normalized nucleosome Normalized nucleosome Average nucleosome occupancy occupancy in vitro occupancy in vivo (YPD) occupancy in vivo (C. elegans) on 5-mers in vivo (C. elegans) Figure 3 | The in vitro sequence preferences of nucleosomes are highly pair predicted by our cross-validated computational model of nucleosome similar to those of nucleosome-bound sequences in vivo and are predictive sequence preferences (y axis). Values above zero indicate nucleosome of nucleosome occupancy in C. elegans.a, Comparison of genome-wide enrichment relative to the genome-wide average. The colour of each point relative nucleosome occupancy of nucleosomes over sequences of length 5. represents the number of base pairs that map to that point in the graph. The For the in vitro and in vivo maps of nucleosome occupancy, we separately Pearson correlation between the maps is indicated. e, Same as d, comparing computed the average normalized nucleosome occupancy of each of the our model predictions to the in vivo map. f, In vitro nucleosome sequence 1,024 sequences of length 5, across all of its instancesFigure 3. Examples in the genome. of nucleosome Shown arrays.preferencesA. MNase midpoint on yeast density genomic (smoothed DNA using are a 30 predictive bp sliding window) of the in across vivo a 76nucleosome kb region near the is a comparison between the distributions ofchromosome these 5-base-pair 12 centromere. sequences This region in containsorganization an array of , in400C. nucleosomes elegansKaplan. Same with & regular, as aldNature, comparing consistent458 positioning. :362 our (2009) model B. ;A predictionsGaffney small 10 kb & subsection al. andPLoS Geneticsof 8 :e1003036 (2012) both maps. Also shown is the Pearson correlationthe larger between nucleosome these array. Predicted nucleosomethe in occupancy vivo nucleosome from the in occupancy vitro sequence map model of ofC. Kaplan eleganset al.on[33] chromosome corresponds very 2 well with MNase midpoint density. Kaplan scores predict the affinity of nucleosomes for the sequence but, unlike predicted occupancies, do not incorporate distributions. b, Position-dependent sequencesteric preferences exclusion. DNase of nucleosomes I nick density (smoothed(ref. with 19). a 10g, bp Comparison sliding window) of indicates yeast nucleosome the location of DNase sequence I sensitive preferences regions (therein vitro are none in in the in vitro map. We aligned the individualthis nucleosome region). The density reads of in simulated the in MNaseand midpoints those andof C. Yoruba elegans DNA in sequencing vivo. For read each depth of the (aggregated maps we across separately individuals computed from the 1000 vitro nucleosome collection. Shown is the fractiongenomes (3-bp project) moving are not average) strongly correlated of the with average MNase midpoint normalized density, nucleosome which shows that occupancy the array is of not every an artifact possible of sequencing sequence or mapping of bias. C. MNase midpoint density around the gene NPM3. In this region there is consistent, regular spacing of nucleosomes, but their positions are not AA/AT/TT/TA andEpig´enomique´ CC/CG/GC/GG? dinucleotidesIFT6299well predicted at H2014 each by position the? KaplanUdeM of model, the? particularlyMikl´osCs˝ur¨oslength in the 5. DNase For C. I hypersensitive elegans, we sites, performed which are these depleted computations of nucleosomes. on vi alignment. c, Same as b, for the in vivo map.doi:10.1371/journal.pgen.1003036.g003d, Shown is a density dot plot chromosome 2. Shown is a comparison of these 5-base-pair sequence comparison between the normalized nucleosome occupancy per base pair in distributions between the yeast in vitro map and the in vivo map of C. the in vitro map (x axis) and the normalized nucleosomethe translational occupancy positioning per of most base humanelegans nucleosomes, along is with weak, the Pearsonretain phase correlation with the periodic between nucleosome these distributions. sequence preferences. but we also find that most nucleosomes are significantly more Similar offsets in nucleosome positions have been observed in 5S 364 positioned than expected by chance. Additionally, a substantial rDNA in vitro [43,44] and are consistent with a weak 10 bp fraction©2009 of nucleosomesMacmillan have Publishers moderate Limited. or strong All positioning. rights reservedperiodicity in MNase-seq reads from C. elegans [45]. Recently, this At a fine scale, nucleosomes are often found at alternate finding has been confirmed by chemical mapping of nucleosomes ‘‘minor’’ translational positions that are multiples of 10 bp away in yeast, which demonstrates that it is not an artifact of digestion from their most frequent ‘‘major’’ position. These alternate by MNase [46]. positions preserve the rotational positioning of the nucleosome At a broad scale, nucleosomes are often found in consistently on the DNA and are likely to be energetically favored because they positioned, regularly spaced arrays, which are enriched in insulators,

PLOS Genetics | www.plosgenetics.org 6 November 2012 | Volume 8 | Issue 11 | e1003036 LETTER RESEARCH

abGranulocyte distogram In vitro distogram a Nucleosome spacing within genes 30 8 1-pile 0.4 0.6 205 CD4+ T cells 20 3-pile Granulocytes 4 0.4 0.2 200 10 1-pile subset 0.2 3-pile subset 195 0 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 SUPPLEMENTARYDistance counts (millions) INFORMATION RESEARCHDistance counts (millions)

Distance (bp) Distance (bp) Nuclesome spacing (bp) 190 c Granulocyte phasogram 1–10 >50 1,400 Phase = 193 bp 0–0.1 0.1–1 10–30 30–50 50 1-pile subset 0.8 Gene RPKM bins 3-pile subset 1,000 Espace entre nucleosomes´ 5-pile subset b Nucleosome spacing within epigenetic domains 600 Adjusted R2 = 1 CD4+ T-cells 205205 –16 Heterochomatin 200 P-value = 2.7 × 10 0.6 Peak coordinate (bp) 12345678 Euchromatin (gene bodies) Peak count 200 bp 30 195

Phase counts (millions) Euchromatin Supplementaryplus proches Figure 3 dans des regions´ actives 190 0.4 (active promoters 187 Nucleosome and enhancers) spacing sequence´ + autres facteurs determinent´ le placement 05001,0001,5002,0002,5003,000LETTER RESEARCH 178 179 Phase (bp) A Distogram calculation d 170 bp abGranulocyte distogram In vitro distogram aBlood cell phasogramNucleosome spacing within genes Distances 1,500 Phase = 203 bps 30 8 Granulocytes 0.25 H3K4me1H3K27acH3K36ac H3K9me3 Reads 1-pile + CD4+ T cells H4K20me1H3K27me1 H3K27me3 0.4 0.6 - 205 1,000 CD4+ T cells 20 3-pile 0.6 CD8+ T cells Chromatin domains Granulocytes 4 0.4 0.2 200 500 Figure 2 | Transcription and chromatin modification-dependent 10 Phase = 193 bps 1-pile subset 0.2 Peak position (bp) nucleosome spacing. a, Nucleosome spacing as a function of transcriptional 3-pile subset 195 12345678 activity. x- axis represents gene expression values binned according to RPKM 0 0 0 0 Peak count

0 50 100 150 200 250 0 50 100 150 200 250 Phase counts (millions) values. Internucleosome spacing is plotted along the y-axis. Dashed lines Distance counts (millions) Distance counts (millions) represent genome-wide average spacing for each cell type. b, Nucleosome Distance (bp) Distance (bp) Nuclesome spacing (bp) 0.4 190 1 0.15 spacing within genomic regions marked by specific marks in CD4 T c B Phasogram calculation Granulocyte phasogram 05001,0001,5002,0002,5003,000 cells. Bar height plots estimated nucleosome spacing for each histone 1–10 >50 Phases (+) 1,400 Phase = 193 bp 0–0.1 Phase0.1–1 (bp)10–30 30–50 modification. Bar colours differentiate chromatin types (euchromatin vs 1-pile subset Gene RPKM bins 50 0.8 e heterochromatin). 3-pile subset 1,000 6 Reads (+) 0.6 M 16 5-pile subset bIn vitro phasogramNucleosome spacing within epigenetic1-pile subset domains (H4K20me1, H3K27me1) ,oreuchromatinassociatedwithpromoters 17 600 3-pile subset and enhancers (H3K4me1, H3K27ac, H3K36ac) ,andestimatedspa- Adjusted R2 = 1 CD4+ T-cells 205205 –16 Heterochomatin5-pile subset cing of nucleosomes for each of these epigenetic domains. We found that Reads (-) 200 P-value = 2.7 × 10 0.6 Peak coordinate (bp) 12345678 Euchromatin (gene bodies) active promoter-associated domains contained the shortest spacing of Peak count 200 bp 30 Phases (-) 195 0.4 M 178–187 bp, followed by a larger spacing of 190–195 bp within the body

Phase counts (millions) Euchromatin 190 of active genes, whereas heterochromatin spacing was largest at 205 bp 0.4 (active promoters 187 Nucleosome and enhancers) (Fig. 2b). These results reveal striking heterogeneity in nucleosome spacing 05001,0001,5002,0002,5003,000 Phase counts (millions) 178 179 organization across the genome that depends on global cellular identity, Phase (bp) metabolic state, regional regulatory state, and local gene activity. Supplementary Figure 3. Distograms and phasograms. (A) Schematic depiction of 2the distogram calculation. 0.2 M d 170 bp To characterize DNA signals responsible for consistent positioning of 05001,0001,5002,0002,5003,000 Blue arcs represent recorded distancesBlood cell phasogram between nucleosome reads that map on opposite strands. Distance frequen- nucleosomes, we identified 0.3 million sites occupied in vitro by nucleo- 1,500 Phase = 203 bps Phase (bp) somes at high stringency (.0.5; Methods). The region occupied by the cies are represented as a histogramGranulocytes (distogram, see Fig. 1A-B of the main text). Distograms are used to reveal the 0.25 H3K4me1H3K27acH3K36ac H3K9me3 CD4+ T cells Figure 1 | Global parameters of cell-specificH4K20me1 nucleosomeH3K27me1 phasingH3K27me3 and centre of the nucleosome (dyad) exhibits a significant increase in G/C existence of consistently positioned nucleosomes1,000 in the main data. (B) Schematic depiction of the phasogram 2100 calculation. Blue arcs0.6 represent recordedCD8+ T cells phases between the nucleosome reads mappingpositioning on in human.the same a, In strand vivo granulocyte of theChromatin distogram domains (calculation explained usage (Poisson P-value , 10 ; Fig. 3a). Flanking regions increase in in Supplementary Fig. 3a). x-axis represents the range of recorded distances. A/T usage as the positioning strength increases (Fig. 3b). A subset of in reference genome. Phase frequencies are represented500 as a histogram (phasogram,Figure see Fig 2 | Transcription 1C-D). Phasograms and chromatin are modification-dependent Phase = 193 bps y-axis represents frequencies of observed distances within 1-pile (blue) and vitro positioned nucleosomes (stringency . 0.5) which are also strongly used to reveal the existance of consistently spacedPeak position (bp) nucleosomes forming regular nucleosomearrays.3-pile (red) subsets.spacing. 1-pile a, Nucleosome subset represents spacing the as entirea function data of set, transcriptional 3-pile subset Valouev & al Nature 474positioned :516 (2011)in vivo (stringency . 0.4) revealed increased A/T usage 12345678 activity.representsx- axis a subset represents of sites gene containing expression three values or more binned coincident according read to starts. RPKM Peak count within the flanks (Fig. 3c) compared to in vitro-only positioning sites

Phase counts (millions) values.b, Distogram Internucleosome of the in vitro spacingreconstituted is plotted nucleosomesalong the y-axis. showing Dashed 1-pile lines and represent3-pile subsets genome-wide as in (a). c average, In vivo spacinggranulocyte for each phasogram cell type. (calculationb, Nucleosome explained (Fig. 3a), which underscores the importance of flanking repelling ele- 0.4 1 0.15 spacingin Supplementary within genomic Fig. 3b). regionsx-axis marked shows by the specific range of histone recorded marks phases. in CD4y-axisT ments for positioning in vivo. We term such elements with strong G/C 05001,0001,5002,0002,5003,000 cells.shows Bar frequencies height plots of estimatedcorresponding nucleosome phases. spacing Phasograms for each of 1-pile, histone 3-pile and cores and A/T flanks ‘container sites’ to emphasize the proposed posi- Epig´enomique´ ? IFT6299 H2014Phase (bp)? UdeM ? Mikl´osCs˝ur¨os modification.5-pile subsets Bar are colours plotted. differentiate Inset, linear chromatin fit to the positions types (euchromatin of the phase vs peaks tioning mechanismvii (Fig. 3d). This positioning signal is different from a e heterochromatin).within 3-pile subsets (slope 5 193 bp). d, Phasograms of blood cell types. Inset, 10-bp dinucleotide periodicity observed in populations of nucleosome 6 1 18,19 0.6 M linear fits in CD4 T cells (20316 bp) and granulocytes (193 bp). e, Phasograms of core segments isolated from a variety of species and proposed to In vitro phasogram 1-pile subset (H4K20me1, H3K27me1) ,oreuchromatinassociatedwithpromoters 1-pile, 3-pile and 5-pile subsets in the in vitro data. 17 3-pile subset and enhancers (H3K4me1, H3K27ac, H3K36ac) ,andestimatedspa- contribute to precise positioning and/or rotational setting of DNA on 19 5-pile subset cing of nucleosomes for each of these epigenetic domains. We found that nucleosomes on a fine scale (Supplementary Fig. 7). G/C-rich signals 20,21 activechromatin promoter-associated modifications mightdomains be contained associated the with shortest specific spacing spacing of are known to promote nucleosome occupancy , whereas AA-rich 4 0.4 M 178–187patterns. bp, Using followed previously by a larger published spacing of ChIP-seq 190–195 data, bp within we identifiedthe body sequences repel nucleosomes , and our data demonstrate that precise 15 ofregions active of genes, enrichment whereas heterochromatinfor histone modifications spacingwas that largest are found at 205 within bp arrangement of a core-length attractive segment flanked by repelling 16 (Fig.heterochromatin 2b). These results (H3K27me3, reveal striking H3K9me3) heterogeneity,gene-bodyeuchromatin in nucleosome sequences can produce a strongly positioned nucleosome (Fig. 3d).

Phase counts (millions) organization across the genome that depends on global cellular identity, 23 JUNE 2011 | VOL 474 | NATURE | 517 metabolic state, regional regulatory state, and local gene activity. 2 0.2 M ©2011 Macmillan Publishers Limited. All rights reserved To characterize DNA signals responsible for consistent positioning of 05001,0001,5002,0002,5003,000 nucleosomes, we identified 0.3 million sites occupied in vitro by nucleo- Phase (bp) somes at high stringency (.0.5; Methods). The region occupied by the Figure 1 | Global parameters of cell-specific nucleosome phasing and centre of the nucleosome (dyad) exhibits a significant increase in G/C positioning in human. a, In vivo granulocyte distogram (calculation explained usage (Poisson P-value , 102100; Fig. 3a). Flanking regions increase in in Supplementary Fig. 3a). x-axis represents the range of recorded distances. A/T usage as the positioning strength increases (Fig. 3b). A subset of in y-axis represents frequencies of observed distances within 1-pile (blue) and vitro positioned nucleosomes (stringency . 0.5) which are also strongly 3-pile (red) subsets. 1-pile subset represents the entire data set, 3-pile subset positioned in vivo (stringency . 0.4) revealed increased A/T usage represents a subset of sites containing three or more coincident read starts. within the flanks (Fig. 3c) compared to in vitro-only positioning sites b, Distogram of the in vitro reconstituted nucleosomes showing 1-pile and WWW.NATURE.COM/NATURE | 3 3-pile subsets as in (a). c, In vivo granulocyte phasogram (calculation explained (Fig. 3a), which underscores the importance of flanking repelling ele- in Supplementary Fig. 3b). x-axis shows the range of recorded phases. y-axis ments for positioning in vivo. We term such elements with strong G/C shows frequencies of corresponding phases. Phasograms of 1-pile, 3-pile and cores and A/T flanks ‘container sites’ to emphasize the proposed posi- 5-pile subsets are plotted. Inset, linear fit to the positions of the phase peaks tioning mechanism (Fig. 3d). This positioning signal is different from a within 3-pile subsets (slope 5 193 bp). d, Phasograms of blood cell types. Inset, 10-bp dinucleotide periodicity observed in populations of nucleosome linear fits in CD41 T cells (203 bp) and granulocytes (193 bp). e, Phasograms of core segments isolated from a variety of species18,19 and proposed to 1-pile, 3-pile and 5-pile subsets in the in vitro data. contribute to precise positioning and/or rotational setting of DNA on nucleosomes19 on a fine scale (Supplementary Fig. 7). G/C-rich signals chromatin modifications might be associated with specific spacing are known to promote nucleosome occupancy20,21, whereas AA-rich patterns. Using previously published ChIP-seq data, we identified sequences repel nucleosomes4, and our data demonstrate that precise regions of enrichment15 for histone modifications that are found within arrangement of a core-length attractive segment flanked by repelling heterochromatin (H3K27me3, H3K9me3)16,gene-bodyeuchromatin sequences can produce a strongly positioned nucleosome (Fig. 3d).

23 JUNE 2011 | VOL 474 | NATURE | 517 ©2011 Macmillan Publishers Limited. All rights reserved NEWS AND VIEWS

unknown challenges, for the mapping of regulatory sequences in the genome. The genome shows its sensitive side Regulatory sequences, when active, are bound by transcription factors (TFs), which Anil Raj & Graham McVicker are proteins that recognize specific DNA sequences. Once bound, TFs recruit other pro- New methods for measuring the sensitivity of chromatin to DNase teins that transcribe, or ‘turn on’, nearby genes. digestion and Tn5 transposition help us map and interpret the A complete description of all of the regulatory genome’s regulatory sequences. sequences that are active in a given cell type is DNase-seq : regions´ hypersensitivestherefore fundamentally important for under- standing how our genome functions. Many of the traits that make individuals sequences that control when they are switched Direct measurement of TF-bound unique are encoded by genetic differences in on and off. Two papers published in this sequences, such as by chromatin immuno- their genomes. Recent evidence suggests that issue of Nature Methods1,2 and one paper precipitation, provides information about many of these genetic differences do not affect published in the December issue3 describe only one TF at a time, even though hundreds on veut detecter´ la chromosomegenes directly but instead« alterouverte regulatory »technical— advances, site and possible highlight previously de of liaison TFs may be active in a single cell. Another approach is to look for the indirect effects of TFs on chromatin. At its most basic level,

Millions of cells 500–50,000 cells chromatin is made up of a repeating series of nucleosomes (complexes of histone proteins) encircled by DNA. When TFs bind to the

DNase I Tn5 genome, they displace nucleosomes, thereby exposing the DNA and making it more sen- sitive to cleavage by . The methods Cleavage of sensitive sites described in this issue exploit the increased

Nature America, Inc. All rights reserved. America, Inc. Nature sensitivity of nucleosome-depleted chromatin to identify active regulatory sequences. TF TF 1 2

© 201 He et al. and Vierstra et al. base their work on DNase-seq, a method that has already proven very successful at identifying active regulatory regions in the genome4–7. npg First, an known as DNase I is used Long nucleosome to preferentially cleave nucleosome-depleted DNA fragments Short DNA sequences. Pairs of DNase cuts gener- DNA fragments from ate short fragments that are then sequenced sensitive region Sequence ends of fragments and mapped back to the genome to identify and map back to genome sensitive ‘open chromatin’ regions. The num- ber of fragments that map to a sequence is a measure of regulatory activity; moreover, sites bound by some TFs show highly spe- cific patterns of DNase I cleavage. These cut Infer locations of sensitive sites patterns, called ‘DNase footprints’, have been and nucleosomes used to identify the binding of specific TFs in several studies5,6,8,9. Vierstra et al.2 extend the DNase-seq TF footprint? protocol in an approach they term DNase- Figure 1 | Mapping regulatory regions with paired-end DNase-seq or ATAC-seq. Short fragments come from FLASH (DNase I–released fragment-length nucleosome-depleted sequences, whereas long fragments originate from flanking nucleosomes. analysis of hypersensitivity). They sequence both ends of the DNA fragments that are Anil Raj and Graham McVicker are in the Department of Genetics, Stanford University, Stanford, California, released by DNase cleavage1,2, which allows USA; and Graham McVicker is in the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. the fragment lengths to be determined e-mail: [email protected] after the fragment ends are mapped to Raj & McVicker Nature Methods 11 :39 (2014) NATURE METHODS | VOL.11 NO.1 | JANUARY 2014 | 39

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os viii Downloaded from genome.cshlp.org on February 11, 2014 - Published by Cold Spring Harbor Laboratory Press

FAIRE-seq : chromosomeGiresi et al. ouverte

(Fig. 1). The DNA fragments recovered in the aqueous phase were fluorescently labeled and hybridized to high-density oligo- nucleotide microarrays tiling the ENCODE regions at 38-bp reso- lution. The ENCODE regions represent 1% of the human genome (30 Mb), consisting of manually selected regions of particular interest and randomly selected regions of varying gene density and evolutionary conservation (The ENCODE Project Consor- tium 2004). As a reference, DNA prepared in parallel from un- crosslinked cells was labeled with a different fluor and simulta- neously hybridized to the arrays. We compared the genomic regions enriched by FAIRE to hallmarks of active chromatin, including localization of the gen- eral transcriptional machinery (Kim et al. 2005a,b), and H4 and methylation (Koch et al. 2007), DNaseI hypersensitivity (Crawford et al. 2006; Sabo et al. 2006), and direct assays of promoter activity (Trinklein et al. 2003; Cooper et al. 2006). Genomic regions enriched by FAIRE correspond well with each of these indicators of active regulatory elements (Fig. 2, Figure 1. FAIRE in human cells is illustrated on the left, while prepara- Table 1). tion of the reference is illustrated on the right. For FAIRE, formaldehyde is added directly to cultured cells. The crosslinked chromatin is then Active promoters are enriched by FAIRE sheared by sonication and phenol-chloroform extracted. Crosslinking be- tween histones and DNA (or between one histone and another) is likely Earlier experiments performed in yeast had revealed that the to dominate the chromatin crosslinking profile (Brutlag et al. 1969; So- regulatory regions of highly transcribed genes are preferentially lomon and Varshavsky 1985; Polach and Widom 1995). Covalently linked isolated by FAIRE (Nagy et al. 2003). To determine whether this protein–DNA complexes are sequestered to the organic phase, leaving relationship holds in human cells, we compared FAIRE signal to only protein-free DNA fragments in the aqueous phase. For the hybrid- ization reference, the same procedure is performed on a portion of the measurements of promoter strength. Predicted promoters in the cells that had not been fixed with formaldehyde, a procedure identical to ENCODE regions have been analyzed for regulatory activity by a traditional phenol-chloroform extraction. DNA resulting from each pro- cloning them upstream of reporters and measuring the resulting cedure is then labeled with a fluorescent dye, mixed, and comparatively activity of the reporter gene in different cell types (Trinklein et al. hybridized to DNA microarrays. In this case, we used high-density oligo- nucleotide arrays that tile across the ENCODE regions of the human 2003; Cooper et al. 2006). We assigned each probe on the micro- genome (30 Mb). array that mapped to a predicted promoter to one of four classes, based on the average activity of the corresponding promoter. Analysis revealed that probes mapping to the most active pro- greater distances from the initiation of transcription, there are moters have a higher FAIRE signal than those that do not map to more repetitive and heterochromatic regions, and the baseline a promoter or that map to a promoter of lower activity (Fig. 3A, state of chromatin is more compact and repressive (Alberts et al. Therefore, more active promoters are more strongly .(100מP < 10 2002). Therefore, it is reasonable to expect that a much smaller enriched by FAIRE in human cells. fraction of the genome will be in the “open” conformation rep- resenting regions of active chromatin. Moreover, it is not clear a FAIRE isolates DNA encompassing TSSs priori whether the same physical properties of yeast chromatin Giresi & al Genome Res 17 :877 (2007) that allow isolation of open regions by FAIRE can be successfully Yeast experiments had also revealed that FAIRE isolated the exploited for isolation of regulatory regions in human chroma- nucleosome-free region located at yeast TSSs (Nagy et al. 2003; tin. Yuan et al. 2005; Hogan et al. 2006). Alignment of DNase-chip Here, we performed FAIRE in a human foreskin fibroblast signal (Crawford et al. 2006), FAIRE signal, and gene annotations Epig´enomique´ ? IFT6299 H2014 ? UdeMcell? lineMikl´osCs˝ur¨os and assayed its performance within the genomic regions suggested that a similar feature was enriched byix FAIRE in human selected by the ENCODE Project Consortium (2004). Regions en- cells (Fig. 2). To assess the extent to which this was generally true, riched by FAIRE were compared with functional genomic ele- we aligned all TSSs for all annotated genes within the ENCODE ments such as DNaseI hypersensitive sites, transcriptional start regions and calculated the average FAIRE signal over a region sites (TSSs), and active promoters. The results indicate that FAIRE spanning 1.5 kb upstream to 1.5 kb downstream of the TSS (Fig. is a simple genomic method for the isolation and identification 3B, solid line). This analysis revealed that, on average, the peak of of human functional regulatory elements, with broad utility for enrichment by FAIRE occurs at the TSS. DNase hypersensitive mammalian genomes. sites are an indicator of DNA accessibility and a well-established characteristic of TSSs and regulatory DNA. We performed the Results same analysis using DNase-chip data (Crawford et al. 2006) and found that the pattern of DNA enrichment at TSSs was very simi- DNA isolated by FAIRE in human cells corresponds to regions lar to that generated by FAIRE (Fig. 3B, broken line). of active chromatin Global comparison of FAIRE peaks to other annotated Fibroblasts were grown in culture, and formaldehyde was added features directly to actively dividing cells to a final concentration of 1% (see Methods). The cells were then disrupted with glass beads. We also analyzed the overall concordance between the genomic The resulting extract was sonicated to yield 0.5- to 1-kb chroma- regions enriched by FAIRE and other selected hallmarks of active tin fragments, and subjected to phenol-chloroform extraction chromatin (Fig. 3C; TSS [Ashurst et al. 2005; Harrow et al. 2006],

878 Genome Research www.genome.org FAIRE et DNaseI Downloaded from genome.cshlp.org on February 11, 2014 - Published by Cold Spring Harbor Laboratory Press tous les deux trouvent la chromatine ouverte (⇐ attachement de facteurs de trans- Human open chromatin defined by DNaseI and FAIRE cription) that 70%–80% of all genes are either ac- tive or poised (Guenther et al. 2007). We divided genes into highly expressed

(46% of genes, log2 RNA > 7; Methods), moderately expressed (29%, log2 RNA between 5 and 7), and lowly or not

expressed (25%, log2 RNA < 5). We found that nearly all highly expressed genes had Pol II binding and open chromatin at their TSS (Fig. 2E). About 60% of the moderately expressed genes showed Pol II and open chromatin signals, while an additional 30% showed just open chro- matin signal. About half of the lowly or nonexpressed genes showed evidence of Pol II or open chromatin, while the re- maining half had no evidence of either signal. In all, open chromatin identifies the TSS of nearly all of expressed genes and indicates that a large fraction of the remaining genes may be poised for transcription.

A combined open chromatin atlas reveals chromatin similarities between functionally related cell types To take advantage of the strengths of each Figure 1. Identification of open chromatin in seven human cell lines. (A) A schematic representation assay, we created a combined annotation of the experiment and analysis design. (B) DNaseI (y-axis fixed at Parzen signal value 0.15) and FAIRE for each of the seven cell lines by in- Song & al Genome Res 21 :1757 (2011) (y-axis fixed at 0.04) data from seven cell lines surrounding the HNF4A locus (145 kb; UCSC Genome tegrating data from DNaseI and FAIRE Browser) shows both ubiquitous and cell-type selective open sites that are especially prevalent in HepG2 cells. Pol II, CTCF, and MYC ChIP-seq peaks that overlap open chromatin are highlighted. (see Methods). Our open chromatin atlas contains sites strongly identified by both assays, high confidence peaks present in Epig´enomique´ ? IFT6299Together, H2014 ? DNase-seqUdeM ? Mikl´osCs˝ur¨os and FAIRE-seq identify most of the sites only one assay, and lower confidence peaks supportedx by both bound by regulatory factors assays (Table 1). The number of combined significant open chro- matin sites ranged from 100,000 to 125,000 ( < 0.05; Methods) for DNase-seq and FAIRE-seq data were compared to ChIP-seq data P each cell line. Between any two cell types, ;30%–40% of open generated from the same cell lines using antibodies to CTCF, MYC, chromatin sites are shared (Supplemental Table S3). and Pol II (see Methods) (Supplemental Fig. S4). Over 96% of the Using open chromatin sites, we performed hierarchical clus- strongest CTCF and MYC ChIP sites were identified by one or both tering of the cell lines (see Methods) (Supplemental Fig. S7A). The assays (Fig. 2C,D). About 30% of CTCF and 15% of MYC sites were clustering appears to reflect functional and lineage similarities in captured by DNase-only or FAIRE-only sites. At any given ChIP-seq cell types and almost perfectly matches cell-line clustering based peak cut-off, ChIP-seq signal intensity was the strongest for peaks on gene expression data (Supplemental Fig. S7B). For example, we detected by both DNaseI and FAIRE, was weaker in sites detected by find that the two cell types of hematopoietic lineage, GM12878 only one assay, and the weakest for sites that overlapped neither (lymphoblastoid cell line) and K562 (chronic myeloid leukemia), assay (Supplemental Fig. S5). clustered together using either expression or chromatin data. We examined the correspondence of published ChIP-seq data Embryonic stem cells do not have a considerably different number in matching cell types (Fujiwara et al. 2009; Motallebipour et al. of open chromatin sites and do not contain a superset of open 2009; Frietze et al. 2010; Kouwenhoven et al. 2010; Raha et al. chromatin sites found in other more differentiated cell types. 2010) with our open chromatin data. DNase-seq and FAIRE-seq However, embryonic stem cell open chromatin sites tended to be captured >80% of sites (>90% of the strongest sites) for TP63 in larger and covered a greater fraction of the genome than other cell NHEK, FOXA1, and FOXA3 in HepG2, and GATA1 in K562 (Sup- types (Table 1). plemental Fig. S6), and ;70% of the ZNF263 sites in K562. We note that FOXA1, FOXA3, and GATA1 were better identified by FAIRE- seq, while ZNF263 was found more often by DNase-seq. We next evaluated our Pol II ChIP-seq data in conjunction The discovery of human regulatory elements by open with RNA expression data generated from the same cells. For each chromatin mapping is far from saturation gene with RNA data in each cell line, we determined whether there We created union sets for every possible combination of 2, 3, 4, 5, was a significant signal for Pol II binding and/or open chromatin in 6, and 7 cell types and plotted the rate at which new sites appeared the region 1000 bases upstream of and 500 bases downstream from (Fig. 3A). Regardless of the threshold used to call the sites, the an annotated transcription start site. We found that 81% of all TSSs number of new sites identified does not abate as the number of cell harbored accessible chromatin, consistent with previous estimates lines analyzed increases. In contrast, performing the same analysis

Genome Research 1759 www.genome.org Ou` sont les nucleosomes´ ? Molecular Cell Unifying Model for Nucleosome Positioning on transfere` l’ADN (Yeast Artificial Chromosome) d’un espece` a` un autre : qu’est- ce qui se passe aux nucleosomes ? Figure 1. Functional Evolutionary Dissec- tion of Chromatin-Establishment Mecha- nisms (A) Schematic of experimental design. Yeast artificial chromosomes are constructed carrying sequence from species such as K. lactis, and introduced into S. cerevisiae. Comparison of nucleosome-mapping data between the same sequence in two different environments (its en- dogenous genome, and in S. cerevisiae) can be used to disentangle DNA-driven from trans- mediated aspects of chromatin organization. (B) Chromosomal complement of parental S. cerevisiae (AB1380) and three different YAC- bearing strains. Pulsed-field gel electrophoresis of YAC-bearing strains, as indicated. (C and D) Examples of nucleosome-mapping data from two genes. Blue line indicates nucleosome- mapping data from wild-type K. lactis (Tsankov et al., 2010), red line shows data from the same sequence carried on a YAC in S. cerevisiae. (E and F) Data for all K. lactis genes on all three YACs. (E) shows data for all genes from wild-type K. lactis, with genes sorted by NDR width, while (F) shows data from these genes on YACs, sorted identically. Black indicates no sequencing reads, yellow intensity indicates number of sequencing reads. C and D indicate the example genes shown above.

the nuclear environment to nucleosome positioning in vivo. This approach relies on the finding that there are species- specific differences in parameters of Hughes & al Mol Cellnucleosome48 :5 (2012) positioning in a variety of yeast species, even though the general pattern is highly conserved (Tsankov et al., 2010). Specifically, we compare nucleosome maps of artificial chromo- Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os somes (YACs)xi containing large, heterolo- gous genomic regions from different yeast species in S. cerevisiae with maps of the same regions in their native organism (Figure 1A). In principle, fea- tures that change in the context of S. cerevisiae are determined by protein factors that are functionally distinct in the two species, whereas features that are retained when the foreign yeast DNA is present in S. cerevisiae are due either to intrinsic DNA sequence or to at promoters (Tirosh et al., 2010). However, S. cerevisiae and conserved trans-acting regulators. For example, when the S. paradoxus differ very little in bulk aspects of chromatin archi- S. cerevisiae HIS3-PET56 region is introduced into S. pombe, tecture. In contrast, chromatin structure exhibits far greater it retains the nucleosome-depleted promoter region, but not differences between more divergent species: for example, the positions of nucleosomes in the coding region (Sekinger average nucleosome spacing differs by 15–20 bp between et al., 2005). In addition, the generation of fortuitous functional  S. cerevisiae and K. lactis (last common ancestor 150 million elements arising from heterologous genomic sequences makes  years ago) (Heus et al., 1993; Tsankov et al., 2010). it possible to address mechanistic issues that are presumably Here, we describe a functional evolutionary approach to free of evolutionary constraints. Here, we show that nucleosome systematically dissect the contributions of DNA sequence and spacing is established in trans, and that promoter nucleosome

6 Molecular Cell 48, 5–15, October 12, 2012 ª2012 Elsevier Inc. Ils bougent. . . Molecular Cell Unifying Model for Nucleosome Positioning ce n’est pas simplement la sequence´ qui determine´ le placement

Figure 4. +1 Nucleosome Shifts Associated with Transcriptional Changes (A and B) Nucleosome data and RNA-Seq data are shown for K. lactis and D. hansenii genes in wild-type and YACs, as indicated. RNA-Seq data for YAC-derived transcripts are normalized independently from S. cerevisiae transcripts here—see Figures S4B and S4C for data normalized genome-wide. (C–E) Examples of +1 nucleosome shifts associated with changes in transcription. (C) shows a moderate upstream shift in a +1 nucleosome with a similar change in transcript length, while (D) and (E) show large-scale NDR gain/loss with associated changes in transcription. Schematic interpretation of the nucleosome positioning for the endogenous gene is shown in blue above the rectangle, nucleosome positioning in the YAC is shown in red below the rectangle. Arrows

indicate inferred TSSs (note that RNA-sequencing data are not strand specific, but TFIIB-mapping data support our inferred TSSs)—the furthest 50 RNA in (E), for example, derives from the upstream gene as opposed to a divergent promoter.

these shifts were biased toward upstream shifts (Figure 3A). levels than host genes—average sequencing reads per kilobase Thus, our observations demonstrate that pronucleosomal se- of coding sequence for YACs were 40% of the average value  quences do not ‘‘program’’ the position of the +1 nucleosome for endogenous RNAs—consistent with extensive promoterHughes & al Mol Cell 48 :5 (2012) in vivo. sequence divergence between species resulting in widespread The strong correspondence between +1 nucleosome posi- misinterpretation of exogenous regulatory information by the tioning and transcriptional start sites in many species (Jiang S. cerevisiae transcriptional machinery (Figures S4B, S4C, and and Pugh, 2009) led us to consider the hypothesis that changes S4E). In general, we found a good correlation between expres- Epig´enomique´ IFT6299in transcriptional H2014 activityUdeM mightMikl´osCs˝ur¨os underlie the repositioning of sion levels for genes in their endogenous genome versus expres- ? the +1 nucleosomes? (Zhang et? al., 2009). We therefore carried sion from the YACs (Figure S4D)—genes expressed at high xii out deep sequencing of RNA isolated from D. hansenii, levels in K. lactis remained the most highly expressed genes K. lactis, and the S. cerevisiae YAC strains in this study, and when carried on YACs, but were expressed at lower levels rela- carried out ChIP-Seq for TFIIB localization in the YAC-containing tive to S. cerevisiae genes. In D. hansenii, we also observed strains (a full analysis of these data will be published separately). increased expression of intergenic regions in the YACs (Fig- Alignment of RNA-Seq data from wild-type strains with nucleo- ure S4C and see below), again indicating evolutionary diver- some-mapping data confirmed prior predictions that the posi- gence in transcriptional control sequences (e.g., loss of tran- tioning of +1 nucleosomes with respect to a gene’s transcription scriptional termination signals and/or gain of cryptic promoters). start site (TSS) varies between these species (Tirosh et al., 2007; Consistent with a relationship between +1 nucleosome posi- Tsankov et al., 2010)—transcription begins further inside the +1 tioning and TSSs, we found that the 50 ends of RNAs in YACs nucleosome in K. lactis than in D. hansenii (Figure S4A). shifted on average toward a S. cerevisiae-like location relative Comparing endogenous to YAC-based gene expression, we to the +1 nucleosome (Figures 4A and 4B)—K. lactis RNAs found on average that genes on YACs were expressed at lower started farther upstream in the YAC, whereas D. hansenii RNAs

10 Molecular Cell 48, 5–15, October 12, 2012 ª2012 Elsevier Inc. Molecular Cell Unifying Model for Nucleosome Positioning Remodelisation´ thereby permitting a defined location for the second entity. As discussed above, nucleosome remodeling complexes alone are insufficient to generate proper positioning of the +1 nucleo- some, and hence sequence and nucleosome remodelers are insufficient to provide an anchor. In contrast, preinitiation complexes bound at core promoters are clearly sufficient to provide an anchor, with the location of the TBP bound to the TATA element or TATA-related sequence being the major deter- minant of the anchor point. From these considerations, and our finding that the TSS to +1 distance in YACs shifts to the S. cerevisiae spacing (Figure 5 and Figure S5), we suggest that the preinitiation complex plays a role in fine-tuning the position of the +1 nucleosome. In the third step, positioning of downstream nucleosomes, with progressively less positioned nucleosomes downstream within the gene, depends on transcriptional elongation, and hence recruitment of nucleosome-remodeling activities and histone chaperones by the elongating RNA polymerase II machinery. This elongation-dependent step explains why nucleosome-re- modeling complexes, though capable of weakly positioning nucleosomes flanking the NDR, are unable to position more downstream nucleosomes (Zhang et al., 2011). Conversely, yeast mutant strains lacking nucleosome-remodeling complexes Figure 6. Three-Step Model for Establishment of Nucleosome (Chd1 and Isw1) that are recruited to coding regions by elon- Positioning In Vivo gating RNA polymerase show drastically reduced positioning of A unifying three-step model for how nucleosome positioning pattern is downstream nucleosomes but relatively normal positioning of generated in eukaryotic organisms. The first step is the generation of an NDR, Hughes & al Mol Cell 48 :5 (2012) either by poly(dA:dT) elements and/or by transcription factors and their re- the +1 and +2 nucleosomes (Gkikopoulos et al., 2011). Finally, cruited nucleosome remodeling complexes. In the second step, nucleosome- a transcription-based step nicely helps to explain why nucleo- remodeling complexes recognize the NDRs and generate highly positioned some arrays occur largely in the transcribed direction even Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os xiii nucleosomes flanking the NDR; and the RNA polymerase II preinitiation though highly positioned nucleosomes can occur both at complex fine-tunes the position of the +1 nucleosome. In the final step, the +1 and 1 position, as well as the curious observation that positioning of the more downstream nucleosomes depends on transcriptional À the decay of nucleosome positioning toward the center of genes elongation and the recruitment of nucleosome-remodeling activities and histone chaperones by the elongating RNA polymerase II machinery. displays a 50/30 asymmetry (Vaillant et al., 2010); both of these observations are inconsistent with a pure packing-based model. The above model can explain why the general pattern of nucle- Zhang et al. compared nucleosome positioning generated osome positioning is highly conserved among eukaryotes yet by ATP-dependent extracts with the nucleosome positions shows species-specific differences in various aspects of chro- measured from yeast lysed without crosslinking and allowed to matin structure. These species-specific differences reflect the redistribute prior to crosslinking. Indeed, we find mediocre relative utilization of poly(dA:dT) sequences and hence intrinsic correspondence between the ‘‘native’’ nucleosome positions histone-DNA interactions, as well as differences in the enzymatic from Zhang et al. and true in vivo nucleosome positions gener- and recruitment properties of the nucleosome remodelers. ated from crosslinked yeast (Figure S7), so the ability of whole- cell extracts to recover these ‘‘native’’ positions in the absence EXPERIMENTAL PROCEDURES of transcription does not have any bearing on the question of whether in vivo positioning is influenced by transcription prior Growth Conditions to lysis of cells. Although nucleosome remodelers can generate All cultures were grown in medium containing the following: SC –tryptophan –uracil (Sunrise Sciences) (0.2%), yeast extract (1.5%), peptone (1%), somewhat positioned nucleosomes flanking the NDR and dextrose (2%), and adenine (0.01%), as previously described (Tsankov unquestionably perform far better than salt dialysis, they appar- et al., 2010). ently are insufficient to generate the precise in vivo nucleosome positions, particularly for the +1 nucleosome (Figure S7). Preparation of YACs Here, the strong, and species-specific, spacing relationship Yeast chromosomal DNA was prepared in InCert agarose blocks (LONZA), between the +1 nucleosome and mRNA start site that is with a final cell concentration of 2 3 109 cells/ml. Agarose blocks with observed both in the native and YAC strains indicates that there intact chromosomal DNA were subjected to EcoRI partial digestion with a 2+ is a mechanistic connection between transcriptional initiation titrated Mg concentration, followed by size fractionation using pulsed field gel electrophoresis (PFGE). Partially digested DNA fragments ( 100–200 kb) and the location of the +1 nucleosome. Given the strong in vivo  were excised from the gel. YAC vector pYAC4 was purified by successive positioning of both the preinitiation complex and the +1 nucleo- CsCl gradient ultracentrifugation and digested with BamHI and EcoRI, some, a spacing relationship between these two entities requires followed by calf intestine alkaline treatment. Digested pYAC4 that at least one of these is anchored to a specific location, and partially digested yeast chromosomal fragments were ligated by T4

Molecular Cell 48, 5–15, October 12, 2012 ª2012 Elsevier Inc. 13 CHiP-seq

1. CHiP = ´ de chromatine : fixer les liaisons ADN-proteine´ (facteur de transcription ou histone) 2. fragmentation et filtrage (pour morceaux avec ADN+proteine)´ REVIEWS 3. sequenc¸age´ de morceaux

instance, can be done much more effectively by ChIP–seq. Sequence variations within repeat elements can be captured by sequencing and used to map reads to the genome; unique sequences that flank repeats are also helpful in aligning the reads to the genome. For exam- Sample fragmentation ple, only 48% of the human genome is non-repetitive, but Immunoprecipitation 80% is mappable with 30 bp reads and 89% is mappable with 70 bp reads38. All profiling technologies produce unwanted Non-histone ChIP Histone ChIP artefacts, and ChIP–seq is no exception. Although sequencing errors have been reduced substantially as the technology has improved, they are still present, especially towards the end of each read. This problem DNA purification can be ameliorated by improvements in alignment algo- rithms (see below) and computational analysis. There is also bias towards GC-rich content in fragment selection, End repair and adaptor ligation PolyA tailing both in library preparation and in amplification before and during sequencing14,39, although notable improve- ments have been made recently. In addition, when an insufficient number of reads is generated, there is loss of sensitivity or specificity in detection of enriched regions. Cluster Amplification There are also technical issues in performing the experi- generation on beads ment, such as loading the correct amount of sample: too (bridge PCR) (emulsion PCR) little sample will result in too few tags; too much sample will result in fluorescent labels that are too close to one another, and therefore lower quality data. However, the main disadvantage with ChIP–seq is its current cost and availability. Several groups have successfully developed and applied their own proto- Helicos Illumina Single-molecule cols for library construction, which has lowered that Sequencing Roche ABI sequencing cost substantially. But the overall cost of ChIP–seq, with reversible Pyrosequencing Sequencing with reversible which includes machine depreciation and reagent cost, terminators by ligation terminators will have to be lowered further for it to be comparable Sequence reads with the cost of ChIP–chip in every case. For high- resolution profiling of an entire large genome, ChIP–seq Figure 1 | Overview of a ChIP–seq experiment. Using chromatin immunoprecipitation Nature Reviews | Genetics is already less expensive than ChIP–chip, but depend- (ChIP) followed by massively parallel sequencing, the specific DNA sites that interact ing on the genome size and the depth of sequencing with transcription factors or other chromatin-associated proteins (non-histone ChIP) Park Nat Rev Genet 10 :669 (2009) and sites that correspond to modified nucleosomes (histone ChIP) can be profiled. The needed, a ChIP–chip experiment on carefully selected ChIP process enriches the crosslinked proteins or modified nucleosomes of interest regions using a customized microarray may yield as using an antibody specific to the protein or the histone modification. Purified DNA can much biological understanding. The recent decrease in be sequenced on any of the next-generation platforms12. The basic concepts are similar sequencing cost per base pair has not affected ChIP–seq for different platforms: common adaptors are ligated to the ChIP DNA and clonally as substantially as other applications, as the decrease clustered amplicons are generated. The sequencing step involves the enzyme-driven has come as much from increased read lengths as Epig´enomique´ IFT6299 H2014 UdeM extensionMikl´osCs˝ur¨os of all templates in parallel. After each extension, the fluorescent labels that from the number of sequenced fragments. The gain in ? ? ?have been incorporated are detected through high-resolution imaging. On the the fraction of reads that can be uniquely aligned to the xiv Illumina Solexa Genome Analyzer (bottom left), clusters of clonal sequences are genome decreases noticeably after ~25–35 bp and is generated by bridge PCR, and sequencing is performed by sequencing-by-synthesis. marginal beyond 70–100 nucleotides40. However, as the On the Roche 454 and Applied Biosystems (ABI) SOLiD platforms (bottom middle), clonal sequencing features are generated by emulsion PCR and amplicons are cost of sequencing continues to decline and institutional captured on the surface of micrometre-scale beads. Beads with amplicons are then support for sequencing platforms continues to grow, recovered and immobilized to a planar substrate to be sequenced by pyrosequencing ChIP–seq is likely to become the method of choice for (for the 454 platform) or by DNA ligase-driven synthesis (for the SOLiD platform). On nearly all ChIP experiments in the near future. single-molecule sequencing platforms such as the HeliScope by Helicos (bottom right), fluorescent nucleotides incorporated into templates can be imaged at the level of Issues in experimental design single molecules, which makes clonal amplification unnecessary. Antibody quality. The value of any ChIP data, includ- ing ChIP–seq data, depends crucially on the quality of the antibody used. A sensitive and specific antibody will ChIP–seq the genome coverage is not limited by the rep- give a high level of enrichment compared with the back- Heterochromatin ertoire of probe sequences fixed on the array. This is par- ground, which makes it easier to detect binding events. A region of highly compact chromatin. Constitutive ticularly important for the analysis of repetitive regions Many antibodies are commercially available, and some heterochromatin is largely of the genome, which are typically masked out on arrays. are noted as ChIP grade, but the quality of different anti- composed of repetitive DNA. Studies involving heterochromatin or microsatellites, for bodies is highly variable and can also vary among batches

NATURE REVIEWS | GENETICS VOLUME 10 | OCTOBER 2009 | 671 Ÿ)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ CHiP-exo : resolution´ de 1pb

AB

C

Figure 1. Single Base-Pair Resolution of ChIP-exo (A) Illustration(exonucl of the ChIP-exoease´ method.λ enl ChIPeve` DNA le isbout treated with 5’ apr 50 toecisement)´ 30 exonuclease while still present within the immunoprecipitate. The 50 ends of the digested DNA are concentrated at a fixed distance from the sites of crosslinking and are detected by deep sequencing (see also Figure S1). (B) Comparison of ChIP-exo to ChIP-chip and ChIP-seq for Reb1 at specific loci. The gray, green, and magenta filled plots, respectively, show the distribution of raw signals, measured by ChIP-chip using Affymetrix microarrays having 5 bp probe spacing (Venters and Pugh, 2009), ChIP-seq, and ChIP-exo. Sequencing tags on each strand were shifted toward the 30 direction by 14 bp so as maximize opposite-strand overlap. (C) Aggregated raw Reb1 signal distribution around all 791 instances of TTACCCG in the yeast genome. The ChIP-seq and ChIP-exo datasets included 2,938,677, and 2,920,571 uniquely aligned tags, respectively. See also Figure S1 and Table S1. Rhee & Pugh Cell 147 :1408 (2012)

RESULTS ChIP-exo of Reb1 has single-base accuracy. In comparison, ChIP-seq displayed more than 90-fold greater mapping vari- Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os xv ChIP-exo Design ability (SD = 24 bp). ChIP-exo also displayed lower raw back- We considered the possibility that a protein covalently cross- ground. The raw signal-to-noise ranged from 300- to 2800- linked to DNA would block strand-specific 50-30 degradation by fold (Table S1). Subsequent employment of noise filters lambda (l) exonuclease (Figure 1A), thereby creating a homoge- produced a comprehensive set of bound locations. In contrast, neous 50 border at a fixed distance from the bound protein. DNA ChIP-chip and ChIP-seq had 7- and 80-fold raw signal-to- sequences 30 to the exonuclease block remain intact and are noise, respectively. ChIP-exo retained its quantitative proper- sufficiently long to uniquely map to a reference genome, after ties, in that occupancy levels correlated with those from identification by deep sequencing (Figure S1A available online). ChIP-seq (Figure S1C), and peak-pair intensities correlated Uncrosslinked nonspecific DNA is largely eliminated by exonu- (Figure 2A). clease treatment, as evidenced by the repeated failure to generate a ChIP-exo library from a negative control BY4741 Reb1 Has Multiple Highly Organized Secondary strain. Interactions at Promoters The 50 ends of ChIP-exo tags (as well as peaks) located on one ChIP-exo Improves Genome-wide Mapping Accuracy strand were largely at a fixed distance ( 27 bp) from another  and Sensitivity tag or peak on the other strand, corresponding to the two exonu- We initially focused on the yeast Reb1 protein, which has a clear clease barriers formed by Reb1 (Figures 2A, and S2A, and S2B). DNA recognition site (TTACCCG) that can be used for indepen- A total of 1,776 Reb1 peak pairs were identified (Data S1). Impor- dent validation (Badis et al., 2008; Harbison et al., 2004). Reb1 is tantly, these peak pairs were not preselected based upon the involved in many aspects of transcriptional regulation by all three presence of any DNA sequence motif, although a motif was yeast RNA polymerases and promotes formation of nucleo- present in nearly all cases. some-free regions (NFRs) (Hartley and Madhani, 2009; Raisner Of the peak pairs, 60% (1,058/1,776) were classified as et al., 2005). It is also found at telomeres. We compared ChIP- primary locations, and 40% (718/1,776) as secondary. exo to ChIP-chip and standard sonication-based ChIP-seq. Secondary locations were defined as less-occupied locations The unfiltered ChIP-exo signal was highly focused across the within 100 bp of a more-occupied location. Thus, most Reb1 genome at TTACCCG sequences (Figures 1B and 1C). ChIP- locations were found in clusters. Nearly all (92%) primary loca- chip and ChIP-seq displayed broader signals. When converted tions contained the TTACCCG Reb1 recognition site or to peak-pair calls (described below), ChIP-exo displayed a stan- a single-nucleotide variant centered between its borders dard deviation (SD) of 0.3 bp (Figure S1B), which indicates that (Figures 2A, 2B, and S2C). Increased deviations from TTACCCG

Cell 147, 1408–1419, December 9, 2011 ª2011 Elsevier Inc. 1409 REVIEWS REVIEW

Poisson model alignment,threshold as all number subsequent of tags results covering are based them on to the define ChIP–seq an enriched should nonoverlappingallow for a small windows, number thenof mis aggregates- windows into ‘islands’ A probability distribution that alignedsequence reads. Owingregion. to Although the large thisnumber can beof effectivereads, the for highlymatches defined due to sequencingof subthreshold errors, SNPs windows and indels separated or by gaps in order to capture is often used to model the use ofpoint conventional source factors alignment with algorithms strong ChIP can enrichment,take hun- the it is difference not satis- betweenbroad the enrichment genome of regions.interest andAn alternate the approach is to extend the number of random events in a dreds or thousands of processor hours; therefore, a new reference genome. This is simpler than in RNA–seq, fixed interval. Given an average factory overall because of inherent complexities of the signals as ChIP-seq tags along their strand direction (called an ‘XSET’) and generation of aligners has been developed57, and more for example, in which large gaps corresponding to number of events in the well as experimental noise and/or artifacts. Additional informa- to count overlaps above a threshold as peak regions16. Tag exten- are expected soon. Every aligner is a balance between introns must be considered. Popular aligners include: interval, the probability of a tion present in the data is now used to help discriminate true posi- sion before signal calculation serves the dual purpose of correct- given number of occurrences accuracy, speed, memory and flexibility, and no aligner Eland, an efficient and fast aligner for short reads that tive signals from various artifacts. For example, the strand-specific ing for the assumed fragment length and also smoothing over gaps can be calculated. can be best suited for all applications. Alignment for was developed by Illumina and is the default aligner on structure of the tag distribution is useful to discriminate the punc- that were not tagged because of low sampling or read mappability that platform; Mapping and Assembly with Qualities Analyse detate donnclass of binding events fromees´ a variety 58 (MAQ) , a widely used aligner with a more exhaustive of artifacts9. Because immunoprecipitated Protein or algorithmGenerate and excellent signal profilecapabilities for detecting SNPs; Define background DNA fragments are typically sequenced as 59 nucleosome and Bowtiealong ,each an extremely chromosom faste mapper that is based (model or data) single-ended reads, thatof is, inter fromest one of the on an algorithm that was originally developed for file 5 Positive strand two strands in the 5a to 3a direction, the tags 3 compression. These methods use the quality score that are expected to come on average equally accompanies each base call to indicate its reliability. For 3 frequently from each strand,Negative strthusand giving 5 the SOChIPLiD di-base sequencingTag shift technology, in which two Control rise to two related distributions of stranded consecutivedata bases are read at a time, modified aligners data 5 ends of reads. The corresponding individual strand have been developed60,61. Many current analysis pipe- fragments are sequenced distributions will occur upstream and lines discard non-unique tags, but studies involving the ag count

27,62–64 ag count T

downstream, shifted from the source point repetitive regions of the genome requireT careful (‘summit’) by half-the average sequenced handling of these non-unique tags. fragment length, which is typically referred to as the ‘shift’ (Fig. 4a). Note that the aver- Identification of enriched regions. After sequenced age observed fragment length can differ reads are aligned to the genome, the next step is to iden- considerably from the ‘expected’ fragment tify regions thatPosition are enriched (bp) in the ChIP sample relative Position (bp) length derived from agarose gel cuts made to the control with statistical significance. during Illumina library preparation; short Several ‘peak callers’ that scan along the genome fragments are further favored by Illumina’s to identify the enriched regions are currently availableIdentify 24,26,38,48,65–70. In early algorithms, regions werePeak region solid-state PCR. For this reason, the shift is scoredpeaks byin the number of tags in a window of a given Short reads now mainly determined computationally ChIP size and then assessed by a set of criteria based on fac- are aligned from the data rather than imposed from the signal tors such as enrichment over the control and minimumEnrichment relative molecular biology protocol. The shift will to background ag count

tag density. SubsequentT algorithms take advantage of be smaller and the two strand distributions the directionality of the reads71. As shown in FIG. 5, the will come closer together in experiments in ` Distribution of tags Reference genome fragments are sequenced at the 5 end, and the loca- is computed which the fragment length, read-length and tions of mapped reads should form two distributions, recognition site length converge. one on the positive strand and the other on the negative Position (bp) strand,Assess withsignificance a consistent distance between the peaks of Building a signal profile. The signal pro- the distributions. In these methods, a smoothed pro- file is a smoothing of the tag counts to file of each strand is constructed65,72 and the combined Profile is generated from allow reliable region identification and profile is calculated either by shifting each distribution combined tags better summit resolution. The simplest towards the centre or by extending each mappedFilter position artifact s For example, each mapped Peak identification ) into an appropriately oriented ‘fragment’ and then

way to define a signal profile is to slide a s location is extended can be performed

with a fragment of P( adding the fragmentss together. The latter approach Tags window of fixed width across theon genome,either profil e thresh estimated size replacing the tag count at each site with should result in a more accurate profile with respect to the summed value within the window the width of the binding, but it requires an estimate centered at the site. Consecutive windows of the fragment size as well as the assumption that Fragments are added exceeding a threshold value are merged. fragment size is uniform. This is what cisGenome10 does. SiSSRs11 Given a combinedS profile, peaks can be scored in sev-Position (bp) and spp12 count tags within a window in eral ways. A simple fold ratio of the signal for the ChIP a strand-specific fashion. Other programs Figuresample 3 | ChIP-seq relative peak to calling that of subtasks. the control A signal sample profile around of aligned the reads that takes on a value at each base pairpeak is formed(FIG. 3B) via provides a census importantalgorithm, for information, example, counting but it theis number of reads overlapping each also use sliding window scans but com- base pairnot along adequate. the genome A fold (upper ratio ofleft 5 plotestimated ‘+’ strand from reads 50 in and blue, 10 ‘–’ strand reads in red, combined pute various modified signal values. The distributiontags (from after shifting the ChIP the and‘+’ and control ‘–‘ reads experiments, toward the center respec by -the read shift value in purple). Figure 5 | Strand-specific profilesprogram at enriched MACS sites.13 DNA performs fragments a fromwindow a If experimentaltively) has control a different data are statistical available significance (brown), the sameto the processing same steps are applied to form a chromatin immunoprecipitation experiment are sequenced fromNatur thee Re 5`vie end.ws | Geneticsbackground profile (top right); otherwise, a random genomic background may be assumed. The signal scan but only after shifting the tag data in ratio estimatedPark from,Nat for Revexample, Genet 50010 and :669 100(2009) tags. ; Pepke & al Nat Methods 6 :S22 (2009) Therefore, the alignment of these tags to the genome results in two peaks (one on and background profiles are compared in order to define regions of enrichment. Finally, peaks are a strand-specific fashion to account for the A Poisson model for the tag distribution is an effective each strand) that flank the bindingfragment location of length. the protein F-Seq or14 nucleosome performs kernel of interest .filtered to reduce false positives and ranked according to relative strength or statistical significance. This strand-specific pattern can be used for the optimal detection of enriched Bottomapproach left, P(s), that probability accounts of observingfor the ratio a location as well with as the s reads abso covering- it. The bars represent the regions. To create an approximatedensity distribution estimation of all fragments, with a Gaussian each tag kernel.location can lute tag numbers27, and it can also be modified to account 9 control data distribution. A hypothetical Poisson distribution fit is shown with sthresh indicating a cutoff be extended by an estimated fragmentQuEST size increates the appropriate separate orientation kernel density and the above forwhich regional a ChIP-seq bias peak in tagmight density be considered due to significant. the chromatin Bottom right, schematic representation number of fragments can be countedestimation at each position. profiles for the two strands. of twostructure, types of artifactual copy number peaks: variation single strand or amplification peaks and peaks bias formed67. by multiple occurrences of only Epig´enomique´ ? IFT6299 H2014SICER? UdeM15 computes? Mikl´osCs˝ur¨os probability scores in one or a few reads. xvi

676 | OCTOBER 2009 | VOLUME 10 www.nature.com/reviews/genetics NATURE METHODS SUPPLEMENT | VOL.6 NO.11s | NOVEMBER 2009 | S25 Ÿ)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ REVIEWS

a also informative, as this ratio corresponds to the fraction ChIP–chip of nucleosomes with the particularREVIEWS modification at that location, averaged over all the cells assayed. One of the difficulties in conducting a ChIP–seq con- trol experiment is the large amount of sequencing that a also informative, as this ratio corresponds to the fraction ChIP–seqChIP–chip may be necessary. For input DNA and bulk nucleosomes, ofmany nucleosomes of the sequenced with the tags particular are spread modification evenly across at that the location,genome. averaged To obtain over accurate all the cells estimates assayed. throughout theOne genome, of the sufficientdifficulties numbers in conducting of tags a ChIP–seq are needed con at- troleach experiment point; otherwise is the largefold enrichment amount of sequencingat the peaks that will ChIP–seqChIP–seq input DNA mayresult be in necessary. large errors For due input to samplingDNA and bias.bulk Therefore,nucleosomes, the manytotal number of the sequenced of tags to tagsbe sequenced are spread is evenly potentially across very the genome.large. Alternatively, To obtain it accurate is possible estimates to avoid sequencingthroughout a thecontrol genome, sample sufficient if one is numbers only interested of tags arein differential needed at eachbinding point; patterns otherwise between fold conditionsenrichment or at time the peakspoints will and ChIP–seq input DNA Pros35 CG4908 eEF1 resultif the variationin large errors in chromatin due to sampling preparations bias. Therefore, is small. the total number of tags to be sequenced is potentially very large. Alternatively, it is possible to avoid sequencing a NPC1 CG5708 CG5694 Depth of sequencing. One crucial difference between controlChIP–chip sample and if ChIP–seq one is only is thatinterested the number in differential of tiling 10,220,000 10,225,000 10,230,000 binding patterns between conditions or time points and Pros35 CG4908 eEF1 arrays that is used in a ChIP–chip experiment is fixed if the variation in chromatin preparations is small. b regardless of the protein or modification of interest, CTCF whereas the number of fragments that is sequenced in Depth of sequencing. One crucial difference between NPC1 CG5708 CG5694 a ChIP–seq experiment is determined by the investiga- ChIP–chip and ChIP–seq is that the number of tiling tor. In published ChIP–seq experiments, a single lane 10,220,000 10,225,000 10,230,000 arrays that is used in a ChIP–chip experiment is fixed Pistes pour TFs et histones modifiof the Illuminaees´ Genome Analyzer was the basic unit of b regardless of the protein or modification of interest, CTCF sequencing. When it was introduced, a single lane gen- RNA polymerase II whereas the number of fragments that is sequenced in eratedCTCF 4–6 (un million élément reads isolant): before alignment but, owing to a ChIP–seq experiment is determined by the investiga- improvementspics pointus in the system, a single lane now gener- tor. In published ChIP–seq experiments, a single lane ates 8–15 million reads or more. Given the cost of each of the Illumina Genome Analyzer was the basic unit of experiment, many early data sets contained reads from sequencing. When it was introduced, a single lane gen- H3K36me3RNA polymerase II a single lane regardless of what the specific experiment eratedpolymérase 4–6 million d'ARN: reads before alignment but, owing to was. Intuitively, one expects that when a large number picimprovements pointu + région ind'enrichissement the system, a single lane now gener- of binding sites are present in the genome for a DNA- ates 8–15 million reads or more. Given the cost of each binding protein or when a histone modification covers experiment, many early data sets contained reads from a large fraction of the genome, a correspondingly large a single lane regardless of what the specific experiment numbermodification of tags de histone will be H3 needed to cover each bound was. Intuitively, one expects that when a large number (associéregion avec at the élongation same tagde gène): density. One reasonable crite- of bindingpics sites élargis are present in the genome for a DNA- rion for determining sufficient sequencing depth would binding protein or when a histone modification covers H3K27me3 be that the results of a given analysis do not change a large fraction of the genome, a correspondingly large FBXO7 when more reads are obtained. In terms of the number number of tags will be needed to cover each bound of binding sites, this criterion translates to the presence of region at the same tag density. One reasonable crite- a ‘saturation point’ after which no further binding sites rion for determining sufficient sequencing depth would BPIL2 SYN3 are discovered with additional reads. H3K27me3 be that the results of a given analysis do not change 31,200,000 31,220,000 31,240,000 31,260,000 The issue of saturation points has been examined FBXO7 when more reads are obtained. In terms of the number in a recent paper through simulation studies48. In three Figure 2 | ChIP profiles. a | Examples of the profiles generated byNa chromatinture Revie immunoprews | Genetics- of binding sites, this criterion translates to the presence of cipitation followed by sequencing (ChIP–seq) or by microarray (ChIP–chip). Shown is a aexample ‘saturation data point’ sets, aafter reference which set no of further sites was binding generated sites sectionBPIL2 of the binding profiles of the chromodomain protein Chromator,SYN3 as measured arebased discovered on the full with set additional of sequencing reads. reads in each case. sites de liaison au debut´ de la transcription ; Then, a wide range of different read counts was sampled by ChIP–chip 31,200,000(unlogged intensity31,220,000 ratio; blue) and31,240,000 ChIP–seq (tag density;31,260,000 red) in the The issue of saturation points has been examined Drosophila melanogaster S2 cell line. The tag density profile obtained by ChIP–seq infrom a recent the complete paper through data set, simulation with multiple studies random48. In threeselec- modificationFigure 2 | ChIP deprofiles. histone a | Examples indique of the profiles activit generatede´ sp byNaecifique ´chromatinture Revie immunoprews | Genetics- reveals specific positions of Chromator binding with higher spatial resolution and exampletions for eachdata sets,sample a reference size. Binding set of sites sites were was determined generated sensitivity.cipitation followedThe ChIP–seq by sequencing input DNA (ChIP–seq) (control experiment) or by microarray tag density (ChIP–chip). is shown Shown in grey is a for basedfor each on sample the full with set of a thresholdsequencing probability reads in each (p value), case. comparison.section of the b binding| Examples profiles of different of the chromodomaintypes of ChIP–seq protein tag density Chromator, profiles as inmeasured human T and the results for each sample size were averaged. The by ChIP–chip (unlogged intensity ratio; blue) and ChIP–seq (tag density; red) in the Then, a wide range of different read counts was sampled cells. Profiles for different types of proteins and histone marks can have different types of fraction of the reference set that was recovered as a func- features,Drosophila such melanogaster as: sharp binding S2 cell sites line., asThe shown tag density for the profile insulator obtained binding by protein ChIP–seq CTCF from the complete data set, with multiple random selec- tion of the number of reads is shown in FIG. 3A. If there (CCCTC-bindingreveals specific positions factor; red) of Chromator; a mixture of binding shapes, with as shown higher for spatial RNA resolutionpolymerase and II tions for eachPark sampleNat Rev size. Genet Binding10 :669 (2009)sites were determined was a saturation point, the number of sites found would (orange),sensitivity. which The ChIP–seq has a sharp input peak DNA followed (control by experiment)a broad region tag of density enrichment is shown; medium in grey size for for each sample with a threshold probability (p value), broadcomparison. peaks, bas | shownExamples for ofhistone different H3 trimethylatedtypes of ChIP–seq at tag density 36 (H3K36me3 profiles in; green), human T andincrease the results up to fora certain each sample point andsize thenwere plateau,averaged. which The whichcells. Profiles is associated for different with transcription types of proteins elongation and histone over the marks gene can; or have large different domains types, as of fractionwould indicate of the reference that the set rate that at was which recovered new sitesas a func were- Epig´enomique´ features,? suchIFT6299 as: sharp H2014 binding? UdeM sites, as? shownMikl´osCs˝ur¨os for the insulator binding protein CTCF xvii shown for histone H3 trimethylated at lysine 27 (H3K27me3; blue), which is a repressive tionbeing of discovered the number had of readsslowed is downshown to in the FIG. point 3A. If wherethere (CCCTC-binding factor; red); a mixture of shapes, as shown for RNA polymerase II mark that is indicative of Polycomb-mediated silencing. BPIL2, bactericidal/permeability- wasany afurther saturation increase point, in the the number number of ofsites reads found would would be (orange), which has a sharp peak followed by a broad region of enrichment; medium size increasing protein-like 2; FBXO7, F box only 7; NPC1, Niemann-Pick disease, type C1; increaseinefficient up at to yielding a certain new point sites. and W thenhen plateau,the simulation which Pros35broad peaks, proteasome, as shown 35 forkDa histone subunit; H3 SYN3 trimethylated, synapsin atIII. lysineData for36 (partH3K36me3 b are from; green), REF. 25 . was performed, however, the results indicated that which is associated with transcription elongation over the gene; or large domains, as would indicate that the rate at which new sites were shown for histone H3 trimethylated at lysine 27 (H3K27me3; blue), which is a repressive being discovered had slowed down to the point where mark that is indicative of Polycomb-mediated silencing. BPIL2, bactericidal/permeability- any further increase in the number of reads would be NATURE REVIEWS | GENETICS VOLUME 10 | OCTOBER 2009 | 673 increasing protein-like 2; FBXO7, F box only 7; NPC1, Niemann-Pick disease, type C1; inefficient at yielding new sites. When the simulation Pros35, proteasome 35 kDa subunit; SYN3, synapsinŸ)''0DXZd`ccXeG III. Data for part blYc`j_\i are fromjC`d`k REF\[%8cci`^_kji. 25. was\j\im performed,\[ however, the results indicated that

NATURE REVIEWS | GENETICS VOLUME 10 | OCTOBER 2009 | 673 Ÿ)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ REVIEW REVIEW

a a but broader regions of up to a few kilobases; but broader regions of up to a few kilobases; and broad regions up to several hundred and broad regions up to several hundred CTCF motif kilobases. Punctate enrichment is a signa- CTCF motif kilobases. Punctate enrichment is a signa- 5.4 _ 5.4 _ Watson (+) reads Watson ture(+) r ofeads a classic sequence-specific transcrip- ture of a classic sequence-specific transcrip- minus Crick (−) minus Crtionick factor(−) such as NRSF or CTCF binding tion factor such as NRSF or CTCF binding 0 - 0 - reads (RPM) reads (RtoPM its) cognate DNA sequence motif (Fig. 2a). to its cognate DNA sequence motif (Fig. 2a). −5.3766 _ −5.3766 _ 8.3653 _ A mixture8.3653 of punctate_ and broader signals A mixture of punctate and broader signals is associated with proteins such as RNA is associated with proteins such as RNA Total reads Total readspolymerase II that bind strongly to specific polymerase II that bind strongly to specific (RPM) (RPM) 0.0586 _ transcription0.0586 start_ sites in active and stalled transcription start sites in active and stalled REVIEW promoters (in a punctate fashion), but RNA 50 bp 50 bp promoters (in a punctate fashion), but RNA bPistes Position (bp) b polymerase II signals can also be detectedPosition ( bp) polymerase II signals can also be detected a butmore broader diffusely regions over of upthe to body a few of kilobases; actively more diffusely over the body of actively RNA polymerase II andtranscribed broad regions genes5,6 up (Fig. to several2b). ChIP-seq RNAhundred po siglyme - rase II transcribed genes5,6 (Fig. 2b). ChIP-seq sig- 10.63 _ kilobases.nals 10. that63 come _Punctate from enrichment most histone is marks a signa and- nals that come from most histone marks and Watson (+) reads CTCF motif 5.4 _ Watson other(+) reads chromatin domain signatures are not Wamitsnuson Cr(+)ick reads (−) minus tureCrick of (− a) classic sequence-specific transcrip- other chromatin domain signatures are not mireadsnus Cr(RiPMck ()−) reads (RtionpointPM factor) sources such as as described NRSF or aboveCTCF but binding range point sources as described above but range 0 - reads (RPM)−7.24 _ tofrom its− cognate7. nucleosome-sized24 _ DNA sequence domains motif (Fig. to 2avery). from nucleosome-sized domains to very −16.5.37669 _ _ 16.9 _ 8.3653 _ Abroad mixture enriched of punctate regions and that broader lack asignals single broad enriched regions that lack a single Total reads Total readsissource associated entirely with such proteins as histone such H3 as RNALys27 source entirely such as histone H3 Lys27 To(RtaPMl reads) (RPM)polymerasetrimethylation II that (H3K27me3) bind strongly in to repressed specific trimethylation (H3K27me3) in repressed (RPM) 7,8 7,8 0.05860 _ _ transcriptionareas 0( Fig._ start2c). sites in active and stalled areas (Fig. 2c). Position (bp) 500 bp These different categories of ChIP enrichPosition- (bp) 500 bp These different categories of ChIP enrich- 50 bp promoters (in a punctate fashion), but RNA b Position (bp) polymerasement have II distinct signals cancharacteristics also be detected that ment have distinct characteristics that ZFP36 RefSeq algorithmsZFP can36 use to predict true signals RefSeq algorithms can use to predict true signals genes more diffusely over the body of actively genes c RNA polymerase II c transcribedoptimally. Punctate genes5,6 ( eventsFig. 2b offer). ChIP-seq the greatest sig- optimally. Punctate events offer the greatest 10.63 _ 0.752 _ H3K36me3 0.nalsamount752 that _ come of discriminatory from most histone detail to marks model and H3the K 36me3 amount of discriminatory detail to model the Watson (+) reads source point down to the nucleotide level. To source point down to the nucleotide level. To minus Crick (−) picsother pointus chromatin (50–100domain signatures pb) are :not Total reads Total readsdate, most algorithms have been developed r(ReadsPM (R) PM) (RPM)point sources as described above but range date, most algorithms have been developed −7.24 _ sitefromand de tunednucleosome-sized liaison for this class spécifique of domains binding, to though very and tuned for this class of binding, though 016. _9 _ broadspecific0 _ enriched packages regions can work that reasonably lack a single well specific packages can work reasonably well 0.752 _ H3K27me3 0.752 _ H3K27me3 Total reads sourcefor mixed entirely binding, such typically as histone requiringpic H3 + Lys27the userégion d'enrichissement for mixed binding, typically requiring the use (RPM) trimethylationof nondefault parameters. (H3K27me3) in repressed of nondefault parameters. Total reads Total reads 7,8 (1000s pb) (RPM) 0 _ (RPM)areas (Fig. 2c). Position (bp) Peak-finders, regions, summits and sourc- Peak-finders, regions, summits and sourc- 0 _ 500 bp These0 _ different categories of ChIP enrich- es. The first step in analyzing ChIP-seq data es. The first step in analyzing ChIP-seq data Position (bp) 100,000 bp ment have distinctrégions characteristics largesPosi tithaton ( bp) 100,000 bp ZFP36 RefSeq algorithmsis to identify can regions use to of predict increased true sequence signals is to identify regions of increased sequence RefSeq genes OLIG2 RefSeqoptimally. read tag density Punctate along events the chromosome offer the greatest relaOLIG2- read tag density along the chromosome rela- c genes OLIG1 genes tive to measured or estimated background. OLIG1 0.752 _ H3K36me3 amount of discriminatory detail to model the tive to measured or estimated background. sourceAfter thesepoint ‘regions’down to theare nucleotideidentified, level.process To - After these ‘regions’ are identified, process- FigureTotal 2reads | ChIP-seq peak types from various experiments. (a–c) Data shown are from remapping ofFigure a 2 | date,ChIP-seqing ensues most peak algorithms to types identify from havethevarious most been experiments. likely developed source (a– c) Data shown are from remapping of a ing ensues to identify the most likely source previously(RPM) published human ChIP-seq dataset7. Proteins that bind DNA in a site-specific fashion, previouslysuch published human ChIP-seq dataset7. Proteins that bind DNA in a site-specific fashion, such andpoint(s) tuned of for cross-linking this class of and binding, inferred though bind - point(s) of cross-linking and inferred bind- as CTCF, form narrow peaks hundreds of base pairs wide (a). The difference of plus and minus readas CTCF, form narrow peaks hundreds of base pairs wide (a). The difference of plus and minus read 0 _ ing (called ‘sources’). The source is related, counts is generally expected to cross zero near the signal source, the source in this example beingcounts the is generallyspecific expected packages to can cross work zero nearreasonably the signal well source, the source in this example being the ing (called ‘sources’). The source is related, 0.752 _ H3K27me3 CTCF motif indicated in red. Signal from enzymes such as RNA polymerase II may show enrichmentCTCF over motif forindicatedbut mixed not in identical,binding, red. Signal typically to from the enzymes requiring‘summit’, such the whichas use RNA polymerase II may show enrichment over but not identical, to the ‘summit’, which regions up to a few kilobases in length (b). Experiments that probe larger-scale chromatin structureregions up oftois nondefault athe few local kilobases maximum parameters. in length read (b). densityExperiments in each that probe larger-scale chromatin structure is the local maximum read density in each Total reads such as the repressive mark for H3K27me3 may yield very broad ‘above’-background regions spanning (RPM) such as the region.repressive When mark therefor H3K27me3 is no single may pointyield very source broad ‘above’-background regions spanning region. When there is no single point source several hundred kilobases (c). Signals are plotted on a normalized read per million (RPM) basis. several hundredPeak-finders,of cross-linking, kilobases ( regions,c). Signalsas for summitssome are plotted dispersed and on sourca normalizedchro-- read per million (RPM) basis. of cross-linking, as for some dispersed chro- 0 _ es.matin The marks,first step the in region-aggregation analyzing ChIP-seq step data is matin marks, the region-aggregation step is Position (bp) 100,000 bp the workup). The current algorithms have each been designed appropriate butthe the workup). ‘summit-finding’is to identifyThe current stepregions is algorithmsnot. of Softwareincreased have packages sequence each for been designed appropriate but the ‘summit-finding’ step is not. Software packages for toRe ignorefSeq a variety of false positive read-tag aggregationsOLIG2 that are ChIP-seq are genericallyto ignoreread a and variety tag somewhat density of false along vaguely positive the called chromosome read-tag ‘peak aggregationsfinders’. rela- that are ChIP-seq are generically and somewhat vaguely called ‘peak finders’. genes Pepke & al Nat Methods 6 :S22 (2009) judged unlikely to be due to immuno-enriched factor OLbinding,IG1 but They can be conceptuallyjudged unlikelytive subdivided to measuredto be due into to or immuno-enrichedthe estimated following background. basic factorcom - binding, but They can be conceptually subdivided into the following basic com- they are not identical, and users should expect different packages ponents: (i) a theysignal are profile notAfter identical, definition these ‘regions’ and along users are each identified,should chromosome, expect process different (ii)- packages ponents: (i) a signal profile definition along each chromosome, (ii) Figureand different 2 | ChIP-seq parameters peak types to from eliminate various experiments. some overlapping (a–c) Data and shown some are froma background remapping ofand model,a different (iii)ing peak ensues parameters call to criteria, identify to eliminate (iv) the post-call most some likely filtering overlapping source of and some a background model, (iii) peak call criteria, (iv) post-call filtering of 7 previouslynovel tag published patterns human as background. ChIP-seq dataset . Proteins that bind DNA in a site-specificartifactual fashion, peaks suchnovel and tag (v) point(s)patterns significance of as cross-linkingbackground. ranking of called and inferred peaks ( Fig.bind 3-). artifactual peaks and (v) significance ranking of called peaks (Fig. 3). as CTCF, form narrow peaks hundreds of base pairs wide ( ). The difference of plus and minus read a Components of 12 published software packages are summarized in Components of 12 published software packages are summarized in countsEpig´enomique´ is generally expected? toIFT6299 cross zero near H2014 the signal? source,UdeM the source? Mikl´osCs˝ur¨os in this example being the ing (called ‘sources’). The source is related, xviii CTCFClasses motif of indicated ChIP-seq in red. signals. Signal from Consistent enzymes withsuch asprevious RNA polymerase ChIP-chip II may showTable enrichment 1. Classesover ofbut ChIP-seq not identical, signals. to Consistent the ‘summit’, with which previous ChIP-chip Table 1. regionsresults, up ChIP-seq to a few kilobases tag enrichments in length (b ).or Experiments ‘peaks’ generated that probe by larger-scale typical chromatinThe simpleststructureresults, approach ChIP-seqis the for local calling tag maximum enrichments enriched read regions or density ‘peaks’ in ChIP-seq in generated each by typical The simplest approach for calling enriched regions in ChIP-seq suchexperimental as the repressive protocols mark forcan H3K27me3 be classified may into yield three very broad major ‘above’-background categories: data regions is to spanning takeexperimental a direct censusregion. protocols of When mapped therecan tagbe is classifiedsitesno single along intopoint the three genome source major categories: data is to take a direct census of mapped tag sites along the genome severalpunctate hundred regions kilobases covering (c). Signals a few hundred are plotted base on apairs normalized or less; read localized per million and(RPM) allow basis. everypunctate contiguous regionsof cross-linking, setcovering of base asa fewfor pairs somehundred with dispersed morebase pairs thanchro or- a less; localized and allow every contiguous set of base pairs with more than a matin marks, the region-aggregation step is theS24 workup). | VOL.6 NO.11s The |current NOVEMBER algorithms 2009 | NATURE have METHOD each beenS SUPPL designedEMENT appropriate butS24 the |‘summit-finding’ VOL.6 NO.11s | NOVEMBER step is not. 2009 Software | NATURE packages METHOD forS SUPPLEMENT to ignore a variety of false positive read-tag aggregations that are ChIP-seq are generically and somewhat vaguely called ‘peak finders’. judged unlikely to be due to immuno-enriched factor binding, but They can be conceptually subdivided into the following basic com- they are not identical, and users should expect different packages ponents: (i) a signal profile definition along each chromosome, (ii) and different parameters to eliminate some overlapping and some a background model, (iii) peak call criteria, (iv) post-call filtering of novel tag patterns as background. artifactual peaks and (v) significance ranking of called peaks (Fig. 3). Components of 12 published software packages are summarized in Classes of ChIP-seq signals. Consistent with previous ChIP-chip Table 1. results, ChIP-seq tag enrichments or ‘peaks’ generated by typical The simplest approach for calling enriched regions in ChIP-seq experimental protocols can be classified into three major categories: data is to take a direct census of mapped tag sites along the genome punctate regions covering a few hundred base pairs or less; localized and allow every contiguous set of base pairs with more than a

S24 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT Downloaded from genome.cshlp.org on December 17, 2013 - Published by Cold Spring Harbor Laboratory Press

Signaux d’activite´ et de silence The epigenome of the pancreatic islet

We also considered the possibility that the low levels of H3K4me3 mod- ification at the insulin gene promoter might be due to a paucity of beta-cells in our samples. However, our ChIP-seq data suggest that this is not the case, as both PDX1 and MAFA (genes with expression restricted to beta-cells) are highly occu- pied by modified histones (Supplemental Fig. 7A; data not shown). Similarly, the promoter of MAFB (a gene expressed only in alpha-cells in the adult islet) is also occupied by H3K4me3 (Supplemental Fig. 7B). Furthermore, by qRT-PCR analy- sis, the expression of insulin and other pancreatic cell-specific genes is much higher than that of amylase, a marker for exocrine contamination (Supplemental Fig. 9). While we observed some vari- ability in the levels of insulin mRNA ex- pression between the individual donors, there was no significant difference in the H3K4me3 pattern at the insulin promoter between samples (Supplemental Fig. 10).

Tissue specificity of histone marks at promoters in human pancreatic islets Previous efforts (Heintzman et al. 2009) have reported that chromatin structure at Figure 3. Marks of active and repressed genes in islets. (A) Human islet chromatin enriched for promoters is largely consistent between H3K4me1, H3K4me2, H3K4me3, H3K27me3, and control input DNA from four samples was processed and sequenced. Reads were pooled and aligned to the NCBI Genome Build 36.1-hg18, to determine cells types and that the variation occurs at regions that were enriched for binding by modified histones. Note the strong double peak surrounding enhancers. To test the promoter part of the transcriptional start site (left end) of ATF4, and the weaker peaks for H3K4me1 and H3K4me2. In B, this observation, we compared levels of the HOXB cluster contains significantly enriched regions for H3K27me3 as well as H3K4me3 at several H3K4me3 in islets with CD4+ T-cells transcription start sites. (Barski et al. 2007). We found that while the levels of H3K4me3 on a majority of MAFB, which are heavily occupied by H3K4me3 despite their promoters were well correlated between islets and T-cells, there lower expression levels (Supplemental Figs. 6, 7A,B). Our initial was a set of promoters that were differentially modified in the hypothesis was that this lack of activating histone marks on the two tissues (Fig. 4). These genes include the small number of hormone-encoding gene promoters might be due to a complete CpG genes that showed higher H3K4me3 modification levels À absence of histones, perhaps due to the high transcription rate of with higher expressionBhandare levels in & Figure al Genome 1. Thus, Res there20 :428 are (2010) tissue- these genes. To test this notion, we performed ChIP for total H3 specific variations at the promoters of genes in terms of H3K4me3 histone at the promoters of JUN and GAPDH, which are occupied by H3K4me3, and insulin and glucagon, which are not (Supple- mental Fig. 8). We found similar levels of histone H3 at all of these Epig´enomique´ ? IFT6299 H2014promoters;? UdeM thus,? theMikl´osCs˝ur¨os lack of H3K4me3 at the insulin and glucagon xix promoters is not due to a lack of histones. Therefore, alternative histone modifications or regulatory mechanisms must be respon- sible for the activation of these hormone-encoding genes in hu- man islets. Recent studies have demonstrated that the insulin and nearby genes in an extended 80-kb region are a part of a large, human islet- specific, open chromatin domain, and share a common con- trol mechanism (Mutskov and Felsenfeld 2009). The presence of intergenic transcription in this region has been proposed to play a role in the maintenance of open chromatin structure, suggesting Figure 4. Comparison of H3K4me3 at gene promoters in human islets that a locus-specific control mechanism might be responsible for and CD4+ T-cells. The square root of the summary input-normalized constitutive insulin gene expression in humans (Mutskov and levels of H3K4me3 for CpG island-containing (yellow) and CpG island-less (green) genes are plotted. The CpG island-less genes in the magenta (395 Felsenfeld 2009). Our data also indicate a region (chr1:2,100,000– genes) and blue (93 genes) boxes are modified only in one tissue or the 2,200,000 mm8) of high levels of H3K4me1, a mark associated other, indicating that this class of genes exhibits tissue-specific promoter with regulatory regions covering the insulin gene locus. modification.

Genome Research 431 www.genome.org HMM pour histones modifiees´ ANA LYSIS

additional ones. Regardless of whether these chromatin states are RESULTS causal in directing regulatory processes, or simply reinforcing inde- Chromatin states model and comparison to previous work emissions´ : modificationspendent regulatory decisions, (vecteur these annotations binaireshould provide a Previous - Bernoulli analyses have largely focused ind on characterizingependent)´ the marks ; resource for interpreting biological and medical data sets, such as predictive of specific classes of genomic elements defined a priori such genome-wide association studies for diverse phenotypes and could as transcribed regions, promoters or putative enhancers, and using etats´ : a` creer´ commepotentially help n toecessaire´ identify new classes of functional (evaluer´ elements. complexitthe characterization to identifye´ denew instances mod of theseele)` classes5–12.

Chr 7: 116,260 kb 116,270 kb 116,280 kb 116,290 kb 116,300 kb 116,310 kb 116,320 kb 116,330 kb 116,340 kb 116,350 kb 116,360 kb

State 3 Promoter states State 5 State 7 State 8 State 10 State 11 State 13 Transcribed states State 15 State 16 Chromatin states State 17 State 18 State 19 State 24 State 25 State 26 State 36 Active intergenic State 37 State 38 State 39 State 43 Repressed State 44 State 51 Repetitive

H3K14ac CAPZA2 H3K23ac H4K12ac H2AK9ac H4K16ac H2AK5ac H4K91ac H3K4ac H2BK20ac H3K18ac H2BK120ac H3K27ac

Chromatin marks H2BK5ac H2BK12ac H3K36ac H4K5ac H4K8ac H3K9ac PolII CTCF H2AZ H3K4me3 H3K4me2

© 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 H3K4me1 H3K9me1 H3K79me3 H3K79me2 H3K79me1 H3K27me1 H2BK5me1 H4K20me1 H3K36me3 H3K36me1 H3R2me1 H3R2me2 H3K27me2 H3K27me3 H3K9me2 H3K9me3 H4K20me3

50 kb

Figure 1 Example of chromatin state annotation. Input chromatin mark information and resulting chromatin state annotation for a 120-kb region of human chromosome 7 surrounding the CAPZA2 gene. For each 200-bp interval, the input ChIP-Seq sequence tag count (black bars) is processed into a binary presence and/or absence call for each of 18 acetylation marks (light blue), 20 methylation marks (pink) and CTCF/Pol2/H2AZ (brown). The preciseErnst & Kellis Nat Biotechnol 28 :817 (2010) combination of these marks in each interval in their spatial context is used to infer the most probable chromatin state assignment (colored boxes). Although chromatin states were learned independently of any prior genome annotation, they correlate strongly with upstream and downstream promoters (red), 5`-proximal and distal transcribed regions (purple), active intergenic regions (yellow), repressed (gray) and repetitive (blue) regions (state descriptions shown in Supplementary Table 1). This example illustrates that even when the signal coming from chromatin marks is noisy, the resulting chromatin state annotation is very robust, directly interpretable and shows a strong correspondence with the gene annotation. Several spatially coherent transitions are seen from large-scale repressed to active intergenic regions near active genes, from upstream to downstream promoter states surrounding the TSS and from 5`-proximal to distal transcribed regions along the body of the gene. The frequent transitions to state 16 correlate with annotated Alu elements (57% Epig´enomique´ ? IFT6299 H2014overlap versus? 4%UdeM and 25% for ?statesMikl´osCs˝ur¨os 13 and 15, respectively). Transitions to state 13 are likely due to enhancer elements in the first intron of CAPZA2, xx a region where regulatory elements are commonly found and correlate with several enhancer marks. The maximum-probability state assignments are shown here, and the full posterior probability for each state in this region is shown in Supplementary Figure 1.

818 VOLUME 28 NUMBER 8 AUGUST 2010 NATURE BIOTECHNOLOGY Zentner and Heniko! Genome Biology 2012, 13:250 Page 6 of 8 http://genomebiology.com/2012/13/10/250

methods. ChIP-exo provides a method to precisely map complex milieu. One such method is fluorescence- the genomic binding of proteins in systems where ChIP activated cell sorting (FACS), involving purification of reagents are readily available. MNase-seq allows for fluorescently labeled cells or nuclei. FACS has been used mapping of nucleosomes and non-histone proteins to isolate specific cell populations from mouse and within a single sample and like DNase-seq is easily human brain and mouse embryonic mesoderm for adapted to any system with a sequenced genome. In chromatin analysis [40,41]. Another technique, isolation combination with ChIP-seq, MNase-seq and DNase-seq of nuclei tagged in specific cell types (INTACT) has been provide powerful methods for base-pair resolution used to isolate nuclei from individual cell types in identification of protein binding sites. !ese techniques Arabidopsis, Caenorhabditis elegans, and Drosophila for are summarized schematically in Figure 2. expression and preliminary chromatin profiling [42,43]. While epigenomic profiling is relatively straightforward Combining these techniques with the various methods of in single-cell systems, it is more challenging in base-pair resolution epigenome analysis detailed above multicellular organisms, where different cell types are should provide striking insights into the regulatory tightly interwoven in complex tissues. Indeed, ChIP-exo, networks underlying specific cell identities. MNase-seq, and DNase-seq have generally been As base-pair resolution epigenomic techniques are performed either in yeast, which are unicellular, or further developed and the cost of sequencing continues cultured cells from other organisms, which are not to decrease, genome-wide profiling of cell type-specific necessarily reflective of the in vivo situation in the chromatin landscapes will become increasingly routine. organism from which they were derived. To profile cell !e precise mapping of TFs, of nucleosomal features type-specific epigenomes at base-pair resolution, it will (positioning, occupancy, composition, and modification), be necessary to combine the above technologies with and of ATP-dependent chromatin remodelers may Methodes´ methods for the d’ isolationepig´ of specific cellenomique´ types from a provide the epigenomic equivalent of genome sequencing

ChIP-exo Crosslink Sonicate

Isolate nuclei

MNase-seq Immunoprecipitation DNase-seq Exonuclease digestion

MNase digestion

DNase hypersensitive sites Affinity purification

DNA purification DNA purification DNase I digestion DNA purification

Linker ligation Affinity purification High-throughput sequencing * DNase HS site High-throughput sequencing Chromatin landscape

Figure 2. Summary of techniques for base-pair resolution epigenome mapping. Schematic representations of ChIP-exo, MNase-seq, and DNase-seq. In ChIP-exo, chromatin is sonicated and specific fragments are isolated with an antibody to a protein of interest. ChIP DNA is trimmed using λ exonuclease, purified, and sequenced. In MNase-seq, nuclei are isolated and treated with MNase to fragment chromatin. Chromatin is then subjected to DNA purification with or without prior affinity purification and MNase-protected DNA is sequenced. In DNase-seq, nuclei are isolated and treated with DNase I to digest chromatin. DNase-hypersensitive DNA is then ligated to linkers, affinity purified, and sequenced. HS, hypersensitive.

Zentner & Henikoff Genome Biol 13 :250 (2012)

Epig´enomique´ ? IFT6299 H2014 ? UdeM ? Mikl´osCs˝ur¨os xxi