Learning Chroma\N States from Chip-‐Seq Data
Total Page:16
File Type:pdf, Size:1020Kb
Learning Chroman States from ChIP-seq data Luca Pinello GC Yuan Lab Outline • Chroman structure, histone modificaons and combinatorial paerns • How to segment the genome in chroman states • How to use ChromHMM step by step • Further references 2 Epigene-cs and chroman structure • All (almost) the cells of our body share the same genome but have very different gene expression programs…. 3 h?p://jpkc.scu.edu.cn/ywwy/zbsw(E)/edetail12.htm The code over the code • The chroman structure and the accessibility are mainly controlled by: 1. Nucleosome posioning, 2. DNA methylaon, 3. Histone modificaons. 4 Histone Modificaons Specific histone modificaons or combinaons of modificaons confer unique biological func-ons to the region of the genome associated with them: • H3K4me3: promoters, gene acva.on • H3K27me3: promoters, poised enhancers, gene silencing • H2AZ: promoters • H3K4me1: enhancers • H3K36me3: transcribed regions • H3K9me3: gene silencing • H3k27ac: acve enhancers 5 Examples of *-Seq Measuring the genome genome fragmentation assembler DNA DNA reads “genome” ChIP-seq to measure histone data fragments Measuring the regulome (e.g., protein-binding of the genome) Chromatin Immunopreciptation genomic (ChIP) + intervals fragmentation Protein - peak caller bound by DNA bound DNA reads proteins REVIEWS fragments a also informative, as this ratio corresponds to the fraction ChIP–chip of nucleosomes with the particular modification at that location, averaged over all the cells assayed. One of the difficulties in conducting a ChIP–seq con- trol experiment is the large amount of sequencing that ChIP–seq may be necessary. For input DNA and bulk nucleosomes, many of the sequenced tags are spread evenly across the genome. To obtain accurate estimates throughout the genome, sufficient numbers of tags are needed at each point; otherwise fold enrichment at the peaks will ChIP–seq input DNA result in large errors due to sampling bias. Therefore, the total number of tags to be sequenced is potentially very large. Alternatively, it is possible to avoid sequencing a control sample if one is only interested in differential binding patterns between conditions or time points and Pros35 CG4908 eEF1 if the variation in chromatin preparations is small. NPC1 CG5708 CG5694 Depth of sequencing. One crucial difference between ChIP–chip and ChIP–seq is that the number of tiling 10,220,000 10,225,000 10,230,000 arrays that is used in a ChIP–chip experiment is fixed b regardless of the protein or modification of interest, CTCF 6 Adapted from Dewey lecture and Peter Park Nature Gene-cs Review whereas the number of fragments that is sequenced in a ChIP–seq experiment is determined by the investiga- tor. In published ChIP–seq experiments, a single lane of the Illumina Genome Analyzer was the basic unit of RNA polymerase II sequencing. When it was introduced, a single lane gen- erated 4–6 million reads before alignment but, owing to improvements in the system, a single lane now gener- ates 8–15 million reads or more. Given the cost of each experiment, many early data sets contained reads from H3K36me3 a single lane regardless of what the specific experiment was. Intuitively, one expects that when a large number of binding sites are present in the genome for a DNA- binding protein or when a histone modification covers a large fraction of the genome, a correspondingly large number of tags will be needed to cover each bound region at the same tag density. One reasonable crite- rion for determining sufficient sequencing depth would H3K27me3 be that the results of a given analysis do not change FBXO7 when more reads are obtained. In terms of the number of binding sites, this criterion translates to the presence of a ‘saturation point’ after which no further binding sites BPIL2 SYN3 are discovered with additional reads. 31,200,000 31,220,000 31,240,000 31,260,000 The issue of saturation points has been examined in a recent paper through simulation studies48. In three Figure 2 | ChIP profiles. a | Examples of the profiles generated byNa chromatinture Revie immunoprews | Genetics- cipitation followed by sequencing (ChIP–seq) or by microarray (ChIP–chip). Shown is a example data sets, a reference set of sites was generated section of the binding profiles of the chromodomain protein Chromator, as measured based on the full set of sequencing reads in each case. by ChIP–chip (unlogged intensity ratio; blue) and ChIP–seq (tag density; red) in the Then, a wide range of different read counts was sampled Drosophila melanogaster S2 cell line. The tag density profile obtained by ChIP–seq from the complete data set, with multiple random selec- reveals specific positions of Chromator binding with higher spatial resolution and tions for each sample size. Binding sites were determined sensitivity. The ChIP–seq input DNA (control experiment) tag density is shown in grey for for each sample with a threshold probability (p value), comparison. b | Examples of different types of ChIP–seq tag density profiles in human T and the results for each sample size were averaged. The cells. Profiles for different types of proteins and histone marks can have different types of fraction of the reference set that was recovered as a func- features, such as: sharp binding sites, as shown for the insulator binding protein CTCF tion of the number of reads is shown in FIG. 3A. If there (CCCTC-binding factor; red); a mixture of shapes, as shown for RNA polymerase II was a saturation point, the number of sites found would (orange), which has a sharp peak followed by a broad region of enrichment; medium size broad peaks, as shown for histone H3 trimethylated at lysine 36 (H3K36me3; green), increase up to a certain point and then plateau, which which is associated with transcription elongation over the gene; or large domains, as would indicate that the rate at which new sites were shown for histone H3 trimethylated at lysine 27 (H3K27me3; blue), which is a repressive being discovered had slowed down to the point where mark that is indicative of Polycomb-mediated silencing. BPIL2, bactericidal/permeability- any further increase in the number of reads would be increasing protein-like 2; FBXO7, F box only 7; NPC1, Niemann-Pick disease, type C1; inefficient at yielding new sites. When the simulation Pros35, proteasome 35 kDa subunit; SYN3, synapsin III. Data for part b are from REF. 25. was performed, however, the results indicated that NATURE REVIEWS | GENETICS VOLUME 10 | OCTOBER 2009 | 673 )''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ Figure 1. Generation of Comprehensive A MappedMapped reareadsds TotalTotal bases OP9 coculture -> colony-forming ((billion)billion) ((billion)billion) Epigenome Reference Maps for hESCs and -> colony subculture Mesenchymal Stem Cell (MSC) Four hESC-Derived Lineages 19-22 days CChIP-seqhIP-seq 5.85.8 210 (A) Schematic of hESC differentiation procedures Noggin + SB431542 Neural Progenitor and a summary of the epigenomic data sets pro- hESC (H1) Cell (NPC) 7 days duced in this study. MethylC-SeqMethylC-Seq 5.3 457 BMP4 Trophoblast-like Cell (B) A snapshot of the UCSC genome browser 5 days (TBL) shows the DNA methylation level (mCG/CG), RNA- RRNA-SeqNA-Seq 3.83.8 380 seq reads (+, Watson strand; , Crick strand), FGF2 + BMP4 Mesendoderm À and ChIP-seq reads (RPKM) of 24 chromatin 2 days (ME) marks in H1. We can “call peaks” but… See also Figure S1. B Epigenomic landscape in human embryonic stem cells (H1) chr6:30,614,231-31,337,674,, ,, H2AK5ac H2BK5ac H2BK12ac H2BK15ac H2BK20ac sively at the ERV1 subfamily HERV-H H2BK120ac H3K4ac and its flanking LTR elements LTR7 (Fig- H3K9ac Histone H3K14ac ure 2D, bottom). Together, HERV-H and acetylation H3K18ac H3K23ac LTR7 account for more than 43% of H3K27ac H3K56ac LTRs that are present at H1- and ME-spe- H4K5ac cific lncRNA gene promoters. A gene H4K91ac H3K4me1 ontology analysis of coding genes near H3K4me2 H3K4me3 H1-specific HERV-H/LTR7 sites revealed Histone H3K27me3 an enrichment of POU5F1-targeted methylation H3K9me3 15 H3K36me3 genes (p value = 4 3 10À ), which is H4K20me1 H3K79me1 consistent with a previous study showing H3K79me2 mCG/CG that NANOG and POU5F1 preferentially RNA (+) bind to repetitive elements (Kunarso RNA (-) PPP1R10 DHX16 MDC1 IER3 DDR1 SFTA2 MUC21 HCG22 C6orf15 TCF19 HCG27 et al., 2010). We did not find significant PRR3 MRPS18B PPP1R18 TUBB DDR1 DPCR1 MUC22 PSORS1C1 ABCF1 ATAT1 NRM FLOT1 MIR4640 CDSN TCF19 MIR877 C6orf136 NRM GTF2H4 PSORS1C2 PPP1R10 DHX16 VARS2 CCHCR1 enrichment of LTR subclasses for other C6orf136 NRM POU5F1 PPP1R18 POU5F1 lineage-restricted lncRNA genes. Repeti- tive elements are known to be regulated by DNA methylation and H3K9me3 in Idea: We need a way to summarize the combinatorial paerns of NANOG, POU5F1, and a reduced but significant level of SOX2. ESCs (Leung and Lorincz, 2012). We do not find significant We then investigated a cohort of long noncoding RNA (lncRNA) enrichment of H3K9me3 around most HERV-H elements (data mul-ple histone marks genes and detected significant levels of transcripts for 2,175 not shown). By contrast, a7 subset of the H1-specific HERV-H known and 281 unannotated lncRNA genes in at least one cell elements (n = 70) show hypomethylation in H1 and ME but type (Figures 2A and S2A). Using the same entropy-based gain DNA methylation in other H1-derived cells (Figures 2B and approach, we found 930 lncRNA genes defined as lineage 2E). Notably, the overall low level of DNA methylation in IMR90 restricted (Figure S2C), which constitute 37.9% of total ex- reflects its globally hypomethylated genome, likely due to the pressed lncRNA genes. By contrast, only 16.5% of expressed presence of partially methylated domains (PMDs) (Figures S2E coding genes are characterized as lineage restricted (Fig- and S2F) (Lister et al., 2009).