A Histone Arginine Methylation Localizes to Nucleosomes in Satellite II and III DNA Sequences in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
UCSF UC San Francisco Previously Published Works Title A histone arginine methylation localizes to nucleosomes in satellite II and III DNA sequences in the human genome Permalink https://escholarship.org/uc/item/3xm5d6fh Journal BMC Genomics, 13(1) ISSN 1471-2164 Authors Capurso, Daniel Xiong, Hao Segal, Mark R Publication Date 2012-11-15 DOI http://dx.doi.org/10.1186/1471-2164-13-630 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Capurso et al. BMC Genomics 2012, 13:630 http://www.biomedcentral.com/1471-2164/13/630 RESEARCH ARTICLE Open Access A histone arginine methylation localizes to nucleosomes in satellite II and III DNA sequences in the human genome Daniel Capurso1, Hao Xiong2 and Mark R Segal2* Abstract Background: Applying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications. Previous analyses sought to classify nucleosomes containing histone H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) or H2A.Z using human CD4+ T-cell chromatin immunoprecipitation sequencing (ChIP-Seq) data. However, these efforts only achieved modest accuracy with limited biological interpretation. Here, we investigate the impact of using appropriate data pre-processing — deduplication, normalization, and position- (peak-) finding to identify stable nucleosome positions — in conjunction with advanced classification algorithms, notably discriminatory motif feature selection and random forests. Performance assessments are based on accuracy and interpretative yield. Results: We achieved dramatically improved accuracy using histone modification features (99.0%; previous attempts, 68.3%) and DNA sequence features (94.1%; previous attempts, <60%). Furthermore, the algorithms elicited interpretable features that withstand permutation testing, including: the histone modifications H4K20me3 and H3K9me3, which are components of heterochromatin; and the motif TCCATT, which is part of the consensus sequence of satellite II and III DNA. Downstream analysis demonstrates that satellite II and III DNA in the human genome is occupied by stable nucleosomes containing H2A/H4R3me2s, H4K20me3, and/or H3K9me3, but not 18 other histone methylations. These results are consistent with the recent biochemical finding that H4R3me2s provides a binding site for the DNA methyltransferase (Dnmt3a) that methylates satellite II and III DNA. Conclusions: Classification algorithms applied to appropriately pre-processed ChIP-Seq data can accurately discriminate between histone modifications. Algorithms that facilitate interpretation, such as discriminatory motif feature selection, have the added potential to impart information about underlying biological mechanism. Keywords: Epigenomics, Histone modifications, ChIP-Seq, Data pre-processing, Classification Background However, it is currently unknown why so many modifica- Chromatin compaction is one of the critical factors tions — on at least 60 histone residues [3] — are neces- regulating gene expression. The basic unit of chromatin, sary [3,4]. One possibility is that individual modifications the nucleosome, consists of 147 base pairs (bp) of DNA have specialized properties, such as “indexing” classes of wrapped around an octamer of histone proteins (H2A, genomic elements [5]. Nevertheless, such discriminating H2B, H3, H4). Many histone post-translational modifica- properties remain largely unknown, as redundancy and tions contribute to establishing compacted, transcrip- enzyme promiscuity for non-histone targets have limited tionally repressed heterochromatin (e.g., histone H3 the amenability of histone modifications to genetic lysine 9 trimethylation (H3K9me3)) or open, transcrip- experimentation [6]. tionally poised euchromatin (e.g., H3K4me3) [1,2]. A potential solution to this problem is to apply supervised learning/classification techniques to high- * Correspondence: [email protected] throughput epigenomic data, such as chromatin immu- 2Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA noprecipitation sequencing (ChIP-Seq) data, for histone Full list of author information is available at the end of the article modificatons. Encouragingly, these approaches have had © 2012 Capurso et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Capurso et al. BMC Genomics 2012, 13:630 Page 2 of 10 http://www.biomedcentral.com/1471-2164/13/630 success in the related task of predicting the nucleosome position across a population of cells and can therefore occupancy of DNA sequences: they have elicited pre- generate a signal peak when ChIP-Seq reads are aligned. dictive features with biological (e.g., Rap1 transcription Such nucleosomes are also referred to as being relatively factor binding sites [7,8]) and biophysical (e.g., GC con- well positioned or phased, and there is evidence for their tent, DNA propeller twist [7,9,10]) interpretations. regulatory importance [1,16]. While using stable nucleo- Nevertheless, attempts to apply classification techniques some positions might limit the analysis to a subset of to histone modifications have been less forthcoming. nucleosomes (and thus influence interpretation), we still This is, in part, because such analyses require less read- believe this approach is preferable to using raw, aligned ily available datasets, which correspond to many ChIP- reads — of which only a small minority were likely even Seq experiments in the same cell type. As notable bound to the nucleosomes of interest. This approach of exceptions, Barski et al. [11] have generated a ChIP-Seq using stable nucleosomes was also utilized in a recent dataset for 20 histone methylations and the histone study [19]. variant H2A.Z in human CD4+ T cells, and Wang et al. Aside from the handling of signal and background, the [12], of the same research group, have generated a approach of Gervais and Gaudreau [13] might not ad- similar dataset for 18 histone acetylations. A recent equately control for systematic biases present in ChIP-Seq study by Gervais and Gaudreau [13] applied classifica- data. First, because of PCR amplification bias it may be ad- tion techniques to histone modifications using these visabletocollapseduplicatereads prior to analysis [15,18]. datasets. This is especially the case for datasets such as Barski et al. In particular, Gervais and Gaudreau [13] attempted to [11] and Wang et al. [12] where sequencing depth is rela- predict whether a nucleosome contains histone H2A.Z tively low, such that there is a lower likelihood of sequen- or H2A/H4 arginine 3 symmetric dimethylation (H2A/ cing independently-precipitated fragments with the same H4R3me2s; the authors refer to this as just “H2A”, start site (as future datasets begin to have much higher though it is a methylated form [14]). Importantly, these sequencing depth, more refined alternatives to read dedu- two classes are likely mutually exclusive: H2A.Z lacks plication will be valuable). Indeed, even for stable nucleo- the R3 methylation site and localizes near active tran- somes, the positioning is often blurry, with nucleosomes scription start sites [1], while H2A/H4R3me2s localizes not having precisely the same start site across cells [20]. with repressed heterochromatin [3]. The authors [13] In addition, because coverage and the ability to detect first performed classification with histone modification peaks vary with sequencing depth, ChIP-Seq experiments features (co-localization with 37 other modifications need to be normalized for the number of reads [18]. from ChIP-Seq) and, then, with DNA sequence features Refined normalization approaches are emerging [21] for (frequency of 6-mers in 147 bp nucleosome-bound DNA ChIP-Seq datasets that contain a mock immunoprecipita- sequences). However, these analyses only achieved mod- tion (IP) sample; however, for otherwise rich ChIP-Seq est prediction accuracies of 68.3% and <60%, respectively datasets that lack such a mock IP, including [11] and [12], (here, a trivial classifier would have an accuracy of 50%) we believe data should still be normalized for the number [13]. Furthermore, there was limited biological interpret- of reads, in the absence of a more delicate approach for ation for histone modification features and no interpret- this type of data (see Discussion). ation for DNA sequence features [13]. Here, we employ appropriate ChIP-Seq data pre- A partial explanation for this modest performance processing and sequence-customized, or otherwise may be insufficient data pre-processing. Gervais and advanced, algorithms to investigate their impact on the Gaudreau [13] used raw, aligned (25 bp) ChIP-Seq accuracy and interpretability of classifying nucleosomes reads, and simply extended these to 147 bp to generate containing H2A/H4R3me2s or H2A.Z. For data pre-pro- what they consider to be nucleosome-bound DNA cessing, we perform deduplication, normalization, and sequences. However, this approach is problematic. Be- position-finding. Further, for DNA sequence-based clas- cause ChIP-Seq is only a slight enrichment (not a purifi- sification, we utilize the recently developed Discrimin- cation) for sequences bound to the protein of interest atory Motif Feature Selection (DMFS) [22], which,