Comparative Analysis of Commonly Used Peak Calling Programs for Chip
Total Page:16
File Type:pdf, Size:1020Kb
Comparative analysis of commonly used peak calling programs for ChIP- Seq analysis Hyeongrin Jeon1, Hyunji Lee1, Byunghee Kang1, Insoon Jang1, Original article Tae-Young Roh1,2,3* 1Department of Life Sciences, Pohang University of Science and Technology (POSTECH), eISSN 2234-0742 Pohang 37673, Korea 2 Genomics Inform 2020;18(4):e42 Division of Integrative Biosciences and Biotechnology, Pohang University of Science and https://doi.org/10.5808/GI.2020.18.4.e42 Technology (POSTECH), Pohang 37673, Korea 3SysGenLab Inc., Pohang 37613, Korea Received: October 6, 2020 Chromatin immunoprecipitation coupled with high-throughput DNA sequencing (ChIP- Revised: October 26, 2020 Seq) is a powerful technology to profile the location of proteins of interest on a whole-ge- Accepted: November 22, 2020 nome scale. To identify the enrichment location of proteins, many programs and algorithms have been proposed. However, none of the commonly used peak calling programs could *Corresponding author: accurately explain the binding features of target proteins detected by ChIP-Seq. Here, pub- E-mail: [email protected] licly available data on 12 histone modifications, including H3K4ac/me1/me2/me3, H3K9ac/ me3, H3K27ac/me3, H3K36me3, H3K56ac, and H3K79me1/me2, generated from a human embryonic stem cell line (H1), were profiled with five peak callers (CisGenome, MACS1, MACS2, PeakSeq, and SISSRs). The performance of the peak calling programs was com- pared in terms of reproducibility between replicates, examination of enriched regions to variable sequencing depths, the specificity-to-noise signal, and sensitivity of peak predic- tion. There were no major differences among peak callers when analyzing point source his- tone modifications. The peak calling results from histone modifications with low fidelity, such as H3K4ac, H3K56ac, and H3K79me1/me2, showed low performance in all parame- ters, which indicates that their peak positions might not be located accurately. Our com- parative results could provide a helpful guide to choose a suitable peak calling program for specific histone modifications. Keywords: ChIP-Seq, histone modification, human embryonic stem cell, peak calling pro- gram Introduction Protein-binding regions in the context of chromatin have been detected by the chromatin immunoprecipitation (ChIP) method. Since the first ChIP coupled with high-through- put DNA sequencing (ChIP-Seq) technology for histone modification mapping was in- 2020, Korea Genome Organization troduced with the combination of ChIP and next-generation sequencing, a large amount This is an open-access article distributed of ChIP-Seq data has been produced at the genome level, and the development of data under the terms of the Creative Commons Attribution license (http://creativecommons. analysis tools should thus be emphasized [1-3]. org/licenses/by/4.0/), which permits unre- The basic building block of chromatin, the nucleosome, consists of 146 base pairs (bp) stricted use, distribution, and reproduction in any medium, provided the original work is of DNA and a histone octamer composed of four core histones: H2A, H2B, H3, and H4. properly cited. Post-translational modifications of histone tails play an important role in the epigenetic regulation of genome activity. These modifications include acetylation, methylation, 1 / 9 Jeon HG et al. • Comparison of ChIP-Seq peak calling programs phosphorylation, and ubiquitination. Depending on the types of strand cross-correlation analysis was performed using the SPP histone modifications and binding sites, different enrichment pat- program with the default options (-s -100:5:600, and -x 10), con- terns and related biological effects are expected. For example, sidering two metrics: (1) the normalized strand coefficient, which acetylated histones provide a chromatin environment easily acces- quantifies the fragment length cross-correlation over the back- sible to the transcriptional machinery by changing the chromatin ground cross-correlation rate, and (2) the relative strand correla- conformation. Some histone methylations, such as H3K4me2 and tion, which calculates the ratio of cross-correlation observed at the H3K4me3, are mostly located on promoters, whereas H3K36me3 predicted fragment size against the artifactual cross-correlation ob- is predominantly found on the gene bodies of transcriptionally ac- served at the read length [18]. tive genes [4,5]. The Encyclopedia of DNA Elements (ENCODE) Consortium, Identification of regions enriched with specific histone aiming at the identification of all functional elements in the human modifications genome, proposed a guideline for categorizing protein-bound re- To detect peaks, CisGenome (version 2.0), MACS1 (version gions occupied by point source factors, broad source factors, and 1.4.2), MACS2 (version 2.1.0), PeakSeq (version 1.31), and SIS- mixed source factors [6]. SRs (version 1.4), were used with the default options and recom- The distribution patterns of ChIP-Seq data on the genome have mended parameters for a direct comparison without any optimiza- been analyzed using many different software programs with spe- tion (Supplementary Table 2). For CisGenome, the Bowtie-for- cific algorithms, which use different strategies for searching poten- mat output files were converted into the aln format and the se- tial binding regions, judging the peaks, and calculating significance qpeak command was used. For MACS1, the options of –p 1e-5, [7-10]. Most previous studies have focused on detecting the en- -m 10:30, and --keep-dup 1 were used and for MACS2, the default riched peaks, and several groups have already evaluated peak call- options (-q 0.01, -m 5:50, and --keep dup 1) were applied. In ing programs [11-16]. Although most previous studies compared MACS2, the broad options (-q 0.1, -m 5:50, and --keep-dup 1) the performance of each program for analyzing transcription factor were also used for the broad source peaks. The signal map was pre- binding patterns, some tested histone modifications, including pared from the Bowtie output using the PeakSeq -preprocess com- H3K4me3, H3K9me3, H3K27me3, and H3K36me3 [11,12,14]. mand. During the step of PeakSeq -peak_selection, the default op- However, the performance evaluation of ChIP-Seq analysis pro- tions were used, such as Enrichment_mapped_fragment_length grams needs to be more extensively examined to understand the 200, target_FDR 0.05, N_Simulations 10, Minimum_interpeak_ nature of enrichment of various types of histone modifications. distance 200, and max_Qvalue 0.05. SISSRs detected peaks with Herein, we tested ChIP-Seq data from 12 histone modifications the recommend options (-F 0.001, -e 10, -p 0.001, -m 0.8, -w 20, covering three source types with five peak calling programs -E 2, and -L 500). All peaks in each set were ranked by the follow- (CisGenome, MACS1, MACS2, PeakSeq, and SISSRs). ing guidelines: CisGenome and PeakSeq, pre-sorted peak lists; MACS1 and MACS2, sorted by the significance level (10 × Methods 2log10(p-value)) and then by the fold enrichment; SISSRs, ranked by the fold enrichment and by the significance level (p-val- Data filtering and cross-correlation analysis ue). Frequently detected false positive peaks, regardless of cell line The ChIP-Seq datasets of 12 histone modification types, input, or experiment (called the ENCODE blacklist) were removed for and RNA-sequencing of human embryonic stem cell line (H1) quality control of peaks [19,20]. were downloaded from the NIH Roadmap Epigenomics Project Gene Expression Omnibus (GEO) repository (http://www.ncbi. Comparison of peak calling performance nlm.nih.gov/geo/roadmap/epigenomics/) (Supplementary Table The coincidence of peak positions obtained by the individual pro- 1). The downloaded SRA format files were converted to the grams was examined using the intersectBed and multiIntersectBed FASTQ format via fastq-dump in SRA Toolkit (version 2.4.5). functions (BEDTools version 2.23.0) with a minimum overlap- Raw sequencing reads were filtered by fastq_quality_filter ping size of 1 bp [21]. Pearson correlation coefficients based on (FASTX-Toolkit version 0.0.13.2) with the following options (-p peak ranks between overlapped peaks were calculated, because the 80, -q 20, and -Q33). High-quality reads were mapped to the hu- peak rank represents the order of importance according to algo- man genome (hg19) using Bowtie (version 1.1.1) with the default rithm characteristics. For the multiple comparison analyses of each options (-n 2, -e 70, -l 28, -I 0, -X 250, and -maxbts 250) [17]. histone mark, we used multiIntersectBed in BEDTools. The multi- To evaluate the signal-to-noise ratio of a ChIP-Seq experiment, IntersectBed function provided a comparison among the multiple 2 / 9 https://doi.org/10.5808/GI.2020.18.4.e42 Genomics & Informatics 2020;18(4):e42 files. the shortest peaks. The concordance or co-occupancy of peaks re- The Jaccard similarity coefficients (or index J) were calculated gions identified from two different callers were calculated at the for the measurement of variability: J(A, B) = |A ∩ B| / |A∪B| same genomic loci. The peaks from H3K4me2, H3K4me3, H3K- where A and B are sets of enriched regions in base pairs identified 9ac, H3K27me3, and H3K36me3 varied in length. As a represen- by peak calling programs. Irreproducibility discovery rate (IDR) tative example, the number of peaks enriched with H3K4me3, a analysis with all replicates was performed using the recommended typical narrow source mark, ranged from 24,000 to 37,000 and its parameters (peak.half.width ‒1, min.overlap.ratio 0, is.broadpeak F, enrichment profile was very similar at promoters of actively tran- and ranking.measure p.value for MACS1 and MACS2; q.value for scribed genes with all peak callers (Fig. 2A). The peak positional CisGenome and PeakSeq; signal.value for SISSRs) [22]. For the variability was highly dependent on the histone mark type. His- specificity test, the control sequence reads were mixed with the tone marks such as H3K4me2, H3K4me3, H3K27ac, and H3K- original ChIP-Seq data and then the performance was computed. 9ac, which are associated with transcriptional activation, showed a At a different sequencing read depth, the genomic coverage of the high level of concordance.