Features That Define the Best Chip-Seq Peak Calling Algorithms Reuben Thomas, Sean Thomas, Alisha K
Total Page:16
File Type:pdf, Size:1020Kb
Briefings in Bioinformatics, 18(3), 2017, 441–450 doi: 10.1093/bib/bbw035 Advance Access Publication Date: 11 May 2016 Paper Features that define the best ChIP-seq peak calling algorithms Reuben Thomas, Sean Thomas, Alisha K. Holloway and Katherine S. Pollard Corresponding author: Katherine S Pollard, Gladstone Institutes, San Francisco, CA 94158, USA. Tel.: 415-734-2711. Fax: 415- 355-0141; E-mail: [email protected] Abstract Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is an important tool for studying gene regulatory proteins, such as transcription factors and histones. Peak calling is one of the first steps in the analysis of these data. Peak calling consists of two sub-problems: identifying candidate peaks and testing candidate peaks for statistical significance. We surveyed 30 methods and identified 12 features of the two sub-problems that distinguish methods from each other. We picked six methods GEM, MACS2, MUSIC, BCP, Threshold-based method (TM) and ZINBA] that span this feature space and used a combination of 300 simulated ChIP-seq data sets, 3 real data sets and mathematical analyses to identify features of methods that allow some to perform better than the others. We prove that methods that explicitly combine the signals from ChIP and input samples are less powerful than methods that do not. Methods that use windows of different sizes are more powerful than the ones that do not. For statistical testing of candidate peaks, methods that use a Poisson test to rank their candidate peaks are more powerful than those that use a Binomial test. BCP and MACS2 have the best operating char- acteristics on simulated transcription factor binding data. GEM has the highest fraction of the top 500 peaks containing the binding motif of the immunoprecipitated factor, with 50% of its peaks within 10 base pairs of a motif. BCP and MUSIC perform best on histone data. These findings provide guidance and rationale for selecting the best peak caller for a given application. Key words: ChIP-seq; peak caller; benchmark Introduction The data for ChIP-seq peak calling are stacks of aligned reads Regulation of gene expression is one of the fundamental means across a genome. Some of these stacks correspond to the signal by which cells adapt to internal and external environments. of interest (e.g. binding of a transcription factor or modified his- Many regulatory mechanisms rely on modifying or ‘marking’ tone). Many other stacks are regarded as molecular or experi- the DNA in particular ways, either through covalent modifica- mental noise, or as being influenced by a systematically greater tion or by intermolecular interactions. Chromatin immunopre- accessibility of measurement by the experiment at that particu- cipitation followed by sequencing (ChIP-seq) data are generated lar genomic location. This manuscript deals with the problem to provide readouts of these modifications, such as the location of separating signal from noise in the stacks of reads to estimate and frequency of binding of a transcription factor or the distri- where the immunoprecipitated protein is bound to the DNA. bution of histone modifications that are used by the cell to es- Many methods target this problem. The newer methods tablish or maintain specialized chromatin domains. make claims of superiority over a subset of the existing ones by Reuben Thomas is a Staff Research Scientist in the Bioinformatics Core at Gladstone Institutes. Sean Thomas is a Staff Research Scientist in the Bioinformatics Core at Gladstone Institutes. Alisha K. Holloway is the Director of Bioinformatics at Phylos Biosciences, visiting scientist at Gladstone Institutes and Adjunct Assistant Professor in Biostatistics at the University of California, San Francisco. Katherine S. Pollard is a Senior Investigator at Gladstone Institutes and Professor of Biostatistics at University of California, San Francisco. Submitted: 20 January 2016; Received (in revised form): 1 March 2016 VC The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/ licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 441 442 | Thomas et al. displaying performance on certain metrics. However, the gener- Simulations alizability of these performance results is unclear given the ChIP-seq data representing transcription factor binding and cor- number of data sets (typically 3–5), methods, parameter settings responding input samples are simulated using functions (often only default settings) and performance metrics used. adapted from the ChIPsim [32] Bioconductor [33] package in R There have been efforts to benchmark peak calling methods [34] that are in turn based on [35]. The data are simulated for [1–5]. These have the advantage of being independent evalu- pairs of ChIP and corresponding input samples under three ations, but they have also been limited in scope and sometimes noise settings—high noise, medium noise and low noise (details provide conflicting advice owing to differences in numbers and in Supplementary Text). widths of peaks evaluated [6]. For example, Harmanci et al.[7], To ensure that our simulated ChIP and input data resemble Koohy et al.[5] and Micsinai et al.[4] disagree about the ranking real ChIP-seq experiments, we compared them with data from of Spatial clustering approach for the Identification of ChIP- the first 10 million base pairs (bp) of chromosome 1 in a ChIP- Enriched Regions (SICER) [8], F-seq [9] and Model-based Analysis seq experiment on the transcription factor Tbx5 in mouse cardi- for ChIP-Seq (MACS) [10]. Consequently, the field lacks system- omyocytes [36]. Peaks were identified using MACS2, which is atic recommendations for calling peaks in different scenarios. one of the best performing methods on the simulated data. We address these criticisms of benchmarking efforts by first Input was quantified using reads per 1000 bp window. Quantiles abstracting the peak calling problem into two sub-problems, of simulated ChIP and input data match the real data well identifying peaks and testing peaks for significance, respect- (Figure 1), and browser tracks resemble real data visually ively. We identify four features of the algorithms that differenti- (Supplementary Figure S1). The only difference detected is a ate them in terms of how they address the first sub-problem more extreme number of reads in about 1% of simulated re- and eight features that differentiate the algorithms on the se- gions, both ChIP and input, compared with the real data. cond sub-problem. We pick six methods that span most of the possible feature values for the two sub-problems. We then simulate ChIP-seq data for transcription factor binding to gener- Evaluation metrics for simulated data ate 100 independent ChIP and input sample pairs at each of All peak calling methods were run with the lowest significance three different noise levels to evaluate the operating character- threshold possible (P-value or q-value equal to 1 or for any fold istics of the six methods by varying the respective thresholds enrichment) to generate the complete list of peaks that could be for determining peak significance. We also test the methods on evaluated for their operating characteristics on each of the data from one transcription factor and two histone mark experi- simulated data sets. Note, GEM only reports exact genomic loca- ments. Combining results across these data with mathematical tions of binding so a 200 bp window around these identified analyses, we determine which features of peak calling methods binding locations was used to define peaks for comparison with optimize performance. other methods. For each method, we varied the P-value or q-value threshold to produce a nested set of peaks for evaluat- ing performance across a range of significance levels. We evaluated performance using several complementary Methods metrics. First, each set of peaks was compared with the location Choice of peak calling methods of the binding features in the ChIP sample using the findOverlappingPeaks and annotatePeakInBatch functions in the We surveyed 30 methods [4,7,8, 9–31] in the literature and anno- ChIPpeakAnno [37] Bioconductor [33] package in R [34]. The frac- tated each of them in terms of how they solve sub-problems 1 tion of the true binding features that overlaps with the signifi- and 2 (Table 1 and Supplementary Table S1). We chose Model- cant peaks is defined as the ‘sensitivity’, the fraction of the based Analysis for ChIP-Seq version 2 (MACS2) [10], MUltiScale significant peaks that overlap with the true binding features is enrIchment Calling for ChIP-Seq (MUSIC) [7], Genome wide defined as the ‘precision’ and the harmonic mean of the sensi- Event finding and Motif discovery (GEM) [13], Zero-Inflated tivity and precision is defined as the ‘F-score’ for each method Negative Binomial Algorithm (ZINBA) [11], Bayesian Change at the particular significance threshold setting on the given Point (BCP) [14] and TM for further analysis to balance the need simulated data set. We also computed the distance from the to cover most of the feature space of the different methods with center of each significant peak to the center of closest binding that of directly testing a feasible number of the more recent feature