Improving Chip-Seq Peak Calling by Optimally Weighting Controls

bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. RESEARCH WACS: Improving ChIP-seq Peak Calling by Optimally Weighting Controls Aseel Awdeh1,2*, Marcel Turcotte1 and Theodore J. Perkins1,2,3 Abstract Motivation: Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating \smart" controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Results: We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusion: This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls. Keywords: ChIP-seq; Controls; Bias Background for the identification of regions of enrichment (putative High throughput sequencing technologies help in un- binding sites) in ChIP-seq data [3, 4, 5, 6, 7]. covering the mechanisms of gene regulation and cell Every experiment is prone to noise and bias, and adaptation to external and internal environments [1, ChIP-seq experiments are no exception. While some 2]. One widely used technology is chromatin immuno- read pileups correspond to regions of true enrichment, precipitation followed by next generation sequencing others may be a result of the distortion of the ChIP-seq (ChIP-seq). It allows the genome-wide investigation of signal. Biased or noisy datasets (with a high number the structural and functional elements encoded in a of false negative or false positive peaks) negatively im- genomic sequence, such as transcriptional regulatory pact downstream biological and computational anal- elements. The main goal of a ChIP-seq experiment is yses [8]. Thus, accounting for both noise and bias is the detection of protein-DNA binding sites and histone important. Existing peak callers generally account for modifications genome-wide in various cell lines and tis- noise by assessing statistical significance under some sues. Many peak calling methods have been proposed statistical model. Bias is a more complicated subject and is usually addressed explicitly only via some con- *Correspondence: [email protected] trol data to which the ChIP-seq is compared. We re- 1School of Electrical Engineering and Computer Science, University of turn to the issue of controls shortly. Ottawa, K1N6N5, Ottawa, Canada There are many sources of bias in a ChIP-seq exper- 2Regenerative Medicine Program, Ottawa Hospital Research Institute, K1H8L6 Ottawa, Canada iment. In the experimental design, for example, the Full list of author information is available at the end of the article quality of the experiment is predetermined by anti- bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 2 of 11 Controls Treatment body and immunoprecipitation specificity. Low sensi- tivity, resulting from poor affinity to the target protein of interest, or low specificity, from cross reactivity with other unrelated proteins, degrades the quality of a ChIP-seq experiment [9]. The fragmentation step may WACS MACS2 also introduce bias [10]. Prior to immunoprecipitation, the DNA-protein complexes undergo fragmentation. Learn background noise However, due to the non-uniform nature of the chro- distribution by estimating Pool controls together. weights per control. matin structure (DNA), some regions are more densely packed (heterochromatin) than others and are thus Compute weighted pileup for Compute pileup for pooled more resistant to fragmentation. Less densely packed controls. control. regions (euchromatin) will undergo more fragmentation. Another source of bias is mappability, which is the extent to which reads are uniquely mapped to regions along the genome [10, 11]. In an ideal situation, Peak Detection long enough reads are used such that there is higher coverage and uniformity in coverage. However, in prac- Figure 1: Flowcharts for WACS and MACS2. Both tice, read length is short and there are \ambiguous" methods take controls and a treatment as input. reads that map to multiple regions. Such reads can either be multiply mapped (creating false positive ChIP- seq signal) or discarded (creating empty, unmappable conditions as the original ChIP-seq experiment are ap- regions), with either choice creating a different sort plied. However, a control antibody (not specific to the of bias. GC content bias [12, 13], introduced by PCR protein of interest) is adopted to interact with non- amplification or sequencing, also results in imbalanced relevant genomic positions [9]. DNase-seq and ATAC- coverage of reads along the genome. For example, in seq are used to tackle open chromatin regions. Accord- PCR amplification, GC rich fragments are targeted ing to ENCODE [9], the input DNA and IgG con- more than the GC poor fragments. These variations trols should have a sequencing depth greater than or in coverage can have a significant impact on the re- equal to the original ChIP-seq experiment. Higher se- sults obtained. quencing depth is recommended since input DNA sig- Systematic and experimental biases hinder the full nals represent broader genomic chromatin regions than potential of ChIP-seq analysis. Thus, the quality of ChIP-seq [9, 10]. Other crucial factors addressed by the input samples is important, especially in large scale the protocols include, but are not limited to, biologi- analysis where low quality datasets have greater effects cal/technical replicates and library complexity. [8, 14]. Consequently, more than a decade after ChIP- Many existing peak calling algorithms allow testing seq was introduced, the ENCODE and modENCODE enrichment compared to a control [7, 15, 16, 17, 18, consortia developed a set of ChIP-seq quality control 19, 20]. Whether biases in controls and ChIP-seq data metrics and guidelines to produce high quality repro- are the same is not known, however. None of these ducible data [9]. The protocols address all the stages methods selects a control or estimates background sig- of a ChIP-seq experiment, as bias and noise may be nals. Depending on which controls are selected and introduced at various stages, such as experimental de- their nature, peak callers can produce different results sign, execution, evaluation and storage methods [10]. (i.e., binding site positions) for the same ChIP-seq ex- One essential step for the alleviation of bias is the in- periment. The BIDCHIPS [21], CloudControl [22] and corporation of control datasets in ChIP-seq analysis. It AIControl [23] studies have shown that different ChIP- assists in the selection of true enrichment binding sites seq datasets can be biased in different ways. They ad- from false positives. Controls, such as input DNA and dress different biases in different ChIP-seq datasets via IgG, attempt to minimize the effects of immunopre- the integration of multiple control datasets through cipitation, antibody imprecision, PCR-amplification, regression to improve enrichment analysis. There are mappability bias, etc., and thereby increase the re- some limitations to these studies, however. liability of the results. In the input DNA, using the For example, BIDCHIPS [21] has the ability to re- same conditions as the original ChIP-seq experiment, prioritize peaks already identified by another peak call- the DNA undergoes cross linkage and fragmentation. ing method. However, only five notions of control are However, no antibody nor immunoprecipitation is used accounted for and there are no mechanisms for de novo [9]. For the IgG control, sometimes referred to as a peak calling based on the combined control [21]. The \mock" ChIP-seq experiment, all the same steps and Hiranuma et al. [22, 23] studies prove the advantage of bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 3 of 11 using more controls to model the background signal. Algorithm 1 Derive Weights In CloudControl [22], the controls are subsampled in Input: Control samples (BAM) and ChIP-seq sample (BAM) Output: Weights per control their regression fit proportional to their weights.

Improving Chip-Seq Peak Calling by Optimally Weighting Controls

Introduction to Chip-Seq

Peak-Calling for Chip-Seq and ATAC-Seq

Annominer Is a New Web-Tool to Integrate Epigenetics, Transcription

Fstitch: a Fast and Simple Algorithm for Detecting Nascent RNA Transcripts

A Deep Learning Peak Caller for ATAC-Seq, Chip-Seq, and Dnase-Seq 1,2 1,2 2 Lance D

HMMRATAC: a Hidden Markov Modeler for ATAC-Seq

Peak-Calling for Chip-Seq

Peak Calling Software Histone Modifications

Features That Define the Best Chip-Seq Peak Calling Algorithms Reuben Thomas, Sean Thomas, Alisha K

Comparative Analysis of Commonly Used Peak Calling Programs for Chip

Chip-Seq Analysis Report Demo Report

PEAK CALLING for Chip-SEQ