bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al.

RESEARCH WACS: Improving ChIP-seq Peak Calling by Optimally Weighting Controls

Aseel Awdeh1,2*, Marcel Turcotte1 and Theodore J. Perkins1,2,3

Abstract Motivation: Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the . Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Results: We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusion: This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls. Keywords: ChIP-seq; Controls; Bias

Background for the identification of regions of enrichment (putative High throughput sequencing technologies help in un- binding sites) in ChIP-seq data [3, 4, 5, 6, 7]. covering the mechanisms of gene regulation and cell Every experiment is prone to noise and bias, and adaptation to external and internal environments [1, ChIP-seq experiments are no exception. While some 2]. One widely used technology is chromatin immuno- read pileups correspond to regions of true enrichment, precipitation followed by next generation sequencing others may be a result of the distortion of the ChIP-seq (ChIP-seq). It allows the genome-wide investigation of signal. Biased or noisy datasets (with a high number the structural and functional elements encoded in a of false negative or false positive peaks) negatively im- genomic sequence, such as transcriptional regulatory pact downstream biological and computational anal- elements. The main goal of a ChIP-seq experiment is yses [8]. Thus, accounting for both noise and bias is the detection of protein-DNA binding sites and histone important. Existing peak callers generally account for modifications genome-wide in various cell lines and tis- noise by assessing statistical significance under some sues. Many peak calling methods have been proposed statistical model. Bias is a more complicated subject and is usually addressed explicitly only via some con- *Correspondence: [email protected] trol data to which the ChIP-seq is compared. We re- 1School of Electrical Engineering and Computer Science, University of turn to the issue of controls shortly. Ottawa, K1N6N5, Ottawa, Canada There are many sources of bias in a ChIP-seq exper- 2Regenerative Medicine Program, Ottawa Hospital Research Institute, K1H8L6 Ottawa, Canada iment. In the experimental design, for example, the Full list of author information is available at the end of the article quality of the experiment is predetermined by anti- bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 2 of 11

Controls Treatment body and immunoprecipitation specificity. Low sensi- tivity, resulting from poor affinity to the target pro- tein of interest, or low specificity, from cross reactivity with other unrelated proteins, degrades the quality of a ChIP-seq experiment [9]. The fragmentation step may WACS MACS2 also introduce bias [10]. Prior to immunoprecipitation, the DNA-protein complexes undergo fragmentation. Learn background noise However, due to the non-uniform nature of the chro- distribution by estimating Pool controls together. weights per control. matin structure (DNA), some regions are more densely packed (heterochromatin) than others and are thus Compute weighted pileup for Compute pileup for pooled more resistant to fragmentation. Less densely packed controls. control. regions (euchromatin) will undergo more fragmenta- tion. Another source of bias is mappability, which is the extent to which reads are uniquely mapped to re- gions along the genome [10, 11]. In an ideal situation, Peak Detection long enough reads are used such that there is higher coverage and uniformity in coverage. However, in prac- Figure 1: Flowcharts for WACS and MACS2. Both tice, read length is short and there are “ambiguous” methods take controls and a treatment as input. reads that map to multiple regions. Such reads can ei- ther be multiply mapped (creating false positive ChIP- seq signal) or discarded (creating empty, unmappable conditions as the original ChIP-seq experiment are ap- regions), with either choice creating a different sort plied. However, a control antibody (not specific to the of bias. GC content bias [12, 13], introduced by PCR protein of interest) is adopted to interact with non- amplification or sequencing, also results in imbalanced relevant genomic positions [9]. DNase-seq and ATAC- coverage of reads along the genome. For example, in seq are used to tackle open chromatin regions. Accord- PCR amplification, GC rich fragments are targeted ing to ENCODE [9], the input DNA and IgG con- more than the GC poor fragments. These variations trols should have a sequencing depth greater than or in coverage can have a significant impact on the re- equal to the original ChIP-seq experiment. Higher se- sults obtained. quencing depth is recommended since input DNA sig- Systematic and experimental biases hinder the full nals represent broader genomic chromatin regions than potential of ChIP-seq analysis. Thus, the quality of ChIP-seq [9, 10]. Other crucial factors addressed by the input samples is important, especially in large scale the protocols include, but are not limited to, biologi- analysis where low quality datasets have greater effects cal/technical replicates and library complexity. [8, 14]. Consequently, more than a decade after ChIP- Many existing peak calling algorithms allow testing seq was introduced, the ENCODE and modENCODE enrichment compared to a control [7, 15, 16, 17, 18, consortia developed a set of ChIP-seq quality control 19, 20]. Whether biases in controls and ChIP-seq data metrics and guidelines to produce high quality repro- are the same is not known, however. None of these ducible data [9]. The protocols address all the stages methods selects a control or estimates background sig- of a ChIP-seq experiment, as bias and noise may be nals. Depending on which controls are selected and introduced at various stages, such as experimental de- their nature, peak callers can produce different results sign, execution, evaluation and storage methods [10]. (i.e., binding site positions) for the same ChIP-seq ex- One essential step for the alleviation of bias is the in- periment. The BIDCHIPS [21], CloudControl [22] and corporation of control datasets in ChIP-seq analysis. It AIControl [23] studies have shown that different ChIP- assists in the selection of true enrichment binding sites seq datasets can be biased in different ways. They ad- from false positives. Controls, such as input DNA and dress different biases in different ChIP-seq datasets via IgG, attempt to minimize the effects of immunopre- the integration of multiple control datasets through cipitation, antibody imprecision, PCR-amplification, regression to improve enrichment analysis. There are mappability bias, etc., and thereby increase the re- some limitations to these studies, however. liability of the results. In the input DNA, using the For example, BIDCHIPS [21] has the ability to re- same conditions as the original ChIP-seq experiment, prioritize peaks already identified by another peak call- the DNA undergoes cross linkage and fragmentation. ing method. However, only five notions of control are However, no antibody nor immunoprecipitation is used accounted for and there are no mechanisms for de novo [9]. For the IgG control, sometimes referred to as a peak calling based on the combined control [21]. The “mock” ChIP-seq experiment, all the same steps and Hiranuma et al. [22, 23] studies prove the advantage of bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 3 of 11

using more controls to model the background signal. Algorithm 1 Derive Weights In CloudControl [22], the controls are subsampled in Input: Control samples (BAM) and ChIP-seq sample (BAM) Output: Weights per control their regression fit proportional to their weights. This 1: procedure DeriveWeights then allows the single customized control to be used as 2: Index, sort and filter the BAM files. input to any peak calling method. However, the down- 3: Produce read counts per 200bp windows along genome. 4: Normalize control and ChIP-seq counts by reads per billion. sampling of the combined controls may introduce noise 5: Compute control weights using non-negative linear least- into the control signal. squares regression AIControl [23], a peak calling framework, is an ex- tension of CloudControl [22]. It integrates a group of publicly available control datasets and uses ridge re- of bias varies across different ChIP-seq experiments. gression to model the background signal. This elimi- In the investigation of downstream genomic analysis, nates the need for the user to input controls. However, such as motif enrichment and reproducibility, the use some users may want to provide their own controls, of weighted controls in WACS shows a significant im- and this is not accommodated. Additionally, the num- provement in peak detection in comparison with the ber of datasets in ENCODE increases with time, so pooled unweighted controls in MACS2 and weighted allowing controls as input in a weighted peak caller controls in AIControl. is important to represent the newly available datasets and newly explored cell lines. Methods In this work, we introduce a peak calling algorithm, WACS: Weighted Analysis of ChIP-seq Weighted Analysis of ChIP-Seq (WACS), which uti- Our approach, WACS, estimates a background distri- lizes “smart” controls to model the non-signal effect for bution by weighting controls, and ultimately identifies a specific ChIP-seq experiment. WACS first estimates regions of enrichment along the genome (Figure 1 and the weights for each input control, without requiring Supplementary Figure 1). Below we describe the five the fine-tuning of any parameters. Using the weighted major steps of the WACS algorithm. To implement controls, WACS then proceeds to detect regions of en- WACS, we modified a well-known open source algo- richment along the genome. WACS is an extension of rithm, MACS2. Because there is limited written de- MACS2.1.1 (Model-based Analysis for ChIP-Seq) [18], scription of how MACS2 works, we describe some parts the most highly cited open source peak caller. Our of MACS2 to fully describe WACS. The WACS algo- development of WACS based on MACS2 allows re- rithm is summarized into two parts: Derive Weights searchers to use the weighted approach within a peak (Algorithm 1) and Peak Detection (Algorithm 2). calling method with which they are familiar, and which has many refined features. Fragment length estima- Algorithm 1: Derive Weights tion/detection, read shifting, candidate peak identifi- The control and treatment samples (in BAM format) cation, and peak assessment remain the same, while are first preprocessed, as seen in Algorithm 1. Using the construction of the control via the weighted com- SAMtools [25], we index, sort and optionally filter (re- bination of datasets is different. To allow for poten- move duplicates from) the BAM files (line 2 in Algo- tially large numbers of controls, we restructure the rithm 1). We then use BEDtools [26] to convert the code invisibly for better memory footprint. We also BAM files of mapped reads into read counts per 200 correct a hashing bug in the pileup-computing code of base pair (bp) windows along the genome with 50 bp MACS2, which becomes especially important when we increments (line 3 in Algorithm 1). have high read depth and/or many controls. (This bug Next, WACS normalizes the mapped reads per win- has subsequently been corrected in the main MACS2 dow for the preprocessed control and treatment sam- distribution as well.) ples. This ensures that the control and treatment sam- We evaluate WACS on a large collection of 438 ChIP- ples are on the same scale. WACS applies reads per seq datasets and 147 control datasets from the K562 billion normalization to both the control and ChIP- cell line in the ENCODE database [24]. To estab- seq samples (line 4 in Algorithm 1). For each sample lish generalizability and study performance in a less m and window i: expansive setting, we also investigate WACS on 20 9 ChIP-seq datasets for each of the A549, GM12878 nmi ← rmi × 10 ÷ T otalReadCountm and HepG2 cell lines. (The terms ChIP-seq and treat- ment are used interchangeably throughout the paper.) where rmi is the read count in the window, nmi is the The results demonstrate the importance of smart bias normalized read count, and T otalReadCountm is the removal methods and the use of customized control total number of reads in sample m. This effectively re- datasets for each ChIP-seq experiment, as the amount produces the normalization in MACS2, which linearly bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 4 of 11

Algorithm 2 Peak Detection ment background signal. Zero weights are given to con- Input Control samples (BAM), control weights, ChIP-seq trols not required for modelling the treatment experi- (BAM) Output Detected peaks ment. If there is one control, WACS and MACS2 pro- 1: procedure Peak Detection duce the same output, as by default, the control in . Compute ChIP-seq pileup and associated values WACS gets a weight of exactly 1. The controls can 2: Read treatment sample. 3: Estimate/compute fragment length d based on treatment. also be weighted by the user, instead of using NNLS 4: treatmentPileup ← Compute treatment pileup. to compute the weights of the controls. 5: λBG ← treatReadcount × d ÷ genomeSize 6: LengthScales ← [d, 1kb, 10kb] . Initialize control pileup to zero at each length scale. Algorithm 2: Peak Detection 7: for < j = 1 to 3 > do WACS is identical to MACS2 in its initial process- 8: ControlPileup[j] ← 0 . Read and accumulate each control. ing of the treatment sample, including: loading the 9: for < each control i > do mapped reads (line 2); estimation/calculation of frag- 10: Read control i into FixedWidthTracki. ment length d, which differs depending on whether . Loop over length scales. 11: for < j = 1 to 3 > do the ChIP-seq reads are sequenced single-end or paired- . Compute scale factor. end (line 3); and construction of the treatment pileup,   12: sf ← d ÷ LengthScale[j] × treatReadcount which also differs for single-end or paired-end reads controlReadcounti . Compute weighted pileup at this scale. (line 4). Because these details have been described else- 13: currentPileup ← sf × controlW eighti × where, we do not repeat them here [18, 28, 29]. BidirectionalExtendReads(FixedWidthTracki, LengthScales[j]) Where WACS differs substantially from MACS2 is . Add into growing control pileup. how it reads in, processes, and combines the control 14: ControlPileup[j] ← ControlPileup[j]+ currentPileup samples. WACS reads the controls into memory one . Find maximum control pileup. at a time, accumulating them into overall (weighted) 15: λlocal ← maximum of λBG and pointwise maximum control pileups at three different length scales: d, 1kb of ControlPileup at three different length scales and 10kb. The length scale is essentially the diam- . Call peaks. 16: CallPeaks(treatmentP ileup, λlocal) eter of a Parzen-windows density estimator used to smooth the control reads. As each control is read in, it is smoothed, scaled so that its total reads are com- scales the control sample to the ChIP-seq sample. In mensurate with the treatment, and further scaled by what follows, we assume k total controls comprise sam- the control weight computed in Algorithm 1 (unless ples 1 to k, and sample k + 1 is the ChIP-seq data. the user opts for unweighted controls). The function WACS then calculates the weights per input control BidirectionExtendReads performs the actual smooth- (line 6 in Algorithm 1). WACS performs non-negative ing, extending the read starts into intervals with di- least squares (NNLS) to model the treatment dataset ameter equal to the length scale. The smoothed and as a function of the controls. The overall objective of scaled control is added to the growing overall control the regression is to find the values of the parameters at that length scale. In contrast, MACS2 reads all the (weights), that minimize the sum of squared differ- control data in before beginning smoothing, which can ences between predictions and target values, with an create an unmanageable memory footprint when very additional constraint that allows only positive weights. many controls are being combined. Finally, WACS (as Given n instances (windows), yi = nk+1,i target val- does MACS) creates an overall control pileup by tak- ues (one per window), xi = (n1i, . . . , nki) feature vec- ing the pointwise maximum of the “background” read tors (one vector per window), a vector Θ of coefficient density λBG and the control pileups computed at each weights and a constant offset Θ0, NNLS’s objective length scale. function is: Finally, WACS calls peaks using the same mecha- n 1 X 2 nism as MACS2, which involves identifying candidate min (yi − Θ · xi − Θ0) Θ,Θ0 2n peaks and comparing the pileup heights at their sum- i=1 mits with the control track. In the case of unweighted subject to Θ ≥ 0 controls, WACS produces an identical control track to and Θ0 ≥ 0 MACS2 and identical peak calls. However, when con- To solve the NNLS regression we rely on the nnls trol samples are weighted differently, a different control module from scipy.optimize, part of the scipy [27] pack- track is produced and different peaks may be called. age in Python. This produces a weighted control model Each peak is associated with a p-value and a q-value, for the treatment, with weights that indicate the rela- the latter accounting for multiple comparisons across tive importance of each control in modelling the treat- the entire genome. bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 5 of 11

Duplicate removal. treatment sample. First, the peak width is normalized Duplicate reads—multiple reads mapped to the same by binning the peaks in 1000 base pair windows. For position on the genome—are often due to the over- example, a peak at chromosome 1 from 14520 to 15420 amplification of DNA fragments by PCR, which leads is counted as two peaks covering bins 14000 to 15000 to the repeated sequencing of a DNA fragment. For and 15000 to 16000. Next, the number of peaks under WACS and MACS2, duplicate removal is optional. To all four conditions for the same dataset is normalized. produce more reliable peak calls, MACS2/WACS re- The top n most statistically significant peaks are se- move redundant reads at each genomic locus for both lected for each of the peak-width normalized datasets, the treatment and control datasets [18].The default such that n is the minimum number of peaks under number per genomic locus is determined by the se- the four conditions. quencing depth. However, when dealing with multi- ple controls, MACS2 performs duplicate removal af- Results ter pooling reads. WACS does the same thing when Peaks identified by WACS are more enriched for known used in unweighted mode, for the sake of consistency sequence motifs. with MACS2. In this case, apparent “duplicates” aris- The purpose of ChIP-seq analysis is the identification ing from different sequencing runs may be removed of regions of enrichment, such as TF binding sites, incorrectly, artificially flattening the control read dis- along the genome. To evaluate the performance of our tribution in high density areas. This phenomenon can method in comparison to MACS2 and AIControl, we be particularly prominent when hundreds of controls performed motif enrichment analysis. In this and the are being pooled. Thus, we recommend that users who following two subsections, we focus on the K562 re- want to perform de-duplication do so prior to feeding sults, with results in additional cell lines reported fur- the mapped read files to MACS2 or WACS. ther below. Adopting a similar motif analysis method as in [23], we first used JASPAR to obtain position Experimental Evaluation. weight matrices (PWMs) for each unique TF [30]. (See We evaluated WACS, MACS2.1.1 Supplementary Table 3 for the PWM IDs per TF.) (https://github.com/taoliu/MACS) and AIControl Using PWMs as input, we then used FIMO (Find In- (https://github.com/hiranumn/AIControl.jl/) on 438 dividual Motif Occurrences) [31] in the MEME suite ChIP-seq (treatment) and 147 control samples from [32] to scan the genome and identify motif hits genome the K562 cell line; and 20 ChIP-seq and 20 control wide. We found motifs for 125 treatment samples. Mo- samples for each of A549, GM12878, and HepG2 cell tif hits tend to be enriched in genuine binding sites. In lines. (See Supplementary Tables 1, 2, 4, 5, and 6 for our analysis, peaks with a motif are considered as true the accession codes of samples). As seen in Figure positives, while those lacking a motif hit are considered 1 (and Supplementary Figure 1), MACS2 pools the false positives. controls together for each ChIP-seq sample, whereas WACS estimates a weight for each control and com- Table 1: Motif Enrichment Summary. putes a unique weighted control pileup for each ChIP- seq sample. AIControl uses a predefined set of pub- Type WACS All MACS2 Matched MACS2 AIControl licly available controls [23]. For each ChIP-seq sam- ple, we generated peaks under four conditions: (1) All Peaks 107 8 10 0 Standardized 90 8 16 11 MACS2 with all the controls from the same cell line (All MACS2), (2) MACS2 with the matched EN- CODE controls (Matched MACS2), (3) WACS with Table 2: Reproducibility Analysis Summary. all the controls from the same cell line (WACS) and Type WACS All MACS2 Matched MACS2 AIControl (4) AIControl with its predefined controls (AIControl). Next, we used two methods to compare of the quality All Peaks 134 19 41 3 of the peaks generated by WACS, MACS2 and AICon- Standardized 115 36 37 9 trol. One method considers all the original peaks out- put by each algorithm (called All Peaks). However, different peak callers can produce peaks in different lo- Fig 2a and 2b display the motif enrichment for each cations based on the same data, and they can also pro- of these ChIP-seq datasets for all peaks and stan- duce different numbers of peaks. Thus, for additional dardized peaks respectively, when using WACS (blue comparison, we adopted the standardization proce- line), All MACS2 (red line), Matched MACS2 (green dure proposed by Hiranuma et al. [23], where the peak line) and AIControl (purple line). Using all the peaks, width and number of peaks are normalized for each WACS outperforms All MACS2, Matched MACS2 and bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 6 of 11

0.8 WACS 0.7 WACS All MACS2 All MACS2 Matched MACS2 Matched MACS2 0.7 0.6 AIControl AIControl 0.6 No Control 0.5

0.5 0.4 0.4 0.3 0.3 Motif Enrichment Motif Enrichment 0.2 0.2

0.1 0.1

0.0 0.0 JUN.ENCFF479JUJ JUN.ENCFF479JUJ JUN.ENCFF050LIC JUN.ENCFF050LIC JUN.ENCFF749RRI JUN.ENCFF749RRI YY1.ENCFF044TAL YY1.ENCFF044TAL JUN.ENCFF924CYX JUN.ENCFF924CYX JUN.ENCFF527CTN JUN.ENCFF527CTN SP1.ENCFF487SUP SP1.ENCFF487SUP IRF1.ENCFF489YJG IRF1.ENCFF489YJG JUN.ENCFF814CHV JUN.ENCFF814CHV MAX.ENCFF635HIS MAX.ENCFF635HIS JUNB.ENCFF512YJL JUNB.ENCFF512YJL JUN.ENCFF014UUB JUN.ENCFF784MLU JUN.ENCFF014UUB JUN.ENCFF784MLU MYC.ENCFF836IMK MYC.ENCFF836IMK YY1.ENCFF651HPM YY1.ENCFF651HPM RFX1.ENCFF490ICL RFX1.ENCFF490ICL MYC.ENCFF483RLD MYC.ENCFF483RLD NRF1.ENCFF722LJA NRF1.ENCFF722LJA E2F6.ENCFF011OSI E2F6.ENCFF011OSI E2F6.ENCFF827SLL E2F6.ENCFF827SLL ETS1.ENCFF401KIO ETS1.ENCFF401KIO IRF1.ENCFF299HHL IRF1.ENCFF299HHL IRF1.ENCFF922JWO E2F4.ENCFF156NIH MYC.ENCFF491EUN E2F4.ENCFF156NIH IRF1.ENCFF922JWO MYC.ENCFF491EUN MYC.ENCFF641HZR MYC.ENCFF641HZR MAFF.ENCFF589IXE MAFF.ENCFF589IXE IRF1.ENCFF168KTM IRF1.ENCFF168KTM ELF1.ENCFF251EVS ELF1.ENCFF251EVS MYC.ENCFF439NNR MYC.ENCFF439NNR JUNB.ENCFF090SFR JUNB.ENCFF090SFR MYC.ENCFF014DRH MYC.ENCFF014DRH ELF1.ENCFF331TRC MAX.ENCFF360QBV ELF1.ENCFF331TRC MAX.ENCFF360QBV MYC.ENCFF006GQZ MYC.ENCFF006GQZ MITF.ENCFF983RUX MITF.ENCFF983RUX REST.ENCFF546IMN REST.ENCFF546IMN ELF1.ENCFF619SRD ELF1.ENCFF619SRD E2F1.ENCFF846CYU E2F1.ENCFF846CYU ELF1.ENCFF535PLW ELF1.ENCFF535PLW USF1.ENCFF794ABP USF1.ENCFF794ABP RFX1.ENCFF871KNP RFX1.ENCFF871KNP JUND.ENCFF400BSN ETS1.ENCFF398SXO JUND.ENCFF400BSN ETS1.ENCFF398SXO CTCF.ENCFF391HFU CTCF.ENCFF391HFU E2F6.ENCFF823GCX E2F6.ENCFF823GCX NRF1.ENCFF207NLX E2F4.ENCFF613CDR MYC.ENCFF357MHM NRF1.ENCFF207NLX E2F4.ENCFF613CDR MYC.ENCFF357MHM MXI1.ENCFF726UVN MXI1.ENCFF726UVN MYC.ENCFF239WGU MYC.ENCFF239WGU IRF1.ENCFF728WOA IRF1.ENCFF728WOA RFX1.ENCFF837ZOY RFX1.ENCFF837ZOY USF2.ENCFF102ZOS USF2.ENCFF102ZOS REST.ENCFF204TUQ E2F6.ENCFF201BQU REST.ENCFF204TUQ E2F6.ENCFF201BQU MXI1.ENCFF388QNA MXI1.ENCFF388QNA MAFF.ENCFF097GYE MAFF.ENCFF097GYE JUND.ENCFF321ZQU MYC.ENCFF553WMU JUND.ENCFF321ZQU MYC.ENCFF553WMU NRF1.ENCFF564KCD NRF1.ENCFF564KCD CTCF.ENCFF081HVQ CTCF.ENCFF081HVQ USF2.ENCFF979BFW USF2.ENCFF979BFW RFX1.ENCFF183MSQ RFX1.ENCFF183MSQ STAT1.ENCFF463LMJ STAT1.ENCFF463LMJ NRF1.ENCFF564CXM NRF1.ENCFF337GMY FOSL1.ENCFF561ILM NRF1.ENCFF564CXM NRF1.ENCFF337GMY FOSL1.ENCFF561ILM E2F1.ENCFF226KOW E2F1.ENCFF226KOW CTCF.ENCFF494VZW CTCF.ENCFF494VZW FOXA1.ENCFF364SJK FOXA1.ENCFF364SJK GATA1.ENCFF788ICY GATA1.ENCFF788ICY NRF1.ENCFF821HMU NRF1.ENCFF821HMU STAT1.ENCFF089XMI STAT1.ENCFF089XMI RUNX1.ENCFF812KIP RUNX1.ENCFF812KIP STAT1.ENCFF263PLG STAT1.ENCFF263PLG TCF12.ENCFF516TOF TCF12.ENCFF516TOF USF1.ENCFF587WWS USF1.ENCFF587WWS NR4A1.ENCFF908PSF NR4A1.ENCFF908PSF CEBPB.ENCFF953ZZL CEBPB.ENCFF953ZZL TEAD2.ENCFF280AEF NR4A1.ENCFF650RFL TEAD2.ENCFF280AEF NR4A1.ENCFF650RFL ESRRA.ENCFF857YYV ESRRA.ENCFF857YYV FOSL1.ENCFF492NUF FOSL1.ENCFF492NUF CEBPB.ENCFF154YTN CEBPB.ENCFF154YTN ZNF24.ENCFF396AXY ZNF24.ENCFF396AXY RUNX1.ENCFF350NUJ RUNX1.ENCFF350NUJ CEBPB.ENCFF346SNT CEBPB.ENCFF346SNT FOSL1.ENCFF581AZU FOXK2.ENCFF373AFZ FOSL1.ENCFF581AZU FOXK2.ENCFF373AFZ FOSL1.ENCFF500HFO FOXK2.ENCFF563FZR FOSL1.ENCFF500HFO FOXK2.ENCFF563FZR THAP1.ENCFF393ZAK THAP1.ENCFF393ZAK CEBPB.ENCFF002PNZ CEBPB.ENCFF002PNZ NR2F2.ENCFF568RDF NR2F2.ENCFF568RDF FOXA1.ENCFF611PXX FOXA1.ENCFF611PXX NR2C2.ENCFF856EXS GATA2.ENCFF082TOF NR2C2.ENCFF856EXS GATA2.ENCFF082TOF NR2C2.ENCFF771NSF NR2C2.ENCFF771NSF TEAD2.ENCFF949ENV TEAD2.ENCFF949ENV ZNF24.ENCFF391BBO ZNF24.ENCFF391BBO RUNX1.ENCFF011XRF RUNX1.ENCFF011XRF CTCFL.ENCFF454MMY GATA2.ENCFF779CRV CTCFL.ENCFF454MMY GATA2.ENCFF779CRV CTCFL.ENCFF310MOR CEBPB.ENCFF314OQP CTCFL.ENCFF310MOR CEBPB.ENCFF314OQP ZNF24.ENCFF205VDU ZNF24.ENCFF205VDU TCF7L2.ENCFF425MJK TCF7L2.ENCFF425MJK NR4A1.ENCFF786GEU NR4A1.ENCFF786GEU GATA2.ENCFF880ZXO GATA2.ENCFF880ZXO GABPA.ENCFF593XTM GABPA.ENCFF593XTM STAT1.ENCFF403WNR STAT1.ENCFF403WNR ZBTB33.ENCFF226IAA ZBTB33.ENCFF226IAA FOXK2.ENCFF931HHQ FOXK2.ENCFF931HHQ ZBTB7A.ENCFF778PDI ZBTB7A.ENCFF778PDI NR2C2.ENCFF597WPX NR2C2.ENCFF597WPX CEBPB.ENCFF729WGC CEBPB.ENCFF729WGC ESRRA.ENCFF320WXN ESRRA.ENCFF320WXN NR2C2.ENCFF078BWN NR2C2.ENCFF078BWN GABPA.ENCFF400MMA GABPA.ENCFF400MMA ZNF263.ENCFF094FZQ ZNF263.ENCFF094FZQ ZNF263.ENCFF891GFG ZNF263.ENCFF891GFG ZNF24.ENCFF109OWW ZNF24.ENCFF109OWW ZBTB7A.ENCFF784SCN ZBTB7A.ENCFF784SCN ZBTB33.ENCFF632WSK ZBTB33.ENCFF632WSK ChIP samples ChIP samples (a) (b) Figure 2: (a) Motif enrichment of the treatment samples, for each of the four peak calling methods. Motif enrichment (precision) is defined as the fraction of all peaks that contain at least one motif occurrence for the in question. (b) Motif enrichment for the standardized peaks.

WACS WACS All MACS2 All MACS2 40 40 Matched MACS2 Matched MACS2 AIControl AIControl

30 30

20 20 Percentage Overlap Percentage Overlap

10 10

0 0 ENCFF283ZIL.ENCFF903JXA ENCFF283ZIL.ENCFF903JXA ENCFF285CII.ENCFF096KBP ENCFF285CII.ENCFF096KBP ENCFF762TAN.ENCFF401PIJ ENCFF762TAN.ENCFF401PIJ ENCFF810IRF.ENCFF760JNE ENCFF810IRF.ENCFF760JNE ENCFF072KEJ.ENCFF033JFM ENCFF072KEJ.ENCFF033JFM ENCFF479JUJ.ENCFF643UCP ENCFF479JUJ.ENCFF643UCP ENCFF512YJL.ENCFF090SFR ENCFF512YJL.ENCFF090SFR ENCFF865ISL.ENCFF554BLC ENCFF865ISL.ENCFF554BLC ENCFF072SKI.ENCFF030XLK ENCFF518PIS.ENCFF126SXE ENCFF072SKI.ENCFF030XLK ENCFF518PIS.ENCFF126SXE ENCFF494OJC.ENCFF636IGK ENCFF494OJC.ENCFF636IGK ENCFF011XRF.ENCFF812KIP ENCFF011XRF.ENCFF812KIP ENCFF559BJS.ENCFF725WJS ENCFF559BJS.ENCFF725WJS ENCFF871KNP.ENCFF490ICL ENCFF871KNP.ENCFF490ICL ENCFF749LRJ.ENCFF408TSG ENCFF364SJK.ENCFF611PXX ENCFF749LRJ.ENCFF408TSG ENCFF364SJK.ENCFF611PXX ENCFF630JRN.ENCFF594OID ENCFF630JRN.ENCFF594OID ENCFF525JGC.ENCFF031YZL ENCFF525JGC.ENCFF031YZL ENCFF201MXQ.ENCFF049RJI ENCFF178EJY.ENCFF057KBN ENCFF589IXE.ENCFF097GYE ENCFF178EJY.ENCFF057KBN ENCFF201MXQ.ENCFF049RJI ENCFF589IXE.ENCFF097GYE ENCFF448SJG.ENCFF786KLU ENCFF448SJG.ENCFF786KLU ENCFF016KRJ.ENCFF880SVB ENCFF323IVG.ENCFF862XFS ENCFF016KRJ.ENCFF880SVB ENCFF323IVG.ENCFF862XFS ENCFF079TBV.ENCFF772XIR ENCFF079TBV.ENCFF772XIR ENCFF945TUI.ENCFF581SOT ENCFF677JTW.ENCFF921IAQ ENCFF918CSI.ENCFF110FDC ENCFF945TUI.ENCFF581SOT ENCFF677JTW.ENCFF921IAQ ENCFF918CSI.ENCFF110FDC ENCFF938FQC.ENCFF327XSI ENCFF814CHV.ENCFF050LIC ENCFF938FQC.ENCFF327XSI ENCFF814CHV.ENCFF050LIC ENCFF990RGE.ENCFF205JWI ENCFF990RGE.ENCFF205JWI ENCFF515YFJ.ENCFF780DBH ENCFF754GYS.ENCFF035CJB ENCFF728IYF.ENCFF874YYW ENCFF515YFJ.ENCFF780DBH ENCFF754GYS.ENCFF035CJB ENCFF728IYF.ENCFF874YYW ENCFF625AJO.ENCFF212PQL ENCFF362TFV.ENCFF007ALP ENCFF157HYJ.ENCFF606LUD ENCFF625AJO.ENCFF212PQL ENCFF362TFV.ENCFF007ALP ENCFF157HYJ.ENCFF606LUD ENCFF719FYJ.ENCFF883MZR ENCFF923LMN.ENCFF551IFV ENCFF719FYJ.ENCFF883MZR ENCFF923LMN.ENCFF551IFV ENCFF811CBJ.ENCFF413VAR ENCFF221XVM.ENCFF042SJF ENCFF784SCN.ENCFF778PDI ENCFF811CBJ.ENCFF413VAR ENCFF221XVM.ENCFF042SJF ENCFF784SCN.ENCFF778PDI ENCFF550IUC.ENCFF284XYU ENCFF834YLT.ENCFF904PBR ENCFF245YPG.ENCFF980TTL ENCFF834YLT.ENCFF904PBR ENCFF550IUC.ENCFF284XYU ENCFF245YPG.ENCFF980TTL ENCFF013MXK.ENCFF752FIR ENCFF492NUF.ENCFF561ILM ENCFF089XMI.ENCFF263PLG ENCFF013MXK.ENCFF752FIR ENCFF492NUF.ENCFF561ILM ENCFF089XMI.ENCFF263PLG ENCFF564CXM.ENCFF722LJA ENCFF880ODJ.ENCFF726AYE ENCFF564CXM.ENCFF722LJA ENCFF880ODJ.ENCFF726AYE ENCFF997BDJ.ENCFF285FGV ENCFF997BDJ.ENCFF285FGV ENCFF247FGI.ENCFF488BQX ENCFF247FGI.ENCFF488BQX ENCFF749RRI.ENCFF784MLU ENCFF635HIS.ENCFF360QBV ENCFF749RRI.ENCFF784MLU ENCFF635HIS.ENCFF360QBV ENCFF401KIO.ENCFF398SXO ENCFF401KIO.ENCFF398SXO ENCFF569XLF.ENCFF824KAR ENCFF569XLF.ENCFF824KAR ENCFF182YDN.ENCFF855NIZ ENCFF182YDN.ENCFF855NIZ ENCFF254IUH.ENCFF167NVK ENCFF600TNR.ENCFF425MJK ENCFF254IUH.ENCFF167NVK ENCFF600TNR.ENCFF425MJK ENCFF303ICH.ENCFF041ZVD ENCFF303ICH.ENCFF041ZVD ENCFF656BKZ.ENCFF670SFK ENCFF350FCE.ENCFF744TCZ ENCFF823GCX.ENCFF827SLL ENCFF656BKZ.ENCFF670SFK ENCFF350FCE.ENCFF744TCZ ENCFF823GCX.ENCFF827SLL ENCFF774RUP.ENCFF892LAS ENCFF774RUP.ENCFF892LAS ENCFF438GBD.ENCFF601CVJ ENCFF563FZR.ENCFF373AFZ ENCFF438GBD.ENCFF601CVJ ENCFF563FZR.ENCFF373AFZ ENCFF201BQU.ENCFF011OSI ENCFF201BQU.ENCFF011OSI ENCFF790FNF.ENCFF516UTV ENCFF790FNF.ENCFF516UTV ENCFF951ITB.ENCFF876GQO ENCFF079SPF.ENCFF738QLQ ENCFF258DFY.ENCFF490XTX ENCFF602UAC.ENCFF650RFL ENCFF951ITB.ENCFF876GQO ENCFF079SPF.ENCFF738QLQ ENCFF258DFY.ENCFF490XTX ENCFF602UAC.ENCFF650RFL ENCFF908PSF.ENCFF786GEU ENCFF632WSK.ENCFF226IAA ENCFF632WSK.ENCFF226IAA ENCFF908PSF.ENCFF786GEU ENCFF613CDR.ENCFF156NIH ENCFF613CDR.ENCFF156NIH ENCFF949ENV.ENCFF280AEF ENCFF373YTD.ENCFF304KSV ENCFF949ENV.ENCFF280AEF ENCFF373YTD.ENCFF304KSV ENCFF873BRB.ENCFF149LTH ENCFF873BRB.ENCFF149LTH ENCFF367OYT.ENCFF895DLK ENCFF799RPS.ENCFF576EDK ENCFF933HFV.ENCFF178YFO ENCFF367OYT.ENCFF895DLK ENCFF799RPS.ENCFF576EDK ENCFF933HFV.ENCFF178YFO ENCFF468LGZ.ENCFF092TPD ENCFF468LGZ.ENCFF092TPD ENCFF151GOI.ENCFF540DXB ENCFF623YPH.ENCFF106BHY ENCFF151GOI.ENCFF540DXB ENCFF623YPH.ENCFF106BHY ENCFF400SKV.ENCFF498KCA ENCFF829XLC.ENCFF771KEQ ENCFF358SSM.ENCFF405YCF ENCFF400SKV.ENCFF498KCA ENCFF829XLC.ENCFF771KEQ ENCFF358SSM.ENCFF405YCF ENCFF241ZAF.ENCFF825VBC ENCFF241ZAF.ENCFF825VBC ENCFF002PNZ.ENCFF346SNT ENCFF002PNZ.ENCFF346SNT ENCFF204TUQ.ENCFF546IMN ENCFF569DZM.ENCFF140UJB ENCFF204TUQ.ENCFF546IMN ENCFF569DZM.ENCFF140UJB ENCFF251EVS.ENCFF619SRD ENCFF290MEZ.ENCFF544ZPF ENCFF833ZOK.ENCFF318SPV ENCFF554KEG.ENCFF339YBR ENCFF251EVS.ENCFF619SRD ENCFF290MEZ.ENCFF544ZPF ENCFF833ZOK.ENCFF318SPV ENCFF554KEG.ENCFF339YBR ENCFF294DBF.ENCFF334OKF ENCFF412DRO.ENCFF033UID ENCFF568RDF.ENCFF665EXC ENCFF294DBF.ENCFF334OKF ENCFF412DRO.ENCFF033UID ENCFF568RDF.ENCFF665EXC ENCFF044TAL.ENCFF651HPM ENCFF525CPA.ENCFF695TZQ ENCFF044TAL.ENCFF651HPM ENCFF525CPA.ENCFF695TZQ ENCFF573OTF.ENCFF378UVV ENCFF573OTF.ENCFF378UVV ENCFF420DGT.ENCFF129RYE ENCFF799FNR.ENCFF856TBD ENCFF353XMB.ENCFF131XTL ENCFF420DGT.ENCFF129RYE ENCFF799FNR.ENCFF856TBD ENCFF353XMB.ENCFF131XTL ENCFF797FVU.ENCFF202BLM ENCFF498EFE.ENCFF302OZO ENCFF330ZVF.ENCFF066TDG ENCFF797FVU.ENCFF202BLM ENCFF498EFE.ENCFF302OZO ENCFF330ZVF.ENCFF066TDG ENCFF564KCD.ENCFF207NLX ENCFF502NAY.ENCFF046ACV ENCFF564KCD.ENCFF207NLX ENCFF502NAY.ENCFF046ACV ENCFF355PQK.ENCFF365BCR ENCFF284MLL.ENCFF465ZZG ENCFF179DZP.ENCFF022IWG ENCFF355PQK.ENCFF365BCR ENCFF284MLL.ENCFF465ZZG ENCFF179DZP.ENCFF022IWG ENCFF125HKL.ENCFF046VNU ENCFF563DGZ.ENCFF750EKT ENCFF427CAA.ENCFF314PZG ENCFF385PRZ.ENCFF315DZX ENCFF125HKL.ENCFF046VNU ENCFF563DGZ.ENCFF750EKT ENCFF427CAA.ENCFF314PZG ENCFF385PRZ.ENCFF315DZX ENCFF331BYU.ENCFF809OVE ENCFF880ZXO.ENCFF082TOF ENCFF331BYU.ENCFF809OVE ENCFF880ZXO.ENCFF082TOF ENCFF489YJG.ENCFF728WOA ENCFF924CYX.ENCFF014UUB ENCFF396AXY.ENCFF391BBO ENCFF589UVV.ENCFF060WJH ENCFF229VMU.ENCFF802CIM ENCFF154YTN.ENCFF314OQP ENCFF331TRC.ENCFF535PLW ENCFF715AOC.ENCFF516TOF ENCFF489YJG.ENCFF728WOA ENCFF924CYX.ENCFF014UUB ENCFF396AXY.ENCFF391BBO ENCFF229VMU.ENCFF802CIM ENCFF589UVV.ENCFF060WJH ENCFF154YTN.ENCFF314OQP ENCFF331TRC.ENCFF535PLW ENCFF715AOC.ENCFF516TOF ENCFF463LMJ.ENCFF403WNR ENCFF463LMJ.ENCFF403WNR ENCFF191CWT.ENCFF536TPK ENCFF191CWT.ENCFF536TPK ENCFF891GFG.ENCFF094FZQ ENCFF546ZLD.ENCFF562ATM ENCFF904OXK.ENCFF077NYA ENCFF010TVY.ENCFF057SMG ENCFF261BXO.ENCFF035GBL ENCFF417FHN.ENCFF003FNG ENCFF891GFG.ENCFF094FZQ ENCFF546ZLD.ENCFF562ATM ENCFF904OXK.ENCFF077NYA ENCFF261BXO.ENCFF035GBL ENCFF010TVY.ENCFF057SMG ENCFF417FHN.ENCFF003FNG ENCFF168KTM.ENCFF299HHL ENCFF168KTM.ENCFF299HHL ENCFF723QZO.ENCFF019PLG ENCFF723QZO.ENCFF019PLG ENCFF323OVV.ENCFF050VHP ENCFF393ZAK.ENCFF680COV ENCFF323OVV.ENCFF050VHP ENCFF393ZAK.ENCFF680COV ENCFF168TBA.ENCFF024NZQ ENCFF361MTW.ENCFF503JQK ENCFF548AHK.ENCFF995UTD ENCFF168TBA.ENCFF024NZQ ENCFF361MTW.ENCFF503JQK ENCFF548AHK.ENCFF995UTD ENCFF439NNR.ENCFF483RLD ENCFF439NNR.ENCFF483RLD ENCFF581AZU.ENCFF500HFO ENCFF056HVT.ENCFF900QPQ ENCFF785NOU.ENCFF135YTU ENCFF691BQZ.ENCFF391HFU ENCFF581AZU.ENCFF500HFO ENCFF056HVT.ENCFF900QPQ ENCFF785NOU.ENCFF135YTU ENCFF691BQZ.ENCFF391HFU ENCFF597WPX.ENCFF856EXS ENCFF597WPX.ENCFF856EXS ENCFF471TDC.ENCFF360QAB ENCFF719XQY.ENCFF544DDY ENCFF471TDC.ENCFF360QAB ENCFF719XQY.ENCFF544DDY ENCFF477DKP.ENCFF495MEU ENCFF382DBE.ENCFF888AOR ENCFF477DKP.ENCFF495MEU ENCFF382DBE.ENCFF888AOR ENCFF400BSN.ENCFF321ZQU ENCFF400BSN.ENCFF321ZQU ENCFF056XVW.ENCFF540SXT ENCFF056XVW.ENCFF540SXT ENCFF253HOC.ENCFF773DBT ENCFF008CUX.ENCFF549MBS ENCFF253HOC.ENCFF773DBT ENCFF008CUX.ENCFF549MBS ENCFF239WGU.ENCFF836IMK ENCFF239WGU.ENCFF836IMK ENCFF408BKE.ENCFF497GOO ENCFF408BKE.ENCFF497GOO ENCFF019RWB.ENCFF353EFN ENCFF857YYV.ENCFF320WXN ENCFF019RWB.ENCFF353EFN ENCFF857YYV.ENCFF320WXN ENCFF161VHM.ENCFF156OFZ ENCFF954HWS.ENCFF124RUL ENCFF161VHM.ENCFF156OFZ ENCFF954HWS.ENCFF124RUL ENCFF979BFW.ENCFF102ZOS ENCFF433SUG.ENCFF461OAN ENCFF979BFW.ENCFF102ZOS ENCFF433SUG.ENCFF461OAN ENCFF837ZOY.ENCFF183MSQ ENCFF140ZMN.ENCFF171VBC ENCFF438DHS.ENCFF308RGN ENCFF837ZOY.ENCFF183MSQ ENCFF140ZMN.ENCFF171VBC ENCFF438DHS.ENCFF308RGN ENCFF641HZR.ENCFF006GQZ ENCFF641HZR.ENCFF006GQZ ENCFF771NSF.ENCFF078BWN ENCFF953ZZL.ENCFF729WGC ENCFF953ZZL.ENCFF729WGC ENCFF771NSF.ENCFF078BWN ENCFF726UVN.ENCFF388QNA ENCFF726UVN.ENCFF388QNA ENCFF036IOO.ENCFF358HOW ENCFF036IOO.ENCFF358HOW ENCFF068XVC.ENCFF976DWY ENCFF769DDY.ENCFF221CPW ENCFF068XVC.ENCFF976DWY ENCFF769DDY.ENCFF221CPW ENCFF530UEA.ENCFF437UWB ENCFF530UEA.ENCFF437UWB ENCFF846CYU.ENCFF226KOW ENCFF555QMG.ENCFF273NVY ENCFF846CYU.ENCFF226KOW ENCFF555QMG.ENCFF273NVY ENCFF071KOM.ENCFF236ZUN ENCFF071KOM.ENCFF236ZUN ENCFF964COC.ENCFF946HWT ENCFF964COC.ENCFF946HWT ENCFF910XWG.ENCFF338MSY ENCFF157WFG.ENCFF372NDR ENCFF910XWG.ENCFF338MSY ENCFF157WFG.ENCFF372NDR ENCFF384DMV.ENCFF730MAX ENCFF384DMV.ENCFF730MAX ENCFF329LWM.ENCFF334QAB ENCFF329LWM.ENCFF334QAB ENCFF400MMA.ENCFF593XTM ENCFF400MMA.ENCFF593XTM ENCFF081HVQ.ENCFF494VZW ENCFF794ABP.ENCFF587WWS ENCFF081HVQ.ENCFF494VZW ENCFF794ABP.ENCFF587WWS ENCFF337GMY.ENCFF821HMU ENCFF357MHM.ENCFF491EUN ENCFF337GMY.ENCFF821HMU ENCFF357MHM.ENCFF491EUN ENCFF363YKW.ENCFF473NMO ENCFF384WCY.ENCFF417HNM ENCFF363YKW.ENCFF473NMO ENCFF384WCY.ENCFF417HNM ENCFF454MMY.ENCFF310MOR ENCFF454MMY.ENCFF310MOR ENCFF931HWG.ENCFF562PEW ENCFF931HWG.ENCFF562PEW ENCFF615CWW.ENCFF940QAP ENCFF615CWW.ENCFF940QAP ENCFF014DRH.ENCFF553WMU ENCFF014DRH.ENCFF553WMU ENCFF203MUO.ENCFF618MHM ENCFF203MUO.ENCFF618MHM ENCFF145ZDW.ENCFF831ODM ENCFF145ZDW.ENCFF831ODM ENCFF205VDU.ENCFF109OWW ENCFF205VDU.ENCFF109OWW ENCFF958WMW.ENCFF339RWU ENCFF958WMW.ENCFF339RWU ChIP Replicates ChIP Replicates (a) (b) Figure 3: (a) Percentage overlap in all peaks between ENCODE replicates, for each of the four peak calling methods. (b) Percentage overlap in standardized peaks between ENCODE replicates, for each of the four peak calling methods

AIControl on 107 treatment samples in total. WACS This suggests that these treatment samples may need also outperforms All MACS2, Matched MACS2 and to be more thoroughly investigated. Possibly the sam- AIControl on the majority of the treatment samples, ple are of poor quality, or the motifs in JASPAR are when peaks are standardized, as shown in Table 1. not representative of true binding preferences. Using a paired sign test with a two-tailed null hypoth- Another method for evaluating motif enrichment is esis to compare the precision of WACS with MACS2 the area under the precision-recall curve (AUPRC) and AIControl, we get a p-value of less than 10−5. [23]. The AURPC is designed to compare algorithms The two versions of MACS2 (with all or matched on the same set of instances. Each algorithm, how- controls) perform similarly to each other. Matched ever, generates a different set of peaks for a specific MACS2 slightly outperforms All MACS2, suggesting ChIP-seq dataset. Thus, we believe precision is a more that the ENCODE selected controls are well-matched appropriate evaluation metric than AUPRC for this to the treatment sample. Across the treatment sam- comparison. Nevertheless, for the purpose of compari- ples, AIControl has the highest fluctuations in the pre- son with AIControl [23] which uses the AUPRC met- cision values. The four methods perform similarly on ric, we performed the AUPRC analysis as well. Sup- treatment samples when the precision is less than 0.1. plementary Figure 2 shows an example precision-recall bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 7 of 11

ENCFF923LMN ENCFF938FQC ENCFF327XSI ENCFF241ZAF ENCFF825VBC ENCFF079SPF ENCFF497GOO ENCFF551IFV ENCFF157HYJ ENCFF408BKE ENCFF016KRJ ENCFF606LUD ENCFF725WJS ENCFF273NVY ENCFF490ICL ENCFF821HMU ENCFF564CXM ENCFF337GMY ENCFF722LJA ENCFF601CVJ ENCFF438GBD ENCFF559BJS ENCFF772XIR ENCFF079TBV ENCFF695TZQ ENCFF525CPA ENCFF072KEJ ENCFF033JFM ENCFF715AOC ENCFF516TOF ENCFF226IAA curve for the ChIP-seq dataset ENCFF109OWW with ENCFF880SVB ENCFF632WSK ENCFF041ZVD ENCFF362TFV ENCFF007ALP ENCFF908CYX ENCFF267EEY ENCFF131XTL ENCFF019RWB ENCFF353EFN ENCFF502NAY ENCFF703CKR ENCFF178YFO ENCFF700WSI ENCFF046ACV ENCFF933HFV ENCFF490XTX ENCFF258DFY ENCFF405YCF ENCFF686ZIH ENCFF168TBA ENCFF500YYZ ENCFF353XMB ENCFF546ZLD ENCFF883MZR ENCFF373AFZ ENCFF290MEZ ENCFF846CYU ENCFF226KOW ENCFF795OAZ ENCFF221CPW ENCFF769DDY ENCFF702WHX ENCFF750EKT TF ZNF24, and Supplementary Figure 3 shows the the ENCFF717OYW ENCFF109OWW ENCFF563DGZ ENCFF205VDU ENCFF931HHQ ENCFF562ATM ENCFF855NIZ ENCFF754GYS ENCFF744TCZ ENCFF182YDN ENCFF167NVK ENCFF254IUH ENCFF758LME ENCFF331BYU ENCFF487SUP ENCFF427CAA ENCFF314PZG ENCFF809OVE ENCFF857YYV ENCFF320WXN ENCFF358SSM ENCFF719FYJ ENCFF024NZQ ENCFF339RWU ENCFF958WMW ENCFF530UEA ENCFF563FZR ENCFF437UWB ENCFF856TBD ENCFF397LNB ENCFF284XYU ENCFF035CJB ENCFF350FCE AUPRC for each of these ChIP-seq datasets when us- ENCFF990RGE ENCFF550IUC ENCFF013MXK ENCFF752FIR ENCFF205JWI ENCFF677JTW ENCFF921IAQ ENCFF149LTH ENCFF873BRB ENCFF636IGK ENCFF799FNR ENCFF494OJC ENCFF823GCX ENCFF691BQZ ENCFF391HFU ENCFF535PLW ENCFF827SLL ENCFF331TRC ENCFF285CII ENCFF096KBP ENCFF597WPX ENCFF856EXS ENCFF892LAS ENCFF774RUP ENCFF360QBV ENCFF635HIS ENCFF641HZR ENCFF321ZQU ENCFF400BSN ENCFF115BME ENCFF695GTB ENCFF891GFG ENCFF094FZQ ENCFF156NIH ing standardized peaks. Using AUPRC, WACS outper- ENCFF613CDR ENCFF011OSI ENCFF201BQU ENCFF955AMI ENCFF185YRK ENCFF236CJR ENCFF869PBP ENCFF763ZGN ENCFF159SIY ENCFF630JRN ENCFF594OID ENCFF393ZAK ENCFF778PDI ENCFF569DZM ENCFF651HPM ENCFF140UJB ENCFF025QKV ENCFF919HAZ ENCFF670SFK ENCFF656BKZ ENCFF161VHM ENCFF156OFZ ENCFF553WMU ENCFF536TPK ENCFF191CWT ENCFF593XTM ENCFF400MMA ENCFF914ABF ENCFF836IMK ENCFF548NQU ENCFF412DRO ENCFF473NMO ENCFF363YKW forms All MACS2, Matched MACS2 and AIControl on ENCFF207NLX ENCFF564KCD ENCFF033UID ENCFF066TDG ENCFF555QMG ENCFF871KNP ENCFF323OVV ENCFF706XFF ENCFF008CUX ENCFF432NAV ENCFF008IZS ENCFF042LKP ENCFF420DGT ENCFF129RYE ENCFF066JMV ENCFF803TYM ENCFF837ZOY ENCFF698VVG ENCFF225IQS ENCFF077NYA ENCFF904OXK ENCFF036IOO ENCFF183MSQ ENCFF358HOW ENCFF459MNI ENCFF050VHP ENCFF549MBS ENCFF655OEX ENCFF544DDY ENCFF888AOR ENCFF979BFW ENCFF102ZOS ENCFF407GAB 114, 99 and 100 of the 125 treatment samples respec- ENCFF137QTZ ENCFF251EVS ENCFF619SRD ENCFF680KUB ENCFF757TIT ENCFF976DWY ENCFF068XVC ENCFF006GQZ ENCFF212PQL ENCFF625AJO ENCFF433SUG ENCFF461OAN ENCFF284MLL ENCFF465ZZG ENCFF581AZU ENCFF500HFO ENCFF904PBR ENCFF834YLT ENCFF180ZQP ENCFF690YWW ENCFF042SJF ENCFF623YPH ENCFF106BHY ENCFF384DMV ENCFF730MAX ENCFF047IWR ENCFF662HRL ENCFF391BBO ENCFF396AXY ENCFF701KAF ENCFF002PNZ ENCFF346SNT ENCFF786GEU ENCFF908PSF tively. These differences are statistically significant by ENCFF294DBF ENCFF334OKF ENCFF502VUK ENCFF719XQY ENCFF749LRJ ENCFF468LGZ ENCFF677BBS ENCFF512YJL ENCFF983RUX ENCFF900QPQ ENCFF056HVT ENCFF540SXT ENCFF408TSG ENCFF908MXL ENCFF876GQO ENCFF071KOM ENCFF171VBC ENCFF140ZMN ENCFF221XVM ENCFF811CBJ ENCFF014UUB −5 ENCFF924CYX ENCFF784MLU ENCFF729WGC ENCFF953ZZL ENCFF703YNU ENCFF527CTN ENCFF168KTM ENCFF299HHL ENCFF773DBT ENCFF253HOC ENCFF945TUI ENCFF581SOT a two-tailed sign test with p-value less than 10 . ENCFF589IXE ENCFF495MEU ENCFF477DKP ENCFF749RRI ENCFF050LIC ENCFF082TOF ENCFF202BLM ENCFF507UJN ENCFF903JXA ENCFF780GMX ENCFF812KIP ENCFF382DBE ENCFF092TPD ENCFF330ZVF ENCFF035GBL ENCFF261BXO ENCFF283ZIL ENCFF011XRF ENCFF684CFB ENCFF022IWG ENCFF179DZP ENCFF665EXC ENCFF790FNF ENCFF568RDF ENCFF516UTV ENCFF314OQP ENCFF793KZB ENCFF397ZVS ENCFF154YTN ENCFF904CPI ENCFF206YYH ENCFF479JUJ ENCFF951ITB ENCFF940QAP ENCFF615CWW ENCFF483RLD ENCFF463LMJ ChIP-seq Data ENCFF491EUN ENCFF263PLG ENCFF494VZW ENCFF081HVQ ENCFF561ILM ENCFF779CRV ENCFF797FVU ENCFF357MHM ENCFF439NNR ENCFF239WGU ENCFF014DRH ENCFF489YJG ENCFF728WOA ENCFF544ZPF ENCFF975ZXB ENCFF682XGC ENCFF602UAC ENCFF365BCR ENCFF355PQK ENCFF554KEG ENCFF438DHS ENCFF339YBR ENCFF492NUF ENCFF126SXE ENCFF310MOR ENCFF308RGN ENCFF784SCN Peaks identified by WACS are more reproducible. ENCFF401KIO ENCFF454MMY ENCFF334QAB ENCFF680COV ENCFF398SXO ENCFF044TAL ENCFF794ABP ENCFF204TUQ ENCFF587WWS ENCFF546IMN ENCFF329LWM ENCFF518PIS ENCFF151GOI ENCFF097GYE ENCFF946HWT ENCFF964COC ENCFF090SFR ENCFF880ZXO ENCFF786TLS ENCFF659KOL ENCFF614RBF ENCFF986KSB ENCFF425MJK ENCFF723QZO ENCFF880ODJ ENCFF600TNR ENCFF257QQX ENCFF384WCY ENCFF003FNG ENCFF417HNM ENCFF576EDK ENCFF925CYF ENCFF910XWG Ideally, a ChIP-seq peak calling algorithm is able to ENCFF060WJH ENCFF413VAR ENCFF057KBN ENCFF726AYE ENCFF385PRZ ENCFF236ZUN ENCFF315DZX ENCFF089XMI ENCFF403WNR ENCFF643UCP ENCFF814CHV ENCFF417FHN ENCFF110FDC ENCFF019PLG ENCFF178EJY ENCFF918CSI ENCFF799RPS ENCFF056XVW ENCFF589UVV ENCFF338MSY ENCFF388QNA ENCFF726UVN ENCFF865ISL ENCFF540DXB ENCFF554BLC ENCFF280AEF ENCFF949ENV ENCFF810IRF ENCFF760JNE ENCFF448SJG ENCFF786KLU ENCFF829XLC ENCFF771KEQ ENCFF103HTZ reproducibly identify true regions of enrichment along ENCFF606TJM ENCFF303ICH ENCFF980TTL ENCFF229VMU ENCFF802CIM ENCFF378UVV ENCFF201MXQ ENCFF049RJI ENCFF573OTF ENCFF738QLQ ENCFF247FGI ENCFF488BQX ENCFF785NOU ENCFF135YTU ENCFF318SPV ENCFF124RUL ENCFF323IVG ENCFF862XFS ENCFF715AMR ENCFF195WOK ENCFF833ZOK ENCFF954HWS ENCFF364SJK ENCFF611PXX ENCFF304KSV ENCFF373YTD ENCFF587HZM ENCFF015UIX ENCFF245YPG ENCFF125HKL ENCFF046VNU ENCFF668FCG ENCFF145NON the genome with no false positives. Reproducibility is ENCFF995UTD ENCFF548AHK ENCFF498EFE ENCFF302OZO ENCFF831ODM ENCFF145ZDW ENCFF400SKV ENCFF498KCA ENCFF649ORS ENCFF434YXL ENCFF922JWO ENCFF634KQT ENCFF623MQU ENCFF165STC ENCFF694KHP ENCFF557QLG ENCFF158BYF ENCFF997BDJ ENCFF285FGV ENCFF562PEW ENCFF931HWG ENCFF350NUJ ENCFF548STV ENCFF515YFJ ENCFF072SKI ENCFF030XLK ENCFF780DBH ENCFF728IYF ENCFF525JGC ENCFF031YZL ENCFF078BWN ENCFF650RFL ENCFF771NSF most commonly measured by computing the percent- ENCFF975JFV ENCFF989ORU ENCFF788ICY ENCFF340JWK ENCFF642RHD ENCFF874YYW ENCFF372NDR ENCFF157WFG ENCFF408GAN ENCFF641OBF ENCFF401PIJ ENCFF762TAN ENCFF569XLF ENCFF824KAR ENCFF360QAB ENCFF471TDC ENCFF895DLK ENCFF367OYT ENCFF361MTW ENCFF503JQK ENCFF010TVY ENCFF057SMG ENCFF203MUO age overlap of peaks between replicates [4, 5]. Accord- ENCFF618MHM ENCFF465FJI ENCFF945IIQ ENCFF276OII ENCFF388TPI ENCFF796JTX ENCFF332SVJ ENCFF906SAI ENCFF715JBE ENCFF227IZS ENCFF204GLJ ENCFF910IKB ENCFF350IRK ENCFF448DJP ENCFF392XRJ ENCFF240RBJ ENCFF816BIC ENCFF423TQJ ENCFF172LMI ENCFF505ZUJ ENCFF596JNR ENCFF247YFL ENCFF772PJM ENCFF244GCJ ENCFF198JCQ ENCFF767FSP ENCFF967YTY ENCFF394JQH ENCFF942FFX ENCFF037PYE ENCFF100PTE ENCFF120YSP ENCFF036ZLP ENCFF824YEY ENCFF115YFR ENCFF332BFE ENCFF156EWJ ENCFF243UPF ENCFF159FKZ ENCFF322EZT ENCFF611KVY ENCFF256EBS ENCFF102RVF ENCFF209VFR ENCFF266FHE ENCFF652CFB ENCFF156FED ENCFF721LZU ENCFF873PSH ENCFF982BHL ENCFF679QYP ENCFF382XSA ENCFF874TOY ENCFF438FAN ENCFF584ERB ENCFF635ZZS ENCFF355SGP ENCFF756KAA ENCFF482LDC ENCFF383YOE ENCFF936BLQ ENCFF067XXK ENCFF493VPN ENCFF872FDX ENCFF790TAN ENCFF177OCL ENCFF699KKU ENCFF829HZY ENCFF123YHB ENCFF709XAA ENCFF993ZUS ENCFF868RUE ENCFF752BZZ ENCFF968UFH ENCFF156NSV ENCFF913HVS ENCFF720AUK ENCFF791OAY ENCFF390ATQ ENCFF647OZY ENCFF464ZQT ENCFF059OPC ENCFF875SXG ENCFF306LVM ENCFF577FNG ENCFF647SZO ENCFF358ZBU ENCFF533FQH ENCFF063BAN ENCFF332CUX ENCFF304AZH ENCFF441HVC ENCFF121DUE ENCFF936VXD ENCFF836GUS ENCFF230BZG ENCFF886VAQ ENCFF102ZZG ENCFF213VVQ ENCFF696ZGZ ENCFF984QXA ENCFF274BOZ ENCFF680XQR ENCFF234NVU ENCFF661ZRO ENCFF510BRO ENCFF321YGO ENCFF343UNB ENCFF458FYW ENCFF898KHD ENCFF795UVG ENCFF836OEO ENCFF092OEQ ENCFF092OLM ENCFF700BKM ENCFF554LOM ENCFF375ZQH ENCFF308PSW ENCFF623HUN ENCFF248LWZ ENCFF926NUN ENCFF345ODV ENCFF783MVR ENCFF604MCA ENCFF895QZG ENCFF711KPW ENCFF789BCM ENCFF568OMP ENCFF819KWY ENCFF092PMQ ENCFF162ZOO ENCFF228COG ENCFF020CFW ENCFF023NGN ENCFF779PRW ENCFF093MXU ENCFF683DQU ENCFF002WSA ENCFF285EWB ENCFF758WOL ENCFF299GMC ENCFF547QCM ENCFF712WXB ENCFF812TGW ENCFF937WDE ENCFF330GSW ENCFF399WOX ing to the ENCODE guidelines, each ChIP-seq exper- ENCFF780MVW Control Data iment is associated with at least two ChIP-seq bio- Figure 4: Comparison of controls used by WACS and logical replicate samples [24]. Here, we computed the ENCODE. The rows and columns correspond to the percentage overlap between the non-standardized and ChIP-seq and control experiments respectively. For standardized peaks separately for WACS, All MACS2, each ChIP-seq dataset, the controls are given a blue Matched MACS2 and AIControl across 197 ChIP-seq color if they are used by WACS only, a maroon color if experiments with exactly two replicates. (The percent- they are ENCODE matched controls only, and a ma- age overlap is equal to the overlap between the repli- genta color if they are used by both ENCODE and cates divided by the total number of peaks from both WACS. replicates.) (See Table 2 in Supplementary Data for details.) Figure 3c and 3d show the percentage overlap with all peaks and standardized peaks respectively for each each ChIP-seq experiment, or to match them on the of the ChIP-seq experiments when using WACS (blue basis of experimental details, such as cell/tissue type, line), All MACS2 (red line), Matched MACS2 (green read length and sequencer. If smart controls are to be line) and AIControl (purple line). Using all peaks, used, it is unclear how many controls should be consid- WACS has higher reproducibility than All MACS2, ered, and how many will end up in the smart control. Matched MACS2 and AIControl on 134 of the 197 It is unclear whether ENCODE matched controls are, ChIP-seq experiments. WACS also outperforms All in fact, the best choices or even among the controls MACS2, Matched MACS2 and AIControl on the ma- selected by a smart control procedure. jority of the ChIP-seq experiments with standardized Here, we aim to increase our understanding of the peaks (Table 2). These differences are statistically sig- smart controls used to model the background signal. nificant using a paired sign test and at a p-value of less Figure 4 displays a matrix where the rows and columns than 10−5. represent the ChIP-seq and control datasets respec- AIControl has the highest variability, in comparison tively. The blue color in the matrix represents the con- to MACS2 and WACS, with the percentage overlap trols selected by WACS to fit each ChIP-seq dataset, fluctuating between 5% and 34%. This shows poor the maroon color represents the ENCODE matched reproducibility between replicates for AIControl. All controls [24] and the magenta color represents the con- MACS2 and Matched MACS2 have similar perfor- trols selected by both ENCODE and WACS. mance in terms of percentage overlap. The difference Let us first consider the WACS selected controls per in percentage overlap between WACS and AIControl ChIP-seq dataset (blue in Figure 4). Different subsets is noticeably much larger than the difference between of the 147 controls are required by WACS for each WACS and MACS2 across the ChIP-seq replicates. ChIP-seq dataset, but these form several coherent clus- ters, where groups of ChIP-seq datasets use relatively Controls used per treatment sample. the same controls for modeling the background sig- Our results (and other results [21, 22, 23]) for motif nal. For example, the 10 or so controls most towards enrichment and reproducibility analysis suggest that the left of the diagram are used in modeling nearly all smart controls offer superior background subtraction the ChIP-seq, excepts those towards the bottom. The and peak-calling for ChIP-seq data. However, the stan- next 10 controls are widely used, though less so, and dard practice remains to generate controls alongside are distinct in be used for some of the ChIP-seqs to- bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 8 of 11

wards the bottom. Conversely, there is a set of ChIP- Figures 5a and 5d, WACS and WACS All Ctrls dis- seq datasets about 1/3 of the way from the bottom play the highest motif enrichment and have very sim- they rely on a large number of controls for modeling ilar performance. WACS and WACS All Ctrls outper- their background, whereas ChIP-seqs in the upper half form Matched MACS2, All MACS2 and AIControl on rely almost solely on the leftmost controls. Although 14 treatment samples in total, as shown in Table 3. each ChIP-seq’s background is modeled by a unique The differences for A549 with standardized peaks are combination of controls, a clear trend is that many statistically significant using a paired sign test and at controls are combined—approximately 31 on average, a p-value of less than 10−5. An equivalent trend is ob- and up to 100 for some samples. Supplementary Figure served for the GM12878 cell line (Figures 5b and 5e). 4 shows a histogram of the overall number of controls However, when using all peaks, WACS has the high- used by the ChIP-seq datasets using WACS. est motif enrichment; WACS outperforms WACS All For the ENCODE matched controls, we observe a Ctrls, Matched MACS2, All MACS2 and AIControl range of 1 to 4 ENCODE matched controls per ChIP- on 15 treatment samples in total, as shown in Table 3. seq dataset (maroon color in Figure 4). For 164 of the The differences for GM12878, with both standard and 438 ChIP-seq datasets (37%), none of the matched EN- all peaks, are statistically significant using a paired −5 CODE controls are used to model the background sig- sign test and at a p-value of less than 10 . Addition- nal in comparison to those used by WACS (rows with ally, for standardized peaks, for cell lines A549 and no magenta color in Figure 4). For example, 15 con- GM12878, we notice almost equivalent motif enrich- trols are used to model the background signal for the ment when using All MACS2 and Matched MACS2. ChIP-seq dataset ENCFF025QKV in Figure 4, none For HepG2 with all peaks (Figure 5c), on the other of which are the matched ENCODE controls. For the hand, Matched MACS2 outperforms WACS, WACS remaining 63% of the ChIP-seq datasets, some of the All Ctrls, All MACS2 and AIControl on 11 treatment ENCODE matched controls are also those selected by samples in total. For HepG2 with standardized peaks WACS, as seen in Figure 2 (magenta color). There are (Figure 5f), all methods display similar performance. For HepG2, the differences between All MACS2 and 207 ChIP-seq datasets (out of 438) that use all their WACS are statistically significant using a paired sign matched ENCODE controls (in addition to other con- test and at a p-value of less than 10−5. trols samples), and 32 ChIP-seq datasets that use half Next, we explore the reproducibility of peaks in of their matched control samples to model the back- ChIP-seq replicates for each cell line. There are a to- ground signal. tal of 10 ChIP-seq experiments for each cell line, each with two replicates. Figure 6 show the percentage over- Validation on additional cell lines lap with all and standardized peaks for each of the Here, we further evaluate WACS, MACS2 and AICon- ChIP-seq experiments, when using WACS (blue line), trol on three other cell lines: A549, HepG2 and WACS AllCtrls (yellow line), All MACS2 (red line), GM12878. We specifically explored 20 ChIP-seq and 18 Matched MACS2 (green line) and AIControl (purple control datasets for each cell line. (See Supplementary line). WACS All Ctrls outperforms WACS, Matched Tables 4, 5 and 6 for accession codes of the samples.) MACS2, All MACS2 and AIControl on all of the ChIP- We evaluated MACS2 with the ENCODE matched seq datasets for all the three cell lines, A549, GM12878 controls (Matched MACS2), MACS2 with the cell line and HepG2 for all and standardized peaks, as show in specific controls (All MACS2), WACS with the cell line Table 4. Again, AIControl displays the lowest percent- specific controls (WACS), WACS with the all controls age overlap for A549, GM12878 and HepG2 for all and across the three different cell lines (WACS AllCtrls), standardized peaks. and AIControl with its predefined set of controls on ChIP-seq datasets (AIControl). Discussion To evaluate the quality of the peaks generated by In this paper, we provide a method for improved peak- each method for each cell line, we first investigate mo- calling and increase our understanding of ChIP-seq tif enrichment. Figure 5 displays the motif enrichment data, controls and their biases. Firstly, we showed that for all and standardized peaks for each of the ChIP- the controls selected by WACS are not necessarily the experiments corresponding to each cell line, when us- matched ENCODE controls. Additionally, for most of ing WACS (blue line), WACS AllCtrls (yellow line), the ChIP-seq datasets, many more than two controls All MACS2 (red line), Matched MACS2 (green line) are selected to model the background signal. This sug- and AIControl (purple line). AIControl across all cell gests that the ENCODE guidelines may need to be lines, for all and standardized peaks, has the lowest modified to include more than two “matched” con- motif enrichment. For the cell line A549, as seen in trols per ChIP-seq dataset. We also notice that for a bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 9 of 11

0.35 WACS WACS WACS 0.5 All MACS2 All MACS2 All MACS2 0.30 Matched MACS2 Matched MACS2 0.5 Matched MACS2 AIControl AIControl AIControl 0.4 WACS All Ctrls WACS All Ctrls 0.25 WACS All Ctrls 0.4

0.20 0.3 0.3

0.15 0.2 Motif Enrichment

Motif Enrichment 0.2 Motif Enrichment 0.10

0.1 0.1 0.05

0.00 0.0 0.0 IRF4.ENCFF888PAI SRF.ENCFF263NOT IRF4.ENCFF240MQI ELF1.ENCFF444PPF ELF1.ENCFF735DGJ REST.ENCFF894EID JUN.ENCFF074GYD JUN.ENCFF122BQB SRF.ENCFF731ZNW ELF1.ENCFF739SRY ELF1.ENCFF028TNY ELK1.ENCFF211VKF JUN.ENCFF761UEZ JUN.ENCFF476XBN ELK1.ENCFF784XUE USF1.ENCFF737VAT USF1.ENCFF074OYP JUNB.ENCFF599JTK ELF1.ENCFF930EXY PBX3.ENCFF791EPM REST.ENCFF569QEN PBX3.ENCFF845MYC ELF1.ENCFF418EVV HES2.ENCFF595EIS CTCF.ENCFF178SXE RUNX3.ENCFF884LEJ JUNB.ENCFF330XFU REST.ENCFF195HUS CTCF.ENCFF435DKZ REST.ENCFF291BBG JUNB.ENCFF389OFH JUNB.ENCFF179XAQ ELK1.ENCFF757GXN REST.ENCFF808ADX HES2.ENCFF504YVD ELK1.ENCFF826GGN CREB1.ENCFF320SCI FOXA1.ENCFF332SRJ RUNX3.ENCFF579PRC CEBPB.ENCFF217RBI FOSL2.ENCFF585INN FOSL2.ENCFF953PCA NFATC1.ENCFF983YCI CEBPB.ENCFF514FUP CEBPB.ENCFF677PSB CEBPB.ENCFF090KCF FOSL2.ENCFF185DVY REST.ENCFF364NWO FOSL2.ENCFF125MJO CEBPB.ENCFF280ZFT FOXA1.ENCFF401YVR CEBPB.ENCFF417GPF FOXA1.ENCFF396NXZ CREB1.ENCFF011HOS CEBPB.ENCFF791DRP FOXA1.ENCFF988UCQ CEBPB.ENCFF073MBT NFATC1.ENCFF207QTV CEBPB.ENCFF499BWX CEBPB.ENCFF347MNU ChIP samples ChIP samples ChIP samples (a) A549 All Peaks (b) GM12878 All Peaks (c) HepG2 All Peaks

0.5 0.35 WACS WACS WACS All MACS2 All MACS2 All MACS2 Matched MACS2 0.5 Matched MACS2 0.30 Matched MACS2 0.4 AIControl AIControl AIControl WACS All Ctrls WACS All Ctrls WACS All Ctrls 0.25 0.4

0.3 0.20 0.3

0.15 0.2 Motif Enrichment Motif Enrichment Motif Enrichment 0.2 0.10

0.1 0.05 0.1

0.00 0.0 IRF4.ENCFF888PAI SRF.ENCFF263NOT IRF4.ENCFF240MQI ELF1.ENCFF444PPF ELF1.ENCFF735DGJ REST.ENCFF894EID JUN.ENCFF074GYD JUN.ENCFF122BQB SRF.ENCFF731ZNW ELF1.ENCFF739SRY ELF1.ENCFF028TNY ELK1.ENCFF211VKF JUN.ENCFF761UEZ JUN.ENCFF476XBN ELK1.ENCFF784XUE USF1.ENCFF737VAT USF1.ENCFF074OYP JUNB.ENCFF599JTK ELF1.ENCFF930EXY PBX3.ENCFF791EPM REST.ENCFF569QEN PBX3.ENCFF845MYC ELF1.ENCFF418EVV HES2.ENCFF595EIS CTCF.ENCFF178SXE RUNX3.ENCFF884LEJ JUNB.ENCFF330XFU REST.ENCFF195HUS CTCF.ENCFF435DKZ REST.ENCFF291BBG JUNB.ENCFF389OFH JUNB.ENCFF179XAQ ELK1.ENCFF757GXN REST.ENCFF808ADX HES2.ENCFF504YVD ELK1.ENCFF826GGN CREB1.ENCFF320SCI FOXA1.ENCFF332SRJ RUNX3.ENCFF579PRC CEBPB.ENCFF217RBI FOSL2.ENCFF585INN FOSL2.ENCFF953PCA NFATC1.ENCFF983YCI CEBPB.ENCFF514FUP CEBPB.ENCFF677PSB CEBPB.ENCFF090KCF FOSL2.ENCFF185DVY REST.ENCFF364NWO FOSL2.ENCFF125MJO CEBPB.ENCFF280ZFT FOXA1.ENCFF401YVR CEBPB.ENCFF417GPF FOXA1.ENCFF396NXZ CREB1.ENCFF011HOS CEBPB.ENCFF791DRP FOXA1.ENCFF988UCQ CEBPB.ENCFF073MBT NFATC1.ENCFF207QTV CEBPB.ENCFF499BWX CEBPB.ENCFF347MNU ChIP samples ChIP samples ChIP samples (d) A549 Standardized (e) GM12878 Standardized (f) HepG2 Standardized

Figure 5: Motif enrichment of the treatment samples for each of the five peak calling methods for each of the 3 cell lines: A549 (a and d), GM12878 (b and e) and HepG2 (c and f ). small number of controls, Matched MACS2 and WACS thermore, Hiranuma et al. applied MACS2 using only perform similarly. The more controls used, the better one matched control, while for our analysis, we used WACS performs. We showed that a form of intelligent either all the ENCODE matched controls for a treat- control selection is beneficial for the combination (se- ment sample or simply all controls from the same K562 lection) of controls, as it better models the estimated cell line. In either case, the provision of multiple con- background signal, where different controls represent trols may have improved MACS2’s performance. different types of biases. As noted by Hiranuma et In this manuscript, we described using NNLS to fit al. [23], this will also allow researchers to use other a model of ChIP-seq background to control densities, controls non-specific to their ChIP-seq experiment to but other formulations are possible. For example, we model the noise distribution. This will decrease cost, experimented with an instance-weighted NNLS formu- time and resources required to perform the ChIP-seq lation, to account for differing variances on the regres- experiments. WACS not only provides weights per con- sion targets yi (the ChIP-seq read counts per window). trol for a more efficient background model, it uses an We did not find any improvement in performance. already existing and precise peak calling method in However, results may depend on how one estimates MACS2. target variances. Relatedly, performing regression on Hiranuma et al. [23] claim that AIControl is better log-transformed read counts may be worth exploring. at removing background noise than MACS2. However, RNA-seq analysis tools such as DESeq2 [33] use log lin- our results suggest the contrary (see Figure 5). This ear models for read counts and comparisons between may be due to a number of reasons. First, Hiranuma conditions. It would also make sense to explore L1- et al. [23] uses a different and nonstandard evalua- penalized regression formulations, to explore trade offs tion method for reproducibility analysis. Whereas we between the number of controls used to model back- adopted the widely used approach of looking at peak ground and the accuracy of the background model. overlaps between biological replicates [4, 5], Hiranuma Future work will deal with a more thorough anal- showed that AIControl had higher irreproducibility ysis of the weighted controls approach on other high than MACS when applied to unrelated datasets. Fur- throughput sequencing data, such as RNA-seq, and bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 10 of 11

WACS WACS WACS All MACS2 40 All MACS2 All MACS2 40 Matched MACS2 Matched MACS2 Matched MACS2 AIControl AIControl 40 35 AIControl WACS AllCtrls WACS AllCtrls WACS AllCtrls

30 30 30 25

20 20 20 Percentage Overlap Percentage Overlap Percentage Overlap 15 10 10 10

5 ENCFF884LEJ.ENCFF579PRC ENCFF504YVD.ENCFF595EIS ENCFF888PAI.ENCFF240MQI ENCFF389OFH.ENCFF599JTK ENCFF599JTK.ENCFF389OFH ENCFF125MJO.ENCFF585INN ENCFF791DRP.ENCFF217RBI ENCFF207QTV.ENCFF983YCI ENCFF028TNY.ENCFF444PPF ENCFF735DGJ.ENCFF739SRY ENCFF332SRJ.ENCFF401YVR ENCFF569QEN.ENCFF894EID ENCFF073MBT.ENCFF280ZFT ENCFF211VKF.ENCFF784XUE ENCFF074OYP.ENCFF737VAT ENCFF011HOS.ENCFF320SCI ENCFF179XAQ.ENCFF330XFU ENCFF514FUP.ENCFF677PSB ENCFF476XBN.ENCFF761UEZ ENCFF418EVV.ENCFF930EXY ENCFF347MNU.ENCFF417GPF ENCFF417GPF.ENCFF347MNU ENCFF791EPM.ENCFF845MYC ENCFF953PCA.ENCFF185DVY ENCFF178SXE.ENCFF435DKZ ENCFF826GGN.ENCFF757GXN ENCFF195HUS.ENCFF291BBG ENCFF090KCF.ENCFF499BWX ENCFF074GYD.ENCFF122BQB ENCFF364NWO.ENCFF808ADX ENCFF263NOT.ENCFF731ZNW ENCFF396NXZ.ENCFF988UCQ ChIP Replicates ChIP Replicates ChIP Replicates (a) A549 All Peaks (b) GM12878 All Peaks (c) HepG2 All Peaks

45 WACS WACS WACS 40 All MACS2 All MACS2 45 All MACS2 40 Matched MACS2 Matched MACS2 Matched MACS2 AIControl AIControl 35 40 AIControl 35 WACS AllCtrls WACS AllCtrls WACS AllCtrls

35 30 30

30 25 25 25 20 20 Percentage Overlap Percentage Overlap Percentage Overlap 20 15 15 15 10

10 10 5 ENCFF884LEJ.ENCFF579PRC ENCFF504YVD.ENCFF595EIS ENCFF888PAI.ENCFF240MQI ENCFF389OFH.ENCFF599JTK ENCFF599JTK.ENCFF389OFH ENCFF125MJO.ENCFF585INN ENCFF791DRP.ENCFF217RBI ENCFF207QTV.ENCFF983YCI ENCFF028TNY.ENCFF444PPF ENCFF735DGJ.ENCFF739SRY ENCFF332SRJ.ENCFF401YVR ENCFF569QEN.ENCFF894EID ENCFF073MBT.ENCFF280ZFT ENCFF211VKF.ENCFF784XUE ENCFF074OYP.ENCFF737VAT ENCFF011HOS.ENCFF320SCI ENCFF179XAQ.ENCFF330XFU ENCFF514FUP.ENCFF677PSB ENCFF476XBN.ENCFF761UEZ ENCFF418EVV.ENCFF930EXY ENCFF347MNU.ENCFF417GPF ENCFF417GPF.ENCFF347MNU ENCFF791EPM.ENCFF845MYC ENCFF953PCA.ENCFF185DVY ENCFF178SXE.ENCFF435DKZ ENCFF826GGN.ENCFF757GXN ENCFF195HUS.ENCFF291BBG ENCFF090KCF.ENCFF499BWX ENCFF074GYD.ENCFF122BQB ENCFF364NWO.ENCFF808ADX ENCFF263NOT.ENCFF731ZNW ENCFF396NXZ.ENCFF988UCQ ChIP Replicates ChIP Replicates ChIP Replicates (d) A549 Standardized (e) GM12878 Standardized (f) HepG2 Standardized

Figure 6: Percentage overlap in peaks between ENCODE replicates, for each of the five peak calling methods for each of the 3 cell lines: A549 (a and d), GM12878 (b and e) and HepG2 (c and f ). other cell lines. The weighted approach will be used In the special case of equal weights which sum up to 1, to study the biases in RNA-seq data across different the peaks output from WACS and MACS2 are iden- platforms, labs, cell types, tissues, etc. For example, tical. If different weights are allowed, the two algo- RNA-seq is used to measure the difference in gene ex- rithms have different outputs. WACS allows only pos- pression between tissues, where a tissue consists of a itive weights for better interpretability of results. Neg- mixture of cell types. To generate a realistic control tis- ative weights are biologically difficult to interpret; as sue, the weighted approach can be used to weight the it does not add to the background signal. WACS pro- cell types in the tissue to model the background sig- ceeds to use this devised background signal to iden- nal. Also, in this analysis, we focused on sharp peaks, tify regions of enrichment along the genome. WACS which are more generally found at protein-DNA bind- is an extension of the most highly cited peak calling ing sites. Thus, an analysis of other broader peaks, algorithm, MACS2 [18]. We conducted a comparison for example, will be conducted. Ultimately, the overall between WACS, MACS2 and AIControl to evaluate aim is to further our understanding of the significance our method and the significance of the weighted con- of the increasing amount of publicly available data in trols. WACS significantly outperforms both MACS2 peak calling analysis to obtain more efficient results. and AIControl in motif enrichment analysis and re- producibility analysis. Conclusion We developed a peak calling method, WACS, which allows a mixture of weighted controls as input. The Ethics approval and consent to participate user inputs the controls. These controls can either be Not applicable. weighted by the user, or the weights can be computed Consent for publication by our regression approach. The latter systematically Not applicable. estimates the weights of the input controls to model Competing interests the background signal for that ChIP-seq experiment. The authors declare that they have no competing interests. bioRxiv preprint doi: https://doi.org/10.1101/582650; this version posted November 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Awdeh et al. Page 11 of 11

Table 3: Motif Enrichment Summary. 5. Bardet, A.F., He, Q., Zeitlinger, J., Stark, A. Nat. Protoc. 7(1), 45 (2012) WACS Matched All 6. Wilbanks, E.G., Facciotti, M.T. PloS one 5(7), 11471 (2010) Cell Line WACS AIControl AllCtrls MACS2 MACS2 7. Thomas, R., Thomas, S., Holloway, A.K., Pollard, K.S. Brief. Bioinform 18(3), 441–450 (2016) All Peaks 8. Marinov, G.K., Kundaje, A., Park, P.J., Wold, B.J. G3 4(2), 209–223 A549 7 7 2 4 0 (2014) GM12878 15 2 2 1 0 9. Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., HepG2 0 9 11 0 0 Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al.Genome Res. 22(9), 1813–1831 (2012) Standardized 10. Meyer, C.A., Liu, X.S. Nat. Rev. Genet. 15(11), 709–721 (2014) A549 10 8 1 1 0 11. Karimzadeh, M., Ernst, C., Kundaje, A., Hoffman, M.M. bioRxiv, GM12878 11 5 3 0 1 095463 (2016) HepG2 3 6 5 0 6 12. Benjamini, Y., Speed, T.P. Nucleic Acids Res, 001 (2012) 13. Teng, M., Irizarry, R.A. bioRxiv, 090704 (2016) 14. Nakato, R., Shirahige, K. Brief. Bioinform, 023 (2016) Table 4: Reproducibility Analysis Summary. 15. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B. Nat. Methods 5(7), 621 (2008) WACS Matched All 16. Fejes, A.P., Robertson, G., Bilenky, M., Varhol, R., Bainbridge, M., Cell Line WACS AIControl AllCtrls MACS2 MACS2 Jones, S.J. Bioinformatics 24(15), 1729–1730 (2008) 17. Zang, C., Schones, D.E., Zeng, C., Cui, K., Zhao, K., Peng, W. Bioinformatics 25(15), 1952–1958 (2009) All Peaks 18. Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., A549 1 9 2 0 1 Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., et GM12878 3 4 1 1 1 al.Genome Biol. 9(9), 137 (2008) HepG2 1 7 2 0 0 19. Harmanci, A., Rozowsky, J., Gerstein, M. Genome Biol. 15(10), 474 (2014) Standardized 20. Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, A549 2 9 2 0 0 T., Bjornson, R., Carriero, N., Snyder, M., Gerstein, M.B. Nat. GM12878 2 4 3 1 0 Biotechnol. 27(1), 66 (2009) HepG2 1 8 1 0 0 21. Ramachandran, P., Palidwor, G.A., Perkins, T.J. Epigenetics & chromatin 8(1), 33 (2015) 22. Hiranuma, N., Lundberg, S., Lee, S.: In: Proceedings of the 7th ACM BCB, pp. 191–199 (2016). ACM 23. Hiranuma, N., Lundberg, S.M., Lee, S.-I.: Aicontrol: Replacing Availability of data and materials matched control experiments with machine learning improves chip-seq ChIP-seq data used to develop and evaluate this method can be found peak identification. Nucleic Acids Res 47(10), 58–58 (2019) online on the ENCODE website https://www.encodeproject.org. The 24. Consortium, E.P., et al.Nature 489(7414), 57 (2012) WACS software can be found on the following website: 25. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., https://www.perkinslab.ca/software. Marth, G., Abecasis, G., Durbin, R. Bioinformatics 25(16), 2078–2079 (2009) Funding 26. Quinlan, A.R., Hall, I.M. Bioinformatics 26(6), 841–842 (2010) This work was supported in part by a Queen Elizabeth II Graduate 27. Jones, E., Oliphant, T., Peterson, P.: (2014) Scholarship in Science and Technology (QEII-GSST) to AA, and by an 28. Feng, J., Liu, T., Zhang, Y. Current protocols in bioinformatics 34(1), NSERC Discovery Grant to TJP. 2–14 (2011) Author’s contributions 29. Feng, J., Liu, T., Qin, B., Zhang, Y., Liu, X.S. Nat. Protoc. 7(9), 1728 (2012) AA and TJP conceived and designed the analysis. AA developed the tool, 30. Khan, A., Fornes, O., Stigliani, A., Gheorghe, M., Castro-Mondragon, performed analysis/computations and wrote the manuscript with input J.A., van der Lee, R., Bessy, A., Cheneby, J., Kulkarni, S., Tan, G., et from TJP. TJP edited the manuscript. TJP and MT supervised the project. al.Nucleic Acids Res 46(D1), 260–266 (2017) All authors provided critical feedback and helped shape the research, 31. Grant, C.E., Bailey, T.L., Noble, W.S. Bioinformatics 27(7), analysis and manuscript. 1017–1018 (2011) Acknowledgments 32. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, We thank Compute Canada for granting us access to their cluster to store L., Ren, J., Li, W.W., Noble, W.S. Nucleic Acids Res 37(suppl 2), data and run our computational analyses. We also thank members of the 202–208 (2009) Perkins lab for their feedback. 33. Love, M.I., Huber, W., Anders, S. Genome biology 15(12), 550 (2014)

Author details Additional Files 1School of Electrical Engineering and Computer Science, University of Additional file 1 — WACSSupp Ottawa, K1N6N5, Ottawa, Canada. 2Regenerative Medicine Program, Includes Supplementary Tables 1, 2, 3, 4, 5 and 6. Supplementary Figures Ottawa Hospital Research Institute, K1H8L6 Ottawa, Canada. 1, 2, 3 and 4. 3Department of Biochemistry, Microbiology and Immunology, University of Ottawa, K1H8M5 Ottawa, Canada.

References 1. Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B. Science 316(5830), 1497–1502 (2007) 2. Barski, A., Cuddapah, S., Cui, K., Roh, T., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., Zhao, K. Cell 129(4), 823–837 (2007) 3. Pepke, S., Wold, B., Mortazavi, A. Nat. Methods 6(11s), 22 (2009) 4. Laajala, T.D., Raghav, S., Tuomela, S., Lahesmaa, R., Aittokallio, T., Elo, L.L. BMC genomics 10(1), 618 (2009)