BaalChIP: Bayesian analysis of allelic imbalance in ChIP-seq data corrects for copy-number variation de Santiago and Liu et al., 2017

University of Cambridge, Cancer Research UK, Robinson Way, CB2 0RE. Cambridge, UK. ^ Equal Contribution authors. * Corresponding authors

Additional File 1

Regulatory variant in SNP TF a diploid region: C A

A Allele 1: C Allele 2: ChIP-seq read out:

C C C Aneuploid region: C A A A A A A counts Read A A C A Allele 1 A A (amplicon) A A SNP A Allele 2: C

Figure S1: Example illustration of a ChIP-seq read out at a DNA-binding site when a true regulatory difference exists between two alleles (A allele depicted in orange and C allele depicted in blue), and when there is no regulatory difference between the two alleles. In a diploid sample, the (yellow circle) binds preferentially to the A allele. Here the regulatory effect is observed as an imbalance in the allelic ratios obtained from ChIP-seq read counts, with a higher number of reads carrying the A allele. In the presence of allele-specific copy number aberrations such as an amplicon affecting the A allele, the direct ChIP-seq readout may reflect the relative presence of the alleles, rather than the regulatory effect of the single nucleotide variant. Consequently, in the presence of copy number changes, ChIP-seq allelic ratios are not sufficient to uncover true cis-regulatory effects.

1 µ

⌘ ⇢

⇤ an dn N

Figure S2: Graphical model of BaalChIP. The gray circles represent given variables and white circles are unknown variables. The direction of arrows indicates statistical dependency in the model. The observed reference allele read count is an, total read counts is dn. n is the index of given factor (e.g. ) binding to the region of the SNP of interest. The rectangular plane and the capital N indicates that there are N factors binding to the SNP. The primary variable of interest is η representing allelic balance ratio. Λ is the precision parameter. The reference allele frequency at the SNP is ρ. dn, η, ρ, Λ are the parameters of a beta-binomial distribution that governs the uncertainty in an. Finally, the reference mapping bias µ and λ are the mean and variance of a beta distribution which serves as a prior over η.

2 (a) RAF = 0.5 & Reads per TF = 15 (b) RAF = 0.5 & TF = 1 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 TF = 1 reads = 3 True Positive Rate Positive True TF = 5 Rate Positive True reads = 30

0.2 TF = 10 0.2 reads = 250 TF = 20 reads = 520 TF = 30 reads = 1000 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate

Figure S3: BaalChIP simulation performance with varying number of transcription factors (a) and read depth (b).

3 ealddsrpinicuigacsinnmescnb on nAdtoa ie3 al 2and S2 Table 3: File Additional in More found respectively. blue, be S3. and Table can pink 4: numbers in File coloured accession Additional are binding including cell-lines allele-specific for non-cancer description study and this detailed cancer in used for samples Marks ENCODE the analyses. summarizing matrix Data S4: Figure GM12892 GM12891 GM12878 H1hESC SKNSH MCF10 HepG2 IMR90 MCF7 T47D HL60 HeLa A549 K562

ATF3 BCLAF1 CBX3 CEBPB CEBPD CREB1 CTCF CTCFL E2F6 EGR1 ELF1 ETS1 FOSL1 GABPA GATA2 MAX MEF2A NR2F2 NRSF PML POL2 PU1 RAD21 SIN3AK20 SIX5 SP1 SP2 SRF STAT5A TAF1 TAF7 TEAD4 THAP1 TRIM28 USF1 YY1 ZBTB33 ZBTB7A BAF155 BAF170 BRCA1 DNA−binding cJUN CHD2

4 ELK1 GCN5 GTF2F1 HCFC1 INI1 IRF3 JUND MAFK MAZ MXI1 NRF1 P300 PRDM1 COREST RFX5 SMC3 SPT20 STAT3 TBP USF2 ZKSCAN1 ZNF143 BCL3 FOSL2 FOXA2 TCF12 FOXM1 GATA3 HDAC2 FOXA1 BHLHE40 HNF4A HNF4G MBD4 MYBL2 NFIC RXRA ZEB1 PBX3 cFOS cMYC ATF2 BCL11A NANOG POU5F1 SP4 BATF EBF1 IRF4 MEF2C MTA3 NFATC1 PAX5 POU2F2 RUNX3 TCF3 CHD1 a c blacklist 95 highcov mappability intbias 90 pass replicates_merged

85

Only1Allele

ih callsRight (%) 80

75

30 35 40 45 50 b read length

1.00 key blacklist 0.75 highcov mappability 0.50 intbias replicates_merged 0.25 Only1Allele itrdTtl(%) Filtered/Total 0.00 pass K562 A549 HeLa HL60 T47D MCF7 IMR90 HepG2 MCF10 SKNSH H1hESC GM12878 GM12891 GM12892 cell line

Figure S5: BaalChIP SNP filtering. (a) Percentage of SNPs with the correct number of aligned reads based on simulations of reads of different read lengths (28 to 50 mer). The percentage of correct calls increased with read length. (b) Proportion of SNPs that were filtered out in each filtering step. (b) The average proportion of SNPs that were filtered out in each filtering step. Average was calculated accross all 14 ENCODE cell lines.

5 a

cancer, A549 cancer, HeLa cancer, HepG2 cancer, HL60 0.8

0.6

0.4

0.2 cancer, K562 cancer, MCF7 cancer, SKNSH cancer, T47D 0.8

0.6

0.4

0.2 non−cancer, GM12878 non−cancer, GM12891 non−cancer, GM12892 non−cancer, H1hESC RAF 0.8

0.6

0.4

0.2 non−cancer, IMR90 non−cancer, MCF10 0.8

0.6

0.4

0.2 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 AR (ChIP-seq) b celltype 0.6 cancer non−cancer

0.4

0.2 rprinof sites inProportion CNA regions

0.0

0.0 0.2 0.4 0.6 Correlation ChIP−seq AR vs. RAF scores

Figure S6: Correlation between the BAF values and the ChIP-seq allelic ratios (AR). (a) Scatter plots showing the correlation between RAF and AR at each SNP, for all considered cell lines of the ENCODE dataset. RAF corresponds to the BAF value with respect to the reference allele (RAF is equal to BAF if the reference allele corresponds to the B allele; RAF is equal to 1-BAF if the reference allele corresponds to the A allele). The blue line shows the fitted linear regression line. (b) Positive relationship between the Spearman correlation coefficient (obtained from the correlation of AR with RAF scores) and the proportion of sites in copy-number altered (CNA) regions. Each dot in the scatter plot corresponds to a cell line. Sites in copy number changed regions correspond to heterozygous SNPs with BAF score < 0.4 or BAF > 0.6.

6 a b Without RAF correction Cancer cell line K562 HeLa A549 With RAF correction Non-cancer cell line

3 3 3 0.125 2 2 2 ●

1 1 1 0 0 0 0.100 ● 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 ● MCF7 T47D HepG2 ● ● 3 4 3 0.075 3 2 2 ● 2 1 1 1 ● ● 0.050 ● 0 0 0

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Proportion of ASB sites ●● SKNSH HL60 MCF10 ● ● 0.025 ● ● 3 3 3 ●

2 2 2

1 1 1 density 0.0 0.2 0.4 0.6 0 0 0 C 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.20 ● H1hESC GM12878 GM12891 4 4 3 3 3 ● 2 2 2 0.15 1 1 1 ● ● 0 0 0 ●

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 GM12892 IMR90 0.10 ● 4 ● 3 ● 3 ChIP-seq AR (not corrected) ● ● 2 ● 2 ● ● ● Corrected AR Proportion of ASB sites 1 1 0.05 ● 0 0

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Allelic Ratio (AR) 0.0 0.2 0.4 0.6 Proportion of tested sites in CNA regions

Figure S7: Effects of BaalChIP adjustment of the allelic ratios (AR) after correcting for the background Reference Allele Frequency (RAF) bias. (a) Density plots showing the distribution of allelic ratios before (green) and after (orange) BaalChIP correction. The ARs before correction were calculated directly from the ChIP-seq data and refer to the number of reads in the reference allele divided by the total number of reads. The corrected AR values were estimated by BaalChIP model after taking into account the RAF bias. The adjustment of AR is more dramatic in cancer cell lines. (b) Proportion of ASB sites detected with (orange) and without (green) RAF correction as a function of the proportion of sites within copy-number altered (CNA) regions. Each dot refers to a cell line. Sites in regions of copy number change refer to any SNP with a BAF score higher than 0.6 or lower than 0.4. Cancer cell lines have a higher proportion of SNPs in CNA regions and benefit more from BaalChIP correction. (c) Same as (b) but after selecting SNPs with 30X-40X coverage.

7 With RAF correction Without RAF correction

0.12

0.08 cell type cancer non-cancer 0.04

0.00 Proportion of sites of ASB Proportion

20 30 40 50 20 30 40 50 Average coverage at tested sites

Figure S8: Proportion of identified ASB SNPs per cell line. Correlation between the proportion of identified ASB SNPs and the average coverage at all assayed heterozygous SNPs. The general correlation is expected due to the higher power to detect ASB when coverage is high.

8 GM12878 0.4

0.2

0.0

0.2 GM12892

0.1

0.0 0.04

0.03 IMR90 0.02 0.01 0.00 cancer 0.15 non−cancer MCF10 0.10 0.05

rprinof ASBProportion sites 0.00 0.3 MCF7 0.2 0.1 0.0 0.3 SKNSH 0.2 0.1 0.0 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22

Figure S9: Higher rates of ASB on X of female cell lines than in autosomal .

9 A549, FOSL2 A549, FOXA2 A549, GABPA A549, J UND A549, NRSF A549, P300 A549, POL2 A549, SIX5 A549, TCF12 GM12878, ATF2 GM12878, BATF GM12878, BCL11A GM12878, BCL3 GM12878, CREB1 1

0.5

0 GM12878, EBF1 GM12878, EGR1 GM12878, ELF1 GM12878, FOXM1 GM12878, GABPA GM12878, IRF4 GM12878, MEF2A GM12878, MTA3 GM12878, NFIC GM12878, NRSF GM12878, PAX5 GM12878, PBX3 GM12878, POL2 GM12878, POU2F2 1

0.5

0 GM12878, PU1 GM12878, RAD21 GM12878, RUNX3 GM12878, SIX5 GM12878, SP1 GM12878, SRF GM12878, STAT5A GM12878, TAF1 GM12878, TCF12 GM12878, TCF3 GM12878, YY1 GM12891, POL2 GM12891, POU2F2 GM12891, PU1 1

0.5

0 GM12891, TAF1 GM12891, YY1 GM12892, POL2 GM12892, YY1 H1hESC, CREB1 H1hESC, CTCF H1hESC, E2F6 H1hESC, EGR1 H1hESC, MAX H1hESC, NRSF H1hESC, POL2 H1hESC, RAD21 H1hESC, SIN3AK20 H1hESC, SP1 1

0.5

0 H1hESC, SRF H1hESC, TAF1H1hESC, TCF12 H1hESC, TEAD4 H1hESC, USF1 H1hESC, YY1 HeLa, BRCA1 HeLa, CEBPB HeLa, CHD2 HeLa, cJ UN HeLa, GTF2F1 HeLa, HCFC1 HeLa, J UND HeLa, MAFK 1

0.5

0 HeLa, MAX HeLa, MXI1 HeLa, P300 HeLa, POL2 HeLa, PRDM1 HeLa, RAD21 HeLa, RFX5 HeLa, SMC3 HeLa, STAT3 HeLa, TBP HeLa, USF2 HepG2, CEBPB HepG2, CREB1 HepG2, CTCF 1

0.5

0 HepG2, ELF1 HepG2, FOSL2 HepG2, FOXA1 HepG2, FOXA2 HepG2, GABPA HepG2, HNF4A HepG2, HNF4G HepG2, J UND HepG2, MAX HepG2, NR2F2 HepG2, P300 HepG2, POL2 HepG2, RAD21 HepG2, RXRA 1

0.5

0 HepG2, SRF HepG2, TAF1 HepG2, USF1 HepG2, YY1 HL60, GABPA HL60, NRSF HL60, POL2 HL60, PU1 IMR90, CEBPB IMR90, CTCF IMR90, MAFK IMR90, MXI1 IMR90, POL2 IMR90, RAD21 1

0.5 Allelic Ratio (Replicate 2) (Replicate Ratio Allelic 0 K562, ATF3 K562, CEBPB K562, CEBPD K562, CREB1 K562, CTCF K562, E2F6 K562, EGR1 K562, ELF1 K562, ETS1 K562, FOSL1 K562, GABPA K562, GATA2 K562, MAX K562, MEF2A 1

0.5

0 K562, NR2F2 K562, NRSF K562, PML K562, POL2 K562, PU1 K562, RAD21 K562, SP1 K562, SRF K562, STAT5A K562, TAF1 K562, TEAD4 K562, TRIM28 K562, USF1 K562, ZBTB7A 1

0.5

0 MCF10, cFOS MCF10, cMYC MCF10, POL2 MCF10, STAT3 MCF7, CEBPB MCF7, CTCF MCF7, EGR1 MCF7, ELF1 MCF7, FOSL2 MCF7, FOXM1 MCF7, GABPA MCF7, GATA3 MCF7, HDAC2 MCF7, J UND 1

0.5

0 MCF7, MAX MCF7, NR2F2 MCF7, NRSF MCF7, P300 MCF7, PML MCF7, RAD21 MCF7, SIN3AK20 MCF7, SRF MCF7, TAF1 MCF7, TCF12 MCF7, TEAD4 SKNSH, ELF1 SKNSH, FOSL2 SKNSH, FOXM1 1

0.5

0 SKNSH, GABPA SKNSH, GATA3 SKNSH, J UND SKNSH, MAX SKNSH, NFIC SKNSH, NRSF SKNSH, P300 SKNSH, PBX3 SKNSH, POL2 SKNSH, RXRA SKNSH, TCF12 SKNSH, TEAD4 SKNSH, USF1 SKNSH, YY1 1

0.5

0 SKNSH, ZBTB33 T47D, CTCF T47D, FOXA1 T47D, GATA3 T47D, P300 1

0.5 0

0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 Allelic Ratio (Replicate 1)

Figure S10: Consistency of ASB across biological replicates. Each dot in the scatter plot represents the allelic ratio at a single SNP obtained for two replicated samples from the same ChIP-seq experiment.

10 K562 HeLa A549 MCF7 1.0

0.5

0.0 T47D HepG2 SKNSH HL60 1.0

0.5

0.0 MCF10 H1hESC GM12878 GM12891 1.0

Allelic Ratio (Protein Y) 0.5

0.0 GM12892 IMR90 1.0

0.5

0.0 0.0 0.5 1.00.0 0.5 1.0 Allelic Ratio (Protein X)

Figure S11: Consistency of ASB across pairs of different assayed proteins (Protein "X" and "Y"), within the same cell line. Each dot in the scatter plot represents the allelic ratio at a single SNP obtained for distinct ChIP-seq experiments in the same cell line (i.e distinct ChIPed proteins bound at the same site).

11 a b

(124)

100 0.8

50 0.6 ubrof ASB shared Number sites pamncreaincoefficient correlation Spearman (18)

(5) (1) (1) 0.4 0

Rep sameCell 2 3 4 5 6 Number of cell lines

Figure S12: (a) Distribution of Spearman correlation coefficients obtained for the correlation of allelic ratios obtained for pairs of biological replicates (Rep), and between distinct proteins bound at the same site (sameCell). The allelic ratios are highly correlated in both scenarios. (b) Number of shared ASB SNPs across cell lines. Numbers in parenthesis show the number of ASB SNPs shared between 2 to 6 cell lines. 149 ASB SNPs (124+18+5+1+1) were concomitantly detected in two or more cell lines.

12 50 ASB Assayed 40

30

20

Percentage of of Overlap Percentage 10

0

coding5' UTR intron 3' UTR intergenic promoter

Figure S13: Percentage of ASB SNPs overlapping annotations (5’ UTR, 3’ UTR, coding, inter- genic, intron or promoter). The largest proportion of ASB SNPs was found in introns and intergenic regions. A similar distribution pattern is observed when considering all assayed heterozygous SNPs.

13 a

A549 GM12878 H1hESC HeLa HepG2 K562

75

50

25 ecnaeof overlap Percentage 0 E E E E E WE WE WE WE WE K27ac K27ac K27ac K27ac K27ac K27ac k4me1 Cell−specific chromatin features b observed expected

75 *

50

25

ecnaeof overlap Percentage 0 K562 A549 HeLa HepG2 H1hESC GM12878 Cell−specific chromatin features (merged ranges)

Figure S14: Percentage of ASB SNPs overlapping putative enhancer regions from the matched cell line. (a) Putative enhancer regions were retrieved from publicly available ENCODE datasets on H3K27ac and H3K4me1 modified regions, and previously predicted weak enhancers (WE) and strong enhancer (E) sites obtained from Segway chromatin state annotations of the ENCODE data. (b) Comparison of the observed and expected number of ASB SNPs that map to enhancer regions. A significant difference (*) was only detected in HeLa cells.

14 A549 GM12878 GM12891 GM12892 25 20 p=0.99 p=0.91 p=0.51 p=0.23 15 10 5 0 H1hESC HeLa HepG2 HL60 25 20 p=0.99 p=0.98 p=0.91 p=1 15 10 5 0 group ASB IMR90 K562 MCF10 MCF7 25 Assayed 20 p=0.21 p=0.92 p=0.89 p=0.28 15 10

abs(LOD(alt)−LOD(ref)) 5 0 ASB Assayed ASB Assayed SKNSH T47D 25 20 p=0.98 p=0.81 15 10 5 0 ASB Assayed ASB Assayed

A549 GM12878 GM12891 GM12892 25 20 p=0.63 p=0.84 p=0.94 p=0.6 15 10 5 0 H1hESC HeLa HepG2 HL60 25 20 p=0.81 p=0.99 p=0.89 p=0.98 15 10 5 0 group ASB IMR90 K562 MCF10 MCF7 25 Assayed 20 p=0.56 p=0.97 p=0.81 p=0.03 15 10

abs(LOD(alt)−LOD(ref)) 5 0 ASB Assayed ASB Assayed SKNSH T47D 25 20 p=0.85 p=0.63 15 10 5 0 ASB Assayed ASB Assayed

Figure S15: Distribution of absolute DELTA scores (deltaLOD) of motif-disrupting SNPs with (a) deltaLOD > 0 and (b) deltaLOD > 5. The DELTA score corresponds to the change in the PWM score between SNP alleles (DELTA = LOD(ref) - LOD(alt)). P-values correspond to the Kolmogorov- Smirnov test when comparing the distribution of DELTA scores between motif-disrupting ASB SNPs with all motif-disrupting SNPs.

15 a b ENCODE (ChIP−seq) FAIRE−seq gDNA Depth Depth 10 100 1000 10 100 1000 K562 A549 HeLa HL60 T47D MCF7 IMR90 T47D T47D HepG2 MCF10 SKNSH H1hESC MDA134 MDA134 GM12878 GM12891 GM12892

Figure S16: Distribution of depth of coverage at tested heterozygous SNPs in (a) ENCODE ChIP- seq datasets (pooled data across all ChIP-seq samples in each cell line) and (b) targeted sequencing FAIRE-seq datasets (pooled data for three replicates).

16