Bayesian Analysis of Allelic Imbalance in Chip-Seq Data Corrects for Copy-Number Variation De Santiago and Liu Et Al., 2017

Bayesian Analysis of Allelic Imbalance in Chip-Seq Data Corrects for Copy-Number Variation De Santiago and Liu Et Al., 2017

BaalChIP: Bayesian analysis of allelic imbalance in ChIP-seq data corrects for copy-number variation de Santiago and Liu et al., 2017 University of Cambridge, Cancer Research UK, Robinson Way, CB2 0RE. Cambridge, UK. ^ Equal Contribution authors. * Corresponding authors Additional File 1 Regulatory variant in SNP TF a diploid region: C A A Allele 1: C Allele 2: ChIP-seq read out: C C C Aneuploid region: C A A A A A A counts Read A A C A Allele 1 A A (amplicon) A A SNP A Allele 2: C Figure S1: Example illustration of a ChIP-seq read out at a DNA-binding site when a true regulatory difference exists between two alleles (A allele depicted in orange and C allele depicted in blue), and when there is no regulatory difference between the two alleles. In a diploid sample, the protein (yellow circle) binds preferentially to the A allele. Here the regulatory effect is observed as an imbalance in the allelic ratios obtained from ChIP-seq read counts, with a higher number of reads carrying the A allele. In the presence of allele-specific copy number aberrations such as an amplicon affecting the A allele, the direct ChIP-seq readout may reflect the relative presence of the alleles, rather than the regulatory effect of the single nucleotide variant. Consequently, in the presence of copy number changes, ChIP-seq allelic ratios are not sufficient to uncover true cis-regulatory effects. 1 µ λ ⌘ ⇢ ⇤ an dn N Figure S2: Graphical model of BaalChIP. The gray circles represent given variables and white circles are unknown variables. The direction of arrows indicates statistical dependency in the model. The observed reference allele read count is an, total read counts is dn. n is the index of given factor (e.g. transcription factor) binding to the region of the SNP of interest. The rectangular plane and the capital N indicates that there are N factors binding to the SNP. The primary variable of interest is η representing allelic balance ratio. Λ is the precision parameter. The reference allele frequency at the SNP is ρ. dn, η, ρ, Λ are the parameters of a beta-binomial distribution that governs the uncertainty in an. Finally, the reference mapping bias µ and λ are the mean and variance of a beta distribution which serves as a prior over η. 2 (a) RAF = 0.5 & Reads per TF = 15 (b) RAF = 0.5 & TF = 1 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 TF = 1 reads = 3 True Positive Rate Positive True TF = 5 Rate Positive True reads = 30 0.2 TF = 10 0.2 reads = 250 TF = 20 reads = 520 TF = 30 reads = 1000 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate Figure S3: BaalChIP simulation performance with varying number of transcription factors (a) and read depth (b). 3 DNA−binding proteins ATF3 BCLAF1 CBX3 CEBPB CEBPD CREB1 CTCF CTCFL E2F6 EGR1 ELF1 ETS1 FOSL1 GABPA GATA2 MAX MEF2A NR2F2 NRSF PML POL2 PU1 RAD21 SIN3AK20 SIX5 SP1 SP2 SRF STAT5A TAF1 TAF7 TEAD4 THAP1 TRIM28 USF1 YY1 ZBTB33 ZBTB7A BAF155 BAF170 BRCA1 cJUN CHD2 ELK1 GCN5 GTF2F1 HCFC1 INI1 IRF3 JUND MAFK MAZ MXI1 NRF1 P300 PRDM1 COREST RFX5 SMC3 SPT20 STAT3 TBP USF2 ZKSCAN1 ZNF143 BCL3 FOSL2 FOXA2 TCF12 FOXM1 GATA3 HDAC2 FOXA1 BHLHE40 HNF4A HNF4G MBD4 MYBL2 NFIC RXRA ZEB1 PBX3 cFOS cMYC ATF2 BCL11A NANOG POU5F1 SP4 BATF EBF1 IRF4 MEF2C MTA3 NFATC1 PAX5 POU2F2 RUNX3 TCF3 CHD1 K562 HeLa A549 MCF7 T47D HepG2 SKNSH HL60 MCF10 H1hESC GM12878 GM12891 Additional File 4: Tabledetailed S3. description includinganalyses. accession Marks numbers for can cancerFigure and S4: be non-cancer Data found cell-lines matrix summarizing are in the coloured ENCODE Additional in samples used File pink in and this 3: study blue, for respectively. allele-specific Table More binding S2 and GM12892 IMR90 4 a c blacklist 95 highcov mappability intbias 90 pass replicates_merged 85 Only1Allele ih callsRight (%) 80 75 30 35 40 45 50 b read length 1.00 key blacklist 0.75 highcov mappability 0.50 intbias replicates_merged 0.25 Only1Allele itrdTtl(%) Filtered/Total 0.00 pass K562 A549 HeLa HL60 T47D MCF7 IMR90 HepG2 MCF10 SKNSH H1hESC GM12878 GM12891 GM12892 cell line Figure S5: BaalChIP SNP filtering. (a) Percentage of SNPs with the correct number of aligned reads based on simulations of reads of different read lengths (28 to 50 mer). The percentage of correct calls increased with read length. (b) Proportion of SNPs that were filtered out in each filtering step. (b) The average proportion of SNPs that were filtered out in each filtering step. Average was calculated accross all 14 ENCODE cell lines. 5 a cancer, A549 cancer, HeLa cancer, HepG2 cancer, HL60 0.8 0.6 0.4 0.2 cancer, K562 cancer, MCF7 cancer, SKNSH cancer, T47D 0.8 0.6 0.4 0.2 non−cancer, GM12878 non−cancer, GM12891 non−cancer, GM12892 non−cancer, H1hESC RAF 0.8 0.6 0.4 0.2 non−cancer, IMR90 non−cancer, MCF10 0.8 0.6 0.4 0.2 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 AR (ChIP-seq) b celltype 0.6 cancer non−cancer 0.4 0.2 rprinof sites inProportion CNA regions 0.0 0.0 0.2 0.4 0.6 Correlation ChIP−seq AR vs. RAF scores Figure S6: Correlation between the BAF values and the ChIP-seq allelic ratios (AR). (a) Scatter plots showing the correlation between RAF and AR at each SNP, for all considered cell lines of the ENCODE dataset. RAF corresponds to the BAF value with respect to the reference allele (RAF is equal to BAF if the reference allele corresponds to the B allele; RAF is equal to 1-BAF if the reference allele corresponds to the A allele). The blue line shows the fitted linear regression line. (b) Positive relationship between the Spearman correlation coefficient (obtained from the correlation of AR with RAF scores) and the proportion of sites in copy-number altered (CNA) regions. Each dot in the scatter plot corresponds to a cell line. Sites in copy number changed regions correspond to heterozygous SNPs with BAF score < 0.4 or BAF > 0.6. 6 a b Without RAF correction Cancer cell line K562 HeLa A549 With RAF correction Non-cancer cell line 3 3 3 0.125 2 2 2 ● 1 1 1 0 0 0 0.100 ● 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 ● MCF7 T47D HepG2 ● ● 3 4 3 0.075 3 2 2 ● 2 1 1 1 ● ● 0.050 ● 0 0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Proportion of ASB sites ●● SKNSH HL60 MCF10 ● ● 0.025 ● ● 3 3 3 ● 2 2 2 1 1 1 density 0.0 0.2 0.4 0.6 0 0 0 C 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.20 ● H1hESC GM12878 GM12891 4 4 3 3 3 ● 2 2 2 0.15 1 1 1 ● ● 0 0 0 ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 GM12892 IMR90 0.10 ● 4 ● 3 ● 3 ChIP-seq AR (not corrected) ● ● 2 ● 2 ● ● ● Corrected AR Proportion of ASB sites 1 1 0.05 ● 0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Allelic Ratio (AR) 0.0 0.2 0.4 0.6 Proportion of tested sites in CNA regions Figure S7: Effects of BaalChIP adjustment of the allelic ratios (AR) after correcting for the background Reference Allele Frequency (RAF) bias. (a) Density plots showing the distribution of allelic ratios before (green) and after (orange) BaalChIP correction. The ARs before correction were calculated directly from the ChIP-seq data and refer to the number of reads in the reference allele divided by the total number of reads. The corrected AR values were estimated by BaalChIP model after taking into account the RAF bias. The adjustment of AR is more dramatic in cancer cell lines. (b) Proportion of ASB sites detected with (orange) and without (green) RAF correction as a function of the proportion of sites within copy-number altered (CNA) regions. Each dot refers to a cell line. Sites in regions of copy number change refer to any SNP with a BAF score higher than 0.6 or lower than 0.4. Cancer cell lines have a higher proportion of SNPs in CNA regions and benefit more from BaalChIP correction. (c) Same as (b) but after selecting SNPs with 30X-40X coverage. 7 With RAF correction Without RAF correction 0.12 0.08 cell type cancer non-cancer 0.04 0.00 Proportion of sites of ASB Proportion 20 30 40 50 20 30 40 50 Average coverage at tested sites Figure S8: Proportion of identified ASB SNPs per cell line. Correlation between the proportion of identified ASB SNPs and the average coverage at all assayed heterozygous SNPs.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    16 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us