Alignment Alignment Read Ends Pairing Valid Fragment
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Figures and Tables Hi-C reads Ligation site End 1 End 2 Alignment Alignment >25bps Chimeric reads Multi-reads Multi-reads Uni-reads Uni-reads Read ends pairing Uniquely mapping read pairs Multi-mapping read pairs >99 Unmapped reads Singleton reads Low quality Multi-reads Valid fragment filtering d1 d1 d2 50 bps < d1 + d2 < 800 bps d1 + d2 >800 bps d2 <25k bps d1 + d2 < 50 bps >25k bps Short-range contacts Valid read pairs Invalid alignments End 1 End 2 End 1 End 2 End 1 End 2 Dangling end Self circle Religation 24 Supplementary Figure 1 mHi-C pipeline (Alignment - Read end pairing - Valid fragment filtering). 1. Read ends are aligned to reference genome separately allow- ing multi-reads and chimeric reads are rescued. 2. Read ends are paired by their read query names. Multi-reads form more than one read pair with the same read query name. Read ends that fail to align form either unmapped reads or singleton reads and are discarded. Multi-reads with ends aligning to more than 99 positions are regarded as low quality multi-reads and are excluded from the downstream analysis. 3. Vali- dation checking to filter short-range contacts and alignments far away from restriction enzyme recognition sites. Contacts residing within the same restriction fragment, i.e., dangling end or self circle, as well as adjacent fragments (religation) are discarded. The above three processing steps are applied to each read independently enabling parallel implementation. 25 Valid fragment filtering d1 d2 50 bps < d1 + d2 < 800 bps >25k bps Valid read pairs Duplicate removal Uni-reads Multi-reads A 1 mismatch 2 mismatches Multi-reads Multi-reads B Genome binning 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb Uni-bin pairs Multi-bin pairs Multi-reads reduced to Uni-bin pairs mHi-C Prob=0.9 0.1 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb 40Kb Uni-bin pairs Multi-reads reduced to Multi-bin pairs Uni-binpairs Contact matrix Bin k Bin j Bin k Bin j 3 contact counts 26 Supplementary Figure 2 mHi-C pipeline (Duplicate removal - Genome binning - mHi-C). 4. PCR duplicates are removed to ensure that when a uni-read and a multi- read have the same alignment position and strand direction, the uni-read is kept. In the case of multi-reads that overlap with other multi-reads, the ones with alphabetically larger IDs are removed. 5. Genome is split into fix-sized non-overlapping intervals, i.e., bins or fixed number of restriction fragments and, as a result, read alignment position pairs are reduced to bin pairs. Multi-reads, candidate alignment positions of which fall into the same bin, are reduced to uni-bin pairs. 6. mHi-C model estimates an allocation probability for each potential contact and enables filtering of contacts by thresholding this allocation probability. 7. Uni-reads and thresholded multi-reads are utilized to construct contact matrix. 27 a b 6e+08 10.78% 12.24% 12.14% 10.86% 11.35% 11.35% 11.03% 11.03% 10.86% 10.82% 10.81% 10.78% 9.64% 10.74% 10.74% 10.73% 10.68% 10.67% 10.61% 9.62% 10.39% 10.31% 9.93% 10 9.89% 9.64% 9.63% 9.62% 9.6% 4e+08 9.89% 9.93% 10.68% 10.74% # of Reads 5 9.6% 2e+08 9.63% 10.31% 10.39% Percentage of Multi-reads 11.03% 11.03% 10.73% 10.74% 11.35% 11.35% 10.81% 10.82% 10.67% 10.61% 12.14% 12.24% 0e+00 0 rep1 rep2 rep3 rep4 rep5 rep6 rep1 rep2 rep3 rep4 rep5 rep6 read end 1.w/o chimeric reads read end 1.w/o chimeric reads read end 1.chimeric reads read end 1.chimeric reads read end 2.w/o chimeric reads read end 2.w/o chimeric reads read end 2.chimeric reads read end 2.chimeric reads Supplementary Figure 3 Multi-reads due to chimeric reads (IMR90). Both chimeric reads and multi-reads require extra processing to rescue. a. Numbers of read ends with and without chimeric rescue are displayed along with what percentage of these sets are multi-reads. Darker shades on the bars represent multi-reads. Multi-reads constitute a larger percentage of the usable reads compared to chimeric reads. b. Same information as (a) but displayed in terms of percentages. The actual percent- ages of multi-reads (y-axis) for each category are also displayed on top of each bar. As expected, chimeric reads lead to larger percentages of multi-reads. 28 a b c Rings Trophozoites Rep1 8.07% 1500 Count Count 13.21% 13.17% 13.10% 13.05% 1e+06 12.61% 12.60% 12.53% 12.53% 10000 12.41% 1e+04 12.26% 12.19% 12.13% 11.90% 8.97% 11.79% 11.73% 100 11.64% 1e+02 1e+00 1 6e+07 2000 10 9.48% 9.39% 1000 9.21% 8.97% 8.97% 8.45% 9.39% 8.32% 8.28% 8.07% 8.06% 9.21% 7.32% 7.32% 4e+07 1000 500 # of Reads 5 2e+07 8.06% 8.97% 13.05% 13.10% 0 0 13.17% 13.21% 12.53% 12.53% Percentage of Multi−mapping Reads 12.61% 12.60% 0 200 400 600 800 0 200 400 600 7.32% 8.32% 8.45% 9.48% 7.32% 0 8.28% 12.13% 12.19% 12.26% 11.79% 11.90% 11.64% 12.41% 11.73% 0e+00 Trophozoites Rep2 Schizonts AT_L1 AT_L2 AT_L1 AT_L2 RINGS_L1 GGG_L1 GGG_L2 AGGG_L1 AGGG_L2 A A RINGS_L1 Count SCHIZONTS_L1SCHIZONTS_L2 SCHIZONTS_L1SCHIZONTS_L2 4000 1e+06 Count 1e+06 ROPHOZOITES−XL−ROPHOZOITES−XL−ROPHOZOITES−XL−CCROPHOZOITES−XL−CC OPHOZOITES−XL−OPHOZOITES−XL−OPHOZOITES−XL−CCOPHOZOITES−XL−CC T T T T TR TR TR TR 1e+04 read end 1.w/o chimeric reads 1e+04 read end 1.w/o chimeric reads 1e+02 read end 2.w/o chimeric reads read end 2.w/o chimeric reads 30000 1e+02 1e+00 read end 1.chimeric reads 1e+00 read end 1.chimeric reads 3000 read end 2.chimeric reads read end 2.chimeric reads e Rings Trophozoites Rep1 Uni&Multi−mapping Bin−pair Contact Count 20000 Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 2000 5941 18654 17326 10000 1000 4296 0 0 0 500 1000 1500 0 1000 2000 Uni−mapping Bin−pair Contact Count 2407 d Rings Trophozoites Rep1 0.001 0.01 0.05 0.001 0.01 0.05 1553 300 280 3294 260 766 80 1968 1553 240 220 155 155 200 60 180 Trophozoites Rep2 Schizonts 160 140 Uni−setting Uni&Multi−setting Uni−setting Uni&Multi−setting 120 4381 40 2274 100 80 60 20 40 1725 20 1553 0 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 2446 2202 Trophozoites Rep2 Schizonts 0.001 0.01 0.05 0.001 0.01 0.05 1553 180 25 651 160 140 20 268 102 155 155 120 100 Change in the Number of Significant Contacts (*100%) 15 80 Uni (FDR 1%) 60 Uni.Specific (Uni&Multi FDR 1%) 40 10 Uni.Specific (Uni&Multi FDR 10%) Uni&Multi (FDR 1%) 20 Uni&Multi.Specific (Uni FDR 1%) Gain 0 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 Uni&Multi.Specific (Uni FDR 10%) Loss 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.90.5 0.6 0.7 0.8 0.9 Multi−mapping Reads Posterior Probability Thresholding 29 Supplementary Figure 4 Multi-reads due to chimeric reads and improvement in the number of significant contacts due to multi-reads (P. falciparum). a, b Same as Supplementary Figure. 3a, b, but for P. falciparum. c. mHi-C leads to im- proved bin coverage by the Uni&Multi-setting compared to Uni-setting across all the P. falciparum samples. Dashed line is y = x. d. Percentage change in the numbers of significant contacts: red and blue depict gain and loss of Uni&Multi-setting compared to the Uni-setting, respectively. e. Recovery of significant contacts identified at FDR 1% by analysis at FDR 10%. Uni&Multi-setting. Specific (Uni FDR 10%) is the set of significant contacts identified at 1% FDR by the Uni&Multi-setting but are still unrecov- erable by the Uni-setting even with a liberal FDR of 10%. More detailed explanation is provided in Supplementary Figure 10. 30 a Uni-setting Uni&Multi-setting 74.50 MB 93.00 MB 74.50 MB 93.00 MB 74.50 MB 74.50 MB Chromosome 1 Chromosome 1 93.00 MB 30 93.00 MB 30 Chromosome 1 Chromosome 1 b Uni-setting Uni&Multi-setting 65.20 MB 83.70 MB 65.20 MB 83.70 MB 65.20 MB 65.20 MB Chromosome 3 Chromosome 3 30 83.70 MB 30 83.70 MB Chromosome 3 Chromosome 3 Supplementary Figure 5 Gaps in contact matrices are filled in after incorporat- ing multi-reads (IMR90).