A Statistical Test on Single-Cell Data Reveals Widespread Recurrent Mutations in Tumor Evolution”

Supplementary material for \A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution" Jack Kuipers1;2;?, Katharina Jahn1;2;∗, Benjamin J. Raphael3, and Niko Beerenwinkel1;2 1 Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland 2 SIB Swiss Institute of Bioinformatics, Basel, Switzerland 3 Department of Computer Science, Princeton University, Princeton, NJ, USA Supplementary Figures x x x x x x x x x loss of x heterozygosity sampled cells non-sampled cells observed back mutation, samples with and without loss sampled unobserved incidence of a parallel mutation observed unique mutation unobserved back mutation of observed mutation observed parallel mutation, both incidences sampled unobserved mutations due to subsampling observed incidence of a parallel mutation x unobserved mutations, extinct at sampling time observed mutation with unobserved back mutation x end of extinct lineage Supp. Fig. 1: Left: A cell lineage tree illustrating somatic cell evolution. Due to subsampling and extinct lineages only part of the mutation history is reconstructable (black edges). Therefore some recurrent mutations may be mistaken as being unique (pink pentagon), or their back mutation may be overlooked (yellow diamond). Right: Reconstructable part of the mutation history showing that loss of heterozygosity is the cause of the back mutation indicated by the red star: the chromosome segment containing the SNV is lost leaving only the normal allele on the matching chromosome behind. ? Equal contributors 120 120 100 100 80 80 60 60 40 40 Number of samples Number of samples 20 20 −4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 (a) Log10 Bayes factor (b) Log10 Bayes factor Supp. Fig. 2: The range of Bayes factors estimated from simulated data with no recurrent mutations for trees with 20 mutations and up to 120 sampled cells. The false negative rate is 10% in (a) and 20% in (b). 120 120 100 100 80 80 60 60 40 40 Number of samples Number of samples 20 20 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6 (a) Log10 Bayes factor per sample (b) Log10 Bayes factor per sample Supp. Fig. 3: The range of Bayes factors estimated from simulated data with a single recurrent mutations for trees with 20 mutations and up to 120 sampled cells. To compare the Bayes factors for different numbers of samples, their log values are divided by the number of samples. The false negative rate is 10% in (a) and 20% in (b). 120 120 80 80 Number of samples Number of samples 40 40 −4 −3 −2 −1 0 1 2 −4 −3 −2 −1 0 1 2 (a) Log10 Bayes factor (b) Log10 Bayes factor Supp. Fig. 4: The range of Bayes factors estimated from simulated data with no recurrent mutations for trees with 40 mutations and up to 120 sampled cells. The false negative rate is 10% in (a) and 20% in (b). 120 120 80 80 Number of samples Number of samples 40 40 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6 (a) Log10 Bayes factor per sample (b) Log10 Bayes factor per sample Supp. Fig. 5: The range of Bayes factors estimated from simulated data with a single recurrent mutation for trees with 40 mutations and up to 120 sampled cells. To compare the Bayes factors for different numbers of samples, their log values are divided by the number of samples. The false negative rate is 10% in (a) and 20% in (b). 5 5 4 4 3 3 2 2 1 1 Number of doublets Number of doublets 0 0 −4 −2 0 2 4 6 −4 −2 0 2 4 6 (a) Log10 Bayes factor (b) Log10 Bayes factor Supp. Fig. 6: The range of Bayes factors estimated from simulated data with 20 mutations and no recurrent mutation for an increasing number of doublets out of 50 sampled cells. Doublet modeling is not employed in (a) but its use in (b) removes the spurious signals from sequencing two cells at once. 50 50 40 40 30 30 20 20 10 10 Number of doublets Number of doublets 5 5 0 0 0 5 10 15 −4 −3 −2 −1 0 1 2 (a) Log10 Bayes factor (b) Log10 Bayes factor Supp. Fig. 7: The range of Bayes factors estimated from simulated data with 20 mutations and no recurrent mutation for an increasing number of doublets out of 50 sampled cells. The BFs in (a) are calculated without modeling the presence of doublets, while they are accounted for in (b). 50 5 40 4 3 30 2 20 1 10 Number of doublets Number of doublets 5 0 0 0 5 10 0 5 10 15 (a) Log10 Bayes factor (b) Log10 Bayes factor Supp. Fig. 8: The range of Bayes factors estimated from simulated data with 20 mutations and a single recurrent mutation for an increasing number of doublets out of 50 sampled cells. The doublet rate increases to 10% in (a) and 100% in (b). 1.0 1.0 ρ = 0.95 ρ = 0.95 0.8 0.8 0.6 0.6 0.4 0.4 Inferred doublet rate doublet Inferred rate doublet Inferred 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Real doublet rate (b) Real doublet rate Supp. Fig. 9: The doublet rates estimated from simulated data with 20 mutations and 50 sampled cells as the fraction of doublets increases. The simulations do not include any recurrent mutations in (a) but permit a single recurrence in (b). 1.0 1.0 ρ = 0.99 ρ = 0.98 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Recurrent mutation doublet rate doublet Recurrent mutation rate doublet Recurrent mutation 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Infinite sites doublet rate (b) Infinite sites doublet rate Supp. Fig. 10: The doublet rates estimated from simulated data with 20 and 50 sampled cells for a range of doublet rates. For each simulated tree, the doublet rate is learnt under both the infinite sites hypothesis and allowing for a recurrent mutation. The simulations do not include any recurrent mutations in (a) and possess a single recurrence in (b). ABCB5 ABCB5 DNAJC17 DNAJC17 SESN2 PABPC1 DMXL1 PABPC1 PDE4DIP NTRK1 DLEC1 SESN2 NTRK1 DLEC1 DMXL1 TOP1MT TOP1MT ST13 ST13 PDE4DIP ANAPC1 ANAPC1 USP32 USP32 ARHGAP5 ARHGAP5 FAM115C FAM115C FRG1 ASNS FRG1 ASNS RETSAT MLL3 MLL3 RETSAT RETSAT (a) (b) Supp. Fig. 11: The best scoring trees learnt for the data on 18 selected mutations from the whole exome sequencing of 58 cells from a JAK2 -negative myeloproliferative neoplasm [Hou et al., 2012]. The tree in (a) respects the infinite sites hypothesis while (b) allows for the highest scoring recurrent mutation which occurs in the gene RETSAT . The presence or absence of the point mutation in PDE4DIP is unknown for over 60% of the single cells, resulting in high uncertainty and variability in its placement. PIP4K2A PIP4K2A COL5A3 SLC11A2 SLC11A2 THRAP1 ZNF462 ZNF462 NOS1 XPNPEP1 CCKBR NOS1 C8orf45 C17orf27 KIAA0226 C8orf45 FLJ38288 FGFR4 FGFR4 THRAP1 COL5A3 CCKBR C17orf27 KIAA0226 XPNPEP1 FLJ38288 DEPDC2 F2RL2 SCYL2 ADAMTS10 PTPRF SCYL2 ADAMTS10 GPR158 ZBTB2 DEPDC2 KIF6 RPL8 SH3GL1 KIF6 C1orf107 KIAA1718 KIAA1718 SLC36A2 ZBTB2 RPL8 C1orf107 F2RL2 PTPRF MCCC2 SH3GL1 FCRLM1 SLC36A2 PTPRT ZNF71 GPR158 LRRK2 ZNF71 FCRLM1 LRRK2 MCCC2 FLJ10379 PTPRT FLJ10379 PTPRT CDON CDON CUZD1 CUZD1 EPHB4 EPHB4 (a) (b) Supp. Fig. 12: The best scoring trees learnt for the whole exome sequencing of 17 cells from a clear cell renal cell carcinoma [Xu et al., 2012]. The tree in (a) respects the infinite sites hypothesis while (b) allows for the highest scoring recurrent mutation which occurs in the gene PTPRT . BTLA PANK3 DCAF8L1 DCAF8L1 PIK3CA ITGAD ITGAD LSG1 DNM3 DNM3 LSG1 FCHSD2 FCHSD2 PIK3CA BTLA PANK3 CASP3 TRIM58 FUBP3 CALD1 H1ENT FUBP3 CALD1 TRIM58 CASP3 MARCH11 PANK3 TCP11 DUSP12 DUSP12 TCP11 PITRM1 MARCH11 FBN2 FBN2 PITRM1 PPP2RE PPP2RE ROPN1B GPR64 ROPN1B GPR64 PRDM9 ZEHX4 ZNE318 MUTYH PLXNA2 PRDM9 ZEHX4 ZNE318 MUTYH PLXNA2 CABP2 DKEZ WDR16 C1orf223 SEC11A CABP2 H1ENT DKEZ WDR16 C1orf223 SEC11A TRIB2 C15orf23 GLCE KIAA1539 RABGAP1L TRIB2 C15orf23 GLCE KIAA1539 RABGAP1L CNDP1 FGFR2 CNDP1 FGFR2 CXXC1 TECTA TECTA CXXC1 (a) (b) Supp. Fig. 13: The best scoring trees learnt for the whole exome sequencing of 47 cells from a estrogen- receptor positive (ER+) breast cancer tumor [Wang et al., 2014]. The tree in (a) respects the infinite sites hypothesis while (b) allows for the highest scoring recurrent mutation which occurs in the gene PANK3 . MAL2_chr8_120255800 MAL2_chr8_120255800 RYR3_chr15_33952506 RYR3_chr15_33952506 TUFT1_chr1_151512803 PTPRQ_chr12_81013241 TUFT1_chr1_151512803 PTPRQ_chr12_81013241 BBS4_chr15_73021943 OAZ3_chr1_151735456 BBS4_chr15_73021943 OAZ3_chr1_151735456 HIST1H2AG_chr6_27101163 HIPK4_chr19_40895668 HIST1H2AG_chr6_27101163 HIPK4_chr19_40895668 EPHA10_chr1_38226084 SERPINF2_chr17_1657484 EPHA10_chr1_38226084 SERPINF2_chr17_1657484 DOCK3_chr3_51263127 TTN_chr2_179615995 DOCK3_chr3_51263127 TTN_chr2_179615995 RGS11_chr16_320709 INTS8_chr8_95877709 RGS11_chr16_320709 INTS8_chr8_95877709 CAMSAP1_chr9_138702709 PPIG_chr2_170493981 CAMSAP1_chr9_138702709 PPIG_chr2_170493981 SMOC1_chr14_70418996 MYOM3_chr1_24390601 SMOC1_chr14_70418996 MYOM3_chr1_24390601 EYA4_chr6_133850054 ZNF540_chr19_38103617 EYA4_chr6_133850054 ZNF540_chr19_38103617 MAL2_chr8_120255800 (a) (b) Supp. Fig. 14: The best scoring trees learnt for the mutation data for patient 1 from the leukemia dataset of Gawad et al. [2014]. The tree in (a) respects the infinite sites hypothesis while (b) allows for the highest scoring recurrent mutation on chromosome 8 at position 120255800 (hg19) in the gene MAL2 . RIMS2_chr8_105025789 RIMS2_chr8_105025789 SIGLEC10_chr19_51917709 SIGLEC10_chr19_51917709 PLEC_chr8_144995042 PLEC_chr8_144995042 FGD4_chr12_32793457 LINC00052_chr15_88122589 ZC3H3_chr8_144550709 FGD4_chr12_32793457 LINC00052_chr15_88122589 ZC3H3_chr8_144550709 BDNF-AS_chr11_27698664 XPO7_chr8_21827088 BDNF-AS_chr11_27698664 XPO7_chr8_21827088 RRP8_chr11_6622747 BRD7P3_chr6_118822648 RRP8_chr11_6622747 BRD7P3_chr6_118822648 PCDH7_chr4_31148165 PCDH7_chr4_31148165 RIMS2_chr8_105025789 FAM105A_chr5_14612901 FAM105A_chr5_14612901 ATRNL1_chr10_117154148 INHA_chr2_220440226 ATRNL1_chr10_117154148 INHA_chr2_220440226 TRRAP_chr7_98575882 CMTM8_chr3_32280281 TRRAP_chr7_98575882 CMTM8_chr3_32280281 (a) (b) Supp.

A Statistical Test on Single-Cell Data Reveals Widespread Recurrent Mutations in Tumor Evolution”

Circular RNA Hsa Circ 0005114‑Mir‑142‑3P/Mir‑590‑5P‑ Adenomatous

Genetic and Genomic Analysis of Hyperlipidemia, Obesity and Diabetes Using (C57BL/6J × TALLYHO/Jngj) F2 Mice

Gwas Meta-Analysis of Intelligence – Supplementary Note 1

Whole Exome Sequencing in Families at High Risk for Hodgkin Lymphoma: Identification of a Predisposing Mutation in the KDR Gene

A Meta-Analysis of Gene Expression Data Highlights Synaptic Dysfunction

Data Sheet 140

Loss of Twist1 in the Mesenchymal Compartment Promotes Increased

Role of PDZ-Binding Motif from West Nile Virus NS5 Protein on Viral

The Molecular Machinery of Neurotransmitter Release Nobel Lecture, 7 December 2013

Rabbit Anti-Epac2/FITC Conjugated Antibody

GSE50161, (C) GSE66354, (D) GSE74195 and (E) GSE86574

Analysis and Functional Evaluation of the Hair-Cell Transcriptome