Unstable Colorectal Cancer Supplementary Material
Total Page:16
File Type:pdf, Size:1020Kb
Comprehensive evaluation of protein coding mononucleotide microsatellites in microsatellite- unstable colorectal cancer Supplementary material Johanna Kondelin1,2, Alexandra E. Gylfe1,2, Sofie Lundgren1,2, Tomas Tanskanen1,2, Jiri Hamberg1,2, Mervi Aavikko1,2, Kimmo Palin1,2, Heikki Ristolainen1,2, Riku Katainen1,2, Eevi Kaasinen1,2, Minna Taipale3,4, Jussi Taipale1,2,3,4, Laura Renkonen- Sinisalo5, Heikki Järvinen5, Jan Böhm6, Jukka-Pekka Mecklin7, Pia Vahteristo1,2, Sari Tuupanen1,2, Lauri A. Aaltonen1,2,3, Esa Pitkänen1,2 1 Department of Medical and Clinical Genetics, Medicum, University of Helsinki, Helsinki, Finland 2 Genome-Scale Biology Research Program, Research Programs Unit, University of Helsinki, Helsinki, Finland 3 Department of Biosciences and Nutrition, Karolinska Institutet, Solna, Sweden 4 Science for Life Center, Huddinge, Sweden 5 Department of Surgery, Helsinki University Central Hospital, Hospital District of Helsinki and Uusimaa, Helsinki, Finland 6 Department of Pathology, Jyväskylä Central Hospital, Jyväskylä, Finland 7 Department of Surgery, Jyväskylä Central Hospital, University of Eastern Finland, Jyväskylä, Finland Supplementary Figures Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure 4 Supplementary Figure 5 Supplementary Figure 6 Supplementary Figure 7 Supplementary Figure 8 Supplementary Figure 9 Supplementary Figure 10 Supplementary Figure 11 Supplementary Figure 12 Supplementary Figure 13 Supplementary Figure 14 Supplementary Figure 15 Supplementary Figure 16 Supplementary Figure 17 Supplementary Figure 18 Supplementary Figure 19 Materials and Methods Extended literature evaluation of the candidate genes References Supplementary Figures Supplementary Figure 1. The length distribution of short somatic insertions and deletions in the protein coding region of 24 MSI CRCs. Positive values denote insertions, negative deletions. Supplementary Figure 2. The frequency of mononucleotide microsatellite sites with coverage ≥5 in both MSI tumor and respective normal sample (averages over 24 tumors). Microsatellites of length >40 not shown (n=3). Blue, A/T repeats; green, C/G repeats. Supplementary Figure 3. Mutation frequencies at A/T mononucleotide microsatellites by the repeat length in 24 MSI CRCs. Error bars indicate Jeffreys binomial proportion 95% confidence intervals. Supplementary Figure 4. Mutation frequencies at C/G mononucleotide microsatellites by the repeat length in 24 MSI CRCs. Error bars indicate Jeffreys binomial proportion 95% confidence intervals. Supplementary Figure 5. The number of somatic deletions in 24 MSI CRCs with respect to the mononucleotide microsatellite length and deletion length. Each dot corresponds to deletions of a specific number of nucleotides (Y-axis) that have occurred at a microsatellite of specific length (X-axis), higher deletion counts shown in brighter colors. Note that also deletions longer than the microsatellite length are shown here (e.g., CAAACG>CCG). Supplementary Figure 6. The number of somatic insertions in 24 MSI CRCs with respect to the mononucleotide microsatellite context and insertion length. Each dot corresponds to insertions of a specific number of nucleotides (Y-axis) that have occurred at a microsatellite of specific length (X-axis), higher insertion counts shown in brighter colors. Supplementary Figure 7. The frequency of somatic indels in 24 MSI CRCs with respect to frameshift status and mononucleotide microsatellite sequence context. A total of 70% of frameshift indels occurred at mononucleotide microsatellites with a minimum length of three nucleotides. In contrast, 16% of the inframe indels occurred at mononucleotide microsatellites. Supplementary Figure 8. Left: a histogram of model p values for exome sequencing data of 24 MSI CRCs. Right: a Q-Q plot of observed and expected -log10(p) values. Supplementary Figure 9. Mutation frequency of the top 53 genes in Set A in exome sequencing of 24 MSI CRCs. Supplementary Figure 10. Mutation frequency and significance (-log10(q)) in the exome sequencing data shown for the set of 24 MSI CRCs. For each gene, the smallest q-value of the repeat sites within the gene is shown. Genes with -log10(q)>14 or more than 14 mutated tumors are labelled. Regression line with 95% confidence intervals is shown. Supplementary Figure 11. Venn diagram of the genes selected for further validation according to the three criteria: 1) Model, significantly mutated in our exome sequencing data (Set A); 2) WXS, frequently mutated in our exome sequencing data (Set B); 3) Kim et al., frequently mutated in the data of of Kim et al. (Set C). Supplementary Figure 12. Mapped base pairs per tumor (n=93) in the MiSeq experiment. Median indicated by a vertical line. Supplementary Figure 13. Medians of sequencing coverage at coding indel variants in MiSeq data of 93 MSI CRCs. Median of medians indicated by a vertical line. Supplementary Figure 14. Mutation frequency of the genes in Set B in exome sequencing of 24 MSI CRCs and MiSeq sequencing of 93 MSI CRCs. One of the genes of Set B, OR7E24, did not amplify by PCR and is therefore not found in the plot. Supplementary Figure 15. Mutation frequencies for 17 microsatellites in nine genes observed in exome (WXS), MiSeq and Sanger sequencing data, as well as in the data of Kim et al. Spearman correlation rho and p-value, and regression line with 95% confidence intervals shown. Supplementary Figure 16. Mutation frequency and mean normalized allelic fraction (mNAF) in the MiSeq data shown for the 71 genes selected for screening (Sets A, B and C) in the extended set. Color indicates the significance (q-value) of the gene, with darker shades corresponding to lower q-values. Genes with a normalized allelic fraction >60% or the frequency of mutated tumors >80%, as well as FAM111B and MTF1, are labelled. Regression line with 95% confidence intervals is shown (100,000 bootstrap samples). Supplementary Figure 17. Indel allelic fractions (green) and normalized allelic fractions (blue) at TGFBR2, CASP5, ACVR2A, KCNMA1, CDH26 and MIS18BP1 in the MiSeq data of 93 MSI CRCs. TGFBR2, CASP5, ACVR2A and KCNMA1 show a right-skewed distribution of normalized allelic fractions indicative of a clonal nature of indels. In contrast, CDH26 and MIS18BP1 show a more uniform distribution. The number of indels is indicated as (n). Supplementary Figure 18. Mutation frequency of the 18 genes of Set C in Sanger sequencing vs. the data by Kim et al. One of the genes of Set C, OR7E24, did not amplify by PCR and is therefore not shown in the plot. Supplementary Figure 19. Mutation frequency of the 18 genes of Set C in the 24 exomes vs. the data by Kim et al. Please note that SLC35G2 is not a previously known MSI target gene whereas CLOCK that is plotted in the same position has been shown to play a role in MSI CRC tumorigenesis. Materials and Methods Patient Material All tumor DNAs were extracted from fresh-frozen tissue specimens. A sample set consisting of 93 MSI CRCs, of which 12 were from patients with Lynch syndrome, was available for validation (Supplementary Table 1). The MSI status of the tumors was determined previously(1-2). All tumors fulfilled the MSI-high criteria(3). Exome Sequencing, Read Mapping and Variant Calling The coding regions of the genome were enriched with the Agilent SureSelect Human All Exon Kit v1 (Agilent, Santa Clara, CA) according to manufacturer’s instructions. Paired-end short-read sequencing with read length of 56-76 base pairs was performed on Illumina Genome Analyzer II platform (Illumina, Inc, San Diego, CA) at Karolinska Institute (Huddinge, Sweden), and the Finnish Institute for Molecular Medicine (FIMM) Genome and Technology Center, Finland. The read mapping and variant calling were conducted as in our previous studies(4-6). Variant Analysis in Exome Sequencing Data A variant in the sequencing data was called a mutation when the following criteria were fulfilled: (1) coverage at the variant position was five or higher, and (2) the mutated allelic fraction was at least 20%. From this list, indels targeting mononucleotide microsatellites in the protein coding region of the genome that resulted in a frameshift were identified. The resulting data therefore were a list of somatic insertions and deletions targeting coding region mononucleotide microsatellites predicted to result in a frameshift. MiSeq Sequencing Sequencing libraries were prepared with the TruSeq Custom Amplicon IndexKit (Illumina) and the TruSeq Custom Amplicon Kit v1.5 (Illumina) at Functional Genomics Unit (FuGU), Biomedicum, Helsinki, Finland. Paired-end sequencing of 150 bp was performed on Illumina MiSeq Sequencing System at FuGU. Sequence files were produced with MiSeq Control Software 2.4.1.3. A total of 129,670 bp of coding region DNA was targeted with the MiSeq sequencing. Read Mapping and Variant Calling of the MiSeq Data Paired-end MiSeq reads were mapped against reference (specifically, the 1000 Genomes Project flavor hs37d5) with BWA MEM (version 0.7.12)(7). Overlapping read pair mates were clipped with the bamUtil clipOverlap tool. Regions with suspected indels were realigned with GATK IndelRealigner (GATK version 2.3-9)(8). Base quality scores were then normalized with GATK to produce the BAM files used in subsequent variant calling and analysis. Finally, variants were called by first running GATK HaplotypeCaller with default parameters and then GATK GenotypeGVCFs with default parameters except for the minimum confidence threshold for emitting variants, which