Supplemental Note

Supplemental Note Historical account of the benchmark The ICGC Verification group was tasked with providing guidelines for sequencing and somatic mutation calling. The task was split into two parts: the comparison of how sequencing is done at different sequencing centers and how data analysis is done. Here we present the outcome of the benchmark of somatic mutation calling that was carried out in two sequential phases, BM1.1 on chronic lymphocytic leukaemia (CLL) and BM1.2 on medulloblastoma (MB). The experience gained from the organization of BM1.1 was used to improve BM1.2. For BM1.1, sequence reads in FASTQ format corresponding to 40x coverage of a CLL tumor sample and the corresponding normal sample were made available to members of the ICGC. Participants were asked to return VCF v4.1 files for simple somatic mutations (SSM) and somatic indel mutations SIM), as well as a list of structural variations. We specified that submitters should classify calls that they had high confidence in and calls that might be present, but in which they had less confidence. The threshold for high and medium confidence was left up to the participants. We initially received 11 submissions (December 2012). Several issues were noted, such as incomplete or malformed VCF headers, lack of sample genotypes, differences in reference chromosome names, heterogeneity in the annotation of presumably identical indels (left/right alignment of indels), representation of multiple alternative alleles or normal alleles different than the reference. These small issues were resolved and a first analysis presented to the participants. In the initial submission we observed an alarmingly low degree of agreement among the eleven call sets (Supplemental Figure 1). For example, the number of high confidence SSMs ranged from 500 to 15,000 (after correction) while the number of SSMs that all call sets agreed on was only 77. For SIMs the picture was even more bleak. The number of SIMs called ranged from 50 to 700 and all call sets agreed on only five SIMs. As the goal of the exercise was not to compete but to cooperate in order to improve the state of affairs, submitters were allowed to revise their submissions several times, each time using feedback on the level of concordance among all submitted call sets, but with no information about specific mutations. The overall results of the resubmissions and additional new submissions were not much improved, with the exception that submissions with very high numbers of SSM/SIM calls had revised call rates more in line with the rest of the pack. During the first phase of the somatic mutation calling benchmark, we initiated a second benchmark on sequencing (cross-reference back-to-back BM2 manuscript), as we realized that while the quality of the reads that had been supplied for BM1.1 was reasonable for the time at which they had been generated, the quality of the sequencing itself had improved substantially. As the ability to call somatic mutations ultimately depends on the quality of the sequences generated, we initiated BM1.2 and supplied high quality reads generated by one of the submitters to BM2 (a medullablastoma tumor-normal pair) to the ICGC somatic mutation calling community for the second part of this benchmark. This time no confidence threshold was prescribed. A similar submission and revision process was set up for BM1.2, but previous experience plus automatic validation and feedback from the submission server reduced the number of validation errors, resulting in more consistent call rates and higher concordance. We received 18 submissions for BM1.2 that were analysed the same way as the submission for BM.1.1. Not all submitters to BM1.1 participated in BM1.2, and there were new submitters who had not participated in BM1.1 (MB.H, MB.J, MB.M, MB.Q). Verification and generation of Gold Sets It is important to note that the genomes and sequence data used for this benchmark are real data and not simulated; the truth was not withheld from participants – it was in fact unknown. At this point, our analysis relied on comparisons among call sets, with the assumption that mutations on which all or almost all submitters agree (which we will call the high concordance set) were more likely to be correct. In the absence of absolute truth, this can serve as a guide for improvement of pipelines as we have already noted above, but subsequent verification of the calls was deemed an important goal if we were to get to a closer approximation of the truth. For verification of BM1.1 calls, we carried targeted sequencing with two independent Haloplex panels (Agilent). Using all somatic mutations (SSM and SIM) called by at least two submitters and up to 500 private SSM calls per submitter from the set of 14 submissions received to date (see Supplemental Table 1), targets were designed and the experiments run at two different sites (University of Oviedo, Spain and OICR, Canada) on an IonTorrent PGM and an Illumina MiSeq, respectively, using the same input DNA that was used for the generation of the WGS data. To establish a Gold Set of mutations for CLL the two Haloplex enrichment datasets were merged with the original Illumina sequencing data. Candidate mutations were classified as outlined in the Online Methods: briefly, mutation calls in agreement among submissions and between verification sets and with no filter flags were accepted without further review, all others were subject to additional filters including depth, snape-cmp-counts scores, mappability, duplicated regions, and most importantly visual inspection in IGV of alignments with two different aligners, BWA and GEM. We established a classifier for all SSM calls (Table 1) that assigns each call a class based on the potential issue it might cause in establishing a correct call. For evaluation, we combined classes into a set of nested tiers (see Online Methods). SIM calls were not subjected to classification. The CLL Gold Set had a total of 1507 bona fide mutation calls across all tiers (see Material and Methods), with 1143, 1239, 1313, and 1320 SSMs in tiers 1, 2, 3 and 4, respectively, and 118 and 134 SIMs in tiers 1 and 4, respectively. This corresponds to a mutation frequency of ~0.5 mutations per Mb. For medulloblastoma mutation calls (BM1.2), we used a 300x dataset from BM2 to generate a Gold Set that had a total of 1620 bona fide mutation calls across all tiers (see Material and Methods), with 962, 1101, 1255, and 1267 SSMs in tiers 1, 2, 3 and 4, respectively, and 337and 353 SIMs in tiers 1 and 4, respectively. The mutation frequency of this tumor was also close to 0.5 mutations per Mb. Comparison of selected pipeline components In addition to testing the choice of mapper on SSM calling (presented in the main paper), we investigated the effect of genome reference build version on SSM calling. Novoalign2 was used for mapping against the selected three human genome reference builds (“hg19r” being a subset of “b37”, which in turn is a subset of “b37+decoy”). Supplemental Table 8 shows that using a larger reference build leads to higher mapping rates and shorter mapping times. Rather than benefiting from the additional alignments (alignments to likely uninteresting parts of the extended reference), the advantage of the decoy sequences should be in prevention of misalignments (i.e. avoiding incorrect mapping to chromosomal regions by providing the appropriate reference). Conducted SSM calling tests showed that smaller reference builds lead to significantly larger unfiltered SSM call sets (removal of decoy sequences from the reference accounts for 25 – 31 % increase in the SSM call set size in our experiment, depending on the caller (Supplemental Figure 26). The fraction of these additional calls that overlap the Gold Set is however negligible (0.00). Qprofiler (http://sourceforge.net/p/adamajava/wiki/qProfiler/) was run on each mapper’s alignment files in order to investigate systematic mapping differences that potentially influenced subsequent mutation calling. Specifically, distributions of values in SAM fields “RNAME” (alignment reference sequence ID), “MAPQ” (alignment mapping quality score), ”TLEN” (observed insert size), “CIGAR” (alignment-level indel details together with “soft clippings” – mapper-induced read trimming) and “MD” (alignment mismatch details) were of interest. The same alignment files were used for mutation calling and qprofiler analysis. However, since mutation calling results were limited to chromosomes 1-22, X and Y, alignment files serving as qprofiler input were first filtered so as to contain only alignments to chromosomes 1-22, X and Y in order for the statistics being relevant (contig coverage statistics being the only exception, since mapping to the decoy contig appears to be important). Qprofiler results indicate that relative mapping rates for the individual chromosomes are generally similar for all the mappers, with minor exceptions for chromosome Y and the decoy contig (Supplemental Figure 27). Distributions of insert sizes are similarly in good overall agreement, especially in the ranges that are expected to contain the bulk of proper pairs (Supplemental Figure 28). Apparent inter-mapper differences appeared to be in the assignment of mapping quality values (Supplemental Figures 29a and 29b), with GEM tending to assign lower mapping qualities than the other mappers, while Novoalign2 tending to do the opposite. Whether distinct mapping quality assignments alone can impact SSM calling will depend on given SSM caller’s methods for utilizing mapping quality information. One of the main reasons for an abnormally large SSM call set produced by MuTect on Novoalign2 alignments might be MuTect’s sensitivity to Novoalign2’s mapping quality assignments – a very low fraction of reads with mapping quality 0 together with a noticeably higher fraction of reads with mapping quality 20 (when compared to the other mappers, Supplemental Figure 29c).

Supplemental Note

A Comprehensive Workflow for Variant Calling Pipeline Comparison and Analysis Using R Programming

Property Graph Vs RDF Triple Store: a Comparison on Glycan Substructure Search

Genomic Alignment (Mapping) and SNP / Polymorphism Calling

Genetics 211 - 2018 Lecture 3

Introduction to Bioinformatics (Elective) – SBB1609

Meeting Booklet

UNIVERSITY of CALIFORNIA RIVERSIDE SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data a Dissertation

Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data

Rapid and Comprehensive Quality Assessment of Raw Sequence Reads

Supplementary Figure 1. Seb1 Interacts with the CF-CPF Complex

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btp352

N-Glycosylation Profiles As a Risk Stratification Biomarker for Type II Diabetes Mellitus and Its Associated Factors