Contamination Detection in Sequencing Studies Using the Mitochondrial Phylogeny
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Method Contamination detection in sequencing studies using the mitochondrial phylogeny Hansi Weissensteiner,1 Lukas Forer,1 Liane Fendt,1 Azin Kheirkhah,1 Antonio Salas,2 Florian Kronenberg,1 and Sebastian Schoenherr1 1Institute of Genetic Epidemiology, Department of Genetics and Pharmacology, Medical University of Innsbruck, 6020 Innsbruck, Austria; 2Unidade de Xenética, Instituto de Ciencias Forenses (INCIFOR), Facultade de Medicina, Universidade de Santiago de Compostela, and GenPoB Research Group, Instituto de Sanitarias (IDIS), Hospital Clínico Universitario de Santiago (SERGAS), 15782, Galicia, Spain Within-species contamination is a major issue in sequencing studies, especially for mitochondrial studies. Contamination can be detected by analyzing the nuclear genome or by inspecting polymorphic sites in the mitochondrial genome (mtDNA). Existing methods using the nuclear genome are computationally expensive, and no appropriate tool for detecting sample contamination in large-scale mtDNA data sets is available. Here we present haplocheck, a tool that requires only the mtDNA to detect contamination in both targeted mitochondrial and whole-genome sequencing studies. Our in silico simulations and amplicon mixture experiments indicate that haplocheck detects mtDNA contamination accurately and is independent of the phylogenetic distance within a sample mixture. By applying haplocheck to The 1000 Genomes Project Consortium data, we further evaluate the application of haplocheck as a fast proxy tool for nDNA-based contamination detection using the mtDNA and identify the mitochondrial copy number within a mixture as a critical component for the overall accuracy. The haplocheck tool is available both as a command-line tool and as a cloud web service producing interactive reports that facilitates the navigation through the phylogeny of contaminated samples. [Supplemental material is available for this article.] The human mitochondrial DNA (mtDNA) is an extranuclear DNA Several approaches exist to detect contamination in mtDNA molecule of ∼16.6 kb in length (Andrews et al. 1999). It is inherited sequencing studies. We and others previously showed that a con- exclusively through the maternal line, facilitating the reconstruc- tamination approach based on the coexistence of phylogenetically tion of the human maternal phylogeny and female (pre-)historical incompatible mitochondrial haplotypes observable as polymor- demographic patterns worldwide. The strict maternal inheritance phic sites is feasible (Li et al. 2010, 2015; Avital et al. 2012; of mtDNA results in a natural grouping of haplotypes into mono- Weissensteiner et al. 2016a). The method of Dickins et al. (2014) phyletic clusters, referred to as haplogroups (Kivisild et al. 2006; facilitates the check for contamination by building neighbor join- Kloss-Brandstätter et al. 2011). Furthermore, second-generation se- ing trees. Mixemt (Vohr et al. 2017) incorporates the mitochondri- quencing enables the detection of heteroplasmy over the complete al phylogeny and estimates the most probable haplogroup for each mitochondrial genome. Heteroplasmy is the occurrence of at least sequence read. The implemented algorithm reveals advantages for two different haplotypes of mtDNA in the investigated biological contamination detection by detecting several haplotypes within samples (e.g., cells or tissues). Depending on the sequencing cov- one sample and is independent of variant frequencies. However, erage, heteroplasmic positions are reliably detectable down to it is too computationally expensive when applied to thousands the 1% variant level (Ye et al. 2014; Weissensteiner et al. 2016a). of samples. For ancient DNA studies, schmutzi (Renaud et al. It has been shown that external or cross-contamination (Yao 2015) uses sequence deamination patterns and fragment-length et al. 2007; Just et al. 2014, 2015; Yin et al. 2019; Brandhagen et al. distributions to estimate contamination. Additionally, specific lab- 2020), artificial recombination (Bandelt et al. 2004), or index hop- oratory protocols were designed for eliminating contamination, ping (Van Der Valk et al. 2019) can generate polymorphic sites that for example, double-barcode sequencing approaches (Yin et al. can be erroneously interpreted as heteroplasmic sites (He et al. 2019). 2010; Bandelt and Salas 2012; Just et al. 2014, 2015; Ye et al. 2014). For contamination detection in mitochondrial studies, most- Sample contamination is still a major issue in both nuclear ly DNA cross-contamination is investigated (Ding et al. 2015; Wei DNA (nDNA) and mtDNA sequencing studies that must be pre- et al. 2019; Yuan et al. 2020) by applying VerifyBamID (Jun et al. vented to avoid mistakes as they occurred with Sanger sequencing 2012; Zhang et al. 2020). Nevertheless, it becomes apparent that studies in the past (Salas et al. 2005). Because of the accuracy and a tool for mitochondrial studies that rapidly and accurately detects sensitivity of second-generation sequencing combined with the contamination in thousands of samples is still missing. Because availability of improved computational models, within-species mtDNA is also present hundredfold to several thousandfold per contamination is traceable down to the 1% level in whole-genome cell depending on the cell type, also WGS data sets specifically sequencing (WGS) studies (Jun et al. 2012). © 2021 Weissensteiner et al. This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publi- cation date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six Corresponding author: [email protected] months, it is available under a Creative Commons License (Attribution- Article published online before print. Article, supplemental material, and publi- NonCommercial 4.0 International), as described at http://creativecommons. cation date are at http://www.genome.org/cgi/doi/10.1101/gr.256545.119. org/licenses/by-nc/4.0/. 31:309–316 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/21; www.genome.org Genome Research 309 www.genome.org Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Weissensteiner et al. targeting the autosomal genome result in a high coverage over the For homoplasmic positions, the final genotype GT ∈ {A,C,G, mitochondrial genome. T} is detected using all input reads (reads) and calculating the geno- In this study, we systematically evaluate the approach of us- type probability P using Bayes’ theorem P(GT|reads) = P(reads|GT) ing the mtDNA phylogeny for contamination detection and pre- × P(GT)/P(reads). To calculate the prior probability P(GT), we sent haplocheck, a tool to report contamination in mtDNA used The 1000 Genomes Project Consortium Phase 3 VCF file targeted sequencing and WGS studies. In general, haplocheck (The 1000 Genomes Project Consortium 2015) and calculated works by identifying polymorphic sites down to 1% within an in- the frequencies for all sites using VCFtools (Danecek et al. 2011). To compute P(reads|GT), we calculated the sequence error rate (ei put sample. By grouping polymorphic sites into haplotypes, hap- − =10 Qi/10) for each base i of a read, where Q is the reported quality locheck identifies contamination using the mitochondrial value. For each genotype GT (GT ∈ {A,C,G,T}) of a read, we deter- phylogeny and the concept of haplogroups. Overall, this work mined the genotype likelihood by multiplying 1 − e in case the should show the merits of the mitochondrial genome as an instru- i base of the read r = GT and e /3 otherwise over all reads (Ding ment for additional quality control in sequencing studies. It addi- i i et al. 2015). The denominator P(reads) is the sum of all four P tionally presents haplocheck, a fast and accurate tool that takes (reads|GT). advantage of a solid well-known mitochondrial phylogeny for de- tecting contamination. Contamination detection model Methods The contamination model within haplocheck includes steps for (1) splitting homoplasmic and polymorphic sites into two Haplocheck takes as input BAM or VCF files. For BAM files, an ini- haplotype profiles, (2) haplogroup classification for each haplo- tial variant calling step based on a maximum likelihood (ML) func- type profile, and (3) filtering based on quality-control criteria. tion (Ye et al. 2014) is performed. Detected polymorphic variants Homozygous genotypes for the alternate alleles (ALT; i.e., homo- are then reported in VCF format and split by their variant allele fre- plasmic sites) are added to both haplotypes, whereas heterozygous quency (AF) into a major and minor haplotype profile. A haplo- genotypes are split using the AF tag. Because mutserve always re- type profile consists of all detected homoplasmic variants and ports the AF of the nonreference allele, the split method applies the corresponding allele of each polymorphic variant. Alleles the following rule: In case a GT 0/1 (e.g., Ref: G, ALT: C) with an ≥ with an AF 50% are added to the major haplotype profile; other- AF of 0.20 is included, the split method defines C as the minor al- wise, they are added to the minor haplotype profile. A haplogroup lele, 0.2 as the minor level, and 0.8 as the major level. Conversely, for each haplotype is then determined using HaploGrep 2 when a GT 0/1 (e.g., Ref: G, ALT: C) with an AF of 0.80 is included, (Weissensteiner