<<

Received: 27 November 2017 | Revised: 12 February 2018 | Accepted: 1 March 2018 DOI: 10.1111/1755-0998.12895

FROM THE COVER

Minimizing polymerase biases in

Ruth V. Nichols1 | Christopher Vollmers2 | Lee A. Newsom3 | Yue Wang4 | Peter D. Heintzman1,5 | McKenna Leighton2 | Richard E. Green2 | Beth Shapiro1

1Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Abstract Santa Cruz, California DNA metabarcoding is an increasingly popular method to characterize and quantify 2Department of Biomolecular Engineering, biodiversity in environmental samples. Metabarcoding approaches simultaneously University of California Santa Cruz, Santa Cruz, California amplify a short, variable genomic region, or “barcode,” from a broad taxonomic 3Department of Social Sciences, Flagler group via the polymerase chain reaction (PCR), using universal primers that anneal College, St. Augustine, Florida to flanking conserved regions. Results of these experiments are reported as occur- 4Department of Geography, University of Wisconsin-Madison, Madison, Wisconsin rence data, which provide a list of taxa amplified from the sample, or relative abun- 5Tromsø University Museum, UiT – The dance data, which measure the relative contribution of each taxon to the overall Arctic University of Norway, Tromsø, Norway composition of amplified product. The accuracy of both occurrence and relative abundance estimates can be affected by a variety of biological and technical biases. Correspondence Ruth V. Nichols and Beth Shapiro, For example, taxa with larger biomass may be better represented in environmental Department of Ecology and Evolutionary samples than those with smaller biomass. Here, we explore how polymerase choice, Biology, University of California Santa Cruz, Santa Cruz, CA. a potential source of technical bias, might influence results in metabarcoding experi- E-mails: [email protected]; ments. We compared potential biases of six commercially available polymerases [email protected] using a combination of mixtures of amplifiable synthetic sequences and real sedi- Funding information mentary DNA extracts. We find that polymerase choice can affect both occurrence University of California Office of the President, Grant/Award Number: and relative abundance estimates and that the main source of this bias appears to 20160713SC; National Science Foundation, be polymerase preference for sequences with specific GC contents. We further Grant/Award Number: ARC-1203990; Gordon and Betty Moore Foundation, recommend an experimental approach for metabarcoding based on results of our Grant/Award Number: GBMF-3804 synthetic experiments.

KEYWORDS bias, eDNA, environmental DNA, metabarcoding, soil, trnL P6 loop

1 | INTRODUCTION (Haile et al., 2009; Jerde, Mahon, Chadderton, & Lodge, 2011; Ped- ersen et al., 2016; Rees, Baker, Gardner, Maddison, & Gough, 2017). Metabarcoding, which is erroneously described as barcoding or Additionally, metabarcoding has been used to calculate differences in in some literature, is the technique in which a univer- haplotype or allele frequency between populations of the same spe- sal primer pair is used to amplify multiple templates from a mixture cies (Sigsgaard et al., 2016) and to link changes in community com- of many different taxa or haplotypes. Metabarcoding is often used in position over time to climatic shifts (Haile et al., 2007; Willerslev conjunction with environmental DNA (eDNA), or DNA that is col- et al., 2003, 2007, 2014). These latter examples analyse both the lected from environmental sources such as water, sediment, air and occurrence and relative abundance of each unique sequence in the faeces (Deiner et al., 2017). Metabarcoding is an increasingly popular amplification product, where abundance is estimated as the propor- tool in ecological and palaeoecological research, mainly due to its tion of the total number of sequences generated matching each simplicity and low cost. eDNA can be used, for example, to charac- taxon or haplotype. terize biodiversity of a particular taxonomic group (Ushio et al., While metabarcoding is a promising approach to characterize 2017) or to estimate the ranges of rare, extinct or cryptic species biodiversity both quickly and inexpensively, few studies have

| Mol Ecol Resour. 2018;1–13. wileyonlinelibrary.com/journal/men © 2018 John Wiley & Sons Ltd 1 2 | NICHOLS ET AL. validated the method experimentally by, for example, testing the number of target loci may vary between taxa and tissue type. extent to which the true community or population is reconstructed. Chloroplast DNA, for example, is a common target for metabarcod- It is generally accepted that taxon occurrence can be inferred via ing, but can differ in copy number between taxa, individuals and cell metabarcoding, provided that a sufficient number of PCR replicates tissue types within the same plant (Morley & Nielsen, 2016). Tapho- —amplifying DNA multiple times from the same soil extract using nomic factors may also influence DNA preservation, for example by the same amplification conditions—are performed (Pinol,~ Mir, affecting the rate of degradation. Lignified structures in plants may Gomez-Polo, & Agustı, 2015; Shaw et al., 2016) and false positives slow the rate of DNA degradation (Yoccoz et al., 2012), as may have been accounted for (Lahoz-Monfort, Guillera-Arroita, & Tingley, anoxic environments (Corinaldesi, Barucca, Luna, & Dell’Anno, 2011). 2016). The first eDNA metabarcoding studies used replication In some environments, soil leaching and postdepositional mixing may (Cooper & Poinar, 2000), where DNA extraction and amplification move DNA up or down sediment columns or horizontally over space were both replicated, to help confirm their results (Willerslev et al., (Andersen et al., 2012; Anderson-Carpenter et al., 2011; Pedersen 2003), but many subsequent studies did not replicate experiments et al., 2015; Rawlence et al., 2014). (Soininen et al., 2009; Sønstebø et al., 2010; Valentini et al., 2009). Technical biases can be introduced during DNA extraction and After a detailed exploration of the utility of replication in metabar- PCR amplification. DNA extraction protocols can be more or less coding (Darling & Mahon, 2011), the use of replication increased, optimized for soil chemistry, which can influence the extent to which but the number of replicates performed per experiment varied DNA is recovered (Zielinska et al., 2017). Soils rich in clays or humic widely. Most studies used between two and five PCR replicates per acids may bind DNA, for example, reducing DNA recovery (Direito, sample (Andersen et al., 2012; De Barba et al., 2014; Jørgensen Marees, & Roling,€ 2012). PCR is a highly stochastic process, which is et al., 2012; Willerslev et al., 2014) and some as many as eight further complicated by the presence of variable templates, with (Giguet-Covex et al., 2014). Recently, the use of site occupancy many opportunities for the introduction of bias (Aird et al., 2011; models has been proposed as a tool to estimate how many replicates Pinto & Raskin, 2012; Polz & Cavanaugh, 1998; Suzuki & Giovan- are needed; with most recommendations ranging from six to 12 noni, 1996). Although the universal primers used in metabarcoding replicates per sample (Ficetola et al., 2015; Lahoz-Monfort et al., are designed to anneal to conserved genomic regions, slight variation 2016; Schmidt, Kery, Ursenbacher, Hyman, & Collins, 2013), depend- in binding site sequences may affect primer binding efficiency, ing on the number and abundance of rare taxa. Another approach to resulting in bias (Elbrecht & Leese, 2015; Pinol~ et al. 2015). For estimate the amount of replication required is rarefaction, whereby example, Fahner, Shokralla, Baird, and Hajibabaei (2016) used four the number of new taxa identified per replicate PCR is used to esti- plant-specific primers to infer community composition from the same mate the probability that most rare taxa have been recovered (Hsieh, soil samples and found that each primer pair produced a different Ma, & Chao, 2016; Sanders, 1968). result. This result may also be related to amplicon length whereby Whether relative abundance can be estimated accurately from shorter amplicons amplify more readily than longer amplicons. Tem- metabarcoding data is a more contentious issue. Some researchers plate secondary structures can also bias PCR when molecules with routinely interpret the relative abundance of sequences post-PCR as secondary structures bind to themselves and inhibit their own ampli- indicative of real relative biomass estimates (Kowalczyk, Taberlet, fication. In addition, templates with suboptimal GC contents can be Kaminski, & Wojcik, 2011; Niemeyer, Epp, Stoof- Leichsenring, Pes- disfavoured during amplification, although some polymerases are tryakova, & Herzschuh, 2017; Willerslev et al., 2014). Others argue known to have reduced GC bias and additives such as dimethyl against this approach, citing challenges that include differential DNA sulphoxide (DMSO) for GC-rich templates or betaine for AT-rich degradation, different primer binding efficiencies and sequencing templates can reduce this bias (Baskaran et al., 1996; van Dijk, errors as confounding factors that might influence the utility of rela- Jaszczyszyn, & Thermes, 2014; Kozarewa et al., 2009). Finally, the tive abundance data collected from metabarcoding loci (Deagle, Tho- number of PCR cycles has also been shown to influence results: mas, Shaffer, Trites, & Jarman, 2013; Deagle et al., 2007; Marcelino while a higher number of PCR cycles might increase the likelihood & Verbruggen, 2016; Pawluczyk et al., 2015; Pinol~ et al., 2015). that rare molecules are observed, it could also skew abundance esti- Biases that might influence the likelihood of a taxon being mates by amplifying the biases described above (Casbon, Osborne, detected during metabarcoding can be both biological and technical Brenner, & Lichtenstein, 2011; Weyrich et al., 2017), but this can in origin. Biological differences include organism size, seasonal pres- vary (Krehenwinkel et al., 2017; Vierna, Dona, Vizcaino, Serrano, & ence and senescence, preservation and dispersal strategy, among Jovani, 2017). others. Larger taxa, taxa that are present year-round or taxa whose Here, we explore the potential of polymerase choice to influence DNA is readily transported across long distances by wind or water, the results of metabarcoding analyses, with particular reference to may be more likely to be observed in environmental samples than polymerase GC bias. We selected the trnL g/h primer set (Taberlet smaller, seasonal and sedentary taxa (Andersen et al., 2012; Barnes et al., 2007) as our universal barcoding primers for this evaluation, & Turner, 2016; Buxton, Groombridge, Zakaria, & Griffiths, 2017; as the target trnL (P6 loop) locus of the chloroplast is com- Dunn, Priestley, Herraiz, Arnold, & Savolainen, 2017; Hemery, Poli- monly used for plant metabarcoding studies (Pornon et al., 2016; tano, & Henkel, 2017; Rees et al., 2017). Even when the same num- Sønstebø et al., 2010; Valentini et al., 2009). In addition, amplicons ber of cells is present in an environmental sample, the starting copy derived from this primer set are within the range of 50 and 150 base NICHOLS ET AL. | 3 pairs (bp), which is suitable for degraded environmental DNA and 2.2 | Data generation also fully sequenceable using short-read sequencing technologies. We performed metabarcoding on DNA extracted from soil collected 2.2.1 | Environmental DNA from St Paul Island, from St. Paul Island, Alaska, and on mixtures of synthetic oligonu- Alaska cleotides whose inserts varied by GC content, using six polymerases, including those commonly used in metabarcoding. Using these We collected soil samples from St. Paul Island, Alaska. This small experiments, we asked three questions: (i) Does polymerase GC (~114 km2), isolated island is situated ~450 km west of the coast of preference affect relative abundance estimates in metabarcoding Alaska in the Bering Sea (~50.2°N, 170.2°W). St. Paul is the largest data? (ii) Are some polymerases more appropriate for metabarcod- and most northerly island of the Pribilof Islands (Mungoven, 2005), ing-derived estimates of relative abundance than others? And (iii) has a low diversity of plants and terrestrial mammals (Colinvaux, Does GC bias affect occurrence estimates in metabarcoding experi- 1981; Preble & McAtee, 1923) and completely lacks trees. We ments? selected nine sampling sites that were spatially separate from each other, geologically distinct and appeared to be colonized by different vegetative communities. At each site, a 1 9 1 m quadrat was cho- 2 | MATERIALS AND METHODS sen. We removed a ~15 9 15 9 10 cm (L 9 W 9 D) volume of sur- face soil from the centre of each quadrat using a knife and trowel 2.1 | Experimental design overview that we cleaned with ethanol between uses. We transferred We designed our experiment to ask three questions. First, Does ~10–20 g of soil to a sterile 50-ml falcon tube for eDNA analyses. polymerase GC preference affect relative abundance estimates in In addition to collecting sediment, we performed surveys of metabarcoding data? To answer this, we performed metabarcoding above-ground vegetation. We photographed the surface vegetation analyses of sedimentary DNA samples collected from St. Paul Island, in each quadrat and performed a census of each taxon growing Alaska. We performed two separate tests. First, we performed trnL within the unit. We counted stems from each representative of each (P6 loop) metabarcoding from nine samples and compared DNA- plant taxon and tallied the total for each unit (no counts exceeded derived biodiversity estimates and biodiversity estimates based on 50). For very widespread and ubiquitous taxa, including spreading above-ground survey data from the same sites. Next, for four of mat-forming types (e.g., mosses growing at the ground surface) and these nine sedimentary DNA samples, we explored whether relative oversized plants with wide crowns, we estimated relative abundance abundance changed during the course of PCR amplification, follow- based on percentage coverage within the unit. We identified the ing the design depicted in Figure 1. In both of these tests, we majority of common taxa in the field by comparison with a local col- found that polymerase GC preference did affect relative abundance lection curated at the St. Paul Public School and verified taxonomic estimated. Our second question was therefore Are some polymerases assignments using Hulten ’s floras (Hulten, 1960, 1968). We collected more appropriate for metabarcoding-derived estimates of relative abun- representative samples of distinct or unknown taxa for later taxo- dance than others? To answer this question, we amplified pools of nomic verification, which we carried out using the relevant published synthetic oligonucleotides with a range of GC contents using six dif- floras along with online keys and floristics data (Hulten, 1960; Mun- ferent polymerases and measured the precision with which each goven, 2005; Stotler & Crandall-Stotler, 2005; Talbot & Talbot, polymerase reconstructed the starting concentrations of each 1994; Walker et al., 2005). We converted the count data and the oligonucleotide pool. Our third question was Does GC bias also proportion of ground covered as a rank order (1 = 1–20% cover or affect occurrence estimates in metabarcoding experiments? To answer <10 count; 2 = 21–40% or 10–24 count; 3 = 41–60% or 25–50 this question, we again used the sedimentary DNA samples from St. count; 4 = 61–80%; 5 = 81–100%) as a proxy for plant abundance Paul Island, Alaska, but this time performed metabarcoding using at each sampling location. the polymerase identified in Question 2 as the least biased. We We extracted environmental DNA from all nine soil samples estimated the reproducibility of occurrence data using rarefaction using the MoBio PowerSoil DNA Isolation Kit (now called Qiagen analysis of ten replicate PCRs per sample. DNeasy PowerSoil Kit), following the manufacturer’s instructions. To

FIGURE 1 Schematic of the amplicon competition experiment. We chose four eDNA extracts and ran each in a PCR with trnL g/h and Platinum Taq using the recipe in Graham et al. (2016). Starting at cycle 10 and every five cycles up to cycle 60, we cooled the reaction to 20°C and removed 1 ll. We converted each 1 ll of PCR product into a sequenceable library individually. After sequencing and processing the reads, we plotted each amplicon as a function of PCR cycle and relative abundance 4 | NICHOLS ET AL. avoid contamination, we performed all steps in a clean laboratory products. We purified amplification products using a SPRI bead that is physically isolated from other molecular biology research, protocol (Rohland & Reich, 2012). while wearing sterile suits, face masks and gloves for DNA extrac- We transformed PCR amplicons into sequenceable libraries using tions and PCR set-up. To monitor cross-contamination, we extracted two different approaches. Initially (for questions one and two), we and processed the samples alongside two negative extraction used a lengthy protocol described by Meyer and Kircher (2010) (MK) controls, but did not use a positive control. that involves blunt-end repair, phosphorylation, adapter ligation and fill-in, and indexing PCR. To answer question three, we compared the MK protocol to a shorter and less expensive approach that 2.2.2 | Synthetic oligonucleotide pools amplifies DNA using trnL g/h primers with 50 overhangs containing We designed and synthesized 12 oligonucleotides with inserts of 47 the Illumina TruSeq adapter sequences. This made it possible to pro- base pairs (bp) flanked by the trnL g/h primer binding sites with no ceed directly to indexing PCR following the initial metabarcoding mismatches (total length: 83 bp; Supplementary Table S1). This set PCR, allowing library preparation to be completed in two steps (two included two oligonucleotides with 13% average GC content, two PCR set-ups). To assess whether the two-step protocol performed with 26% average GC content, two with 51% average GC content, differently from the MK protocol, we performed a comparative two with 63% average GC content and four oligonucleotides with experiment in which we amplified DNA and sequenced libraries gen- 38% average GC content. We then created six mixtures of these 12 erated from a common master mix of Qiagen Multiplex Master Mix, oligonucleotides in which each oligonucleotide was included at dif- water and template (consisting of an equimolar mixture of synthetic ferent, but known, concentrations. We then diluted each mixture to oligonucleotides). After sequencing, we found there was no signifi- 10 fM, which qPCR indicated was similar to the concentrations in cant difference between the two methods (standard least squares our eDNA extracts. To verify pooling accuracy, we amplified each test: whole model F ratio = 0.55, p = 0.58, Supplementary Fig- mixture using an approach that adds unique molecular identifiers ure S2). While we find no difference between these two library (MIDs) to each starting molecule (Cole, Volden, Dharmadhikari, preparation approaches, additional comparative analyses of prepared Scelfo-Dalbey, & Vollmers, 2016; Hoshino & Inagaki, 2017). Briefly, libraries for example using different GC content binning strategies, we first performed two cycles of PCR using modified versions of the will be necessary to explore fully whether one library preparation trnL g/h primers that contained a 50 molecular identifier (which com- approach is superior by all metrics to another. prised five random nucleotides, followed by AT, followed by another For all experiments, we sequenced libraries on the Illumina three random nucleotides: NNNNNATNNN) and the Nextera adapter MiSeq platform using 2 9 75 v3 chemistry, targeting 150,000 reads sequence (Supplementary Figure S1). This two-cycle PCR, which is per sample. We used rarefaction to confirm that sequencing depth performed using the permissive Phusion polymerase (New England was sufficient to recover all amplified molecules (Hsieh et al., 2016). Biosystems), adds to each starting molecule a uniquely identifying After sequencing, we processed each data set using an in-house barcode that can be used to reconstruct bioinformatically the true bioinformatics pipeline. Briefly, we removed adapters and merged starting relative abundance of molecules. After a clean-up step, we overlapping reads using SEQPREP version 2 (https://github.com/ then amplified the product of this two-cycle PCR for an additional jstjohn/SeqPrep), with the following flags: minimum length of reads 30 cycles with standard Nextera indexing primers and the higher (-L) 37 (combined length of the primer sequences plus one), overlap fidelity polymerase in Kapa HiFi ReadyMix (Kapa Biosystems). After required to merge read1 and read2 (-o) 10, minimum length of adap- sequencing, we counted the number of unique MIDs for each ampli- ter to consider trimming (-O) 8 and quality threshold (-q) 15. We fil- con to verify the starting relative abundance of molecules in the tered the merged reads and retained sequences containing either an pool. exact match to the forward primer and the reverse complement of the reverse primer (correct orientation) or an exact match to the 2.2.3 | PCR amplification, library preparation, TABLE 1 The six polymerases used in this study. *Platinum HiFi sequencing and bioinformatics is a blend of two polymerases (one proofreading, one not)

We performed PCR using the trnL g/h primers and six different Hot polymerases (Table 1). We performed gradient PCR as necessary Polymerase/mix Manufacturer Proofreading start to determine optimal annealing temperatures for each of the dif- AmpliTaq Gold, Buffer II Applied Biosystems N Y ferent polymerases. For Platinum HiFi Taq, AmpliTaq Gold and Kapa HiFi ReadyMix Kapa Biosystems Y Y Phusion, we used reagent mixes that are described in previous Phusion New England YN publications (Cole et al., 2016; De Barba et al., 2014; Graham BioLabs et al., 2016). All final recipes and cycling conditions are provided Platinum HiFi Invitrogen Y* Y in the supplement (Supplementary Table S2). We confirmed that Q5 29 Master Mix New England YY amplification products were in the expected size range (50– BioLabs 150 bp) via gel electrophoresis, which also confirmed that all Qiagen Multiplex Qiagen N Y Master Mix extraction and PCR-negative controls lacked visible amplification NICHOLS ET AL. | 5 reverse primer and the reverse complement of the forward primer collection step to avoid evaporation. Large numbers of cycles are (incorrect orientation). We then reverse-complemented the data in often used in metabarcoding experiments because the target loci are the incorrect orientation using the FASTX toolkit version 0.0.13 at very low abundances relative to the total amount of extracted (http://hannonlab.cshl.edu/fastx_toolkit/) and concatenated these DNA and eDNA extracts often have PCR inhibitors (Kennedy, Calla- data with those in the correct orientation. We trimmed any remain- han, & Carlson, 2013). We used 60 cycles to be sure that all PCRs ing adapter and PCR primer sequences from the ends of the filtered had reached the plateau phase. Each aliquot was made into an Illu- reads and removed any reads that retained any primer sequences or mina sequencing library individually using a library preparation proto- that were shorter than six base pairs using PRINSEQ-lite version 0.20.4 col based on Meyer and Kircher (2010) (as detailed above). We (Schmieder & Edwards, 2011). We created a single file with all called this our amplicon competition experiment (Figure 1). unmerged reads, so that read1 and read2 were on the same line and processed this file as described above. We then split this file back into 2.3.2 | Question 2: Are some polymerases more read1 and read2 files. We did not remove sequences that contained a appropriate for metabarcoding-derived estimates of mismatch to known synthetic oligonucleotide insert sequences (see relative abundance than others? below). For the sequence data derived from the St. Paul soil samples, all amplicons were short enough that the sequences could be merged. We assessed whether six polymerases (Table 1) could individually

We used the OBITOOLS software defaults (Boyer et al., 2016) to group maintain the starting ratio (relative abundance) of oligonucleotides in identical sequences (obiuniq), remove singletons and PCR artefacts mixtures after 35 cycles of PCR. For each polymerase, we performed (obiclean –H) and compare the sequences to the arctic, boreal and six experiments in which synthetic oligonucleotides were combined at embl reference libraries (Sønstebø et al., 2010; Willerslev et al., 2014) different ratios based on sequence GC content. The oligonucleotides to identify the reads to their best-associated plant taxa. Because we were combined (1) in equimolar ratios (two experiments), (2) by used three reference libraries, three separate result files were created increasing proportion with GC content, (3) by decreasing proportion for each sample (one for each reference library). We parsed the three with GC content, (4) with extreme GC contents being most abundant files using a script that compared the results in each file and extracted and (5) with extreme GC contents being least abundant. For each only the entries with the highest percentage identity and lowest taxo- experiment, we performed metabarcoding PCRs in triplicate using the nomic rank. If two species of the same genus were seen, that sequence trnL g/h primers. After obtaining relative abundance estimates for each was classified to the genus level. We set a cut-off value of 98% iden- oligonucleotide in each pool, we plotted expected abundances (rela- tity and removed reads at proportions less than 0.001. The number of tive abundance prior to amplification) versus observed abundances raw and merged reads and number of identified taxa per sample are (relative abundance after amplification) for each polymerase. We then listed in the Supplementary Materials (Supplementary Table S3). calculated the Pearson correlation coefficient between observed and For the synthetic oligonucleotide pools, we used grep to pull out expected abundance values for each enzyme. the known sequences and their reverse complements and count how many times they occurred within each FASTA file. As the OBITOOLS and 2.3.3 | Question 3: Does GC bias affect occurrence GREP methods both provided count data, we converted these counts estimates in metabarcoding experiments? to relative abundances. We again performed metabarcoding on the nine St. Paul soil eDNA extracts as described for Question 1, but used the Qiagen Multiplex 2.3 | Data analysis Master Mix (Qiagen), which our results indicated is the least biased of the six polymerases tested (see below). As with the experiment 2.3.1 | Question 1: Does polymerase GC preference described in Question 1 using Platinum HiFi Taq (Invitrogen), we affect relative abundance estimates in metabarcoding performed ten replicate PCRs for each sample. We assigned ampli- data? cons to taxa as described above. We then performed rarefaction for For the nine St. Paul samples, we performed ten replicate PCRs per sam- each replicate set from both polymerases using iNEXT (Hsieh et al., ple using Platinum HiFi Taq polymerase (Invitrogen) following the proto- 2016) in R version 3.4.2 (http://www.R-project.org/). col found in Graham et al. (2016). After sequencing and read processing as detailed above, we used standard least squares to test the effects of 3 | RESULTS above-ground vegetation abundance and amplicon average GC content on DNA relative abundance, both separately and interactively. 3.1 | Question 1: Does polymerase preference for To test the effect of PCR cycle number on the relative abun- certain GC contents affect relative abundance dances of different plant taxa, we chose four St. Paul soil eDNA estimates in metabarcoding data? extracts and two PCR controls, scaled up the PCR to 100 lL and collected 1 lL aliquots at five-cycle intervals from cycles 10–60 (Fig- For this question, we used the above-ground vegetation abundance ure 1). We used a large reaction volume to minimize the impact of data, which were collected prior to the DNA work, and the Platinum aliquot removal and cooled the reaction to 20 C during each HiFi Taq-amplified metabarcoding data. Both data sets were 6 | NICHOLS ET AL. generated from the same nine localities on St. Paul. Using both of While this pattern observed in Figure 2 supports the hypothesis these, we plotted all plant taxa that were identified using both that sequences with certain GC contents are preferentially amplified above-ground and eDNA at all locations on the same plot but split via PCR, it does not exclude the possibility that biological factors, into GC content bins (Figure 2). The x-values are above-ground such as differences in above- versus below-ground biomass, are ranked abundances, and the y-values are mean eDNA abundance influencing the results. We therefore performed an additional experi- across replicates. When we compared relative abundance estimates ment in which we measured changes in DNA-based relative abun- from the metabarcoding experiments to the relative abundance dance estimates directly during the course of PCR for four St. Paul inferred from above-ground biomass, we found that whether or not eDNA extracts (Figure 1). Figure 3a shows the changes in relative these two estimates agreed depended on average GC content of the abundance of the twelve most abundant taxa in each of the four plant’s trnL (P6 loop) locus (standard least squares, whole model: samples during cycles 20 through 60 of the PCR. Libraries from F = 34.25, p < 0.0001; effect tests: average GC, t = 1.54, p = 0.124, cycles 10 and 15 had no sequenceable amplicon molecules. Expo- above-ground abundance, t = 12.27, p < 0.0001, average GC*above- nential amplification appears to start at cycle 30 for all samples, and ground abundance, t = 4.39, p < 0.0001). Figure 2 shows that this was confirmed by qPCR (Supplementary Figure S3). We calcu- above-ground and eDNA-based estimates of abundance are corre- lated the fold change from cycle 30–60 and used this to quantify lated most strongly in middle GC content bins, but this relationship the increase or decrease in the relative abundance of each amplicon. decreases or disappears completely in the more extreme GC content We then recorded the number of primer mismatches and barcode bins. This pattern is consistent with the previously reported optimal length for each amplicon. We found that neither primer mismatches GC content of 34–38% for Platinum HiFi Taq polymerase (Dabney & nor amplicon length explained the increase or decrease in rela- Meyer, 2012). tive abundance (primer mismatches, R2: 0.011; sequence length,

FIGURE 2 DNA abundance and above-ground abundance across average GC content bins. After collecting the DNA data, we took all data on plant taxa that were identified at all locations, put them in the same plot and split them into GC content bins. Each point is a plant taxon where x is its ranked above-ground abundance and y is its mean DNA abundance across replicates. Some taxa identified in the above-ground were not found in the DNA data and some found in the DNA were not found in the above-ground data. For above-ground abundance, 5 is the highest rank, meaning the most abundant, whereas 0 indicates absence. Lines are linear best fits with p-values >0.3 for all bins except the middle bin where the p = 0.03

FIGURE 3 Changes in relative abundance over the duration of a 60-cycle PCR for four St. Paul Island sediment samples. (a) Plots showing relative abundance measured at 5-cycle intervals between cycles 20 and 60. Coloured lines show relative abundance estimates for the 10 most abundant plant taxa in these samples. (b) Plot describing the fold change for each taxon in each experiment between cycle 30 and cycle 60, with a linear line of best fit (p = 0.002), showing that change in relative abundance correlates with GC content. The y-axis is plotted on a log scale; therefore, values above 1 indicate that the amplicon is increasing in abundance from cycle 30–60 and values below 1 indicate that the amplicon is decreasing in abundance NICHOLS ET AL. | 7

R2: 0.095). However, we found a positive correlation with average However, when we compared this to the data generated using Plat- GC content and fold change from cycle 30–60 (R2: 0.474, Linear fit inum HiFi Taq, this sample had not yet reached a rarefaction plateau. p = 0.002; Figure 3b). Given the small sample size, it is not possible to know whether this difference is due to polymerase choice or to chance. 3.2 | Question 2: Are some polymerases more appropriate for metabarcoding-derived estimates of relative abundance than others?

Results from Question 1 suggest that Platinum HiFi Taq polymerase preferentially amplifies sequences with 34–38% GC. To identify polymerases that might be more appropriate for metabarcoding than Platinum HiFi Taq, we performed metabarcoding on mixtures of syn- thetic oligonucleotides with different GC contents using six com- monly used polymerases (Table 1). We found that the correlation between observed and expected oligonucleotide proportions differed between enzymes (Figure 4). Among the polymerases tested, the Qiagen Multiplex Master Mix polymerase most accurately recon- structed the known starting relative abundances (Figure 4a, and var- ied the least in accuracy by GC content (Figure 4b). However, the Qiagen Multiplex Master Mix polymerase also had the highest pro- portion of sequences with at least one error (Figure 4c). Figure 5 shows the differences between observed and expected relative abundance using the most quantitatively accurate (Qiagen Multiplex Master Mix polymerase) and least quantitatively accurate (Phusion polymerase) enzymes. Detailed plots for the other four enzymes are provided in the Supplementary Materials (Supplementary Figures S4–S7).

3.3 | Question 3: Does GC bias affect occurrence data?

The results above show that polymerase biases can influence eDNA- based estimates of relative abundance. To test whether polymerase bias may also influence the accuracy of occurrence estimates, we performed an additional experiment in which we PCR-amplified the trnL (P6 loop) locus from the same nine St. Paul eDNA extracts that were amplified for Question 1, however, this time using the best- performing enzyme as identified by the synthetic oligonucleotide experiment above, Qiagen Multiplex Master Mix. As with Platinum HiFi Taq polymerase, we performed 10 replicate PCRs for each of the nine eDNA samples, and used rarefaction to confirm that sequencing depth of each PCR library was sufficient to recover all amplified molecules (Hsieh et al., 2016). We then performed addi- tional rarefaction analyses, this time asking whether additional PCR replicates were contributing significantly towards biodiversity esti- mates. We found that after 10 replicates, mean sample coverage (the probability that all rare taxa have been recovered) was not sig- FIGURE 4 Testing polymerases using pools of synthetic nificantly different when using the Qiagen Multiplex Master Mix oligonucleotides. a, b and c combine data from six pools of synthetic compared to Platinum HiFi Taq (t = 0.66, df = 15.76, p = 0.52; Fig- oligonucleotides amplified using six polymerases. (a) Observed proportions plotted against expected proportions for six ure 6). In addition, despite the fact that St. Paul has low plant diver- polymerases. Each panel contains data for all six pools of oligos. (b) sity (Colinvaux, 1981; Preble & McAtee, 1923), only one site appears Difference from expected proportions plotted against average GC to have reached a rarefaction plateau, which would suggest that the content. Here, we only used data from the equimolar pools. (c) majority of species present have been sequenced after 10 replicates. Proportion of reads with at least one error for each enzyme/mix 8 | NICHOLS ET AL.

4 | DISCUSSION compared to all extracted DNA. Because polymerases vary in the degree to which they are biased towards GC content, another Our results show polymerase GC bias can dramatically alter the rel- approach is to simply choose the least biased polymerase. Of the ative abundance of molecules during PCR. It is important, therefore, six polymerases evaluated here, our data show that the Qiagen to use an experimental approach in metabarcoding that limits the Multiplex Master Mix is the least biased and effectively retains influence of polymerase GC bias. Molecular identifier (MID), also abundance ratios throughout the PCR (R2: 0.95). Qiagen Multiplex called unique molecular identifier (UMI), methods (Cole et al., 2016) Master Mix (but not the enzyme, HotStarTaq, itself) was originally offer a possible solution, as they allow each starting molecule to be engineered for experiments that targeted multiple templates simul- disambiguated bioinformatically after PCR. In this way, GC bias that taneously, which may explain why it performs well here (Qiagen manifests during PCR can be effectively ignored. However, these 2013). methods are not yet optimized for the mixed, low concentration If a biased polymerase is used in metabarcoding, the DNA results samples that are most often available for metabarcoding. While we may not reflect the true relative abundance of target taxa. For the successfully tested a UMI approach for the analysis of synthetic plant trnL (P6 loop) locus, for example, GC content varies consider- mixtures of oligonucleotides, the approach often failed to produce ably among major plant growth forms (Figure 7). The GC content of sequencing libraries when analysing actual eDNA samples. This may forbs, or low-lying herbaceous flowering plants, falls mainly within be due to inhibitors and/or very low concentrations of target DNA the range preferred by most polymerases (Dabney & Meyer, 2012).

FIGURE 5 Expected (black lines) and observed abundances of the six synthetic oligonucleotide mixtures using Phusion (green open circles) and Qiagen Multiplex Master Mix (purple open circles) plotted as proportional data. Each oligonucleotide was pooled at 10 lM, and then, each pool was diluted to 10 fM. Each 10 fM pool underwent PCR using the six different polymerases. The results for the best and the worst polymerases are plotted here

FIGURE 6 Rarefaction curves resulting from metabarcoding experiments for nine sites on St Paul Island, Alaska, using Platinum HiFi Taq as described in Graham et al. (2016) and the Qiagen Multiplex Master Mix following manufacturer’s instructions. For each extract and polymerase, we performed 10 replicate PCRs. Rarefaction plots describe the number of unique taxa added per replicate. Solid lines are results from the 10 experiments, and dashed lines are predicted values calculated using iNEXT (Hsieh et al., 2016) in R version 3.4.2 NICHOLS ET AL. | 9

Our DNA-based relative abundance estimates of plants from St. Paul as pollen and identification of macroscopic remains (Birks & Birks, (Figure 8) and those previously published from Siberia and Alaska 2015). While site occupancy models offer a potential solution to (Supplementary Figure S8) (Willerslev et al., 2014) were both gener- estimate the number of replicates required to identify rare taxa (Dor- ated using Platinum HiFi Taq polymerase targeting the trnL P6-loop azio & Erickson, 2017; Schmidt et al., 2013), these are constructed locus and showed that graminoids (grasses and sedges) were less for single species and would not be practical for experiments that abundant than forbs. Because this pattern falls within the biases of aim to describe an entire community. We note, however, that the Platinum HiFi Taq polymerase, these results may simply reflect poly- most abundant taxa were recovered in all PCR replicates for all sites merase bias rather than true biological signal. and both polymerases, suggesting that DNA metabarcoding is a rea- Although our results indicate that GC bias can confound sonable approach to identify at least the most abundant taxa in an metabarcoding-based relative abundance estimates, other potential environment, even if only a single replicate PCR is performed (Leray sources of bias may also influence amplicon competition during PCR. & Knowlton, 2017). For example, differences in the number of mismatches between the While our current work has identified an experimental approach sequence and the primer at the primer binding site and differences to reduce the influence of GC content on relative abundance esti- in template length will also affect the efficiency with which an ampli- mates in metabarcoding, it is important also to consider other con is copied (Stadhouders et al., 2010). While we did not find that sources of potential biases and error when interpreting results. For the number of primer mismatches affected the efficiency of replica- example, errors such as tag switching, where sample-specific bar- tion, few taxa have mismatches to the trnL g/h primers (Taberlet codes are associated with the incorrect sample during either library et al., 2007). Primer mismatches have been shown, however, to influence relative abundance for other metabarcoding loci (Pinol~ et al., 2015). In addition, shorter molecules tend to amplify more readily than longer molecules during PCR (Shagin, Lukyanov, Vagner, & Matz, 1999), and while most sequences amplified by the trnL g/h primers in this study tended to be around the same length, other metabarcoding loci vary considerably in barcode length between amplified taxa. Another source of bias during PCR is homopolymer repeats (Kieleczawa, 2006). In our amplicon competition experiment using Platinum HiFi Taq, the plant taxa Anthemideae and Pedicularis decreased in abundance in all four samples despite having optimal (Anthemideae has a GC content of 36%) and close to optimal (Pedic- ularis is 31%) GC contents, which may be because these barcodes FIGURE 7 Average GC content across different plant growth contain 8- and 9-bp-long homopolymer runs, respectively. In com- forms. The data come from the current study and Willerslev et al. parison with Platinum HiFi Taq, we noted that Anthemideae and (2014). Both studies used Platinum HiFi Taq polymerase. Ferns Pedicularis had increased abundances when using Qiagen Multiplex include horsetails Master Mix (Supplementary Figures S9–S12), suggesting that Qiagen Multiplex Master Mix was not deterred by the homopolymer repeats. Finally, polymerase error rates are a potential source of error in metabarcoding experiments, and our results showed that HotStarTaq in the Qiagen Multiplex Master Mix had the highest error rate of the six polymerases used (Figure 4c). Polymerase error has the potential to produce false-positive results when barcoding loci differ by one or a few base pairs, although this may be amelio- rated by bioinformatic pipelines capable of identifying potential sequencing errors. Our results suggest that occurrence data, which have been believed to be largely reliable from metabarcoding experiments, can also be challenging to interpret. While it is understood that rare taxa may be more difficult to identify than common taxa, recommenda- tions within the field have been to perform replicate PCRs, with little FIGURE 8 St Paul data generated using eDNA and separated guidance as to how many PCRs are necessary. Our experiments from into plant growth forms. Ferns include horsetails. All plant taxa from all samples are plotted both using Platinum HiFi Taq and Qiagen St. Paul suggest, however, that more than 10 replicate PCRs would Multiplex Master Mix. Each data point is the relative abundance of a be necessary to sample the breadth of taxa within our extracts, plant taxon from a particular location grouped into its growth form regardless of polymerase GC bias. In many instances, it may be more and shaded based on its average trnL p6-loop GC content. The practical to combine DNA-based surveys with other data types, such darker the point is, the lower the average GC content 10 | NICHOLS ET AL. preparation (Schnell, Bohmann, & Gilbert, 2015) or sequencing two reviewers whose comments helped improve this manuscript. (Kircher, Sawyer, & Meyer, 2012), may influence both occurrence This work was funded by awards from the Gordon and Betty Moore and relative abundance data. Fortunately, the latter problem can be Foundation (GBMF-3804), the National Science Foundation (ARC- mitigated by adding indices to both ends of the molecule (Kircher 1203990) and the University of California Office of the President et al., 2012). The choice of bioinformatic pipeline can also influence (20160713SC). results. For example, in a recent analysis of the metagenome of fresh basil, three of four pipelines identified Salmonella but, because Sal- monella was not identified via qPCR, the authors concluded that the DATA ACCESSIBILITY bioinformatic results were erroneous (Ceuppens, De Coninck, Bottle- The processed sequence data and above-ground survey data are doorn, Van Nieuwerburgh, & Uyttendaele, 2017). While public data- available in the Dryad repository (https://doi.org/10.5061/dryad. bases containing metabarcoding loci continue to expand in k129c). Raw amplicon sequence data have been made available on taxonomic depth (Bell, Loeffler, & Brosi, 2017), some lineages are the NCBI Short Read Archive (BioProject: PRJNA433185). more poorly represented. Finally, biological differences between spe- cies, including variation per cell or tissue type in the number of amplifiable loci (Morley & Nielsen, 2016), differences in organism AUTHOR CONTRIBUTIONS size, seasonal senescence and behaviour, may all influence the prob- B.S., P.D.H., L.A.N. and Y.W. designed and executed the collection of ability that an organism will be represented in a particular environ- sediment samples. L.A.N. and Y.W. identified the plants. R.V.N., mental sample. Although work remains to be performed to better R.E.G., P.D.H. and B.S. designed the amplicon competition experi- understand the consequences of these various types of bias and ment and the synthetic oligonucleotides. C.V. and R.V.N. designed error, metabarcoding remains a powerful approach to quickly and the experiments testing different polymerases. R.V.N. and M.L. per- inexpensively characterize communities. formed the laboratory work. R.V.N. and C.V. analysed the data. R.V.N., B.S., C.V. and P.D.H. wrote the manuscript, with critical input from all remaining authors. 5 | CONCLUSION

REFERENCES Despite the rapid growth of metabarcoding as a technique for char- acterizing communities from eDNA samples, relatively little attention Aird, D., Ross, M. G., Chen, W.-S., Danielsson, M., Fennell, T., Russ, C., has been given to validating the methodology and understanding its ... Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias limitations. Polymerase GC bias is a known challenge for applications in Illumina sequencing libraries. Genome Biology, 12, R18. Andersen, K., Bird, K. L., Rasmussen, M., Haile, J., Breuning-Madsen, H., that rely on PCR (Aird et al., 2011; Dabney & Meyer, 2012; Kozar- Kjaer, K. H., ... Willerslev, E. (2012). Meta-barcoding of “dirt” DNA ewa et al., 2009). With the advent of next-generation sequencing from soil reflects vertebrate biodiversity. Molecular Ecology, 21, approaches, PCR-free methods have been developed to convert 1966–1979. extracted DNA into sequenceable molecules (Kozarewa et al., 2009). Anderson-Carpenter, L. L., McLachlan, J. S., Jackson, S. T., Kuch, M., Lumibao, C. Y., & Poinar, H. N. (2011). Ancient DNA from lake sedi- PCR remains the most useful approach to catalogue diversity in envi- ments: Bridging the gap between paleoecology and genetics. BMC ronmental samples, however, as the number of target molecules is Evolutionary Biology, 11, 30. small relative to the total extracted DNA. For this reason, it is impor- Barnes, M. A., & Turner, C. R. (2016). The ecology of environmental tant to understand the influence of GC bias in metabarcoding DNA and implications for conservation genetics. Conservation Genet- ics, 17(1), 1–17. approaches and, if possible, mitigate these biases. Here, we showed Baskaran, N., Kandpal, R. P., Bhargava, A. K., Glynn, M. W., Bale, A., & that many commonly used PCR protocols are not appropriate for Weissman, S. M. (1996). Uniform amplification of a mixture of generating reliable estimates of relative abundance. In these cases, deoxyribonucleic acids with varying GC content. Genome Research, 6, our results show that the relative abundance of amplified sequences 633–638. Bell, K. L., Loeffler, V. M., & Brosi, B. J. (2017). An rbcL reference library changes during PCR cycling and that these changes are related to to aid in the identification of plant species mixtures by DNA the GC content of the target. Of the six polymerases and mixtures metabarcoding. Applications in plant sciences, 5, pii: apps.1600110. tested, Qiagen Multiplex Master Mix provided the most accurate Birks, H. J. B., & Birks, H. H. (2015). How have studies of ancient DNA estimates of relative abundance, but also generated the highest error from sediments contributed to the reconstruction of Quaternary flo- – rate. However, we found no evidence that occurrence data were ras? New Phytologist, 209, 499 506. Boyer, F., Mercier, C., Bonin, A., Le Bras, Y., Taberlet, P., & Coissac, E. influenced by polymerase bias. (2016). OBITOOLS: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16, 176–182. Buxton, A. S., Groombridge, J. J., Zakaria, N. B., & Griffiths, R. A. (2017). ACKNOWLEDGEMENTS Seasonal variation in environmental DNA in relation to population size and environmental factors. Scientific Reports, 7, 46294. We thank Soumaya Belmecheri for fieldwork assistance, Brendan Casbon, J. A., Osborne, R. J., Brenner, S., & Lichtenstein, C. P. (2011). A O’Connell and Joshua Kapp for assistance with laboratory work and method for counting PCR template molecules with application to for writing scripts, and Duane Froese for discussion. We thank the next-generation sequencing. Nucleic Acids Research, 39, e81. NICHOLS ET AL. | 11

Ceuppens, S., De Coninck, D., Bottledoorn, N., Van Nieuwerburgh, F., & Giguet-Covex, C., Pansu, J., Arnaud, F., Rey, P. J., Griggo, C., Gielly, L., ... Uyttendaele, M. (2017). Microbial community profiling of fresh basil Taberlet, P. (2014). Long livestock farming history and human land- and pitfalls in taxonomic assignment of enterobacterial pathogenic scape shaping revealed by lake sediment DNA. Nature Communica- species based upon 16S rRNA amplicon sequencing. International tions, 5, 3211. Journal of Food Microbiology, 257, 148–156. Graham, R. W., Belmecheri, S., Choy, K., Culleton, B. J., Davies, L. J., Fro- Cole, C., Volden, R., Dharmadhikari, S., Scelfo-Dalbey, C., & Vollmers, C. ese, D., ... Wooller, M. J. (2016). Timing and causes of mid-Holocene (2016). Highly accurate sequencing of full-length immune repertoire mammoth extinction on St. Paul Island, Alaska. Proceedings of the amplicons using Tn5-enabled and molecular identifier-guided ampli- National Academy of Sciences of the United States of America, 113, con assembly. Journal of Immunology, 196, 2902–2907. 9310–9314. Colinvaux, P. (1981). Historical ecology in beringia: The south land bridge Haile, J., Froese, D. G., Macphee, R. D. E., Roberts, R. G., Arnold, L. J., coast at St. Paul Island. Quaternary Research, 16,18–36. Reyes, A. V., ... Willerslev, E. (2009). Ancient DNA reveals late sur- Cooper, A., & Poinar, H. N. (2000). Ancient DNA: Do it right or not at all. vival of mammoth and horse in interior Alaska. Proceedings of the Science, 289, 1139. National Academy of Sciences of the United States of America, 106, Corinaldesi, C., Barucca, M., Luna, G. M., & Dell’Anno, A. (2011). Preser- 22352–22357. vation, origin and genetic imprint of extracellular DNA in permanently Haile, J., Holdaway, R., Oliver, K., Bunce, M., Gilbert, M. T., Nielsen, R., anoxic deep-sea sediments. Molecular Ecology, 20, 642–654. ... Willerslev, E. (2007). Ancient DNA chronology within sediment Dabney, J., & Meyer, M. (2012). Length and GC-biases during sequencing deposits: Are paleobiological reconstructions possible and is DNA library amplification: a comparison of various polymerase-buffer sys- leaching a factor? Molecular Biology and Evolution, 24, 982–989. tems with ancient and modern DNA sequencing libraries. BioTech- Hemery, L. G., Politano, K. K., & Henkel, S. K. (2017). Assessing differ- niques, 52,87–94. ences in macrofaunal assemblages as a factor of sieve mesh size, dis- Darling, J. A., & Mahon, A. R. (2011). From molecules to management: tance between samples, and time of sampling. Environmental Adopting DNA-based methods for monitoring biological invasions in Monitoring and Assessment, 189, 413. aquatic environments. Environmental Research, 111, 978–988. Hoshino, T., & Inagaki, F. (2017). Application of stochastic labeling with De Barba, M., Miquel, C., Boyer, F., Mercier, C., Rioux, D., Coissac, E., & random-sequence barcodes for simultaneous quantification and Taberlet, P. (2014). DNA metabarcoding multiplexing and validation sequencing of environmental 16S rRNA genes. PLoS ONE, 12(1), of data accuracy for diet assessment: Application to omnivorous diet. e0169431. Molecular ecology Resources, 14, 306–323. Hsieh, T. C., Ma, K. H., & Chao, A. (2016). iNEXT: An R package for rar- Deagle, B. E., Gales, N. J., Evans, K., Jarman, S. N., Robinson, S., Trebilco, efaction and extrapolation of species diversity (Hill numbers). Meth- R., & Hindell, M. A. (2007). Studying seabird diet through genetic ods in Ecology and Evolution/British Ecological Society, 7, 1451–1456. analysis of faeces: A case study on macaroni penguins (Eudyptes Hulten, E. (1960). Flora of the Aleutian Islands and westernmost Alaska chrysolophus). PLoS ONE, 2, e831. Peninsula: With notes on the Flora of commander Islands. New York, Deagle, B. E., Thomas, A. C., Shaffer, A. K., Trites, A. W., & Jarman, S. N. NY: Hafner Pub. Co.. (2013). Quantifying sequence proportions in a DNA-based diet study Hulten, E. (1968). Flora of alaska and neighboring territories: A manual of using Ion Torrent amplicon sequencing: Which counts count? Molecu- the vascular plants. Stanford, CA: Stanford University Press. lar Ecology Resources, 13, 620–633. Jerde, C. L., Mahon, A. R., Chadderton, W. L., & Lodge, D. M. (2011). Deiner, K., Bik, H. M., M€achler, E., Seymour, M., Lacoursiere-Roussel, A., “Sight-unseen” detection of rare aquatic species using environmental Altermatt, F., ... Bernatchez, L. (2017). Environmental DNA metabar- DNA. Conservation Letters, 4, 150–157. coding: Transforming how we survey animal and plant communities. Jørgensen, T., Kjaer, K. H., Haile, J., Rasmussen, M., Boessenkool, S., Molecular Ecology, 26, 5872–5895. Andersen, K., ... Willerslev, E. (2012). Islands in the ice: Detecting van Dijk, E. L., Jaszczyszyn, Y., & Thermes, C. (2014). Library preparation past vegetation on Greenlandic nunataks using historical records and methods for next-generation sequencing: Tone down the bias. Experi- sedimentary ancient DNA meta-barcoding. Molecular Ecology, 21, mental Cell Research, 322,12–20. 1980–1988. Direito, S. O. L., Marees, A., & Roling,€ W. F. M. (2012). Sensitive life Kennedy, S., Callahan, H., & Carlson, M. (2013). Tips and tricks for isola- detection strategies for low-biomass environments: Optimizing tion of DNA & RNA from challenging samples. Carlsbad, CA: MO BIO extraction of nucleic acids adsorbing to terrestrial and Mars analogue Laboratories Inc. minerals. FEMS Microbiology Ecology, 81, 111–123. Kieleczawa, J. (2006). Fundamentals of sequencing of difficult templates– Dorazio, R. M., & Erickson, R. A. (2017). eDNA occupancy: An R package an overview. Journal of Biomolecular Techniques, 17, 207–217. for multi-scale occupancy modeling of environmental DNA data. Kircher, M., Sawyer, S., & Meyer, M. (2012). Double indexing overcomes Molecular Ecology Resources, 18, 368–380. inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Dunn, N., Priestley, V., Herraiz, A., Arnold, R., & Savolainen, V. Acids Research, 40, e3. (2017). Behavior and season affect crayfish detection and density Kowalczyk, R., Taberlet, P., Kaminski, T., & Wojcik, J. (2011). Influence of inference using environmental DNA. Ecology and Evolution, 7, 7777– management practices on large herbivore diet–Case of European 7785. bison in Bialowieza Primeval Forest (Poland). Forest Ecology and Man- Elbrecht, V., & Leese, F. (2015). Can DNA-based ecosystem assessments agement, 261, 821–828. quantify species abundance? Testing primer bias and biomass—se- Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., & quence relationships with an innovative metabarcoding protocol. Turner, D. J. (2009). Amplification-free Illumina sequencing-library PLoS ONE, 10, 7. e0130324-16 preparation facilitates improved mapping and assembly of (G+C)- Fahner, N. A., Shokralla, S., Baird, D. J., & Hajibabaei, M. (2016). Large- biased . Nature Methods, 6, 291–295. scale monitoring of plants through environmental DNA metabarcod- Krehenwinkel, H., Wolf, M., Lim, J. Y., Rominger, A. J., Simison, W. B., & ing of soil: recovery, resolution, and annotation of four DNA markers. Gillespie, R. (2017). Estimating and mitigating amplification bias in PLoS ONE, 11, e0157505. qualitative and quantitative arthropod metabarcoding. Scientific Ficetola, G. F., Pansu, J., Bonin, A., Coissac, E., Giguet-Covex, C., De Reports, 7,1–12. Barba, M., ... Taberlet, P. (2015). Replication levels, false presences Lahoz-Monfort, J. J., Guillera-Arroita, G., & Tingley, R. (2016). Statistical and the estimation of the presence/absence from eDNA metabarcod- approaches to account for false-positive errors in environmental ing data. Molecular Ecology Resources, 15, 543–556. DNA samples. Molecular Ecology Resources, 16, 673–685. 12 | NICHOLS ET AL.

Leray, M., & Knowlton, N. (2017). Random sampling causes the low Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing reproducibility of rare eukaryotic OTUs in Illumina COI metabarcod- of metagenomic datasets. Bioinformatics, 27(6), 863–864. ing. PeerJ, 5, e3006–e3027. Schnell, I. B., Bohmann, K., & Gilbert, M. T. P. (2015). Tag jumps illumi- Marcelino, V. R., & Verbruggen, H. (2016). Multi-marker metabarcoding nated–reducing sequence-to-sample misidentifications in metabarcod- of coral skeletons reveals a rich microbiome and diverse evolutionary ing studies. Molecular Ecology Resources, 15, 1289–1303. origins of endolithic algae. Scientific Reports, 6, 31508. Shagin, D. A., Lukyanov, K. A., Vagner, L. L., & Matz, M. V. (1999). Regu- Meyer, M., & Kircher, M. (2010). Illumina sequencing library preparation lation of average length of complex PCR product. Nucleic Acids for highly multiplexed target capture and sequencing. Cold Spring Har- Research, 27, e23. bor Protocols, 6, pdb.prot5448. Shaw, J. L. A., Clarke, L. J., Wedderburn, S. D., Barnes, T. C., Weyrich, L. Morley, S. A., & Nielsen, B. L. (2016). Chloroplast DNA copy number S., & Cooper, A. (2016). Comparison of environmental DNA metabar- changes during plant development in organelle DNA polymerase coding and conventional fish survey methods in a river system. Bio- mutants. Frontiers in Plant Science, 7, 57. logical Conservation, 197, 131–138. Mungoven, M. (2005) Soil Survey of Saint Paul Island Area, Alaska. Sigsgaard, E. E., Nielsen, I. B., Bach, S. S., Lorenzen, E. D., Robinson, D. National Resources Conservation Service, United States Department P., Knudsen, S. W., ... Thomsen, P. F. (2016). Population characteris- of Agriculture. tics of a large whale shark aggregation inferred from seawater envi- Niemeyer, B., Epp, L. S., Stoof- Leichsenring, K. R., Pestryakova, L. A., & ronmental DNA. Nature Ecology & Evolution, 1, 0004. Herzschuh, U. (2017). A comparison of sedimentary DNA and pollen Soininen, E. M., Valentini, A., Coissac, E., Miquel, C., Gielly, L., Broch- from lake sediments in recording vegetation composition at the mann, C., ... Taberlet, P. (2009). Analysing diet of small herbivores: Siberian treeline. Molecular Ecology Resources, 17, e46–e62. The efficiency of DNA barcoding coupled with high-throughput Pawluczyk, M., Weiss, J., Links, M. G., Egana~ Aranguren, M., Wilkinson, pyrosequencing for deciphering the composition of complex plant M. D., & Egea-Cortines, M. (2015). Quantitative evaluation of bias in mixtures. Frontiers in Zoology, 6, 16. PCR amplification and next-generation sequencing derived from Sønstebø, J. H., Gielly, L., Brysting, A. K., Elven, R., Edwards, M., Haile, J., metabarcoding samples. Analytical and Bioanalytical Chemistry, 407, ... Brochmann, C. (2010). Using next-generation sequencing for 1841–1848. molecular reconstruction of past Arctic vegetation and climate. Pedersen, M. W., Overballe-Petersen, S., Ermini, L., Der Sarkissian, C., Molecular Ecology Resources, 10, 1009–1018. Haile, J., Hellstrom, M., ... Schnell, I. B. (2015). Ancient and modern Stadhouders, R., Pas, S. D., Anber, J., Voermans, J., Mes, T. H., & Schut- environmental DNA. Philosophical Transactions of the Royal Society of ten, M. (2010). The effect of primer-template mismatches on the London. Series B, Biological Sciences, 370, 20130383. detection and quantification of nucleic acids using the 50 nuclease Pedersen, M. W., Ruter, A., Schweger, C., Friebe, H., Staff, R. A., Kjeldsen, assay. Journal of Molecular Diagnostics, 12, 109–117. K. K., ... Willerslev, E. (2016). Postglacial viability and colonization in Stotler, R. E., & Crandall-Stotler, B. J. (2005). Bryophytes. Department of North America’s ice-free corridor. Nature, 537,45–49. Plant Biology, Southern Illinois University. Pinol,~ J., Mir, G., Gomez-Polo, P., & Agustı, N. (2015). Universal and Suzuki, M. T., & Giovannoni, S. J. (1996). Bias caused by template anneal- blocking primer mismatches limit the use of high-throughput DNA ing in the amplification of mixtures of 16S rRNA genes by PCR. sequencing for the quantitative metabarcoding of arthropods. Molecu- Applied and Environmental Microbiology, 62, 625–630. lar Ecology Resources, 15, 819–830. Taberlet, P., Coissac, E., Pompanon, F., Gielly, L., Miquel, C., Valentini, A., Pinto, A. J., & Raskin, L. (2012). PCR biases distort bacterial and archaeal ... Willerslev, E. (2007). Power and limitations of the chloroplast trnL community structure in pyrosequencing datasets. PLoS ONE, 7, e43093. (UAA) intron for plant DNA barcoding. Nucleic Acids Research, 35, e14. Polz, M. F., & Cavanaugh, C. M. (1998). Bias in template-to-product Talbot, S. S., & Talbot, S. L. (1994). Numerical classification of the coastal ratios in multitemplate PCR. Applied and Environmental Microbiology, vegetation of Attu Island, Aleutian Islands, Alaska. Journal of Vegeta- 64, 3724–3730. tion Science, 5, 867–876. Pornon, A., Escaravage, N., Burrus, M., Holota, H., Khimoun, A., Mariette, Ushio, M., Fukuda, H., Inoue, T., Makoto, K., Kishida, O., Sato, K., ... J., ... Andalo, C. (2016). Using metabarcoding to reveal and quantify Miya, M. (2017). Environmental DNA enables detection of terrestrial plant-pollinator interactions. Scientific Reports, 6, 27282. mammals from forest pond water. Molecular Ecology Resources, 17, Preble, E. A., & McAtee, W. L. (1923). A Biological Survey of the Pribilof e63–e75. Islands, Alaska, North American Fauna, No. 46. Department of Agricul- Valentini, A., Miquel, C., Nawaz, M. A., Bellemain, E., Coissac, E., Pom- ture, Bureau of Biological Survey, Government Printing Office, Wash- panon, F., ... Taberlet, P. (2009). New perspectives in diet analysis ington DC, USA. based on DNA barcoding and parallel pyrosequencing: The trnL Qiagen (2013). Qiagen Multiplex PCR Kit: Product Details. approach. Molecular Ecology Resources, 9,51–60. Rawlence, N. J., Lowe, D. J., Wood, J. R., Young, J. M., Jock Churchman, Vierna, J., Dona, J., Vizcaino, A., Serrano, D., & Jovani, R. (2017). PCR G., Huang, Y. T., & Cooper, A. (2014). Using palaeoenvironmental cycles above routine numbers do not compromise high-throughput DNA to reconstruct past environments: Progress and prospects. Jour- DNA barcoding results. Genome, 60(10), 868–873. nal of Quaternary Science, 29, 610–626. Walker, D. A., Raynolds, M. K., Daniels,€ F. J. A., Einarsson, E., Elvebakk, Rees, H. C., Baker, C. A., Gardner, D. S., Maddison, B. C., & Gough, K. C. A., Gould, W. A., ... Yurtsev, A. (2005). The circumpolar Arctic vege- (2017). The detection of great crested newts year round via environ- tation map. Vegetation Science, 5, 757–764. mental DNA analysis. BMC Research Notes, 10, 327. Weyrich, L. S., Duchene, S., Soubrier, J., Arriola, L., Llamas, B., Breen, J., Rohland, N., & Reich, D. (2012). Cost-effective, high-throughput DNA ... Cooper, A. (2017). Neanderthal behaviour, diet, and dis- sequencing libraries for multiplexed target capture. Genome Research, ease inferred from ancient DNA in dental calculus. Nature, 544, 22, 939–946. 357–361. Sanders, H. L. (1968). Marine benthic diversity: A comparative study. Willerslev, E., Cappellini, E., Boomsma, W., Nielsen, R., Hebsgaard, M. B., American Naturalist, 102, 243–282. Brand, T. B., ... Collins, M. J. (2007). Ancient biomolecules from deep Schmidt, B. R., Kery, M., Ursenbacher, S., Hyman, O. J., & Collins, J. P. ice cores reveal a forested southern Greenland. Science, 317, (2013). Site occupancy models in the analysis of environmental DNA 111–114. presence/absence surveys: A case study of an emerging amphibian Willerslev, E., Davison, J., Moora, M., Zobel, M., Coissac, E., Edwards, M. pathogen. Methods in Ecology and Evolution/British Ecological Society, E., ... Taberlet, P. (2014). Fifty thousand years of Arctic vegetation 4, 646–653. and megafaunal diet. Nature, 506,47–51. NICHOLS ET AL. | 13

Willerslev, E., Hansen, A. J., Binladen, J., Brand, T. B., Gilbert, M. T., Sha- SUPPORTING INFORMATION piro, B., ... Cooper, A. (2003). Diverse plant and animal genetic records from Holocene and Pleistocene sediments. Science, 300, Additional supporting information may be found online in the 791–795. Supporting Information section at the end of the article. Yoccoz, N. G., Brathen, K. A., Gielly, L., Haile, J., Edwards, M. E., Goslar, T., ... Taberlet, P. (2012). DNA from soil mirrors plant taxonomic and growth form diversity. Molecular Ecology, 21, 3647– 3655. How to cite this article: Nichols RV, Vollmers C, Newsom LA, Zielinska, S., Radkowski, P., Blendowska, A., Ludwig-Gałezzowska, A., Łos, et al. Minimizing polymerase biases in metabarcoding. Mol J. M., & Łos, M. (2017). The choice of the DNA extraction method Ecol Resour. 2018;00:1–13. https://doi.org/10.1111/1755- may influence the outcome of the soil microbial community structure analysis. Microbiology Open, 6, https://doi.org/10.1002/mbo3.453. 0998.12895