<<

Molecular analysis of honey bee foraging ecology

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

in the Graduate School of The Ohio State University

By

Rodney Trey Richardson

Graduate Program in Entomology

The Ohio State University

2018

Dissertation Committee

Professor Reed Johnson, Advisor

Professor Mary Gardiner

Professor John Christman

Professor Roman Lanno

1

Copyrighted by

Rodney Trey Richardson

2018

2

Abstract

While numerous factors currently impact the health of honey bees and other pollinating Hymenoptera, poor floral resource availability due to habitat loss and land conversion is thought to be important. This issue is particularly salient in the upper

Midwest, a location which harbors approximately 60 percent of the US honey bee colonies each summer for honey production. This region has experienced a dramatic expansion in the area devoted to crop production over the past decade. Consequently, understanding how changes to landscape composition affect the diversity, quality and quantity of available floral resources has become an important research goal.

Here, I developed molecular methods for the identification of bee-collected by adapting and improving upon the existing amplicon sequencing infrastructure used for microbial community ecology. In thoroughly benchmarking our procedures, I show that a simple and cost-effective three-step PCR-based library preparation protocol in combination with Metaxa2-based hierarchical classification yields an accurate and highly quantitative pollen approach when applied across multiple markers.

In Chapter 1, I conducted one of the first ever proof-of-concept studies applying amplicon sequencing, or metabarcoding, to the identification of bee-collected pollen. In this work, we used rudimentary laboratory and bioinformatic methods to apply the method to a single nuclear marker, ITS2. In doing so, we found the method to be highly

ii inaccurate with respect to quantitative inference of the relative abundances of different plant taxa represented within our sample. Thus, in Chapter 2 I used the same methods and turned my attention to two alternative chloroplast markers, matK and rbcL, in addition to

ITS2. In this study, I found that the chloroplast markers were more useful for quantification of pollen abundance relative to ITS2.

With an improved understanding of the behavior of different plant markers, I began optimizing the bioinformatic and laboratory methods used for pollen metabarcoding. In Chapter 3, I conducted in silico cross-validation analyses using three prominent hierarchical amplicon sequence classifiers. Testing the classifiers on data from all five commonly used plant barcoding markers, I found wide variance in the accuracy and sensitivity of the classifiers evaluated, suggesting that the choice of classifier and the optimization of classification procedures is an important area for future methods development. In Chapter 4, I expand on evaluating hierarchical sequence classifiers with finer granularity and apply cross-validation analysis to the Metaxa2 hierarchical DNA sequence classifier. Further, I discuss and implement my perspective upon best practices for reference database curation. These curation procedures are designed for the purposes of hierarchical classification specifically.

In Chapter 5, I apply pollen metabarcoding in combination with waggle dance interpretation to investigate the spatial and taxonomic foraging patterns of honey bees in central Ohio agroecosystems. After modifying existing PCR-based library preparation protocols, we applied our methods to target four plant barcode markers, trnL, trnH, rbcL and ITS2, for 32 samples collected across the month of May 2015 from four corn and

iii soybean-dominated agroecosystems. Our results indicated that the vast majority of colony nutrition provided by pollen during this time was provided by three major plant taxa, woody Rosaceae trees, Salix and Trifolium. Inference of spatial foraging patterns through waggle dance analysis revealed a significant preference for wood lots and tree lines relative to herbaceous, residential and crop landcover types.

Having worked to optimize and validate molecular methods for pollen analysis, investigations into how floral resource availability mediates honey bee health can be more feasibly conducted at large scales. It is my hope that this approach proves useful for quantifying and maximizing the pollinator floral resource value of managed lands.

iv

Dedication

To my family - past, present and future.

v

Acknowledgments

Upon applying to graduate school, I was an un-exceptional student and I largely owe my opportunities to an advisor who took a chance on me when others would not.

Thank you, Reed.

John Christman, Johan Bengtsson-Palme, Douglas Sponsler, Chia Hua-Lin, Mary

Gardiner, Roman Lanno, Karen Goodell and Megan Ballinger provided valuable guidance and collaborative engagement throughout my doctoral research.

For friendship and lively discussions, scientific or otherwise, I thank Drew

Spacht, Brent Nowinski, Ben Green, Chris Riley, Natalia Riusech, Molly Dieterich

Mabin, Alice Vossbrink and Yvan Delgado.

For all the colleagues of whom I’ve had the privilege to work alongside through the long and often seemingly futile scientific process, I thank Juan Pillajo, Hailey Curtis,

Emma Matcham, Garret Cherry, Tyler Eaton, Katie Turo, Alyssa Wheeler, Luke Hearon,

Karissa Smith and Sreelakshmi Suresh. I hope that your experience in the lab will be a valuable asset to your future efforts.

This work was funded by numerous sources, foremost among them being the

Costco-Project Apis m. Honey Bee Biology Fellowship. Funding was also provided by the North American Pollinator Protection Campaign, Pollinator Partnership, OARDC

SEEDS grant program and The Ohio State Beekeepers Association.

vi

To all my family, I am grateful for your support throughout the years. Since my interest in science is largely derived from my love of nature, I am deeply indebted to my parents, Jo and Rodney, and my step parents, Phyllis and Bob, for frequently exposing me to the harsh outdoors from an early age. To my sister, Casey, thanks for encouraging a questioning mind from an early age.

Last but certainly not least, thanks to my wife, Sandra, who has provided an endless supply of smiles, hugs and cheer after many a failed experiment. This work is very much a product of her support.

vii

Vita

2013...... B.S. Biochemistry, Indiana University

2013 to present ...... Graduate Research Associate, Department

of Entomology, The Ohio State University

Publications

Richardson, RT, HR Curtis, EG Matcham, C-H Lin, S Suresh, DB Sponsler, L Hearon & RM Johnson. In Press. Quantitative multi-locus metabarcoding and waggle dance interpretation reveal honey bee spring foraging patterns in Midwest agroecosystems. Molecular Ecology (bioRxiv preprint: http://dx.doi.org/10.1101/418590)

Bengtsson-Palme, J, RT Richardson, M Meola, et al. 2018. Metaxa2 Database Builder: enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker. Bioinformatics, bty482

Richardson, RT, MN Ballinger, F Qian, JW Christman & RM Johnson. 2018. Morphological and functional characterization of hemocyte communities spanning the honey bee, Apis mellifera, lifecycle. Apidologie, 49: 397-410

Richardson, RT, J Bengtsson-Palme & RM Johnson. 2017. Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data. Molecular Ecology Resources, 17: 760-769

Bell, KL, N de Vere, A Keller, RT Richardson, A Gous, KS Burgess & BJ Brosi. 2016. Pollen DNA barcoding: Current applications and future prospects. Genome, 59: 1- 12

Richardson, RT, C-H Lin, JQ Quijia, NS Riusech, K Goodell & RM Johnson. 2015. Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach. Applications in Plant Sciences, 3(11): 1500043 viii

Richardson, RT, C-H Lin, JO Quijia, DB Sponsler, K Goodell & RM Johnson. 2015. Application of ITS2 metabarcoding to determine the provenance of pollen collected by honey bees in a field-crop dominated agroecosystem. Applications in Plant Sciences, 3(1): 1400066

Fields of Study

Major Field: Entomology

ix

Table of Contents

Abstract ...... ii

Dedication ...... v

Acknowledgments...... vi

Vita ...... viii

Table of Contents ...... x

List of Tables ...... xiv

List of Figures ...... xvi

Chapter 1. Application of ITS2 Metabarcoding to Determine the Provenance of Pollen

Collected by Honey Bees in an Agroecosystem ...... 1

1.1 Abstract ...... 1

1.2 Introduction ...... 2

1.3 Methods...... 4

1.4 Results ...... 9

1.5 Discussion ...... 11

1.6 References ...... 15

1.7 Tables ...... 18

x

1.8 Figures...... 20

1.9 Acknowledgements ...... 20

Chapter 2. Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach ...... 21

2.1 Abstract ...... 21

2.2 Introduction ...... 22

2.3 Methods and Results ...... 23

2.4 Conclusions ...... 28

2.5 References ...... 32

2.6 Tables ...... 34

2.7 Figures...... 35

2.8 Acknowledgements ...... 36

Chapter 3. Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data ...... 38

3.1 Abstract ...... 38

3.2 Introduction ...... 39

3.3 Methods...... 42

3.4 Results ...... 47

3.5 Discussion ...... 49

3.6 References ...... 54

3.7 Tables ...... 58

3.8 Figures...... 59

xi

3.9 Acknowledgements ...... 63

Chapter 4. A reference Cytochrome C Oxidase Subunit I database curated for hierarchical classification of arthropod metabarcoding data ...... 64

4.1 Abstract ...... 64

4.2 Introduction ...... 65

4.3 Methods...... 67

4.4 Results ...... 72

4.5 Discussion ...... 75

4.6 Conclusions ...... 80

4.7 References ...... 80

4.8 Tables ...... 84

4.9 Figures...... 87

Chapter 5. Quantitative multi-locus metabarcoding and waggle dance interpretation reveal honey bee spring foraging patterns in Midwest agroecosystems ...... 92

5.1 Abstract ...... 92

5.2 Introduction ...... 93

5.3 Methods...... 97

5.4 Results ...... 104

5.5 Discussion ...... 107

5.6 References ...... 111

5.7 Tables ...... 115

5.8 Figures...... 117

xii

5.9 Acknowledgements ...... 121

Complete List of References Cited ...... 122

Appendix A. Reference pollen collections and published references for microscopic pollen identification ...... 133

Appendix B. Proportional weight of pollen identified by microscopy ...... 134

Appendix C. List of plant families and genera identified by ITS2 metabarcoding ...... 136

Appendix D. Latitude and longitude of sampling sites ...... 138

Appendix E. . Results of microscopic palynological analysis by family and sampling location...... 139

Appendix F. Venn diagram of the completeness of each reference library...... 140

Appendix G. Complete metabarcoding results (not consensus filtered) by family and sampling location...... 141

Appendix H. . Metabarcoding consensus lists (consensus filtered) by family and sampling location...... 145

xiii

List of Tables

Table 1.1 Comparison of percent weight by microscopy to percent paired end read alignments per taxa for each sampling date...... 18

Table 1.2 Spearman’s Rank-Based Correlation test between the number of paired end alignments and the number of grams per plant taxon...... 19

Table 1.3 The average R-coefficients and standard deviations per taxon across all samples...... 19

Table 2.1 Spearman’s rank-based correlation between the total number of grains per plant taxon as determined by microscopy and the number of mate-paired aligned reads per plant taxon as determined by metabarcoding...... 34

Table 2.2 Average R-coefficients for taxa present in at least five of the six samples...... 34

Table 3.1 Summary of commonly used sequence classification approaches ...... 58

Table 3.2 Summary of tests of relationships between classification accuracy, sensitivity and database completeness...... 58

Table 4.1 Summary of taxonomic annotations made for references which had undefined ranks at midpoints in their respective taxonomic lineages...... 84

Table 4.2 Summary of taxonomic representation across all arthropod classes and associated sister groups...... 85

xiv

Table 4.3 Summary of insect taxa included in the arthropod COI database following curation...... 86

Table 5.1 Summary of plant taxonomic groups represented at each rank in the reference sequence databases...... 115

Table 5.2 Mean, standard error and range of the number of Viridiplantae sequences per sample obtained for each marker...... 115

Table 5.3 Mean and standard error of proportion of sequences classified to each rank for each marker...... 116

xv

List of Figures

Figure 1.1 Pollen origins identified by microscopy during the sampling period...... 20

Figure 2.1 Proportional taxonomic abundances of each sample as estimated by microscopy...... 35

Figure 2.2 Rank-transformed taxonomic abundance as estimated by the mean number of rbcL and matK metabarcoding reads (Y-axis) and the number of grains estimated microscopically (X-axis)...... 36

Figure 3.1 Venn diagrams showing the genera known to be present in Ohio as well as the number of genera represented in our testing sets and training sets ...... 59

Figure 3.2 Mean and standard error values of classifier sensitivity and error rate for classification of all test sequences...... 60

Figure 3.3 Mean and standard error values of genus-level error rate, family-level sensitivity and family-level error for test sequences belonging to genera not represented in the corresponding training set...... 61

Figure 3.4 Genus-level classifier error rate, sensitivity and errors per assignment regressed against classification confidence threshold for rbcL testing sequences classified using both UTAX and RDP...... 62

Figure 3.5 Linear models of genus-level classifier sensitivity and error rate regressed against the log-scaled number of reference sequences used for classification...... 63

xvi

Figure 4.1 A percent density histogram of the number of sequences per species shows the distribution of redundancy within the NCBI Nucleotide entries used...... 87

Figure 4.2 A logistic regression analysis of case-by-case classification accuracy, ‘1’ indicating a false-positive identification and ‘0’ indicating a true-positive identification, regressed against classification reliability score for half length and full length test sequence cases...... 88

Figure 4.3 Mean proportion and standard error of true positives (TP), true negatives (TN), false negatives (FN) and false positives (FP) for the classification of all testing sequences, conducted on both the full-length and half-length sequences...... 89

Figure 4.4 Proportional species level overclassification rate and genus level classification performance for test sequence cases from species not represented in the corresponding training data...... 90

Figure 4.5 Mean and standard error of the proportion of sequences assigned and false discovery rate, as measured in errors per assignment, measured at the family, genus and species levels for sequences belonging to each order ...... 91

Figure 5.1 Cross-validation results of classifier performance evaluation on test reference sequences cropped to 150 bp in length...... 117

Figure 5.2 Metabarcoding results regressed against microscopy results for the metabarcoding median of all loci as well as each locus individually...... 118

Figure 5.3 Time series plot of the metabarcoding median estimate of the proportional abundance of each plant family across the four sampling sites ...... 119

Figure 5.4 Honey bee spatial foraging patterns...... 120

xvii

(originally published under the same title in Applications in Plant Sciences 3:1400066)

1.1 Abstract

Melissopalynology, the identification of bee-collected pollen, provides insight into the flowers exploited by foraging bees. Information provided by melissopalynology could guide floral enrichment efforts aimed at supporting pollinators, but it has rarely been used because traditional methods of pollen identification are laborious and require expert knowledge. We approach melissopalynology in a novel way, employing a molecular method to study the pollen foraging of honey bees (Apis mellifera) in a field-crop dominated landscape and compare these results to those obtained by microscopic melissopalynology. Pollen was collected from honey bee colonies in Madison County, Ohio, USA over two weeks in mid-spring and identified using microscopic methods and ITS2 metabarcoding with high-throughput amplicon sequencing.

Metabarcoding identified 19 plant families and exhibited sensitivity for identifying the taxa present in large and diverse pollen samples relative to microscopy, which identified eight families. The bulk of pollen collected by honey bees was from trees (Sapindaceae, Oleaceae, and Rosaceae), though dandelion (Taraxacum officinale) and mustard (Brassicaceae) pollen was also abundant. For quantitative analysis of pollen, the use of both metabarcoding and

1

microscopic identification is superior to either individual method. For qualitative analysis, ITS2 metabarcoding is superior, providing genus-level resolution.

1.2 Introduction

Melissopalynology, the identification of pollen collected by bees, has proven to be an indispensable tool in fields such as biology (Kearns and Inouye, 1993; Cusser and

Goodell, 2013), pollinator foraging behavior (Louveaux, 1959; Wilson et al., 2010; Baum et al.,

2011), sourcing and authentication of apicultural products (Louveaux et al., 1978; Jones and

Bryant, Jr., 1992; Dimou et al., 2007), and honey bee (Apis mellifera L.) nutritional biology

(Severson and Parry, 1981; Forcone et al., 2011; Girard et al., 2012). The lattermost has taken on particular urgency in recent years due to alarming and largely unexplained regional declines in honey bee populations (vanEngelsdorp et al., 2009). Pollen is the main source of amino acids, lipids, sterols, vitamins, and minerals in the honey bee diet (Winston, 1987), and malnutrition, caused by the displacement of natural floral communities by agricultural development, may contribute to recent patterns of honey bee decline (Naug, 2009; vanEngelsdorp and Meixner,

2010; Huang, 2012). Melissopalynology allows for direct observation of the honey bee pollen diet, providing a unique means for identifying nutritional deficits and informing potential efforts to improve foraging habitat.

Traditional melissopalynology involves the careful preparation of pollen samples for microscopic inspection followed by the identification of individual pollen grains by comparison to a similarly prepared reference collection of local pollen taxa (Erdtman, 1943; Kearns and

Inouye, 1993). When pollen collected by honey bees is used, the steps above are normally preceded by the laborious sorting of bulk pollen samples into groups of pellets having similar color and texture (Dimou and Thrasyvoulou, 2007). To the extent that this sorting succeeds in

2

discriminating true taxonomic groups, it provides a preliminary estimate of sample diversity and streamlines the process of microscopic examination for large volumes of pollen. These methods can provide reliable data when properly executed, but they suffer from being (1) highly dependent on human expertise and vulnerable to human error, (2) limited in taxonomic precision, as many taxa are only identifiable to family, and (3) prohibitively time-consuming for large-scale studies. A reliable and efficient alternative would be of immense value for many melissopalynological applications, especially those that require the analysis of large and taxonomically diverse pollen samples, such as are commonly collected by honey bees (e.g.

Wilson et al., 2010).

We present a novel approach to melissopalynology using DNA metabarcoding. We used previously validated primers specific for the second internal transcribed spacer (ITS2) in the plant ribosomal sequence (Chen et al., 2010), and performed amplicon sequencing on DNA isolated from four samples of pollen collected by honey bees over a two week period in spring

2013. We chose the nuclear ITS2 region because prior investigation of ITS2 as a barcoding locus supports suitability for differentiating plant taxa at the genus and, in some cases, the species level

(Chen et al., 2010; Han et al., 2013; Pang et al., 2013; Tripathi et al., 2013). Wilson et al. (2010) successfully used ITS2, as well as other ribosomal sequences, as a genetic barcode to identify monospecific pollen collected by a specialist bee. Our primary goal was to obtain taxonomic identification at the genus level, which made the ITS2 region suitable for this barcoding project.

Here, we apply our method to determine the pollen taxa collected by honey bee colonies located in an agroecosystem devoted to corn and soybean cultivation, a context in which nutritional stress due to agricultural development would be plausible. We then compare the results of our metabarcoding approach to those obtained by traditional microscopic analysis of

3

the same pollen samples and discuss the relevance of our findings to both melissopalynological methodology and honey bee foraging in agricultural landscapes.

1.3 Methods

Sample collection

Samples from this study were obtained from an apiary managed by The Ohio State

University located in Madison County Ohio (39.96095, -83.43215) between April 23 and May 6,

2014. The apiary is located in an agricultural landscape dominated by corn and soybean fields

(>60% land cover), with a small amount of forest, residential area, and uncultivated fields. Four healthy, actively foraging colonies in 8-frame Langstroth hives were fitted with Sundance I bottom-mounted pollen traps (Ross Rounds Inc., Albany, New York, USA) to collect pollen samples. Pollen traps were emptied and turned on and off on a twice weekly cycle, alternating between colonies, so that pollen was always sampled from two colonies while the remaining colonies were allowed pollen for sustenance. Samples were pooled across colonies for each collection date for all analyses.

Pollen identification by microscopy

To identify the pollen collected by honey bees, we sorted pollen pellets from 10% (15 –

68 g) of the total sample for each collection date into distinct color categories and weighed each category. A 10% subsample from each color category, or up to 10 pellets if a category contained fewer than 100 pellets, was then blended in a few drops of water to achieve a spreadable consistency and four drops of the homogenized suspension were mounted separately in basic fuchsin jelly (Kearns and Inouye, 1993) on glass slides for microscopic examination at 400x –

1000x magnification.

4

Pollen was identified to the lowest possible taxonomic level by comparison with reference collected from fresh flowers found in the study area during the sampling period. Published pollen guides and image databases were used to identify pollens that did not match any of the reference collections. Plant taxa included in the reference collections and published references are listed in Appendix A. Representative pollen grains on the reference slides were photographed at 400X – 1000X magnification. Reference slides and digital images of the pollens are stored at the Ohio State University Bee Lab (Wooster, Ohio, USA).

The sum of weight for color categories belonging to the same taxon was recorded. If a color category was found to contain pollen of multiple taxa, we examined 1000 pollen grains on slides prepared for that color category and estimated the proportional abundance by counting pollen grains for each taxon. We chose to exclude any plant taxa constituting less than 1% of the pollen sample. We then measured the grain size for each taxon and estimated the proportional abundance using the volumetric method described in O’Rourke and Buchmann (1991). The weight of pollen from each taxon was summed to calculate the final proportion in the total sample for each collection date.

Pollens from in the family Rosaceae often share similar pellet colors and morphological characteristics, making them difficult to distinguish at the generic level by visual sorting and microscopy (Moore et al., 1991). Genera within Brassicaceae were also difficult to identify using light microscopy. Improving the accuracy of pollen identification to the genus level in these families would require scanning electron microscopy (SEM) and was beyond the scope of this study. Therefore, the abundance of Rosaceae and Brassicaceae estimated by microscopy was only presented at the family level.

Pollen identification by ITS2 metabarcoding

5

For metabarcoding analysis, 5% (8 – 34 g) of the original pooled samples from each sample date was homogenized to break the pollen balls up into a powder. Seventy five percent ethanol was added to the mixture at roughly a 4:1 solvent to sample volume ratio. The pollen suspensions were then stirred on a magnetic stir plate for 25 minutes at room temperature.

Additional solvent was added as necessary to ensure homogenous sample mixing. Buchner funnel vacuum filtration was performed, and the resulting sample was transferred to a weigh boat to air dry in a flow hood.

We used a bead-beater pulverization method adapted from Simel et al. (1997) to free the genomic DNA from within the protective exine coat of the non-germinated pollen. Fifty milligrams of dry, homogenized pollen from each sample was placed in a 2.0 mL microcentrifuge tube with 600 µL Qiagen DNeasy Plant Minikit lysis buffer (Qiagen,Venlo,

Limburg, ). Zirconium/silica beads (0.5 mm diameter) were added until the total contents of each tube reached 1.5 mL. The pollen was then pulverized in a bead-beater (Mini-

Beadbeater-1, BiospecProducts, Bartlesville, Oklahoma, USA) for two minutes. After pulverization, 300 µL of DI water was transferred to each tube and mixed with the contents. A

300 µL portion of the resulting lysate was then transferred from each tube to a sterile 1.5 mL microcentrifuge tube. Pollen DNA was then isolated using the Qiagen DNeasy Plant Minikit.

After DNA isolation, the ITS2 region was amplified via PCR. We used two primers

(forward: 5’-ATGCGATACTTGGTGTGAAT-3’; reverse: 5’-

GACGCTTCTCCAGACTACAAT-3’) designed and validated by Chen et al. (2010), but we used alternative PCR conditions to accommodate our use of the Phusion® High-Fidelity PCR

Kit (New England Biolabs, Ipswich, Massachusetts, USA). Parameters of the thermocycler

(Mastercycler® ep gradient, Eppendorf AG, Hamburg, ) were set as follows: initial

6

denaturation at 98°C for 30 seconds, 30 cycles of 98°C denaturation for 10 sec, annealing at

59°C for 30 seconds and extension at 72°C for 30 seconds, and a final extension step at 72°C for

10 min. These PCR conditions were calculated with the use of the NEB Tm Calculator

(http://tmcalculator.neb.com/, New England Biolabs, Ipswich, Massachusetts, USA).

Following PCR, ITS2 amplicons were purified using the PureLink® PCR Purification kit

(Life Technologies, Carlsbad, California, USA). At this point, 500 ng of purified PCR product from each respective sample was indexed independently using the NEBNext® Ultra™ DNA

Library Prep Kit for Illumina® and NEBNext® Multiplex Oligos for Illumina® (New England

Biolabs, Ipswich, Massachusetts, USA). After multiplexing, samples were subjected to a final

Agencourt AMPure XP purification step before being pooled (Beckman Coulter catalog number

A63880 [Brea, California, USA]). A final 9 cycle library amplification PCR step was performed and samples were then analyzed on a Qubit® 2.0 Fluorometer (Life Technologies, Carlsbad,

California, USA), as well as an Agilent 2100 Bioanalyzer (DNA 1000 kit; Agilent Technologies,

Santa Clara, California, USA) to ensure sample quality before sequencing. Paired end sequencing was performed with the Illumina MiSeq platform using the Truseq LT assay. All sequence information has been deposited in the NCBI-Sequence Read Archive, accession code

SRP044703.

To analyze the sequencing data for our samples, we adapted previously described metabarcoding analysis pipelines (Huson, et al., 2011; Mitra et al., 2011). Illumina MiSeq reads were demultiplexed using CASAVA (v1.9; Illumina, Inc., San Diego, California, USA) and preprocessed using the artifacts filtering tool from the FASTX-Toolkit (v0.0.13; http://hannonlab.cshl.edu/fastx_toolkit/). Paired end read merging was then accomplished using

PEAR (v0.9.1), a software package which calculates highly accurate paired end mergers of

7

forward and reverse reads (Zhang et al., 2014). PEAR was used with default parameter settings with the exception of the Phred scale quality score trimming threshold, which was set to 20. The resulting fastq files were dereplicated and converted to FASTA format using the FASTX-Toolkit

Collapser tool. After generation of the FASTA files, the resulting sequences were compared against a reference library of 2,628 plant ribosomal sequences downloaded from Genbank on

March 11, 2014. This library represents approximately half of the 4,918 plant species potentially present in Ohio and surrounding states (USDA PLANTS database; http://plants.usda.gov/). The blastn algorithm (blast-2.2.17) (Altschul et al., 1997) was used to align reads with the Genbank- derived sequence library using the following settings: E-value cutoff 1e-125, number of alignments 1, output format 0, number of descriptions 1, percent identity threshold 95%.

Alignment outputs were then annotated in MEGAN (v5.1.5) (Huson et al., 2011) using the lowest common ancestor (LCA) algorithm with the following parameter settings: min support 1, min support percent 0.0 (off), min score 50.0, max expected 1E-125, top percent 100.0, min complexity 0.00.

Analysis of Results

We used two statistical tests to assess whether the metabarcoding approach could produce reliable quantitative measures of relative abundance for the plant taxa within our samples. The

Spearman’s rho statistic, a rank-based measurement, was employed to test the association between the number of paired end reads aligned and the amount of pollen in grams per plant taxon identified by microscopy. Additionally, average R-coefficients, analogous to those described by Bryant and Jones (2001) in relation to honey authentication, characterizing the ratio of the number of aligned paired-end reads relative to the abundance of pollen determined by microscopy for taxa found in at least three of the four samples, were calculated using the

8

equation below. This coefficient can provide information as to whether certain taxa are over- represented or underrepresented in the metabarcoding results relative to the microscopic results, as well as the variance in this over and under-representation across samples. With further development, the R-coefficient may allow the quantitative estimation of the taxa present within pollen samples by enabling the calculation of taxon-specific correction coefficients.

R-coefficient = percent mapped reads / percent weight by microscopy

1.4 Results

Pollen identification by microscopy

During the sampling period (April 23 – May 6, 2013), a large proportion of bee-collected pollen originated from mass-flowering, woody species, ranging 57 - 95% of the total sample weight per collection date (Figure 1.1. Identified taxa are listed in Appendix B). Pollens of trees in the genera Acer L. and Fraxinus L., both have anemophilous flowers, were the most abundant pollens on the three earlier sampling dates. On May 6, over 90% of the total sample originated from entomophilous woody species, corresponding with the full bloom of rosaceous trees (e.g.

Malus Mill., Prunus L., Crataegus L., and Amelanchier Medik.) and honeysuckle (Lonicera morrowii A. Gray and L. maackii [Rupr.] Maxim.) that were common in the surrounding landscape in early May. Pollen from herbaceous weeds, predominately dandelion (Taraxacum officinale F. H. Wigg.) followed by crucifers (Brassicaceae), comprised 43% of the total sample on April 23 and declined progressively to 5% on May 5.

Pollen identification by ITS2 metabarcoding

After Illumina MiSeq sequencing and paired end merging of the forward and reverse reads, we obtained between 593,478 and 1,062,208 paired end reads from the samples. The mean paired end read lengths for all sequencing runs were between 461 bp and 469 bp. Upon

9

annotating the blast output files with MEGAN 5, between 71,125 and 340,606 reads were taxonomically assigned. In total, we detected 42 distinct plant genera across the four samples.

However, upon cross-validating the list of genera with known floral phenology calendar

(http://www.oardc.ohio-state.edu/gdd/) and Ohio State University herbarium records

(https://herbarium.osu.edu/), we found alignments to three plant genera (Lactuca, Glycine, and

Triticum) that were not likely to be flowering around our sampling period. For a complete list of the plant genera detected as well as their respective paired end alignment coverage, see

Appendix C.

Comparison of Metabarcoding and Microscopic Techniques

To compare the results of ITS2 barcoding with those of traditional microscopy, we calculated the percentage of mapped reads per plant taxon in comparison to the percentage by weight of plant taxa estimated using microscopic methods (Table 1.1). Using the barcoding approach, we were able to detect the majority of the plant taxa observed using the microscopic approach. Pollen from Salicaceae, Lamiaceae, and Lonicera were detected by microscopy but not by metabarcoding analysis.

To test the ability of the metabarcoding method to accurately determine the relative abundance of different pollen types, we compared paired end alignment coverage with grams of pollen collected as determined by color sorting and microscopic analysis (Spearman’s correlation test; p>0.05 for all samples; Table 1.2). We found no significant associations between the relative abundances of pollen types identified using microscopic and metabarcoding based methods. Furthermore, upon calculating R-coefficients across all taxa, specific taxa were overrepresented or underrepresented consistently in the metabarcoding results relative to the microscopic results, though the variation in the degree of misrepresentation was large, with

10

relative standard deviations being between 39 and 84% RSD (Table 1.3). These results indicate that ITS2 sequencing data alone are not sufficient for quantitative measures of pollen abundance across plant taxa.

1.5 Discussion

Metabarcoding proved useful for identifying the diversity of pollen collected by honey bees. However, metabarcoding also produced a relatively longer list of plant genera. This may reflect the genuinely large diversity of plant taxa visited by many thousands of foraging honey bees over a large foraging range. Some genera identified through metabarcoding appear implausible given known floral phenology. The identification of these plant genera may be explained by (1) spurious false-positive BLAST alignments; (2) bees regurgitating honey stomach contents, which could contain stored honey or particles from bee bread collected and stored in the colony for months (Vasquez and Olofsson, 2009); (3) contact between pollen foragers and stored bee bread within the hive prior to foraging. Additionally, in the case of spurious false-positive alignments, it is important to note that the Genbank-derived blast library used in this study lacked ITS2 reference sequences from approximately half of the 4,918 plant species potentially present in Ohio and surrounding states (USDA PLANTS database; http://plants.usda.gov/). In the absence of a true reference sequence the blastn algorithm could produce a misalignment, assuming quality control cutoffs were satisfied. Lastly, some unexpected grasses identified through metabarcoding, such as Poa and Anthoxanthum (Poaceae) may reflect non-foraging contact with pollen or pollen grains that were present in extremely low numbers and not detected by microscopy.

The metabarcoding approach does not provide quantitative estimates of the relative abundance of pollen in mixed samples. The nature of ribosomal and plant genetics skew

11

quantitative calculations in a species-specific or even intra-species-specific manner. The number of ribosomal DNA cassette copies varies widely within and between species (Long and Dawid,

1980) and likely has a great influence over the number of reads generated for each taxon.

Furthermore, variation in DNA extraction efficiency, primer annealing and genome copy number between plant species may also affect the number of taxon-specific reads.

Metabarcoding failed to detect pollen from plants in the genus Lonicera and families

Lamiaceae and Salicaceae, despite the fact that pollens from these taxa were identified using microscopy. This may be explained in several ways, such as sequence divergence in the PCR priming site or current sequencing length limitations for the Illumina MiSeq platform used (~600 bp). Given that the length of ITS2 region ranges from 100 to 700 bp for dicotyledons and monocotyledons (Yao et al., 2010). The failure of metabarcoding to detect these taxa suggests that improvements to this method are necessary to expand the scope of detection. This may be accomplished using different loci, multiple loci, or through the development of ITS2 primers specific to problematic plant clades. Recent work by Galimberti et al. (2014) applied rbcl and trnH-psbA amplicon cloning in conjunction with capillary sequencing to characterize the taxonomic composition of bee collected pollen samples from Italian Alpine habitats. With the use of these loci, the investigators were able to detect Lonicera and Lamiaceae, however, given the absence of morphological validation of the amplicon clone-based method, it is difficult to infer if the primers used for these loci were relatively more or less expansive than the primers used in this study.

Both methods of pollen identification were in agreement with the conclusion that honey bees collect pollen from a wide variety of flowers in the agricultural landscape and the taxa visited varied over the weeks of our study. What was apparent from our microscopic results, but

12

less clear through metabarcoding, was a dramatic shift in the pollen assemblage collected by colonies (Figure 1.1). Early samples (April 23 and 29) were dominated by non-rosaceous woody taxa (Acer, Fraxinus) and dandelion (Taraxacum officinale). The later samples (May 2 and 6), however, saw a sudden increase in the prevalence of rosaceous woody taxa and a corresponding decline in Acer, Fraxinus, and T. officinale. This shift in principal pollen sources represents not only a phenological-taxonomic transition but also a change in spatial foraging patterns between different floral habitats. Acer and Fraxinus were restricted to small tracts of forest and scattered residential areas within the field-crop matrix, while T. officinale occurred abundantly in residential areas, along roadsides, and in agricultural fields that had not been recently tilled or treated with herbicide. The rosaceous woody taxa that became important in the later half of our samples consisted of a combination of cultivated trees (e.g. Malus) from residential properties and wild taxa such as Crataegus and Prunus that occurred mainly in forest edge and understory habitat.

Our data underscore the dependence of honey bees upon landscape features that are systematically marginalized by agricultural intensification, namely field edges, forest and residential habitats. Furthermore, the prevalence of herbaceous weeds, such as T. officinale, is highly subject to weed control practices. Thus, we conclude that intensive field-crop agroecosystems are not necessarily nutritional deserts for honey bees in the springtime, but nutrition could become limiting if aggressive land conversion and weed control practices progress unchecked.

In addition to using this methodological pipeline for apicultural applications, our approach may provide a useful tool for many other areas of scientific research. Honey bees collect pollen from a diverse array of plant taxa, possess a near global geographic distribution,

13

and a single colony can forage over an area of over 100 km2 (Beekman and Ratnieks, 2000;

Moritz et al., 2005). As such, honey bees may be seen as a large scale sampling tool. This trait of honey bees has already been explored for monitoring industrial pollution, airborne and , and even explosives (Bromenshenk et al., 1985; Lighthart et al., 2000; Lighthart et al.,

2005). Furthermore, given a reliable, high-throughput metabarcoding approach to melissopalynology, honey bees could be employed to sample a regional flora rapidly, inexpensively, and with intensity unapproachable by human investigators. Moreover, honey bees can easily sample environments that are not amenable to conventional floral sampling, such as dense forests or urban centers. This unique utility of honey bees could be broadly applicable in such fields as biogeography, biodiversity assessment, population genetics, floral phenology tracking, and invasive species monitoring.

In this study, we applied high-throughput ITS2 metabarcoding in conjunction with traditional microscopic melissopalynology to determine the pollen foraging habits of honey bees in a field-crop dominated agroecosystem. We demonstrate that the metabarcoding approach exhibits strengths in terms of ease of implementation and sensitivity, and weakness due to the non-quantitative nature of the method. As such, we suggest this method is useful for qualitative applications such as large-scale melissopalynology-based floral surveying and pollinator habitat preservation. Melissopalynology has previously been difficult to apply in such applications due to the time, expense, and expertise required for microscopic pollen identification. The development and further refinement of molecular melissopalynological techniques may be a suitable research avenue for overcoming these difficulties.

14

1.6 References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. Baum KA, Rubink WL, Coulson RN, Bryant VM (2011) Diurnal patterns of pollen collection by feral honey bee colonies in southern Texas, USA. Palynology, 35, 85-93. Beekman M, Ratnieks FLW (2000) Long-range foraging by the honey-bee, Apis mellifera L. Functional Ecology. 14, 490-496. Bryant VM, Jones GD (2001) The r-values of honey: pollen coefficients. Palynology, 25, 11-28. Bromenshenk JJ, Carlson SR, Simpson JC, Thomas JM (1985) Pollution monitoring of Puget Sound with honey bees. Science, 227, 632-634. Chen S, Yao H, Han J, Liu C, Song J, Shi L, Zhu Y (2010) Validation of the ITS2 region as a novel DNA barcode for identifying medicinal plant species. PLoS ONE, 5, e8613. Cusser S, Goodell K (2013) Diversity and distribution of floral resources influence the restoration of plant–pollinator networks on a reclaimed strip mine. Restoration Ecology, 21, 713-721. Dimou MG, Thrasyvoulou A (2007) A comparison of three methods for assessing the relative abundance of pollen resources collected by honey bee colonies. Journal of Apicultural Research and Bee World, 46, 144-148. Dimou M, Goras G, Thrasyvoulou A (2007) Pollen analysis as a means to determine the geographical origin of royal jelly. Grana, 46, 118-122. Erdtman G (1943) An introduction to pollen analysis. Chronica Botanica Company, Waltham, Massachusetts, USA. Forcone A, Aloisi PV, Ruppel S, Muñoz M (2011) Botanical composition and protein content of pollen collected by Apis mellifera L. in the north-west of Santa Cruz (Argentinean Patagonia). Grana, 50, 30-39. Galimberti A, De Mattia F, Bruni I, Scaccabarozzi D, Sandionigi A, Barbuto M, Casiraghi M, Labra M (2014) A DNA barcoding approach to characterize pollen collected by honeybees. PLoS ONE, 9, e109363. Girard M, Chagnon M, Fournier V (2012) Pollen diversity collected by honey bees in the vicinity of Vaccinium spp. crops and its importance for colony development. Botany, 90, 545-555. Han J, Zhu Y, Chen X, Liao B, Yao H, Song J, Chen S, Meng F (2013) The short ITS2 sequence serves as an efficient taxonomic sequence tag in comparison with the full-length ITS. Biomed Research International, 741476. Huang Z (2012) Pollen nutrition affects honey bee stress resistance. Terrestrial Arthropod Reviews, 5, 175-189.

15

Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN 4. Genome Research, 21, 1552-1560. Jones GD, Bryant VM (1992) Melissopalynology in the United States: a review and critque. Palynology, 16, 63-71. Kearns CA, Inouye DW 1993. Techniques for pollination biologists. University Press of Colorado, Boulder, Colorado, USA. Long EO, Dawid IB (1980) Repeated genes in . Annual Review of Biochemistry, 49, 727-764. Lighthart B, Prier KRS, Loper GM, Bromenshenk JJ (2000) Bees scavenge airborne bacteria. , 39, 314-321. Lighthart B, Prier KRS, Bromenshenk JJ (2005) Flying honey bees adsorb airborne viruses. Aerobiologia, 21, 147-149. Louveaux J (1959) Recherches sur la récolte du pollen par les abeilles (Apis mellifica L) (Fin). Annales de L’Abeillle, 2, 13-111. Louveaux J, Maurizio A, Vorwohl G (1978) Methods of melissopalynology. Bee World, 59, 139- 153. Mitra S, Staerk M, Huson DH (2011) Analysis of 16S rRNA environmental sequences using MEGAN. BMC Genomics, 12, S17. Moritz RFA, Härtel S, Neumann P (2005) Global invasions of the western honeybee (Apis mellifera) and the consequences for biodiversity. Ecoscience, 12, 289-301. Moore PD, Webb JA, Collinson ME (1991) Pollen Analysis, Second Edition. Blackwell Scientific Publications, Oxford, UK. Naug D (2009) Nutritional stress due to habitat loss may explain recent honeybee colony collapses. Biological Conservation, 142, 2369-2372. O’Rourke MK, Buchmann SL (1991) Standardized analytical techniques for bee-collected pollen. Environmental Entomology, 20, 507-513. Pang X, Shi L, Song J, Chen X, Chen S (2013) Use of the Potential DNA barcode ITS2 to identify herbal materials. Journal of Natural Medicines, 67, 571-575. Severson DW, Parry JE (1981) A chronology of pollen collection by honeybees. Journal of Apicultural Research, 20, 97-103. Simel EJ, Saidak LR, Tuskan GA (1997) Method of extracting genomic DNA from non- germinated gymnosperm and angiosperm pollen. BioTechniques, 22, 390-392, 394. Tripathi AM, Tyagi A, Kumar A, Singh A, Singh S, Chaudhary LB, Roy S (2013) The internal transcribed spacer (ITS) region and trnhH-psbA are suitable candidate loci for DNA barcoding of tropical tree species of India. PLoS ONE, 8, e57934. vanEngelsdorp D, Evans JD, Saegerman C, Mullin C, Haubruge E, Nguyen BK, Frazier M (2009) Colony collapse disorder: a descriptive study. PLoS ONE, 4, e6481.

16

vanEngelsdorp D, Meixner MD (2010) A historical review of managed honey bee populations in Europe and the United States and the factors that may affect them. Journal of Invertebrate Pathology, 103, S80-S95. Vasquez A, Olofsson TC (2009) The lactic acid bacteria involved in the production of bee pollen and bee bread. Journal of Apicultural Research, 48, 189-195. Wilson EE, Sidhu CS, LeVan KE, Holway DA (2010) Pollen foraging behaviour of solitary Hawaiian bees revealed through molecular pollen analysis. Molecular Ecology, 19, 4823- 4829. Winston ML (1987) The biology of the honey bee. Harvard University Press, Cambridge, Massechusetts, USA. Yao H, Song J, Liu C, Luo K, Han J, Li Y, Pang X, Xu H, Zhu Y, Xiao P, Chen S (2010) Use of ITS2 region as the universal barcode for plants and PLoS ONE, 5, e13102. Zhang J, Kobert K, Flouri T, Stamatakis A (2014) PEAR: A fast and accurate Illumina Paired- End reAd mergeR. Bioinformatics, 30, 614-620.

17

1.7 Tables

Table 1.1 Comparison of percent weight by microscopy to percent paired end read alignments per taxa for each sampling date.

Percent weight by Percent reads by Date Taxa microscopy (%) sequencing (%) April 23 Taraxacum officinale F.H. Wigg. 38.81 12.77 Brassicaceae Burnett 3.39 83.90 Acer L. 45.08 2.86 Salicaceae Mirb. 10.26 0 other 2.45 0.47 April 29 Taraxacum officinale 24.97 16.89 Acer 21.96 1.25 Fraxinus L. 50.84 0.18 other 2.23 81.68 May 2 Taraxacum 10.75 21.70 Brassicaceae 3.70 71.21 Lamiaceae Martinov 2.89 0 Acer 41.09 6.61 Fraxinus 13.94 0.01 Rosaceae Juss. 26.98 0.04 other 0.65 0.43 May 6 Taraxacum 2.44 7.69 Brassicaceae 2.26 91.22 Fraxinus 1.54 0.01 Rosaceae 83.54 0.01 Lonicera 6.51 0 other 3.72 1.07

18

Table 1.2 Spearman’s Rank-Based Correlation test between the number of paired end alignments and the number of grams per plant taxon.

Sample rho score S P-value April 23 -0.4 14 0.75 April 29 0.5 6 1 May 2 0.05 53 0.91 May 6 -0.2 24 0.75

Table 1.3 The average R-coefficients and standard deviations per taxon across all samples.

Average ± SD for high depth Taxa sequencing Brassicaceae 28.119 ± 10.954 Acer 0.094 ± 0.058 Fraxinus 0.003 ± 0.002 Taraxacum 1.544 ± 1.296

19

1.8 Figures

Figure 1.1 Pollen origins identified by microscopy during the sampling period. Shades of yellow represent pollen of herbaceous plants whereas shades of turquoise represent pollen of woody plants.

1.9 Acknowledgements

The authors thank J. Wenger, E. E. Wilson, C. S. Sidhu, J. Wallace, B. Klips, A. D. Wolfe, and anonymous reviewers for discussion, N. Douridas for site access, G. Cobb for help with sample processing, M. E. Hernandez-Gonzalez and the OARDC MCIC staff for technical support, and the Ohio Supercomputing Center. This study was funded by the Pollinator Partnership’s Corn

Dust Research Consortium grant to RMJ and an OSU-Newark Scholarly Activity Grant to KG.

20

(originally published under the same title in Applications in Plant Sciences 3:1500043)

2.1 Abstract

Difficulties inherent in microscopic pollen identification have resulted in limited implementation for large-scale studies. Metabarcoding, a relatively novel approach, could make pollen analysis less onerous; however, improved understanding of the quantitative capacity of various plant metabarcode regions and primer sets is needed to ensure that such applications are accurate and precise. We applied metabarcoding, targeting the ITS2, matK and rbcL loci, to characterize six samples of pollen collected by honey bees, Apis mellifera. In addition, samples were analyzed by light microscopy. We found significant rank-based associations between the relative abundance of pollen types within our samples as inferred by the two methods. Our findings suggest metabarcoding data from plastid loci, as opposed to the ribosomal locus, are more reliable for quantitative characterization of pollen assemblages. Additionally, multi-locus metabarcoding of pollen may be more reliable than single-locus analyses, underscoring the need for discovering novel barcodes and barcode combinations optimized for molecular palynology.

21

2.2 Introduction

Quantitative identification of pollen by taxonomic origin is important for applications in pollination biology and conservation (Kearns and Inouye, 1993; Wilson et al., 2010; Forcone et al., 2011; Girard et al., 2012; Cusser and Goodell, 2013), authentication of apicultural products

(Louveaux et al., 1978; Jones and Bryant, Jr., 1992; Dimou and Thrasyvoulou, 2007), and allergy-related airborne pollen monitoring (Longhi et al., 2009; Kraaijeveld et al., 2015).

Traditionally, pollen analysis has been accomplished using microscopic palynology, a technique involving the discrimination of pollen types by morphology (Erdtman, 1943). Due to the expertise required and difficulties associated with accurately distinguishing and identifying pollen from morphologically similar taxa, this technique has been difficult to implement at large scale. Thus, the development and improvement of novel techniques for pollen analysis is an area of current interest (Keller et al., 2015; Kraaijeveld et al., 2015; Richardson et al., 2015).

The application of DNA barcoding to pollen analysis displays promise as an efficient and reliable approach. Similar to comparing morphological features of unknown pollen to those of reference pollen from voucher specimens, DNA sequences of unknown origin can be compared to sequences from voucher specimens. Initial applications of this approach to pollen analysis have utilized capillary sequencing technology (Longhi et al., 2009; Wilson et al., 2010;

Galimberti et al., 2014); however, advances in the accuracy and length of next-generation sequencing provide researchers a practical, high-throughput alternative (Keller et al., 2015;

Kraaijeveld et al., 2015; Richardson et al., 2015).

Recently, next-generation sequencing was used to characterize the botanical origins of bee-collected pollen using the ribosomal intergenic ITS2 locus (Richardson et al., 2015). This target locus was chosen because previous studies suggested that plastids are rarely incorporated

22

into pollen (Reboud and Zeyl, 1994; Mogensen, 1996; Azhagiri and Maliga, 2007). However, evidence from more recent studies suggests that pollen plastids may be common (Tang et al.,

2009), enabling pollen metabarcoding of plastid loci (Galimberti et al., 2014; Kraaijeveld et al.,

2015). Though the approach using ITS2 was successful in identifying pollen (Richardson et al.,

2015), it suffered from two limitations: 1) the method failed to detect certain prominent taxa identified microscopically and 2) while the method generated a useful taxonomic list, the relative abundance of different pollen types could not be inferred from the sequence data. Here, we present an improvement in pollen metabarcoding by targeting the plastid loci matK and rbcL, in addition to the ribosomal ITS2 locus, to characterize polyfloral samples of pollen collected by honey bees. In addition, we compare our metabarcoding results with results from microscopic analysis to evaluate the range of taxa detected and the capacity for quantitative inference of rank order pollen type abundance using a multi-locus metabarcoding approach.

2.3 Methods and Results

Sample collection and homogenization

During spring of 2014 bee-collected pollen samples were collected at six apiaries, all greater than 15 km apart, in west-central Ohio. The latitude and longitude of each apiary is provided in Appendix D and apiaries are herein denoted as A, B, C, D, E and F. Using Sundance

I bottom-mounted pollen traps (Ross Rounds Inc, Albany, New York, USA), we collected four samples from each site from May 5th to May, 11th, sampling every other day. Following collection, samples were pooled by site before homogenization. A 10% subsample (by weight) was taken from each pooled sample, mixed in 50% ethanol, and stirred for 25 min using a magnetic stir plate. Using Buchner funnel vacuum filtration and a paper filter (Whatman grade 1;

23

Sigma-Aldrich, Saint Louis, MO, USA), we separated the homogenized pollen grains from the solvent and transferred it to a flow hood to air dry at room temperature.

Pollen identification and quantification by microscopy

We mixed 100 mg of the dried, homogenized pollen sample from each site in 0.5 ml of water and mounted five separate smears onto slides in basic fuchsin jelly (Kearns and Inouye, 1993). We then counted and identified approximately 1,000 pollen grains per slide for each pooled sample under a compound microscope at 400 – 1000x magnification. The voucher specimens used for pollen identification are listed in Richardson et al. (2015). A total of approximately 5,000 grains were analyzed per sample. Due to the difficulty in distinguishing some related plant taxa [e.g. within the Rosaceae Juss. (Moore et al., 1991)], we chose to limit microscopic identification to the family level. The total number of grains of pollen from each plant family, summed from each of the five slides, is available in Appendix E.

Pollen identification by metabarcoding

After drying our homogenized samples we freed DNA from 50 mg of pollen per sample using bead-beater pulverization (Mini-BeadBeater-1; BioSpec Products, Bartlesville, Oklahoma,

USA) (Simel et al., 1997). Each sample was placed in a 2.0 mL microcentrifuge tube with 600

µL lysis buffer from the Qiagen DNeasy Plant Minikit (Qiagen,Venlo, Limburg, Netherlands).

Zirconia/silica beads (0.5 mm diameter) were added until the total contents of each tube reached

1.5 mL, and the sample was pulverized for 2 minutes. Then, 300 µL of DI water was transferred to each tube and mixed with the contents and a 300 µL portion of the resulting lysate mix was transferred to a sterile 1.5 mL microcentrifuge tube. DNA was extracted using the Qiagen

DNeasy Plant Minikit (Qiagen,Venlo, Limburg, Netherlands) and the ribosomal ITS2, and plastid matK and rbcL loci were amplified in separate PCR reactions. Amplification was

24

conducted using previously published primer sets (Fay et al., 1997; Cuénoud et al., 2001; Chen et al., 2010) and the Phusion High-Fidelity PCR Kit (New England Biolabs, Ipswich,

Massachusetts, USA) in a Mastercycler ep gradient PCR machine (Eppendorf AG, Hamburg,

Germany). The ITS2, matK and rbcL amplicons were subsequently purified using the PureLink

PCR Purification kit (Life Technologies, Carlsbad, California, USA). At this point, 500 ng of purified PCR product for each locus was indexed independently using the NEBNext Ultra DNA

Library Prep Kit for Illumina and NEBNext Multiplex Oligos for Illumina (New England

Biolabs, Ipswich, Massachusetts, USA). Multiplexed samples were purified before being pooled

(Agencourt AMPure XP; Beckman Coulter, Brea, California, USA). A final 9-cycle library amplification step was performed and samples were analyzed on a Qubit 2.0 fluorometer (Life

Technologies, Carlsbad, California, USA), and an Agilent 2100 Bioanalyzer (DNA 1000 kit;

Agilent Technologies, Santa Clara, California, USA), to ensure sample quality before sequencing. Paired end sequencing was performed with the Illumina MiSeq platform using the

Truseq LT assay, 600 cycles. Sequence data are available from the NCBI-Sequence Read

Archive, accession code SRP055937.

Sequences were analyzed using an alignment-based approach. All computation was performed at the Ohio Supercomputer Center on a 12 core HP Intel Xeon x5650 machine with 48

GB of RAM. Reads were first trimmed by quality using Trimmomatic (v0.32; Bolger et al.,

2014) with phred scale 33 quality thresholds of 20 for both the 5’ and 3’ ends of each read.

Reads less than 50 base pairs in length were discarded. Reads were then dereplicated to minimize

PCR amplification bias and converted to fasta format using the FASTX-Toolkit (v0.0.13; http://hannonlab.cshl.edu/fastx_toolkit/). Next, reads were aligned against reference ITS2, matK, and rbcL plant sequences downloaded from NCBI Genbank on September 23, 2014. Reference

25

libraries were constrained to only include plant species known to be present in Ohio and surrounding states based on the USDA Plants Database (http://plants.usda.gov/). A Venn diagram showing the completeness of each of the reference libraries is presented in Appendix F.

Alignment was performed using the blastn algorithm (v2.2.29+; Altschul et al., 1997).

Alignment quality control thresholds were set as follows: E-value cutoff 1e-150, number of alignments 1, output format 0, number of descriptions 1. An additional setting, percent identity threshold, was used and its value differed between loci. For ITS2 we used a percent identity threshold of 95 percent, as in Richardson et al. (2015). However, given the relatively low sequence divergence between species at the matK and rbcL loci, we used a stringent setting of 99 percent identity. Following blast, we used MEGAN 5 (v5.1.5; Huson et al., 2011) to taxonomically summarize our results with the following settings: min support 1, min score 50.0, max expected 1e-150, top percent 100.0, min complexity 0.00, min support percent 0.0 (off), paired end mode. Complete family-level metabarcoding results are summarized in Appendix G.

Analysis of Results

After sequencing we obtained between 78,975 to 224,428 forward reads and 134,133 to

557,713 reverse reads across all 18 amplicon libraries. The median number of reads per locus was 258,987, 194,856 and 134,183 for ITS2, rbcL and matK, respectively. In total, these reads had best hits to plant species from 49 families. To limit the potential for false identification, we limited our analysis using a consensus-based approach, counting only families found in more than one of the three amplicon libraries for each sample. Consensus lists of the families detected and their relative abundance in each sample are provided in Appendix H. Using this approach, we confidently detected 25 plant families across the six sites. Using microscopy, 25 plant families were identified, six of which, Asparagaceae Juss., Elaeagnaceae Juss., Hamamelidaceae

26

R. Br., Lamiaceae Martinov, Magnoliaceae Juss. and Poaceae Barnhart, were not identified by the metabarcoding consensus analysis. Though these families were detected microscopically, they were present at very low abundance, never constituting more than 0.5% of the 5,000 counted grains in any sample.

To test the ability to infer the rank order abundance of different pollen types from the metabarcoding data, we conducted Spearman’s rank-based correlation between the number of mate-paired read alignments and the number of pollen grains per plant family for each locus individually as well as for the mean of the rbcL and matK loci, excluding the ITS2 data. We chose to exclude ITS2 as data from this locus exhibited poor quantitative capacity in a prior study (Richardson et al., 2015). Lastly, we calculated R-coefficients for families detected across at least five of the six samples to determine which families were over- or under-represented in the metabarcoding analysis relative to microscopic analysis. The R-coefficient is used in authenticating honey provenance (Bryant, Jr. and Jones, 2001). In the context of this paper, the

R-coefficient is the quotient, for a particular taxon, of the relative abundance as inferred by metabarcoding and the relative abundance as inferred by microscopy. We conducted this analysis on rbcL data because this locus exhibited a broad scope of detection and was the only single locus to produce significant rank-based correlations when compared to the microscopy data.

Pollen from the families Rosaceae (commonly species of Malus Mill., Crataegus L.,

Amelanchier Medik., Prunus L. and other cultivated relatives) and Salicaceae Mirb.

(predominantly Salix L. spp.) comprised over 65% of our samples (Figure 2.1). Pollen from plants in the Asteraceae (Taraxacum F. H. Wigg. spp.) and Oleaceae (Fraxinus L. spp.) were also abundant. Using Spearman’s rank-based correlation we found moderate to strong associations between the rank order abundance of pollen types within our samples as inferred by

27

the molecular and microscopic approaches. For the rbcL locus, rho values ranged from 0.536 to

0.939, and the associations were significant for five out of six samples (Table 2.1). For the mean of rbcL and matK, the associations were significant across all samples and rho values ranged from 0.570 to 0.939 (Figure 2.2). When matK and ITS2 were analyzed separately, associations between the molecular and microscopic relative abundances were not significant for any sample

(Table 2.1). In our analysis of average R-coefficients, we found that certain families were consistently over or under-represented in the molecular results relative to the microscopic results

(Table 2.2). In particular, the average R-coefficients for Brassicaceae Burnett, Caprifoliaceae

Juss. and Salicaceae were under-represented in the molecular data by greater than threefold relative to the microscopy data, while Fabaceae Lindl. and Fagaceae Dumort. were over- represented by greater than threefold (Table 2.2).

2.4 Conclusions

We employed multi-locus metabarcoding alongside traditional microscopic palynology, with the latter being considered the current standard of practice. Using a consensus-based approach, we found significant rank-based correlations between rbcL sequence abundance and microscopically examined pollen grain abundance for five of six samples. However, using the mean of rbcL and matK sequence abundance, we found significant associations between metabarcoding and microscopic results across all sites. This suggests that while the rbcL locus may be quantitatively useful, the simultaneous use of multiple loci may improve quantitative measurement of pollen abundance.

Our multi-locus, consensus-based method exhibits promise as a powerful approach to pollen identification using metabarcoding. While no significant associations were found between matK sequence abundance and microscopy data, significant associations where found across all

28

samples when matK sequence abundance was averaged with rbcL abundance. The poor performance of matK when used individually may be a result of incomplete universality displayed by the matK primer set (Chen et al., 2010). Despite its discriminatory power as a rapidly evolving plastidial coding region (Hilu and Liang, 1997), our data suggest the matK primer set used here may not be ideal for characterizing diverse pollen samples and may only be useful for supplementing data from other loci through average- or median-based analyses.

Performing such analyses could enable researchers to both broaden the scope of detectable taxa and increase the quantitative capacity of metabarcoding efforts. Using one primer set to co- amplify a genetic region across taxonomically diverse samples can be problematic, as priming site sequence divergence may hinder or prevent amplification for some taxa, potentially leading to under-representation or even non-detection in the metabarcoding sequence data. Employing a suite of primers enables researchers to overcome this limitation.

An additional metabarcoding issue involves minimizing the potential for false positive identifications. Across a diverse sample, it can be expected that some closely related taxa exhibit little sequence divergence at a particular locus. Employing multiple loci in conjunction with consensus-based analysis limits the potential for false positive identifications as the probability of the same false positive identification occurring across multiple independent loci is decreased relative to the probability for a single locus. Lastly, the completeness of the reference database is crucial for the successful application of pollen metabarcoding. While none of the libraries used here were entirely complete with respect to Ohio taxa, a large majority of the known species were represented (Appendix F).

Future research into different bioinformatic analyses, such as classifier-based analysis as opposed to the alignment-based analysis used here, are warranted. The current alignment-based

29

approach does not provide confidence estimates for individual sequence to taxon assignments.

Classifier-based approaches are commonly used in microbial ecology, where they have been designed for the analysis of ribosomal amplicon libraries (Wang et al., 2007). Keller et al. (2015) successfully applied a classifier-based approach to ribosomal amplicons originating from pollen

DNA, but to our knowledge, this approach has never been applied to non-ribosomal loci, such as matK or rbcL. Successful application of this approach may enable researchers to better understand the confidence of taxonomic assignments on a read-by-read basis as well as across taxonomic ranks.

Though significant associations were found between the microscopic and molecular method, the presence of outliers cannot be overlooked (Figure 2.2). Our analysis of family- specific R-coefficients shows that some families were consistently over or under-represented in the molecular results when compared to the microscopic results (Table 2.2), suggesting that, in addition to stochastic sampling error, some systemic mechanism may bias results. Such systemic biases could be attributable to aspects of pollen plastid biology, such as taxon-specific rates of plastid incorporation or relationships between average pollen grain volume and plastid abundance. To our knowledge, no studies have directly addressed such basic questions of plastid biology within pollen tissue. Alternatively, these biases may be the result of decreased amplification efficiency for certain plant families, resulting in non-detection or underestimation of abundance.

Unless validated, pollen metabarcoding data should be questioned in terms of its capacity for quantitative inference. Such validation requires comparison between the results of novel molecular approaches and the standard method of microscopic palynology. Contemporary studies have applied this approach, using statistical tests including Pearson’s product moment

30

correlation, Spearman’s rank-based correlation, and generalized linear modeling to determine the quantitative capacity of metabarcoding techniques (Keller et al., 2015; Kraaijeveld et al., 2015;

Richardson et al., 2015). One conclusion consistent with the analysis presented here, and that presented by Kraaijeveld et al. (2015), is that low copy number plastid loci provide generally quantitative results. However, studies disagree on the quantitative capacity of the repetitive ITS2 locus for metabarcoding (Keller et al., 2015; Richardson et al., 2015).

We employed multi-locus metabarcoding to characterize the taxonomic composition of polyfloral honey bee-collected pollen. Requiring only minute quantities of pollen, our approach can easily be applied to studying the foraging habits of individual honey bees, as well as other ecologically and economically important pollinators, such as solitary bees and bumble bees. Our results suggest that sequencing plastid loci produces semi-quantitative results. Furthermore, our results support the use of a multi-locus, consensus-based approach over single-locus barcoding.

Further research is needed to validate these trends across a larger sample of plant taxa and additional barcode loci.

31

2.5 References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. Azhagiri AK, Maliga P (2007) Exceptional paternal inheritance of plastids in Arabidopsis suggests that low-frequency leakage of plastids via pollen may be universal in plants. The Plant Journal, 52, 817-823. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, btu170. Bryant VM, Jones GD (2001) The r-values of honey: Pollen coefficients. Palynology, 25, 11-28. Chen S, Yao H, Han J, Liu C, Song J, Shi L, Zhu Y (2010) Validation of the ITS2 region as a novel DNA barcode for identifying medicinal plant species. PLoS ONE, 5, e8613. Cuenoud P, Savolainen V, Chatrou LW, Powell M, Grayer RJ, Chase MW (2001) Molecular phylogenetics of Caryophyllales based on nuclear 18S rDNA and plastid rbcL, atpB, and matK DNA sequences. American Journal of Botany, 89, 132-144. Cusser S, Goodell K (2013) Diversity and distribution of floral resources influence the restoration of plant–pollinator networks on a reclaimed strip mine. Restoration Ecology, 21, 713-721. Dimou MG, Thrasyvoulou A (2007) A comparison of three methods for assessing the relative abundance of pollen resources collected by honey bee colonies. Journal of Apicultural Research and Bee World, 46, 144-148. Erdtman G (1943) An introduction to pollen analysis. Chronica Botanica Company, Waltham, Massachusetts, USA. Fay MF, Swenson SM, Chase MW (1997) Taxonomic affinities of Medusagyne oppositifolia (Medusagynaceae). Kew Bulletin, 52, 111-120. Forcone A, Aloisi PV, Ruppel S, M. Muñoz M (2011) Botanical composition and protein content of pollen collected by Apis mellifera L. in the north-west of Santa Cruz (Argentinean Patagonia). Grana, 50, 30-39. Galimberti A, De Mattia F, Bruni I, Scaccabarozzi D, Sandionigi A, Barbuto M, Casiraghi M, Labra M (2014) A DNA barcoding approach to characterize pollen collected by honeybees. PLoS ONE, 9, e109363. Girard M, Chagnon M, Fournier V (2012) Pollen diversity collected by honey bees in the vicinity of Vaccinium spp. crops and its importance for colony development. Botany, 90, 545- 555. Hilu K, Liang H (1997) The matK gene: sequence variation and application in plant systematics. American Journal of Botany, 84, 830. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN 4. Genome Research, 21, 1552-1560.

32

Jones GD, Bryant VM (1992) Melissopalynology in the United States: A review and critique. Palynology, 16, 63-71. Kearns CA, Inouye DW (1993) Techniques for pollination biologists. University Press of Colorado, Boulder, Colorado, USA. Keller A, Danner N, Grimmer G, Ankenbrand M, von der Ohe K, von der Ohe W, Rost S, Härtel S, Steffan-Dewenter I (2015) Evaluating multiplexed next-generation sequencing as a method in palynology for mixed pollen samples. Plant Biology, 17, 558-556. Kraaijeveld K, de Weger LA, García MV, Buermans H, Frank J, Hiemstra PS, Den Dunnen JT (2015) Efficient and sensitive identification and quantification of airborne pollen using next-generation DNA sequencing. Molecular Ecology Resources, 15, 8-16. Longhi S, Cristofori A, Gatto P, Cristofolini F, Grando MS, Gottardini E (2009) Biomolecular identification of allergenic pollen: a new perspective for aerobiological monitoring? Annals of Allergy, Asthma and Immunology, 103, 508-514. Louveaux J, Maurizio A, Vorwohl G (1978) Methods of melissopalynology. Bee World, 59, 139- 153. Mogensen HL (1996) The hows and whys of cytoplasmic inheritance in seed plants. American Journal of Botany, 83, 383-404. Moore PD, Webb JA, Collinson ME (1991) Pollen analysis, second edition. Blackwell Scientific Publications, Oxford, UK. Reboud X, Zeyl C (1994) Organelle inheritance in plants. Heredity, 72, 132-140. Richardson RT, Lin CH, Sponsler DB, Quijia JO, Goodell K, Johnson RM (2015) Application of ITS2 metabarcoding to determine the provenance of pollen collected by honey bees in an agroecosystem. Applications in Plant Sciences, 3, 1400066. Simel EJ, Saidak LR, Tuskan GA (1997) Method of extracting genomic DNA from non- germinated gymnosperm and angiosperm pollen. BioTechniques, 22, 390-92, 394. Tang LY, Nagata N, Matsushima R, Chen Y, Yoshioka Y, Sakamoto W (2009) Visualization of plastids in pollen grains: Involvement of FtsZ1 in pollen plastid division. Plant Cell Physiology, 50, 904-908. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naïve bayesian classifier for rapid assignment of rRNA sequences into the new bacterial . Applied and Environmental , 73, 5261-5267. Wilson EE, Sidhu CS, LeVan KE, Holway DA (2010) Pollen foraging behaviour of solitary Hawaiian bees revealed through molecular pollen analysis. Molecular Ecology, 19, 4823- 4829.

33

2.6 Tables

Table 2.1 Spearman’s rank-based correlation between the total number of grains per plant taxon as determined by microscopy and the number of mate-paired aligned reads per plant taxon as determined by metabarcoding. Numbers indicate Spearman’s rho and p-value for each test

(rho(p-value)).

ITS2 matK rbcL A 0.381(0.360) 0.469(0.203) 0.587(0.049) B -0.130(0.658) 0.515(0.072) 0.764(0.001) C 0.238(0.582) 0.750(0.066) 0.939(< 0.001) D 0.204(0.504) 0.262(0.536) 0.575(0.028) E 0.067(0.854) 0.483(0.194) 0.536(0.073) F -0.005(0.989) 0.617(0.086) 0.762(0.002)

Table 2.2 Average R-coefficients for taxa present in at least five of the six samples.

Average R SD Rosaceae 1.507 0.868 Asteraceae Bercht. & J. Presl 2.922 2.744 Brassicaceae 0.190 0.116 Fagaceae 5.972 4.734 Oleaceae Hoffmans. & Link 2.674 5.186 Caprifoliaceae 0.091 0.017 Fabaceae 7.509 9.631 Salicaceae 0.097 0.083 Aceraceae Juss. 0.618 0.893 Moraceae Gaudich. 0.960 1.650

34

2.7 Figures

Figure 2.1 Proportional taxonomic abundances of each sample as estimated by microscopy.

Unidentified pollen grains and taxa present at less than or equal to 1% of the sample are grouped as “Other.”

35

Figure 2.2 Rank-transformed taxonomic abundance as estimated by the mean number of rbcL and matK metabarcoding reads (Y-axis) and the number of grains estimated microscopically (X- axis). Spearman’s rho and p-values are provided. Letters indicate apiary associated with each sample.

2.8 Acknowledgements

The authors thank the Ohio Agricultural Research and Development Center Molecular and

Cellular Imaging Center staff for technical support; and D. B. Sponsler and R. A. Klips for helpful manuscript review and botanical advice, respectively. This study was funded by a

36

Pollinator Partnership Corn Dust Research Consortium grant to R.M.J. and an Ohio State

University–Newark Scholarly Activity Grant to K.G.

37

(originally published under the same title in Molecular Ecology Resources 17:760-769)

3.1 Abstract

The taxonomic classification of DNA sequences has become a critical component of numerous ecological research applications; however, few studies have evaluated the strengths and weaknesses of commonly used sequence classification approaches. Further, the methods and software available for sequence classification are diverse, creating an environment in which it may be difficult to determine the best course of action and the tradeoffs made using different classification approaches. Here, we provide an in silico evaluation of three DNA sequence classifiers, the RDP Naïve Bayesian Classifier, RTAX and UTAX. Further, we discuss the results, merits and limitations of both the classifiers and our method of classifier evaluation. Our methods of comparison are simple, yet robust, and will provide researchers a methodological and conceptual foundation for making such evaluations in a variety of research situations. Generally, we found a considerable trade-off between accuracy and sensitivity for the classifiers tested, indicating a need for further improvement of sequence classification tools.

38

3.2 Introduction

The sequencing of taxonomically informative genetic loci, termed DNA barcoding or metabarcoding depending upon the scale, has quickly emerged as a useful tool for the characterization of obscure biological specimens such as cryptic species, mixed-species tissue samples or even environmental DNA (Hebert et al. 2004; Schnell et al. 2012; Thomsen et al.

2012). The conceptual framework for this is straightforward and was first proposed by Hebert et al. (2003). As with traditional morphometric delimitation of biological taxa, where the physical features of a specimen are used for taxonomic categorization, so too can the genetic makeup of an organism be used for classification. In DNA barcoding, short genetic regions which exhibit high inter-specific and low intra-specific variability are utilized, allowing taxonomic inference by comparison between sequences from specimens of unknown identity and reference sequences of known taxonomic origin. Since the conception of DNA barcoding, the production of reference sequences has expanded rapidly, as witnessed in digital repositories such as BOLD

(Ratnasingham and Hebert 2007) and NCBI (Benson et al. 2005), which respectively contain

60,320 and 211,132 entries for the plant rbcL loci, for example (accessed 06-13-2016). Further, with the advent of high-throughput sequencing, the capacity for researchers to probe diverse biotic assemblages, such as bee-collected pollen, soil microbial communities and environmental

DNA, has expanded dramatically (Taberlet et al. 2012; Craine et al. 2015; Guardiola et al. 2015;

Keller et al. 2015; Richardson et al. 2015a and b). Despite this progress, major obstacles to accurate and reliable DNA barcoding approaches still exist. Particularly with respect to DNA metabarcoding, analysis of the data produced for this application requires a breadth of computational expertise. With modern sequencing platforms, researchers must routinely deal with millions of DNA sequences representing diverse taxonomic groups. With respect to the

39

taxonomic characterization of these sequences, a central methodology has yet to be converged upon. This has resulted in an environment in which it is difficult to assess the strengths, weaknesses and overall validity of results yielded by different classification schemes.

In general, DNA sequence classification methods can be delineated into three major categories (Table 3.1): alignment-based methods, composition-based methods and model-based methods (Xing et al. 2010; Bazinet and Cummings 2012). Alignment-based methods include best-hit alignment searches between query and reference sequences, often performed in conjunction with lowest common ancestor calculation or similar approaches, such as in

MEGAN5, OBItools and RTAX (Huson et al. 2011; Soergel et al. 2012; Boyer et al. 2016).

Compositional or feature vector-based classification is performed by reducing reference sequences into n-dimensional feature space vectors and formulating probability models of taxonomic inclusion for any sequence with a given set of features matching those of a reference or set of references. Such approaches include the k-gram style RDP Naïve Bayesian Classifier

(RDP) and the UTAX classifier, both of which assign sequences based on the k-mer oligonucleotide words shared between queries and references (Wang et al. 2007; Edgar 2015c).

Model-based classification is performed by modelling sequence relatedness, using phylogenetic models, Markov models or hidden Markov models (Krause et al. 2008; Munch et al. 2008; Brady and Salzberg 2009). It is worthwhile to note that not all classification software conform to these categorical delineations and most are best described as utilizing multiple approaches. For example, k-gram style classifiers are generally described as compositional but mathematical models are used to infer the probability of a given query sequence belonging within the same taxon as a reference sequence based on the features shared between the two sequences. On an abstract level, these approaches can be viewed as attempts, either implicitly or explicitly, to

40

model the variance in inter- and intra-taxonomic sequence divergence across taxa in order to delineate boundaries in a more rigorous fashion than can be achieved with a singular sequence similarity threshold.

The simplest, and often most easily implemented, of these classification approaches are those which rely primarily on sequence alignment. Such analyses generally treat all sequence divergence equally, regardless of how the sequences diverge with respect to sequence location or composition. Such classification approaches are popular in current literature (Cornman et al.

2015; Hawkins et al. 2015; Kraaijeveld et al. 2015; Richardson et al. 2015a and b; Vesterinen et al. 2016). While these alignment-based methods represent an attractively simple approach, their outcomes depend heavily on appropriately selecting sequence identity thresholds sufficient for taxonomic inference. This can be problematic, as the sequence identity thresholds which are sufficient for such analyses generally vary across the breadth of taxa analyzed and the DNA barcoding loci used. As such, well-validated, easily standardized and automated classification methods are desirable for their potential to limit confusion within the field and provide a clear and objective standard of practice.

Here, we examine the performance of three DNA metabarcoding sequence classifiers in terms of their accuracy, sensitivity and robustness to reference database incompleteness. To our knowledge, only a limited number of previous studies have attempted to characterize the strengths and weaknesses of different classification approaches, particularly with respect to bacterial 16S rRNA sequencing (Bazinet and Cummings 2012; Porter and Golding 2012; Lanzén et al. 2012; Bengtsson-Palme et al. 2015; Peabody et al. 2015). We argue that an ideal sequence classification tool exhibits a low rate of classification error, regardless of reference database completeness, and a capacity to assign as many sequences as possible to the highest resolution

41

permitted by the reference data. Our analytical approach provides a useful test of these performance criteria and can be used to identify the strengths and weaknesses of different sequence classifiers.

We evaluated the performance of sequence classifiers using five loci from vascular plants, matK, rbcL, trnL, trnH and ITS2, all of which are commonly used for barcoding and for which there is good representation in online reference databases. To perform various tests of classifier performance, available reference sequences were split into classifier training sequences and testing sequences. With this design, the training sequences were used as reference data for the classification of the testing sequences. Further, the identity of each testing sequence was known and could be used to assess the validity of taxonomic assignments made during classification. Lastly, knowing the taxonomic identities of both the training and testing sequences enabled in-depth analysis of testing sequence cases belonging to taxa not represented in the training sequence database. Thus, the propensity for taxonomic mis-assignment as well as the sensitivity at the next lowest taxonomic level could be calculated for these cases in particular, revealing the merits of each classifier when dealing with sequences from taxa which are poorly represented in the reference database.

3.3 Methods

Data Preparation and Classifier Training

All NCBI Nucleotide entries for vascular plant ITS2, matK, rbcL, trnL and trnH sequences were downloaded on March 4th, 2016. Entries were filtered by length to remove sequences considerably longer or shorter than the barcode region of each locus. Entries with more than two sequential uncalled nucleotides were removed. Entries were then curated to only include sequences from plants known to be present in Ohio and surrounding states and provinces

42

based on the USDA Plants Database (http://plants.usda.gov). Restricting the database in this way decreased the computational complexity of the analyses and allowed calculation of training set completeness such that the results can be viewed in the context of the completeness of the reference databases. While the effects of such geographic filtering on classification accuracy and sensitivity are currently unknown, this practice is common in current literature (Quéméré et al.

2013; Hawkins et al. 2015; Richardson et al. 2015a and b; McFrederick and Rehan 2016;

Valentini et al. 2016). From the resulting sequences, 15 percent of the entries were randomly sampled from each locus to serve as experimental testing sets. The remaining 85 percent of entries from each locus were curated and prepared according to the best practices set forth for the training of the RDP classifier (v2.11; Wang et al. 2007). Briefly, duplicate and partial sequences were removed using the Java scripts provided with the RDP classifier and the Linnaean lineage of each sequence was retrieved using the NCBI Taxonomy Module (Sayers et al. 2011) and a

Perl script provided in Sickel et al. (2015). This lineage consisted of the , phylum, class, order, family, genus and species identifications for each sequence. Sequences which were taxonomically undefined at any rank or unidentified at the species level were removed and the resulting lineages and sequences were used to train the RDP (v2.11; Wang et al. 2007), UTAX

(v8.1; Edgar 2010) and RTAX (v0. 984; Soergel et al. 2012) classifiers.

Accuracy and Sensitivity Testing

To evaluate classifier accuracy and sensitivity among the previously mentioned testing sequence sets, the testing sequences from each of the five loci were classified using each classifier. For the RTAX and UTAX software, classifications were made using the default settings. For the RDP classifier, we used a bootstrap confidence threshold of 0.90, which is more stringent than the default threshold of 0.80 (http://rdp.cme.msu.edu/classifier/class_help.jsp). We

43

chose an RDP bootstrap confidence threshold which exceeded the default confidence threshold as this is common in the literature (Price et al. 2009; Salazar et al. 2015; Tang et al. 2015). The classifications of each sequence were then compared to the NCBI-derived taxonomic identity to assess assignment and error rates using R (v3.2.3; R Development Core Team, 2014). In addition to testing error and sensitivity among all the test sequences, for each locus we also isolated sequence cases belonging to genera not represented in the corresponding training set. For these cases, we evaluated classifier performance at the genus and family levels. All computation for this work was performed on the Oakley cluster of the Ohio Supercomputer Center using a 12- core HP Intel Xeon X5650 machine with 48 GB of RAM (Ohio Supercomputer Center 1987).

While this magnitude of computer power was useful for some database training steps, using a single core with 4 GB of RAM was sufficient for the majority of computation performed, including all of the sequence classification. For a detailed description of the commands used in this analysis and the complete training and testing sequences, see supplementary information available at http://github.com/johnson5005/evaluating-DNA-metabarcoding. Additionally, for the raw data from this analysis, see Appendix S1.

Evaluating the relationship between classification confidence threshold and classifier performance

The previously mentioned training and testing sequences for the rbcL locus were used to investigate relationships between classification confidence threshold, classification accuracy and classification sensitivity at the genus level. Using the RDP and UTAX classifiers, the rbcL testing sequences were classified using 12 different confidence threshold cutoff values, spanning from 0.40 to 0.95 in 0.05 quantile intervals. The RTAX classifier was not used for this analysis as this software does not provide confidence estimates for each classification. In addition to

44

analyzing the relationship between classifier confidence, accuracy and sensitivity, we calculated the ratio of errors per classification by dividing the total number of mis-classified cases by the number of cases classified. When viewed alongside the accuracy and sensitivity data, this statistic facilitates determination of the optimum confidence threshold. In performing this analysis, we observed a non-linear trend between UTAX classification confidence and errors per assignment. As a result, we further analyzed all five markers using UTAX only and a confidence threshold range of 0.45 to 0.95, in 0.1 quantile intervals.

Evaluating Classifier Robustness to Reference Database Completeness

The previously mentioned training and testing sequences for the rbcL loci were used to investigate relationships between reference database completeness, sensitivity and error. To obtain training sets of varying completeness, the training sequences were randomly sampled to obtain five samples of training data containing 25, 35, 50, 75 and 85 percent of the curated

NCBI-derived entries, respectively. Training sets were then used as reference sequences for the classification of the original testing set – the 15% of sequences sampled from the total curated

NCBI sequences – using each of the three classifiers. For this analysis, RTAX classification was performed using the default settings, RDP classification was performed using a 0.90 bootstrap confidence threshold and UTAX classification was performed using both a 0.90 and 0.65 confidence threshold. We chose to investigate UTAX performance at two confidence thresholds based on our previous analysis of the relationship between classification confidence threshold and errors per assignment.

Statistical Analysis

To evaluate pairwise statistical differences in error rate and sensitivity between classifiers, we applied two-tailed chi-square tests. For chi-square tests on both classifications of

45

the entire testing sequence sets as well as classifications of test sequences which lacked genus- level reference representation, we compared the proportion of sequences classified incorrectly across the classifiers to test for differences in error rate, while the counts of assigned and unassigned sequences were used to test for differences in sensitivity across classifiers. To investigate whether the locus analyzed affected sensitivity and accuracy results for each classifier, we used a chi-square test for equality of proportions on the genus level results. For our evaluation of classifier robustness to database incompleteness, linear models were used to regress classifier sensitivity and error rate against the log transformed number of reference sequences used in classifier training. To test for pairwise differences in the slope of the relationship between database completeness, error rate and sensitivity across classifiers, multiple linear regression analyses were performed in pairwise fashion for each classifier combination.

For this analysis, we used ANOVA to test for statistical differences between two linear models of the data, one in which the classifiers were included as a covariate and an alternative which included classifiers labels as interacting factors. To test for pairwise differences in the y-intercept of the relationship between database completeness, error rate and sensitivity across classifiers,

ANCOVA was employed by regressing error and sensitivity against database completeness with classifier labels included as a covariate. For our analysis of the relationships between the classifier confidence threshold used and genus level classification sensitivity and error, our results were analyzed using linear regression. Using R, model selection was performed by producing best fit linear and 2nd degree polynomial models for each trend series. One-way

ANOVA was then used to determine which model produced the best fit.

46

3.4 Results

Upon curating the NCBI-derived reference sequences and dividing them into testing and training sets, we obtained between 1267 and 2003 training sequences and between 247 to 391 testing sequences across the five barcoding loci. The sum total of the testing and training sequences for each locus represented between 52 and 72 percent of the plant genera known to occur in Ohio and surrounding states and provinces. With respect to training set coverage of the testing sequences, the training sequence databases lacked generic representatives for between 28 to 36 percent of the testing sequence genera across the five loci (Figure 3.1).

Overall, our comparison of classifiers indicated that mean classification sensitivity tended to increase and mean error rate tended to decrease with decreasing taxonomic resolution (Figure

3.2). Further, though not significant at every , there was a consistent trend in which UTAX displayed the lowest error rate while RTAX displayed the greatest error, with RDP displaying intermediate error rates. With respect to sensitivity, UTAX consistently displayed significantly lower sensitivity across all taxonomic ranks while RDP and RTAX displayed similar sensitivity at the order and family levels with RTAX displaying significantly greater genus-level sensitivity. Using a two-tailed chi-square test for equality of proportions, we found that the locus analyzed had no effect on classification accuracy for RDP and UTAX (P > 0.1 for both tests), while a significant effect was found for RTAX (P < 0.0001). With respect to classification sensitivity, the locus analyzed had a significant effect on the proportion of reads assigned for all three classifiers (P < 0.01 for all tests).

In our analysis of cases for which the genus of the test sequence was not represented in the corresponding training reference sequences, numerous significant differences were found

(Figure 3.3). UTAX displayed significantly lower rates of genus-level classification, and thus

47

mis-classification, for these sequences. Further, the rate of genus-level misclassifications using

RDP was significantly lower than using RTAX, the latter which classified around 75 percent of all sequences to the wrong genus. With respect to family-level classification sensitivity of these sequences, UTAX displayed markedly lower sensitivity, while the RDP classifier displayed significantly higher rates of classification relative to both RTAX and UTAX. However, when considering family-level classification error for these sequences, RDP and UTAX were statistically indistinguishable and both displayed significantly lower error rates relative to

RTAX.

In our investigation of the relationships between classifier confidence threshold and classification accuracy and sensitivity, we found strongly explanatory regression models for each relationship (median adjusted r-squared: 0.961; minimum adjusted r-squared: 0.494; Figure 3.4).

Interestingly, while accuracy, sensitivity and errors per assignment were negatively associated with increasing classification confidence threshold for RDP, only accuracy and sensitivity exhibited the same behavior for UTAX. While the relationship between UTAX confidence threshold and errors per assignment was negative from 0.4 to 0.65 confidence, a positive trend was observed from 0.65 to 0.95. When the relationship between UTAX confidence threshold and errors per assignment was investigated further using all five loci, similar trends were observed for matK and ITS2, with the lowest error per assignment occurring at intermediate confidence thresholds, ranging from 0.65 to 0.85, and increased error per assignment occurring at higher confidence thresholds. This trend, however, was not observed for trnL and trnH, which both exhibited the lowest error per assignment rate at the highest confidence threshold analyzed.

We found significant positive linear relationships between reference database size and sensitivity for RDP and UTAX (90 percent confidence) (P < 0.05 for both classifiers, two-tailed

48

t-test), while significant negative linear relationships between reference database size and error rate were observed for RDP and RTAX (P < 0.01 for both classifiers, two-tailed t-test) (Table

3.2). Using multiple linear regression analysis, significant differences in the slopes of the relationships between classification error rate and reference database size were observed for five of the pairwise comparisons, with the comparison between the two confidence thresholds used for UTAX revealing no significant difference. With respect to classifier sensitivity, significant differences in slope were found only with the pairwise comparisons of RTAX and RDP, RTAX and UTAX (90 percent confidence) and RTAX and UTAX (65 percent confidence). In evaluating differences in y-intercept, all pairwise comparisons of the linear relationships exhibited significant differences for both accuracy and sensitivity (Figure 3.5).

3.5 Discussion

With increasing reliance on molecular techniques in many areas of ecological research, sequence classification is becoming a crucial component of many biological analyses. It is important that biologists using these techniques have an understanding of the relative performance of different classification tools so that appropriate inferences from molecular data can be made. Performing such evaluations can inform biologists in terms of what approaches display optimal classification accuracy, sensitivity and robustness to reference database incompleteness. While some studies have attempted to evaluate the validity of metabarcoding results through comparisons with other methods of empirical community analysis as well as the use of statistical approaches such as consensus filtering and site-occupancy detection models

(Keller et al. 2015; Richardson et al. 2015b; Ficetola et al. 2016; Lahoz-Monfort et al. 2016;

Valentini et al. 2016), our study directly approaches the problem of metabarcoding error at the stage of sequence classification.

49

Our analysis of the average error and assignment rates among all testing sequences lends insight into the relative merits of the classifiers tested. Further, the results should provide a rough estimation of real-world classification error and sensitivity given that the following assumptions are satisfied: 1) reference database completeness is similar to this study, 2) experimental sequences are of a similar length to those analyzed here, 3) nucleotide error rates are similar in magnitude and distribution to the sequence errors present in this study and 4) the reference training data and experimental data are independent and randomly distributed with respect to taxonomic representation. In this scenario, it is clear that meaningful differences in performance exist across these classifiers. For example, while UTAX appears superior with respect to error rate, under default settings it is clearly designed to be exceedingly stringent and thereby suffers from a low assignment rate, even at the family and order levels. Conversely, the RDP Naïve

Bayesian Classifier, one of the most widely used sequence classifiers to date, displays high sensitivity and low error at the family- and order-level; however, it exhibited a mean genus-level error rate of 9.6 percent when using a bootstrap confidence threshold of 0.9. Further, while

RTAX displayed high sensitivity at all taxonomic ranks, it also produced the highest rates of error at all levels. Lastly, the significant effects of locus choice on classification accuracy for

RTAX and classification sensitivity for RTAX, RDP and UTAX suggest that some inherent property or properties of the locus being analyzed affected classification performance in these cases. Though we did not investigate this effect further, such properties may, for example, include locus length, average divergence or the amount of data used in classifier training.

In addition to the general error and sensitivity analysis performed across all test sequences, our analysis of test sequences from genera not represented in the reference training data provided further differentiation of classifier performance. The sequences in this analysis

50

represent cases in which genus-level assignment is impossible due to a lack of genus-level representation in the reference training data. Thus, these cases are ideal for assessing both how susceptible a classifier is to over-classification and how well a classifier can scale the taxonomic tree to find the correct family-level assignment (Edgar 2015a). Further, the tests used in this analysis are similar to those proposed by Edgar (2015b). Despite a lack of genus-level reference representatives for these cases, the RDP and RTAX classifiers mis-assigned an average of 31.4 and 73.8 percent of these sequences to the genus level, compared to 5.1 percent average over- classification for UTAX. However, in terms of family-level sensitivity, the RDP and RTAX classifiers displayed significantly greater sensitivity than UTAX, assigning an average of 77.5 and 67.7 percent of these sequences to the family level, compared to 29.8 percent for UTAX.

Further, the RDP classifier displayed this greater family-level sensitivity with only marginally greater, yet statistically indistinguishable, family-level error relative to UTAX.

Our investigation of the relationship between classification confidence threshold and classification error rate, sensitivity and errors per assignment revealed an interesting difference between the UTAX and RDP classifiers. When analyzing rbcL test sequences with the RDP classifier, the relationship between classification error, sensitivity and errors per assignment regressed against classification confidence threshold was linear or approximately linear for each trend and always decreased with increasing confidence threshold. However, with UTAX, these relationships more closely resembled a quadratic relationship and while error and sensitivity always decreased with increasing confidence threshold, errors per assignment were lowest when using a confidence threshold between 0.65 and 0.75. When we investigated this trend further using UTAX on the testing data from all five loci, a similar trend was observed for matK and

ITS2, in which the lowest error per assignment rate was observed at a 0.85 confidence threshold,

51

as opposed to the more stringent 0.95 confidence threshold, for these loci. This trend did not hold for trnL and trnH, however, where the lowest error per assignment rate was observed at the highest confidence threshold analyzed, 0.95. The variation in this relationship suggests that, when using UTAX for classification, different loci exhibit different optimal confidence thresholds and therefore increasing the UTAX confidence threshold will not always result in increased classification confidence.

With respect to sequence analysis, it is assumed that the database completeness dictates the quality of taxonomic assignment in terms of both sensitivity and accuracy (Taberlet et al.

2012; Sickel et al. 2015; Bell et al. 2016). While our analysis provides strong evidence for the relationships between database completeness, classification sensitivity and classification accuracy, we argue that such tradeoffs do not necessarily occur with all classification approaches. While it is intuitive that increased reference database completeness should improve sensitivity, we show that one classifier, RTAX, displays no such relationship. This finding, along with the high error rates displayed by RTAX, suggests that the design of this software leads to classification overconfidence, and ultimately, classifications for which there is insufficient support. Further, while the error rates of RDP and RTAX were significantly related to reference database size, UTAX, when executed using both a 65 and 90 percent confidence threshold, displayed low error rates independent of database size, suggesting that software programs which appropriately model taxonomic boundaries and assignment confidence without a great dependence on reference database completeness can be designed. Lastly, we show that, for certain loci, UTAX sensitivity can be increased with only minimal corresponding increases in error by optimizing the UTAX confidence threshold.

52

Here, we provide a comprehensive in silico analysis of classifier accuracy, sensitivity and robustness to reference database incompleteness. While we argue that our approach is useful, it is not without limitations. With respect to taxonomic authority, any analyses similar to those presented here are subject to some degree of analytical error due to taxonomic inconsistencies.

However, our reliance on a single taxonomic database, NCBI Taxonomy, minimizes the potential for such artifacts. Further, given that results are most useful for relative comparisons of classifier performance due to potentially confounding factors such as reference database completeness and locus discriminatory power, useful inferences can be made even if such artifacts exist. Our choice of classifiers for testing reflects a preference for classification tools which are standardized and objective in their implementation. While classification software such as Megan5, OBItools and more customized alignment searches may exhibit merit for certain applications, these approaches suffer from potentially subjective search and annotation parameter settings.

Our findings highlight an apparent tradeoff between sensitivity and error rate, such that higher sensitivity also implies a tendency toward erroneous assignments. For some tools used in the taxonomic classification of microbial rRNA sequences, this tendency is not as pronounced

(Bengtsson-Palme et al. 2015), and an obvious future goal for metabarcoding software would be to minimize this relationship using better algorithms or better use of a priori knowledge. In addition to focusing on classification software, improving the comprehensiveness and quality of reference sequence data is another approach to increasing the accuracy and sensitivity of metabarcoding efforts (Nilsson et al. 2014; Abarenkov et al. 2016). Thus, additional research efforts targeting reference database gaps and improved database curation techniques are also warranted.

53

Metabarcoding is an approach to taxonomic classification of biological samples that is still new but is rapidly finding applications in many areas of research. Identifying methods of sequence categorization that provide the desired level of taxonomic resolution and tolerable rates of error for a particular set of circumstances will be an ongoing challenge. Clear comparisons of classifier methods and analysis pipelines will aid the widespread adoption of and confidence in this developing field.

3.6 References

Abarenkov K, Adams RI, Laszlo I et al. (2016) Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1. Bazinet AL, Cummings MP (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 1-13. Bell KL, Brosi BJ, de Vere N, Keller A, Richardson RT, Gous A, Burgess KS (2016) Pollen DNA barcoding: Current applications and future prospects. Genome, 59, 1-12. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH (2015) Metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 1403-1414. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank. Nucleic Acids Research, 33, D34-D38. Boyer F, Mercier C, Bonin A, Le Bras Y, Taberlet P, Coissac E (2016) Obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16, 176-182. Brady A, Salzberg SL (2009) Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6, 673-676. Cornman RS, Otto CRV, Iwanowicz D, Pettis JS (2015) Taxonomic characterization of honey bee (Apis mellifera) pollen foraging based on non-overlapping paired-end sequencing of nuclear ribosomal loci. PLoS ONE, 10, e0145365. Craine JM, Towne EG, Miller M, Fierer N (2015) Climatic warming and the future of bison as grazers. Scientific Reports, 5, 16738. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461. Edgar RC (2015a) Taxonomy overclassification and underclassification errors. Accessed 06-01- 2016. www.drive5.com/usearch/manual/tax_overclass.html

54

Edgar RC (2015b) Validating taxonomy classifiers. Accessed 06-01-2016. www.drive5.com/usearch/manual/taxonomy_validation.html Edgar RC (2015c) UTAX algorithm. Accessed 06-01-2016. http://www.drive5.com/usearch/manual/utax_algo.html Ficetola GF, Taberlet P, Coissac E (2016) How to limit false positives in environmental DNA and metabarcoding? Molecular Ecology Resources, 16, 604-607. Guardiola M, Uriz MJ, Taberlet P, Coissac E, Wangensteen OW, Turon X (2015) Deep-sea, deep-sequencing: Metabarcoding extracellular DNA from sediments of marine canyons. PLoS ONE, 10, e0139633. Hawkins J, de Vere N, Griffith A, Ford CR, Allainguillaume J, Hegarty MJ, Baillie L, Adams- Groom B (2015) Using DNA metabarcoding to identify the floral composition of honey: A new tool for investigating honey bee foraging preferences. PLoS ONE, 10, e0134735. Hebert PDN, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences, 270, 313- 321. Hebert PDN, Penton EH, Burns JM, Janzen DH, Hallwachs W (2004) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proceedings of the National Academy of Sciences, 101, 14812-14817. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN 4. Genome Research, 21, 1552-1560. Keller A, Danner N, Grimmer G, Ankenbrand M, Von Der Ohe K, Von Der Ohe W, Rost S, Hartel S, Steffan-Dewenter I (2015) Evaluating multiplexed next generation sequencing as a method in palynology for mixed pollen samples. Plant Biology, 17, 558-566. Kraaijeveld K, De Weger LA, Garcia MV, Buermans H, Frank J, Hiempstrah PS, Den Dunnen JT (2015) Efficient and sensitive identification and quantification of airborne pollen using next generation DNA sequencing. Molecular Ecology Resources, 15, 8-16. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J (2008) Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research, 36, 2230-2239. Lahoz-Monfort JJ, Guillera-Arroita G, Tingley R (2016) Statistical approaches to account for false-positive errors in environmental DNA samples. Molecular Ecology Resources, 16, 673-685. Lanzén A, Jørgensen SL, Huson DH, Gorfer M, Grindhaug SH, Jonassen I, Øvreås L, Urich T (2012) CREST – classification resources for environmental sequence tags. PLoS ONE, 7, e49334. McFrederick QS, Rehan SM (2016) Characterization of pollen and bacterial community composition in brood provisions of a small carpenter bee. Molecular Ecology, 25, 2302- 2311. Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R (2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Systematic Biology, 57, 750-757.

55

Nilsson RH, Hyde KD, Pawłowska J et al. (2014) Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 11-19. Ohio Supercomputer Center (1987) Citation. Columbus, Ohio, USA. http://osc.edu/ark:/19495/f5s1ph73. Peabody MA, Van Rossum T, Lo R, Brinkman FSL (2015) Evaluation of shotgun sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 363. Porter TM, Golding GB (2012) Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: Classification method, primer choice, and error. PLoS ONE, 7, e35749. Price LB, Liu CM, Melendez JH, Frankel YM, Engelthaler D, Aziz M, Bowers J, Rattray R, Ravel J, Kingsley C, Keim PS, Lazarus GS, Zenilman JM (2009) Community analysis of chronic wound bacteria using 16S rRNA gene-based pyrosequencing: Impact of diabetes and antibiotics on chronic wound microbiota. PLoS ONE, 4, e6462. Quéméré E, Hibert F, Miquel C, Lhuillier E, Rasolondraibe E, Champeau J, Rabarivola C, Nusbaumer L, Chatelain C, Gautier L, Ranirison P, Crouau-Roy B, Taberlet P, Chikhi L (2013) A DNA metabarcoding study of a dietary diversity and plasticity across its entire fragmented range. PLoS ONE, 8, e58971. R Core Team (2014) R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/ Ratnasingham S, Hebert PDN (2007) Bold: The barcode of life data system (http://www.barcodinglife.org). Molecular Ecology Notes, 7, 355-364. Richardson RT, Lin C-H, Quijia JQ, Riusech NS, Goodell K, Johnson RM (2015a) Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach. Applications in Plant Sciences, 3, 1500043. Richardson RT, Lin C-H, Quijia JQ, Sponsler BD, Goodell K, Johnson RM (2015b) Application of ITS2 metabarcoding to determine the provenance of pollen collected by honey bees in a field-crop dominated agroecosystem. Applications in Plant Sciences, 3, 1400066. Salazar G, Cornejo-Castillo FM, Borrull E, Diez-Vives C, Lara E, Vaque D, Arrieta JM, Duarte CM, Gasol JM, Acinas SG (2015) Particle-association lifestyle is a phylogenetically conserved trait in bathypelagic . Molecular Ecology, 24, 5692-5706. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K et al (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 41, D8-D20. Schnell IB, Thomsen PF, Wilkinson N, Rasmussen M, Jensen LRD, Willerslev E, Bertelsen MF, Gilbert MTP (2012) Screening mammal biodiversity using DNA from Leeches. Current Biology, 22, 262-263. Sickel W, Ankenbrand MJ, Grimmer G, Holzschuh A, Härtel S, Lanzen J, Steffan-Dewenter I, Keller A (2015) Increased efficiency in identifying mixed pollen samples by meta- barcoding with a dual-indexing approach. BMC Ecology, 15, 1-9.

56

Soergel D, Neelendu Dey AW, Knight R, Brenner SE (2012) Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. The ISME Journal, 6, 1440-44. Taberlet P, Coissac E, Pompanon F, Brochmann C, Willerslev E (2012) Towards next-generation biodiversity assessment using DNA metabarcoding. Molecular Ecology, 21, 2045-2050. Tang M, Hardman CJ, Ji Y, Meng G, Liu S, Tan M, Yang S, Moss ED, Wang J, Yang C, Bruce C, Nevard T, Potts SG, Zhou X, Yu DW (2015) High-throughput monitoring of wild bee diversity and abundance via mitogenomics. Methods in Ecology and Evolution, 6, 1034- 1043. Thomsen PF, Kielgast J, Iversen LL, Møller PR, Morten Rasmussen M, Willerslev E (2012) Detection of a diverse marine fish fauna using environmental DNA from seawater samples. PLoS ONE, 7, e41732. Valentini A, Taberlet P, Miaud C, et al. (2016) Next-generation monitoring of aquatic biodiversity using environmental DNA metabarcoding. Molecular Ecology, 25, 929-942. Vesterinen EJ, Ruokolainen L, Wahlberg N, Peña C, Roslin T, Laine VN, Vasko V, Sääksjärvi IE, Norrdahl K, Lilley TM (2016) What you need is what you eat? Prey selection by the bat Myotis daubentonii. Molecular Ecology, 25, 1581-1594. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267. Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explorations, 12, 40-48.

57

3.7 Tables

Table 3.1 Summary of commonly used sequence classification approaches

Summary of commonly used sequence classification approaches

Classification approach Alignment-based Composition-based Model-based Method of Sequence alignment Shared feature vectors (e.g. (Multiple) sequence reference 8-mers) alignment comparison Data used for Reference sequences Shared feature vectors and Reference sequences and classification probabilities of taxonomic models of sequence database inclusion divergence Parameters used Identity thresholds Quantity or proportion of Phylogenetically or for taxonomic (often implemented feature vectors shared taxonomically inference with LCA between reference and informative sequence calculation) query sequence divergences Examples Megan5, OBItools, UTAX, RDP SAP RTAX

Table 3.2 Summary of tests of relationships between classification accuracy, sensitivity and database completeness. Asterisk symbols indicate significant relationship, as inferred by linear regression. “N.S.” indicates no significant relationship. The P-values from each test are shown in parenthesis.

Accuracy significantly Sensitivity related to database significantly related to completeness database completeness RDP * (0.006) * (0.018) RTAX * (0.001) N.S. (0.484) UTAX (0.90 confidence) N.S. (0.120) * (0.033) UTAX (0.65 confidence) N.S. (0.055) N.S. (0.056)

58

3.8 Figures

Figure 3.1 Venn diagrams showing the genera known to be present in Ohio as well as the number of genera represented in our testing sets and training sets. Percentages next to diagram labels represent the total percent coverage of Ohio genera within both the training and testing sets, cumulatively. Percentage values in the test set represent the percentage of genera in the testing sequences which lacked a generic representative in the training set. Sets and unions are not drawn to scale.

59

Figure 3.2 Mean and standard error values of classifier sensitivity (A) and error rate (B) for classification of all test sequences. P-value matrices (C) show significant differences for pairwise comparisons between classifiers, inferred using two-tailed chi-square tests. Differences in sensitivity are shown in the upper-triangular (light orange) while differences in error are shown in the lower-triangular (light blue).

60

Figure 3.3 Mean and standard error values of genus-level error rate (A), family-level sensitivity

(B) and family-level error (C) for test sequences belonging to genera not represented in the corresponding training set. P-value matrices (D) show significant differences for pairwise comparisons between classifiers, inferred using two-tailed chi-square tests. Differences in sensitivity are shown in the upper-triangular (light orange) while differences in error are shown in the lower-triangular (light blue).

61

Figure 3.4 (A-C) Genus-level classifier error rate, sensitivity and errors per assignment regressed against classification confidence threshold for rbcL testing sequences classified using both

UTAX and RDP. (D-F) Genus-level error rate, sensitivity and errors per assignment regressed against classification confidence threshold for rbcL, matK, trnL, trnH and ITS2 testing sequences classified using UTAX.

62

Figure 3.5 Linear models of genus-level classifier sensitivity (A) and error rate (B) regressed against the log-scaled number of reference sequences used for classification. A P-value matrix

(C) shows significant differences for pairwise comparisons of the slope and y-intercept of each linear model. Differences with respect to sensitivity are shown in the upper-triangular (light orange) while differences with respect to error are shown in the lower-triangular (light blue).

3.9 Acknowledgements

RTR was supported by the Project Apis m. - Costco Honey Bee Biology Fellowship. The authors thank Markus Ankenbrand for helpful comments regarding the bioinformatic analyses. This work was supported by an allocation of computing time from the Ohio Supercomputer Center and state and federal appropriations to the Ohio Agricultural Research and Development Center.

The authors thank Ken Kraaijeveld and anonymous reviewers for providing helpful comments on the manuscript.

63

(originally published under the same title in PeerJ 6:e5126)

4.1 Abstract

Metabarcoding is a popular application which warrants continued methods optimization.

To maximize barcoding inferences, hierarchy-based sequence classification methods are increasingly common. We present methods for the construction and curation of a database designed for hierarchical classification of a 157 bp barcoding region of the arthropod cytochrome c oxidase subunit I (COI) locus. We produced a comprehensive arthropod COI amplicon dataset including annotated arthropod COI sequences and COI sequences extracted from arthropod whole mitochondrion genomes, the latter of which provided the only source of representation for

Zoraptera, Callipodida and Holothyrida. The database contains extracted sequences of the target amplicon from all major arthropod clades, including all insect orders, all arthropod classes and

Onychophora, Tardigrada and Mollusca outgroups. During curation, we extracted the COI region of interest from approximately 81 percent of the input sequences, corresponding to 73 percent of the genus-level diversity found in the input data. Further, our analysis revealed a high degree of sequence redundancy within the NCBI nucleotide database, with a mean of approximately 11 sequence entries per species in the input data. The curated, low-redundancy database is included in the Metaxa2 sequence classification software (http://microbiology.se/software/metaxa2/).

64

Using this database with the Metaxa2 classifier, we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database. Our work will help researchers design and evaluate classification databases and conduct metabarcoding on arthropods and alternate taxa.

4.2 Introduction

With the increasing availability of high-throughput DNA sequencing, scientists with a wide diversity of backgrounds and interests are increasingly utilizing this technology to achieve a variety of goals. One growing area of interest involves the use of metabarcoding, or amplicon sequencing, for biomonitoring, biodiversity assessment and community composition inference

(Yu et al. 2012; Guardiola et al. 2015; Richardson et al. 2015). Using universal primers designed to amplify conserved genomic regions across a broad diversity of taxonomic groups of interest, researchers are afforded the opportunity to survey biological communities at previously unprecedented scales. While such advancements hold great promise for improving our knowledge of the biological world, they also represent new challenges to the scientific community.

Given that bioinformatic methods for taxonomic inference of metabarcoding sequence data are relatively new, the development, validation and refinement of appropriate analytical methods is ongoing. Relatively few studies have characterized the strengths and weaknesses of different bioinformatic sequence classification protocols (Porter et al. 2012; Bengtsson-Palme et al. 2015; Peabody et al. 2015; Somervuo et al. 2016; Richardson et al. 2017). Further,

65

researchers continue to utilize a diversity of methods to draw taxonomic inferences from amplicon sequence data. Relative to alignment-based nearest-neighbor and lowest common ancestor-type classification approaches, methods involving hierarchical classification of DNA sequences are popular as they are often designed to estimate the probabilistic confidence of taxonomic inferences at each taxonomic rank. However, studies explicitly examining the accuracy of classification confidence estimates are rare (Somervuo et al. 2016).

When performing hierarchical classification, the construction, curation and uniform taxonomic annotation of the reference sequence database is an important methodological consideration. Database quality can affect classification performance in numerous ways. For example, artifacts within the taxonomic identifiers of a reference database can represent artificial diversity and the inclusion of sequence data adjacent to the exact barcoding locus of interest likely display sequence composition that is unrepresentative of the barcoding locus. Lastly, sequence redundancy within reference databases increases computational resource use and is particularly problematic for classification software programs that classify sequences based on a set number of top alignments. In general, such database artifacts have the potential to bias model selection and confidence estimation both with k-mer style classifiers such as UTAX, SINTAX and the RDP Naïve Bayesian Classifier (Wang et al. 2007; Edgar 2015; Edgar 2016) and alignment-based classification approaches such as Metaxa2 and Megan (Huson et al. 2011;

Bengtsson-Palme et al. 2015). Thus, it is important to identify and manage reference sequence database artifacts during curation for optimal downstream classification performance.

The use of molecular barcoding and metabarcoding in arthropod community assessment and gut content analysis has gained popularity in recent years (Corse et al. 2010; Yu et al. 2012;

Mollot et al. 2014; Elbrecht and Leese 2017). However, as with other non-microbial taxonomic

66

groups of interest, few researchers have developed hierarchical DNA sequence classification techniques for arthropods (Porter et al. 2014; Tang et al. 2015; Somervuo et al. 2017). Here, we detail the construction, curation and evaluation of a database designed for hierarchical classification of amplicon sequences belonging to a 157 bp COI locus commonly used for arthropod metabarcoding (Zeale et al. 2011). This work will serve as both a resource for those conducting experiments using arthropod metabarcoding and as a template for future work curating and evaluating hierarchical sequence classification databases.

4.3 Methods

Data collection and curation

To produce a comprehensive reference set, all COI annotated sequences from Arthropoda as well as three sister phyla, Mollusca, Onychophora and Tardigrada, between 250 and 2500 bp in length were downloaded from the NCBI Nucleotide repository on October 21st, 2016 using the search term ‘Arthropoda cytochrome oxidase subunit I’. To supplement this collection, all arthropod whole mitochondrion genomes were downloaded from NCBI Nucleotide on March 3rd,

2017 using the search term ‘Arthropoda mitochondrion genome’. For metagenetic analysis, the inclusion of close outgroup sequences is useful for estimating the sequence space boundaries between arthropods and alternate phyla. The Perl script provided in Sickel et al. (2015) was then used along with the NCBI Taxonomy module (NCBI Resource Coordinators 2018) to retrieve the taxonomic identity of each sequence across each of the major Linnaean ranks, from kingdom to species.

After obtaining the available sequences and rank annotations, we created an intermediate database to obtain extracted barcode amplicons of interest from the reference data using the

Metaxa2 database builder tool (v1.0, beta 4; http://microbiology.se/software/metaxa2/). This tool

67

creates the hidden Markov models (HMMs) and BLAST reference databases underpinning the

Metaxa2 classification procedures. Prior to extraction, we randomly selected a reference sequence, trimmed it to the exact 157 bp barcode amplicon of interest and designated it as the archetypical reference during database building using the ‘-r’ argument. The section of the arthropod COI gene we trimmed this sequence to is the amplicon product of the commonly used primers of Zeale et al. (2011). The reference sequence is used in the database builder tool to define the range of the barcoding region of interest, and the software then trims the remainder of the input sequences to this region using the Metaxa2 extractor (Bengtsson-Palme et al. 2015). To increase the accuracy of multiple sequence alignment during this process, we split the original input sequences on the basis of length prior to running the database builder for amplicon extraction, creating four files with sequences of 250-500 bp, 501-600 bp, 601-2500bp and whole mitochondrion genomes. Following sequence extraction, the database builder tool aligns trimmed sequences using MAFFT (Katoh and Standley 2013) and from this alignment the conservation of each residue in the sequence is determined. The most conserved regions are selected for building HMMs using the HMMER package (Eddy 2011). Input sequences that cover most of the barcoding region and are taxonomically annotated are used to build a BLAST

(Altschul et al. 1997) database for sequence classification. Finally, the sequences in the BLAST database are aligned using MAFFT, and the intra- and inter-taxonomic sequence identities are calculated to derive meaningful sequence identity cutoffs at each taxonomic level. This entire process is described in more detail in the Metaxa2 2.2 manual

(http://microbiology.se/software/metaxa2/) and in Bengtsson-Palme et al. (2018).

After extraction, sequences were then curated by removal of duplicate sequences using the Java code provided with the RDP classifier (v2.11; Wang et al. 2007), which removes

68

identical sequences or any sequence contained within another sequence. At this point, we conducted extensive curation of the available lineage data for the reference sequence database.

For references lacking complete annotation at midpoints within the Linnaean lineage, we used

Perl regular expression-based substitution to complete the annotation according to established taxonomic authorities, including MilliBase (Sierwald 2017), the Integrated Taxonomic

Information System (http://www.itis.gov) and the phylogenomic analysis of Regier et al. (2010).

Table 4.1 shows the substitutions made. Further, we removed ranks containing annotations reflective of open nomenclature, such as sp., cf. and Incertae sedis, as well as ranks annotated as

‘undef.’ Lastly, we removed entries containing more than two consecutive uncalled base pairs.

Upon analyzing the representativeness of this initial database across arthropod classes and insect orders, we found that amplicon sequences from two insect orders, Strepsiptera and

Embioptera, were not present in the curated database, likely due to their poor sequence similarity to the reference sequence used to designate the amplicon barcode region of interest. To add

Strepsiptera and Embioptera COI amplicons, all NCBI COI sequences belonging to these orders were downloaded on October 10th, 2017, curated and added to the Metaxa2 COI database. To improve recovery of amplicons from these insect orders during curation, a representative sequence from both Embioptera and Strepsiptera, representing the 157 bp COI amplicon of interest, was used when building the Metaxa2 database. This retrospective addition of sequences belonging to Strepsiptera and Embioptera contributed 102 and 3 non-redundant reference sequences to the database, respectively. After this final sequence addition step, a Metaxa2 database was built to include all curated sequences and this database is available through the

Metaxa2 software package (http://microbiology.se/software/metaxa2/).

69

To assess the degree to which our amplicon sequence extraction, dereplication and curation procedures worked, we took inventory of the number of sequences per species in the initial input data as well as the number of sequences and genera present in the data at three points during curation: 1) in the initial input data, 2) following Metaxa2 database builder-based amplicon sequence extraction and 3) in the final database following dereplication and taxonomic curation.

Classifier performance evaluation

For performance evaluations, the methods used were highly similar to those of

Richardson et al. (2017). For three repeated samplings, we randomly selected 10 percent of the curated sequences to obtain testing data, using the remaining 90 percent of sequences to train the

Metaxa2 classifier for performance evaluations. To assess the effect of sequence length on classifier performance, we used a Python script to crop the test case sequences to 80 bp in length, approximately half the median length of the original reference sequence dataset. Evaluating classification performance on these short sequences provides a test of the classifiers robustness to sequence length variation and enables estimation of the potential for classifying sequences from short, high-throughput technology, such as 100 cycle single-end Illumina HiSeq sequencing. We then performed the following analyses on both the full-length (157 bp) and half- length (80 bp) test case sequences, separately.

To characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and the probability of classification error, we used the COI trained classifier to classify the testing datasets, requiring the software to classify to the family rank regardless of the reliability score of the assignment. After comparing the known taxonomic identity of each reference test case to the Metaxa2 predicted taxonomic identity, we regressed

70

5,000 randomly chosen binary classification outcomes, ‘1’ representing an incorrect classification and ‘0’ representing a correct classification, against the Metaxa2 reliability score using local polynomial logistic regression in R (v3.3.1; R Core Team 2014) with the span set to a value of 0.5.

For each of the three testing and training datasets, we classified the testing sequences using Metaxa2 with a reliability score threshold (-R) of 68. With the resulting classifications, we compared the known taxonomic identity of each reference test case to its Metaxa2 classification, from kingdom to species, to assess the proportion of true positive, true negative, false negative and false positive predictions. We also calculated false discovery rate, as measured in errors per assignment for each rank.

To assess the rate of taxonomic overclassification at the genus and species levels, we searched the testing dataset for sequence cases belonging to arthropod genera and species not represented in the training database. For each of the two ranks, we then determined the proportion of these sequences which were classified. Since the actual identity of such a sequence case is not represented in the training data, any such classification represents a particular type of misclassification known as an overclassification or overprediction. Lastly, for these sequence cases, we looked at how the classifier performed at the preceding rank (e.g. for the species-level cases, we analyzed classifier performance at the genus-level). Such analysis provides a measure of how well the software is able to perform at the next higher rank for these worst-case-scenario input sequences.

For each order in the testing data, we estimated the family, genus and species-level proportion of sequences assigned and false discovery rate to estimate the degree of variance in performance across major arthropod lineages. For this analysis, the false discovery rate was

71

again defined as the number of errors per assignment. After conducting this analysis, we limited our interpretation of the results to orders with at least 100 tests sequence cases at all of the ranks analyzed, family, genus and species. A Python script which takes the testing sequence taxonomies, training sequence taxonomies and Metaxa2 predicted taxonomies as input and provides the summaries of classification performance described above is provided with the

GitHub repository associated with this work, which is detailed in the Supplemental Information section.

4.4 Results

Following curation and extraction, we obtained 199,206 reference amplicon sequences belonging to 51,416 arthropod species. Over 90 percent of the references were between 142 and

149 bp in length, with a minimum reference sequence length of 94 bp. For the final database creation and classifier training procedure, many reference amplicons were shorter than the 157 bp region of interest due to the incompleteness of some reference sequences and the trimming of taxonomically uninformative ends during Metaxa2 training. Prior to this step, 82 percent of the sequences were between 150 and 157 bp in length following the original extraction and these longer sequences can be found at the GitHub repository associated with this work. The taxonomic representativeness of the database across different arthropod classes and insect orders, including the number of families, genera and species in each, are presented in Tables 4.2 and 4.3.

Analyzing the number of sequences per species in the input reference sequence data, we observed a heavily right-skewed distribution, with a median of 2 and a mean of 11.1 sequences per species (Figure 4.1A). Further, 32.0 percent of species were represented by 5 or more sequences and 40 species, including Bemisia tabaci and Delia platura, were represented by between 1,000 and 9,736 entries. After conducting amplicon sequence extraction using the

72

Metaxa2 database builder tool, we were able to extract the COI region of interest from 80.8 percent of the input sequences, which corresponded to 73.4 percent of the genus-level diversity found in the original input data. Following sequence dereplication, removal of sequences with three or more ambiguous base calls and taxonomic lineage curation, our final database contained approximately 13 percent of the input extracted sequences, which represented 98.2 percent of the genus-level richness of the input extracted reference amplicon sequences (Figure 4.1B and 4.1C)

Regressing classification outcome against the Metaxa2 reliability score yielded a similar best fit model for both the 80 bp and full length test sequence datasets (Figure 4.2). For both regressions, the probability of sequence mis-assignment was below 10 percent for reliability scores above 70. Thus, for our evaluations, we chose a reliability score of 68, which corresponded to family-level error probabilities of approximately 11.3 percent and 9.5 percent for 80 bp and full length sequences, respectively.

In evaluating the performance of our classification database when analyzed with Metaxa2 at a reliability score cutoff of 68, we found consistently low proportions of false positives across all ranks, though the proportion of true positive, true negative and false negatives varied more considerably from kingdom to species (Figure 4.3A). Further, while the Metaxa2 false discovery rate increased with higher resolution ranks (Figure 4.3B), it was generally low, never exceeding

5 percent at the genus level. Interestingly, Metaxa2 displayed low variance in the proportion of sequences assigned when classifying 80 bp sequences relative to full length sequences of 147 bp in median length. Overall, the proportion of sequences assigned was greater than 90 percent through the order level for both full length and half length sequences. Beyond the order level, this statistic decreased to 53 and 56 percent at the species level for half length and full length sequences, respectively. Conversely, the proportion of false positives varied more strongly by

73

sequence length and was greatest at higher-resolution taxonomic levels. At the species level, 1.97 percent of 80 bp sequences were misclassified, compared to only 1.13 percent for full length sequences. At the order level, the percent of sequences misclassified was 0.59 and 0.65 for 80 bp and full length sequences, respectively. As measured in errors per assignment, the classification false discovery rate was similarly highest at the species level, with 7.3 and 6.3 percent of assignments being incorrect for 80 bp and full length sequences, respectively. False discovery rates decreased to 1.2 and 0.7 percent of assignments being incorrect at the order level for 80 bp and full length sequences, respectively.

During our evaluation of taxonomic overclassifiction, we found between 3,141 and 3,202 sequence test cases belonging to species not represented in the corresponding training data and between 612 and 630 sequence test cases belonging to genera not represented in the corresponding training data across the three iterations of training and testing data. At the species level, the proportion of these cases which were overclassified was roughly equal for full length and half length sequences, with 5.4 and 5.2 percent being overclassified (Figure 4.4A). With respect to genus level classification performance on these species overclassification test cases, approximately 31 and 37 percent of test cases were classified correctly as true positives or true negatives for half length and full length sequences. Genus level false positive proportions for these sequence cases were 6.2 and 5.1 percent (Figure 4.4B). For the genus level overclassification cases, the difference in overclassification rates by sequence length was slightly larger, with approximately 8.9 and 10.9 percent of full length and half length test cases being overclassified (Figure 4.4C). Family level performance on genus level overclassification cases was slightly lower with the proportion of true positive and true negative identifications summing

74

to 25 and 27 percent for half length and full length sequences, with corresponding false positive proportions of 8.8 and 8.0

Evaluating classification performance for each order resulted in unsurprising outcomes across taxonomic ranks, from family to species (Figure 4.5). Generally, the proportion of sequences assigned decreased and false discovery rate increased with increasing taxonomic resolution. Between orders, there was noteworthy variation in performance. For example, the proportion of sequences assigned was highest among trichopteran sequences and lowest among lepidopteran sequences, with approximately 94 and 50 percent of sequences being assigned to the family rank for each order, respectively. Further, while the proportion of sequences assigned was similarly lowest for lepidopteran sequences at the genus and species levels, wherein 39 and

21 percent of sequences were assigned for each respective rank, the highest proportion of sequences assigned at these ranks was not observed with trichopteran sequences. Instead, genus and species level proportion assigned was highest among Sessilia sequences, with 86 and 78 percent of sequences being assigned to each respective rank.

4.5 Discussion

While species-specific PCR and immunohistochemistry-based methods have been useful in documenting arthropod food webs (Stuart and Greenstone, 1990; Symondson 2002; Weber et al. 2006; Blubaugh et al. 2016), the narrow species-by-species nature of such approaches has limited their utility for answering large-scale or open-ended ecological questions. With the increasing availability of high-throughput sequencing, arthropod metabarcoding will continue to become more broadly applicable to scientific questions spanning a diversity of research areas.

The development of improved methods for drawing maximal inferences from sequence data is an important area for further methodological research. In creating a highly curated COI reference

75

amplicon sequence database and evaluating its performance when used with the Metaxa2 taxonomic classifier, we have developed a new method to aid researchers in the analysis of arthropod metabarcoding data.

Though predictions vary greatly, researchers have estimated the species richness of arthropods to be between 2.5 to 3.7 million (Hamilton et al. 2010). Further, according to the literature review of Porter et al. (2014), 72,618 insect genera have been described to date. Thus the 51,416 species and 17,039 genera represented in our database account for only a small fraction of arthropod biodiversity. The limited representativeness of currently available, high quality reference sequence amplicons for the COI region highlights the need for continued efforts to catalogue arthropod biodiversity with molecular techniques. Despite this current limitation, the combination of molecular gut content analysis with high-throughput sequencing is a promising path toward investigating arthropod trophic ecology and biodiversity monitoring with greater sensitivity and accuracy relative to alternate approaches.

The results of our inventory of sequences per species and genus level richness at various stages in the database curation process revealed that our amplicon extraction procedure was highly sensitive, extracting and trimming approximately 81 percent of the input sequences down to the 157 bp region of interest. Further, approximately 87 percent of these extracted sequences represented sequence redundancies and were removed during dereplication. As mentioned previously, the trimming of sequence residues adjacent to the barcode of interest and removal of redundant sequences not only makes computational analysis less resource intensive, it can also improve classification performance. For k-mer style classifiers, extraneous sequence residues can bias model selection during classifier training, while abundant sequence duplicates can result in an overwhelming number of identical top hit alignments for alignment-based classifiers.

76

Overall, the best fit local regression models summarizing the relationship between the

Metaxa2 reliability score and the probability of classification error were useful in that the likelihood of misclassification was always less than what would be expected based on the reliability score. For example, a reliability score of 90 corresponded to only a 3.3 percent probability of family-level misclassification for full length sequences. We selected a reliability score of 68 for subsequent analysis as this provided a balanced trade-off between sensitivity and accuracy. Using this reliability score we observed minimal false positive rates and overall proportions of misclassification when comparing our results to those of similar studies (Porter et al. 2014; Bengtsson-Palme et al. 2015; Edgar 2016; Richardson et al. 2017). Given that the family level probability of error was only 9.5 percent at a reliability score of 68, a lower reliability score threshold may be justifiable for certain research situations. However, further testing should be conducted to ensure that the relationship between reliability score and classification confidence is similar across taxonomic ranks and between different DNA barcodes.

With respect to sensitivity using a reliability score of 68, our results were highly dependent upon the rank being analyzed, with sensitivities, as measured by the total proportion of sequences classified, above 60 percent only being achieved at the family and order ranks. To some degree, these sensitivity estimates reflect the large degree of database incompleteness at the genus and species ranks, wherein approximately 3.6 and 25.5 percent of unclassified sequences were true negatives. However, to our knowledge, no other studies have reported classification sensitivity data for this COI amplicon locus. This makes it difficult to ascertain if Metaxa2 – with an average of approximately 44.7 percent of false negative assignments at the genus and species ranks – exhibits relatively low sensitivity or if this locus is limited in discriminatory power. The short length of the 157 bp COI amplicon region relative to other barcoding regions

77

such as the ITS and 18S rRNA regions (Hugerth et al. 2014; Wang et al. 2015) could be a cause of such limited discriminatory power.

As expected, analyzing cases of overclassification in our data revealed that sequences from taxa lacking representation in the database are far more likely to be misclassified relative to sequences from well-represented taxa. This is supported by an approximately 10 percent probability of genus level overclassification for sequences from unrepresented genera relative to a 1 to 2 percent probability, depending on sequence length, for all sequence test cases.

Interestingly, the genus level overclassification rate was approximately double that observed at the species level. This seems counterintuitive but is expected in light of the discussion put forth by Edgar (2018), wherein the percent identity difference from the closest reference sequence match is considered one of the most important predictors of classification performance. For a species-level overclassification case, the closest reference sequence match in the corresponding database is a sequence from a congeneric species in the best case scenario. Since the closest reference sequence match for a genus level overclassification case is a sequence from a confamilial species at best, the average percent identity difference from the closest reference is greater for a genus level overclassification case than for a species level overclassification case.

Thus, for overclassification cases in particular, higher levels of error are expected at more inclusive taxonomic ranks.

While the genus level overclassification estimates we observed are not desirable, they are considerably lower than similar estimates for the RDP classifier, which range from 21.3 percent to 67.8 percent depending on the database, locus analyzed and cross-validation approach used

(Edgar 2016; Richardson et al. 2017). Further, the observed degree of genus level overclassification using Metaxa2 with our COI database was similar to or less than that of the

78

recently developed SINTAX classifier (Edgar 2016, Edgar 2018). Interestingly, the recent analysis of Edgar (2018), resulted in a similar estimate of the Metaxa2 genus level overclassification rate while revealing a considerably higher overclassification rate for the majority of other classifiers tested. Though this analysis found Metaxa2 to be relatively less sensitive than other classifiers, a weakness of the work was that it only tested Metaxa2 using the default reliability score of 80, while testing multiple confidence thresholds for alternate classifiers, such as SINTAX and RDP. Given that our analysis has revealed the Metaxa2 default reliability score to be too conservative – at least for the COI locus – such results are difficult to interpret. In general, such comparisons across studies should be approached with caution as multiple factors complicate the interpretation of classification performance, such as locus discriminatory power, database completeness and the choice of evaluation metrics used.

Ultimately, direct comparisons of classification methods using standardized loci and databases are needed to more rigorously compare performance.

With respect to Metaxa2 classification of full length relative to half length amplicon sequences, we observed surprisingly small differences in performance. Consistently, the proportion of misclassified sequences was greater for half length sequences. When considering error and sensitivity together, the false discovery rate or errors per assignment for full length sequences was consistently less than or equal to that achieved during the classification of half length sequences. Lastly, when considering the relationship between the Metaxa2 reliability score and the probability of classification error at the family level, we noted highly similar local polynomial regression models of error probability for both full length and half length sequences.

79

4.6 Conclusions

Here, we assembled a highly curated database of arthropod COI reference amplicon sequences, trained a recently developed hierarchical DNA sequence classifier using the database and conducted extensive in silico performance evaluations on the resulting classification pipeline. Overall, we found a high degree of sequence redundancy within the initial, uncurated dataset, highlighting the importance of effective sequence dereplication during the creation of databases designed for metabarcoding analysis. Further, the limited representativeness of the database with respect to arthropod biodiversity indicates that additional sequencing effort is needed to further improve the performance of arthropod metabarcoding techniques. Though the performance evaluations presented in this work were conducted on a large corpus of available biological data, the results are not necessarily directly transferable to all experimental settings.

For example, variations in sequence error profiles and taxonomic distributions among datasets are potential confounding factors. Despite these limitations, this work provides researchers with a new resource for arthropod COI sequence analysis and novel data for gauging the strengths and limitations of different approaches to arthropod metabarcoding.

4.7 References

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. Bazinet AL, Cummings MP (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 1-13. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH (2015) Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 1403-1414. Bengtsson-Palme J, Richardson RT, Meola M, et al. (2018) Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker. Bioinformatics, bty482. 80

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank. Nucleic Acids Research, 33, D34-D38. Blubaugh CK, Hagler JR, Machtley SA, Kaplan I (2016) Cover crops increase foraging activity of omnivorous predators in seed patches and facilitate weed biological control. Agriculture, Ecosystems and Environment, 231, 264-270. Corse E, Costedoat C, Chappaz R, Pech N, Martin J-F, Gilles A (2010) A PCR-based method for diet analysis in freshwater organisms using 18S rDNA barcoding on faeces. Molecular Ecology Resources, 10, 96-108. Eddy SR (2011) Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195. Edgar RC (2015) UTAX algorithm. Available at: http://www.drive5.com/usearch/manual/utax_algo.html (accessed 6 January 2016). Edgar RC (2016) SINTAX: A simple non-Bayesian taxonomy classifier for 16S and ITS sequences. Biorxiv. https://doi.org/10.1101/074161. Edgar RC (2018) Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ, 6, e4652. Elbrecht V and Leese F (2017) Validation and development of COI metabarcoding primers for freshwater macroinvertebrate bioassessment. Frontiers in Environmental Science, 5, 11. Guardiola M, Uriz MJ, Taberlet P, Coissac E, Wangensteen OW, Turon X (2015) Deep-sea, deep-sequencing: Metabarcoding extracellular DNA from sediments of marine canyons. PloS One, 10, e0139633. Hamilton AJ, Basset Y, Benke KK, Grimbacher PS, Miller SE, Novotny V, Samuelson A, Stork NE, Weiblen GD, Yen JDL (2010) Quantifying uncertainty in estimation of tropical arthropod species richness. The American Naturalist, 175, 90-95. Hugerth LW, Muller EEL, Hu YOO, Lebrun LAM, Roume H, Lundin D, Wilmes P, Andersson AF (2014) Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia. PLoS ONE, 9, e95567. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN 4. Genome Research, 21, 1552-1560. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772- 780. Keller A, Schleicher T, Schultz J, Müller T, Dandekar T, Wolf M (2009) 5.8S-28S rRNA interaction and HMM-based ITS2 annotation. Gene, 430, 50-57. Mollot G, Duyck P-F, Lefeuvre P, Lescourret F, Martin J-F, Piry S, Canard A, Tixier P (2014) Cover cropping alters the diet of arthropods in a banana plantation: A metabarcoding approach. PLoS ONE, 9, e93740. NCBI Resource Coordinators (2018) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research, 46, D8-D13. Ohio Supercomputer Center (1987) Citation. Columbus, Ohio, USA. http://osc.edu/ark:/19495/f5s1ph73. 81

Peabody MA, Van Rossum T, Lo R, Brinkman FSL (2015) Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 363. Porter TM, Golding GB (2012) Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: Classification method, primer choice, and error. PLoS One, 7, e35749. Porter TM, Gibson JF, Shokralla S, Baird DJ, Golding GB, Hajibabaei M (2014) Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naive Bayesian classifier. Molecular Ecology Resources 14, 929-942. R Core Team (2014) R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/ Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW (2010) Arthropod relationships revealed by phylogenomic analysis of nuclear protein- coding sequences. Nature, 463, 1079-1083. Richardson RT, Bengtsson-Palme J, Johnson RM (2017) Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data. Molecular Ecology Resources 17, 760-769. Richardson RT, Lin C-H, Quijia JQ, Riusech NS, Goodell K, Johnson RM (2015) Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach. Applications in Plant Sciences, 3, 1500043. Sickel W, Ankenbrand MJ, Grimmer G, Holzschuh A, Härtel S, Lanzen J, Steffan-Dewenter I, Keller A (2015) Increased efficiency in identifying mixed pollen samples by meta- barcoding with a dual-indexing approach. BMC Ecology, 15, 1-9. Sierwald P (2017) MilliBase. Accessed at http://www.millibase.org on 2017-04-12 Somervuo P, Koskela S, Pennanen J, Nilsson RH, Ovaskainen O (2016) Unbiased probabilistic taxonomic classification for DNA barcoding. Bioinformatics, 32, 2920-2927. Somervuo P, Yu DW, Xu CCY, Ji Y, Hultman J, Wirta H, Ovaskainen O (2017) Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding. Methods in Ecology and Evolution, 8, 398-407. Stuart MK, Greenstone MH (1990) Beyond ELISA: A rapid, sensitive, specific immunodot assay for identification of predator stomach contents. Annals of the Entomological Society of America, 83, 1101-1107. Symondson WOC (2002) Molecular identification of prey in predator diets. Molecular Ecology, 11, 627-641. Tang M, Hardman CJ, Ji Y, Meng G, Liu S, Tan M, Yang S, Moss ED, Wang J, Yang C, Bruce C, Nevard T, Potts SG, Zhou X, Yu DW (2015) High-throughput monitoring of wild bee diversity and abundance via mitogenomics. Methods in Ecology and Evolution, 6, 1034- 1043.

82

Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267. Wang X-C, Liu C, Huang L, Bengtsson-Palme J, Chen H, Zhang J-H, Cai D, Li J-Q (2015) ITS1: A DNA barcode better than ITS2 in eukaryotes? Molecular Ecology Resources, 15, 573-586. Weber DC, Rowley DL, Greenstone MH, Athanas MM (2006) Prey preference and host suitability of the predatory and parasitoid carabid beetle, Lebia grandis, for several species of Leptinotarsa beetles. Journal of Insect Science, 6, 1-14. Yu DW, Ji Y, Emerson BC, Wang X, Ye C, Yang C, Ding Z (2012) Biodiversity soup: Metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring. Methods in Ecology and Evolution, 3, 613-623. Zeale MRK, Butlin RK, Barker GLA, Lees DC, Jones G (2011) Taxon-specific PCR for DNA barcoding arthropod prey in bat faeces. Molecular Ecology Resources, 11, 236-244.

83

4.8 Tables

Table 4.1 Summary of taxonomic annotations made for references which had undefined ranks at midpoints in their respective taxonomic lineages.

Undefined Higher Resolution Assignment Assignment Made Authority Used Rank Order Family Sphaerotheriidae Order Sphaerotheriida MilliBase Order Family Zephroniidae Order Sphaerotheriida Order Family Lepidotrichidae Order Zygentoma ITIS Order Family Lepismatidae Order Zygentoma Order Family Nicoletiidae Order Zygentoma Class Order Pauropoda Class Myriapoda Genus Genus Pseudocellus Family Ricinididae Genus Genus Chanbria Family Eremobatidae Genus Species Tanypodinae spp. Genus Tanypodinae Genus Species Ennominae spp. Genus Ennominae Genus Genus Dichelesthiidae Family Dichelesthiidae Genus Genus Phallocryptus Family Thamnocephalidae Class Order Symphyla Class Myriapoda Regier et al. 2010 Order Family Peripatidae Order Onychophora Class Family Peripatidae Class Onychophora Order Family Peripatopsidae Order Onychophora Class Family Peripatopsidae Class Onychophora Genus Genus Lasionectes Family Speleonectidae WoRMS Family Family Speleonectidae Order Nectiopoda Genus Genus Prionodiaptomus Family Diaptomidae Family Family Diaptomidae Order Calanoida Order Order Calanoida Class Maxillopoda

84

Table 4.2 Summary of taxonomic representation across all arthropod classes and associated sister groups. Numbers may include sub and super groupings.

Order Number Number Number of of Genera of Families Species Archaeognatha 2 15 18 Zygentoma 3 3 3 Odonata 34 243 488 Ephemeroptera 26 108 378 Zoraptera 1 1 1 Dermaptera 4 6 7 Plecoptera 15 84 199 Orthoptera 30 299 603 Mantophasmatodea 1 1 1 Grylloblattodea 1 1 0 Embioptera 1 2 2 Phasmatodea 6 20 28 Mantodea 11 74 76 Blattodea 8 82 95 Isoptera 6 91 186 Thysanoptera 3 14 30 Hemiptera 101 1245 2730 Psocoptera 12 17 19 Hymenoptera 74 1500 4418 Raphidioptera 2 9 11 Megaloptera 2 10 22 Neuroptera 14 65 153 Strepsiptera 3 10 36 Coleoptera 124 2287 7452 Trichoptera 43 320 1269 Lepidoptera 134 6554 21626 Siphonaptera 6 14 20 Mecoptera 4 5 10 Diptera 118 1574 5460 Total 789 14654 45341

85

Table 4.3 Summary of insect taxa included in the arthropod COI database following curation.

Numbers may include sub and super groupings.

Class Number of Number of Number of Number of Orders Families Genera Species Heterotardigrada 1 2 2 1 Eutardigrada 1 3 12 20 Onychophora 1 2 17 42 Pycnogonida 1 10 27 89 Cephalopoda 1 1 1 1 Merostomata 1 1 3 4 Arachnida 17 226 740 1804 Myriapoda 2 4 7 9 Chilopoda 5 16 53 172 Diplopoda 11 33 95 181 Ostracoda 2 6 19 40 Branchiopoda 3 25 76 254 Malacostraca 13 256 969 2654 Maxillopoda 11 85 240 568 Cephalocarida 1 1 2 1 Remipedia 1 2 5 8 Protura 1 4 12 13 Diplura 1 5 7 11 Collembola 4 18 98 203 Insecta 29 789 14654 45341 Total 107 1489 17039 51416

86

4.9 Figures

Figure 4.1 A percent density histogram of the number of sequences per species (A) shows the distribution of redundancy within the NCBI Nucleotide entries used. The dashed blue line and solid red line indicate the median and mean number of sequences per species, respectively.

Inventories of the number of sequences (A) and genera (B) input into the curation process, following Metaxa2 extraction and following dereplication of redundant sequences and curation of taxonomic lineages.

87

Figure 4.2 A logistic regression analysis of case-by-case classification accuracy, ‘1’ indicating a false-positive identification and ‘0’ indicating a true-positive identification, regressed against classification reliability score for half length (A) and full length (B) test sequence cases. A best fit local polynomial regression line (solid blue with 95 percent confidence interval) was used to estimate the relationship between reliability score and the probability of mis-classification.

Dashed red lines illustrate the hypothetically ideal 1 to 1 relationship between error probability and the Metaxa2 reliability score, an estimate of classification confidence. Solid black lines highlight the 10 percent error probability.

88

Figure 4.3 Mean proportion and standard error of true positives (TP), true negatives (TN), false negatives (FN) and false positives (FP) for the classification of all testing sequences, conducted on both the full-length and half-length sequences (A). Mean and standard error of the false discovery rate for both full length and half length sequences, as measured in errors per assignment, during classification of all testing sequences (B).

89

Figure 4.4 Proportional species level overclassification rate (A) and genus level classification performance (B) for test sequence cases from species not represented in the corresponding training data. Proportional genus level overclassification rate (C), and family level classification performance (D) for test sequence cases from genera not represented in the corresponding training data.

90

Figure 4.5 Mean and standard error of the proportion of sequences assigned and false discovery rate, as measured in errors per assignment, measured at the family, genus and species levels for sequences belonging to each order. These results represent classification performance by arthropod order for the full length sequences only and orders with fewer than 100 test sequence cases at any of the ranks analyzed were excluded from analysis.

91

5.1 Abstract

We explored the pollen foraging behavior of honey bee colonies situated in the corn and soybean dominated agroecosystems of central Ohio over a month-long period using both pollen metabarcoding and waggle dance inference of spatial foraging patterns. For molecular pollen analysis we developed simple and cost-effective laboratory and bioinformatics methods.

Targeting four plant barcode loci (ITS2, rbcL, trnL and trnH), we implemented metabarcoding library preparation and dual-indexing protocols designed to minimize amplification biases and index mis-tagging events. We constructed comprehensive, curated reference databases for hierarchical taxonomic classification of metabarcoding data and used these databases to train the

Metaxa2 DNA sequence classifier. Comparisons between morphological and molecular palynology provide strong support for the quantitative potential of multi-locus metabarcoding.

Results revealed consistent foraging habits between locations and show clear trends in the phenological progression of honey bee spring foraging in these agricultural areas. Our data suggest that three key taxa, woody Rosaceae such as pome fruits and hawthorns, Salix, and

Trifolium provided the majority of pollen nutrition during the study. Spatially, these foraging patterns were associated with a significant preference for forests and tree lines relative to crop fields and herbaceous land cover.

92

5.2 Introduction

Understanding the floral resource usage patterns and preferences of pollinators, such as honey bees, remains an important research goal with implications for pollinator health (Di

Pasquale et al. 2013; Vaudo et al. 2016). Such questions have typically been investigated using plant-pollinator network analysis and analysis of pollen provisions (Severson and Parry 1981;

Memmott 1999). In the case of honey bees, however, waggle dance inference of spatial foraging patterns has emerged as an additional tool for investigating the relative attractiveness of different landscape features as forage and inferring associations between landscape composition and foraging outcomes (Couvillon and Ratnieks 2015). In this study, we combined molecular pollen analysis methods with waggle dance inference to observe the taxonomic composition of honey bee-collected pollen while simultaneously inferring where bees were foraging in the surrounding landscape.

Since the first proof-of-concept articles documenting the applicability of plant metabarcoding to pollen analysis (Valentini et al. 2010; Hawkins et al. 2015; Keller et al. 2015;

Kraaijeveld et al. 2015; Richardson et al. 2015), the field of molecular pollen analysis has expanded rapidly (Cornman et al. 2016; McFrederick and Rehan 2016; Smart et al. 2017; Bell et al. 2018). Undoubtedly, high-throughput sequencing exhibits great promise in facilitating future discoveries in the fields of plant-pollinator interaction biology, palynological forensics, food authentication and airborne pollen monitoring (Bell at al. 2016). Despite this promise, important questions remain regarding the selection of appropriate library preparation protocols and bioinformatic analysis methods. Further, the ability to draw quantitative inferences from pollen metabarcoding studies remains unclear, with considerable disagreement between research groups

(Keller et al. 2015; Richardson et al. 2015; Bell et al. 2018).

93

For any researcher wishing to employ pollen metabarcoding, the selection of library preparation protocols is a critical methodological decision. The selection of which loci to target, which universal primers to use for amplification and which library construction methods to implement will ultimately affect the strengths or weaknesses of the study, regardless of the bioinformatic techniques employed after sequencing. With respect to locus and primer choice, a number of studies have documented the taxonomic biases of different primer sets used to amplify the same locus (Deagle et al. 2014; Elbrecht and Leese 2015; Piñol et al. 2015; Krehenwinkel et al. 2017), as well as the biases of individual loci (Cowart et al. 2015; Richardson et al. 2015a;

Elbrecht et al. 2016). Similarly, it has recently been demonstrated that the use of ‘barcoded’ or

‘fusion’ primers during the initial amplification of mixed-species samples results in considerable amplification bias and decreased replicability (Berry et al. 2011; O’Donnell et al. 2016). While such primers are used to attach oligonucleotides necessary for indexing and next-generation sequencing, both studies demonstrated that this could be performed with greater replicability and precision by first performing a traditional PCR with no 5’ appendages on the primer before using a second set of fusion primers carrying nucleotide sequences required for sequencing. It is likely that biases resulting from sub-optimal primer and locus selection combined with the use of fusion primers for initial community amplification will result in sequencing data that does not quantitatively represent the diversity of pollen being analyzed.

Given the above issues, the use of multi-locus metabarcoding has been proposed as one approach to improving the quantitative capacity of molecular pollen analysis. Since different loci and primer sets display different biases with respect to the taxonomic scope of detection and quantitative bias, employing multiple markers and analyzing the median or mean of all loci may improve the accuracy of quantitative inferences (Richardson et al. 2015a). This approach has the

94

added advantage of enabling researchers to exclude taxa identified using only one locus and focus on consensus taxa identified by multiple markers, increasing the confidence of detections.

Following sequencing, the bioinformatic characterization of the resulting data is another important consideration. Arguably, metabarcoding studies should include rigorous tests of DNA sequence classification methods to enable reviewers and readers to properly gauge the plausibility of research findings (Edgar 2018). This requires tests of both the accuracy and sensitivity of bioinformatics methods and leads researchers toward methods that can be benchmarked against alternative approaches. While this requires extra effort, it enables researchers to be more objective in selecting classification methods and to rely less on post hoc determination of classification parameters while working through preliminary analyses of their data.

To classify our pollen metabarcoding data, we employed a recently designed DNA sequence classifier, Metaxa2 (Bengtsson-Palme et al. 2015). The Metaxa2 classifier is capable of extracting sequences belonging to a specific locus of interest from multi-locus or metagenomic data using Hidden Markov Models produced by HMMER (Eddy 2011). Since Metaxa2 had not previously been trained on plant barcode loci, we produced curated plant reference databases for each of our loci of interest and performed a cross-validation analysis as in Richardson et al.

(2017). Using logistic regression, we characterized the relationship between the Metaxa2 reliability score and the probability of false classification. We then used this regression model to select a classification reliability score threshold optimized for our data and reference databases.

Finally, we examined the accuracy and sensitivity of Metaxa2 implemented with our chosen reliability score threshold using previously documented methods (Richardson et al. 2018).

95

An overarching goal of this work was to develop complementary laboratory and bioinformatics approaches optimized to minimize quantitative biases in a cost-effective manner.

Using a three-step PCR approach, we circumvent the issue of using fusion primers in the initial sample amplification, similar to the approach suggested in Berry et al. (2011) and O’Donnell et al. (2016). Further, due to the presence of critical mis-tag events in next-generation sequencing data (Schnell et al. 2015), we performed our experiment using a 50 percent unsaturated Latin

Square Design, as described in Esling et al. (2015). To accomplish this efficiently, we utilized the gene annotation capacity of Metaxa2 to minimize the number of dual index pairs required for our study. This allowed us to produce multiple libraries per sample using the same dual index pair and computationally separate sequences from each locus after sequencing. Sequencing multiple loci per sample on the same Illumina flow cell has the added advantage of increasing sequence diversity during initial base calling, decreasing the amount of PhiX required in the final amplicon pool and increasing the number of samples which can be analyzed per sequencing run.

While methods will continue to be optimized, especially with respect to locus choice, primer choice and bioinformatic classification, our work represents a simplified and cost- effective approach to pollen metabarcoding which yields quantitatively useful data. Further, we demonstrate the applicability of our methods through applying them to explore the foraging ecology of honey bee, Apis mellifera, colonies situated across four apiaries in the corn and soybean agroecosystems of west-central Ohio.

In conducting a waggle dance analysis study in tandem with our pollen metabarcoding approach, we were able to relate the taxonomic composition of our samples to observed spatial patterns of honey bee foraging, as in Sponsler et al. (2017). Waggle dance inference was useful for determining the relative importance of different landclass types as honey bee foraging

96

locations for our study. However, the relatively high degree of imprecision inherent in this type of analysis made fine-scale interpretation of foraging patterns unfeasible and resulted in low statistical power with respect to inferring differences in foraging preference across landscape classes despite considerable sampling effort.

5.3 Methods

Pollen sampling and waggle dance recording

In early spring of 2015, apiaries were set up at four sites in rural central Ohio. Each apiary consisted of 12 - 18 actively foraging colonies in 8- or 10-frame Langstroth hives. Two of the Langstroth hives were fitted with Sundance I bottom-mounted pollen traps (Ross Rounds,

Albany, NY, USA). Pollen was trapped continuously from May 2nd to May 27th. The traps were emptied and samples were collected at three to five day intervals. Artificial pollen substitute (Ultra Bee, Mann Lake, Hackensack, MN, USA) were placed in the pollen-trapping hives to mitigate the effects of the resulting pollen nutritional deficit. To video record waggle dancing behavior, one 3-frame observation hive (Bonterra TableView, Addison, ME, USA) was installed at each apiary, sheltered in a plastic storage shed (Suncast #BMS4700, 179x112x132 cm, Batavia, IL, USA). Approximately one hour of video of the bottom frame was recorded using an HD video camera (Canon Vixia HF G20) situated on a 1m tripod with light provided by a small opening in the door. Recordings were made on 16 days from May 4th to May 29th between the hours of 10:30 and 17:10. In total, video recordings were taken on at least seven sampling dates per apiary throughout the study.

Metabarcoding sample processing

For each sample, 10 percent by mass or up to 20 g of pollen (wet mass) was combined with distilled water to a concentration of 0.1 g/mL of pollen and homogenized using a blender

97

(Hamilton Beach #54225, Southern Pines, NC, USA) for 2.5 minutes. After blending, each sample was gently mixed immediately prior to the collection of 1.4 mL of pollen homogenate into a 2.0 mL bead beater tube (Fisherbrand Free-Standing Microcentrifuge Tubes; Fisher

Scientific, Hampton, NH, USA). Bead beater tubes were then centrifuged for 2 minutes at 10,000 g, the supernatant was removed from the pollen pellet and 1.25 mL of buffer AP1 from the

Qiagen DNeasy Plant Minikit (QIAGEN, Venlo, The Netherlands) was added along with 3,355 mg of 0.7 mm zirconia beads (Fisher Scientific, Hampton, NH, USA). Pollen was then mechanically disrupted in a bead beater (Mini-BeadBeater-1; BioSpec Products, Bartlesville,

OK, USA) for 5 minutes at the highest setting. Samples were then vortexed briefly before 400

µL of lysate was removed for DNA extraction using the Qiagen DNeasy Plant Minikit according to the manufacturer’s instructions.

Following DNA extraction, a 3-step PCR-based protocol, compatible with the

Illumina Nextera sequencing protocol, was used for amplicon library preparation. For each sample, rbcL, trnL, trnH and ITS2 libraries were prepared separately using previously published universal primer sets (White et al. 1990; Fay et al. 1997; Sang et al. 1997; Tate and Simpson

2003; Taberlet et al. 2007; Chen et al. 2010). For the initial PCR reaction, universal primers with no 5-prime fusion oligos were used to generate a pool of amplicons. Subsequently, 1 µL of PCR product from the initial reaction was used as template for a second PCR reaction. Lastly, 1 µL of

PCR product from the second reaction was used as template for a third PCR reaction. The second and third reactions were used to append template priming, sample indexing and lane hybridization oligonucleotides to each amplicon for downstream compatibility with the Illumina

Nextera protocol and MiSeq sequencing. Supplementary Table S1 contains the primer sequences, complete PCR conditions and sample dual-indexing design used in this study. All

98

PCR reactions were conducted at a 20 μl scale with 4 μl High Fidelity Phusion Buffer, 0.2 mM dNTPs and 0.02 U/μl Phusion Polymerase. Initial PCR reactions were conducted with 100 to 150 ng of DNA template. Following library preparation, the final PCR products were purified and normalized using the SequalPrep Normalization Plate Kit (Thermo Fisher Scientific, Waltham,

MA, USA), pooled equimolarly and sequenced using the Illumina MiSeq Micro Kit (2 x 150 cycles).

Hierarchical classification database construction and curation

To use Metaxa2 (v2.2 4th beta; Bengtsson-Palme et al. 2015), a software originally designed to classify bacterial and fungal sequences, for plant sequence classification, we first had to produce reference databases for each marker of interest. To gather reference data, we downloaded all available trnL, trnH, rbcL and plant whole chloroplast genome sequences from

NCBI Genbank on April 20th, 2017. Additionally, we downloaded all available Viridiplantae

ITS2 sequences from the ITS2 Database (Ankenbrand et al. 2015) on May 5th, 2017. We then used the NCBI Taxonomy Module (NCBI Resource Coordinators, 2018) along with the Perl scripts described in Sickel et al. (2015) to obtain the seven-ranked Linnaean lineage, from kingdom to species, for each reference entry.

To aid in both the estimation of marker conservation parameters during hierarchical training and the removal of duplicate reference sequences, we extracted the amplicon of interest from the available reference sequences where possible, including from plant whole chloroplast genomes for the rbcL and trnL markers. For this, we first removed exceptionally long or short sequence entries as well as any entries containing three or more consecutive uncalled base pairs from the locus-specific data sets. For the rbcL, trnL and whole chloroplast genome data sets, we then isolated archetypical trnL and rbcL reference sequences for each locus using the primers

99

employed during pollen metabarcoding and used these sequences in combination with the HMM- based Metaxa2 Database Builder tool (v1.0 4th beta; Bengtsson-Palme et al. 2018) to extract the amplicon of interest. However, trnH sequences were too divergent for this approach. Instead, we removed any entries longer than 1,500 bp and retained only the references annotated with ‘trnH’ and ‘psbA’, to remove as many extraneous sequences as possible. While ITS2 is also a highly divergent marker, this level of curation for ITS2 references was unnecessary due to the in silico secondary structure analysis employed during ITS2 Database curation (Keller et al. 2009).

Next, we performed extensive curation of the taxonomic lineage metadata associated with each entry. Using Perl substitution, we removed the undefined ranks from the end of any lineage unidentified at the highest resolution ranks, typically genus and species. Leaving undefined tags in the lineages is problematic for hierarchical classification, as the classifier has no way to distinguish undefined annotations from bona fide taxonomic annotations, resulting in multiple sequences from different taxa receiving the same annotation. To account for lineages which are currently unresolved at intermediate ranks, we developed a Python script which substitutes these undefined intermediate rank annotations with an annotation containing the identity of the lowest resolution rank containing an identification and a ‘urs’ tag which indicates ‘unresolved.’ In this way, we were able to salvage important lineages of plants, such as Magnoliales, Ranunculales and Caryophyllales, while annotating these entries with a tag that distinguishes them from other taxa that are also unresolved at the same rank. For a more detailed description of this approach, see Richardson et al. (2018). Finally, we used Perl substitution to further clean the lineages and remove ranks annotated with artifactual alphanumeric tags or open nomenclature, which were common at the genus, species and family ranks. Reference sequence databases were then

100

dereplicated using the Java script provided with the RDP Naïve Bayesian Classifier (v2.11;

Wang et al 2007).

Following final curation of the reference sequence and taxonomic lineage data, Metaxa2 was trained on each of the four markers using the Metaxa2 Database Builder Tool. For rbcL and trnL, training was performed in default mode and an archetypical sequence was used to designate the precise barcode region of interest. For trnH and ITS2, the divergent mode was used due to the low degree of sequence conservation across these markers.

In addition to training Metaxa2 on the complete reference databases for each marker, we also performed a cross-validation performance evaluation by randomly sampling 10 percent of the sequences for each marker to serve as test sequences, training Metaxa2 with the remaining 90 percent and then classifying the test sequences. In order to make the evaluation conservative, we cropped the test sequences for each marker to 150 bp in length using a custom Python script. To select the most appropriate Metaxa2 reliability score, an estimate of classification confidence, we evaluated the relationship between the Metaxa2 reliability score and classification error probability using local polynomial logistic regression. For this evaluation, we randomly subsampled 1,000 reference sequence classification cases from each locus and regressed the outcome of each family-level classification case, ‘0’ indicating correct classification and ‘1’ indicating misclassification, against the Metaxa2 reliability score of the assignment using the

Loess function in R (R Core Team 2014). We then estimated the sensitivity and accuracy of the classifier using the methods of Richardson et al. (2017).

Pollen metabarcoding bioinformatics and statistics

Given the built-in quality filtering and mate-pair awareness of Metaxa2, we proceeded to classify the sequences of the raw forward and reverse fastq files without prior quality processing

101

for the ITS2, trnH and rbcL libraries. Since amplicons of the trnL marker were short enough for the paired end reads to be merged into a single contiguous sequence, we used PEAR (v0.9.1;

Zhang et al. 2014) to merge forward and reverse read pairs and improve base calling accuracy toward the middle of the trnL amplicon. For read pairing, a minimum merged read length of 100 bp was used along with a Phred scale 33 quality threshold of 20. Assembled trnL sequences were then subjected to taxonomic classification using Metaxa2. For all taxonomic classification,

Metaxa2 was implemented using default quality filtering and a reliability score threshold of 50 on the Owens cluster of the Ohio Supercomputer Center (Ohio Supercomputer Center, 1987).

Following sequence classification, custom Python scripts were used to summarize the data using the consensus-filtered, median-based approach discussed in Richardson et al. (2015a).

Briefly, for a given sample and taxonomic rank, the proportion of sequences belonging to each taxon was calculated for each marker. At this point, the taxa were consensus-filtered by discarding any taxonomic group which was discovered using only one of the four makers.

Additionally, taxonomic groups represented by less than 0.01 percent of the data were discarded.

The median proportional abundance of each taxonomic group was then calculated. After obtaining the median proportions of each taxon, median values were then normalized to a sum of

1.0 for each sample. The commands and Python scripts used for all analyses presented in this work are available at https://github.com/RTRichar/QuantitativePollenMetabarcoding.

Microscopic palynology and quantitative inference

To explore the utility of metabarcoding data for drawing quantitative inferences of the proportions of different taxa within each sample, we used microscopic palynology as a standard to characterize the components of 12 out of the 32 total pollen samples. We then performed linear regression analysis, regressing the metabarcoding data against microscopic inferences of

102

the abundance of each taxon for each sample analyzed. For these analyses, all regressions were performed using data summarized to the family rank. For microscopic characterizations, we utilized the methods of Richardson et al. (2015b) wherein corbicular pollen pellets were first sorted by color prior to mounting, basic fuchsin staining and microscopic identification.

Following the taxonomic characterization of each color fraction, the sum proportion of each taxonomic group was calculated according to the volumetric methods of O’Rourke and

Buchmann (1991). The pollen reference collections used for identification were those detailed in

Richardson et al. (2015a and b).

Waggle dance analysis and statistics

Waggle dance analysis was conducted using methods similar to Sponsler et al. (2017).

Briefly, each video was subsampled by extracting one-minute segments separated by four minute intervals. Individual dance vectors (distance and direction) were then estimated according to

Couvillon et al. (2012) using ImageJ (Schneider et al. 2012) video analysis with the MTrackJ plugin (Meijering et al., 2012). Using QGIS (v2.18.20; QGIS Development Team 2018), we digitized the landscape within a 2 km radius of each apiary using the USDA-NASS Cropland

Data Layer (USDA National Agricultural Statistics Service Cropland Data Layer 2018),

OpenLayers aerial imagery (Map data provided by Google; Sourcepole 2018), and manual ground-truthing. Landscape features were classified into three categories: crop field, forest

(forest and tree lines) and herbaceous habitat (residential and pasture lands). Dance vectors were then mapped upon the digitized landscape using the Bayesian probabilistic methods of Schürch et al. (2013). For each landcover class, we calculated a preference index, defined as the empirical visitation rate on a given landcover class (i.e. the sum of the foraging probability falling within a given landcover class) divided by the proportional abundance of that landcover class in the total

103

landscape. Conceptually, this is a measure of whether the landcover class visitation rate deviates from what would be expected assuming random foraging across the landscape. After applying a log transformation to this statistic, values above zero indicate preference for the landcover class, while values below zero indicate aversion. For statistical analysis, we applied a one-way anova to our log transformed preference index values to infer if significant differences in preference existed across the three landcover types. Additionally, we used two-tailed t-tests to infer if the preference index of any of the three landcover types was significantly different from zero. Lastly, since honey bees prioritize floral resource use based on distance from the hive, we used a one- way ANOVA to test for significant differences in mean distance from the apiary across landcover types.

Additional supplemental laboratory and analytical details for this work can be found at https://github.com/RTRichar/QuantitativePollenMetabarcoding. This repository contains the primer sequences, PCR conditions and complete sample dual-indexing design used in this study.

Further, all command line arguments, Python code and Metaxa2 trained databases used during data processing and analysis are included

5.4 Results

Construction and evaluation of hierarchical classification databases

Construction and curation of reference sequence databases yielded 21,902, 22,663,

46,488 and 121,168 sequences for rbcL, trnL, trnH and ITS2, respectively. These sequences corresponded to between 16,994 to 65,052 species per database and a total of 86,525 species across all four databases (Table 5.1). With respect to classification performance, local polynomial logistic regression between reference test sequence classification outcome and the

“reliability score” calculated by Metaxa2 revealed a non-linear relationship when classifying 150

104

bp plant reference sequences (Figure 5.1A). The probability of classification error was below 0.1 for reliability scores of 54 or greater. To maximize sensitivity, we chose to set the reliability score at 50 for analysis of all four loci. Using this threshold in our accuracy and sensitivity assessments, we found Metaxa2 to exhibit a low degree of error, mis-identifying an average of

5.1, 2.0 and 1.2 percent of 150 bp plant reference test sequences at the level of genus, family and order, respectively. Further, we found high degrees of sensitivity in classifying 150 bp plant reference test sequences, with an average genus-level sensitivity of 40.4 percent and family and order-level sensitivities of 89.8 and 94.4 percent, respectively (Figure 5.1B).

Sequencing, demultiplexing and classification performance

After sequencing, we obtained 4,380,260 mate-paired reads. Of these 3,141,670 mate- pairs were classified as Viridiplantae by Metaxa2 using HMM-based sequence annotation and extraction. Sequencing results produced in this work have been deposited to the NCBI Sequence

Read Archive under BioProject PRJNA489437. Following extraction of the sequences from each locus, we obtained a mean and standard error of 25,572 ± 1,416 sequences per sample per locus.

An ANOVA followed by a Tukey’s HSD test revealed significant differences in the average number of sequences per sample across the four loci used (ANOVA: P < 0.0001; P < 0.0001 for all pairwise comparisons except ITS2 - trnH, P < 0.05, and ITS2 - trnL, P > 0.05). Overall, the minimum number of Viridiplantae sequences found in a single library was 1,897. Table 5.2 shows the mean, standard error and range of sequences per sample for each locus. Following sequence classification with Metaxa2, we generally achieved a high rate of classification from phylum to family, beyond which, steep decreases in sensitivity were observed at the genus and species ranks. However, one marker, ITS2, exhibited relatively high sensitivity at the genus level. This was expected given the increased discriminatory power of ITS2 relative to other plant

105

barcodes (Chen et al. 2010). For sequences belonging within Viridiplantae, Table 5.3 shows the mean proportion of sequences classified and standard error for each marker at each taxonomic rank.

Quantitative median-based multi-locus metabarcoding

With respect to the quantitative validity of metabarcoding data, we found extreme variance in the degree to which results from different loci were related to the microscopic results using linear regression modeling (Figure 5.2). Prior to these analyses, the microscopy and molecular datasets were square-root transformed in order to improve homogeneity of variance, which is negatively affected by the preference of honey bees to collect small quantities of numerous plant taxa. When using our multi-locus approach, we found the metabarcoding median of each consensus-filtered family to be strongly and significantly related to the microscopy results (P < 0.0001; R2 = 0.60). Analyzing individual loci, the results from rbcL and trnL were strongly correlated with the microscopy results (P < 0.0001 and R2 > 0.53 for both loci). Further, while the trnH results were significantly correlated with the microscopy results, this relationship was relatively weak (P < 0.0001; R2 = 0.31). Lastly, the data from the ITS2 locus was not significantly related to the microscopy results (P > 0.05; R2 = -0.001).

Pollen foraging patterns

Our data indicate that three plant families, Rosaceae, Salicaceae and Fabaceae, comprised the majority of any given sample, accounting for a mean of 68.1 ± 3.0 (SE) percent of pollen abundance across all 32 samples (Figure 5.3). Results from the ITS2 locus, which displays greater resolution at lower taxonomic levels relative to other plant barcoding loci, led us to conclude that these family-level inferences likely represented Prunus, Malus, Rubus, Salix,

Trifolium and Cercis.

106

Waggle dance inferences

In analyzing the log-transformed preference index for each landcover type across all four sites (Figure 5.4), two-tailed t-tests suggested an overall preference for forested areas (mean log- transformed preference index: 0.4614; P = 0.0512) and an aversion to crop fields (mean log- transformed preference index: -0.1384; P = 0.0505). Though the mean preference index for non- crop herbaceous lands was positive, we did not observe a significant preference for this landcover class (mean log-transformed preference index: 0.2732; P = 0.1377). With respect to relative preferences, we found a significant difference in preference across landcover classes

(one-way ANOVA: P = 0.01613). Specifically, forest areas were significantly preferred over crop fields (Tukey’s HSD test: P = 0.01444). Further, this effect did not appear to be driven by variation in the average distance of each landcover type from the hive, as we found no significant differences in this measurement (one-way ANOVA: P = 0.8943).

5.5 Discussion

With our modified library preparation methods, we had three major goals: 1) obtain enough sequences per locus to accurately document the diversity of each sample, 2) obtain an even distribution of sequences per library, and 3) infer the taxonomic composition of our samples in a quantitatively representative manner. Considering past evaluations of the minimum number of analyzed pollen grains (Lau et al. 2017) and the sequencing depth needed to characterize the diversity of a typical sample of honey bee-collected pollen (Keller et al. 2015;

Cornman et al. 2016), we are confident that our methods provided sufficient sequencing depth.

Across all four loci, the minimum number of high quality Viridiplantae sequences generated for a sample was 38,486 and only 2 of 128 libraries contained fewer than 2,000 Viridiplantae sequences.

107

Despite adequate sequencing coverage, it is clear that our methods can be further optimized to yield a more even distribution of sequences per locus for each sample. This was an interesting outcome considering that we mixed our marker libraries on an equimolar basis before sequencing and may be explained by variation in amplicon clustering efficiency across loci on the Illumina MiSeq flow cell. Such sequence clustering variation is known to occur on the basis of template length (Illumina Inc. 2016). Given significant differences in the number of sequences per locus obtained, future studies implementing these methods would benefit from the addition of fewer ITS2, trnL and trnH products and more rbcL products during the final pooling of libraries. Additionally, investing in longer sequencing length would likely improve the taxonomic resolution achieved with the rbcL and trnH markers.

With respect to the quantitative capacities of pollen metabarcoding, numerous conflicting conclusions exist within the literature. While some authors conclude that molecular pollen identification methods can be relatively quantitative if interpreted appropriately (Hawkins et al.

2015; Kraaijeveld et al. 2015; Richardson et al. 2015a), others maintain that pollen metabarcoding data yield poor quality quantitative results (Bell et al. 2017; Bell et al. 2018). Our data indicate that, while all metabarcoding loci have some degree of bias, plastid loci produce data that is more quantitative, at least at the family rank and for the taxonomic groups assessed here. Further, the use of four metabarcoding loci and a median-based approach enables the estimation of pollen type abundance with reasonable quantitative accuracy when compared to microscopic analysis. As discussed in Richardson et al. (2015a), the use of multiple loci along with a consensus-filtered, median or average-based approach exhibits promise in terms of limiting false discoveries while increasing the scope of detectable taxa and increasing the

108

quantitative utility of the resulting data. We contend that estimating the median, as opposed the to the mean, is ideal for this approach in order to reduce the influence of statistical outliers.

Alternatively, when considering the results from each locus individually, studies relying on a single marker and primer set to characterize diverse pollen samples almost certainly exhibit deficiencies with respect to taxonomic scope of detection and relative quantification, especially when a ribosomal locus is targeted. Poor results are expected for such loci considering that angiosperms are known to exhibit variations as large as 19-fold and 173-fold in ploidy and ribosomal copy number, respectively (Prokopowich et al. 2003; Murray et al. 2005). While several research groups contend that individual ribosomal loci alone are sufficient for quantitative inference (Keller et al. 2015; Pornon et al. 2016; Smart et al. 2017), a clear consensus of evidence challenges this assumption.

While the trnL locus used here appeared most quantitatively useful, the severe limitations of this fragment and primer set make it impractical for a single locus approach. Even though this short locus, approximately 160 bp in length, was the only locus to be sequenced in its entirety and efficiently mate-paired in this study, it exhibited poor resolution and could only be used for identification beyond the family level extremely rarely (23 percent of sequences identified to genus with an estimated false discovery rate of 17.8 percent). While additional sequencing length may result in fewer false discoveries and greater resolution for longer markers like rbcL and the trnL (UAA) fragment (Taberlet et al. 2007), it would not improve the results obtained with this section of trnL. It is also important to note that trnL libraries were more deeply sequenced by a large margin relative to rbcL, which may partially account for the increased quantitative performance of trnL (Smith and Peay 2014). Further, the relatively low proportion of sequences assigned to family for rbcL and ITS2, 78 and 70 percent, may have negatively affected the

109

regression statistics of these markers relative to trnL and trnH, for which 97 and 99 percent of sequences were assigned to family. With respect to trnL and all loci analyzed, it should be considered that the quantitative evaluations presented here, as well as those of Richardson et al.

(2015a), are limited to the taxa of early spring honey bee foraging in central Ohio, USA.

Considering that different loci and primer sets exhibit variable, taxon-specific biases, trnL may perform differently in terms of quantitative reliability on alternate groups of plant taxa.

Dance analysis indicated a foraging preference for forested areas, particularly relative to crop fields. This is consistent with many studies that have observed a major role of forest and forest edge in provisioning honey bees, particularly in agricultural landscapes (Sande et al. 2009;

Odoux et al. 2012; Donkersley et al. 2014; Richardson et al. 2015a; Requier et al. 2015). While previous work has found a negative correlation between forest land cover and honey bee productivity in Ohio (Sponsler and Johnson 2015), this apparent inconsistency could be explained by a positive effect of forest edge within an agricultural matrix and a negative effect of unbroken canopy in a forested matrix. This interpretation is supported by the predominance in our samples of Salix and rosaceous trees, which are characteristically forest edge, forest understory, and forested waterway flora. Importantly, though, considering that honey bees forage for nectar in addition to pollen, we are unable to precisely infer the degree to which observed spatial foraging patterns reflect pollen foraging. Thus, future studies of this nature would benefit from simultaneously conducting honey pollen analysis in addition to corbicular pollen analysis and waggle dance inference. Lastly, the present study was carried out in spring, when the majority of trees flower; if the dance analysis were repeated later in the year, outside the flowering period of major tree species, we would predict an aversion to forested areas and a preference for weedy herbaceous plants (Sponsler et al. 2017).

110

5.6 References

Ankenbrand MJ, Keller A, Wolf M, Schultz J, Förster F (2015) ITS2 database V: Twice as much. Molecular Biology and Evolution, 32, 3030-3032. Bell KL, Brosi BJ, de Vere N, Keller A, Richardson RT, Gous A, Burgess KS (2016) Pollen DNA barcoding: Current applications and future prospects. Genome, 59, 1-12. Bell KL, Fowler J, Burgess KS, Dobbs EK, Gruenewald D, Lawley B, Morozumi C, Brosi BJ (2017) Applying pollen DNA metabarcoding to the study of plant-pollinator interactions. Applications in Plant Sciences, 5, 1600124. Bell KL, Burgess KS, Botsch JC, Dobbs EK, Read TD, Brosi BJ (2018) Quantitative and qualitative assessment of pollen DNA metabarcoding using constructed species mixtures. Molecular Ecology, https://doi.org/10.1111/mec.14840. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH (2015) Metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 1403-1414. Bengtsson-Palme, J, Richardson RT, Meola M, et al. (2018) Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker. Bioinformatics, bty482. Berry D, Mahfoudh KB, Wagner M, Loy A (2011) Barcoded primers used in multiplex amplicon pyrosequencing bias amplification. Applied Environmental Microbiology, 77, 7846-7849. Chen S, Yao H, Han J, Liu C, Song J, et al. (2010) Validation of the ITS2 region as a novel DNA barcode for identifying medicinal plant species. PLoS ONE, 5, e8613. Cornman RS, Otto CRV, Iwanowicz D, Pettis JS (2015) Taxonomic characterization of honey bee (Apis mellifera) pollen foraging based on non-overlapping paired-end sequencing of nuclear ribosomal loci. PLoS ONE, 10, e0145365. Couvillon MJ, Ratnieks FLW (2015) Environmental consultancy: dancing bee bioindicators to evaluate landscape “health”. Frontiers in Ecology and Evolution, 3, 44. Couvillon MJ, Pearce FCR, Harris-Jones EL, et al. (2012) In tra-dance variation among waggle runs and the design of efficient protocols for honey bee dance decoding. Biology Open, 1, 467-472. Cowart DA, Pinheiro M, Mouchel O, Maguer M, Grall J, Miné J, et al. (2015) Metabarcoding is powerful yet still blind: A comparative analysis of morphological and molecular surveys of seagrass communities. PLoS ONE, 10, e0117562. Deagle BE, Jarman SN, Coissac E, Pompanon F, Taberlet P (2014) DNA metabarcoding and the cytochrome c oxidase subunit I marker: Not a perfect match. Biology Letters, 10, 20140562. Di Pasquale G, Salignon M, Le Conte Y, et al. (2013) Influence of pollen nutrition on honey bee health: Do pollen quality and diversity matter? PLoS ONE, 8, e72016. Donkersley P, Rhodes G, Pickup RW, Jones KC, Wilson K (2014) Honeybee nutrition is linked to landscape composition. Ecology and evolution, 4, 4195-4206. 111

Eddy SR (2011) Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195. Edgar RC (2018) Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ, 6, e4652. Elbrecht V, Leese F (2015) Can DNA-based ecosystem assessments quantify species abundance? Testing primer bias and biomass—sequence relationships with an innovative metabarcoding protocol. PLoS ONE, 10, e0130324. Elbrecht V, Taberlet P, Dejean T, Valentini A, Usseglio-Polatera P, Beisel J, Coissac E, Boyer F, Leese F (2016) Testing the potential of a ribosomal 16S marker for DNA metabarcoding of insects. PeerJ, 4, e1966. Esling P, Lejzerowicz F, Pawlowski J (2015) Accurate multiplexing and filtering for high- throughput amplicon-sequencing. Nucleic Acids Research, 43, 2513-2524. Fay MF, Swensen SM, Chase MW (1997) Taxonomic affinities of Medusagyne oppositifolia (Medusagynaceae). Kew Bulletin, 52, 111-120. Garbuzov M, et al. (2015) Honey bee dance decoding and pollen load analysis show limited foraging on spring-flowering oilseed fape, a potential source of neonicotinoid contamination. Agriculture, Ecosystems and Environment, 203, 62-68. Hawkins J, de Vere N, Griffith A, Ford CR, Allainguillaume J, Hegarty MJ, Baillie L, Adams- Groom B (2015) Using DNA metabarcoding to identify the floral composition of honey: A new tool for investigating honey bee foraging preferences. PLoS ONE, 10, e0134735. Illumina Inc. (2014) Nextera library validation and cluster density optimization. Available at: https://www.illumina.com/documents/products/technotes/technote_nextera_library_valid ation.pdf (accessed 11 August 2017). Keller A, Schleicher T, Schultz J, Müller T, Dandekar T, Wolf M (2009) 5.8S-28S rRNA interaction and HMM-based ITS2 annotation. Gene, 430, 50-57. Keller A, Danner N, Grimmer G, Ankenbrand M, Von Der Ohe K, Von Der Ohe W, Rost S, Hartel S, Steffan-Dewenter I (2015) Evaluating multiplexed next generation sequencing as a method in palynology for mixed pollen samples. Plant Biology, 17, 558-566. Kraaijeveld K, De Weger LA, Garcia MV, Buermans H, Frank J, Hiempstrah PS, Den Dunnen JT (2015) Efficient and sensitive identification and quantification of airborne pollen using next generation DNA sequencing. Molecular Ecology Resources, 15, 8-16. Krehenwinkel H, Wolf M, Lim JY, Rominger AJ, Simison WB, Gillespie RG (2017) Using metabarcoding to reveal and quantify plant-pollinator interactions. Scientific Reports, 7, 17668. Lau P, Bryant V, Rangel J (2018) Determining the minimum number of pollen grains needed for accurate honey bee (Apis mellifera) colony pollen pellet analysis. Palynology, 42, 36-42. McFrederick QS, Rehan SM (2016) Characterization of pollen and bacterial community composition in brood provisions of a small carpenter bee. Molecular Ecology, 25, 2302- 2311. Meijering E, Dzyubachyk O, Smal I (2012). Chapter nine - Methods for cell and particle tracking. In P. M. Conn (Ed.), Methods in Enzymology, 504, 183-200. Academic Press.

112

Memmott J (1999) The structure of a plant-pollinator food web. Ecology Letters, 2, 276-280. NCBI Resource Coordinators (2018) Database resources of the national center for biotechnology information. Nucleic Acids Research, 46, D8-D13. O’Donnell JL, Kelly RP, Lowell NC, Port JA (2016) Indexed PCR primers induce template- specific bias in large-scale DNA sequencing studies. PLoS ONE, 11, e0148698. O’Rourke MK, Buchmann SL (1991) Standardized analytical techniques for bee-collected pollen. Environmental Entomology, 20, 507-513. Odoux J-F, Feuillet D, Aupinel P et al. (2012) Territorial biodiversity and consequences on physico-chemical characteristics of pollen collected by honey bee colonies. Apidologie, 43, 561-575. Ohio Supercomputer Center (1987) Citation. Columbus, Ohio, USA. http://osc.edu/ark:/19495/f5s1ph73. Piñol J, Mir G, Gomez-Polo P, Agustí N (2014) Universal and blocking primer mismatches limit the use of high-throughput DNA sequencing for the quantitative metabarcoding of arthropods. Molecular Ecology Resources, 15, 819-830. Pornon A, Escaravage N, Burrus M, et al. (2016) Using metabarcoding to reveal and quantify plant-pollinator interactions. Scientific Reports, 6, 27282. Prokopowich CD, Gregory TR, Crease TJ (2003) The correlation between rDNA copy number and genome size in eukaryotes. Genome, 46, 48-50. QGIS Development Team (2018). QGIS Geographic Information System. Open Source Geospatial Foundation Project. http://qgis.osgeo.org (accessed 11 June 2018). R Core Team (2014) R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/ (accessed 21 June 2016). Requier F, Odoux J-F, Tamic T, Moreau N, Henry M, Decourtye A, Bretagnolle V (2015) Honey bee diet in intensive farmland habitats reveals an unexpectedly high flower richness and a major role of weeds. Ecological Applications, 25, 881-890. Richardson RT, Bengtsson-Palme J, Johnson RM (2017) Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data. Molecular Ecology Resources 17, 760-769. Richardson RT, Lin C-H, Quijia JQ, Riusech NS, Goodell K, Johnson RM (2015a) Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach. Applications in Plant Sciences, 3, 1500043. Richardson RT, Lin C-H, Quijia JQ, Sponsler BD, Goodell K, Johnson RM (2015b) Application of ITS2 metabarcoding to determine the provenance of pollen collected by honey bees in a field-crop dominated agroecosystem. Applications in Plant Sciences, 3, 1400066. Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM (2018) A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ, 6, e5126.

113

Sande SO, Crewe RM, Raina SK, Nicolson SW, Gordon I (2009) Proximity to a forest leads to higher honey yield: Another reason to conserve. Biological conservation, 142, 2703- 2709. Sang T, Crawford DJ, Stuessy TF (1997) Chloroplast DNA phylogeny, reticulate evolution and biogeography of Paeonia (Paeoniaceae). American Journal of Botany, 84, 1120-1136. Schnell IB, Bohmann K, Gilbert TP (2015) Tag jumps illuminated – reducing sequence-to- sample misidentifications in metabarcoding studies. Molecular Ecology Resources, 15, 1289-1303. Schürch R, Couvillon MJ, Burns DDR, Tasman K, Waxman D, Ratnieks FLW (2013) Incorporating variability in honey bee waggle dance decoding improves the mapping of communicated resource locations. Journal of Comparative Physiology A, 199, 1143- 1152. Severson DW, Parry JE (1981) A chronology of pollen collection by honeybees. Journal of Apicultural Research, 20, 97-103. Sickel W, Ankenbrand MJ, Grimmer G, Holzschuh A, Härtel S, Lanzen J, Steffan-Dewenter I, Keller A (2015) Increased efficiency in identifying mixed pollen samples by meta- barcoding with a dual-indexing approach. BMC Ecology, 15, 1-9. Smith DP, Peay KG (2014) Sequence depth, not PCR replication, improves ecological inference from next generation DNA sequencing. PLoS ONE, 9, e90234. Sourcepole (2018) OpenLayers Plugin. QGIS Python Plugins Repository. https://plugins.qgis.org/plugins/openlayers_plugin/ (accessed 11 June 2018). Taberlet P, Coissac E, Pompanon F et al. (2007) Power and limitations of the chloroplast trnL (UAA) intron for plant DNA barcoding. Nucleic Acids Research, 35, e14. Tate JA, Simpson BB (2003) of Tarasa (Malvaceae) and diverse origins of the polyploid species. Systematic Botany, 28, 723-737. USDA National Agricultural Statistics Service Cropland Data Layer (2018) Published crop- specific data layer. https://nassgeodata.gmu.edu/CropScape/ (accessed 11 June 2018). Valentini A, Miquel C, Taberlet P (2010) DNA barcoding for honey biodiversity. Diversity, 2, 610-617. Vaudo AD, Patch HM, Mortensen DA, Tooker JF, Grozinger CM (2016) Macronutrient ratios in pollen shape bumble bee (Bombus impatiens) foraging strategies and floral preferences. Proceedings of the National Academy of Sciences, 113, E4035-4042. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267. White TJ, Bruns T, Lee S, Taylor JW (1990) Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. In: Innis MA, Gelfand DH, Sninsky JJ, White TJ (eds) PCR protocols: a guide to methods and applications. Academic Press, New York, pp 315-322.

114

Zhang J, Kobert K, Flouri T, Stamakis A (2014) PEAR: A fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics, 30, 614-620.

5.7 Tables

Table 5.1 Summary of plant taxonomic groups represented at each rank in the reference sequence databases. Estimates for intermediate nodes of the , predominantly class and order, are possibly artificially inflated do to some unresolved lineages from the same taxon being given their own independent annotations for hierarchical classification purposes.

Phyla Classes Orders Families Genera Species ITS2 2 53 171 612 9524 65052 rbcL 2 63 196 776 7000 16994 trnL 1 30 84 374 5943 18169 trnH 1 42 126 517 5341 26288 Total 2 69 229 879 12952 86525

Table 5.2 Mean, standard error and range of the number of Viridiplantae sequences per sample obtained for each marker.

Average sequences per Range of sequences per sample sample ITS2 33,258 ± 1,777 9,963 – 58,031 rbcL 4,700 ± 400 1,897 – 10,857 trnL 38,359 ± 1,981 13,496 – 59,420 trnH 25,970 ± 2,087 7,697 – 51,609

115

Table 5.3 Mean and standard error of proportion of sequences classified to each rank for each marker.

ITS2 rbcL trnL trnH Kingdom 0.89 ± 0.02 1.00 ± 0.00 1.00 ± 0.00 0.99 ± 0.00 Phylum 0.89 ± 0.02 1.00 ± 0.00 1.00 ± 0.00 0.99 ± 0.00 Class 0.83 ± 0.03 0.96 ± 0.01 0.99 ± 0.00 0.99 ± 0.00 Order 0.75 ± 0.03 0.87 ± 0.01 0.98 ± 0.00 0.99 ± 0.00 Family 0.70 ± 0.03 0.78 ± 0.03 0.97 ± 0.00 0.99 ± 0.00 Genus 0.27 ± 0.03 0.15 ± 0.02 0.23 ± 0.03 0.47 ± 0.04 Species 0.03 ± 0.01 0.03 ± 0.01 0.10 ± 0.02 0.11 ± 0.03

116

5.8 Figures

Figure 5.1 Cross-validation results of classifier performance evaluation on test reference sequences cropped to 150 bp in length. Local polynomial logistic regression of test case classification outcomes, ‘1’ indicating an incorrect classification and ‘0’ indicating a correct classification, regressed against Metaxa2 reliability score (A). A dashed black line illustrates the hypothetically ideal relationship between error probability and the reliability score. The best fit local polynomial model for the data is shown with a solid red line and a dashed grey line indicates an error probability of 0.1. Mean and standard error of the proportion of true positive

(TP), true negative (TN), false negative (FN) and false positive (FP) classifications as well as the classification false discovery rate (FDR) across each taxonomic rank for all four plant markers

(B).

117

Figure 5.2 Metabarcoding results regressed against microscopy results for the metabarcoding median of all loci as well as each locus individually. All proportional results are summarized to the family level and proportions are square-root transformed. Plant families occurring in any sample at greater than 5 percent in the metabarcoding median results are shown with distinct colors and point types and both the molecular and microscopic results were filtered to remove detections of less than 1 percent of the untransformed data.

118

Figure 5.3 Time series plot of the metabarcoding median estimate of the proportional abundance of each plant family across the four sampling sites. Families occurring at lower than 10 percent abundance are not differentiated.

119

Figure 5.4 Honey bee spatial foraging patterns from May 2nd to May 8th (A), May 11th to May

19th (B) and May 23rd to May 27th (C) at one of the four sites. These sampling partitions represent the three major foraging periods observed in our data, dominated by Rosaceae,

Salicaceae and Fabaceae, respectively. Boxplots of the log-transformed preference index across each of the three landcover types for all sites and all sampling dates (D). In total, 640 dances were analyzed for this work, with the sample size across sites ranging from 124 to 222 dances.

120

5.9 Acknowledgements

The authors thank J. Bengtsson-Palme, D. Denlinger and K. Goodell for laboratory space and advice; I. Barnes and P. Young for access to bees and apiary sites; N. Douridas for assistance at the Molly Caren Agricultural Center; and field assistance from A. Sankey, N. Riusech, and M.

Blackson. This work was supported by a Project Apis m. - Costco Honey Bee Biology

Fellowship to RTR, a Pollinator Partnership Corn Dust Research Consortium grant to RMJ and support provided by state and federal funds appropriated to The Ohio State University, Ohio

Agricultural Research and Development Center (OHO01277).

121

Complete List of References Cited

Abarenkov K, Adams RI, Laszlo I et al. (2016) Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. Ankenbrand MJ, Keller A, Wolf M, Schultz J, Förster F (2015) ITS2 database V: Twice as much. Molecular Biology and Evolution, 32, 3030-3032. Azhagiri AK, Maliga P (2007) Exceptional paternal inheritance of plastids in Arabidopsis suggests that low-frequency leakage of plastids via pollen may be universal in plants. The Plant Journal, 52, 817-823. Baum KA, Rubink WL, Coulson RN, Bryant VM (2011) Diurnal patterns of pollen collection by feral honey bee colonies in southern Texas, USA. Palynology, 35, 85-93. Bazinet AL, Cummings MP (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 1-13. Beekman M, Ratnieks FLW (2000) Long-range foraging by the honey-bee, Apis mellifera L. Functional Ecology. 14, 490-496. Bell KL, Brosi BJ, de Vere N, Keller A, Richardson RT, Gous A, Burgess KS (2016) Pollen DNA barcoding: Current applications and future prospects. Genome, 59, 1-12. Bell KL, Burgess KS, Botsch JC, Dobbs EK, Read TD, Brosi BJ (2018) Quantitative and qualitative assessment of pollen DNA metabarcoding using constructed species mixtures. Molecular Ecology, https://doi.org/10.1111/mec.14840. Bell KL, Fowler J, Burgess KS, Dobbs EK, Gruenewald D, Lawley B, Morozumi C, Brosi BJ (2017a) Applying pollen DNA metabarcoding to the study of plant-pollinator interactions. Applications in Plant Sciences, 5, 1600124. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH (2015) Metaxa2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 1403-1414. Bengtsson-Palme, J, Richardson RT, Meola M, et al. (2018) Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic or metabarcoding data using any genetic marker. Bioinformatics, bty482.

122

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank. Nucleic Acids Research, 33, D34-D38. Berry D, Mahfoudh KB, Wagner M, Loy A (2011) Barcoded primers used in multiplex amplicon pyrosequencing bias amplification. Applied Environmental Microbiology, 77, 7846-7849. Blubaugh CK, Hagler JR, Machtley SA, Kaplan I (2016) Cover crops increase foraging activity of omnivorous predators in seed patches and facilitate weed biological control. Agriculture, Ecosystems and Environment, 231, 264-270. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, btu170. Boyer F, Mercier C, Bonin A, Le Bras Y, Taberlet P, Coissac E (2016) Obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16, 176-182. Brady A, Salzberg SL (2009) Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6, 673-676. Bromenshenk JJ, Carlson SR, Simpson JC, Thomas JM (1985) Pollution monitoring of Puget Sound with honey bees. Science, 227, 632-634. Bryant VM, Jones GD (2001) The r-values of honey: pollen coefficients. Palynology, 25, 11-28. Chen S, Yao H, Han J, Liu C, Song J, Shi L, Zhu Y (2010) Validation of the ITS2 region as a novel DNA barcode for identifying medicinal plant species. PLoS ONE, 5, e8613. Cornman RS, Otto CRV, Iwanowicz D, Pettis JS (2015) Taxonomic characterization of honey bee (Apis mellifera) pollen foraging based on non-overlapping paired-end sequencing of nuclear ribosomal loci. PLoS ONE, 10, e0145365. Corse E, Costedoat C, Chappaz R, Pech N, Martin J-F, Gilles A (2010) A PCR-based method for diet analysis in freshwater organisms using 18S rDNA barcoding on faeces. Molecular Ecology Resources, 10, 96-108. Couvillon MJ, Pearce FCR, Harris-Jones EL, et al. (2012) In tra-dance variation among waggle runs and the design of efficient protocols for honey bee dance decoding. Biology Open, 1, 467-472. Couvillon MJ, Ratnieks FLW (2015) Environmental consultancy: dancing bee bioindicators to evaluate landscape “health”. Frontiers in Ecology and Evolution, 3, 44. Cowart DA, Pinheiro M, Mouchel O, Maguer M, Grall J, Miné J, et al. (2015) Metabarcoding is powerful yet still blind: A comparative analysis of morphological and molecular surveys of seagrass communities. PLoS ONE, 10, e0117562. Craine JM, Towne EG, Miller M, Fierer N (2015) Climatic warming and the future of bison as grazers. Scientific Reports, 5, 16738. Cuenoud P, Savolainen V, Chatrou LW, Powell M, Grayer RJ, Chase MW (2001) Molecular phylogenetics of Caryophyllales based on nuclear 18S rDNA and plastid rbcL, atpB, and matK DNA sequences. American Journal of Botany, 89, 132-144.

123

Cusser S, Goodell K (2013) Diversity and distribution of floral resources influence the restoration of plant–pollinator networks on a reclaimed strip mine. Restoration Ecology, 21, 713-721. Deagle BE, Jarman SN, Coissac E, Pompanon F, Taberlet P (2014) DNA metabarcoding and the cytochrome c oxidase subunit I marker: Not a perfect match. Biology Letters, 10, 20140562. Di Pasquale G, Salignon M, Le Conte Y, et al. (2013) Influence of pollen nutrition on honey bee health: Do pollen quality and diversity matter? PLoS ONE, 8, e72016. Dimou M, Goras G, Thrasyvoulou A (2007) Pollen analysis as a means to determine the geographical origin of royal jelly. Grana, 46, 118-122. Dimou MG, Thrasyvoulou A (2007) A comparison of three methods for assessing the relative abundance of pollen resources collected by honey bee colonies. Journal of Apicultural Research and Bee World, 46, 144-148. Donkersley P, Rhodes G, Pickup RW, Jones KC, Wilson K (2014) Honeybee nutrition is linked to landscape composition. Ecology and evolution, 4, 4195-4206. Eddy SR (2011) Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461. Edgar RC (2015a) Taxonomy overclassification and underclassification errors. Accessed 06-01- 2016. www.drive5.com/usearch/manual/tax_overclass.html Edgar RC (2015b) Validating taxonomy classifiers. Accessed 06-01-2016. www.drive5.com/usearch/manual/taxonomy_validation.html Edgar RC (2015c) UTAX algorithm. Accessed 06-01-2016. http://www.drive5.com/usearch/manual/utax_algo.html Edgar RC (2016) SINTAX: A simple non-Bayesian taxonomy classifier for 16S and ITS sequences. Biorxiv. https://doi.org/10.1101/074161. Edgar RC (2018) Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ, 6, e4652. Elbrecht V and Leese F (2017) Validation and development of COI metabarcoding primers for freshwater macroinvertebrate bioassessment. Frontiers in Environmental Science, 5, 11. Elbrecht V, Leese F (2015) Can DNA-based ecosystem assessments quantify species abundance? Testing primer bias and biomass—sequence relationships with an innovative metabarcoding protocol. PLoS ONE, 10, e0130324. Elbrecht V, Taberlet P, Dejean T, Valentini A, Usseglio-Polatera P, Beisel J, Coissac E, Boyer F, Leese F (2016) Testing the potential of a ribosomal 16S marker for DNA metabarcoding of insects. PeerJ, 4, e1966. Erdtman G (1943) An introduction to pollen analysis. Chronica Botanica Company, Waltham, Massachusetts, USA.

124

Esling P, Lejzerowicz F, Pawlowski J (2015) Accurate multiplexing and filtering for high- throughput amplicon-sequencing. Nucleic Acids Research, 43, 2513-2524. Fay MF, Swenson SM, Chase MW (1997) Taxonomic affinities of Medusagyne oppositifolia (Medusagynaceae). Kew Bulletin, 52, 111-120. Ficetola GF, Taberlet P, Coissac E (2016) How to limit false positives in environmental DNA and metabarcoding? Molecular Ecology Resources, 16, 604-607. Forcone A, Aloisi PV, Ruppel S, Muñoz M (2011) Botanical composition and protein content of pollen collected by Apis mellifera L. in the north-west of Santa Cruz (Argentinean Patagonia). Grana, 50, 30-39. Galimberti A, De Mattia F, Bruni I, Scaccabarozzi D, Sandionigi A, Barbuto M, Casiraghi M, Labra M (2014) A DNA barcoding approach to characterize pollen collected by honeybees. PLoS ONE, 9, e109363. Garbuzov M, et al. (2015) Honey bee dance decoding and pollen load analysis show limited foraging on spring-flowering oilseed fape, a potential source of neonicotinoid contamination. Agriculture, Ecosystems and Environment, 203, 62-68. Girard M, Chagnon M, Fournier V (2012) Pollen diversity collected by honey bees in the vicinity of Vaccinium spp. crops and its importance for colony development. Botany, 90, 545-555. Guardiola M, Uriz MJ, Taberlet P, Coissac E, Wangensteen OW, Turon X (2015) Deep-sea, deep-sequencing: Metabarcoding extracellular DNA from sediments of marine canyons. PLoS ONE, 10, e0139633. Hamilton AJ, Basset Y, Benke KK, Grimbacher PS, Miller SE, Novotny V, Samuelson A, Stork NE, Weiblen GD, Yen JDL (2010) Quantifying uncertainty in estimation of tropical arthropod species richness. The American Naturalist, 175, 90-95. Han J, Zhu Y, Chen X, Liao B, Yao H, Song J, Chen S, Meng F (2013) The short ITS2 sequence serves as an efficient taxonomic sequence tag in comparison with the full-length ITS. Biomed Research International, 741476. Hawkins J, de Vere N, Griffith A, Ford CR, Allainguillaume J, Hegarty MJ, Baillie L, Adams- Groom B (2015) Using DNA metabarcoding to identify the floral composition of honey: A new tool for investigating honey bee foraging preferences. PLoS ONE, 10, e0134735. Hebert PDN, Cywinska A, Ball SL, deWaard JR (2003) Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences, 270, 313- 321. Hebert PDN, Penton EH, Burns JM, Janzen DH, Hallwachs W (2004) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proceedings of the National Academy of Sciences, 101, 14812-14817. Hilu K, Liang H 1997. The matK gene: sequence variation and application in plant systematics. American Journal of Botany, 84, 830. Huang Z (2012) Pollen nutrition affects honey bee stress resistance. Terrestrial Arthropod Reviews, 5, 175-189.

125

Hugerth LW, Muller EEL, Hu YOO, Lebrun LAM, Roume H, Lundin D, Wilmes P, Andersson AF (2014) Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia. PLoS ONE, 9, e95567. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC (2011) Integrative analysis of environmental sequences using MEGAN 4. Genome Research, 21, 1552-1560. Illumina Inc. (2014) Nextera library validation and cluster density optimization. Available at: https://www.illumina.com/documents/products/technotes/technote_nextera_library_valid ation.pdf (accessed 11 August 2017). Jones GD, Bryant VM (1992) Melissopalynology in the United States: A review and critique. Palynology, 16, 63-71. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772- 780. Kearns CA, Inouye DW (1993) Techniques for pollination biologists. University Press of Colorado, Boulder, Colorado, USA. Keller A, Danner N, Grimmer G, Ankenbrand M, von der Ohe K, von der Ohe W, Rost S, Härtel S, Steffan-Dewenter I (2015) Evaluating multiplexed next-generation sequencing as a method in palynology for mixed pollen samples. Plant Biology, 17, 558-556. Keller A, Schleicher T, Schultz J, Müller T, Dandekar T, Wolf M (2009) 5.8S-28S rRNA interaction and HMM-based ITS2 annotation. Gene, 430, 50-57. Kraaijeveld K, de Weger LA, García MV, Buermans H, Frank J, Hiemstra PS, Den Dunnen JT (2015) Efficient and sensitive identification and quantification of airborne pollen using next-generation DNA sequencing. Molecular Ecology Resources, 15, 8-16. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J (2008) Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research, 36, 2230-2239. Krehenwinkel H, Wolf M, Lim JY, Rominger AJ, Simison WB, Gillespie RG (2017) Using metabarcoding to reveal and quantify plant-pollinator interactions. Scientific Reports, 7, 17668. Lahoz-Monfort JJ, Guillera-Arroita G, Tingley R (2016) Statistical approaches to account for false-positive errors in environmental DNA samples. Molecular Ecology Resources, 16, 673-685. Lanzén A, Jørgensen SL, Huson DH, Gorfer M, Grindhaug SH, Jonassen I, Øvreås L, Urich T (2012) CREST – classification resources for environmental sequence tags. PLoS ONE, 7, e49334. Lau P, Bryant V, Rangel J (2018) Determining the minimum number of pollen grains needed for accurate honey bee (Apis mellifera) colony pollen pellet analysis. Palynology, 42, 36-42. Lighthart B, Prier KRS, Bromenshenk JJ (2005) Flying honey bees adsorb airborne viruses. Aerobiologia, 21, 147-149.

126

Lighthart B, Prier KRS, Loper GM, Bromenshenk JJ (2000) Bees scavenge airborne bacteria. Microbial Ecology, 39, 314-321. Long EO, Dawid IB (1980) Repeated genes in eukaryotes. Annual Review of Biochemistry, 49, 727-764. Longhi S, Cristofori A, Gatto P, Cristofolini F, Grando MS, Gottardini E (2009) Biomolecular identification of allergenic pollen: a new perspective for aerobiological monitoring? Annals of Allergy, Asthma and Immunology, 103, 508-514. Louveaux J (1959) Recherches sur la récolte du pollen par les abeilles (Apis mellifica L) (Fin). Annales de L’Abeillle, 2, 13-111. Louveaux J, Maurizio A, Vorwohl G (1978) Methods of melissopalynology. Bee World, 59, 139- 153. McFrederick QS, Rehan SM (2016) Characterization of pollen and bacterial community composition in brood provisions of a small carpenter bee. Molecular Ecology, 25, 2302- 2311. Meijering E, Dzyubachyk O, Smal I (2012). Chapter nine - Methods for cell and particle tracking. In P. M. Conn (Ed.), Methods in Enzymology, 504, 183-200. Academic Press. Memmott J (1999) The structure of a plant-pollinator food web. Ecology Letters, 2, 276-280. Mitra S, Staerk M, Huson DH (2011) Analysis of 16S rRNA environmental sequences using MEGAN. BMC Genomics, 12, S17. Mogensen HL (1996) The hows and whys of cytoplasmic inheritance in seed plants. American Journal of Botany, 83, 383-404. Mollot G, Duyck P-F, Lefeuvre P, Lescourret F, Martin J-F, Piry S, Canard A, Tixier P (2014) Cover cropping alters the diet of arthropods in a banana plantation: A metabarcoding approach. PLoS ONE, 9, e93740. Moore PD, Webb JA, Collinson ME (1991) Pollen analysis, second edition. Blackwell Scientific Publications, Oxford, UK. Moritz RFA, Härtel S, Neumann P (2005) Global invasions of the western honeybee (Apis mellifera) and the consequences for biodiversity. Ecoscience, 12, 289-301. Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R (2008) Statistical assignment of DNA sequences using Bayesian phylogenetics. Systematic Biology, 57, 750-757. Naug D (2009) Nutritional stress due to habitat loss may explain recent honeybee colony collapses. Biological Conservation, 142, 2369-2372. NCBI Resource Coordinators (2018) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research, 46, D8-D13. Nilsson RH, Hyde KD, Pawłowska J et al. (2014) Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 11-19. O’Donnell JL, Kelly RP, Lowell NC, Port JA (2016) Indexed PCR primers induce template- specific bias in large-scale DNA sequencing studies. PLoS ONE, 11, e0148698.

127

O’Rourke MK, Buchmann SL (1991) Standardized analytical techniques for bee-collected pollen. Environmental Entomology, 20, 507-513. Odoux J-F, Feuillet D, Aupinel P et al. (2012) Territorial biodiversity and consequences on physico-chemical characteristics of pollen collected by honey bee colonies. Apidologie, 43, 561-575. Ohio Supercomputer Center (1987) Citation. Columbus, Ohio, USA. http://osc.edu/ark:/19495/f5s1ph73. Pang X, Shi L, Song J, Chen X, Chen S (2013) Use of the Potential DNA barcode ITS2 to identify herbal materials. Journal of Natural Medicines, 67, 571-575. Peabody MA, Van Rossum T, Lo R, Brinkman FSL (2015) Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 363. Piñol J, Mir G, Gomez-Polo P, Agustí N (2014) Universal and blocking primer mismatches limit the use of high-throughput DNA sequencing for the quantitative metabarcoding of arthropods. Molecular Ecology Resources, 15, 819-830. Pornon A, Escaravage N, Burrus M, et al. (2016) Using metabarcoding to reveal and quantify plant-pollinator interactions. Scientific Reports, 6, 27282. Porter TM, Gibson JF, Shokralla S, Baird DJ, Golding GB, Hajibabaei M (2014) Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naive Bayesian classifier. Molecular Ecology Resources 14, 929-942. Porter TM, Golding GB (2012) Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: Classification method, primer choice, and error. PLoS ONE, 7, e35749. Price LB, Liu CM, Melendez JH, Frankel YM, Engelthaler D, Aziz M, Bowers J, Rattray R, Ravel J, Kingsley C, Keim PS, Lazarus GS, Zenilman JM (2009) Community analysis of chronic wound bacteria using 16S rRNA gene-based pyrosequencing: Impact of diabetes and antibiotics on chronic wound microbiota. PLoS ONE, 4, e6462. Prokopowich CD, Gregory TR, Crease TJ (2003) The correlation between rDNA copy number and genome size in eukaryotes. Genome, 46, 48-50. QGIS Development Team (2018). QGIS Geographic Information System. Open Source Geospatial Foundation Project. http://qgis.osgeo.org (accessed 11 June 2018). Quéméré E, Hibert F, Miquel C, Lhuillier E, Rasolondraibe E, Champeau J, Rabarivola C, Nusbaumer L, Chatelain C, Gautier L, Ranirison P, Crouau-Roy B, Taberlet P, Chikhi L (2013) A DNA metabarcoding study of a primate dietary diversity and plasticity across its entire fragmented range. PLoS ONE, 8, e58971. R Core Team (2014) R: A language and environment for statistical computing. Vienna, Austria. http://www.R-project.org/ (accessed 21 June 2016).

128

Ratnasingham S, Hebert PDN (2007) Bold: The barcode of life data system (http://www.barcodinglife.org). Molecular Ecology Notes, 7, 355-364. Reboud X, Zeyl C (1994) Organelle inheritance in plants. Heredity, 72, 132-140. Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW (2010) Arthropod relationships revealed by phylogenomic analysis of nuclear protein- coding sequences. Nature, 463, 1079-1083. Requier F, Odoux J-F, Tamic T, Moreau N, Henry M, Decourtye A, Bretagnolle V (2015) Honey bee diet in intensive farmland habitats reveals an unexpectedly high flower richness and a major role of weeds. Ecological Applications, 25, 881-890. Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM (2018) A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ, 6, e5126. Richardson RT, Bengtsson-Palme J, Johnson RM (2017) Evaluating and optimizing the performance of software commonly used for the taxonomic classification of DNA metabarcoding sequence data. Molecular Ecology Resources 17, 760-769. Richardson RT, Lin C-H, Quijia JQ, Riusech NS, Goodell K, Johnson RM (2015a) Rank-based characterization of pollen assemblages collected by honey bees using a multi-locus metabarcoding approach. Applications in Plant Sciences, 3, 1500043. Richardson RT, Lin C-H, Quijia JQ, Sponsler BD, Goodell K, Johnson RM (2015b) Application of ITS2 metabarcoding to determine the provenance of pollen collected by honey bees in a field-crop dominated agroecosystem. Applications in Plant Sciences, 3, 1400066. Salazar G, Cornejo-Castillo FM, Borrull E, Diez-Vives C, Lara E, Vaque D, Arrieta JM, Duarte CM, Gasol JM, Acinas SG (2015) Particle-association lifestyle is a phylogenetically conserved trait in bathypelagic prokaryotes. Molecular Ecology, 24, 5692-5706. Sande SO, Crewe RM, Raina SK, Nicolson SW, Gordon I (2009) Proximity to a forest leads to higher honey yield: Another reason to conserve. Biological conservation, 142, 2703- 2709. Sang T, Crawford DJ, Stuessy TF (1997) Chloroplast DNA phylogeny, reticulate evolution and biogeography of Paeonia (Paeoniaceae). American Journal of Botany, 84, 1120-1136. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K et al (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 41, D8-D20. Schnell IB, Bohmann K, Gilbert TP (2015) Tag jumps illuminated – reducing sequence-to- sample misidentifications in metabarcoding studies. Molecular Ecology Resources, 15, 1289-1303. Schnell IB, Thomsen PF, Wilkinson N, Rasmussen M, Jensen LRD, Willerslev E, Bertelsen MF, Gilbert MTP (2012) Screening mammal biodiversity using DNA from Leeches. Current Biology, 22, 262-263.

129

Schürch R, Couvillon MJ, Burns DDR, Tasman K, Waxman D, Ratnieks FLW (2013) Incorporating variability in honey bee waggle dance decoding improves the mapping of communicated resource locations. Journal of Comparative Physiology A, 199, 1143- 1152. Severson DW, Parry JE (1981) A chronology of pollen collection by honeybees. Journal of Apicultural Research, 20, 97-103. Sickel W, Ankenbrand MJ, Grimmer G, Holzschuh A, Härtel S, Lanzen J, Steffan-Dewenter I, Keller A (2015) Increased efficiency in identifying mixed pollen samples by meta- barcoding with a dual-indexing approach. BMC Ecology, 15, 1-9. Sierwald P (2017) MilliBase. Accessed at http://www.millibase.org on 2017-04-12 Simel EJ, Saidak LR, Tuskan GA (1997) Method of extracting genomic DNA from non- germinated gymnosperm and angiosperm pollen. BioTechniques, 22, 390-392, 394. Smith DP, Peay KG (2014) Sequence depth, not PCR replication, improves ecological inference from next generation DNA sequencing. PLoS ONE, 9, e90234. Soergel D, Neelendu Dey AW, Knight R, Brenner SE (2012) Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. The ISME Journal, 6, 1440-44. Somervuo P, Koskela S, Pennanen J, Nilsson RH, Ovaskainen O (2016) Unbiased probabilistic taxonomic classification for DNA barcoding. Bioinformatics, 32, 2920-2927. Somervuo P, Yu DW, Xu CCY, Ji Y, Hultman J, Wirta H, Ovaskainen O (2017) Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding. Methods in Ecology and Evolution, 8, 398-407. Sourcepole (2018) OpenLayers Plugin. QGIS Python Plugins Repository. https://plugins.qgis.org/plugins/openlayers_plugin/ (accessed 11 June 2018). Stuart MK, Greenstone MH (1990) Beyond ELISA: A rapid, sensitive, specific immunodot assay for identification of predator stomach contents. Annals of the Entomological Society of America, 83, 1101-1107. Symondson WOC (2002) Molecular identification of prey in predator diets. Molecular Ecology, 11, 627-641. Taberlet P, Coissac E, Pompanon F et al. (2007) Power and limitations of the chloroplast trnL (UAA) intron for plant DNA barcoding. Nucleic Acids Research, 35, e14. Taberlet P, Coissac E, Pompanon F, Brochmann C, Willerslev E (2012) Towards next-generation biodiversity assessment using DNA metabarcoding. Molecular Ecology, 21, 2045-2050. Tang LY, Nagata N, Matsushima R, Chen Y, Yoshioka Y, Sakamoto W (2009) Visualization of plastids in pollen grains: Involvement of FtsZ1 in pollen plastid division. Plant Cell Physiology, 50, 904-908. Tang M, Hardman CJ, Ji Y, Meng G, Liu S, Tan M, Yang S, Moss ED, Wang J, Yang C, Bruce C, Nevard T, Potts SG, Zhou X, Yu DW (2015) High-throughput monitoring of wild bee diversity and abundance via mitogenomics. Methods in Ecology and Evolution, 6, 1034- 1043.

130

Tate JA, Simpson BB (2003) Paraphyly of Tarasa (Malvaceae) and diverse origins of the polyploid species. Systematic Botany, 28, 723-737. Thomsen PF, Kielgast J, Iversen LL, Møller PR, Morten Rasmussen M, Willerslev E (2012) Detection of a diverse marine fish fauna using environmental DNA from seawater samples. PLoS ONE, 7, e41732. Tripathi AM, Tyagi A, Kumar A, Singh A, Singh S, Chaudhary LB, Roy S (2013) The internal transcribed spacer (ITS) region and trnhH-psbA are suitable candidate loci for DNA barcoding of tropical tree species of India. PLoS ONE, 8, e57934. USDA National Agricultural Statistics Service Cropland Data Layer (2018) Published crop- specific data layer. https://nassgeodata.gmu.edu/CropScape/ (accessed 11 June 2018). Valentini A, Miquel C, Taberlet P (2010) DNA barcoding for honey biodiversity. Diversity, 2, 610-617. Valentini A, Taberlet P, Miaud C, et al. (2016) Next-generation monitoring of aquatic biodiversity using environmental DNA metabarcoding. Molecular Ecology, 25, 929-942. vanEngelsdorp D, Evans JD, Saegerman C, Mullin C, Haubruge E, Nguyen BK, Frazier M (2009) Colony collapse disorder: a descriptive study. PLoS ONE, 4, e6481. vanEngelsdorp D, Meixner MD (2010) A historical review of managed honey bee populations in Europe and the United States and the factors that may affect them. Journal of Invertebrate Pathology, 103, S80-S95. Vasquez A, Olofsson TC (2009) The lactic acid bacteria involved in the production of bee pollen and bee bread. Journal of Apicultural Research, 48, 189-195. Vaudo AD, Patch HM, Mortensen DA, Tooker JF, Grozinger CM (2016) Macronutrient ratios in pollen shape bumble bee (Bombus impatiens) foraging strategies and floral preferences. Proceedings of the National Academy of Sciences, 113, E4035-4042. Vesterinen EJ, Ruokolainen L, Wahlberg N, Peña C, Roslin T, Laine VN, Vasko V, Sääksjärvi IE, Norrdahl K, Lilley TM (2016) What you need is what you eat? Prey selection by the bat Myotis daubentonii. Molecular Ecology, 25, 1581-1594. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73, 5261-5267. Wang X-C, Liu C, Huang L, Bengtsson-Palme J, Chen H, Zhang J-H, Cai D, Li J-Q (2015) ITS1: A DNA barcode better than ITS2 in eukaryotes? Molecular Ecology Resources, 15, 573-586. Weber DC, Rowley DL, Greenstone MH, Athanas MM (2006) Prey preference and host suitability of the predatory and parasitoid carabid beetle, Lebia grandis, for several species of Leptinotarsa beetles. Journal of Insect Science, 6, 1-14. White TJ, Bruns T, Lee S, Taylor JW (1990) Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. In: Innis MA, Gelfand DH, Sninsky JJ, White TJ (eds) PCR protocols: a guide to methods and applications. Academic Press, New York, pp 315-322.

131

Wilson EE, Sidhu CS, LeVan KE, Holway DA (2010) Pollen foraging behaviour of solitary Hawaiian bees revealed through molecular pollen analysis. Molecular Ecology, 19, 4823- 4829. Winston ML (1987) The biology of the honey bee. Harvard University Press, Cambridge, Massechusetts, USA. Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM SIGKDD Explorations, 12, 40-48. Yao H, Song J, Liu C, Luo K, Han J, Li Y, Pang X, Xu H, Zhu Y, Xiao P, Chen S (2010) Use of ITS2 region as the universal barcode for plants and animals PLoS ONE, 5, e13102. Yu DW, Ji Y, Emerson BC, Wang X, Ye C, Yang C, Ding Z (2012) Biodiversity soup: Metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring. Methods in Ecology and Evolution, 3, 613-623. Zeale MRK, Butlin RK, Barker GLA, Lees DC, Jones G (2011) Taxon-specific PCR for DNA barcoding arthropod prey in bat faeces. Molecular Ecology Resources, 11, 236-244. Zhang J, Kobert K, Flouri T, Stamatakis A (2014) PEAR: A fast and accurate Illumina Paired- End reAd mergeR. Bioinformatics, 30, 614-620.

132

Printed references: Hebda, R. J., C. C. Chinnappa, and B. M. Smith. 1988. Pollen morphology of the Rosaceae of western Canada I. Agrimonia to Crataegus. Grana 27: 95–113. Hebda, R. J., C. C. Chinnappa, and B. M. Smith. 1991. Pollen morphology of the Rosaceae of western Canada. IV. Luetkea, Oemleria, Physocarpus, Prunus. Canadian Journal of Botany 69: 2583–2596. Hodges, D. 1974. The pollen loads of the honeybee. Bee Research Association, London, . Kapp, R. O., O. K. Davis, and J. E. King. 2000. Pollen and spores, 2nd ed. American Association of Stratigraphic Palynologists. Lewis, W. H., P. Vinay, and V. E. Zenger. 1983. Airborne and allergenic pollen of North America. The Johns Hopkins University Press, Baltimore, Maryland, USA. Moore, P. D., J. A. Webb, and M. E. Collinson. 1991. Pollen analysis, 2nd ed. Blackwell Scientific Publications, Oxford, United Kingdom. Wodehouse, R. P. 1935. Pollen grains: Their structure, identification, and significance in science and madicine. Haffner Publishing Co., New York, New York, USA.

Image databases: Buchner R. and M. Weber. 2000 onwards. PalDat—a palynological database: Descriptions, illustrations, identification, and information retrieval. Website http://www.paldat.org/ [accessed 5 December 2014]. Trigg, H., S. Jacobucci, M. Henkel, and J. Steinberg. 2013. Human impacts pollen database, an illustrated key. Andrew Fiske Center for Archaeological Research, University of Massachusetts, Boston, Massachusetts, USA. Website http://www.fiskecenter.umb.edu/Research/Pollen_Database.html [accessed 5 December 2014]. USDA. 2001. Pollen as indicators of source areas and foraging resources. Website http://pollen.usda.gov [accessed 5 December 2014].

133

Weight Date Taxon Proportion (%) (g) April 23 Woody 14.15 56.84 Acer L. spp. 11.23 45.08 Populus L. spp. 2.56 10.26 Rosaceae Juss. 0.24 0.96 Cercis canadensis L. 0.11 0.44 Fraxinus L. spp. 0.02 0.07 Quercus L. spp. 0.007 0.03 Herbaceous 10.75 43.15 Taraxacum officinale F. H. Wigg. 9.67 38.81 Brassicaceae Burnett 0.84 3.39 Lamium L. spp. 0.21 0.85 Ranunculus L. spp. 0.01 0.04 Stellaria L. spp. 0.01 0.04 Claytonia virginica L. 0.004 0.02 Other 0.003 0.01 Total 24.90 100

April 29 Woody 50.67 74.17 Fraxinus spp. 34.86 51.03 Acer spp. 15.06 22.05 Rosaceae 0.51 0.74 Salix L. spp. 0.20 0.29 Populus spp. 0.04 0.06 Herbaceous 17.64 25.81 Taraxacum officinale 17.12 25.06 Brassicaceae 0.39 0.58 Lamium spp. 0.12 0.18 Other 0.013 0.02 Total 68.32 100

May 2 Woody 12.60 82.46 Acer spp. 6.25 41.09 Rosaceae 4.11 26.98 134

Fraxinus spp. 2.12 13.94 Quercus spp. 0.07 0.46 Magnolia L. spp. 0.006 0.04 Salix spp. 0.0007 0.005 Herbaceous 2.64 17.35 Taraxacum officinale 1.64 10.75 Brassicaceae 0.56 3.7 Glechoma hederacea L. 0.33 2.15 Lamium spp. 0.11 0.75 Other 0.02 0.15 Total 15.22 100

May 6 Woody 38.16 94.88 Rosaceae 33.76 83.95 Lonicera spp. 2.62 6.51 Fraxinus spp. 0.62 1.54 Salix spp. 0.40 1.00 Cercis canadensis 0.30 0.76 Acer spp. 0.22 0.55 Quercus spp. 0.22 0.54 Viburnum spp. 0.01 0.03 Morus spp. 0.0001 <0.001 Herbaceous 1.90 4.73 Taraxacum officinale 0.98 2.44 Brassicaceae 0.91 2.26 Glechoma hederacea 0.01 0.03 Other 0.16 0.40 Total 40.22 100

135

April 23, 2013 No. of Family reads Brassicaceae 285,767 Asteraceae 43,511 Sapindaceae 9728 Rosaceae 880 Caryophyllaceae 289 Cannabaceae 189 Ranunculaceae 147 Liliaceae 33 Boraginaceae 28 Moraceae 12 Betulaceae 7 Poaceae 4 Plantaginaceae 4 Fabaceae 3 Oleaceae 2 Limnanthaceae 1 Juglandaceae 1 April 29, 2013 No. of Family reads Brassicaceae 151,904 Asteraceae 31,638 Sapindaceae 2347 Plantaginaceae 547 Oleaceae 342 Rosaceae 262 Boraginaceae 148 Liliaceae 88 Caryophyllaceae 3 Moraceae 2 Cannabaceae 1

136

May 2, 2013 No. of Family reads Brassicaceae 50,649 Asteraceae 15,438 Sapindaceae 4698 Caryophyllaceae 187 Boraginaceae 71 Moraceae 44 Rosaceae 26 Oleaceae 4 Ranunculaceae 3 Poaceae 3 Betulaceae 1 Juglandaceae 1 May 6, 2013 No. of Family reads Brassicaceae 178,172 Asteraceae 15,018 Moraceae 1730 Boraginaceae 148 Oleaceae 74 Juglandaceae 60 Sapindaceae 34 Caryophyllaceae 28 Rosaceae 27 Poaceae 10 Betulaceae 8 Polemoniaceae 1 Fagaceae 1 Ranunculaceae 1

137

Sampling location Latitude (°N) Longitude (°W)

A 40.09 83.39 B 40.05 84.15 C 39.96 83.43 D 39.99 83.59 E 39.91 84.00 F 39.86 83.66

138

Family A B C D E F Adoxaceae 45 Asteraceae 276 435 274 770 122 374 Boraginaceae 86 Brassicaceae 63 96 160 104 136 161 Caprifoliaceae 14 69 14 77 5 Caryophyllaceae 1 1 Cornaceae 6 3 2 Elaeagnaceae 1 Fabaceae 8 4 5 204 24 39 Fagaceae 38 294 255 81 218 138 Hamamelidaceae 3 Hippocastanaceae 171 389 14 25 13 Hyacinthaceae 3 Juglandaceae 8 7 2 Lamiaceae 13 1 6 Liliaceae 2 2 Magnoliaceae 5 Moraceae 347 15 196 47 Oleaceae 408 34 206 28 499 544 Pinaceae 1 1 Poaceae 1 1 1 Polemoniaceae 30 Ranunculaceae 22 2 4 Rosaceae 3578 1674 3943 2412 1062 1906 Salicaceae 358 1956 2 1545 3024 1883 Sapindaceae 35 21 8 41 13 27 Trilliaceae 2 Unknown 98 58 78 117 50 47

139

140

Complete ITS2 metabarcoding results.

Family A B C D E F Aceraceae 10 5 Adoxaceae 665 5 269 7 Apiaceae 1 2 15 1 2 Asteraceae 23,492 49,244 17,845 29,440 18,766 25,603 Betulaceae 13 5 8 Boraginaceae 14,739 7 20 43 Brassicaceae 7816 44 3682 20,595 88 8710 Caryophyllaceae 38 79 4 69 28 27 Cupressaceae 4 2 1 Fabaceae 361 75 38 1205 269 400 Fagaceae 98 1 3 Grossulariaceae 3 Hypericaceae 1 Juglandaceae 29 406 33 16 45 12 Liliaceae 16 3 Limnanthaceae 2 2 Malvaceae 8 2 Melanthiaceae 27 82 Moraceae 186 15,596 95 763 13,710 2527 Oleaceae 1314 2 4 22 46 Papaveraceae 23 Pinaceae 39 556 61 153 94 44 Plantaginaceae 2 1 2 Poaceae 18 4 15 7 10 Polemoniaceae 5 3784 4 Ranunculaceae 1 36 8 1283 9 129 Rosaceae 118 7 149 40 5 1

141

Salicaceae 2 Aceraceae 6 46 6 4

142

Complete matK metabarcoding results.

Family A B C D E F Adoxaceae 3 581 1 61 413 14 Altingiaceae 13 4 Amaranthaceae 1 Asteraceae 59 172 65 64 10 68 Boraginaceae 11 Brassicaceae 2 Caprifoliaceae 15 61 5 2 3 Cornaceae 1 6 Fabaceae 11 1 2 Fagaceae 28 1199 192 24 134 91 Hippocastanaceae 15 84 6 1 Juglandaceae 12 Loasaceae 2 2 1 Moraceae 16 7 2 Oleaceae 2011 223 624 53 1457 1770 Orchidaceae 1 Papaveraceae 16 Platanaceae 8 2 4 1 Rhamnaceae 1 Rosaceae 26,595 18,125 19,683 18,243 12,440 19,218 Salicaceae 2 15 10 13 8 Aceraceae 1 Staphyleaceae 69 Vitaceae 1

143

Complete rbcL metabarcoding results.

Family A B C D E F Aceraceae 192 475 4 20 Adoxaceae 4 736 3 1125 85 37 Asteraceae 7106 8576 3460 2274 7464 6279 Betulaceae 1 2 1 Boraginaceae 223 Brassicaceae 56 124 340 262 83 191 Caprifoliaceae 11 45 7 2 3 Caryophyllaceae 1 Cornaceae 1 2 9 Cyperaceae 2 1 Elaeagnaceae 5 Fabaceae 1405 40 3 575 3986 1156 Fagaceae 1857 19,798 5217 8302 1898 4936 Hippocastanaceae 180 469 2 179 14 Hyacinthaceae 32 1 Juglandaceae 3 63 5 5 2 Lamiaceae 4 1 Lauraceae 4 Liliaceae 6 7 Magnoliaceae 2 23 2 Malvaceae 1 1 Melanthiaceae 1019 6 1 658 Menyanthaceae 2 1 1 5 2 Moraceae 5 656 6 359 20 61 Nelumbonaceae 4 2 4 5 1 Oleaceae 2582 347 732 2585 85 2215 Papaveraceae 28 Pinaceae 5 19 6 4 Plantaginaceae 3 2 Polemoniaceae 333 Ranunculaceae 3 82 8 Rosaceae 38,926 18,504 25,397 20,391 25,274 19,027 Rutaceae 1 Salicaceae 895 1257 15 1188 597 556 Aceraceae 5 163 Violaceae 6 3 73 28

144

Consensus lists for ITS2 metabarcoding.

Family A B C D E F Adoxaceae 665 5 269 7 Asteraceae 23,492 49,244 17,845 29,440 18,766 25,603 Betulaceae 5 Boraginaceae 14,739 Brassicaceae 7816 44 3682 20,595 88 8710 Caryophyllace 79 ae Fabaceae 361 75 38 1205 269 400 Fagaceae 98 1 3 Juglandaceae 29 406 33 16 12 Liliaceae 16 3 Malvaceae 8 Melanthiaceae 27 82 Moraceae 186 15,596 95 763 13,710 2527 Oleaceae 1314 2 4 22 46 Papaveraceae 23 Pinaceae 39 556 153 44 Plantaginaceae 2 Polemoniaceae 3784 Ranunculaceae 36 9 129 Rosaceae 118 7 149 40 5 1 Salicaceae 2 Aceraceae 6 46 6 4 10 5

145

Consensus lists for matK metabarcoding.

Family A B C D E F Adoxaceae 3 581 1 61 413 14 Asteraceae 59 172 65 64 10 68 Boraginaceae 11 Brassicaceae 2 Caprifoliaceae 15 61 5 2 3 Cornaceae 1 Fabaceae 11 1 2 Fagaceae 28 1199 192 24 134 91 Hippocastanaceae 15 84 1 Juglandaceae 12 Moraceae 16 7 2 Oleaceae 2011 223 624 53 1457 1770 Papaveraceae 16 Rosaceae 26,595 18,125 19,683 18,243 12,440 19,218 Salicaceae 2 15 10 13 8 Aceraceae 1

Consensus lists for rbcL metabarcoding.

Family A B C D E F Adoxaceae 4 736 3 1125 85 37 Asteraceae 7106 8576 3460 2274 7464 6279 Betulaceae 2 Boraginaceae 223 Brassicaceae 56 124 340 262 83 191 Caprifoliaceae 11 45 7 2 3 Caryophyllaceae 1 Cornaceae 2 Fabaceae 1405 40 3 575 3986 1156 Fagaceae 1857 19798 5217 8302 1898 4936 Hippocastanaceae 180 469 14 Juglandaceae 3 63 5 5 2 Liliaceae 6 7 Malvaceae 1 Melanthiaceae 1019 1 Moraceae 5 656 6 359 20 61 Oleaceae 2582 347 732 2585 85 2215 146

Pinaceae 5 19 6 4 Plantaginaceae 3 Polemoniaceae 333 Ranunculaceae 3 82 8 Rosaceae 38926 18504 25397 20391 25274 19027 Salicaceae 895 1257 1188 597 556 Aceraceae 192 475 5 163 4 20

147