<<

Using environmental DNA (eDNA) metabarcoding to assess aquatic communities

A Thesis Submitted to the Committee on Graduate Studies in Partial Fulfillment of the Requirements for the Degree of Master of Science in the Faculty of Arts and Science

TRENT UNIVERSITY Peterborough, Ontario, Canada © Copyright by Stephanie A. Coghlan 2018 Environmental and Life Sciences M.Sc. Graduate Program September 2018

Abstract

Using eDNA metabarcoding to assess of communities

Stephanie A. Coghlan

Environmental DNA (eDNA) metabarcoding targets sequences with interspecific variation that can be amplified using universal primers allowing simultaneous detection of multiple from environmental samples. I developed novel primers for three barcodes commonly used to identify plant species, and compared amplification success for aquatic plant DNA against pre-existing primers. Control eDNA samples of 45 plant species showed that species-level identification was highest for novel matK and pre- existing ITS2 primers (42% each); remaining primers each identified between 24% and

33% of species. Novel matK, rbcL, and pre-existing ITS2 primers combined identified

88% of aquatic species. The novel matK primers identified the largest number of species from eDNA collected from the Black River, Ontario; 21 aquatic plant species were identified using all primers. This study showed that eDNA metabarcoding allows for simultaneous detection of aquatic including invasive species and species-at-risk, thereby providing a biodiversity assessment tool with a variety of applications.

Keywords: aquatic plants, biodiversity, environmental DNA (eDNA), invasive species, species-at-risk, metabarcoding, high throughput sequencing, Illumina, bioinformatics

ii

Acknowledgements

First and foremost, I would like to thank my phenomenal co-supervisors Drs.

Joanna Freeland and Aaron Shafer for their positive attitudes and helpful guidance; my project and thesis were much improved thanks to your knowledge and edits throughout.

Joanna, thank you for your patience during the project; your words of encouragement and pep-talks kept me motivated while writing this thesis. Aaron, thank you for welcoming me into the Shafer lab , for helping me work through too many bioinformatics roadblocks to count, and for reminding me to take breaks. Thank you to my committee member, Sabine McConnell, for providing bioinformatics expertise and helpful feedback.

Thank you to Melanie Shapiera and Wil Wegman for taking me out in the field with their team to collect water samples on the Black River, and for providing a list of reference plants common to the area. Thank you to Maria O’Sullivan for collecting plants for my project and to Susan Chow for helping with visual plant identifications.

Thank you to my Shafer and Freeland lab mates for listening to many practice talks and helping me troubleshoot with lab and bioinformatics errors. In particular, I would like to thank Martin for helping me get familiar with many lab procedures, and Charise Currier for helping me troubleshoot primer design issues and training me on multiple eDNA protocols. Thank you to Lindsay Bond for helping me get lab work finished on multiple occasions and for providing friendship and laughter in the lab.

I would like to thank my family and friends that motivated me to pursue this degree. My big, wonderful family has supported me through all my of school and continues to push me to reach my goals. I am fortunate to have the greatest friends that

iii

inspire me daily with their positive outlooks and who are always there when I need a quick escape from schoolwork and a good laugh.

Lastly, thank you Vince for being my rock and helping me through the highs and lows of this project. I am extremely grateful for our shared passion for learning, and for your support and encouragement; I would not have been able to do any of this without you.

iv

Table of Contents

Table of Contents Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... v List of Figures ...... vii List of Tables ...... ix List of Appendices ...... xiii Chapter 1: General Introduction ...... 1 Chapter 2: Using eDNA metabarcoding to assess aquatic plant communities ...... 5 2.1 Introduction ...... 5 Environmental DNA (eDNA) ...... 5 Barcoding and metabarcoding ...... 7 Illumina sequencing & bioinformatics ...... 13 Study rationale ...... 16 2.2 Methods...... 19 Rationale ...... 19 (a) eDNA marker development and in silico analysis of database sequences ...... 19 (b) Environmental DNA ...... 26 (c) Library preparation ...... 28 (d) Bioinformatics ...... 32 2.3 Results ...... 33 (a) eDNA marker development and in silico analysis of database sequences ...... 33 (b) Control and wild sample metabarcoding and bioinformatics ...... 36 Single- versus multi-locus approach ...... 41 2.4 Discussion ...... 44 In silico primer design and validation on single species ...... 44 Control sample results ...... 48 Positive identifications ...... 48 Missing species identifications ...... 53 Erroneous identifications ...... 54 Metabarcoding limitations ...... 56 Wild sample results ...... 60 Wild versus control results ...... 62 ‘All-or-nothing’ phenomenon ...... 64 Conclusion ...... 65 Chapter 3: General Discussion ...... 67 References ...... 70 Appendices ...... 82 Appendix A – Primer design ...... 82 Appendix B – Sample collection, filtration, and extraction ...... 91

v

Appendix C – Primer testing and optimization, and single-species amplification ...... 96 Appendix D – eDNA sample library preparation ...... 103 Appendix E – Bioinformatics pipeline information ...... 115 1) ecoPrimers and ecoPCR ...... 115 2) Quality-checking ...... 115 3) Metabarcoding ...... 116

vi

List of Figures

Figure 2.2.1. Map showing the eDNA and plant collection sites in Ontario. 1) Black

River eDNA collection site. 2) Buckhorn Lake plant collection site (Lakehurst, ON). 3)

Kawartha Highlands Provincial Park plant collection site (Peterborough County, ON). 4)

Brownhill Tract – York Regional Forest plant collection site (East Gwillimbury, ON).

……………………………………………………………………………………………23

Figure 2.3.1. Phylogenetic modified from APG III (2009) representing the percentage of genera found in the control eDNA sample. Intensity of the squares is proportional to the percentage of genera that were correctly identified following the assay. Incorrect identifications arose when sequences were incorrectly assigned to a confamilial. Numbers in parentheses represent the number of species of each that were placed in the control sample………………………………………………………..40

Figure B.1. Genomic DNA extracted from plant tissue and visualized on an agarose gel to confirm extraction success…………………………………………………………….94

Figure C.1. Image of an agarose gel used to analyze amplification success of orbcLa primers on individual aquatic and terrestrial species, with amplicons expected at 550bp

………………………………...………………………………………………………...100

Figure C.2. Image of DNA amplified by dgFW-ITSn primers from individual aquatic and terrestrial plant species. The expected amplicon length targeted by these primers was 115-

119bp……………………………………………………………………………………101

vii

Figure D.1. Bioanalyzer results from TCAG (Toronto ON, Canada) representing the size of DNA fragments, and their respective concentrations, observed in the sample submitted for sequencing on the Illumina HiSeq 2500. Target amplicons in this sample ranged from

115-550bp excluding primers…………………………………………………………...104

viii

List of Tables

Table 2.2.1. Target markers, amplicon lengths, primer directions, primer names, primer sequences, melting temperatures for extended primers. Modifications to primers from other studies are underlined………………………………………………………………30

Table 2.3.1. Amplification results from all 5 final, optimized primer pairs. Grey squares represent positive amplification, inferred from a band observed at expected amplicon size on agarose gel. Question marks (?) denote possible primer-dimers. The expected amplicon size of oITSn is 115bp-119bp, and the forward and reverse primers are 27bp and 16bp, respectively. Two out of seven terrestrial species product bands at the expected amplicon length for one or more primer pairs, and 17 out of 18 aquatic or wetland species produced bands at the expected amplicon length for one or more primer pairs...……….35

Table 2.3.2. List of all species placed in the control sample. Black means the target species was successfully detected; dark grey means one or more congeneric species were detected; and light grey means that one or more confamilial species were detected…….38

Table 2.3.3. List of aquatic plants identified from the Black River eDNA sample using all five primer pairs………………………………………………………………………….43

Table A.1. Number of target species that had sequence data for each of the genetic markers, therefore the number of sequences that were used in the custom databases for novel primer development………………………………………………………………..82

Table A.2. Inventory of all plants included in in silico testing. Species are either from the control sample, or from a list of common plants found in the Kawartha Lakes, Trent

ix

River, and/or Rideau River. In the final column, ‘Y’ = sequence was included in primer design database, ‘N’ = sequence not included in the primer design database, and ‘-‘ = sequence data not available from GenBank……………………………………………...83

Table B.1. Sampling location and date for all field-collected species, and list of DNA samples from past projects for aquatic and terrestrial plants…………………………….90

Table B.2. List of all plant species included in the control sample……………………...92

Table B.3. Aquatic and terrestrial plant species extractions for single-species amplifications…………………………………………………………………………….93

Table C.1. Extended primers primer names, sequence, melting temperature, and expected amplicon size. Bold sequence = original primer sequence from ecoPrimers, red sequence

= original primer sequence that was removed and not included in optimized version…..95

Table C.2. PCR Cycling conditions for all extended primers on individual species……96

Table C.3. Amplification results from individual species amplification with extended primers with and without degenerate bases. ‘FW’ means freshwater as the primers were designed to target freshwater plant species found in southern Ontario, and ‘dg’ is an abbreviation for degenerate. After this stage, the degenerate primers were used to amplify eDNA samples since they amplified more individual species than the non-degenerate versions, primer names were changed to omatK2, orbcL2, and oITSn………………….97

x

Table C.4. Amplification results from original primers from ecoPrimers with and without degenerate bases compared to the optimized versions. Y = positive PCR amplification, N

= no PCR amplification, and grey = not tested………………………………………….99

Table D.1. Primer sequences with overhang required for the adapter ligation in sample library preparation. Underlined sequence represents the overhang and the bold sequence represents the primer……………………………………………………………………102

Table D.2. P5 and P7 sequences containing indexing barcodes that are added to amplicons in the adapter ligation step, contain barcode. Underlined sequence represents the portion of the sequence that overlaps with the overhang sequence in the primers…102

Table D.3. Barcodes used for sample identification……………………………………103

Table D.4. Cycling conditions with overhang adapters – first PCR step for library preparation………………………………………………………………………………103

Table D.5. Mismatch count for each forward and reverse primer for species that were not identified by any of the primer pairs. Grey boxes represent data that either wasn’t available from GenBank, or sequence data obtained from GenBank did not contain the primer sequence…………………………………………………………………………106

Table D.6. Number of species identified in varying primer combinations. ‘A’ = aquatic species, ‘T’ = terrestrial species, and ‘C’ = combined aquatic + terrestrial species……106

xi

Table D.7. List of species identified in the control sample using the omatK2 primers.

Bolded family or names represent taxa that were represented in the control sample…………………………………………………………………………………..107

Table D.8. List of species identified in the control sample using the orbcL2 primers.

Bolded family or genus names represent taxa that were represented in the control sample…………………………………………………………………………………..107

Table D.9. List of species identified in the control sample using the orbcLa primers.

Bolded family or genus names represent taxa that were represented in the control sample…………………………………………………………………………………...110

Table D.10. List of species identified in the control sample using the oITS2 primers.

Bolded family or genus names represent taxa that were represented in the control sample…………………………………………………………………………………...112

Table D.11. List of species identified in the control sample using the oITSn primers.

Bolded family or genus names represent taxa that were represented in the control sample…………………………………………………………………………………...113

xii

List of Appendices

Appendix A: Primer design……………………………………………………………...81

Appendix B: Sample collection, filtration, and extraction…………………….………....90

Appendix C: Primer testing and optimization, and single-species amplification…..……95

Appendix D: eDNA sample library preparation………………………………..……….102

Appendix E: Bioinformatics pipeline information……………………………………..114

xiii 1

Chapter 1: General Introduction

Aquatic ecosystems are among the most vulnerable to anthropogenic and climatic change (Grimm et al. 1997). Freshwater ecosystems cover only 0.8% of the Earth’s surface yet harbour a disproportionate amount (6%) of the world's species, making them complex natural systems that require a variety of tools to monitor biodiversity and ecosystem functions (Woodward, Perkins, & Brown, 2010). Aquatic plants play a vital role in the ecosystem as they provide shelter, food, oxygen, and nutrients to other inhabitants (Caraco, Cole, Findlay, & Wigand, 2006). More specifically, changes in aquatic macrophyte biomass and density, whether positive or negative, tend to occur due to various environmental and human-mediated factors, where for example, loss of may occur due to shoreline washout, changes in water levels, or changes that render the water body uninhabitable (Chambers, DeWreede, Irlandi, & Vandermeulen, 1999).

Changes in climate or environmental conditions can lead to optimal conditions for algal species, which can reduce light penetration to submerged plant species and change the balance of nutrients such as phosphorus and nitrogen (Chambers et al., 1999). Poor nutrient levels in freshwater bodies lead to competitive interactions for nutrients and ultimately do not allow all plant species to grow. Conversely, high levels of nutrients can also be detrimental to aquatic plant populations (Yang, Wu, Hao, & He, 2008).

Eutrophication is the response in water due to an influx of minerals and nutrients into the watershed which can result in excessive plant and algal growth, and is often caused by the release of domestic waste into the environment or runoff from agricultural and urban development (Yang, Wu, Hao, & He, 2008). For example, nutrient levels in the

1 2

Bay of Quinte, Lake Ontario, increased in the 1950s and 1960s leading to increases in submerged plant populations (Chambers et al., 1999). The increased nutrients then led to dense phytoplankton blooms in the late 1960s, which shaded out the submerged aquatic plants and ultimately decreased their numbers. Aquatic plant populations are also impacted by herbivores. In 1977, aquatic larvae (Acentria spp.) and weevil larvae

(Litodactylus spp.) damaged Eurasian watermilfoil ( spicatum) in Chemong

Lake, Ontario, leading to a large decline for the species (Chambers et al., 1999).

However, M. spicatum is invasive in many parts of North America and can negatively impact biodiversity of native species (Dextrase & Mandrak, 2006; Snyder, Francis, &

Darbyshire, 2016; Tamayo & Olden, 2014).

One of the largest threats to native freshwater species are alien species with the potential to invade water bodies. Alien aquatic plant species can be aggressive and competitive, often forming dense submerged or floating mats of vegetation (Chambers,

Lacoul, Murphy, & Thomaz, 2008). Invasive freshwater species are a concern due to the numerous pathways of potential introduction and spreading in lakes and rivers such as commercial and recreational boating, and intentional or accidental aquarium and ornamental releases (Tamayo & Olden, 2014). Non-native species being introduced into these environments can also lead to severe changes in the chemical compositions of the water (Caraco et al., 2006; Caraco & Cole, 2002); here, even the slightest changes in chemical composition can make an aquatic system uninhabitable to native mammals, fish, invertebrates, plants, and microorganisms (Aloo, Ojwang, Omondi, Njiru, & Oyugi,

2013; Chamier et al., 2012). Invasive aquatic plant species can also impact other organisms living in their ecosystems. For example, in North America, Eurasian

2 3 watermilfoil (M. spicatum) and reed-canary grass (Phalaris arundinacea) were identified as alien species threatening freshwater fishes and molluscs due to habitat alteration

(Dextrase & Mandrak, 2006). Invasive aquatic plant species are a threat not only to the aquatic ecosystem, but to humans as well. One example of a species that poses concern to humans is the Eurasian-native water soldier (Stratiotes aloides) which is invasive in

Ontario, and the sharp and serrated edges are a nuisance to swimmers (Toma, 2006).

The presence of even a single invasive species can disrupt the entire function of an ecosystem, its diversity, and recreational activities (Snyder et al. 2016).

Beyond the biological and recreational impacts of invasive aquatic plant species, there are significant economic losses associated with invasions. In the , the

Office of Technology Assessment (OTA) reported that in excess of $100 million is spent every on controlling aquatic plants (Lovell & Stone, 2006). For invasive hydrilla

(Hydrilla verticillata) alone, the US spends millions of dollars every year on chemical control agents, biological control agents, physical removal of hydrilla communities, and physical environmental manipulation (Chambers et al., 2008). In the past 40 years, aquatic macrophyte control programs have been implemented in Ontario to combat effects of M. spicatum invasion by removing weed beds (Chambers et al., 1999).

Although there are various management strategies for invasive species in place throughout Ontario and more broadly North America, water bodies of interest might harbor multiple invasive species at any given time (Tamayo & Olden, 2014).

Early detection of invasive species can be key in managing their numbers before they become problematic, and the goal of this thesis was to identify genetic markers that would allow for the simultaneous detection of multiple aquatic plant species from a water

3 4 sample, thereby allowing early detection of invasive species. Applications of this species detection tool extend beyond the detection of invasive species to potentially locating elusive species-at-risk, as this assay can also describe aquatic plant communities, for example in response to alien species’ invasions or community recovery following eradication of problematic invasive species.

4 5

Chapter 2: Using eDNA metabarcoding to assess aquatic plant communities

2.1 Introduction

Environmental DNA (eDNA)

Locating aquatic plants using observational studies can be difficult if not impossible in many environments. Visual surveys are limited by constraints such as visibility, access, and identification based on morphological traits (Ficetola et al. 2015).

These constraints provide a challenge, especially in locating rare or cryptic species, and species living in murky or hard-to-access areas. Although visual surveys have been the most common method of documenting community biodiversity, there is often a misrepresentation of rare or newly introduced species, and such monitoring techniques can be expensive and time-consuming (Stein et al., 2014; Tyre, Tenhumberg, Field,

Niejalke, & Possingham, 2003).

An alternative to locating and visualizing or collecting individuals is environmental DNA (eDNA), which is increasingly used as a tool for screening environmental samples for species detection (Ficetola et al. 2015). The basic principle underlying eDNA sampling stems from the fact that most organisms release DNA into their surrounding environment through urine, feces, or any cellular material shed by either a living or dead organism. eDNA assays are based on obtaining DNA without having to first isolate a target organism, allowing for a non-invasive alternative to traditional methods, with often higher resolution of taxonomic identification than observational studies (Kelly et al. 2014). Currently, the only methods used for identifying invasive aquatic plant species are observational surveys and eDNA assays that target one or a

5 6 small number of species (Fujiwara, Matsuhashi, Doi, Yamamoto, & Minamoto, 2016;

Marinich, 2017). Standardized DNA sequences have improved taxonomic assignments for many organisms (Hajibabaei, 2012), and for example, past studies have: extracted

DNA from soil samples to obtain sequence data of multiple terrestrial plants (Fahner et al.

2016); studied plant-pollinator networks using DNA from pollen samples (Bell et al.

2016); and extracted DNA from water samples to detect presence or absence of Asian carp in the Great Lakes and surrounding waterways (Evans et al. 2016). These studies demonstrate the versatility of eDNA assays, both in the media they are extracted from and the variety of study systems that can be analyzed.

While advancements have been made to improve aquatic eDNA-based studies, there are various site-specific environmental conditions and organism-specific factors that impact the amount of DNA found in, and later extracted from, an environmental sample

(Ficetola, Miaud, Pompanon, & Taberlet, 2008; Stoeckle et al., 2017). Organism-specific limitations include density of individuals in the community, size of the organism, and volume and rate of DNA secretions. Environmental conditions affecting aquatic eDNA concentrations include volume of the water body where the sampling is taking place, temperature of the environment, ultra-violet (UV) radiation exposure, and flow or stagnancy of the aquatic environment (Barnes et al., 2014). There is still uncertainty as to how these factors directly influence eDNA degradation, and by extension, eDNA recovery. Rates of DNA degradation in aquatic environments could dictate whether a positive species detection is the result of that organism currently or recently occupying the sampling site, or whether that DNA is from an organism that occupied that space further in the past (Barnes et al., 2014). Matsui et al. (2001) found that a 400bp DNA

6 7 fragment can persist for up to seven days in 18˚C lake water, collected at a depth of 2.5m, with 0.7µg L-1 of soluble reactive phosphorus, and 2.6x106 cells ml-1 bacterial abundance. However, if DNA is still protected inside the cell, or is bound to other particles, it may persist longer (Ficetola et al., 2008). A study conducted by Dejean et al.

(2011) set up aquarium experiments to observe DNA detection from bullfrog tadpoles and

Siberian sturgeon. The bullfrog experiment showed that DNA detectability decreased over time after removing the tadpoles with DNA being detected until day 25, and an increase in DNA detection associated with tadpole density. With the Siberian sturgeon, detectability decreased over time with DNA being detected until day 14 (Dejean et al.,

2011). This discrepancy between the duration of DNA detection for the two species was attributed to two factors: (1) organisms secrete DNA into the environment at different rates, and (2) the tadpole experiments were conducted in aquariums and sturgeon experiments were conducted in artificial ponds, and thus exposed to a variety of environmental factors (Dejean et al., 2011). This research demonstrates the complex nature of environments and how they influence the amount and quality of DNA that can be retrieved from eDNA samples.

Barcoding and metabarcoding

A DNA barcode is a short, unique, and standardized DNA sequence or genetic marker that can be used to make species-level identifications (Krishnamurthy & Francis,

2012; Van De Wiel, Van Der Schoot, Van Valkenburg, Duistermaat, & Smulders, 2009).

A barcode is designed by finding a region of DNA that has greater interspecific than intraspecific variation; in other words, a region of DNA that is more similar when

7 8 compared between conspecific individuals than closely related heterospecific individuals.

Upon its discovery, DNA barcoding was applied primarily to metazoans using mitochondrial markers, but has since been applied to most living organisms including plants and fungi (Casiraghi, Labra, Ferri, Galimberti, & Mattia, 2010). Barcoding involves coupling molecular laboratory techniques with bioinformatics analyses to identify individual species or groups of species (Casiraghi et al., 2010; Krishnamurthy &

Francis, 2012). To date, the International Barcode of Life Project has developed a barcode for over 580,000 species (iBOL; http://biodiversitygenomics.net/projects/ibol/). DNA barcoding has been used to identify a species’ identity directly from tissue samples, or to describe newly discovered species. When an organism is identified in situ, species-level discrimination might still be difficult due to species being cryptic, or due to morphological variation observed at different life stages. DNA barcoding provides a higher level of confidence in declaring an organisms’ identity, however it can be susceptible to error, for example when interspecific sequences are insufficiently divergent or if reference databases are incomplete (Elliott & Davies, 2014; Ghorbani, Saeedi, & De

Boer, 2017; Joly et al., 2014).

DNA barcodes have many potential applications, such as discriminating between invasive species and their non-invasive relatives (Van De Wiel et al., 2009), or inferring presence or absence of a species from an environmental sample (Ficetola et al., 2008). A commonly used barcode for animals is the cytochrome c oxidase I (COI) as intraspecific variation is usually less than interspecific variation (Feng, Li, Kong, &

Zheng, 2011; Hebert, Ratnasingham, & de Waard, 2003; Zou & Li, 2016). Specifically, a

648bp fragment of the COI gene is the primary barcode sequence for the animal kingdom according to the Barcode of Life Data System (BOLD) (Hebert et al., 2003;

8 9

Ratnasingham & Hebert, 2007). For plants, there are several commonly used DNA barcode loci, either individually or in combination, including plastid rbcL, matK, and trnH-psbA, and the nuclear internal transcribed spacer (ITS) (Krishnamurthy &

Francis, 2012). These regions are generally considered less reliable for species discrimination than the COI DNA barcode used for the animal kingdom, and it is unclear which barcodes are most appropriate for species identification in the plant kingdom. For land plants, rbcL and matK are the primary barcode(s), and trnH-psbA and ITS are considered secondary barcodes (Krishnamurthy & Francis, 2012). One study of invasive aquatic plants found that matK was a more useful marker than either rbcL or trnH-psbA for detecting aquatic plant eDNA in terms of the ability to design putatively species- specific primers for the largest number of target species (Scriver, Marinich, Wilson, &

Freeland, 2015).

eDNA barcoding involves using species-specific markers to identify species from environmental samples and has many potential applications, such as inferring the presence or absence of invasive species or species-at-risk (Stoeckle et al., 2017); however, this targeted approach does not allow for studying the composition of an entire community at a given site. Using DNA barcoding to infer community composition from eDNA at a particular site would require running multiple amplification reactions using species-specific primers for every target species and thus requires pre-existing knowledge on which species exist in the environment. eDNA barcoding has been used for early detection of alien species (Dougherty et al., 2016), in ongoing monitoring efforts following eradication attempts for invasive species (Dunker et al., 2016), or to screen environmental samples for rare or cryptic species of interest (Port et al., 2016). eDNA

9 10 approaches are often preferable over more traditional monitoring techniques as they require minimal direct intervention or disturbance of the ecosystem (Matsuhashi, Doi,

Fujiwara, Watanabe, & Minamoto, 2016; Port et al., 2016), although some studies have found equal or greater success in visual observations of organisms (Fujiwara et al., 2016;

Shaw et al., 2016).

The use of eDNA provides a challenge because it is commonly degraded and its quantities are heterogeneous throughout the environment due to stochastic conditions like moving water, temperature and sunlight (Dejean et al., 2011); therefore selecting a shorter barcoding marker may result in better recovery and amplification of target sequences, as shorter fragments can be found at higher concentrations in the environment (Keskin,

Unal, & Atar, 2016; Seppey et al., 2016). While ‘classic’ barcoding regions are frequently used in eDNA assays, they may be insufficiently variable for differentiating among congeneric species, and in many cases amplify fragments that are too long for reliable detection from degraded environmental samples, therefore other markers may be preferable (Freeland, 2017). It is essential to consider multiple potential markers when selecting target regions for eDNA barcoding.

A metabarcode refers to a region of a genome that can be used to identify a range of taxa when there are multiple target species, and uses high-throughput sequencing

(HTS) to identify unique DNA sequences from a mixed sample. Metabarcoding assays should be based on a gene region that is variable enough to discriminate multiple species, but these variable regions must be flanked by highly conserved regions that allow a primer pair to anneal across multiple taxa. Unlike targeted species detection, metabarcodes are normally generated using primers that have some level of universality,

10 11 meaning they should amplify DNA from a predetermined taxonomic group such as order, family, genus, or species (Hawkins et al. 2015). High-throughput sequencing permits the simultaneous characterization of a community of organisms from mixed samples (Fahner,

Shokralla, Baird, & Hajibabaei, 2016).

Metabarcoding can be used for biodiversity studies of plant and animal taxa, and other ecological surveys like analyzing diets of herbivores and carnivores from fecal samples. Valentini et al. (2009) used the P6 loop of the chloroplast marker trnL for metabarcoding plants from fecal samples to study the diets of bears, marmots, capercaillies, grasshoppers, and molluscs. Another diet study analyzed monkey feces with a variety of genetic markers to determine the composition of plant, arthropod, protist, and nematode species (Srivathsan, Ang, Vogler, & Meier, 2016). Specifically relating to plant metabarcoding studies, efforts have been focused on terrestrial species, targeting plant

DNA from soil and pollen samples (Fahner et al., 2016; Richardson et al., 2015).

With the variety of genetic markers that have been studied for plant barcoding and metabarcoding, there has been little agreement as to which one would best serve as a universal marker. Fahner et al. (2016) evaluated rbcL, matK, ITS2, and the trnL P6 loop as potential markers for metabarcoding from soil samples, and found that a combination of rbcL and ITS2 provided the best taxonomic resolution and had the most complete reference databases. Another study found through in silico analysis of eight candidate markers that trnL had the most efficient polymerase chain reaction (PCR) amplification and resolution for their target species (Yang, Zhan, Cao, Meng, & Xu,

2016). Though multiple studies have investigated markers for metabarcoding vascular plants with varying conclusions on which markers are ideal for metabarcoding (Bell et al.,

11 12

2016; Fahner et al., 2016; Hawkins et al., 2015; Srivathsan et al., 2016), none have focused specifically on aquatic plants.

Introducing metabarcoding to studies of aquatic plant species has the potential to provide early detection of invasive species and target the community as a whole to screen for multiple invasive species. In southern Ontario, there are many established invasive alien species, plus additional alien species that have the potential to become invasive and outcompete their native counterparts. Ontario has an abundance of connecting waterways, and there is a need to monitor and control populations of aquatic invasive plants before they make their way into the Great Lakes. The Great Lakes experience heavy volumes of boat traffic, including international commercial ships. Furthermore, observational and molecular methods of detecting plant DNA would be more challenging in the Great Lakes compared to smaller lakes and rivers simply due to the volume of observations or water samples that would be necessary to monitor the entire water body.

One current invasive species of interest in southern Ontario is water soldier (S. aloides). There are five known locations where water soldier has established in North

America, all of which are found in southern Ontario: (1) the Trent River, County of

Northumberland; (2) a pond for cattle in the Municipality of Trent Hills; (3) a pond in

Blackstock, Township of Scugog; (4) an artificial pond near Bayfield, Huron County; and

(5) the Black River, near Sutton, Regional Municipality of York (Snyder, Francis, &

Darbyshire, 2016; M. Shapiera, OMNRF, pers. comm.). Current monitoring techniques include observational surveys leading to physical removal of any observed plants, raking for submerged colonies in areas where the plants have been located, and using quantitative PCR (qPCR) to assay eDNA extracted from water samples using species-

12 13 specific primers with the goal of determining whether eradication efforts have been sufficient or to find otherwise unobservable plants. However, targeted detection of S. aloides has generated results that are not reliable, with qPCR results being inconsistent with observational surveys (C. Currier & J. Freeland, pers. comm.); a metabarcoding method may be a more effective screening tool that would not only allow for the detection of S. aloides, but also the simultaneous identification of other invasive species of concern.

By coupling metabarcoding with aquatic plant primers and HTS, species-at-risk could potentially be identified in areas where they are present in low quantities, or in areas where they are not easily visible; in addition, early detection of invasive species may be possible. Using metabarcoding with aquatic plant-specific primers will ideally generate an inventory of all plants that are found in the area from which a water sample is taken. This could permit organizations to intervene and take measures to conserve certain species, for example by protecting their habitat, which in time could lead to lifting their

“at risk” status. Metabarcoding of aquatic plants from eDNA isolated from freshwater could provide a method for describing aquatic plant communities, including invasive species, species-at-risk, and native species, although this has not yet been investigated.

Illumina sequencing & bioinformatics

The development of HTS technologies has greatly increased the scope and volume of data being generated in genetic studies. Compared to traditional sequencing, HTS allows researchers to obtain a larger volume of sequence data and high sequence coverage leading to reliable and robust, standardized data (Buermans & den Dunnen, 2014).

13 14

Multiple HTS platforms have emerged over the years with a common goal of generating millions of sequencing reads in a single run (Vincent, Derome, Boyle, Culley, & Charette,

2017). Arguably the most popular HTS platform due to its low cost and high throughput when compared to other HTS platforms is Illumina (Glenn, 2011).

Prior to sequencing on the Illumina platform, samples must be prepared into libraries. Metabarcoding studies target one gene region at a time, and the first step is therefore to design or select primers that will amplify the target gene region from multiple taxa of interest. Once amplified, a second PCR amplification step is required to ligate

Illumina-recommended adapters to either end of the amplicon. These adapters must be added to the targeted amplicons to make them complementary to oligos on the sequencer’s flow cell, and contain indexed barcodes that are later used to assign sequences to a sample (Buermans & den Dunnen, 2014). Library preparation may also include PCR purification steps to remove any PCR by-products or steps to select only

DNA fragments of desired amplicon lengths (Feng, Liu, Chen, Liang, & Zhang, 2016).

Once libraries have been prepared, samples are visualized on a high-resolution gel to ensure the libraries are of the expected length in base-pairs, and quantified to ensure they are of a suitable DNA concentration.

The Illumina sequencing platform involves clonal amplification of target DNA fragments on a flow cell. Here, Illumina uses a sequencing-by-synthesis (SBS) approach, and clonal amplification generates a signal large enough to be detected over the threshold of background noise such that base incorporation occurs at a measureable level. Illumina may be desirable over other technologies such as Ion Torrent or Roche Pyrosequencing because of its ability to sequence paired-end libraries, the high number of reads per run

14 15

(up to 3 billion), the output per run, and the relatively low cost per Mb (Ahn, 2011;

Buermans & den Dunnen, 2014). Data output varies between Illumina platforms and there are several sequencing options for each, with the MiSeq capable of generating 2x300bp paired-end reads, 20 million reads per run, resulting in 12Gb output of data, and the

HiSeq capable of generating 2x125bp paired-end reads, 150 million reads per lane, 16 lanes per run, and resulting in 600Gb of data (Vincent et al., 2017).

With the introduction of HTS technologies came the development of computational programs to aid in analysis and interpretation of the resulting data, which typically comprise millions of sequences and billions of base-pairs. Bioinformatics refers to the analysis of high throughput genetic data and statistical modeling through computing protocols and scripts (Ahn, 2011). Compared to Sanger sequencing, which involves determining the amplified DNA sequence from a single target organism, HTS platforms and bioinformatics analyses can provide a massively paralleled system for higher throughput and coverage and the ability to separate mixed DNA sequence results.

With respect to Illumina sequencing results, this involves separating pooled samples by each indexed barcode and quantifying sequence abundance and quality. The barcodes are incorporated during library preparation into the adapter in the adapter ligation amplification step, in both the forward and reverse reads (Illumina, 2013). These barcodes are unique to individual samples since there are often many samples being analyzed simultaneously on the flow cell, and the unique barcode combinations allow samples to be sorted after base-calling is complete. Paired-end reads can be merged, sorted, and further manipulated according to individual applications (Feng et al., 2016).

15 16

With metabarcoding, bioinformatics tools and analyses can be used to establish an inventory of organisms that contributed DNA to a sample in a semi-quantitative manner, providing insight into the complexity of an entire ecosystem (Feng et al., 2016). Quality control is generally one of the first steps in data analysis to evaluate the quality of each sequence read (Vincent et al., 2017). Various computational protocols have been developed to sort metabarcoding sequences into readable, applicable data. These include

UPARSE (Edgar, 2013), PRINSEQ (Schmieder & Edwards, 2011), and OBITOOLS

(Boyer et al., 2015), which allow users to merge paired-end reads, assign reads to samples by barcode or primer sequence, filter erroneous sequences, dereplicate sequences, and assign to each sequence.

Study rationale

The threats facing aquatic plants highlight the pressing need to detect both invasive species and species-at-risk to maintain and restore aquatic ecosystem biodiversity. Southern Ontario is an ideal location for investigations of aquatic plant biodiversity as it is subject to extensive anthropogenic change, which is linked to the growing number of alien aquatic plants that are being introduced and becoming established. Furthermore, it is just north of the range limit of a number of highly problematic species such as water and water hyacinth that are expected to establish in the near future as a result of climate change and (Adebayo et al.,

2011). Southern Ontario is also an important region for studying the effects of invasive aquatic species due to the large amount of ship traffic acting as vectors for alien species’ introduction in the Great Lakes (Lovell & Stone, 2006). Over 145 nonnative aquatic species have become established in the Great Lakes area, 42% of which are plants (Lovell

16 17

& Stone, 2006). Reliable species-detection protocols are required to monitor community responses to invasions in order to conserve biodiversity.

In some cases, invasive aquatic plant species disrupt biodiversity and ecosystem functioning, but community effects following invasions, or following eradication of invasive species, have not been widely studied. Challenges associated with surveying aquatic sites become evident when trying to study these effects, and therefore highlight the need to develop a biodiversity assessment tool to improve the characterization of aquatic plant communities. I developed a method involving metabarcoding and HTS that allows researchers to screen eDNA water samples for aquatic plant species.

There were two main goals of this study:

1. To investigate the feasibility of using metabarcoding to characterize aquatic plant

communities from eDNA samples. This included determining what marker(s)

were suitable for multi-species detection of aquatic plants. I designed novel

metabarcoding primers to specifically target aquatic plant DNA, and tested

whether terrestrial plant metabarcoding markers within rbcL, matK, and ITS2

gene regions are sufficiently conserved to amplify DNA from multiple taxa, but

also sufficiently variable to discriminate among aquatic plant species that are

found in southern Ontario.

2. To use metabarcoding to characterize an aquatic plant community from one region

in southern Ontario that has been invaded by water soldier: the Black River.

If genetic regions that have been used to design metabarcoding primers for terrestrial plants are sufficiently conserved for amplification across taxa, and diverse enough for

17 18 taxonomic discrimination in aquatic plant species, this would allow me to describe aquatic plant communities from metabarcoding sequences. I therefore explored various methods of primer design and aimed to optimize the process in a way that can be applied to any future metabarcode design projects to address the second goal. Multiple validation steps were used to determine whether the selected or newly designed primer pairs are useful for metabarcoding a wide range of freshwater aquatic plants.

Beyond this study, there is potential to address a wide range of hypotheses regarding aquatic ecosystem plant biodiversity. With ecosystems being consistently disrupted by anthropogenic and natural changes, such as community responses to climate change and human disturbance, this study provided a reliable tool for assessing aquatic plant community biodiversity.

18 19

2.2 Methods

Rationale

Research on plant detection using eDNA has been limited in the number of species that can be simultaneously detected, often focused on identifying one or two species at a time (Fujiwara et al., 2016; Marinich, 2017; Scriver et al., 2015). I considered multiple potential loci using a high-throughput metabarcoding framework, investigating whether this would give us a cost-effective approach to detect numerous species. The selected markers and associated primers must have a highly conserved primer-binding region and the amplified sequence of each species must be sufficiently divergent to identify species. Selecting appropriate metabarcoding markers also requires a database of reference sequences representing potential target species. Consequently, the main goals of this study were to develop novel markers appropriate for metabarcoding aquatic plants, compare these to pre-existing vascular plant markers in terms of their ability to amplify and discriminate between aquatic plant species, and develop a bioinformatics pipeline that was then used to compare the taxonomic resolution of multiple genetic markers.

(a) eDNA marker development and in silico analysis of database sequences

(i) Primer design

Chloroplast genes matK and rbcL, and the internal transcribed spacer (ITS2) of ribosomal DNA were selected as candidate markers for this aquatic plant metabarcoding study. These markers have been used in many terrestrial plant eDNA barcoding and metabarcoding studies (Alsos et al., 2018; Fahner et al., 2016; Hawkins et al., 2015;

Richardson et al., 2015), and in aquatic plant eDNA barcoding (Marinich, 2017; Scriver

19 20 et al., 2015), partly because they have moderate to substantial coverage in the GenBank database (www.ncbi.nlm.nih.gov/genbank). Primer design and in silico testing were based on publicly available sequences of aquatic species common to various water bodies in southern Ontario, as well as aquatic, wetland, and terrestrial species that were collected for this study (Appendix A, Table A.2). In preliminary in silico analysis of potential target species, one or two sequences were downloaded from GenBank, with the exception of species that do not have GenBank data at certain gene regions, and aligned in MEGA version 7 (Kumar, Stecher, & Tamura, 2016) with the ClustalW algorithm (Larkin et al.,

2007) to observe inter- and intraspecific variation and infer suitability of the gene regions for metabarcoding. For matK, eight species had one sequence and 30 species had two sequences, where 24 species showed zero intraspecific variation and the maximum intraspecific variation was four mutations. For rbcL, 17 species had one sequence and 26 species had two sequences, where 23 species showed zero intraspecific variation and the maximum intraspecific variation was eight mutations. For ITS2, 16 species had one sequence and 21 species had two sequences, where five species showed zero intraspecific variation and the maximum intraspecific variation was 42 mutations. As little to no intraspecific variation was observed in MEGA alignments across most species, one representative sequence for each of the target species was used in the creation of each database. Only the aquatic and wetland plant species from Table A.2 (Appendix A), henceforth classed together as ‘aquatic’ due to the likelihood of both contributing DNA in aquatic environments, were included in the custom databases curated for primer development, with approximately 50 aquatic species for each marker represented in each database (Appendix A, Table A.2).

20 21

Multiple primer pairs were designed using the ecoPrimers bioinformatics pipeline

(Riaz et al. 2011) based on the databases for matK, rbcL, and ITS2. In brief, the ecoPrimers program scans provided sequences, in this case a custom database for each gene region, and identifies conserved regions suitable for primer annealing in regions that flank variable sites (Riaz et al. 2011). Parameters were set to target amplicons ranging in size from 50bp-500bp excluding primers and allowed up to three mismatches in each of the primer-binding sequences, but ensured that the two 3’ nucleotides on the primers were complementary to database sequences (i.e. meaning that they did not require a degenerate nucleotide). In the case that not all of these parameters can be met for all species, the program designs primers capable of amplifying and discriminating between as many of the potential target sequences as possible. The designed primers were compared to the

MEGA alignments of all species of interest to verify their suitability in the primer- binding regions by looking for the number of mismatches, and specifically how variable the primer-binding region was for the sequences that did not conform to the preset parameters. Although ecoPrimers quantifies how many of the target sequences fit within these parameters, it does not reveal how many mismatches each target sequence contains.

A second set of primers was designed for each locus by modifying the first set of primers to incorporate degenerate bases in sequence positions where there was variability among species. The virtual amplification pipeline ecoPCR (Ficetola et al. 2010) was used to test the anticipated success of the designed primer pairs, determined by amplicon length and discriminatory power, for each marker against its respective database. Scripts are available in Appendix E.

21 22

Previously designed primers for terrestrial plant metabarcoding (Fahner et al.,

2016) were also tested in silico to determine their complementarity for annealing to primer-binding sites across a variety of aquatic plant taxa; the expected amplicon length; and the anticipated discriminatory power among aquatic species based on interspecific sequence variation. Other studies have found that gene regions matK, rbcL, ITS2, and trnL include a suitable combination of conserved regions for primer-binding and discriminatory regions for identifying species, and that these regions have extensive reference databases (Alsos et al., 2018; Fahner et al., 2016; Han et al., 2013; Richardson et al., 2015; Srivathsan et al., 2016). The trnL region was excluded as a candidate for aquatic plant metabarcoding as in silico analyses showed poor taxonomic discrimination, both from the previously developed primers and from those designed by ecoPrimers for this region.

(ii) Primer testing and optimization

Aquatic and terrestrial plant samples were collected in September and October

2016 from three sites: Buckhorn Lake (aquatic and wetland species; wetland representing plants that grow in marshes or partly submerged in water and are therefore likely to contribute DNA in aquatic eDNA samples), Kawartha Highlands Provincial Park

(terrestrial species), and the Black River near Sutton (aquatic and wetland species)

(Figure 2.2.1). All 29 plant species (15 aquatic/wetland, 14 terrestrial; Appendix B, Table

B.1) were identified in the field using visual keys, then brought back to the lab where they were rinsed with deionized water, bagged separately, and stored at -20˚C. An additional

18 samples of extracted DNA from previous research (14 aquatic/wetland, 4 terrestrial)

22 23 were added to the collection (Appendix B, Table B.1). The total collection included 29 aquatic or wetland species and 19 terrestrial species, representing 28 families.

Figure 2.2.1. Map showing the eDNA and plant collection sites in Ontario. 1) Black

River eDNA collection site. 2) Buckhorn Lake plant collection site (Lakehurst, ON). 3)

Kawartha Highlands Provincial Park plant collection site (Peterborough County, ON). 4)

Brownhill Tract – York Regional Forest plant collection site (East Gwillimbury, ON).

Frozen plant tissue was ground using liquid nitrogen and DNA was extracted using an E.Z.N.A Plant DNA Kit (OMEGA bio-tek) D3485-02 following the Plant DNA

Mini - Fresh/Frozen Sample Protocol. The manufacturer’s protocol requires 200mg of

23 24 ground tissue, and when that was unavailable individual weights for each species were recorded and reagent volumes for the protocol were scaled accordingly (Appendix B,

Table B.3). Extracted DNA was visualized on a 1% agarose gel to confirm the presence of genomic DNA (Appendix B, Figure B.1).

Individual primer pairs designed by ecoPrimers were tested on four DNA samples from the aquatic plant genus Typha using a gradient PCR to determine optimal annealing temperatures. Primer names, sequences, melting temperatures, and amplicon lengths are listed in Appendix C, Table C.1. Each PCR cocktail included 1x DreamTaq Buffer

(Thermo Fisher Scientific), 0.5U/μL DreamTaq Polymerase (Thermo Fisher Scientific),

0.2mM dNTP (Thermo Fisher Scientific), 0.2μM forward and reverse primers, 15.3μL

PCR grade water and 1μL of template DNA for a total volume of 20μL per well. A negative control was included in each test. The gradient PCR cycling conditions included an initial denaturation step at 95°C for 30 seconds, followed by 28 cycles of denaturation at 95°C for 30 seconds, a one minute annealing step with a 10°C gradient temperature range starting approximately 10°C below the melting temperatures, and extension at 68°C for one minute, followed by a final extension step at 68°C for five minutes, and a hold at

4°C. Bromophenol blue (Thermo Fisher Scientific) was added to 10µL of PCR product, which was then loaded on a 1% agarose gel made with 1X TBE buffer and stained with

SYBR-Safe (Thermo Fisher Scientific) to allow the DNA to fluoresce. A100bp DNA ladder (FroggaBio) with fragments ranging from 100-1500bp was run on the same gel to estimate the length of the amplicons. Gel electrophoresis was run for 35 minutes at 90V and 0.83 amps. PCR products were visualized using a UV camera to observe amplified bands of DNA and the ladder.

24 25

When tested on 20 individual plant species using the same conditions described above, but with a single annealing temperature (Appendix C, Table C.2), the original and degenerate ecoPrimers primer sequences amplified DNA from 0 to 54% of species

(Appendix C, Table C.4). All primer pairs were then modified with the goal of improving the range of taxa from which DNA was amplified by re-visiting the aligned primer- binding sequences in MEGA and extending the length of the primers to: (a) increase the overall melting temperature and therefore annealing temperature of the primers; and (b) ensure that melting temperatures (Tm) for the forward and reverse primers for each pair were within 1-2°C of each other. Gradient PCRs and gel electrophoresis analysis using the modified primers were completed following the same steps as above. Once an optimal annealing temperature for Typha DNA was selected for each primer pair, amplification success was tested against a diversity of plant taxa by individually amplifying DNA from

25 species with each extended primer pair, with and without degenerate bases (Appendix

C, Table C.3). Cycling conditions varied for each primer pair (Appendix C, Table C.2).

The degenerate primer pairs were chosen for each gene region for downstream applications depending on which version amplified product from the greatest number of species. The selected primers, and the primers from previous soil metabarcoding studies, were altered once more by analyzing aligned sequences and searching for opportunities to increase their length while maintaining complementarity to as many species as possible to reach melting temperatures between 60°C-65°C according to Illumina recommendations.

This final round of primer design included adding the overhang sequence required for adapter ligation in downstream sequencing steps with Illumina HiSeq 2500 (Appendix D,

Table D.1).

25 26

(b) Environmental DNA

An eDNA control sample was created in order to have a composite sample comprising eDNA from multiple known species. A total of 45 species from either the collected or previously extracted samples (Appendix B, Table B.2) representing 30 families were included in the control sample (25 aquatic or wetland and 20 terrestrial species). Of the field-collected plants, approximately one inch of leaf clipping from each plant was placed into a sterile mason jar that was filled with deionized water, and then a

10µL sample of the pre-extracted DNA samples was added, and the sample was left to incubate for three hours (one hour at room temperature and two hours at ~4˚C).

Following incubation, the samples were filtered in preparation for DNA extraction from the water.

In order to obtain eDNA from natural , a freshwater sample was collected from the Black River, Ontario, Canada in September 2016 for a collaborative project between Trent University and the Ontario Ministry of Natural Resources and Forestry

(OMNRF). Three separate samples were collected at the site in sterile 1L plastic bottles and kept in a cooler until filtration. A sterile 1L plastic bottle was filled with ddH2O in the lab the morning of sampling and was included in the cooler as a negative control to detect possible contamination in the cooler.

A three-funnel filter manifold (VWR CA28150-434) driven by an electric pump

(Millipore EZ Stream, Item: EZSTREAM1) was used to filter the water samples which isolated DNA on 1.2µm pore size glass microfiber filter membranes (WhatmanTM 47mm

GF/C). The control samples were filtered immediately following the 3-hour incubation

26 27 period, and the wild samples were filtered within 24 hours from collection, to isolate the eDNA from all organisms present from the water. Filtration equipment and lab benches were cleaned using a 10% bleach solution before filtration began and between the filtration of each set of three samples or controls. Flame-sterilized forceps were used to manipulate filter membranes.

All 1L samples and cooler controls were shaken thoroughly and poured into the filter funnels (VWR CA28150-496) 500mL at a time. If particulate matter clogged the membrane following the first 500mL, a second filter membrane was used to process the second 500mL of the sample. Otherwise, the entire 1L sample was processed through a single filter membrane. Filter membranes were placed in labelled 5mL Eppendorf tubes and stored at -20˚C immediately after filtration. If two filter membranes were required for filtration, they were placed into the same 5mL Eppendorf tube.

Pre- and post-filtration controls were included with the filtration of the wild samples by passing 1L of ddH2O through a filter membrane in each of the filter funnels on the manifold before and after the control and the wild samples to detect any contamination between samples or from the equipment.

DNA from all eDNA water samples was extracted from the filters using MoBio

PowerWater® DNA Isolation kits (MoBio Technologies, Inc.) following a modified manufacturer’s protocol (Wilson, Bronnenhuber, Boothroyd, Smith, & Wozney, 2014), and eluted to a final volume of 100μL. When two filter membranes were needed to filter a

1L sample, DNA was extracted from both filters in one reaction. Extracted DNA was stored at -20˚C.

27 28

(c) Library preparation

The control and wild eDNA samples were prepared for sequencing following the

Illumina 16S Metagenomic Sequencing Library Preparation (Illumina, 2013) guidelines with slight modifications. In the amplicon PCR step, 3μL of template DNA was used instead of the recommended 2.5μL, producing a total volume of 25.5μL per well instead of 25μL. Customized cycling conditions were used for the Amplicon PCR step (Appendix

D, Table D.4). PCR Clean-Up was done using QIAquick PCR Purification Kit (Qiagen), and the final Index PCR products were extracted from an agarose gel using the Wizard®

SV Gel and PCR Clean-Up System (Promega).

Five primer pairs, omatK2, orbcL2, oITSn, orbcLa, and oITS2, were selected for the metabarcoding protocol due to their success in amplifying DNA from multiple aquatic plant taxa (Table 2.2.1). Two of these were pre-existing primers designed for terrestrial plant metabarcoding (orbcLa and oITS2; Fahner et al., 2016) and three were novel primers designed for this study (omatK2, orbcL2, oITSn). All primers were modified for

Illumina sequencing by extending the sequences so that melting temperatures were between 60-65˚C. DNA samples from both the control sample and the Black River were amplified with all 5 primer pairs. Each PCR cocktail included 12.5μL 2x KAPA HiFi

HotStart ReadyMix, 5μL each of 1μM forward and reverse primer, and 3μL of the extracted eDNA for a total volume of 25.5μL per well. Primers used in this stage had the overhang adapters attached (Appendix D, Table D.1). Each sample was prepared in quadruplicate, so that there were four wells and a total of 102μL of combined PCR product for downstream applications. Cycling conditions are listed in Appendix D, Table

D.4. 10μL of amplified product was run on a 1% agarose gel to ensure the amplicons

28 29 were approximately the expected lengths. All four replicates of each amplified sample were pooled together. Excess primers, nucleotides, salts, and other impurities were removed from amplified DNA product using QIAquick PCR Purification Kit (Qiagen).

10μL of purified PCR product was run on a 1% agarose gel to ensure that the bands were still the expected sizes once impurities and by-products had been removed.

29 30

Table 2.2.1. Target markers, amplicon lengths, primer directions, primer names, primer sequences, melting temperatures for extended primers. Modifications to primers from other studies are underlined.

Mar Ampli F/ Nam Sequence (5’ – 3’) Melting ker con R e tempera length ture (bp) (ºC)

matK 308 F mat AAGGATCCTTTCATGCATTATRTTMGR 60.9 K2-F TATCAAGGAA

R mat NGYCCAAAYNGGYTTACTAATRGGAT 62.2 K2- RYCC R

rbcL 221 F rbcL YGATGGACTTACNAGTCTTGATCGTTA 61.4 2-F CAAAGG

R rbcL GNCCATAYTTRTTCAATTTATCTCTTTC 60.2 2-R AACTTGGATNCC

ITS2 115- F ITSn AYGACTCTCGGCAACGGATATCTTGG 61.1 119 -F

R ITSn CCCAVGCAGRCDTGCCC 61.3 -R

rbcL 550 F rbcL ATGTCACCACAAACAGAGACTAAAGC 60.4 * a-F AAGTG

R rbcL TCATCYTTGGTAAAATCAAGTCCACCR 61.9 a-R CG

ITS2 300- F ITS- GCGAAATGCGATACTTGGTGTGAATTG 62.0 * 460 S2F* C

R ITS4 CCTTGTAAGTTTCTTTTCCTCCGCTTAT 61.5 * TGATATGC

*Modified from Fahner et al. (2016).

30 31

A second round of PCR was performed to attach the Illumina sequencing adapters to the purified, amplified product. The adapters contain an 8bp index within their sequence that is required to demultiplex pooled samples in bioinformatics applications, so a unique set of indices were used for the control and wild samples. This PCR cocktail included 25μL 2x KAPA HiFi HotStart ReadyMix, 5μL Index 1 primer (N7XX), 5μL

Index 2 primer (S5XX), 10μL PCR grade water, and 5μL of product from the first PCR amplification. The cycling conditions included an initial denaturation step at 95ºC for three minutes, followed by eight cycles of 95ºC for 30 seconds, 55ºC for 30 seconds, and

72ºC for 30 seconds, with a final extension step at 72ºC for five minutes, and a hold at

4ºC.

50μL of each amplified, purified PCR product was run on a 1% agarose gel at

70V for 1 hour, and DNA bands were visualized using a UV light and excised. DNA was extracted from excised gel slices using the Wizard® SV Gel and PCR Clean-Up System

(Promega). Extracted DNA was quantified using a Qubit dsDNA HS Assay Kit (Life

Technologies). The final amplified and purified control sample concentrations were

1.47 ng/µL, 5.8 ng/µL, 4.2 ng/µL, 7.08 ng/µL, and 9.3 ng/µL for omatK2, orbcL2, oITSn, orbcLa, and oITS2, respectively. All samples were pooled in equal concentrations before submitting for sequencing at The Centre for Applied Genomics (TCAG; Toronto, ON,

Canada). Amplified eDNA was sequenced as a spike-in, meaning that the sample was loaded onto ¼ or ½ of a lane, at TCAG on the Illumina HiSeq 2500 to generate 125bp paired-end reads. Bioanalyzer results from TCAG are available in Appendix D, Figure

D.1.

31 32

(d) Bioinformatics

Quality of the sequencing reads was determined using FastQC (Andrews 2010).

The OBITools bioinformatics package was used to further analyze the sequences to determine which plant species can be assigned to each sample (see Appendix E for pipeline). The forward and reverse reads were assembled and assigned to a corresponding primer combination, creating a separate file for each primer pair. These sequences were then dereplicated into unique reads, counted and filtered by specified sequence counts and amplicon lengths, and cleaned to remove low count sequences that are likely results of

PCR or sequencing error. The final cleaned sequencing data was then assigned taxonomic units by conducting a local BLAST search, with the highest GenBank match being reported regardless of the percent match (pipeline in Appendix E). The resulting taxonomic, sequence, and count data were converted into a tab-delimited text file allowing for downstream manipulation.

32 33

2.3 Results

(a) eDNA marker development and in silico analysis of database sequences

Primer design

For the final novel matK primers (omatK2), the pair with degenerate bases amplified 20% more species (17/25) than the non-degenerate version (12/25). For novel rbcL primers (orbcL2), both pairs amplified the same number of species (17/25); the version with degenerate bases was selected for downstream use under the assumption that it will anneal to a higher number of species, which is desirable in a metabarcoding context. For novel ITS2 primers (oITSn), the pair with degenerate bases amplified 20% more species (19/25) than the non-degenerate version (14/25).

Six species were not amplified by any of the primer pairs: vinifera, beckii, Potamogeton strictifolius, Nepeta cataria, canadensis, and pitcheri (Table 2.3.1). Apart from P. strictifolius and B. beckii, all of these species are terrestrial plants. Observing mismatches in the primer sequences in silico of the species that did not amplify revealed no more mismatches than were present in the species that amplified successfully with two exceptions: S. canadensis has four mismatches in the forward omatK2 primer sequence and seven mismatches in the reverse omatK2 primer sequence, and N. cataria has eight mismatches in both the forward and reverse primer- binding regions for omatK2.

The pre-existing rbcL and ITS2 terrestrial plant metabarcoding primers (orbcLa and oITS2) amplified DNA from 48% and 52% of the species, respectively (Table 2.3.1).

The novel primer pair oITSn appeared to amplify DNA from more species than any of the plastid markers, producing bands of the expected length for 19 out of 25 plant species

33 34

(Table 2.3.1); however, the short amplicon lengths (115-119bp) were difficult to distinguish from potential primer-dimers (Appendix C, Figure C.2). The remaining primers orbcL2, omatK2, oITS2, and orbcLa amplified 17, 17, 13, and 12 species, respectively (Table 2.3.2).

34 35

Table 2.3.1. Amplification results from all 5 final, optimized primer pairs. Grey squares represent positive amplification, inferred from a band observed at expected amplicon size on agarose gel. Question marks (?) denote possible primer-dimers. The expected amplicon size of oITSn is 115bp-119bp, and the forward and reverse primers are 27bp and 16bp, respectively. Two out of seven terrestrial species product bands at the expected amplicon length for one or more primer pairs, and 17 out of 18 aquatic or wetland species produced bands at the expected amplicon length for one or more primer pairs.

Species Aquatic (A)/ Amplification Results Wetland (W)/ 1 1 omatK2 orbcL2 oITSn Terrestrial (T) orbcLa oITS2 T Typha latifolia A/W Iris pseudacorus A ? A Stratiotes aloides A A Potamogeton strictifolius A Elodea canadensis A Vallisneria americana A A Hydrocharis morsus- A ranae Phragmites australis W Isoetes engelmannii A ? Schoenoplectus acutus W ? Ceratophyllum demersum A ? Nepeta cataria T Pontederia cordata A ? T Chara vulgaris A acuminata T ? Asclepias syriaca T ? Nymphaea odorata A ? Eichhornia crassipes A ? Cabomba caroliniana A T Total species amplified 17 17 12 19 13 1Modified from Fahner et al., 2016.

35 36

(b) Control and wild sample metabarcoding and bioinformatics

Control sample bioinformatics

A total of 46,158,260 paired-end reads from Illumina HiSeq 2500 were filtered and demultiplexed using a modified OBITools pipeline. Of these reads, 13,042,135

(28.25%) were assigned to a primer pair, meaning that they contained both the forward and reverse primer for one of the primer pairs. 67.1% of these 13,042,135 assigned sequences contained the omatK2 forward primer, 9.9% contained the orbcL2 forward primer, 2.85% contained the orbcLa forward primer, 2.95% contained the oITSn forward primer, and 17.2% contained the oITS2 forward primer. BLAST searches on GenBank showed that a large proportion of sequences attributed to the omatK2 primers were a result of bacterial sequences contaminating the control sample that happened to have the omatK2 forward primer sequence repeated throughout their genome.

omatK2 primers yielded 371 unique sequences, orbcL2 primers yielded 1191 unique sequences, orbcLa primers yielded 436 unique sequences, oITSn primers yielded

191 unique sequences, and oITS2 primers yielded 5391 unique sequences. The resulting sequence data was searched against the GenBank database using the modified OBITools pipeline to record which plants were identified at the family, genus, and species level.

Total correct taxonomic classification at the species level was highest for omatK2 and oITS2 at 19/45 species (42%), followed by orbcLa at 15/45 species (33%), orbcL2 at

14/45 species (31%), and oITSn at 11/45 species (27%). By combining the results from omatK2 and oITS2, the two with the largest number of plant sequences, 58% of species were identified. Using all five primers increases species identification to 62%. DNA from

Verbascum thapsus, Shepherdia canadensis, , and Linaria vulgaris,

36 37 were not identified in the control sample by any primer pair. Some of the species detected were not placed in the original control sample (Appendix D, Table D7-11); however, in some cases they belong to the same genus or family as species that were placed in the control sample. The orbcL2 primers identified the highest number of species that were not placed in the control sample (n=128), and omatK2 primers identified the fewest species that were not placed in the control sample (n=23) (Appendix D, Table D7-11).

37 38

Table 2.3.2. List of all species placed in the control sample. Black means the target species was successfully detected; dark grey means one or more congeneric species were detected; and light grey means that one or more confamilial species were detected.

Species A/T/ omatK2 orbcL2 orbcLa oITSn oITS2 W calamus W Anemone canadensis T Asclepias incarnata T Asclepias syriaca T T Cabomba caroliniana A Carex richardsonii T Cirsium pitcheri T Ceratophyllum demersum A Eichhornia crassipes A Elodea canadensis A dioicus T Hydrocharis morsus-ranae A Impatiens capensis T Iris pseudacorus A Linaria vulgaris T Lycopus uniflorus T Magnolia acuminata T beckii (Bidens A beckii) Myriophyllum aquaticum A A Myriophyllum spicatum A Nepeta cataria T Nuphar variegata A Nymphaea odorata A peltata A Phragmites australis W Pistia stratiotes A Pontederia cordata A Potamogeton crispus A Potamogeton strictifolius A idaeus T Schoenoplectus acutus W Shepherdia canadensis T canadensis T

38 39

Solanum dulcamara T Sporobolus T Stratiotes aloides A Trapa natans A T Typha latifolia A Typha minima A Vallisneria americana A Verbascum thapsus T Vincetoxicum rossicum T Total identified species A: 16/25 A: A: A: A: T: 3/20 13/25 13/25 11/25 15/25 T: 1/20 T: 2/20 T: 0/20 T: 4/20

39 40

Order omatK2 orbcL2 orbcLa oITSn oITS2 (3)

Magnoliales (1) angiosperms (2)

Poales (6)

Asparagales (1)

monocots (7)

Acorales (1)

Ceratophyllales (1)

Ranunculales (1)

Rosales (2)

Fabales (2)

(1)

Saxifragales (3)

Ericales (1)

Gentianales (3)

Lamiales (4) 0% 50% 100% (1) (5)

Figure 2.3.1. modified from APG III (2009) representing the percentage of genera from the control eDNA sample that were identified from HTS.

Intensity of the squares is proportional to the percentage of genera that were correctly identified following the assay. Numbers in parentheses represent the number of species of each order that were placed in the control sample.

40 41

Single- versus multi-locus approach

When only a single locus was used to identify species, omatK2 correctly identified the largest number of aquatic species in the control sample (Table 2.3.2).

Considering both aquatic and terrestrial species, a combination of oITS2 + omatK2, oITS2 + orbcL2, or oITS2 + orbcLa provided the largest number of species-level identifications (Appendix D, Table D.6). Focusing on aquatic species-level detections, omatK2 + orbcL2, omatK2 + oITS2, oITS2 + orbcL2, and oITS2 + orbcLa combinations performed similarly to one another (Appendix D, Table D.6). Combining omatK2, orbcL2, and oITS2 increases aquatic species-level detections to 22/25 species. Although this is only one extra species identified compared to the oITS2 + orbcL2 combination, it provides a more complete community survey and further confidence in the species that were identified by multiple primer pairs. Adding a second and third target region to a metabarcoding assay does not require significant additional time or resources in terms of library preparation, but provides a significant increase in taxonomic resolution and confidence in results.

Bioinformatics – wild samples

A total of 23,204,881 paired-end reads were filtered and demultiplexed using a modified OBITools pipeline. In total, 8,777,622 sequences (37.8%) were attributed to samples according to primers, meaning that the forward and reverse primer sequence of a given primer was found in the assembled forward and reverse read. Of these reads, 4.2% were assigned to omatK2 primers, 1.6% were assigned to orbcLa primers, 23.6% were assigned to orbcL2 primers, 30.9% were assigned to oITSn primers, and 39.7% were

41 42 assigned to oITS2 primers. The resulting trimmed sequence data was searched against the

GenBank database using the modified OBITools pipeline to record which plants were identified at the family, genus, and species level. The oITSn primer results did not return any plant species.

A preliminary filtering of BLAST results was done by removing terrestrial plants, algae, fungi, bacteria, etcetera to obtain a list of only aquatic plant species. Black River terrestrial plant identifications were omitted from results as there are no data to support which terrestrial plants are found in this area, and the purpose of this study was to characterize aquatic plant communities. A total of 21 aquatic plant species were identified from the Black River sequences, with 13 identified by omatK2 primers, five identified by oITS2 primers, four identified by orbcLa primers, and 12 identified by orbcL2 primers

(Table 2.3.3). Eight species were identified by two or more primer pairs. All of the aquatic plant species that were detected, with the exception of O. acuminata, have been recorded in Ontario and are known to be currently or recently present in the Black River

(M. Shapiera and W. Wegman, OMNRF, pers. comm.).

42 43

Table 2.3.3. List of aquatic plants identified from the Black River eDNA sample using all five primer pairs.

Latin name Common name Primers

1 1 omatK2 oITS2 orbcLa orbcL2 Myriophyllum spicatum Eurasian watermilfoil Myriophyllum sibiricum Northern watermilfoil Acorus calamus Sweet flag Stratiotes aloides Water soldier Najas spp.* Water-nymphs Elodea canadensis Canadian waterweed Elodea bifoliata Twoleaf waterweed Butomus umbellatus Flowering rush Pistia stratiotes Water lettuce

Iris spp.* Yellow iris Nuphar spp.* Water-lily Nymphaea odorata White waterlily Potamogeton crispus Curly-leaf pondweed Schoenoplectus acutus Hardstem bulrush Typha spp.* Cattail Yellow floating heart Ceratophyllum spp.* Spineless hornwort Vallisneria americana Eelgrass Sagittaria latifolia Broadleaf arrowhead Sparganium Floating bur-reed angustifolium Ottelia acuminata - *spp. was used where multiple species of the same genus had the same level of identification in a GenBank BLAST search. 1Modified from Fahner et al. (2016).

43 44

2.4 Discussion

This thesis was largely method development and proof of concept and as such, any deviation from expected results warrants discussion. In this study, an eDNA metabarcoding assay was developed with the goal of characterizing the species within aquatic plant communities. Three novel primer pairs were designed and tested along with two primer pairs that had been previously designed for terrestrial plant metabarcoding

(Fahner et al., 2016). All primer pairs successfully amplified DNA from aquatic plant species, but varied in the number of species they were able to identify either because of failed amplification or misidentification. This study revealed that it is possible to identify aquatic plants in Ontario by metabarcoding eDNA, and that a multi-locus approach is optimal for identifying the highest number of species. This study contributes to aquatic eDNA research in a direction that has not yet been explored: using metabarcoding to simultaneously amplify plant DNA from multiple aquatic plant species, thereby determining community composition.

In silico primer design and validation on single species

One goal when designing primers for metabarcoding is to target a conserved gene region in order to amplify DNA from a large number of species. However, even highly conserved regions will show some variability when targeting multiple taxa, and a completely conserved primer-binding sequence is not possible for all species

(Schmalenberger, Schwieger, & Tebbe, 2001). For this reason, it is important to target multiple gene regions. Determining whether primers are complementary to target species in silico requires having reference sequences to observe mismatches in the primer-binding

44 45 sequence, otherwise it is impossible to conclude whether the primer pairs should be complementary to the target species. Novel primer pairs were designed to amplify three gene regions specifically targeting aquatic plant species. All three amplified and identified species in a control eDNA sample with varying success, and two successfully amplified and identified DNA sequences from a natural environmental sample.

Designing primers for plant metabarcoding often proves difficult compared to animal metabarcoding studies (Fazekas et al., 2009), and three main issues associated with selecting plant metabarcoding markers are: high levels of intraspecific variation in the primer-binding region or in the amplicon leading to potential challenges with amplification or taxonomic assignment (Goulet, Roda, & Hopkins, 2017;

Razafimandimbison, Kellogg, & Bremer, 2004); the prevalence of hybrids that cannot be identified using markers from a single genome (Goulet et al., 2017); or a lack of interspecific variation (Hollingsworth et al., 2009). Many plant markers display intraspecific sequence polymorphisms across a range of taxa, and if all conspecific DNA sequences are not represented in a reference database, proper taxonomic assignment may not be possible. Many aquatic plant species occur in multiple geographic regions due to introduction of alien species globally or due to naturally wide ranges, further increasing the chance of observing conspecifics with varying DNA sequences across multiple loci.

There are approximately 412 genera of vascular macrophytes containing aquatic species, of which only about 39% are endemic to a single biogeographic region (Chambers et al.,

2008). Because of potentially high variation, metabarcoding studies typically group their sequences into discrete clusters called ‘bins’ based on a level of expected similarity, and these bins can be assigned to taxonomic groups (Vincent et al., 2017). In this study,

45 46 alignments of control sample species at all target regions displayed considerably greater interspecific than intraspecific variation. Moreover, amplicon lengths for omatK2, orbcL2, orbcLa, and oITSn primers did not vary by more than ~5bp across species included in the database, so any observed intraspecific variation was represented by mutations rather than large insertions or deletions. The target amplicon for oITS2 primers ranged from 300-460bp (Fahner et al., 2016), however, alignments did not display high levels of intraspecific variation. A limitation in this regard was that there was only one sequence included in the database for each of the species involved in primer development, or in some cases there was no sequence data available for species in the control sample at certain gene regions. Lacking reference sequences at the primer design and in silico testing stages, or having too few reference sequences, may lead to an underrepresentation of the level of intraspecific variation found at given gene regions.

Primers with degenerate bases amplified a larger number of species (up to 20% more) than the non-degenerate versions in single-species amplification reactions. With the exception of P. strictifolius, all of the species that did not amplify with any of the primer pairs were terrestrial plants; however, the available alignments did not show higher levels of mismatched primer-template nucleotides for terrestrial species compared to aquatic and were no more likely to have mismatches in the 3’ end. Studies have quantified the effects of different types of mismatches and found that mismatches at the

3’ end are most detrimental to PCR amplification success (Huang, Arnheim, & Goodman,

1992; Stadhouders et al., 2010). The effects that mismatches on the 5’ end or middle of primers have on amplification are challenging to quantify, as primers vary in length across studies and have potentially variable optimal annealing temperatures for primers

46 47 with degenerate bases (Green, Venkatramanan, & Naqib, 2015). Primer-template mismatches could also have varying impacts on amplification success due to the placement of the mismatched nucleotides in the primer-binding sequence (Piñol, Mir,

Gomez-Polo, & Agustí, 2015); i.e. four mismatched nucleotides spread evenly throughout the primer-bind region (with the exception of the 2 nucleotides at the 3’ end) may have different impacts compared to four consecutive mismatched nucleotides in the center of the primer-binding region.

When comparing the amplification results of the first set of de novo primers to those of the corresponding degenerate versions, there appears to be an ‘all or nothing’ pattern in all primers with the exception of FW-matK2 (Appendix C, Table C.3); meaning that most species that had DNA amplify were amplified by both the original and degenerate primers. Single-species amplifications in this study used extracted DNA as a template, with some extractions being several years old. Of the six species that did not amplify with any of the primer pairs, P. strictifolius and C. pitcheri DNA was retrieved from previous studies. DNA degrades over time even when frozen, so long-term storage of DNA samples leads to a decrease in DNA concentration, and DNA degrades when exposed to multiple freeze-thaw cycles due to shearing from ice crystal formation

(Rossmanith, Röder, Frühwirth, Vogl, & Wagner, 2011). For the other samples that did not amplify with any of the primer pairs but were more recently extracted, there could have been an error in the DNA extraction process, for example a low starting amount of tissue might not generate enough template DNA for primers to bind to which is supported by visualizing total DNA on an agarose gel (Appendix B, Figure B.1).

47 48

Control sample results

Positive identifications

A control sample was created in the lab by combining clippings of locally- collected plant tissue and pre-extracted DNA from various terrestrial and aquatic plant species to create a control eDNA sample. The phylogeny of the species included in the control sample (Figure 2.3.1) showed that amplification and taxonomic assignment was possible across a range of taxonomic groups. For omatK2 (novel), orbcL2 (novel), and orbcLa (pre-existing), there is a relatively high number of identifications for the monocots compared to the eudicots, whereas species identified by both ITS primers are more evenly spread out across angiosperm groups (Figure 2.3.1). The aquatic and terrestrial species found in the control sample are spread throughout the phylogenetic tree demonstrating that aquatic plants are not necessarily more closely related to one another than they are to terrestrial species (Bremer et al., 2009). This suggests that if the novel primers designed for this study perform better in terms of the number of species that are amplified and identified compared to primers used in other plant metabarcoding projects, these novel primers may also perform better in other projects; however, there was a relatively lower proportion of positive identifications of terrestrial species compared to aquatic species from the control sample.

For all of the novel and pre-existing primers, up to 50% more aquatic species were identified when compared to terrestrial species. This is comparable to the individual species amplifications where, on average, 76% of aquatic species produced a band in gel electrophoresis at the expected amplicon size across all five primer pairs, whereas an average of 14% of terrestrial species produced bands across all five primer pairs. Four

48 49 species in the control sample could not be identified by any of the primer pairs to a conspecific, congeneric, or confamilial, and it is unclear whether this was due to lack of reference sequences or the fact that they did not amplify. Furthermore, seven terrestrial species were identified only to the family-level, meaning that a confamilial was identified from the metabarcoding sequences. Of these seven species, three had other confamilials identified to the species level, and the remaining four species did not have any confamilial species identified (Table 2.3.2, Appendix Table A.2, and Appendix D Tables

D7-11). With the exception of A. incarnata and C. pitcheri which do not have GenBank coverage for ITS2 and matK, respectively, the other terrestrial species that were identified only to the family-level had database coverage for all three gene regions, meaning a species-level identification was theoretically possible. In comparison, a single aquatic species Hydrocharis morsus-ranae was only identified by confamilials. This species is a member of the Hydrocharitaceae family that had three other species present in the control sample (Appendix A, Table A.2), each of which was identified to the species-level by between three and five of the amplified gene regions. The H. morsus-ranae DNA was likely not amplified as the DNA added to the control sample was from a previous study that might have been highly degraded.

Considering the relatively low number of terrestrial species that were positively identified in the control sample relative to the number of terrestrial species that were included in the control sample, terrestrial plants might have secreted less DNA than aquatic plants in the water. Organisms shed DNA at different rates (Barnes et al., 2014), and terrestrial species likely decompose in water at varying rates compared to aquatic

49 50 species, therefore shedding DNA into water at a different rate. Similarly, DNA persistence in the environment differs across taxa (Dejean et al., 2011).

Amplification success between novel primers designed specifically for aquatic plants is comparable to amplification success of pre-existing primers that had been designed for vascular plant metabarcoding in terms of their ability to amplify and discriminate between aquatic plants (Table 2.3.2). The most successful metabarcoding primer pair with respect to the aquatic plants included in this study, omatK2, generates an amplicon of ~308bp. Previous researchers have investigated the recovery success of different amplicons lengths from environmental samples with amplicons up to 840bp being successfully recovered (Fahner et al., 2016), although many studies aim to design metabarcoding primers that target a fragment under 300bp (Deiner et al., 2017; Freeland,

2017). The section of the matK region targeted by omatK2 primers is a short fragment that, according to results from other metabarcoding studies (Op De Beeck et al., 2014), is likely to be recovered from an environmental sample. The omatK2 amplicon has higher amplification and discriminatory power than orbcL2, orbcLa, and oITSn, determined by the number of correct taxonomic assignments. In past studies, matK markers have performed more poorly than markers like rbcL and ITS2 in terms of the number of reads retained following raw sequence filtering (Fahner et al., 2016) and the number of sequences matching to the reference database (CBOL Plant Working Group et al., 2009).

Although the goal with metabarcoding is to target theoretically all species of the focal group of organisms, a higher number of species’ identifications does not directly correlate with a high number of correct identifications. In this case, omatK2 can be considered the most efficient marker, having the lowest number of unique sequences and therefore

50 51 lowest database matches, but also the highest number of correct matches. This highlights the utility of a control sample with a known number of DNA contributors when designing and testing novel primers, or using pre-existing primers in new metabarcoding applications.

With respect to aquatic species, one primer combination allowed for 21 out of 25

(84%) species-level identifications: oITS2 and orbcL2. A three-locus approach combining results from omatK2, orbcL2, and oITS2 increased aquatic species identifications to 22 out of 25 (88%). Using all five primers still identified only 23 out of 25 aquatic species.

As a comparison, even the best primer combinations for terrestrial species identifications only identify 5 out of 20 (25%) species. Overall, these results do support the findings from Fahner et al.'s (2016) soil metabarcoding study that the combination of rbcL and

ITS2 offer high levels of recovery and taxonomic resolution. However, Fahner et al.

(2016) found that their matK marker did not detect any unique taxa, whereas omatK2 primers in this study performed best individually against all other primers, identifying the largest number of individual aquatic species. Furthermore, the region of rbcL targeted by the novel orbcL2 primers designed for aquatic species (221bp) performed better than orbcLa primers designed for soil metabarcoding (550bp), with respect to aquatic eDNA metabarcoding. This may be due to higher recovery rate of shorter fragments from aquatic samples, or a higher level of specificity to aquatic plants.

In eDNA amplification, there may be preferential amplification of shorter fragments if DNA recovered from environmental samples is highly degraded (Dejean et al., 2011; Wei, Nakajima, & Tobino, 2018). The orbcLa primers targeted the longest amplicon at ~550bp. Assuming this region is sufficiently divergent among species, a

51 52 longer fragment can provide greater taxonomic resolution (Wei et al., 2018). Of all positively identified aquatic species, 71% of species amplified at fragments between

~115-221bp, and 92% of species amplified at fragments ranging from ~300-550bp. The shortest amplicon length was from the oITSn primer pair generating a ~115bp fragment.

These are relatively short markers in a barcoding context (Kress, Wurdack, Zimmer,

Weigt, & Janzen, 2005), but standard lengths for eDNA metabarcoding assays (Fahner et al., 2016; Richardson et al., 2015; Yang et al., 2016). Compared to metabarcoding markers that have been used in other studies, the target regions from this study should all theoretically amplify from eDNA samples, however there may still be preferential amplification of shorter fragments, especially if environmental samples are extensively degraded. Fewer species were identified by orbcLa primers compared to other primer pairs likely because the target amplicon is longer, and if DNA is degraded, longer fragments of DNA are less likely to be recovered from an environmental sample (Jo et al.,

2017; Wei et al., 2018). Ma et al. (2016) investigated the amplification success and specificity of 12 primer pairs, with amplicon lengths ranging from 76-249bp, to detect the endangered Yangtze finless porpoise from eDNA and found that there was no significant difference in recovery of these amplicons. Another study compared the recovery of two

COI markers from sediment samples, with lengths of 126bp and 358bp, and found that the 126bp generated more positive replicates in PCR amplification (Wei et al., 2018). In a terrestrial plant metabarcoding study conducted by Fahner et al. (2016), four plant markers were targeted with amplicons ranging from 10-840bp, and there were no statistical differences in raw read counts. However, a high proportion of high quality reads assigned to the longer amplicon, matK, may have been sequencing or PCR artifacts,

52 53 as these sequences returned fewer matches with the reference database. Although a wide range of amplicon lengths can be recovered from potentially degraded environmental samples, many studies have found that shorter fragments generally result in superior recovery.

Missing species identifications

There are potential challenges at the library preparation and bioinformatics stages that might prevent species identifications for members of a mock community. Sigut et al.

(2017) tested the ability of metabarcoding to identify insect larvae and their known parasitoids from mock communities, and compared the number of correctly identified taxa against classic barcoding and morphological techniques. Approximately 40% of parasitoids and 90% of insects used in the mock communities were assigned correct species-level identification using current reference libraries, suggesting that a more complete reference database may lead to more species identifications, or maybe they did not amplify due to primer-template mismatches. Geisen et al. (2015) created a mock community of free-living soil protists and found that metabarcoding resulted in an overrepresentation of ciliates, and other protists were either underrepresented or absent from the sequence data. This indicates possible differences in binding affinity of primers across taxa, and when designing primers for metabarcoding such a large assortment of taxa, it is impossible to know how many species will be amplified. Reference databases may include erroneous sequences if source organisms are incorrectly identified, or may lack sequence data for target regions; both scenarios that would not allow for correct species identifications (MacDonald & Sarre, 2017). These factors can potentially affect

53 54 amplification success in metabarcoding applications, and highlight the limitations associated with trying to design highly universal primers.

Potamogeton strictifolius is the only unidentified species that does not have rbcL sequence coverage in GenBank. This species was not identified by either of the rbcL primers, but the Potamogeton genus was detected in the control sample by both orbcL2 and orbcLa; however, this does not provide any indication whether P. strictifolius DNA is being amplified as there are other congeneric species in the sample. For ITS2, A. incarnata, C. pitcheri, E. crassipes, M. acuminata, P. stratiotes, and P. cordata do not have sequence coverage in GenBank, therefore none of these species could be identified by oITSn or oITS2 (Table 2.3.2). These discrepancies reinforce the importance of a complete reference database, and without one it is not possible to obtain identifications for every species in an environmental sample.

Erroneous identifications

Results from the control sample showed that the omatK2 primers had the fewest incorrect species identifications, meaning that other primers identified more species that had not actually been in the control sample, henceforth non-target species (Appendix D,

Tables D.7-11). Thirteen non-target aquatic or wetland species, and 10 non-target terrestrial species were identified following a local BLAST search of the omatK2 metabarcoding sequences. Of the 13 non-target aquatic plant species, eight were congeneric with one or more species that were placed in the control sample. Of the 10 non-target terrestrial plant species, only two were congeneric with other species that were placed in the control sample. The bioinformatics pipeline assigns the top match from

54 55

GenBank to each sequence regardless of the match percentage, but all species identified, whether they were target or non-target species, had a match of 95% or higher to the reported GenBank match. Of the control sample species, four did not have matK coverage in GenBank, indicating that erroneous species identifications may be due to missing sequences in the reference database, in which case the next best match would be reported in a BLAST search, or due to errors in the database which may lead to a species erroneously being labelled as a congeneric.

The orbcL2 primers identified the largest number of non-target species. The high number of non-target species identifications is consistent with the relatively low levels of interspecific variability in the rbcL gene found in other studies (De Mattia et al., 2012; De

Vere et al., 2017); but could also be due to errors in the reference database, for example misidentified species. A total of 23 aquatic or wetland and a disproportionate 103 terrestrial species were incorrectly identified. Eight of the 23 aquatic species that were erroneously identified from the sequences were congeneric with target species, and in comparison, only three of the 103 terrestrial species that were erroneously identified from the sequences were congeneric with control sample species. This indicates that the orbcL2 primers are better at discriminating between aquatic species compared to terrestrial, or perhaps there is more error in terrestrial species identifications or lower representation in GenBank.

The incorrectly identified species could also be a result of contamination, and not due to discrimination issues. With the exception of the species that had extracted DNA added to the control sample, all aquatic and terrestrial plants were collected in the field, then rinsed with deioinized water in the lab prior to placing a clipping in the control

55 56 sample. While collecting aquatic plants in the field, they were held in a cooler full of lake or river water overnight which could have held eDNA from the field. There is a possibility that this rinse was not sufficient to remove DNA or plant material from other species that were found adjacent to the collected specimens in the field, meaning that some of these non-target species identified from the control sample could have contributed some eDNA.

Overall, the control sample results address the first goal of my thesis, confirming that genetic regions used for terrestrial plant metabarcoding are suitable for aquatic plant metabarcoding, and showed that a combination of pre-existing terrestrial metabarcoding primers (oITS2) and novel primers designed for aquatic plants (omatK2 and orbcL2) provide the highest number of species-level identifications.

Metabarcoding limitations

Metabarcoding involves targeting DNA from multiple taxa in a mixed sample.

With respect to primer annealing, even when a relatively high number of mismatches leads to positive amplification, there may be preferential amplification of the sequences that have fewer mismatches and therefore a higher annealing affinity (Suzuki &

Giovannoni, 1996). In tests comparing the annealing temperatures across different PCR runs, Piñol et al. (2015) found that, as expected, lower annealing temperatures lead to low specificity and more species amplified, and that more PCR cycles meant that there was a greater discrepancy between the extent to which different species amplified. Green et al.

(2015) developed an amplification strategy called Polymerase-exonuclease (PEX) PCR which separates primer-amplicon interactions from primer-template interactions with the

56 57 goal of determining which primers were annealing preferentially to the target DNA. The utilisation of PEX PCR substantially improved the consistency of sequence recovery from a mock community, and revealed that at lower temperatures, primers with four or less mismatches to the template DNA can contribute to amplified sequences (Green et al.,

2015).

Technically, more species-level identifications could exist because congeneric species may be of equal match in GenBank to the metabarcoding sequences, but the pipeline only identifies the first match listed; although if two congenerics have the same sequence, the marker is not species-specific. Elbrecht et al. (2018) used metabarcoding to establish haplotypes of a species of freshwater macroinvertebrates in a mock community with 15 expected haplotypes, and ended up with an additional 480 unexpected haplotypes.

It can be difficult to discern whether these discrepancies are due to mutations, intraspecific variation across haplotypes, gaps in bioinformatics sequence filtering stringency, or database errors (Coissac, Riaz, & Puillandre, 2012). Increasing stringency of their bioinformatics filtering steps resulted in a trade-off in which only nine of the 15 expected haplotypes were identified, but only six unexpected haplotypes were identified

(Elbrecht et al., 2018). Leray and Knowlton (2017) targeted a fragment of the COI gene in a mock community of 34 marine invertebrates and generated four times more operational taxonomic units (OTUs) than expected, however the non-target OTUs only accounted for 0.3% of the MiSeq sequencing reads. Of the 86 non-target OTUs, 25.8% were >97% similar to target OTUs, 34.2% were assigned to higher taxonomic levels, and

11.7% were unidentified. The sequences that were assigned to non-target OTUs showing high similarity to target OTUs likely represented taxa that were present in the mock

57 58 community in trace amounts but were undetected when it was created, and unidentified sequences were presumed to belong to single-celled eukaryotes that lack representation in the reference databases (Leray & Knowlton, 2017). These studies demonstrate that there is a trade-off in increasing stringency of bioinformatics filtering steps to retain as many target sequences and as few non-target sequences as possible, indicating there is a delicate balance that may differ across metabarcoding studies.

Another reason sequences may be assigned to the wrong , or why we may be unable to discriminate between two species, is because the entire target amplicon was not recovered. With the exception of the oITSn primer amplicon, amplicons did not have full coverage with the 2x125bp reads available on the Illumina HiSeq. Of the 308bp target amplicon for the omatK2 primers, ~188bp (61%) was recovered following the bioinformatics pipeline. Some of the species found in the bioinformatics results were not placed in the original control sample; however, they may belong to the same family or genus as species that actually were placed in the control sample, indicating a potential issue with resolution between certain species. The non-target species identified using these primers that were congeneric with species found in the control sample may represent species that would be correctly identified if this sequencing gap was filled. For example, N. odorata was not identified to the species-level using omatK2 primers, but congeneric N. alba and N. nouchali were both identified as top matches to two separate metabarcoding sequences. It is possible that N. odorata had an equal or similar match using blastn parameters, but was not reported as the singular top hit. Some genera may not be divergent enough to use a 188bp fragment of matK to discriminate to the species- level, and perhaps using the full 308bp amplicon would reveal N. odorata as a contributor

58 59 to eDNA from the control sample. The same concept may be applied to other species, but perhaps is not true for terrestrial species. The higher number of generic and even familial misidentifications of terrestrial species may indicate that they are even less divergent at this marker. Relatively short reads obtained by sequencing on the Illumina HiSeq 2500 are a limitation of the platform, but paired-end reads allow for partial to full reconstruction of amplicons (Boyer et al., 2016; Feng et al., 2016), and in metabarcoding studies the high sequence coverage offered by Illumina HiSeq 2500 potentially allows for recovery of a high number species in an environmental sample, even across multiple pooled samples (Deiner et al., 2017; Glenn, 2011).

The purpose of the control sample is to assess how many of the species placed into the environmental sample are identified when the metabarcoding sequences are compared to a reference database. Failure to identify a species that was known to be in the sample could be a result of failure to amplify as addressed above, in which case there would be no representative sequence for that given species in the metabarcoding data, or it could be due to issues in the bioinformatics pipeline or insufficient taxonomic resolution of target markers. Having a control sample is imperative in developing metabarcoding assays in order to pinpoint potential sources of error, to confirm or deny specificity of primer-binding sites across taxa, and to assess sequence variability across taxa.

59 60

Wild sample results

This study presents the first attempt to characterize a natural aquatic plant community using eDNA metabarcoding. The metabarcoding results from the Black River sample identified the most aquatic plant species using the omatK2 primers (n=13), followed by orbcL2 (n=12), oITS2 (n=5), and orbcLa (n=4), however a combination of primers leads to the highest number of identified species. No aquatic plants were identified using oITSn primers; most sequences were identified as fungi, bacteria, or algae, which is plausible due to the fact that ITS2 is a commonly used barcode across kingdoms (Han et al., 2013) and there is DNA from other organisms in natural aquatic samples. Common aquatic plants found in the Black River include Potamogeton crispus,

Myriophyllum spicatum, Myriophyllum sibiricum, Ceratophyllum demersum, Nuphar variegata, Nymphaea odorata, Vallisneria americana, and Typha spp., (M. Shapiera &

W. Wegman, pers. comm.), and all eight of these species were identified by at least one of the primer pairs (Table 2.3.3). Beyond these common species, with the exception of

Ottelia acuminata, all identified species from each primer pair have either been repeatedly observed in the Black River (M. Shapiera & W. Wegman, pers. comm.) or are known to reside in southern Ontario. Ottelia acuminata is a member of the

Hydrocharitaceae family that is endemic to southwest China, and the taxonomic status is still under debate due to the level of genetic diversity observed among varieties (Chen,

Du, Long, Gichira, & Wang, 2017). There are at least five other Hydrocharitaceae species identified in the wild sample that are likely to be found in the Black River; the genetic diversity observed in O. acuminata varieties, coupled with the number of closely related species identified indicate that this sequence was likely misidentified in GenBank.

60 61

As with the control sample results, orbcLa primers did not identify any species that were not also identified by another primer pair. Six species were identified by two or three of omatK2, orbcL2, and oITS2. The remaining 14 out of 21 species identifications were only identified by one of these three primer pairs, with each primer pair identifying unique taxa. These results further reiterate the earlier conclusion that a combination of data from at least two gene regions results in the largest number of species identifications.

The only other unexpected result from these species identifications is P. stratiotes, which is an invasive species in Ontario (Adebayo et al., 2011). P. stratiotes is not known to currently grow in the Black River, however there was an occurrence in 2011 in the same district, and the shoreline where this species was observed was combed resulting in the removal of about a half dozen single plants or small patches of plants (W. Wegman, pers. comm.). This particular species is not known to overwinter in southern Ontario climates, but it is a common garden plant and it is feasible that plants or their DNA could be accidentally or intentionally put into the river.

Invasive S. aloides has been a concern in the Black River for many years. In 2015, there were several hundred S. aloides plants that had become established in the Black

River (Snyder et al., 2016) and the region has since been surveyed multiple times every year including multiple eradication efforts (Marinich, 2017). Water samples have been collected to screen for S. aloides DNA using a species-specific qPCR approach. Results have been stochastic, sometimes revealing low levels of detection even when visual surveys indicated plants were present, and relatively higher levels of detection when no plants were observed (C. Currier & J. Freeland, pers. comm.). Targeted qPCR assays are generally considered to be more sensitive compared to metabarcoding assays

61 62

(Lacoursière-Roussel, Dubois, Normandeau, Bernatchez, & Adamowicz, 2016), so we were interested to see whether S. aloides would be detected in this study. In this particular

Black River water sample, qPCR results detected S. aloides DNA in two out of three replicates (C. Currier & J. Freeland, pers. comm.). Not only was S. aloides identified from the metabarcoding assay presented in this study, but it was identified by omatK2, orbcL2, and oITS2 primers, increasing the confidence in this finding. Environmental

DNA from an invasive species currently being monitored (S. aloides) was detected by multiple primer pairs, indicating that there are potentially hidden plants somewhere near to or upstream of the sampling site, or that eDNA is persisting in the water from plants that had been physically removed in eradication efforts. Another invasive species, P. stratiotes, was detected in an area where it is not believed to currently reside, again indicating potential persistence of eDNA in freshwater systems, or highlighting the utility of an eDNA metabarcoding assay in early detection of invasive species. Myriophyllum spicatum is another invasive species that has become established across Canada (Snyder et al., 2016) and is known to inhabit the Black River (M. Shapiera & W. Wegman, pers. comm.). This assay positively identified multiple non-native species in the control sample, which, in addition to the three detected from the Black River sample include M. aquaticum, T. natans, and E. crassipes, further validating the utility of this assay in screening water samples for early detection of potentially problematic alien species.

Wild versus control results

The results from the wild sample seem more reliable, meaning that there were fewer unexpected aquatic species identifications, compared to the results from the control

62 63 sample. This may be attributed largely to the terrestrial species that were included in the control sample as the majority of erroneous species identifications were terrestrial plants.

There are also some discrepancies between the control and wild sample results, meaning some species that were not identified in the control sample were identified in the wild sample. For example, A. calamus was only identified by omatK2 primers in the wild results but was identified by all five primer pairs from the control sample. This would be expected if there was a higher concentration of A. calamus DNA in the control sample, but it is not possible to reliably quantify concentrations in wild samples. In some species we see the opposite, for example M. spicatum was identified by omatK2, orbcL2, and orbcLa primers in the wild sample, but only by the oITS2 primers in the control sample.

This pattern continues for all species that were identified in both samples, illustrating that it is absolutely necessary to target multiple regions in eDNA metabarcoding assays.

There is insufficient literature on the differential degradation of DNA across the plant genome to infer specific regions that might degrade more rapidly in an environmental sample. However, we do know that certain DNA types such as nuclear, chloroplast, or mitochondrial DNA may be present in varying quantities at different life stages of the plants; for example, there is a net increase in chloroplast DNA in meristematic cells during seedling development, and a decrease as mature (

& Bendich, 2009). Additionally, there are different environmental considerations across environment types and DNA may degrade at different rates, or there may be other organisms that assist decomposition, in aquatic versus terrestrial environments (Dejean et al., 2011; Pietramellara et al., 2009). This offers multiple plausible explanations as to why we might observe species identifications from different gene regions across samples.

63 64

Overall, both the control and wild eDNA samples provided insight into aquatic plant community composition and biodiversity, and the metabarcoding assay detected multiple native and alien species using a combination of genetic markers.

‘All-or-nothing’ phenomenon

The ‘all-or-nothing’ pattern discussed in the single species amplification section was also observed in the control sample results with respect to amplification and identification. In the single species amplifications, the majority of species that were amplified were amplified by both degenerate and non-degenerate versions of the primers, with only slightly higher positive amplifications with the degenerate primers. This becomes less evident in the control sample findings as Table 2.3.2 reports species, congeneric, and confamilial identifications. However, of the 29 species that were positively identified, 17 (58.6%) were identified by three or more primer pairs, four

(13.8%) were identified by two primer pairs, and eight (27.6%) were only identified by one primer pair. A total of 13/45 (28.9%) species were either not identified, or were identified only to the family level by a confamilial species. Thus, the majority of species that were identified were identified by multiple primer pairs, perhaps suggesting an issue with the template DNA suggestive of differential degradation of DNA across the genome in environmental samples.

64 65

Conclusion

Few studies have focused on metabarcoding plants from environmental samples, even fewer have compared the success of multiple markers for their ability to amplify and discriminate between potential target plant species, and none have specifically focused on eDNA metabarcoding to target aquatic plants. In this study, we conducted in silico tests with reference database sequences to design and test potential aquatic plant metabarcoding markers, and to evaluate the ability of these primers to annotate and discriminate between species. A control sample with eDNA from a known number of species was used to further evaluate the success of these primers, and to compare their performance in terms of amplification and identification. Finally, in a pilot project to test the ability of these primers to characterize natural aquatic plant communities, eDNA metabarcoding of a water sample collected from the Black River (Sutton, ON) successfully identified multiple common aquatic plant species from that area.

The results from this study suggest that the omatK2 primers are the best standalone primers for correctly identifying the largest number of aquatic plant species and the fewest non-target species, although this conclusion may vary between aquatic communities from different geographic regions. Furthermore, metabarcoding the same eDNA sample with multiple primer pairs does not require substantially more time or resources than a single primer amplification, so I recommend using omatK2, orbcL2, and oITS2 primers for the most complete representation of aquatic plant communities and the highest confidence in species identifications. The orbcL2 primers did result in the largest number of incorrect species identifications from the control sample, however in the wild

65 66 sample it provided unique plausible identifications and further confidence in species that were also identified by the other two recommended primer pairs.

The metabarcoding results from the Black River, Ontario, demonstrated that this is an effective method to detect both known (S. aloides) and unidentified (P. stratiotes) invasive species even at low densities. The metabarcoding assay developed in this study can be applied to ongoing studies monitoring biodiversity and the effects invasive species have on native species communities. For example, characterizing aquatic plant communities over time as eradication efforts continue for S. aloides throughout Ontario, eDNA samples could be analyzed to monitor how communities recover, and to what extent native biodiversity changes, in areas of significant infestation. Results from the control sample demonstrate the importance of having a community of known species composition when developing an eDNA metabarcoding assay, and the pilot application of this assay on a natural aquatic sample present promising applications in aquatic plant community characterization, ongoing monitoring efforts of invasive species, and early detection of invasive or potentially problematic alien species.

66 67

Chapter 3: General Discussion

DNA barcoding often provides a higher level of confidence in declaring an organisms’ identities than morphological characteristics, or compared to visual surveys when target organisms are difficult to locate (Stein et al., 2014). However, not all intra- interspecific comparisons reveal a barcoding gap; in other words, if intraspecific sequence variation is comparable to interspecific sequence variation, DNA barcoding is not an effective tool for identifying species. In comparison to barcoding, metabarcoding studies provide an opportunity to survey entire communities to establish taxonomic composition

(Valentini et al., 2009). Metabarcoding studies to date have been left with a fairly large number of unidentified taxa, or often a species will be assigned to a sequence even if it is not a high match to the reference database, and we still don’t know if this is because of incomplete reference databases, lack of marker discrimination, or likely, a combination of both (Fahner et al., 2016; Fazekas et al., 2009).

With a growing body of research focusing on molecular identification techniques, many researchers advocate for complementary approaches of traditional morphologic and molecular taxonomy approaches, and some argue these will replace traditional morphological taxonomy; however, molecular and traditional approaches need to work hand in hand to develop robust, reliable databases. Using DNA sequences to identify species requires a reference database, and errors with misidentification of species whose

DNA sequences are entered into the databases will ultimately lead to further misidentifications as this public information is used in downstream studies.

Metabarcoding studies do not provide reliable abundance estimates due to the multiple library preparation and bioinformatics steps that ultimately affect the number of

67 68 reads attributed to each primer pair and species. However, a presence/absence approach identifies multiple potential species of interest, for example species-at-risk or invasive species. The information gathered from metabarcoding studies can lead to species- specific approaches if there are particular taxa of interest in determining abundance, although the idea of estimating abundance even from targeted species eDNA assays remains controversial and has not been well substantiated. This is the first study to attempt to characterize aquatic plant communities as a whole using eDNA metabarcoding.

This tool has broad implications in biodiversity assessment contexts, as it has the potential to detect invasive species or species at risk, and can be used to compare aquatic plant biodiversity across space and time.

The field of metabarcoding is virtually untouched with respect to aquatic plants.

Currently, although there are genetic markers widely used and well-documented for plants, there is still little agreement among researchers as to which markers or combinations of markers work best for metabarcoding plants. Furthermore, the nature of molecular evolution and mutations causing inter- and intraspecific variation may make it difficult, if not impossible, to design truly universal primers capable of amplifying DNA from all species of aquatic plants. However, the positive amplification results presented in this thesis indicate that some of the tested markers are sufficiently conserved for primer binding, and sufficiently divergent to discriminate between a number of species native to

North America and species native to other continents in the control sample. In future studies, in silico tests can help determine which previously established plant metabarcoding markers, whether they are the aquatic plant markers designed in this study, or ones from other plant metabarcoding studies, are suitable for the potential species of

68 69 interest. Further testing on the markers presented in this study in which the entire target amplicon is sequenced may improve discriminatory power of the sequences for which we currently have a sequencing gap between the forward and reverse reads. This study presents a framework from which future studies that target vastly different taxa may use to design their own metabarcoding primers, and a tool for studies to apply to natural ecosystems to infer species composition.

69 70

References

Adebayo, A. A., Briski, E., Kalaci, O., Hernandez, M., Ghabooli, S., Beric, B., … MacIsaac, H. J. (2011). Water hyacinth (Eichhornia crassipes) and water lettuce (Pistia stratiotes) in the Great Lakes: Playing with fire? Aquatic Invasions, 6(1), 91– 96. http://doi.org/10.3391/ai.2011.6.1.11 Ahn, S. (2011). Introduction to bioinformatics: sequencing technology. Asia Pacific Association of Allergy, Asthma and Clinical Immunology, 1, 93–97. http://doi.org/10.5415/apallergy.2011.1.2.93 Aloo, P., Ojwang, W., Omondi, R., Njiru, J. M., & Oyugi, D. (2013). A review of the impacts of invasive aquatic weeds on the bio- diversity of some tropical water bodies with special reference to Lake (Kenya). Biodiversity Journal, 4(4), 471– 482. Alsos, I. G., Lammers, Y., Yoccoz, N. G., Jørgensen, T., Sjögren, P., Gielly, L., & Edwards, M. E. (2018). Plant DNA metabarcoding of lake sediments: How does it represent the contemporary vegetation. PLoS ONE, 13(4), 1–23. http://doi.org/10.1371/journal.pone.0195403 Barnes, M. A., Turner, C. R., Jerde, C. L., Renshaw, M. A., Chadderton, W. L., & Lodge, D. M. (2014). Environmental conditions influence eDNA persistence in aquatic systems. Environmental Science and Technology, 48(3), 1819–1827. http://doi.org/10.1021/es404734p Bell, K. L., de Vere, N., Keller, A., Richardson, R. T., Gous, A., Burgess, K. S., & Brosia, B. J. (2016). Pollen DNA barcoding: current applications and future prospects. Genome, 59(April), 1–12. http://doi.org/10.1139/gen-2015-0200 Boyer, F., Mercier, C., Bonin, A., Le Bras, Y., Taberlet, P., & Coissac, E. (2015). OBITOOLS: a UNIX-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176–182. http://doi.org/https://doi.org/10.1111/1755-0998.12428 Boyer, F., Mercier, C., Bonin, A., Le Bras, Y., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176–182. http://doi.org/10.1111/1755-0998.12428

70 71

Bremer, B., Bremer, K., Chase, M. W., Fay, M. F., Reveal, J. L., Soltis, D. E., … Stevens, P. F. (2009). An update of the Angiosperm Phylogeny Group classifcation for the orders and families of flowering plants: APG III. Botanical Journal of the Linnean Society, 161, 105–121. http://doi.org/10.11646/phytotaxa.19.1.4 Buermans, H. P. J., & den Dunnen, J. T. (2014). Next generation sequencing technology: Advances and applications. Biochimica et Biophysica Acta - Molecular Basis of Disease, 1842(10), 1932–1941. http://doi.org/10.1016/j.bbadis.2014.06.015 Caraco, N., Cole, J., Findlay, S., & Wigand, C. (2006). Vascular Plants as Engineers of Oxygen in Aquatic Systems. BioScience, 56(3), 219. http://doi.org/10.1641/0006- 3568(2006)056[0219:VPAEOO]2.0.CO;2 Caraco, N. F., & Cole, J. J. (2002). Contrasting impacts of a native and alien macrophyte on dissolved oxygen in a large river. Ecological Applications, 12(5), 1496–1509. http://doi.org/10.1890/1051-0761(2002)012[1496:CIOANA]2.0.CO;2 Casiraghi, M., Labra, M., Ferri, E., Galimberti, A., & Mattia, F. De. (2010). DNA barcoding : a six-question tour to improve users ’ awareness about the method, 11(4), 440–453. http://doi.org/10.1093/bib/bbq003 CBOL Plant Working Group, Hollingsworth, P. M., Forrest, L. L., Spouge, J. L., Hajibabaei, M., Ratnasingham, S., … Little, D. P. (2009). A DNA barcode for land plants. Proceedings of the National Academy of Sciences of the United States of America, 106(31), 12794–7. http://doi.org/10.1073/pnas.0905845106 Chambers, P. A., DeWreede, R. E., Irlandi, E. A., & Vandermeulen, H. (1999). Management issues in aquatic macrophyte ecology: a Canadian perspective. Canadian Journal of Botany, 77(4), 471–487. http://doi.org/10.1139/cjb-77-4-471 Chambers, P. A., Lacoul, P., Murphy, K. J., & Thomaz, S. M. (2008). Global diversity of aquatic macrophytes in freshwater. Hydrobiologia, 595(1), 9–26. http://doi.org/10.1007/s10750-007-9154-6 Chamier, J., Schachtschneider, K., Maitre, D. C., Ashton, P. J., Wilgen, B. W. Van, Le Maitre, D., … Van Wilgen, B. (2012). Impacts of invasive alien plants on water quality, with particular emphasis on South Africa. Water SA, 38(2), 345–356. http://doi.org/10.4314/wsa.v38i2.19 Chen, J. M., Du, Z. Y., Long, Z. C., Gichira, A. W., & Wang, Q. F. (2017). Molecular

71 72

divergence among varieties of Ottelia acuminata (Hydrocharitaceae) in the Yunnan- Guizhou Plateau. Aquatic Botany, 140(February 2015), 62–68. http://doi.org/10.1016/j.aquabot.2017.03.001 Coissac, E., Riaz, T., & Puillandre, N. (2012). Bioinformatic challenges for DNA metabarcoding of plants and animals. Molecular Ecology, 21(8), 1834–1847. http://doi.org/10.1111/j.1365-294X.2012.05550.x De Mattia, F., Gentili, R., Bruni, I., Galimberti, A., Sgorbati, S., Casiraghi, M., & Labra, M. (2012). A multi-marker DNA barcoding approach to save time and resources in vegetation surveys. Botanical Journal of the Linnean Society, 169(3), 518–529. http://doi.org/10.1111/j.1095-8339.2012.01251.x De Vere, N., Jones, L. E., Gilmore, T., Moscrop, J., Lowe, A., Smith, D., … Ford, C. R. (2017). Using DNA metabarcoding to investigate honey bee foraging reveals limited use despite high floral availability. Scientific Reports, 7(February), 1–10. http://doi.org/10.1038/srep42838 Deiner, K., Bik, H. M., Mächler, E., Seymour, M., Lacoursière-Roussel, A., Altermatt, F., … Bernatchez, L. (2017). Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Molecular Ecology, 26(21), 5872–5895. http://doi.org/10.1111/mec.14350 Dejean, T., Valentini, A., Duparc, A., Pellier-Cuit, S., Pompanon, F., Taberlet, P., & Miaud, C. (2011). Persistence of environmental DNA in freshwater ecosystems. PLoS ONE, 6(8), 8–11. http://doi.org/10.1371/journal.pone.0023398 Dextrase, A. J., & Mandrak, N. E. (2006). Impacts of alien invasive species on freshwater fauna at risk in Canada. Biological Invasions, 8(1), 13–24. http://doi.org/10.1007/s10530-005-0232-2 Dougherty, M. M., Larson, E. R., Renshaw, M. A., Gantz, C. A., Egan, S. P., Erickson, D. M., & Lodge, D. M. (2016). Environmental DNA (eDNA) detects the invasive rusty crayfish Orconectes rusticus at low abundances. Journal of Applied Ecology, 53(3), 722–732. http://doi.org/10.1111/1365-2664.12621 Dunker, K. K. J., Sepulveda, A. J. A. A. J. A., Massengill, R. L. R., Olsen, J. J. B. J., Russ, O. L., Wenburg, J. J. K. J., … Shea, C. (2016). Potential of Environmental DNA to Evaluate Northern Pike ( Esox lucius ) Eradication Efforts : An

72 73

Experimental Test and Case Study. PLoS ONE, 11(9), 1–21. http://doi.org/10.5061/dryad.16m53.Funding Edgar, R. C. (2013). UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nature Methods, 10(10), 996–998. http://doi.org/10.1038/nmeth.2604 Elbrecht, V., Vamos, E. E., Steinke, D., & Leese, F. (2018). Estimating intraspecific genetic diversity from community DNA metabarcoding data. PeerJ, 6, e4644. http://doi.org/10.7717/peerj.4644 Elliott, T. L., & Jonathan Davies, T. (2014). Challenges to barcoding an entire flora. Molecular Ecology Resources, 14(5), 883–891. http://doi.org/10.1111/1755- 0998.12277 Fahner, N. A., Shokralla, S., Baird, D. J., & Hajibabaei, M. (2016). Large-scale monitoring of plants through environmental DNA metabarcoding of soil: Recovery, resolution, and annotation of four DNA markers. PLoS ONE, 11(6), 1–16. http://doi.org/10.1371/journal.pone.0157505 Fazekas, A. J., Kesanakurti, P. R., Burgess, K. S., Percy, D. M., Graham, S. W., Barrett, S. C. H., … Husband, B. C. (2009). Are plant species inherently harder to discriminate than animal species using DNA barcoding markers? Molecular Ecology Resources, 9(SUPPL. 1), 130–139. http://doi.org/10.1111/j.1755-0998.2009.02652.x Feng, Y. J., Liu, Q. F., Chen, M. Y., Liang, D., & Zhang, P. (2016). Parallel tagged amplicon sequencing of relatively long PCR products using the Illumina HiSeq platform and transcriptome assembly. Molecular Ecology Resources, 16(1), 91–102. http://doi.org/10.1111/1755-0998.12429 Feng, Y., Li, Q., Kong, L., & Zheng, X. (2011). DNA barcoding and phylogenetic analysis of Pectinidae (Mollusca: Bivalvia) based on mitochondrial COI and 16S rRNA genes. Molecular Biology Reports, 38(1), 291–299. http://doi.org/10.1007/s11033-010-0107-1 Ficetola, G. F., Miaud, C., Pompanon, F., & Taberlet, P. (2008). Species detection using environmental DNA from water samples. Biology Letters, 4(4), 423–425. http://doi.org/10.1098/rsbl.2008.0118 Freeland, J. R. (2017). The importance of molecular markers and primer design when characterizing biodiversity from environmental DNA. Genome, 60(4), 358–374.

73 74

http://doi.org/10.1139/gen-2016-0100 Fujiwara, A., Matsuhashi, S., Doi, H., Yamamoto, S., & Minamoto, T. (2016). Use of environmental DNA to survey the distribution of an invasive submerged plant in ponds. Freshwater Science, 35(2), 748–754. http://doi.org/10.1086/685882 Geisen, S., Laros, I., Vizcaíno, A., Bonkowski, M., & De Groot, G. A. (2015). Not all are free-living: High-throughput DNA metabarcoding reveals a diverse community of protists parasitizing soil metazoa. Molecular Ecology, 24(17), 4556–4569. http://doi.org/10.1111/mec.13238 Ghorbani, A., Saeedi, Y., & De Boer, H. J. (2017). Unidentifiable by morphology: DNA barcoding of plant material in local markets in Iran. PLoS ONE, 12(4). http://doi.org/10.1371/journal.pone.0175722 Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. Molecular Ecology Resources, 11(5), 759–769. http://doi.org/10.1111/j.1755-0998.2011.03024.x Goulet, B. E., Roda, F., & Hopkins, R. (2017). Hybridization in Plants: Old Ideas, New Techniques. Plant Physiology, 173(1), 65–78. http://doi.org/10.1104/pp.16.01340 Green, S. J., Venkatramanan, R., & Naqib, A. (2015). Deconstructing the polymerase chain reaction: Understanding and correcting bias associated with primer degeneracies and primer-template mismatches. PLoS ONE, 10(5), 1–21. http://doi.org/10.1371/journal.pone.0128122 Hajibabaei, M. (2012). The golden age of DNA metasystematics. Trends in Genetics, 28(11), 535–537. http://doi.org/10.1016/j.tig.2012.08.001 Han, J., Zhu, Y., Chen, X., Liao, B., Yao, H., Song, J., … Meng, F. (2013). The short ITS2 sequence serves as an efficient taxonomic sequence tag in comparison with the full-length ITS. BioMed Research International, 2013, 3–10. http://doi.org/10.1155/2013/741476 Hawkins, J., De Vere, N., Griffith, A., Ford, C. R., Allainguillaume, J., Hegarty, M. J., … Adams-Groom, B. (2015). Using DNA metabarcoding to identify the floral composition of honey: A new tool for investigating honey bee foraging preferences. PLoS ONE, 10(8), 1–20. http://doi.org/10.1371/journal.pone.0134735 Hebert, P. D. N., Ratnasingham, S., & de Waard, J. R. (2003). Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species.

74 75

Proceedings of the Royal Society B: Biological Sciences, 270(Suppl_1), S96–S99. http://doi.org/10.1098/rsbl.2003.0025 Hollingsworth, M. L., Andra Clark, A., Forrest, L. L., Richardson, J., Pennington, R. T., Long, D. G., … Hollingsworth, P. M. (2009). Selecting barcoding loci for plants: Evaluation of seven candidate loci with species-level sampling in three divergent groups of land plants. Molecular Ecology Resources, 9(2), 439–457. http://doi.org/10.1111/j.1755-0998.2008.02439.x Huang, M. mei, Arnheim, N., & Goodman, M. F. (1992). Extension of base mispairs by Taq DNA polymerase: Implications for single nucleotide discrimination in PCR. Nucleic Acids Research, 20(17), 4567–4573. http://doi.org/10.1093/nar/20.17.4567 Illumina. (2013). 16S Metagenomic Sequencing Library Preparation. Illumina.Com, (B), 1–28. Retrieved from http://support.illumina.com/content/dam/illumina- support/documents/documentation/chemistry_documentation/16s/16s-metagenomic- library-prep-guide-15044223-b.pdf Jo, T., Murakami, H., Masuda, R., Sakata, M. K., Yamamoto, S., & Minamoto, T. (2017). Rapid degradation of longer DNA fragments enables the improved estimation of distribution and biomass using environmental DNA. Molecular Ecology Resources, 17(6), e25–e33. http://doi.org/10.1111/1755-0998.12685 Joly, S., Davies, T. J., Archambault, A., Bruneau, A., Kembel, S. W., Peres-neto, P., & Vamosi, J. (2014). Ecology in the age of DNA barcoding : the resource , the promise and the challenges ahead, 221–232. http://doi.org/10.1111/1755-0998.12173 Keskin, E., Unal, E. M., & Atar, H. H. (2016). Detection of rare and invasive freshwater fish species using eDNA pyrosequencing: Lake Iznik ichthyofauna revised. Biochemical and Ecology, 67, 29–36. http://doi.org/10.1016/j.bse.2016.05.020 Kress, W. J., Wurdack, K. J., Zimmer, E. A., Weigt, L. A., & Janzen, D. H. (2005). Use of DNA barcodes to identify flowering plants. Proceedings of the National Academy of Sciences of the United States of America, 102(23), 8369–8374. http://doi.org/10.1073/pnas.0503123102 Krishnamurthy, K., & Francis, R. A. (2012). A critical review on the utility of DNA barcoding in biodiversity conservation. Biodiversity and Conservation, 21(8), 1901–

75 76

1919. http://doi.org/10.1007/s10531-012-0306-2 Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Molecular Biology and Evolution, 33(7), 1870–1874. http://doi.org/10.1093/molbev/msw054 Lacoursière-Roussel, A., Dubois, Y., Normandeau, E., Bernatchez, L., & Adamowicz, S. (2016). Improving herpetological surveys in eastern North America using the environmental DNA method 1. Genome, 59(11), 991–1007. http://doi.org/10.1139/gen-2015-0218 Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., Mcgettigan, P. A., McWilliam, H., … Higgins, D. G. (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23(21), 2947–2948. http://doi.org/10.1093/bioinformatics/btm404 Leray, M., & Knowlton, N. (2017). Random sampling causes the low reproducibility of rare eukaryotic OTUs in Illumina COI metabarcoding. PeerJ, 5, e3006. http://doi.org/10.7717/peerj.3006 Lovell, S. J., & Stone, S. F. (2006). The economic impacts of aquatic invasive species: a review of the literature. Agricultural and Resource Economics Review (Vol. 35). Retrieved from https://www.epa.gov/sites/production/files/2014- 12/documents/economic_impacts_of_aquatic_invasive_species.pdf Ma, H., Stewart, K., Lougheed, S., Zheng, J., Wang, Y., & Zhao, J. (2016). Characterization, optimization, and validation of environmental DNA (eDNA) markers to detect an endangered aquatic mammal. Conservation Genetics Resources, 8(4), 561–568. http://doi.org/10.1007/s12686-016-0597-9 MacDonald, A. J., & Sarre, S. D. (2017). A framework for developing and validating taxon-specific primers for specimen identification from environmental DNA. Molecular Ecology Resources, 17(4), 708–720. http://doi.org/10.1111/1755- 0998.12618 Marinich, A. K. (2017). Evaluating environmental DNA (eDNA) detection of invasive water soldier (Stratiotes aloides), (May). Matsuhashi, S., Doi, H., Fujiwara, A., Watanabe, S., & Minamoto, T. (2016). Evaluation of the environmental DNA method for estimating distribution and biomass of submerged aquatic plants. PLoS ONE, 11(6), 1–14.

76 77

http://doi.org/10.1371/journal.pone.0156217 Op De Beeck, M., Lievens, B., Busschaert, P., Declerck, S., Vangronsveld, J., & Colpaert, J. V. (2014). Comparison and validation of some ITS primer pairs useful for fungal metabarcoding studies. PLoS ONE, 9(6). http://doi.org/10.1371/journal.pone.0097629 Pietramellara, G., Ascher, J., Borgogni, F., Ceccherini, M. T., Guerri, G., & Nannipieri, P. (2009). Extracellular DNA in soil and sediment: Fate and ecological relevance. Biology and Fertility of Soils, 45(3), 219–235. http://doi.org/10.1007/s00374-008- 0345-8 Piñol, J., Mir, G., Gomez-Polo, P., & Agustí, N. (2015). Universal and blocking primer mismatches limit the use of high-throughput DNA sequencing for the quantitative metabarcoding of arthropods. Molecular Ecology Resources, 15(4), 819–830. http://doi.org/10.1111/1755-0998.12355 Port, J. A., O’Donnell, J. L., Romero-Maraccini, O. C., Leary, P. R., Litvin, S. Y., Nickols, K. J., … Kelly, R. P. (2016). Assessing vertebrate biodiversity in a kelp forest ecosystem using environmental DNA. Molecular Ecology, 25(2), 527–541. http://doi.org/10.1111/mec.13481 Prediger, E. (n.d.). Designing PCR primers and probes. Retrieved July 15, 2018, from https://www.idtdna.com/pages/education/decoded/article/designing-pcr-primers-and- probes Ratnasingham, S., & Hebert, P. D. N. (2007). BARCODING, BOLD : The Barcode of Life Data System (www.barcodinglife.org). Molecular Ecology Notes, 7(April 2016), 355–364. http://doi.org/10.1111/j.1471-8286.2006.01678.x Razafimandimbison, S. G., Kellogg, E. A., & Bremer, B. (2004). Recent origin and phylogenetic utility of divergent ITS putative pseudogenes: A case study from naucleeae (Rubiaceae). Systematic Biology, 53(2), 177–192. http://doi.org/10.1080/10635150490423278 Richardson, R. T., Lin, C.-H., Sponsler, D. B., Quijia, J. O., Goodell, K., & Johnson, R. M. (2015). Application of ITS2 Metabarcoding to Determine the Provenance of Pollen Collected by Honey Bees in an Agroecosystem. Applications in Plant Sciences, 3(1), 1400066. http://doi.org/10.3732/apps.1400066

77 78

Rossmanith, P., Röder, B., Frühwirth, K., Vogl, C., & Wagner, M. (2011). Mechanisms of degradation of DNA standards for calibration function during storage. Applied Microbiology and Biotechnology, 89(2), 407–417. http://doi.org/10.1007/s00253- 010-2943-2 Rowan, B. A., & Bendich, A. J. (2009). The loss of DNA from chloroplasts as leaves mature: Fact or artefact? Journal of Experimental Botany, 60(11), 3005–3010. http://doi.org/10.1093/jxb/erp158 Schmalenberger, A., Schwieger, F., & Tebbe, C. C. (2001). Effect of Primers Hybridizing to Different Evolutionarily Conserved Regions of the Small-Subunit rRNA Gene in PCR-Based Microbial Community Analyses and Genetic Profiling. Applied and Environmental Microbiology, 67(8), 3557–3563. http://doi.org/10.1128/AEM.67.8.3557-3563.2001 Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863–864. http://doi.org/10.1093/bioinformatics/btr026 Scriver, M., Marinich, A., Wilson, C., & Freeland, J. (2015). Development of species- specific environmental DNA (eDNA) markers for invasive aquatic plants. Aquatic Botany, 122, 27–31. http://doi.org/10.1016/j.aquabot.2015.01.003 Seppey, C. V. W., Fournier, B., Szelecz, I., Singer, D., Mitchell, E. A. D., & Lara, E. (2016). Response of forest soil euglyphid testate amoebae (Rhizaria: Cercozoa) to pig cadavers assessed by high-throughput sequencing. International Journal of Legal Medicine, 130(2), 551–562. http://doi.org/10.1007/s00414-015-1149-7 Shaw, J. L. A., Clarke, L. J., Wedderburn, S. D., Barnes, T. C., Weyrich, L. S., & Cooper, A. (2016). Comparison of environmental DNA metabarcoding and conventional fish survey methods in a river system. Biological Conservation, 197, 131–138. http://doi.org/10.1016/j.biocon.2016.03.010 Sigut, M., Kostovćik, M., Sigutova, H., Hulcr, J., Drozd, P., & Hrcek, J. (2017). Performance of DNA metabarcoding, standard barcoding, and morphological approach in the identification of hostparasitoid interactions. PLoS ONE, 12(12), 1– 18. http://doi.org/10.1371/journal.pone.0187803 Snyder, E., Francis, A., & Darbyshire, S. J. (2016). Biology of invasive alien plants in

78 79

Canada XX . Canadian Journal of Plant Science, 96(April), 225–242. Srivathsan, A., Ang, A., Vogler, A. P., & Meier, R. (2016). Fecal metagenomics for the simultaneous assessment of diet, parasites, and population genetics of an understudied primate. Frontiers in Zoology, 13(1), 1–13. http://doi.org/10.1186/s12983-016-0150-4 Stadhouders, R., Pas, S. D., Anber, J., Voermans, J., Mes, T. H. M., & Schutten, M. (2010). The effect of primer-template mismatches on the detection and quantification of nucleic acids using the 5′ nuclease assay. Journal of Molecular Diagnostics, 12(1), 109–117. http://doi.org/10.2353/jmoldx.2010.090035 Stein, E. D., White, B. P., Mazor, R. D., Jackson, J. K., Battle, J. M., Miller, P. E., … Sweeney, B. W. (2014). Does DNA barcoding improve performance of traditional stream bioassessment metrics? Molecular Approaches in Freshwater Ecology, 33, 302–311. http://doi.org/10.1086/674782. Stoeckle, B. C., Beggel, S., Cerwenka, A. F., Motivans, E., Kuehn, R., & Geist, J. (2017). A systematic approach to evaluate the influence of environmental conditions on eDNA detection success in aquatic ecosystems. PLoS ONE, 12(12), 1–16. http://doi.org/10.1371/journal.pone.0189119 Suzuki, M. T., & Giovannoni, S. J. (1996). Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Applied and Environmental Microbiology, 62(2), 2–8. Tamayo, M., & Olden, J. D. (2014). Forecasting the Vulnerability of Lakes to Aquatic Plant Invasions. Invasive Plant Science and Management, 7(01), 32–45. http://doi.org/10.1614/IPSM-D-13-00036.1 Toma, C. (2006). Distribution and comparison of two morphological forms of water soldier ( Stratiotes aloides L .): a case study on Lake S ≥ osineckie Wielkie ( Northwest Poland ). Biodiv. Res. Conserv., 3(4), 251–257. Tyre, A. J., Tenhumberg, B., Field, S. A., Niejalke, D., & Possingham, H. P. (2003). Improving Precision and Reducing Bias in Biological Surveys : Estimating False- Negative Error Rates. Ecological Applications, 13(6), 1790–1801. http://doi.org/10.1890/02-5078 Valentini, A., Miquel, C., Nawaz, M. A., Bellemain, E., Coissac, E., Pompanon, F., …

79 80

Taberlet, P. (2009). New perspectives in diet analysis based on DNA barcoding and parallel pyrosequencing: The trnL approach. Molecular Ecology Resources, 9(1), 51–60. http://doi.org/10.1111/j.1755-0998.2008.02352.x Van De Wiel, C. C. M., Van Der Schoot, J., Van Valkenburg, J. L. C. H., Duistermaat, H., & Smulders, M. J. M. (2009). DNA barcoding discriminates the noxious invasive plant species, floating pennywort (Hydrocotyle ranunculoides L.f.), from non- invasive relatives. Molecular Ecology Resources, 9(4), 1086–1091. http://doi.org/10.1111/j.1755-0998.2009.02547.x Vincent, A. T., Derome, N., Boyle, B., Culley, A. I., & Charette, S. J. (2017). Next- generation sequencing (NGS) in the microbiological world: How to make the most of your money. Journal of Microbiological Methods, 138, 60–71. http://doi.org/10.1016/j.mimet.2016.02.016 Wei, N., Nakajima, F., & Tobino, T. (2018). Effects of treated sample weight and DNA marker length on sediment eDNA based detection of a benthic invertebrate. Ecological Indicators, 93(April), 267–273. http://doi.org/10.1016/j.ecolind.2018.04.063 Wilson, C., Bronnenhuber, J., Boothroyd, M., Smith, C., & Wozney, K. (2014). Environmental DNA (eDNA) monitoring and surveillance: field and laboratory standard operating procedures. http://doi.org/10.13140/RG.2.1.4724.5041 Woodward, G., Perkins, D. M., & Brown, L. E. (2010). Climate change and freshwater ecosystems: impacts across multiple levels of organization. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 365(1549), 2093–2106. http://doi.org/10.1098/rstb.2010.0055 Yang, X., Wu, X., Hao, H., & He, Z. (2008). Mechanisms and assessment of water eutrophication. Journal of Zhejiang University SCIENCE B, 9(3), 197–209. http://doi.org/10.1631/jzus.B0710626 Yang, Y., Zhan, A., Cao, L., Meng, F., & Xu, W. (2016). Selection of a marker gene to construct a reference library for wetland plants, and the application of metabarcoding to analyze the diet of wintering herbivorous waterbirds. PeerJ, 4, e2345. http://doi.org/10.7717/peerj.2345 Zou, S., & Li, Q. (2016). Pay Attention to the Overlooked Cryptic Diversity in Existing

80 81

Barcoding Data: the Case of Mollusca with Character-Based DNA Barcoding. Marine Biotechnology, 18(3), 327–335. http://doi.org/10.1007/s10126-016-9692-x

81 82

Appendices

Appendix A – Primer design

We used ecoPrimers to design primer pairs for matK, rbcL, and ITS2 markers based on the custom database of target sequences. Prior to testing the primers in PCR reactions, the primer sequences were aligned with the DNA sequences for all species in the custom databases to ensure they were complementary to the target species. A second primer pair was created for each of the three markers by adding degenerate bases at variable sites in the primer-binding region. For matK, a primer pair was designed by replacing three bases in the forward primer sequence with degenerate bases, and replacing four bases in the reverse primer sequence with degenerate bases. For rbcL, a primer pair was designed by replacing two bases in the forward primer sequence with degenerate bases, and replacing three bases in the reverse primer sequence with degenerate bases.

For ITS2, a primer pair was created by replacing two bases in the forward primer sequence with degenerate bases, and replacing three bases in the reverse primer sequence with degenerate bases. In total, six primer pairs were originally tested in PCR reactions with DNA from individual plant species. We assessed species-specific amplification by testing the six primers on a total of 24 extracted DNA samples from different aquatic and terrestrial plant species. Using the forward and reverse primers designed by ecoPrimers as well as the versions designed with degenerate bases, there was minimal PCR amplification success of individual plant species, with amplification success ranging from

0-54%. At this point the primers were redesigned to improve amplification success.

The original primers for matK had melting temperatures of 37.9˚C (forward) and

46.4˚C (reverse), rbcL primers had melting temperatures of 49.7˚C (forward) and 40˚C

82 83

(reverse), and ITS2 primers had melting temperatures of 57.8˚C (forward) and 63.4˚C

(reverse). Thus, the forward and reverse primers for each pair differed by at least 5.6˚C and up to 9.7˚C. Generally, annealing temperatures for PCR are set approximately 5˚C lower than the melting temperatures, meaning that the annealing temperature for this set of primers would have ranged from about 33˚C to 58˚C. According to Integrated DNA

Technologies (IDT; http://www.idtdna.com), primers should be between 18 and 30 bases, ideal melting temperatures are between 60-64˚C to allow PCR enzymes to function optimally during the annealing step, and the forward and reverse primers should have melting temperatures within 2˚C of one another to increase specificity and allow both primers to anneal simultaneously to the template DNA (Prediger, n.d.). The original primer pairs designed by ecoPrimers were 18bp in length, but otherwise did not conform to the other parameters recommended by IDT.

The primers were redesigned for optimization by increasing the length to simultaneously obtain higher melting temperatures for all forward and reverse primers and so that each pair was no more than 2˚C different. The extended versions of the primers amplified a significantly higher number of species than the original versions.

Table A.1. Number of target species that had sequence data for each of the genetic markers, therefore the number of sequences that were used in the custom databases for novel primer development.

DNA Species (n=91) region matK 67 rbcL 81 ITS2 73

83 84

Table A.2. Inventory of all plants included in in silico testing. Species are either from the control sample, or from a list of common plants found in the Kawartha Lakes, Trent

River, and/or Rideau River. In the final column, ‘Y’ = sequence was included in primer design database, ‘N’ = sequence not included in the primer design database, and ‘-‘ = sequence data not available from GenBank. Wetland classification refers to a species of plant that can be found in marsh fields or at the shoreline of lakes and rivers, so their

DNA can be expected in a naturally occurring water sample.

Family/ Common Name Source: Control Terrestrial Included in Genus / species sample/Rideau (T)/ Aquatic database for River/Kawartha (A)/ primer design? Lakes/Trent River Wetland1 (W)

Acoraceae Sweet flag Control sample W matK: Y Acorus calamus rbcL: Y ITS2: Y

Alismataceae Northern water plantain Rideau River A matK: Y Alisma gramineum rbcL: Y ITS2: Y

Ranunculaceae Canada anemone Control sample T matK: N Anemone canadensis rbcL: N ITS2: N

Apocynaceae Swamp milkweed Control sample T matK: N Asclepias incarnata rbcL: N ITS2: -

Apocynaceae Common milkweed Control sample T matK: N Asclepias syriaca rbcL: N ITS2: N

Salviniaceae Eastern mosquito-fern Rideau River A matK: - Azolla caroliniana rbcL: Y ITS2: Y

Asteraceae Swamp Beggar’s ticks Control sample T matK: N Bidens discoidea rbcL: N ITS2: N

Butomaceae Flowering rush Rideau River A matK: Y Butomus umbellatus rbcL: Y

84 85

ITS2: Y

Cabombaceae Fanwort Control sample A matK: Y Cabomba rbcL: Y caroliniana ITS2: Y

Cyperaceae Richardson’s sedge Control sample T matK: - Carex richardsonii rbcL: N ITS2: -

Asteraceae Pitcher’s thistle Control sample T matK: - Cirsium pitcheri rbcL: N ITS2:

Ceratophyllaceae Coontail Control sample; A matK: Y Ceratophyllum Kawartha Lakes; Trent rbcL: Y demersum River; Rideau River ITS2: Y

Characeae Common stonewort Kawartha Lakes A matK: Y Chara vulgaris rbcL: Y ITS2: Y

Campyliaceae Curved branch moss Rideau River A matK: - Drepanocladus rbcL: - exannulatus ITS2: -

Pontederiaceae Common water Control sample A matK: Y Eichhornia crassipes hyacinth rbcL: Y ITS2: -

Hydrocharitaceae Canada waterweed Control sample; Trent A matK: Y Elodea canadensis River; Rideau River; rbcL: Y Kawartha Lakes ITS2: Y

Equisetaceae Water horsetail Rideau River A matK: - Equisetum fluviatile rbcL: Y ITS2: -

Fontinalaceae Water moss Rideau River A matK: - Fontinalis hypnoides rbcL: - ITS2: Y

Fabaceae Control sample T matK: N Gymnocladus dioicus rbcL: N ITS2: -

Hydrocharitaceae European frogbit Control sample; Rideau A matK: Y Hydrocharis morsus- River rbcL: Y ranae ITS2: Y

Balsaminaceae Touch-me-not Control sample T matK: N Impatiens capensis rbcL: N ITS2: N

Iridaceae Yellow iris Control sample A matK: Y

85 86

Iris pseudacorus rbcL: Y ITS2: Y

Lemnaceae Lesser duckweed Rideau River A matK: Y Lemna minor rbcL: Y ITS2: Y

Lemnaceae Star duckweed Kawartha Lakes; Rideau A matK: Y Lemna trisulca River rbcL: Y ITS2: Y

Plantaginaceae Toadflax Control sample T matK: N Linaria vulgaris rbcL: N ITS2: N

Lamiaceae Northern bugleweed Control sample W matK: N Lycopus uniflorus rbcL: N ITS2: N

Magnoliaceae tree Control sample T matK: N Magnolia acuminata rbcL: N ITS2: -

Asteraceae Beck’s water-marigold Control sample; A matK: - Megalodonta beckii Kawartha Lakes rbcL: - (Bidens beckii) ITS2: -

Haloragaceae Parrot’s feather Control sample A matK: Y Myriophyllum rbcL: Y aquaticum ITS2: Y

Haloragaceae Variable water-milfoil Kawartha Lakes A matK: Y Myriophyllum rbcL: N heterophyllum ITS2: -

Haloragaceae Northern water-milfoil Control sample; A matK: Y Myriophyllum Kawartha Lakes; Rideau rbcL: Y sibiricum River ITS2: Y

Haloragaceae Eurasian water-milfoil Control sample; A matK: Y Myriophyllum Kawartha Lakes; Rideau rbcL: Y spicatum River ITS2: Y

Haloragaceae Bracted water milfoil Rideau River A matK: Y Myriophyllum rbcL: Y verticillatum ITS2: Y

Hydrocharitaceae Water nymph Kawartha Lakes; Rideau A matK: Y Najas flexilis River rbcL: Y ITS2: Y

Lamiaceae Catnip Control sample T matK: N Nepeta cataria rbcL: N ITS2: N

86 87

Nymphaeaceae Small yellow pond lily Rideau River A matK: Y Nuphar microphylla rbcL: - ITS2: Y

Nymphaeaceae Yellow pond lily Control sample; A matK: Y Nuphar variegata Kawartha Lakes; Rideau rbcL: Y River ITS2: Y

Nymphaeaceae Fragrant water lily Control sample; A matK: Y Nymphaea odorata Kawartha Lakes; Rideau rbcL: Y River ITS2: Y

Menyanthaceae Yellow floating heart Control sample A matK: Y Nymphoides peltata rbcL: Y ITS2: Y

Poaceae Common reed Control sample W matK: Y Phragmites australis rbcL: Y ITS2: Y

Araceae Water lettuce Control sample A matK: Y Pistia stratiotes rbcL: Y ITS2: -

Polygonaceae Water smartweed Rideau River A matK: Y Polygonum rbcL: Y amphibium ITS2: Y

Pontederiaceae Pickerelweed Control sample; Rideau A matK: Y Pontederia cordata River rbcL: Y ITS2: -

Potamogetonaceae Large-leave pondweed Kawartha Lakes; Rideau A matK: - Potamogeton or Bass week River rbcL: Y amplifolius ITS2: Y

Potamogetonaceae Curly-leaved pondweed Control sample; A matK: Y Potamogeton crispus Kawartha Lakes; Rideau rbcL: Y River; Trent River ITS2: Y

Potamogetonaceae Ribbon-leaved Rideau River A matK: - Potamogeton pondweed rbcL: - epihydrus ITS2: -

Potamogetonaceae Leafy pondweed Rideau River A matK: - Potamogeton rbcL: Y foliosus ITS2: Y

Potamogetonaceae Fries’ pondweed Rideau River A matK: Y Potamogeton friesii rbcL: Y ITS2: -

Potamogetonaceae Illinois pondweed Rideau River A matK: - Potamogeton rbcL: - illinoensis ITS2: -

87 88

Potamogetonaceae Floating-leaved Rideau River A matK: Y Potamogeton natans pondweed rbcL: Y ITS2: Y

Potamogetonaceae Knotted pondweed Rideau River A matK: Y Potamogeton rbcL: Y nodosus ITS2: Y

Potamogetonaceae Sago pondweed Rideau River A matK: Y Potamogeton rbcL: Y pectinatus ITS2: Y

Potamogetonaceae Slender pondweed Kawartha Lakes; Rideau A matK: Y Potamogeton River rbcL: Y pusillus ITS2: Y

Potamogetonaceae Richardson’s pondweed Kawartha Lakes A matK: Y Potamogeton rbcL: Y richardsonii ITS2: -

Potamogetonaceae Robbin’s pondweed Kawartha Lakes A matK: - Potamogeton rbcL: Y robbinsii ITS2: Y

Potamogetonaceae Narrowleaf pondweed Control sample A matK: - Potamogeton rbcL: - strictifolius ITS2: Y

Potamogetonaceae Flat-stemmed Kawartha Lakes; Rideau A matK: Y Potamogeton pondweed River; rbcL: Y zosteriformis Trent River ITS2: Y

Ranunculaceae White water-crowfoot Rideau River A matK: Y Ranunculus aquatilis rbcL: Y ITS2: Y

Ranunculaceae Water buttercup or Kawartha Lakes A matK: - Ranunculus Crowfoot rbcL: - longirostris ITS2: -

Rosaceae Wild red raspberry Control sample T matK: N rbcL: N ITS2: N

Alismataceae Floating arrowhead Rideau River A matK: - Sagittaria cuneata rbcL: - ITS2: -

Alismataceae Broad-leaved Rideau River A matK: Y Sagittaria latifolia arrowhead rbcL: Y ITS2: Y

Alismataceae Stiff arrowhead Rideau River A matK: - Sagittaria rigida rbcL: - ITS2: -

88 89

Cyperaceae Hardstem bulrush Control sample W matK: N Schoenoplectus rbcL: Y acutus ITS2: Y

Cyperaceae River bulrush Rideau River A matK: - Scirpus fluviatilis rbcL: - ITS2: N

Cyperaceae Threesquare bulrush Rideau River A matK: - Scirpus pungens rbcL: - ITS2: -

Cyperaceae Soft-stem bulrush Rideau River A matK: N Scirpus rbcL: N tabernaemontani ITS2: N

Elaeagnaceae Buffaloberry Control sample T matK: N Shepherdia rbcL: N canadensis ITS2: N

Asteraceae Canada goldenrod Control sample T matK: N rbcL: N ITS2: N

Solanaceae Bittersweet/Climbing Control sample T matK: N Solanum dulcamara nightshade rbcL: N ITS2: N

Typhaceae Green-fruited burreed Rideau River A matK: - Sparganium rbcL: - chlorocarpum ITS2: -

Typhaceae Large-fruited burreed Rideau River A matK: Y Sparganium rbcL: Y eurycarpum ITS2: Y

Araceae Greater duckweed Rideau River A matK: Y Spirodela polyrhiza rbcL: Y ITS2: Y

Poaceae Prairie dropseed Control sample T matK: N Sporobolus rbcL: N heterolepis ITS2: N

Hydrocharitaceae Water soldier Control sample A matK: Y Stratiotes aloides rbcL: Y ITS2: Y

Lythraceae European water Control sample A matK: Y Trapa natans rbcL: Y ITS2: Y

Fabaceae Control sample T matK: N Trifolium repens rbcL: N ITS2: N

89 90

Typhaceae Cattail Control sample A/W matK: Y Typha latifolia rbcL: Y ITS2: Y

Typhaceae Miniature cattail Control sample A/W matK: Y Typha minima rbcL: Y ITS2: Y

Lentibulariaceae Common bladderwort Kawartha Lakes; Rideau A matK: Y Utricularia vulgaris River rbcL: Y ITS2: Y

Hydrocharitaceae Eelgrass Control sample; A matK: Y Vallisneria Kawartha Lakes; Rideau rbcL: Y americana River; Trent River ITS2: Y

Scrophulariaceae Common mullein Control sample T matK: N Verbascum thapsus rbcL: N ITS2: N

Apocynaceae Dog-strangling vine Control sample T matK: N Vincetoxicum rbcL: N rossicum ITS2: N

Araceae Dotted watermeal Rideau River A matK: Y Wolffia borealis rbcL: Y ITS2: Y

Araceae Columbia watermeal Rideau River A matK: Y Wolffia columbiana rbcL: Y ITS2: Y

Potamogetonaceae Horned pondweed Rideau River A matK: Y Zannichellia rbcL: Y palustris ITS2: Y

Poaceae Wild rice Rideau River A matK: Y Zizania palustris rbcL: - ITS2: Y

Pontederiaceae Water star-grass or Mud Kawartha Lakes; Rideau A matK: - Zosterella dubia plantain River rbcL: Y ITS2: -

1

90 91

Appendix B – Sample collection, filtration, and extraction

Table B.1. Sampling location and date for all field-collected species, and list of DNA samples from past projects for aquatic and terrestrial plants.

Plants collected Previously extracted Species Location Date DNA samples collected Nymphaea odorata Black River, Sutton 09/2016 Sporobolus heterolepis ON Chara vulgaris Black River, Sutton 09/2016 Phragmites australis ON Acorus calamus Black River, Sutton 09/2016 Cirsium pitcheri ON Pontederia Black River, Sutton 09/2016 Magnolia acuminata cordata ON Elodea canadensis Black River, Sutton 09/2016 Carex richardsonii ON Typha latifolia Black River, Sutton 09/2016 Gymnocladus dioicus ON Iris pseudacorus Black River, Sutton 09/2016 Isoetes engelmannii ON Schoenoplectus Black River, Sutton 09/2016 Nymphoides peltata acutus ON Vallisneria Black River, Sutton 09/2016 Myriophyllum americana ON aquaticum Potamogeton Black River, Sutton 09/2016 Typha minima strictifolius ON Potamogeton Black River, Sutton 09/2016 Eichhornia crassipes crispus ON Ceratophyllum Black River, Sutton 09/2016 Salvinia oblongifolia demersum ON Nuphar variegata Black River, Sutton 09/2016 Salvinia rotundifolia ON Anemone Buckhorn Lake, 10/2016 Pistia stratiotes canadensis Lakehurst ON Nepeta cataria Buckhorn Lake, 10/2016 Cabomba caroliniana Lakehurst ON Asclepias Buckhorn Lake, 10/2016 Hydrocharis morsus- incarnata Lakehurst ON ranae

91 92

Lycopus uniflorus Buckhorn Lake, 10/2016 Trapa natans Lakehurst ON Myriophyllum Buckhorn Lake, 10/2016 Stratiotes aloides sibiricum Lakehurst ON Myriophyllum Buckhorn Lake, 10/2016 spicatum Lakehurst ON Bidens discoidea Buckhorn Lake, 10/2016 Lakehurst ON Impatiens capensis Buckhorn Lake, 10/2016 Lakehurst ON Shepherdia Buckhorn Lake, 10/2016 canadensis Lakehurst ON Linaria vulgaris Buckhorn Lake, 10/2016 Lakehurst ON Trifolium repens Kawartha Highlands 09/2016 Provincial Park, ON Asclepias syriaca East Gwillimbury, ON 10/2016 Verbascum East Gwillimbury, ON 10/2016 thapsus Rubus idaeus East Gwillimbury, ON 10/2016 Vincetoxicum East Gwillimbury, ON 10/2016 rossicum Solidago East Gwillimbury, ON 10/2016 canadensis

92 93

Table B.2. List of all plant species included in the control sample.

Species Source Aquatic/Terrestrial/Wetland Acorus calamus Leaf clipping W Anemone canadensis Leaf clipping T Asclepias incarnata Leaf clipping T Asclepias syriaca Leaf clipping T Bidens discoidea Leaf clipping T Cabomba caroliniana Pre-extracted DNA A Carex richardsonii Pre-extracted DNA T Cirsium pitcheri Pre-extracted DNA T Ceratophyllum demersum Leaf clipping A Eichhornia crassipes Pre-extracted DNA A Elodea canadensis Leaf clipping A Gymnocladus dioicus Pre-extracted DNA T Hydrocharis morsus-ranae Pre-extracted DNA A Impatiens capensis Leaf clipping T Iris pseudacorus Leaf clipping A Linaria vulgaris Leaf clipping T Lycopus uniflorus Leaf clipping W Magnolia acuminata Pre-extracted DNA T Megalodonta beckii (Bidens Leaf clipping A beckii) Myriophyllum aquaticum Pre-extracted DNA A Myriophyllum sibiricum Leaf clipping A Myriophyllum spicatum Leaf clipping A Nepeta cataria Leaf clipping T Nuphar variegata Leaf clipping A Nymphaea odorata Leaf clipping A Nymphoides peltata Pre-extracted DNA A Phragmites australis Pre-extracted DNA W Pistia stratiotes Pre-extracted DNA A Pontederia cordata Leaf clipping A Potamogeton crispus Leaf clipping A Potamogeton strictifolius Leaf clipping A Rubus idaeus Leaf clipping T Schoenoplectus acutus Leaf clipping W Shepherdia canadensis Leaf clipping T Solidago canadensis Leaf clipping T Solanum dulcamara Leaf clipping T Sporobolus heterolepis Pre-extracted DNA T Stratiotes aloides Pre-extracted DNA A Trapa natans Pre-extracted DNA A Trifolium repens Leaf clipping T Typha latifolia Leaf clipping A/W

93 94

Typha minima Pre-extracted DNA A/W Vallisneria americana Leaf clipping A Verbascum thapsus Leaf clipping T Vincetoxicum rossicum Leaf clipping T

Table B.3. Aquatic and terrestrial plant species extractions for single-species amplifications.

Species Weight of Species Weight of freeze-dried freeze-dried tissue used tissue used (mg) (mg) Asclepias syriaca 150 Nymphaea odorata 150 Ceratophyllum demersum 150 Pistia stratiotes* - Elodea canadensis 130 Pontederia cordata 140 Hydrocharis morsus-ranae* - Potamogeton strictifolius 140 Iris pseudacorus 150 Schoenoplectus acutus 120 Isoetes engelmannii* - Shepherdia canadensis 150 Magnolia acuminata* - Sporobolus heterolepis - Megalodonta beckii (Bidens 80 Stratiotes aloides 150 beckii) Myriophyllum aquaticum 120 Typha sp. 150 Myriophyllum spicatum 80 Vallisneria americana 100 Nepeta cataria 150 Vitis vinifera 110 Nuphar variegata 150 *Previously extracted DNA

94 95

Figure B.1. Genomic DNA extracted from plant tissue and visualized on an agarose gel to confirm extraction success. Abbreviations: L = ladder; Ec = Elodea canadensis; Pc =

Pontederia cordata; Cv = Chara vulgaris; Ps = Pontederia cordata; Va = Vallisneria americana; Cd = Ceratophyllum demersum; No = Nymphaea odorata; WL = Pistia stratiotes; Ct = Magnolia acuminata; I = Isoetes engelmannii; FB = Hydrocharis morsus- ranae; Sp = Sporobolus heterolepis.

95 96

Appendix C – Primer testing and optimization, and single-species amplification

Table C.1. Extended primers primer names, sequence, melting temperature, and expected amplicon size. Bold sequence = original primer sequence from ecoPrimers, red bases = original primer sequence that was removed and not included in optimized version.

Marker Primer Name F/ Sequence Melting R temperature (ºC) matK dgFW-matK2-F F CATTATRTTMGRTATCAAGGAA 46.5

dgFW-matK2-R R GYTTACTAATRGGATRYCC 46.9

matK FW-matK2-F F CATTATATTAGATATCAAGGAA 42.6

FW-matK2-R R GCTTACTAATAGGATGTCC 46.7

rbcL dgFW-rbcL2-F F GYCTNGATCGTTACAAAGG 50.5

dgFW-rbcL-R R CAAYTTATCTCTYTCAACYTGGA 51.4

rbcL FW-rbcL2-F F GTCTTGATCGTTACAAAGG 48.2

FW-rbcL2-R R CAATTTATCTCTTTCAACTTGGA 49.1

ITS2 dgFW-ITSn-F F NAYGACTCTCGGCAACGGATATCTTGG 61.1

dgFW-ITSn-R R CCCAVGCAGRCDTGCCCT 58.7

ITS2 FW-ITSn-F F AACGACTCTCGGCAACGGATATCTTGG 61.9

FW-ITSn-R R CCCAGGCAGGCGTGCCCT 64.7

96 97

Table C.2. PCR Cycling conditions for all extended primers on individual species.

Cycling dgFW-matK2 & dgFW-rbcL2 & FW- dgFW-ITSn & FW- conditions FW-matK2 rbcL2 ITSn

Initial 95ºC 30s 95ºC 30s 95ºC 30s

x28 Denature 95ºC 30s 95ºC 30s 95ºC 30s

cycles Anneal 38ºC 1min 45ºC 1min 58ºC 1min

Extend 68ºC 1min 68ºC 1min 68ºC 1min

Final extension 68ºC 5min 68ºC 5min 68ºC 5min

Hold 4ºC HOLD 4ºC HOLD 4ºC HOLD

97 98

Table C.3. Amplification results from individual species amplification with extended primers with and without degenerate bases. ‘FW’ means freshwater as the primers were designed to target freshwater plant species found in southern Ontario, and ‘dg’ is an abbreviation for degenerate. After this stage, the degenerate primers were used to amplify eDNA samples since they amplified more individual species than the non-degenerate versions, primer names were changed to omatK2, orbcL2, and oITSn.

dgFW- dgFW- dgFW- FW-matK2 matK2 FW-rbcL2 rbcL2 FW-ITSn ITSn

Vitis vinifera Typha latifolia Iris pseudacorus Myriophyllum aquaticum Stratiotes aloides Bidens beckii Potamogeton strictifolius Elodea canadensis Vallisneria americana Myriophyllum spicatum Hydrocharis morsus-ranae Phragmites australis Isoetes engelmannii Schoenoplectus acutus Ceratophyllum demersum Nepeta cataria Pontederia cordata Shepherdia canadensis Chara vulgaris Magnolia acuminata Asclepias syriaca Nymphaea odorata Eichhornia crassipes

98 99

Cabomba caroliniana Cirsium pitcheri

TOTAL 12/25 17/25 17/25 17/25 14/25 19/25

99 100

Table C.4. Amplification results from original primers from ecoPrimers with and without degenerate bases compared to the optimized versions. Y = positive PCR amplification, N = no PCR amplification, and grey = not tested. Species Primers dgFW-matK FW-matK dgFW-rbcL FW-rbcL dgFW-ITS2 FW-ITS2 Original Optimized Original Optimized Original Optimized Original Optimized Original Optimized Original Optimized Vitis vinifera N N N N N N N N N N N N Typha latifolia N Y N Y Y Y Y Y Y Y Y Y Iris pseudacorus N Y N Y Y Y Y Y Y Y Y Y Myriophyllum aquaticum N Y N Y Y Y Y Y Y Y Y Y Stratiotes aloides N Y N Y N Y N Y Y Y Y Y Bidens beckii N N N N N N N N N N N N Potamogeton strictifolius N N N N N N Elodea canadensis N Y N Y N Y N Y Y Y Y Y Vallisneria americana N N N Y N Y N Y N Y Y Y Myriophyllum spicatum N Y N Y N Y N Y Y Y Y Y Hydrocharis morsus- N Y N Y N Y N Y Y Y Y Y ranae Phragmites australis Y Y Y Y Y Y Isoetes engelmannii N N N Y N Y N Y N N N Y Schoenoplectus acutus N N N Y N Y N Y N Y Y Y Ceratophyllum demersum N N N Y N Y N Y N Y N Y Nepeta cataria N N N N N N N N N N N N Pontederia cordata N Y N Y N Y N Y N N Y Y Shepherdia canadensis N N N N N N N N N N N N Chara vulgaris N Y N Y N Y N Y Y Y Y Y Magnolia acuminata N N N N N N N N N N N Y Asclepias syriaca N N N Y N Y N Y N N Y Y Nymphaea odorata N N N N N N N N N Y N Y Eichhornia crassipes Y Y Y Y N Y Cabomba caroliniana Y Y Y Y Y Y Cirsium pitcheri N N N N N N Total amplified (/25) 0 17 0 12 3 17 3 17 12 18 8 14

100 101

Figure C.1. Image of an agarose gel used to analyze amplification success of orbcLa primers on individual aquatic and terrestrial species, with amplicons expected at 550bp.

Abbreviations (top left-right, bottom left-right): L = ladder; Vv = Vitis vinifera; Tl =

Typha latifolia; Ip = Iris pseudacorus; Ma = Myriophyllum aquaticum; Sa = Stratiotes aloides; Mb = Megalodonta beckii (Bidens beckii); Ps = Potamogeton strictifolius; Ec =

Elodea canadensis; Va = Vallisneria americana; Ms = Myriophyllum spicatum; Hmr =

Hydrocharis morsus-ranae; Pa = Phragmites australis; Ie = Isoetes engelmannii; Sa =

Schoenoplectus acutus; Cd = Ceratophyllum demersum; L = ladder; Nc = Nepeta cataria;

Pc = Pontederia cordata; Shc = Shepherdia canadensis; Cv = Chara vulgaris; Ma =

Magnolia acuminata; As = Asclepias syriaca; No = Nymphaea odorata; Ec = Eichhornia crassipes; Cc = Cabomba caroliniana; Cp = Cirsium pitcheri; (-) = negative control; Vv

= Vitis vinifera; Tl = Typha latifolia.

101 102

Figure C.2. Image of DNA amplified by dgFW-ITSn primers from individual aquatic and terrestrial plant species. The expected amplicon length targeted by these primers was 115-

119bp. Abbreviations (Top left-right, bottom left-right): L = ladder; Vv = Vitis vinifera;

Tl = Typha latifolia; Ip = Iris pseudacorus; Ma = Myriophyllum aquaticum; Sa =

Stratiotes aloides; Mb = Megalodonta beckii (Bidens beckii); Ps = Potamogeton strictifolius; Ec = Elodea canadensis; Va = Vallisneria americana; Ms = Myriophyllum spicatum; Hmr = Hydrocharis morsus-ranae; Pa = Phragmites australis; Ie = Isoetes engelmannii; Sa = Schoenoplectus acutus; Cd = Ceratophyllum demersum; L = ladder;

Nc = Nepeta cataria; Pc = Pontederia cordata; Shc = Shepherdia canadensis; Cv =

Chara vulgaris; Ma = Magnolia acuminata; As = Asclepias syriaca; No = Nymphaea odorata; Ec = Eichhornia crassipes; Cc = Cabomba caroliniana; Cp = Cirsium pitcheri;

Gd = Gymnocladus dioicus; Ta = Typha angustifolia; (-) = negative control.

102 103

Appendix D – eDNA sample library preparation

Table D.1. Primer sequences with overhang required for the adapter ligation in sample library preparation. Underlined sequence represents the overhang and the bold sequence represents the primer.

Primer Sequence (5’ – 3’)

omatK omatK2-F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAAGGATCCTTTCA 2 TGCATTATRTTMGRTATCAAGGAA omatK2-R GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGNGYCCAAAYNG GYTTACTAATRGGATRYCC orbcL2 orbcL2-F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGYGATGGACTTACN AGTCTTGATCGTTACAAAGG orbcL2-R GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGNCCATAYTTRT TCAATTTATCTCTTTCAACTTGGATNCC oITSn oITSn-F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAYGACTCTCGGC AACGGATATCTTGG oITSn-R GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCCAVGCAGRCD TGCCC orbcLa orbcLa-F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGATGTCACCACAAA * CAGAGACTAAAGCAAGTG orbcLa-R GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGTCATCYTTGGTA AAATCAAGTCCACCRCG oITS2* oITS2-S2F TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGCGAAATGCGAT ACTTGGTGTGAATTGC oITS4 GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCTTGTAAGTTT CTTTTCCTCCGCTTATTGATATGC *Modified from Fahner et al. (2016).

Table D.2. P5 and P7 sequences containing indexing barcodes that are added to amplicons in the adapter ligation step, contain barcode. Underlined sequence represents the portion of the sequence that overlaps with the overhang sequence in the primers.

Sequence (5’ – 3’)

P5 AATGATACGGCGACCACCGAGATCTACAC[i5]TCGTCGGCAGCGTC

P7 CAAGCAGAAGACGGCATACGAGAT[i7]GTCTCGTGGGCTCGG

103 104

Table D.3. Barcodes used for sample identification.

Index 1 (i7) Sequence (5’ – 3’) Index 2 (i5) Sequence (5’ – 3’)

N701 TAAGGCGA S501 TAGATCGC

N702 CGTACTAG S502 CTCTCTAT

N703 AGGCAGAA S505 GTAAGGAG

N704 TCCTGAGC

N705 GGACTCC

Table D.4. Cycling conditions with overhang adapters – first PCR step for library preparation.

omatK2 orbcL2* oITSn orbcLa* oITS2 Initial 95ºC 30s 95ºC 5min 95ºC 30s 94ºC 4min 94ºC 5min

Denature 95ºC 30s 95ºC 30s 95ºC 30s 94ºC 30s 94ºC 30s Anneal 48ºC 1min 55ºC 1min 55ºC 1min 60ºC 30s 55ºC 30s 30s Extend 68ºC 1min 72ºC 30s 68ºC 1min 72ºC 1min 72ºC 45s X30 cycles X30 Final 68ºC 5min 60ºC 30s 68ºC 5min 72ºC 10min 72ºC 10min extension Hold 4ºC HOLD 4ºC HOLD 4ºC HOLD 4ºC HOLD 4ºC HOLD *For wild samples, orbcL2 and orbcLa amplifications required 40 cycles.

104 105

105 106

Figure D.1. Bioanalyzer results from TCAG (Toronto ON, Canada) representing the size of DNA fragments, and their respective concentrations, observed in the sample submitted for sequencing on the Illumina HiSeq 2500. Target amplicons in this sample ranged from

115-550bp excluding primers.

106 107

Table D.5. Mismatch count for each forward and reverse primer for species that were not identified by any of the primer pairs. Grey boxes represent data that either wasn’t available from GenBank, or sequence data obtained from GenBank did not contain the primer sequence.

Species omatK2 orbcL2 orbcLa oITSn oITS2 F R F R F R F R F R S. canadensis 3 4 2 1 0 M. acuminata 1 1 1 0 V. thapsus 3 0 1 1 0 1 0 L. vulgaris 6 1 0 1 2 1 1 0 0

Table D.6. Number of species identified in varying primer combinations. ‘A’ = aquatic species, ‘T’ = terrestrial species, and ‘C’ = combined aquatic + terrestrial species.

omatK2 orbcL2 orbcLa oITSn oITS2 omatK2

orbcL2 A: 20 T: 3 C: 23 orbcLa A: 17 A: 16 T: 3 T: 2 C: 20 C: 18 oITSn A: 14 A: 17 A: 18 T: 3 T: 1 T: 2 C: 17 C: 18 C: 20 oITS2 A: 20 A: 21 A: 20 A: 17 T: 5 T: 4 T: 5 T: 4 C: 25 C: 25 C: 25 C: 21

107 108

Table D.7. List of plant species identified in the control sample using the omatK2 primers. Bolded family or genus names represent taxa that were represented in the control sample.

omatK2 Aquatic (A)/ Family Genus Species Wetland (W)/ Terrestrial (T) Nymphaeaceae Nymphaea alba A nouchali A Euryale ferox A Balsaminaceae Impatiens piufanensis T Aponogetonaceae Aponogeton crispus A Fabaceae floribunda T Proteaceae Protea witzenbergiana T Typhaceae Typha angustifolia A/W Cyperaceae Schoenoplectus tabernaemontani W Poaceae Agrostis mertensii T Asteraceae W Musaceae Musa itinerans T Poaceae Poa pratensis T Potamogetonaceae Potamogeton wrightii A Ranunculaceae Ranunculus acris T Poaceae Setaria helvola T Lyrthraceae Trapa maximowiczii A Asteraceae spinosum T Haloragaceae Myriophyllum alpinum A Butomaceae Butomus umbellatus A Iridaceae Iris versicolor W Alismataceae Sagittaria platyphylla A Lamiaceae Lycopus americanus T

Table D.8. List of plant species identified in the control sample using the orbcL2 primers.

Bolded family or genus names represent taxa that were represented in the control sample.

orbcL2 Aquatic (A)/ Family Genus Species Wetland (W)/ Terrestrial (T) Cyperaceae Cladium jamaicense W Trichophorum subcapitatum T/W Bulbostylis barbata T/W Dulichium arundinaceum A Carex hirta T

108 109

swanii T Asteraceae Coriopsis tinctoria T abrotanum T indicum T gracile T vulgaris T Decodon verticillatus W sp. T Haloragaceae Myriophyllum ussuriense A hippuroides A palustris A serra T Poaceae Triticum macha T Bromus floribunda T Molinia caerulea T Balsaminaceae Impatiens ecornuta T noli-tangere T piufanensis T Hydrocharitaceae Hydrilla verticillata A Elodea bifoliata A Limnobium spongia A delagoensis T kirilowii T Cabombaceae Brasenia schreberi A Pontederiaceae Heteranthera dubia A Potamogetonaceae Potamogeton diversifolius A sp. T Bredemeyera floribunda T hybrida T sinaica T T Primulaceae Androsace bulleyana T Anisophyllaceae purpurascens T Aristolochia serpentaria T Fabaceae agrestis T sp. T nidulans T max T stuevei T obtusifolia T lanceolata T grandis T kingiana T Commelinaceae Commelina paludosa T Palisota ambigua T

109 110

Stylidiaceae novae-zelandiae T Drimys granadensis T Boraginaceae Ehretia microphylla T prunifolioides T Polygonaceae Emex spinosa T Koenigia alaskana T Oxygonum sinuatum T Sidotheca trilobata T Rapateaceae Epidryos guayanensis T sessilis T Icacinaceae Hosiea japonica T Idiospermum australiense T sp. T T ramosior T caseolaris T Plumbaginaceae Limonium boirae T Ceratostigma plumbaginoides T Plumbago sp. T Proteaceae Macadamia ternifolia T Iridaceae Moraea longifloria T Nemastylis geminiflora T Iris bloudowii T virginica T Nymphaeaceae Nymphaea ampla A leibergii A Nuphar advena A Apocynaceae Periploca sepium T balsamifera T alpina T Lamiaceae Prunella grandiflora T T Micromeria biflora T Physostegia virginiana T azurea T Vitex glabrata T Origanum vulgare T Mentha asiatica T Nepeta septemcrenata T Grossulariaceae lacustre T rhombea T T sp. T Basellaceae Basella alba T Ranunculaceae Ceratocephala testiculata T

110 111

Ranunculus pedatifidus T Chaetosphaeridiaceae Chaetosphaeridium pringsheimii T Gentianaceae Chorisepalum psychotrioides T Gentiana puberulenta T Cornaceae Cornus T sanctum T Dasypogonaceae Kingia australis T Araceae Lemna minor A valdiviana A Heteropsis tenuispadix T Wolffia borealis A trifoliata T exaltata A charantia T cerifera T Nelumbonaceae Nelumbo lutea A multifida T sp. T Salicaceae Salix paraplesia T bronchialis T Typhaceae Sparganium angustifolium A Lyrthraceae Trapa maximowiczii A Actinidiaceae Actinidia valvata T Myrsinaceae Ardisia verbascifolia T suffruticosa T Warburgia salutaris T Ceratophyllum echinatum A Adenarake muriculata T Pinaceae Pinus strobus T sylvestris T Butomaceae Butomus umbellatus A Solanaceae Solanum lycipersicum T Juncaceae Juncus falcatus T/W

Table D.9. List of plant species identified in the control sample using the orbcLa primers.

Bolded family or genus names represent taxa that were represented in the control sample.

orbcLa Aquatic (A)/ Family Genus Species Wetland (W)/ Terrestrial (T) Cyperaceae Fimbristylis dichotoma W Schoenoplectus triqueter W Carex siderostica T Lamiaceae Marmoritis complanata T

111 112

Salvia deserta T/W przewalskii T/W virgata T rosmarinus T Anisochilus pallidus T Teucrium heterophyllum T coronopifolia T Vitex trifolia T Ceratophyllaceae Ceratophyllum echinatum A Nymphaeaceae Nymphaea alba A nouchali A Nuphar advena A Balsaminaceae Impatiens ecornuta T pallida T barbata T piufanensis T Typhaceae Typha angustata A/W domingensis A/W Lythraceae Decodon verticillatus W T Begoniaceae T Commelinaceae Coleotrype natalensis T Cyanotis nyctitropa T ehretioides T Fabaceae purpureus T listeriana T Hydrocharitaceae Lagarosiphon major A Hydrilla verticillata A Vallisneria spiralis A Egeria najas A Styracaceae Alniphyllum fortunei T Asteraceae T trichosperma T cernua T tinctoria T Papaveraceae Chelidonium majus T Apiaceae Daucus carota T Orchidaceae Dendrobium wenshanense T Ebenaceae Diospyros fasciculosa T Pontederiaceae Heteranthera dubia A Monochoria hastata T Bignoniaceae Mansoa kerere T Asparagaceae Polygonatum roseum T Ranunculaceae Ranunculus pensylvanicus T repens T

112 113

Menyanthaceae Villarsia cambodiana A Haloragaceae Myriophyllum verticillatum A ussuriense A tetracantha T Theaceae Pyrenaria spectabilis T Zosteraceae Zostera A Acoraceae Acorus gramineus W T Cabombaceae Brasenia schreberi A

Table D.10. List of species identified in the control sample using the oITS2 primers.

Bolded family or genus names represent taxa that were represented in the control sample.

oITS2 Aquatic (A)/ Family Genus Species Wetland (W)/ Terrestrial (T) Cornaceae Cornus alternifolia T Poaceae Deyeuxia scabrascens T Polypogon elongatus T Psilurus incurvus T Hydrocharitaceae Enhalus acoroides A Maidenia rubra A (Vallisneria) Hydrilla verticillata A Rubiaceae Galium perpusillum T Amaranthaceae Halopeplis perfoliata T Alismataceae Hydrocleys nymphoides A Nymphaeaceae Nymphaea tetragona A bicuspidata T Piper infossibaccatum T chinensis T Rubiaceae Tresanthera condamineoides T Asteraceae T Polemoniaceae Collomia soehrensii T Balsaminaceae Impatiens tortisepala T Convolvulaceae Ipomoea batatas T Menyanthaceae Nymphoides walshiae A Nyctaginaceae Pisonia sandwicensis T Potamogetonaceae Potamogeton amplifolius A Verbenaceae Verbena californica T Solanaceae Withania somnifera T Haloragaceae Myriophyllum alterniflorum A Cyperaceae Schoenoplectus subterminalis A

113 114

Table D.11. List of plant species identified in the control sample using the oITSn primers. Bolded family or genus names represent taxa that were represented in the control sample.

oITSn Aquatic (A)/ Family Genus Species Wetland (W)/ Terrestrial (T) Hydrocharitaceae Vallisneria rubra A Enhalus acoroides A Balsaminaceae Impatiens tortisepala T Haloragaceae Myriophyllum alterniflorum A Cyperaceae Schoenoplectus subterminalis A Menyanthaceae Nymphoides walshiae A Asteraceae Bidens frondosa T Potamogetonaceae Potamogeton amplifolius A Nymphaeaceae Nymphaea tetragona A Convulvulaceae Ipomoea batatas T Alismataceae Hydrocleys nymphoides A Polemoniaceae Collomia soehrensii T Araliaceae Astropanax procumbens T Verbenaceae Verbena californica T Boraginaceae Plagiobothrys nothofulvus T Poaceae Polypogon elongatus T Deyeuxia scabrescens T Psilurus incurvus T Passifloraceae Passiflora bicuspidata T Amaranthaceae Halopeplis perfoliata T Rubiaceae Tresanthera condamineoides T Galium perpusillum T Cornaceae Cornus alternifolia T Piperaceae Piper infossibaccatum T Celastraceae T Lecythidaceae Eschweilera caudiculata T Salviniaceae Salvinia molesta A natans A

114 115

Appendix E – Bioinformatics pipeline information

1) ecoPrimers and ecoPCR

#Convert sequence file into OBITools .fasta format $ obiconvert --genbank -t ./TAXO --ecopcrdb-output=my_ecopcr_ITS1_db ./ITS1_SEQS/*.gbk

#Generate metabarcoding primers between 50 and 500bp, allowing up to three mismatches in each of the forward and reverse primers, and requiring a perfect match in the last two nucleotides at the 3’ ends; using a custom database in .fasta format $ ./ecoPrimers -d my_ecopcr_ITS1_db -e 3 -l 50 -L 500 -3 2 > ITS1_barcodes

#Run in silico PCR to determine how many sequences would be amplified by the chosen primer pairs, and how many sequences are sufficiently variable to distinguish species $ ./ecoPCR -d my_ecopcr_matK_db -e 4 -l 5 -L 700 ATATTAGATATCAAGGAA CTTACTAATAGGATGTCC > mymatKsequences.ecopcr

2) Quality-checking

#Compare md5sum value with *.md5 $ cat SC8PW_P_R1.fastq.gz.md5 $ md5sum SC8PW_P_R1.fastq.gz

#Check number of lines in forward and reverse reads $ zcat SC8PW_P_R1.fastq.gz | wc -l $ zcat SC8PW_P_R2.fastq.gz | wc –l

#Search primer sequences in forward and reverse reads $ zcat SC8PW_P_R1.fastq.gz | grep 'ATGTCACCACAAA' | wc -l

#FASTQC to check quality of sequences $ ./fastqc ../SC8PW_P_R1.fastq.gz

FASTQC html stats

SC8PW_P_R1.fastq.gz Per base sequence quality

115 116

#Unzip *.fastq.gz files $ gunzip -k SC8PW_P_R1.fastq.gz

3) Metabarcoding

#Merge forward and reverse reads with illuminapairedend $ illuminapairedend --score-min=30 -r SC8PW_P_R2.fastq SC8PW_P_R1.fastq > SC8PW_P.fastq

#Merge reads again using fuse.sh to merge instead of illuminapairedend, because reads should not overlap, but due to repetitive sequences in this region, illuminapairedend wants to overlap them $ module load bbmap $ fuse.sh in1=./SC8PW_P_R1.fastq in2=./SC8PW_P_R2.fastq pad=0 out=fused.fastq fusepairs

#Sort sequences by sample using ngsfilter $ ngsfilter --DEBUG -e 6 -t ngsfiltertxt_SC8PW_P -u SC8PW_P_unidentified.fastq SC8PW_P.fastq > SC8PW_P_ngs.fastq

#ngsfilter text file #used -:- instead of barcode sequences because they had already been removed by sequencing facility

116 117 plant_eDNA SC8PW_P_omatK2 -:- AAGGATCCTTTCATGCATTATRTTMGRTATCAAGGAA NGYCCAAAYNGGYTTACTAATRGGATRYCC F @ plant_eDNA SC8PW_P_orbcL2 -:- YGATGGACTTACNAGTCTTGATCGTTACAAAGG GNCCATAYTTRTTCAATTTATCTCTTTCAACTTGGATNCC F @ plant_eDNA SC8PW_P_oITSn -:- AYGACTCTCGGCAACGGATATCTTGG CCCAVGCAGRCDTGCCC F @ plant_eDNA SC8PW_P_orbcLa -:- ATGTCACCACAAACAGAGACTAAAGCAAGTG TCATCYTTGGTAAAATCAAGTCCACCRCG F @ plant_eDNA SC8PW_P_oITS2 -:- GCGAAATGCGATACTTGGTGTGAATTGC CCTTGTAAGTTTCTTTTCCTCCGCTTATTGATATGC F @

#Obigrep all sequences with error “Cannot assign sequence to a sample” – this error is a result of the ‘-:-‘ from the ngsfilter text file. $ obigrep -a 'error:Cannot assign sequence to a sample' SC8PW_P_unidentified.fastq > SC8PW_P_good.fastq

#Obigrep primer sequences to separate by primers $ obigrep -a 'forward_primer:gcgaaa' SC8PW_P_good.fastq > SC8PW_P_goodITS2.fastq

#Obiannotate to add sample ID to each sequence $ obiannotate -S sample:SC8PW_P_ITS2 SC8PW_P_goodITS2.fastq > SC8PW_P_annotatedITS2.fastq

#Obiuniq to dereplicate reads into unique sequences $ obiuniq -m sample SC8PW_P_annotatedITS2.fastq > SC8PW_P_annotated.uniq.ITS2.fasta

#Obiannotate to keep only ‘count’ and ‘merged_sample’ attributes; i.e. keep information on how many times each unique sequence was found in each sample $ obiannotate -k count -k merged_sample SC8PW_P_annotated.uniq.ITS2.fasta > $$ ; mv $$ SC8PW_P_annotated.uniq.ITS2.fasta

#Obistat to get count statistics on the ‘count’ attribute $ obistat -c count SC8PW_P_annotated.uniq.ITS2.fasta | sort -nk1 | head -20

#Obigrep to keep only the sequences with a length greater than 100bp and a count of at least 10 $ obigrep -l 100 -p 'count>=10' SC8PW_P_annotated.uniq.ITS2.fasta > SC8PW_P_annotated.uniq.ITS2.c10.l100.fasta

#Obiclean to keep only head sequences, which are sequences with no variants with a count greater than 5% of their own count $ obiclean -s merged_sample -r 0.05 -H SC8PW_P_annotated.uniq.ITS2.c10.l100.fasta > SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.fasta

#ecoPCR to retrieve EMBL sequences that contain the primer pair sequences to create a custom database for each primer pair $ ecoPCR -d embl_last -e 8 -l 50 -L 500 AAGGATCCTTTCATGCATTA NGYCCAAAYNGGYTT > matK.ecopcr

117 118

#Clean the database $ obigrep -d embl_last --require-rank=species --require-rank=genus --require-rank=family matK.ecopcr > matK_clean.fasta $ obiuniq -d embl_last matK_clean.fasta > matK_clean_uniq.fasta $ obigrep -d embl_last --require-rank=family matK_clean_uniq.fasta > matK_clean_uniq_clean.fasta $ obiannotate --uniq-id matK_clean_uniq_clean.fasta > db_matK.fasta

#Taxonomic assignment of sequences using ecotag $ ecotag -d embl_last -R db_matK.fasta SC8PW_P_annotated.uniq.matK.c10.l100.clean.fasta > SC8PW_P_annotated.uniq.matK.c10.l100.clean.tag.fasta

#Obiannotate to clean up the sequence headers by removing any unnecessary information $ obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount --delete- tag=obiclean_count --delete-tag=obiclean_singletoncount --delete-tag=obiclean_cluster --delete- tag=obiclean_internalcount --delete-tag=obiclean_head --delete-tag=taxid_by_db --delete- tag=obiclean_headcount --delete-tag=order SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.fasta > SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.ann.fasta

#Sort sequences in decreasing order of count $ obisort -k count -r SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.ann.fasta > SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.ann.sort.fasta

#Generate a tab-delimited file that can be opened by R or Excel $ obitab -o SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.ann.sort.fasta > SC8PW_P_annotated.uniq.ITS2.c10.l100.clean.tag.ann.sort.tab

#grab columns with names and sequence, insert >, move sequence to next line, delete first two lines (that contain column heads) cut -f1,22 test.tab | sed -e 's/^/>/' | awk 'BEGIN { OFS = "\n" } { $1=$1; print }'| tail -n +3 > new_test.tab

#run remote BLAST search ./ncbi-blast-2.7.1+/bin/blastn -query new_test.tab -remote -db nr -outfmt '6 qseqid stitle pident evalue' - num_alignments 1 -out new_test2.blast

#sort and count the output cut -f2 new_test2.blast | sort | uniq -c | sort | tail –n 100

118 119

119