<<

University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2019-01-07 Conservation genomics of the endangered Banff Springs Snail ( johnsoni) using Pool-seq

Stanford, Brenna

Stanford, B. (2019). Conservation genomics of the endangered Banff Springs Snail (Physella johnsoni) using Pool-seq (Unpublished master's thesis). University of Calgary, Calgary, AB. http://hdl.handle.net/1880/109445 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Conservation genomics of the endangered

Banff Springs Snail (Physella johnsoni) using Pool-seq

by

Brenna C.M. Stanford

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN BIOLOGICAL SCIENCES

CALGARY, ALBERTA

JANUARY 2019

© Brenna C.M. Stanford 2019

Abstract

Understanding how persist and adapt to local habitats is a fundamental question for species of conservation concern. Located in , the endangered snail, Physella johnsoni, inhabits seven highly specialized thermal springs. P. johnsoni undergo yearly population bottlenecks with minimal to no dispersal among springs. The consequences of these processes on genetic population structure are unknown. To investigate effects of habitat and life history on P. johnsoni’s genome and to test the hypothesis of a single panmictic population, I collected 20 to 40 snails/population for P. johnsoni and a closely related snail, P. gyrina, in adjacent, non-thermal water. Using whole genome pooled-sequencing, millions of single nucleotide polymorphisms were captured. These genetic variants resolved significant genetic divergence between P. johnsoni and P. gyrina. In addition, I detected distinct genetic clusters and reduced nucleotide diversity within each spring, indicative of strong micro-geographical population structure and suggestive of a role for genetic drift. These results suggest that P. johnsoni from each spring represent a distinct genetic unit, which has conservation implications for the designation of designatable unit status under COSEWIC, and where mixing of snails may reduce the consequences of genetic drift.

ii

Acknowledgments

To my fantastic supervisor, Sean, I cannot begin to thank you enough for your guidance, patience, and humour. I am so excited for the opportunity to keep working with you. I am incredibly grateful for the wonderful group of people you have brought into this lab and for the environment that you foster. To Danielle, James, Jessy, Jori, Sara, Tegan, and Teresa, you are truly amazing. As scientists, leaders, and people I am so fortunate to call friends. Thank you for putting up with my distractions, tears and countless questions. Thank you for the many laughs, deep conservations, hugs and coffee. This work would not be anywhere close to this point without all of your scientific knowledge and support.

To my committee members, Dwayne Lepitzki and Jana Vamosi – a huge thank you for everything you’ve done to support me in this work! A special thank you, Dwayne, for braving the cold, the snow and the heat and mosquitos, to make sure I not only lived through but thoroughly enjoyed my field season.

To Parks Canada, specifically Mark Taylor, thank you so much for bringing me onto this project. It has been an absolute pleasure working with you.

To my family, where do I even begin? Thank you for always believing in me, knowing when to step in, and when to let me find my own way. You challenge me, support me and make me laugh so hard. I will never be able to tell you how grateful I am for everything you’ve done and continue to do for me. And last but not least, thank you, Peter. You have been so incredibly understanding, supportive and I love you so very much.

iii

Table of Contents

ACKNOWLEDGMENTS ...... III

TABLE OF CONTENTS ...... IV

LIST OF TABLES ...... VI

LIST OF FIGURES ...... VII

CHAPTER 1 GENERAL INTRODUCTION ...... 1

1.1 INTRODUCTION ...... 1 1.2 STUDY SYSTEM ...... 8 1.3 OBJECTIVES ...... 9

CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL ...... 13

2.1 INTRODUCTION ...... 13 2.2 METHODS ...... 17 2.2.1 Sampling...... 17 2.2.2 DNA extraction ...... 19 2.2.3 DNA quantification and quality check ...... 19 2.2.4 Constructing DNA pools for Pool-seq ...... 19 2.2.5 DNA sequencing ...... 20 2.2.6 Genomic analysis ...... 20

2.2.7 Pairwise FST ...... 22 2.2.8 Nucleotide diversity ...... 22 2.3 RESULTS ...... 23 2.3.1 DNA extraction, quantification and quality ...... 23 2.3.2 DNA sequencing and pre-processing ...... 23

2.3.3 Pairwise FST ...... 24 2.3.4 Nucleotide diversity ...... 24 2.4 DISCUSSION ...... 24 2.4.1 Population structure and nucleotide diversity between P. johnsoni and P. gyrina populations ...... 25 2.4.2 Population structure and nucleotide diversity within P. johnsoni and P. gyrina populations ...... 27 2.4.3 Broader implications and conservation recommendations ...... 29 2.4.4 The utility of Pool-seq in conservation ...... 31 2.4.5 Caveats...... 33 2.4.6 Conclusions ...... 34

CHAPTER 3 GENERAL CONCLUSIONS ...... 41

iv

REFERENCES ...... 43

APPENDIX A: GENOMIC ANALYSIS PIPELINE ...... 55

APPENDIX B: DNA AND SEQUENCING QUALITY ...... 79

APPENDIX C: POPOOLATION2 PAIRWISE FST ESTIMATES ...... 82

v

List of Tables

Table 2.1 Number of SNPs within each population and used in pairwise comparisons between populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 35

Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 36

Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FST was calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 82

vi

List of Figures

Figure 1.1 Schematic illustrating where genomic data are required in conservation management plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats...... 10

Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase...... 11

Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis...... 12

Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for , Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3)...... 37

Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re-established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki...... 38

Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 39

Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 40

Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3...... 79

Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic)...... 80

vii

Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post-Trimmomatic)...... 81

viii

CHAPTER 1 GENERAL INTRODUCTION 1.1 INTRODUCTION

As habitats continue to change largely due to anthropogenic impacts, there is an associated massive loss of biodiversity and an increasing number of threatened and endangered species (Frankham 2005; Butchart et al. 2010). Habitat fragmentation, habitat loss, introduction of invasive species, and over-exploitation can leave populations vulnerable to natural disasters, demographic stochasticity, and environmental change (Shaffer 1981; Frankham 2005). This biodiversity loss, at the genetic, species and ecosystem level, has incredibly harmful impacts on human society (Cardinale et al. 2012; Hooper et al. 2012). Decreases in biodiversity lowers the productivity and services of ecosystems (e.g., wood production, carbon sequestration, soil mineralization) and biodiversity loss can have detrimental impacts to ecosystem function similar to other forms of environmental change (e.g., climate warming, acidification) (Cardinale et al. 2012; Hooper et al. 2012). These factors highlight the need for conservation management to devise and implement effective and timely management plans. However, with limited time and resources, the choice of habitats and species to conserve remains a significant factor (Martin et al. 2012; Carwardine et al. 2018). Allocation of limited conservation resources may need to be directed towards prioritized species and populations based on factors such as ecological function and/or evolutionary significance (e.g. Joseph et al. 2009; Funk et al. 2012; Carwardine et al. 2018). However, determining a species or population’s priority can be extremely difficult due to a variety of factors including assessments of extinction risk for species. Altogether, conservation biology is faced with increasing biodiversity losses combined with intensifying data deficiencies.

Data on species and their environment is needed for effective conservation. Data deficiencies render fundamental questions about conservation status difficult to answer. The International Union of Conservation of Nature (IUCN) considers a species data deficient if there is insufficient information on a species taxonomic status, the threats or status of populations, and/or distribution (Bland et al. 2015; Parsons 2016). Priority management and funding are largely allocated to species of conservation concern, when these data are present, whereas data deficient species typically receive lower priority (Morais et al. 2013; Howard & Bickford 2014; Parsons 2016). Consequently, there can be a taxonomic bias in terms of data availability, as rare, cryptic or non-charismatic organisms (e.g., invertebrates, which make up the majority of global

1

biodiversity) are largely data deficient (Howard & Bickford 2014; Régnier et al. 2015; Cowie et al. 2017). In addition, over 60% of data deficient species are likely threatened by extinction (Howard & Bickford 2014; Bland et al. 2015). If data deficient species are considered, up to 7% of described species may have already been lost since 1550 compared to the 0.04% listed by the IUCN (Régnier et al. 2015). Overall, such losses exemplify the need to develop more rapid and appropriate tools towards more effective conversation management plans.

Genomics is one such tool that can be used in conjunction with other means to address these challenges (Figure 1.1). It has become increasingly clear that genetic diversity is essential for the long-term viability of species and that failure to protect it will undermine the actions to protect biodiversity at the ecosystems and species level (Frankel 1974; Laikre 2010). In response to environmental change, it is in part genetic variation, either de novo mutations (slow response), or standing genetic variation (i.e., variation present in a population or species but unselected for until the environment changes) or a combination of both that may facilitate population persistence (e.g., Frankel 1974; Barrett et al. 2008; Morris et al. 2014). For example, a recent study on Tasmanian devils (Sarcophilus harrisii) has demonstrated that in response to facial tumour disease there has been rapid selection on standing genetic variation (Margres et al. 2018). The disease has caused declines upwards of 80% and is almost always fatal, resulting in the species to be listed as endangered (McCallum 2008; Storfer et al. 2018). However, the observed rapid adaptation has conferred greatly improved survival, where a few loci explains over 80% of variation in female survival (Margres et al. 2018). Yet, most policies or management plans do not prioritize or inform decisions based on genetic diversity (Laikre 2010). The maintenance and promotion of conservation of genetic diversity requires characterization of the biological processes that threaten this variation with the integration of techniques that facilitate regular monitoring. Altogether, testing ecological and evolutionary predictions about genetic diversity in association with conservation objectives will help inform policy.

One of the primary aims of conservation management is to maintain or increase population numbers. However, hidden evolutionary genetic threats may undermine management plans and precipitate population collapse (Figure 1.1). For example, in the iconic Isle Royale system, moose (Alces alces) and wolf (Canis lupus) populations had been presumably stable for ~30 years before the wolf population suffered a major crash (Peterson et al. 1998). Though the original crash was predicted to be due to lack of food, the population never recovered due to

2

disease and severely decreased genetic diversity (Peterson et al. 1998). In fact, the wolf populations had such low fitness that a single out-bred male immigrant in 1997 resulted in all individuals born having ~50% of their ancestry from him in a little over a generation compared to the ~14% that would have arisen under equal fitness (Adams et al. 2011). Even with this influx of genetic material, the wolf population continued to decrease (Hedrick et al. 2014) to just two wolves by 2018: a father and daughter, who are half siblings. This is a clear example of decreased genetic diversity playing a role in preventing wild systems buffering environmental change. It is an important reminder that as population numbers diminish, increased mating between related pairs will decrease fitness of offspring, known as inbreeding depression (Edmands 2007). However, the threshold at which inbreeding occurs, the impacts on the health of population or species and the ability of them to avoid or recover from inbreeding are not universal (Keller & Waller 2002) and may be managed. Such management plans require thorough pedigrees or genetic estimates of relatedness to mitigate the effects (Vilà et al. 2003). Additionally, even if inbreeding is not occurring the same level of decreased fitness can be reached due to or in conjunction with other forces (Hedrick & Kalinowski 2000). Small populations are particularly vulnerable to genetic drift, the random loss of alleles that causes fixation (Bouzat 2010). Due to a lower number of individuals, the probablity of losing minor alleles increases (Bouzat 2010). Populations with small effective population sizes (Ne), have limited individuals genetically contributing to the next-generation, resulting in decreased genetic variability (de Oliveira et al. 2006). Environmental, biological and human driven population declines may cause the random loss and subsequent reduction of genetic diversity (Weber et al.

2004; Bouzat 2010). Through concerted conservation efforts, species number and Ne may recover, as seen with the northern elephant seal (Hoelzel et al. 1993). Hunted almost to extinction in the 19th century, the northern elephant seal (Mirounga angustirostris) has seen tremendous recovery in their population numbers, however their genetic diversity has remained extremely low due to the genetic “bottleneck” they underwent (Hoelzel et al. 1993; Bouzat 2010) (Figure 1.2). Detecting and monitoring losses of genetic diversity is essential in mitigating its detrimental effects and robust genomic data are integral in achieving this.

Genetic rescue, the translocation of individuals from one population into another (Ingvarsson 2001), has been proposed to circumvent the impacts of low genetic diversity. The introduction of new genetic material via immigrants has been shown to rescue declining populations and increase composite fitness (Vilà et al. 2003; Edmands 2007). However,

3

complications with genetic rescue can include divergence of isolated populations that are locally adapted to their specific ecological habitat (discussed below). In these cases, attempts to increase genetic diversity may result in disruption of locally adapted genes or gene complexes, decreasing fitness (outbreeding depression) (Edmands 2007). Without the integration of genetic and genomic data into conservation planning, management decisions may unintentionally be detrimental to species recovery.

Another aspect of conservation in which genomics is helpful as a tool is the assignment and investigation of population and species structure (Figure 1.1). Policy decisions on how best to allocate funds and time are based in part on the distinctiveness or taxonomic standing of a species (Isaac et al. 2004; Mace 2004; Joseph et al. 2009) and protection for certain populations is influenced by showing that it is distinct from others (Kell et al. 2009). Management plans must endeavour to incorporate methods that account for specific populations or genetically or ecologically unique populations that warrant specialized management (Funk et al. 2012). Evolutionarily significant units (ESUs) have a variety of definitions but can be described broadly as “a population or group of populations that warrant separate management or priority for conservation because of high genetic and ecological distinctiveness” (Funk et al. 2012). The Committee on the Status of Endangered Wildlife in Canada (COSEWIC) includes ESUs in its designation of designatable units (DUs), where a population or group of populations must meet one or more of COSEWIC’s criteria for “discreteness” and for “significance” (COSEWIC 2015). Two criteria used by COSEWIC to determine discreteness (please see COSEWIC 2015 for full criteria) are if population or group of populations have clear genetic distinctiveness and/or local adaptive differences (COSEWIC 2015). Once a population or group of populations are determined to be discrete, two criteria that may be used to deme significance is if genetic markers illustrate there is a clear phylogenetic divergence and/or they exist in an “ecological setting unusual or unique to the species, such that it is likely or known to have given rise to local adaptations” (COSEWIC 2015). DUs and ESUs aim to protect sub-species variability that is often missed by traditional (Mee et al. 2015). Losing these vital populations can increase the total risk of extinction, as DUs and ESUs may harbor the genetic variability necessary to evolve with environmental change (Ceballos & Ehrlich 2002; Funk et al. 2012; Mee et al. 2015).

4

However, delineation of populations, DUs and/or ESUs can be difficult over time and space, confounding accurate data for conservation. For example, newly colonized habitats may be inhabited by populations that only recently diverged; alternatively, some populations occupying the same habitat may look the same but be genetically cryptic for demographic reasons (Bull et al. 2013). Genetic estimates of population differentiation can help elucidate the underlying genetic differences among populations. Wright (1950) developed an index of genetic differentiation based on levels of heterozygosity (FST). Populations that are not genetically differentiated will have similar allele frequencies and comparable expected levels of heterozygosity and, therefore, a correspondingly small FST, while populations that are differentiated will have dissimilar frequencies and increasing disequilibrium of heterozygosity and, therefore, a larger FST (Holsinger & Weir 2009). This measure can be useful in distinguishing cryptic population structure and determining isolation between population pairs, especially when based on genome-wide estimates of FST from numerous genetic loci or markers. For example, genome-wide markers have been used to resolve phylogenetic structure in populations long considered as panmictic from mosquitos (Wyeomyia smithii) (Emerson et al. 2010) to marine threespine stickleback (Gasterosteus aculeatus) (Morris et al. 2018). Altogether, establishing patterns of genetic differentiation (measured by FST) is a conventional first step to understanding the nature of how organisms are distributed in time and space.

High FST values for certain genomic regions compared to the overall genomic background can be used in uncovering putatively adaptive differences among populations. The hypothesis is that locus-specific divergent or directional selection will maintain genetic differentiation among populations for this allele relative to other non-selected loci (Holsinger & Weir 2009; Whitlock & Lotterhos 2015). However, for candidate or putative genes uncovered in such a method this must be treated as a first step to testing predictions of their potential adaptive role as genetic drift, especially in populations with small Ne, can also cause the fixation of alleles resulting in the same genetic pattern (Holsinger & Weir 2009). Consideration must also be included for the influence of population structure and demography and the possibility of missing or incorrect environmental data (Hoban et al. 2016). Measures of pairwise FST alone are unable to distinguish the nature of the selective force causing variation in allele frequency (Lotterhos & Whitlock 2015). It is necessary to consider any genetic variation uncovered in the context of what is known for the species to develop a mechanistic understanding of the impact of the variation at all biological levels (Dalziel et al. 2009). With these factors in mind it is possible to design experiments to capture the genetic

5

basis behind clear phenotypic and adaptive differences (Rogers & Bernatchez 2007; Dennenmoser et al. 2017) The relevance for conservation management is that such genomic approaches can highlight the presence of adaptive variation that may be contributing to local adaptation of certain at-risk populations and help characterize their evolutionarily significance (Funk et al. 2016).

Though there have been many papers reviewing and promoting the integration of genomics in conservation (Allendorf et al. 2010; Ouborg et al. 2010; Funk et al. 2012), it has been largely confined to academia with very few concrete examples of it impacting management decisions (Shafer et al. 2015). This may largely be due to the cost and level of expertise necessary to generate and process genomic datasets (Shafer et al. 2015), but reflection on the integration of genomics in conservation is appropriate, especially on how to make genomic tools more accessible.

One of the main considerations in estimating genetic variation is that of the choice of genetic marker. Initially used genetic markers in conservation were/are limited due to small numbers and that they are not distributed throughout the genome (Ouborg et al. 2010). With the decreasing cost of Next-Generation sequencing, the capturing of thousands to millions of genome-wide single nucleotide polymorphisms (SNPs) is now possible. The sheer amount of polymorphic loci has facilitated more accurate estimates of genetic variation and the detection of finer-scale population structure (Emerson et al. 2010; Shafer et al. 2015; Morris et al. 2018). Because they are distributed throughout the genome, SNPs can be used to detect both neutral and potentially adaptive genetic variation, helping resolve taxonomic ambiguity and assign ESUs (Funk et al. 2012; Shafer et al. 2015). For genomics to be integrated efficiently and for it to be applied to real conservation dilemmas, conservation management practitioners must push for its use and collaborate with genomic researchers (Shafer et al. 2015), with continuous communication between the two (Lundmark et al. 2019).

Next-Generation sequencing encompasses many high-throughput sequencing methods for the capturing of polymorphisms. Some commonly used methods include reduced representation RADseq (Baird et al. 2008) and ddRADseq (Peterson et al. 2012), where the DNA is treated with one or two restriction enzymes that cleave the DNA before sequencing, returning small regions throughout the genome. SNP-Chips comprise another method whereby small sequences are physically bound to a glass slide and hybridize with DNA to capture known SNPs (Lien et al. 2011). However, while these methods still generate thousands of markers, they do not capture

6

the entire genome. In my project, I used a technique called Pool-sequencing (Futschik & Schlötterer 2010). Though whole genome sequencing for individuals has decreased sustainably in price it is not financially viable to sequence the number of individuals necessary to answer population level questions (Futschik & Schlötterer 2010). Pool-seq involves pooling of equal amounts of DNA from multiple individuals per population and sequencing them as if they were one “individual” (Figure 1.3) (Futschik & Schlötterer 2010). With sufficiently high number of individuals and careful data pre-processing it has been shown to estimate allele frequencies more accurately than individual sequencing (Futschik & Schlötterer 2010; Kofler et al. 2011a; Gautier et al. 2013). Attention needs to be given when pooling individuals, so that they are each represented equally (Schlötterer et al. 2014); however, newly developed models for estimating

FST have been shown to be robust to unequal representation, pool size and coverage depth

(Hivert et al. 2018). Due to the loss of the individual, estimates of Ne and migration are not possible; however, the number of genome-wide SNPs captured makes Pool-seq very powerful in detecting nucleotide diversity, population structure and potentially adaptive differences (though these should be interpreted with caution (Anderson et al. 2014)). Overall, very few studies have applied this method in a conservation context, so further research of this cost-effective method for conservation is needed.

One group of that is in desperate need of effective and timely management plans are molluscs. Though deemed to be one of the most imperiled taxa (Régnier et al. 2009; Johnson et al. 2013; Cowie et al. 2017), only 7,276 of an estimated 70,000 to 76,000 species (Rosenberg 2014) had been assessed by the International Union of Conservation of Nature (IUCN) as of 2016 (Cowie et al. 2017). Of these 34% (2,463 species) were deemed data deficient to make a formal assessment (Cowie et al. 2017). These numbers actually places molluscs at a high proportion of assessed species compared to all invertebrates, where only 1.2% have been assessed overall (Cowie et al. 2017), However, in comparison all mammal and bird species recognized by the IUCN have been assessed and only ~5% were deemed data deficient (Cowie et al. 2017). And yet, even though molluscs are deeply under-represented in terms of conservation assessment by the IUCN, they still made up roughly 40% of the species listed as extinct in the third issue of the 2016 Red List (Cowie et al. 2017). Their decline is heavily impacted by habitat loss and degradation, with very little known about their response to toxins or chemicals in aquatic systems (Régnier et al. 2009; Johnson et al. 2013). However, as the primary grazer in many habitats and an important food source for many species, their loss has the potential to

7

cause drastic shifts in the many ecosystems they inhabit (Régnier et al. 2009; Johnson et al. 2013). Loss of integral components at the bottom of the ecosystem has large bottom-up effects, which impact all members of the ecosystem. Not surprisingly, in terms of genomic information, molluscs are exceedingly lacking. A clear representation of this is that there are only 23 available reference genomes (https://www.ncbi.nlm.nih.gov/genome, txid6447[ORGN]) compared to the 522 for vertebrates hosted on NCBI (https://www.ncbi.nlm.nih.gov/genome, txid7742[ORGN]).

1.2 STUDY SYSTEM

The endangered Banff Springs Snail, Physella johnsoni, is found only in seven thermal springs, characterized by high water temperature and hydrogen sulphide and low dissolved oxygen and pH, in Banff National Park, Alberta, Canada (COSEWIC 2008). It is listed as endangered under Canada’s Species At Risk Act (SARA), in part due to an incredibly small distribution. P. johnsoni’s habitat encompasses less that 600m2 located in a total range and area of occupancy of just 8 km2, well under the 5,000 km2 and 500 km2, respectively, necessary to meet endangered status (Criterion B) (COSEWIC 2014, 2018). Additionally, each of the P. johnsoni thermal springs are seen to undergo large fluctuations in number of mature individuals with yearly population bottlenecks causing changes up to two orders of magnitude (COSEWIC 2014, 2018). In conjunction with predictions that they are hermaphroditic and have low amounts of gene flow between the thermal springs, there is concern of a lack of genetic diversity (Lepitzki & Pacas 2010). They are currently managed as a unique species, with genetic analysis of a few markers showing differentiation between them and a much more common snail, Physella gyrina (Lepitzki 1998; Remigio et al. 2001). However, some research using the same markers suggest that P. johnsoni and P. gyrina are synonymous with each other (Wethington & Guralnick 2004). This taxonomic ambiguity hinders the proper management of P. johnsoni (Daugherty et al. 1990; Mace 2004). If P. johnsoni were synonymized with P. gyrina, they would likely be re-assessed for DU status (COSEWIC 2008), where evidence on discreteness and significance would need to be provided. Limited numbers of genetic markers are unable to provide the resolution necessary to the level necessary to distinguish P. johnsoni and P. gyrina and resolve patterns of population structure.

8

1.3 OBJECTIVES

My objective in this thesis was to produce a genomic dataset for use in the conservation management of Physella johnsoni. The genomics data and resulting analysis will provide new levels of resolution to taxonomic designation and population structure than previously achieved through morphology and single marker sequencing. It will be integrated into Parks Canada’s management plan and be used to advise how best to manage the species to help ensure its continued persistence.

9

Figure 1.1 Schematic illustrating where genomic data are required in conservation management plans. Namely for resolving taxonomic ambiguity, assigning DUs/ESUs and for characterizing the population genomic consequences of increasing threats.

10

Figure 1.2 Schematic of a population bottleneck. The different colours represent genetic variation, where some is randomly lost with the reduction in population size. This decrease in genetic variation is observed even when population numbers increase.

11

Figure 1.3 Schematic of Pool-seq preparation and sequencing. Equal amounts of DNA (ng) of each individual of the population is combined and the individual is “lost”. The same adaptor is ligated to all of the DNA from that population to distinguish it from other populations in sequencing and in analysis.

12

CHAPTER 2 CONSERVATION GENOMICS IN THE BANFF SPRINGS SNAIL 2.1 INTRODUCTION

We are currently in the midst of massive biodiversity loss in association with anthropogenic impact, habitat fragmentation, habitat loss, invasive species, over-exploitation and environmental change (Shaffer 1981; Butchart et al. 2010). These factors have contributed to an alarming increase in numbers and rates of threatened and endangered species (Régnier et al. 2009, 2015). There is a critical need for effective conservation management plans, where conservation practitioners must have the tools to rapidly elucidate and assess population structure and distribution, towards determining whether populations and/or a species meet criteria to be considered priorities for conservation and to help characterize the threats faced by species at risk (Figure 1.1) (Funk et al. 2012; Guisan et al. 2013; Shafer et al. 2015). With the integration of genomics, practitioners have an unprecedented ability to resolve patterns of genetic diversity within and between populations and species and to inform on these vital aspects of conservation management (Figure 1.1) (Shafer et al. 2015). However, there are still limited examples where conservation genomics has been shown to actually impact policy or management decisions (Shafer et al. 2015).

Found only in Banff National Park, Alberta, Canada, with a global range of just 594.4 m2 (COSEWIC 2008), the Banff Springs Snail (Physella johnsoni) (Clench 1926) embodies the challenges faced by conservation biologists. It became the first living mollusc to be listed by Committee on the Status of Endangered Wildlife in Canada (COSEWIC) as threatened in 1997, and in 2000 was re-assessed as endangered (COSEWIC 2008). Globally, molluscs have been determined to be one of the most imperiled taxa (Régnier et al. 2009; Johnson et al. 2013; Cowie et al. 2017), however the majority are unassessed for conservation (Cowie et al. 2017). P. johnsoni belongs to the family , which is a family of about 80 species of freshwater, air- breathing snails found widespread in the Holarctic region and into Central and Southern America (Taylor 2003; Wethington & Lydeard 2007). Currently, ~55% of North American Physidae are at risk, alongside the vast majority of other freshwater snails, partially because of rapid habitat changes or loss due to human interference and/or environmental changes (Johnson et al. 2013).

13

Several factors contribute to the conservation risks faced by P. johnsoni. Their entire global range consists of seven thermal springs characterized by high water temperature and hydrogen sulphide content, and low dissolved oxygen content and pH (Grasby & Lepitzki 2002; COSEWIC 2008) (Figure 2.1). The thermal springs are located along the Sulphur Mountain Thrust fault, existing in three elevation groups (Grasby & Lepitzki 2002). The lowest elevation group (~1400m) consists of four thermal springs located within a few hundred metres of each other - the Cave (isolated from the others except for a small hole in the top), the Basin, and the Lower and Upper Cave and Basin Springs (Figure 2.1) (Grasby & Lepitzki 2002). The middle elevation group (~1500 m) is located about 1 km up Sulphur Mountain and consists of Lower and Upper Middle Springs (Figure 2.1), West Cave and Gord’s Spring, though it is uncertain if the physids currently residing in West Cave or Gord’s Spring are P. johnsoni (Grasby & Lepitzki 2002; COSEWIC 2008, 2018). The highest elevation group, consists of Kidney Spring (1588 m) (Figure 2.1) and the extirpated Upper Hot Spring (1584 m) (Grasby & Lepitzki 2002). P. johnsoni were originally found in 11 thermal springs, however they ceased to exist in six due to water stoppages or to human interference (COSEWIC 2008). Upon water flow resuming in Kidney and Upper Middle Springs snails were re-established successfully in 2002 and 2003, respectively, resulting in the seven current inhabited thermal springs (COSEWIC 2008).

The taxonomic designation of many physids, including P. johnsoni, is strongly debated. Wethington & Lydeard (2007) state that there is a more than 50% over-representation of physid species in North America. This is in part due to the classification being heavily based on morphological traits (e.g., shell morphology and penial structure). Though P. johnsoni was found to be significantly more globose and to have a longer spire than P. gyrina (Lepitzki 1998), recent evidence has shown that shell morphology is very plastic in physids. One study found phenotypic convergence of shell shape within one generation in the lab of two morphologically distinct but geographically adjacent populations of physids (Gustafson et al. 2014). Moore et al. (2014) found that two populations of physids predicted to be the same species due to the same atypical morphology were more genetically divergent from each other than a morphologically typical snail, Physella gyrina. While P. johnsoni are currently designated as a species, alternative hypotheses suggest that up to seven different taxa, including P. johnsoni and another endangered, endemic thermal spring physid, P. wrighti (Hotwater Physa) are synonyms of a much more common snail, P. gyrina and that the observed morphological differences are the result of habitat influence (Wethington & Guralnick 2004; Wethington & Lydeard 2007).

14

P. johnsoni individuals are seemingly restricted to around the origin of the spring (30 to 36°C) (Grasby & Lepitzki 2002). Though the cause of their distribution is unknown, higher densities are correlated with the higher temperature and hydrogen sulphide and lower dissolved oxygen and pH (Lepitzki & Pacas 2010). This distribution may be influenced by concentration of their food sources, algae and bacteria (Grasby & Lepitzki 2002). P. johnsoni are presumed to be hermaphrodites, preferring to out-cross when there are favourable environmental conditions (Jarne et al. 2000; COSEWIC 2008). P. johnsoni’s restricted habitat and life history patterns (discussed below) have led to concern of decreased genetic diversity. Of highest concern, is that each year the populations (whereby each thermal spring is defined as a “population”, but defined as “sub-populations” under COSEWIC) fluctuate on the order of two magnitudes (Lepitzki & Pacas 2010; COSEWIC 2018). Some populations will decrease to fewer than 50 individuals in the summer months and reach population highs into the thousands in the winter and spring (Figure 2.2) (COSEWIC 2008, 2018). The cause of these population rises and declines has not been determined, but is speculated to be in association with seasonal changes in water chemistry (Grasby & Lepitzki 2002). Whether this impacts the snails directly or the changes are due to abundance of the algae and bacteria (or an association of both) is unknown (Grasby & Lepitzki 2002). As the per population numbers decrease to so few individuals (Figure 2.2) (COSEWIC 2018), the genetic variation is likely reduced to the genetic diversity contained within the surviving individuals. Even as the population numbers increase the offspring will only contain that genetic variation, resulting in a genetic bottleneck (Bouzat 2010). In small and restricted populations such as P. johnsoni, low frequency alleles can be randomly lost, increasing the homozygosity of the population (Bouzat 2010). This causes a reduction of genetic diversity and random fixation of potentially detrimental alleles by a process called genetic drift (Bouzat 2010). It has been well documented that even low amounts of gene flow between populations can mitigate a loss of genetic diversity (Ingvarsson 2001; Vilà et al. 2003). There is likely limited opportunity for genetic mixing in exceedingly high spring run-off years with the transport of individuals from Upper Cave and Lower Cave and Basin Springs into the Basin Spring (Figure 2.1) (Lepitzki & Pacas 2010). Though never confirmed in P. johnsoni, snails in other freshwater systems have been documented to be transported by birds (Santamaría & Klaassen 2002) and large mammals, causing significantly decreased genetic differentiation between certain populations (Van Leeuwen et al. 2013). Marmots (Marmota caligata) and bears (Ursus arctos) have been observed via surveillance cameras to frequent some of the thermal springs (per. com.

15

Dr. Dwayne Lepitzki), however, overall, it is predicted that there is very little dispersal and likely gene flow among the thermal springs (Lepitzki & Pacas 2010). Due to the intensity of the population bottlenecks and because genetic mixing has yet to be confirmed among populations, decreased genetic diversity is also strong conservation concern.

Previous sequencing efforts have attempted to resolve the taxonomic ambiguity and determine the levels of genetic differentiation between P. johnsoni and P. gyrina. However, sequencing of protein variants (allozymes) (Lepitzki 1998) or COI and 16S mitochondrial genes (Remigio et al. 2001; Wethington & Guralnick 2004) failed to reach a consensus. In these studies, P. johnsoni was compared to geographically close populations of P. gyrina, including the three used in the present study – the Cave and Basin Marsh, Five Mile Pond and Muleshoe Pond (Figure 2.1). The Cave and Basin Marsh is located downstream of the Cave and Basin Spring cluster and contains diluted thermal water and does not freeze (per. obs). Five Mile Pond and Muleshoe Pond are lake populations, located several kilometres upstream on the Bow River (Figure 2.1). P. johnsoni and P. gyrina were found to be genetically distinguishable at only three of the 12 protein loci tested, with low levels of intraspecific variation restricted to a single locus (Lepitzki 1998). However, consensus of genetic relatedness based on COI and 16S mitochondrial gene sequences has not been reached (Remigio et al. 2001; Wethington & Guralnick 2004; Pip & Franck 2008). P. johnsoni and P. gyrina may be genetically close as not all analyses reveal monophyletic groups (Wethington & Guralnick 2004). This has been hypothesized to be in part due to the young age of the species, with P. johnsoni being predicted to only have diverged 3200 to 5200 years ago when the thermal springs were formed (Grasby et al. 2003; COSEWIC 2008). These limited genetic tools have precluded effectively testing this hypothesis.

These evolutionary factors highlight the need for genome-wide markers for resolving whether P. johnsoni and P. gyrina warrant separate management, to resolve the micro- geographic genetic population structure for P. johnsoni and to detect potential underlying genetic threats. It should be noted that I will not be attempting to resolve what constitutes a species in this study and rather focus on inter-species and intra-species patterns of genetic differentiation. As illustrated above, the use of limited genetic markers has been unable to resolve patterns of genetic differentiation. To address these factors hindering conservation management, I used Pool-sequencing (Figure 1.3) (Futschik & Schlötterer 2010). This sequencing method involves the pooling of DNA from multiple individuals per population to provide high confidence allele

16

frequency estimates across the entire genome (Futschik & Schlötterer 2010; Kofler et al. 2011a; Gautier et al. 2013).

In this chapter I used genome-wide single nucleotide polymorphisms (SNPs) captured by Pool-seq to address two conservation objectives for P. johnsoni. The first objective was to determine whether P. johnsoni is genetically distinct from P. gyrina. I hypothesized that the taxonomic unit previously assigned to P. johnsoni and P. gyrina by defining and/or derived traits would be valid if the observed patterns of genomic differentiation supported their distinct status. Whether or not P. johnsoni represents a thermal ecotype of P. gyrina or rather a distinct genetic unit has direct bearing on their conservation status (COSEWIC 2018) and the resources allocated to their conservation. For effective management Parks Canada must be informed if P. johnsoni and P. gyrina are genetically distinct, as an essential component of conservation biology is taxonomic designation. Improper classification can lead to the extinction of a species (Daugherty et al. 1990; Mace 2004). Secondly, I used this same SNP dataset to test predictions of micro- geographical population structure and within-population genetic diversity of P. johnsoni. While the distribution is limited to a small geographic space, gene flow is predicted to be limited (Lepitzki & Pacas 2010) and extensive annual bottlenecks within each of the populations (COSEWIC 2018) are predicted to amplify the effect of genetic drift resulting in increased population divergence. The combination of these two evolutionary processes lead to the prediction that genetic structure may be pronounced. Alternatively, P. johnsoni may represent a single panmictic population. The genomic data produced here will facilitate management decisions in association with habitat threats, and whether thermal springs should be managed as a single unit or if they each warrant separate management. Overall, an analysis of genomic divergence of these snails is required to test these hypotheses.

2.2 METHODS

2.2.1 SAMPLING

P. johnsoni were collected from five thermal springs between January and March of 2017 in the Banff Thermal Springs of Banff National Park in Alberta, Canada: 1) Cave Spring (J1) 2) Basin Spring (J2) 3) Lower Cave & Basin Spring (J3) 4) Upper Cave & Basin Spring (J4) and 5) Lower Middle Spring (J5) (Figure 2.1). Individuals were also collected from Upper Middle Spring (J6) and Kidney Spring (J7), which were not included in this study (Figure 2.1). Before

17

collecting P. johnsoni, census population sizes were estimated to ensure that the number of snails sampled (n=40) did not exceed 0.5 to 3% of the spring’s current population. This condition was met except for J1, where only 20 P. johnsoni could be collected. In addition, a second species, P. gyrina (n=40), were collected from three locations 1) Cave & Basin Marsh (G1) (March 2017), 2) Five Mile Pond (G2) and 3) Muleshoe Pond (G3) (July 2017) (Figure 2.1).

Snails were collected by hand for all of P. johnsoni locations (J1 to J5) and P. gyrina G1. Snails were collected haphazardly from eight locations within the thermal spring, with five snails being collected at each location. Water temperature was recorded at each location. A D-dipnet was used to collect at all other locations (G2 and G3). Samples were collected in 8 batches (five snails per) from different locations within each lake. Water temperature was taken once from a representative location.

All snails were anesthetized in the field in batches of five by placing them into 5% laboratory grade ethanol (EtOH) (Gilbertson & Wyatt 2016). The tubes were left to incubate immersed in the water source as to be relatively close to the same temperature and minimize stress. They remained in 5% EtOH until movement ceased and they released from the tube’s surface (observed to be 5 to 15 minutes). They were then removed from the 5% EtOH and tested for responsiveness by scrapping a hypodermic needle across the foot (Gilbertson & Wyatt 2016). If unresponsive, they were placed on a dish made of aluminum foil and euthanized by rapid cooling with electrical component freezing spray sprayed from under the dish (Craze & Barr 2002).

Tissue was then removed from the shells by dissecting needle or forceps trying to minimize damage to the shell. The shells were stored individually in 95% EtOH. Ten tissue samples from each of J1 to J5, and G1 were stored individually in RNAlaterÒ, which would have allowed for future gene expression analysis. However, it was decided that these samples would be better used for DNA analysis and therefore, extracted for DNA as explained below. The remaining tissue samples were stored individually in 95% EtOH. For G2 and G3, all 40 tissue samples were stored individually in 95% EtOH.

All samples were transported in a cooler with ice packs. Once in the laboratory they were stored at -20°C until extraction. All sampling procedures and research ethics were approved by

18

the Life and Environmental Science Animal Care Committee under protocol #LESACC AC16- 0267

2.2.2 DNA EXTRACTION

DNA was extracted from whole body tissue, following a modified OMEGA bio-tek E.Z.N.A.Ò Mollusc DNA Kit protocol that included dried and diced tissue, overnight incubation at 56 °C, three washes of the HiBindÒ column, and a 50µL elution. Once DNA extraction was complete 8 to 10µL was aliquoted for quantification and quality checks. Both aliquot and stock were stored at -20°C until further use.

2.2.3 DNA QUANTIFICATION AND QUALITY CHECK

Aliquoted DNA was quantified a minimum of twice on either QubitÒ Fluorometer 2.0 or 3.0. using QubitTM dsDNA BR Assay Kit as per protocol. Samples were vortexed briefly and mixed by pipetting up and down before 2 µL was mixed into 198 µL of working solution for quantification. A subset of samples were run on 1% agarose gel to visualize the level of shearing that occurred. A subset of samples was tested for purity on NanoDropÒ Spectrophotometer ND- 1000 (260/230 and 260/280 ratios).

2.2.4 CONSTRUCTING DNA POOLS FOR POOL-SEQ

Pooled DNA for each population was completed using equal amounts from individual DNA samples. DNA quantity (ng) for each pool was chosen so that at least 1 µL of solution was pipetted from each individual. Individual samples were briefly vortexed, pipetted up and down 20 times before volume was added to the pool. A total of 10 µL from each pool was aliquoted for further quantification and gel electrophoresis.

Pooled DNA samples were quantified using same method as the individual samples (described above). 5 µL of each pool was run on 1% agarose gel to test if handling was increasing shearing. To prepare for sequencing, each pool was diluted down to a final concentration of 3 to 6 ng/µL. The diluted pools were quantified as above.

19

2.2.5 DNA SEQUENCING

All pools passed concentration and quality control. Libraries were prepared using a shotgun approach with PCR with Illuminia TrueSeq LT adaptors. All libraries passed quality control. Pooled DNA libraries were sequenced on the Illumina HiSeq XTM Sequencer using paired-end reads of 150 base pairs (bp). Each pool was sequenced over two lanes (e.g. four pools on one lane and then the same four pools on the second lane) for a total of four lanes at the Génome Québec Innovation Centre, Montréal, Québec, Canada.

2.2.6 GENOMIC ANALYSIS

Full annotated genomic analysis pipeline can be found in Appendix A.

Sequences were converted from BCL files to FASTQ with no barcode mismatches for downstream processing and analyses using bcl2fastq2 v.2.20. Sequences from the two lanes for each pool were concatenated to one file per pool per read direction. FastQC v.011.5 (Andrews 2010) was used to check and visualize the quality of the sequences.

Trimmomatic v.036 (Bolger et al. 2014) was used to remove adaptors and filter low- quality sequences (ILLUMINACLIP 2:30:10 CROP:135 LEADING:5 TRAILING: 5 SLIDING WINDOW: 5:20 MINLEN:100). Sequences were hard cropped at 135 bp due to k-mer overrepresentation in the last 15 bp in a small number of sequences. Post trimmed sequences were checked for quality using FastQC v.011.5.

Contamination of foreign (i.e. non-snail) sequences was removed from the data with DeconSeq v.0.4.3 (Schmieder & Edwards 2011a). Databases of potential contamination sources were generated for Archaea and green Algae (Chlorophyta, Cryptophyta, Charophyceae, Eustigmatophyceae, and Klebsormidiophyceae) by downloading the nt database from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db, accessed 24-08-2018). Using the GenInfo Identifier (GI) list (https://www.ncbi.nlm.nih.gov/nuccore, accessed 24-08-2018) for each of the above, blastdb_aliastool was used to create a file that masked the database so only the organisms of interest was available. This masked database was then converted to a FASTA file using blastdbcmd. The threespine stickleback (Gasterosteus aculeatus) genome was accessed from https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichel et al. 2017) and the human (Homo sapiens) genome was accessed from

20

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_ch . The bacterial database was constructed from NCBI Assembly database (https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria, accessed 06-09-2018), selecting only “Complete Genomes” with the exception of those known to be in the thermal springs (Aphanothece, Brevundimonas, Chloroflexus, Lyngbya, Microcoleus, Oscillatoria, Phormidium, Porphyrobacter, Rhodobacter, Rhodopseudomonas, Rubrivivax, Spirulina, Synechocystis, Thermonanas, and Thiothrix) (Bilyj 2011) where all information (chromosome, scaffold, and contig) was included. The genomes were concatenated together to be prepared for use by DeconSeq. All databases were prepared following DeconSeq manual. Briefly, any Ns, any sequences less than 200 bp and sequence duplicates were removed (Schmieder & Edwards 2011b). The bacterial database was split into ~2.7 Gb chunks (FASTA Splitter v.0.2.6; http://kirill-kryukov.com/study/tools/fasta-splitter/) so that it could be indexed before the Burrows-Wheeler Aligner (BWA) (Li & Durbin 2009) (provided in the DeconSeq download) indexed the databases. Trimmed sequences were split into ~1.1 Gb segments (FASTQ Splitter v.0.1.2; http://kirill-kryukov.com/study/tools/fastq-splitter/), compared against the databases, removing any sequence which had an identity (-i) of 94% and a coverage (-c) of 90%. The out- files per population were merged together and paired read files were compared, removing any sequences found in only one of the files, using fastq-pair (Edwards 2017).

A genome reference was subsequently assembled from the pool with the highest amount of sequences (J3) using DISCOVAR de novo (v.52488) (https://software.broadinstitute.org/software/discovar/blog/?page_id=98) after using Clumpify to remove PCR duplicates (Bushnell 2014) . DISCOVAR de novo was run to assemble de- duplicated sequences into flattened lines. An assembly was also attempted using all pools, albeit with J3 determined to generate a “better” assembly. DISCOVAR de novo was chosen because it does not require sequences done with two different insert sizes, unlike most assemblers. However, the sequences are recommended be from a single, PCR-free library preparation and 250 bp paired-end reads. The consequences of breaking these assumptions will be stated in Results and discussed in Caveats.

Burrows-Wheeler Aligner (BWA)-MEM (v. 0.7.17) (Li & Durbin 2009) was used to align sequences to the assembled genome reference with the -M option which disables making multiple primary alignments for different regions of the query sequence for compatibility with

21

downstream packages. Aligned sequences were sorted from a SAM file to a BAM file by chromosome number using SAMtools (Li et al. 2009). SAMtools was also used to remove any sequences that fell below a mapping quality of 20 and any sequences where its mate did not map. Duplicates were removed with Picard Tools (v. 2.17.3.) (http://broadinstitute.github.io/picard/). SAMtools’ function flagstat was used to determine summary statistics of resulting bam files while Picard Tools (ValidateSamFile) was used to validate that the files were not corrupted in any way. SAMtools was used to create individual mpileup files for each pool and a mpileup file containing all pools, specifying the -B function which stops BAC re-alignment necessary for downstream analysis.

2.2.7 PAIRWISE FST

Pairwise FST between all pools was determined using the R package, Poolfstat (Hivert et al. 2018). Firstly, the mpileup file containing all of the pools was first converted to a sync file using PoPoolation2 (Kofler et al. 2011b), filtering for a minimum base quality of 20. It is important to note that to specify the Phred33 encoding, --fastq-type must be set to “sanger” (rather than “illumina” for Phred64). Using Poolfstat the sync file was converted to a popsync object in RStudio (v. 1.1.383) (RStudio Team 2016) with a minimum read count of one, minimum coverage of 20, maximum coverage of 200, minimum allele frequency of 0.05 and removing indels. Pairwise FST was calculated using the “Anova” method and the same parameters used to import the sync file. A Euclidean distance matrix was created from the pairwise FST matrix using the dist function in R. The pco function in the package LabDSV (v.1.8- 0) (Roberts 2012) was used to determine the eigenvalues, from which the percent explained by eigen vector one and two was calculated. Pairwise FST and percent eigen vectors were visualized using ggplot2 (Wickham 2016).

2.2.8 NUCLEOTIDE DIVERSITY

Individual mpileups were analyzed in PoPoolation1 (Kofler et al. 2011a) to determine within population nucleotide diversity. Nucleotide diversity was determined for SNPs with a minimum count of two, a minimum coverage of 20, max coverage of 100, for side by side windows of 250 bp and for a pool size equal to the diploid number of individuals in the pool. As with above, we had to set the fastq-type to sanger. Mean nucleotide diversity using all windows was calculated, even those that did not contain a SNP for the individuals of that pool, as this is a

22

diversity of zero in RStudio (v. 1.1.383). The nucleotide diversity values were visualized using the package ggplot2.

2.3 RESULTS

2.3.1 DNA EXTRACTION, QUANTIFICATION AND QUALITY

DNA yield was variable among samples, even with tissues of similar size (range from 10 ng/µL up to greater than 200 ng/µL). Extracted DNA was determined to be free of organic compounds and protein contamination (mean ± SD) (260/230 ratio of 2.05 ± 0.26 and 260/280 ratio of 1.95 ± 0.085). Samples exhibited a high molecular weight and limited shearing (individual samples not shown). Pooled samples did not appear to have increased shearing from handling throughout the process with the majority of DNA above 10 Kbp (Figure B.1).

2.3.2 DNA SEQUENCING AND PRE-PROCESSING

A total of 3,675,153,756 sequences were assigned to a population over the eight populations with a quality score of 38 for all populations, with the exception of one of the J3 lane positions which had a quality score of 37 (J1: 446,326,456 sequences; J2: 478,291,654 sequences; J3: 501,946,794 sequences; J4: 430,999,172 sequences; J5: 412,287,368 sequences; G1: 451,624,748 sequences; G2: 453,963,908 sequences; G3: 499,713,656 sequences). No populations were flagged or failed per base quality scores, though quality decreased as the read progressed (Figure B.2). Trimmomatic filtering removed a total of 843,631,584 sequences (22.96%) (J1: 22.65%; J2: 21.92%; J3: 22.10%; J4: 22.04%; J5: 22.43%; G1: 22.45%; G2: 25.85%; G3: 24.13%). For each population per base quality improved such that quality was above 30 (Figure B.3). A total of 19,198,864 sequences were removed from trimmed sequences as non-snail contamination (J1: 0.66%; J2: 0.70%; J3: 0.81%; J4: 0.73%; J5: 0.65%; G1: 0.73%; G2: 0.58%; G3: 0.55%). The genome reference assembly produced from J3 had an N50 = 3,931 bp (~450 Mbp in 1 kb+ scaffolds and ~79 Mbp in 10 kb+ scaffolds), mean length of first read in pair up to first error (MPL1) of 90 and an estimated chimera rate of 0.55%. The genome reference assembly produced using all pools generated an N50 = 2,739 bp (~540 Mbp in 1 kb+ scaffolds and ~48 Mbp in 10 kb+ scaffolds), MPL1 of 77 and estimated chimera rate of 1.08%. Of the decontaminated sequences and post filtering (for unpaired reads, duplicates and a minimum mapping quality of 20), I mapped an overall 2,034,681,286 sequences (72.35%, or 55.36% of initial sequences) to this assembly (J1: 251,520,504 sequences (73.34%); J2:

23

268,182,754 sequences (72.31%); J3: 284,998,462 sequences (73.49%); J4: 244,620,378 sequences (73.34%), J5: 235,247,357 sequences (74.04%); G1: 246,765,601 sequences (70.98%); G2: 237,549,944 sequences (70.98%); G3: 265,796,286 sequences (70.50%)). All BAM files passed validation.

2.3.3 PAIRWISE FST

The number of within population bi-allelic positions captured was 921,339 to 1,300,995 per each P. johnsoni pooled population and 3,053,291 to 3,736,834 per each P. gyrina pooled population (Table 2.1). Pairwise FST between P. johnsoni populations ranged from 0.106 (J2 vs.

J4) to 0.367 (J1 vs. J4) (Table 2.2). Between P. johnsoni and P. gyrina populations pairwise FST was 0.519 (J5 vs. G1) to 0.709 (J4 vs. G2) (Table 2.2). For P. gyrina populations, pairwise FST ranged from 0.359 (G2 vs. G3) to 0.498 (G1 vs. G2). The PCoA plot of the pairwise FST separated P. johnsoni and P. gyrina along the first axis and explained 76.85% of the allelic variation shaping this genetic structure (Figure 2.3). J2, J3, and J4 clustered together (Figure 2.3). J1 and J5 were also clustered together, but clustering was not reflected in the distance between them (pairwise FST = 0. 335) (Table 2.2) but rather that they were both equal distances from all other populations (Figure 2.3). G2 and G3 were loosely clustered while G1 fell out towards the P. johnsoni populations (Figure 2.3).

2.3.4 NUCLEOTIDE DIVERSITY

Nucleotide diversity was decreased in P. johnsoni populations (J1 to J5) compared to P. gyrina (G1 to G3) (mean across all 250 bp windows): J1: 0.00133, J2: 0.00115, J3: 0.00113, J4: 0.00106, J5: 0.00156, G1: 0.00421, G2: 0.00475, G3: 0.00536) (Figure 2.4).

2.4 DISCUSSION

As the instances of habitat loss and fragmentation increase and contribute to species decline (Shaffer 1981), the effect of these events on genetic diversity is of increasing importance to conservation management (Frankham 2005). Population size fluctuations (bottlenecks), inbreeding, reduced gene flow and genetic drift can decrease the fitness of a population or species (Bouzat 2010). Freshwater snails present an excellent system in which to study these impacts as they often naturally exist in discrete populations with limited dispersal (Viard et al. 1997). Genetic drift has also been shown to have a rapid and large influence due to founder

24

effects, frequent bottlenecks, and low immigrant rates causing exceedingly low genetic diversity (Viard et al. 1997; Bousset et al. 2004). These species, who additionally have the ability to colonize a wide range of habitats, provide a unique opportunity to study the genomic underpinning of population and species differentiation (Mavárez et al. 2002c). Though there are instances in certain species and ecosystems where there are no patterns of genetic differentiation between distal populations (Gu et al. 2015; Lounnas et al. 2018), strong genetic structure is common within small micro-geographical ranges (Mavárez et al. 2002a; b; Bousset et al. 2004; Djuikwo-Teukeng et al. 2014). The endangered P. johnsoni exemplifies the characteristics that can make freshwater snails ideal study systems. Its global habitat is restricted to just seven geographically close thermal springs that undergo severe yearly population bottlenecks with minimal gene flow predicted between the thermal springs (Lepitzki & Pacas 2010). While it is currently managed as a species, studies have questioned the validity of this designation and have proposed that P. johnsoni represents thermal ecotypes of a much more common snail, P. gyrina (Wethington & Guralnick 2004; Wethington & Lydeard 2007). The objectives of this study were therefore to test whether these putative species represented distinct phylogenetic units, and to test whether the yearly cyclic bottlenecks contributed to population divergence and decreased nucleotide diversity among P. johnsoni from different thermal springs.

2.4.1 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY BETWEEN P. JOHNSONI AND P. GYRINA POPULATIONS

I discovered strong genetic divergence between pooled samples of P. johnsoni and P. gyrina. Using just under one million to over four million SNPs sequenced across the genome a pairwise FST of 0.636 ± 0.0605 (mean ± SD) was found between P. johnsoni and P. gyrina populations on a geographical scale of less than a few kilometres. Though the relationship between FST and the number of migrates is tenuous at best in most systems, this value would indicate one migrant every five to nine generations (Whitlock & McCauley 1999). This value should be interpreted with extreme caution, as these populations break many of the assumptions of this relationship (e.g., evolutionary equilibrium). For example, there should be no genetic isolation by geographical distance with all populations contributing equally to the pool of migrants (Whitlock & McCauley 1999). This is not reflected by the patterns of pairwise FST (e.g., J5 is the most geographically far thermal spring to G1, but genetically the closest). As well, populations are assumed to have a constant number of individuals, which is clearly not the case in this system, no selection, no mutation and have reached equilibrium between migration and

25

genetic drift (Whitlock & McCauley 1999). With this in mind, the patterns of genetic differentiation should be considered to indicate a maximized FST between these species with virtually no gene flow, rather than a quantitative estimate of migrants. As well, there was a striking difference in the nucleotide diversity between P. johnsoni and P. gyrina. P. gyrina populations were found to have over double (G1) or triple (G2 and G3) the nucleotide diversity observed in P. johnsoni populations. This relationship was also reflected in the number of SNPs captured within each population, with P. gyrina having roughly three to four times the amount of within population SNPs captured. Due to pooling the individuals before sequencing, it was not possible to determine the amount of variability in nucleotide diversity between individuals towards establishing significance.

There are a few plausible explanations for the observed population structure between P. johnsoni and P. gyrina, and the reduced diversity in P. johnsoni. One possibility is that P. johnsoni is adapting to the thermal spring environment and divergent selection is causing the fixation of critical alleles, increasing divergence and decreasing nucleotide diversity. Another, non-mutually exclusive possibility is the influence of the repeated bottlenecks. Over roughly 20 generations (assuming generation time of one year), Lepitzki (COSEWIC 2018) documented shifts where certain P. johnsoni populations’ minimum reached under 0.010% of their maximum value. Such patterns are predicted to result in genetic drift, whereby the probability of random fixation or extinction of alleles is inversely proportional to effective population size (Hedrick & Kalinowski 2000) and a reduction in genetic diversity is the predicted outcome of the process. Though P. gyrina have been shown to have seasonal patterns of large increase and decrease in other Albertan lakes (Sankurathri & Holmes 1976), higher population numbers and less constrained habitat may decrease the influence of the bottlenecks on genetic diversity as compared to P. johnsoni (Bouzat 2010). There could also be influence from multiple evolutionary forces. Indications of selection have been found in populations that undergo extensive bottlenecks; however, the selective force for an allele must be strong enough to overcome random drift (e.g., Koskinen et al. 2002; Funk et al. 2016). Though there are clear patterns of genetic differentiation and decreased nucleotide diversity which highlight the conservation concern for P. johnsoni, more data will be necessary to determine the relative roles of genetic drift and selection in this system.

26

2.4.2 POPULATION STRUCTURE AND NUCLEOTIDE DIVERSITY WITHIN P. JOHNSONI AND P. GYRINA POPULATIONS

The SNPs captured in this study support the hypothesis of multiple genetic populations for P. johnsoni and P. gyrina. Previous genetic work in P. gyrina which included G1, G2 and G3 and an additional two populations located in Banff National Park and Montana, was unable to find support for monophyly or distinguish between three of the five populations (Remigio et al. 2001). Using two (arguably non-neutral) genetic markers, they found that G1 and G2 grouped together away from G3 (Remigio et al. 2001). In this present study, I found that there was strong population structure, with pairwise FST of 0.359 between G2 and G3 and of 0.498 and 0.455 between G1 and G2 and G1 and G3, respectively. This population structure suggests more differentiation between the marsh population containing thermal water run-off (G1) and the lake populations (G2 and G3) than between the two lake populations. Re-addressing population structure with genome-wide SNPs may be more effective to resolve population structure when present (e.g. Emerson et al. 2010). G1 was seen to have decreased nucleotide diversity compared to both G2 and G3, though more SNPs were captured in this population than in G2. Whether these patterns are due to adaptation to different habitat types, connectivity (discussed below) or a combination is unknown.

Between the five populations of P. johnsoni included in this study, which are spread over just one kilometre, I found pairwise FST ranging from 0.106 to 0.367 (factors likely impacting this structure discussed below). Within P. johnsoni populations no general trends could be seen between the severity of the population’s minimum and maximum and the amount of nucleotide diversity or the amount of within population SNPs. In fact, J5 which has had some of the lowest population minimums (30 to 40 individuals between 1996 to 2017) (COSEWIC 2018), was found to have the highest nucleotide diversity and within population SNPs. This illustrates that consensus data should be used in conjunction with genomic data for conservation management (Keller & Waller 2002). Altogether, these analyses reveal that population genetic factors are influencing the evolutionary trajectories of snails within these thermal springs at a remarkably microgeographic scale.

There are several ecological genetic factors possibly influencing the observed patterns of genetic differentiation between populations of the same species. One possibility is micro-habitat local adaptation, however further data generation and analysis would be necessary to address this

27

(discussed in Chapter 3 General conclusions). Again, another possible influencing force is genetic drift. In addition to decreasing genetic diversity, drift is predicted to increase genetic differentiation between populations. To some extent all of the populations included in this study have likely undergone bottlenecks, so it is possible that these events are contributing to the genetic differentiation and FST estimates measured. Ease of dispersal may also be playing a role in the patterns of differentiation. For instance, between J2, J3, and J4 there is decreased pairwise

FST as compared to J1 (protected by a cave) and J5 (up Sulphur Mountain by about one km). In certain conditions J4 water will run above ground to J3 and snails have been observed in the J4 outlet (Lepitzki 2002). Though water has been observed to flow from J3 into J1, there are no patterns of decreased population structure between them in comparison to J1 to J2 or J4. Between the early to mid 1900s to 1997, dispersal between P. johnsoni may have been impacted, as the thermal springs were piped together and bathing occurred between J1 and J2 until 1997 (COSEWIC 2008). Without prior sequencing to compare to, the effects of this will remain unknown. Between P. gyrina populations, the two populations connected by the Bow River (G2 and G3) have decreased (though still substantial) population structure compared to G1, which is isolated from the river. Though birds (Santamaría & Klaassen 2002) and mammals (Van Leeuwen et al. 2013) have been shown to transport snails and shape the patterns of genetic diversity (Van Leeuwen et al. 2013), water connectedness can frequently drive population structure, including in other aquatic species (e.g., Kremer et al. 2017). Previous work has shown that there is decreased genetic differentiation between populations over much further ranges that are connected by waterways which allow the transport of snails and eggs on mats, than even very close pond populations (Mavárez et al. 2002a; Bousset et al. 2004; Djuikwo-Teukeng et al. 2014). Other than the flooding between certain P. johnsoni populations and the possibility of previous connectivity through pipes, P. johnsoni populations have very little water connection as the thermal water largely runs underground (Grasby & Lepitzki 2002). Though there is increased nucleotide diversity in the two connected lake populations of P. gyrina, it is difficult to disentangle if this is due to higher populations numbers and their habitat or because of connectivity. Interestingly, the P. johnsoni populations of J2, J3 and J4 which have the lowest genetic distance and presumably the highest probability of genetic mixing, also have the lowest nucleotide diversity. In conjunction with the measure of strong population structure between geographically close populations, decreased nucleotide diversity and amount of polymorphic

28

sites in P. johnsoni, populations coupled with the known life history, provides compelling evidence that genetic drift may be driving minor allele loss in P. johnsoni populations.

2.4.3 BROADER IMPLICATIONS AND CONSERVATION RECOMMENDATIONS

Though species designation has a broad spectrum of definitions, this study has shown there to be clear genetic differentiation between P. johnsoni and P. gyrina and between populations of what is argued to be the same species. This level of genetic differentiation between populations of the same species is not restricted to this study and has been found in other species (Mavárez et al. 2002a; b; Bousset et al. 2004; Djuikwo-Teukeng et al. 2014). This brings concern of the potential loss of freshwater snail biodiversity that may contain ecological and evolutionarily significant genetic diversity (Funk et al. 2012; Mee et al. 2015). On a whole, molluscs are data deficient with respect to conservation. Though only 10% of known species of molluscs have been assessed by the International Union of Conservation of Nature (IUCN) as of 2016, they still represent 40% of the documented extinctions (Cowie et al. 2017). With the genetic structure observed on such a short geographical scale, there is a high probability that we are in fact losing, if not “species”, genetically diverse populations which are essential for persistence with environmental change (Ceballos & Ehrlich 2002) at a much faster rate than even predicted (Régnier et al. 2015; Cowie et al. 2017). P. johnsoni is fortunate that it exists in such a visibly unique habitat in a national park where COSEWIC agreed that even if it actually represented a thermal ecotype of P. gyrina, it would have been likely re-designated as a designatable unit (DU) (COSEWIC 2008). This ensured the allocation of resources due to its proposed ecological or evolutionary significance (Joseph et al. 2009; Funk et al. 2012; Carwardine et al. 2018). Actions included census counts done every three to four weeks from 1996 till 2017 (though terminated in 2017), motion triggered alarms to prevent people soaking in the Middle Springs, the closing of swimming at the Cave and Basin Springs, and previous funding to test the evolutionary significance of the species (Lepitzki 1998; Remigio et al. 2001; COSEWIC 2008, 2018; Lepitzki & Pacas 2010). Without a recognition of taxonomic, ecological or evolutionarily uniqueness, this level of resources will not be allocated to species (Isaac et al. 2004; Joseph et al. 2009). Unfortunately, there is a taxonomic bias in the primary research necessary to assess these measures of distinctiveness (Howard & Bickford 2014; Régnier et al. 2015; Cowie et al. 2017). Genomics provides a relatively cost effective (the integration of Pool-

29

seq into conservation is discussed below) method for determining population structure and characterizing the genetic health of species or populations.

As illustrated in the P. johnsoni populations compared to P. gyrina, bottlenecks in small, isolated populations can cause genetic drift to fix alleles (decreasing nucleotide diversity) and promotes genetic differentiation. This loss of alleles can cause the fixation of detrimental alleles (Bouzat 2010) and decrease of standing genetic variation necessary to rapidly respond to environmental change (Morris et al. 2014). For P. johnsoni this means that each population represents an incredibly important reservoir for the limited nucleotide diversity found across the species. In light of this, I would recommend that the population counts are re-instated so that deviations from 20-year trends can be detected quickly and, ideally, coupled with concurrent genomic estimates of genetic diversity to directly test predictions associated with genetic drift. The routine sequencing of the populations every few years would be an incredibly valuable component of P. johnsoni’s management plan, as predicted with other management plans (e.g., De Barba et al. 2010; Hendricks et al. 2017). Temporal differences in the same population’s nucleotide diversity and the extent of differentiation would be a powerful way to investigate the effect of genetic drift (Bousset et al. 2004). If a population starts declining in numbers (which there has already been a significant decline in maxima observed (COSEWIC 2018)) and/or there is increased fixation of alleles, translocation of individuals from another population may be warranted as genetic rescue (Ingvarsson 2001; Edmands 2007). Under such scenarios, there could be concern regarding the potential for outbreeding depression if locally adaptation to each thermal spring was disrupted with the influx of new individuals (Edmands 2007). However, the genetic differentiation shown here likely indicates either current or recent gene flow. It is possible that gene flow between these thermal springs has decreased below what would be natural for the system, as each of the thermal springs has been impacted by humans (COSEWIC 2008), presumably decreasing frequentation by animals that could act as vectors for these snails. Additionally, if adaptive differences are occurring at certain alleles even in the face of population bottlenecks and corresponding impact of genetic drift, the selective force would be incredibly strong and therefore unlikely to be disrupted by a few migrants (Funk et al. 2016). Without semi- frequent monitoring of genetic variation, it will be impossible to establish a baseline of what is considered normal and stable for the system, with genetic threats remaining undetectable. As well, further monitoring would provide the parameters necessary to elucidate the roles of selection and drift in this system. This could be used to characterize the potential risk of

30

outbreeding depression if translocation occurred to mitigate the impact of low genetic diversity and/or inbreeding depression. P. johnsoni represents a fantastic and unique opportunity to conduct research on how a species’ genome existing in small, isolated populations with minimum gene flow and bottlenecks is impacted. In the face of the biodiversity crisis, where critically important genetic diversity is so often over-looked (Frankel 1974; Laikre 2010), characterizing and understanding genetic drift is vital.

2.4.4 THE UTILITY OF POOL-SEQ IN CONSERVATION

Pool-seq provides a low-cost method for capturing genome-wide polymorphisms. In conservation management Pool-seq can be effective to decrease sequencing costs but not reduce the number of individuals (Ferretti et al. 2013). However, there are some purposes where Pool- seq excels and others where it is limited. Firstly, Pool-seq is particularly useful in cases where there are unknown amounts of polymorphism, such as this study. With RAD (Baird et al. 2008) and ddRAD (Peterson et al. 2012) sequencing, only a small proportion of the genome is captured, with one snail study capturing less than three thousand markers (Kess et al. 2016) . Because these methods involve the use of restriction enzymes that cut at specific patterns of DNA, it is hard to predict the amount of DNA chunks of appropriate size that will be generated (Liu et al. 2013) and pilot studies can be necessary to determine this (Kess et al. 2016). Barcoding individuals, even when doing reduced sequencing can still represent a large financial investment for decreased amount of SNPs captured (Gautier et al. 2013). However, because of the loss of individual in Pool-seq, it is not possible to accurately estimate migrant rate, effective population size or inbreeding coefficient using Pool-seq (Andrews et al. 2016). Additionally, assignment of individuals to populations is not possible (Andrews et al. 2016). This must be taken into account when sampling, especially if the species doesn’t exist in discrete populations. If using Pool-seq, specific parameters and filtering must be used to decrease bias in allele frequency estimates and subsequent calculations. I will discuss these in the context of this study. At the sampling level, a minimum of 40 individuals is recommended per population for the most accurate population allele frequency estimates (Schlötterer et al. 2014). Though Hivert et al.

2018 argue that their estimator for pairwise FST is unbiased by pool size or coverage, this is of consideration for the measure of nucleotide diversity in this study (Kofler et al. 2011a). This is a known limitation of using Pool-seq in endangered species (Schlötterer et al. 2014) as the intention was to sample 40 individuals per population, however, J2 had low population numbers.

31

To mitigate this I used windows in calculating nucleotide diversity, as per recommended with low sample size (Kofler et al. 2011a; Schlötterer et al. 2014). Care was taken to ensure equal representation of each individual per pool in terms of DNA amount (Gautier et al. 2013; Schlötterer et al. 2014). For filtering, pre-processing and calculations, I followed recommended best practices to mitigate the effects of sequencing error as incorrect SNP calls (Kofler et al. 2011b; a; Schlötterer et al. 2014; Hivert et al. 2018). Further considerations are discussed in Caveats.

As we strive to include genomics into conservation with increasing frequency, careful validation and reflection on software used must be conducted (Shafer et al. 2015). In this study, pairwise FST was originally calculated using the established PoPoolation2 (Kofler et al. 2011b). It was then calculated using newly developed Poolfstat (Hivert et al. 2018) as a confirmation. The packages use different methods for calculating estimates of allele counts, with Hivert et al. (2018) illustrating that the PoPoolation2 estimate is biased (not converging on expected values and impacted by coverage and sample size). The differences between the two packages should not have been extreme (Hivert et al. 2018); however, I found there to be up to a 5x difference between the two packages in the pairwise FST calculated. PoPoolation2 (Kofler et al. 2011b) found pairwise FST of 0.044 to 0.076 between populations of P. johnsoni, 0.21 to 0.35 between P. johnsoni and P. gyrina and 0.167 to 0.237 between P. gyrina populations (Table C.1), compared to 0.106 to 0.367, 0.519 to 0.709 and 0.359 to 0.498 respectively found by Poolfstat (Table 2.2) (Hivert et al. 2018). The population structure reported would have indicated that P. johnsoni and P. gyrina may not be genetically distinct and that gene flow was likely occurring between them. However, it was determined that when calculating pairwise FST, Popoolation2 (Kofler et al. 2011b) considers all base positions that are polymorphic in one or more populations when calculating the pairwise comparisons, regardless if the position is polymorphic in either of the populations in the present pairwise comparison. Thus, any allelic position that was polymorphic in some population but that was fixed in the two populations being compared generated a pairwise FST of zero, effectively dampening the population structure. This was exasperated by difference in nucleotide diversity between the P. johnsoni and P. gyrina populations. This is a clear illustration that genomic results must be thought of in the context of known ecological information for the species and must be examined closely before conservation recommendations are made.

32

Development and re-use of well-developed sequencing methods and pipelines are at the core of genomics being integrated efficiently into conservation (Shafer et al. 2015). If this pipeline was applied to a different project, once the samples were collected, it would likely take less than a month to go from extracted DNA to having population structure results. In this context, using Pool-seq provides an incredibly cost-effective method for conservation management to assess population structure and investigate the genetic health of populations. This method can be complemented by restriction enzyme sequencing of a subset of individuals to provide estimates of effective population number, inbreeding coefficient and migrant rates (Andrews et al. 2016) for more complete conservation management plans.

2.4.5 CAVEATS

An unavoidable consequence of pooling individuals for Pool-seq is that there is no way to distinguish between two sequences that were sequenced twice from the same individual or from two individuals (Schlötterer et al. 2014). Downstream applications assume that they were from different individuals, which may bias the population estimates for allele frequency. However, the estimates that I provide in this study are based off of the averaging over millions of positions, so the bias should be decreased. The genome reference I created was using pooled, 135 bp (post Trimmomatic) paired-end short read sequences from one P. johnsoni population. DISCOVAR de novo was designed for paired-end reads of 250 bp from a single PCR-free library, though the creators do state that PCR-amplified libraries can “in principle be used”, as well as 150 bp reads “may work” (https://software.broadinstitute.org/software/discovar/blog/?page_id=23). Increased quality was seen when using one pool (J3) rather than all pools to construct the genome assembly reflected in an increased N50 value and MPL1 and decreased estimated chimera rate. The creators of DISCOVAR de novo state that the MPL1 should be 175 bp to 225 bp for 250 paired- end reads (compared to 90 bp for J3 reference genome), though this value did not result in DISCOVAR de novo flagging the assembly as problematic, nor did any of the other values generated. However, due to these factors, the constructed contigs were short, which increases sequence mapping error. Additionally, I assumed that P. gyrina would successfully map to a reference genome constructed from P. johnsoni sequences. When calling SNPs, it was required that all populations have a minimum of 20x coverage over a position, which should decrease the impact of certain populations not mapping to divergent regions. Copy number variants and repetitive regions of the genome may collapse to the same position (Schlötterer et al. 2014). This

33

is not an issue unique to Pool-seq and is an unfortunate limitation in using short sequencing reads. I attempted to mitigate this by setting the upper limit of sequencing coverage to 200. Biological differences between P. johnsoni and P. gyrina, such as selfing rates could influence the amount of fixation occuring. I am unable to determine what evolutionary force is generating the genetic differentiation between and within P. johnsoni and P. gyrina populations. While I can predict that genetic drift plays a large role with the decreased genetic diversity observed and known bottlenecks, further work investigating potentially adaptive differences between P. johnsoni and P. gyrina is necessary.

2.4.6 CONCLUSIONS

Without the integration of genomics, current conservation management plans will remain incomplete. Here I used Pool-seq, a cost-effective sequencing method to capture millions of SNPs in the globally restricted P. johnsoni and the more common P. gyrina. Analyses using these SNPs were able to resolve genetic structure that had remained ambiguous between the two species with the use of a few genetic markers (Remigio et al. 2001; Wethington & Guralnick 2004). These results indicated that P. johnsoni and P. gyrina were genetically distinct from each other. Additionally, I characterized that there was extensive population structure between populations of the same species. Coupled with determining decreased nucleotide in P. johnsoni populations, which undergo massive bottlenecks, these results indicate that there may be a large impact of genetic drift. The findings of this study will be integrated into P. johnsoni’s management plan and will help make it more complete. Without the use of genomics, differentiation between P. johnsoni and P. gyrina would not have been determined and the impact of bottlenecks on decreasing genetic diversity would have remained a predicted but uncharacterized threat. As illustrated in this system, there is an incredible place and need for the use of genomics in conservation. The partnership between Parks Canada and University of Calgary researchers represents the type of collaboration that is necessary for genomics to be used in real world policy applications.

34

Table 2.1 Number of SNPs within each population and used in pairwise comparisons between populations determined by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

Population J1 J2 J3 J4 J5 G1 G2 G3

J1 1,174,228 1,362,417 1,369,156 1,349,817 1,575,031 3,491,977 4,416,793 4,607,239 J2 1,000,751 1,127,267 1,049,468 1,468,419 3,527,269 4,397,121 4,601,359 J3 1,061,899 1,090,399 1,454,054 3,460,571 4,358,911 4,566,735 J4 921,339 1,090,399 3,510,866 4,372,598 4,580,040 J5 1,300,995 3,531,009 4,4666,20 4,673,149 G1 3,248,634 4,7089,28 4,817,892 G2 3,053,291 4,366,631 G3 3,736,834

35

Table 2.2 Pairwise FST between all populations determined using Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

Population J1 J2 J3 J4 J5 G1 G2 G3 J1 0 0.315 0.335 0.367 0.322 0.550 0.692 0.650 J2 0 0.136 0.106 0.312 0.569 0.699 0.656 J3 0 0.209 0.302 0.574 0.707 0.667 J4 0 0.351 0.586 0.709 0.666 J5 0 0.519 0.671 0.630 G1 0 0.498 0.455 G2 0 0.359 G3 0

36

Figure 2.1 Range map and sample populations for Physella johnsoni and sample populations for Physella gyrina, Banff National Park, Alberta, Canada. P. johnsoni - Cave Spring (J1), Basin Spring (J2), Lower Cave & Basin Spring (J3), Upper Cave & Basin Spring (J4), Lower Middle Spring (J5), Upper Middle Spring (J6) (not used in this study) and Kidney Spring (J7) (not used in this study). P. gyrina - Cave & Basin Marsh (G1), Five Mile Pond (G2) and Muleshoe Pond (G3).

37

Figure 2.2 Total number of P. johnsoni from January 1996 to September 2017. Population counts were taken once every three weeks till August 2000 and then once every four weeks till September 2016 when population counts were ended. From April to September 2017 and September 2018 the counts were resumed. Original springs include J1, J2, J3, J4 and J5. The re- established springs are J6 and J7. Modified from COSEWIC 2018 by Dr. Dwayne Lepitzki.

38

Figure 2.3 Principle coordinate analysis for all pairwise FST between P. johnsoni and P. gyrina populations calculated by Poolfstat with the ‘Anova’ method for a minimum read count of one, minimum coverage of 20, a maximum coverage of 200, and a minor allele frequency of 0.05. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

39

Figure 2.4 Averaged nucleotide diversity for all P. johnsoni and P. gyrina populations calculated by PoPoolation2 over 250bp side by side windows, minor allele count of 4, minimum coverage of 20 and max coverage of 200, where the 60% of the window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

40

CHAPTER 3 GENERAL CONCLUSIONS

In this study, I aimed to provide clarity to previously unresolved taxonomic designations between the Banff Springs Snail (Physella johnsoni) and Physella gyrina. Additionally, I provided new data that characterized the genetic diversity and micro-population of P. johnsoni. Using Pool-seq, just under a million to over four million SNPs were captured per population, allowing me to uncover strong and defined population structure between P. johnsoni and P. gyrina. This leads me to believe that they represent genetically diverse units and warrant continued separate management. Even between populations containing the same species there was extensive population structure, leading me to have concern on the prospect of “lumping” of species of snails (Wethington & Guralnick 2004), as myself and others (Mavárez et al. 2002a; b; Bousset et al. 2004; Djuikwo-Teukeng et al. 2014) have found large population structure between populations on small geographic scales. In terms of management, I believe the data deficiency for molluscs (Régnier et al. 2009; Cowie et al. 2017), specifically genomic data, will result in the loss of possibly genetically unique and interesting species and sub-species and threaten the continued persistence of many molluscs.

Pool-seq is a fantastic tool to address population structure and nucleotide diversity. Complications can arise when using it to address adaptive differences, as without an annotated genome, regions of divergence lack biological relevance. However, this issue is not restricted to Pool-seq and is shared for all sequencing methods. Unlike RAD-seq, Pool-seq lets us capture the majority of the genome though and without a reference genome it feels under-utilized. Fortunately, the cost of sequencing genomes is continually decreasing, and the number of available genome references is increasing. In terms of conservation management, the current toolset for analyzing Pool-seq data is limited in some respects. There are packages and scripts developed to determine pairwise FST, nucleotide diversity (Tajima's Pi), Watterson’s Theta or Tajima's D, but due to the loss of the individual, Pool-seq data cannot be used to determine levels of inbreeding or effective population size. However, if these don’t need to be explicitly determined for the species or population, Pool-seq does provide an impressive amount of data and resolution for the parameters it can determine for a very attractive price.

In future steps, I would like to investigate further the population structure between P. johnsoni to P. gyrina and to take the first steps in determining if there are possibly adaptive

41

differences between them. In the pursuit of this goal, I think that generating a reference genome would be of great benefit. This would allow us to start investigating if there are shared regions of the genome that show evidence of selection between P. johnsoni and P. gyrina and if these regions lie near or in potential gene coding regions.

In conclusion, the data I have generated and presented here provides the resolution necessary to determine that P. johnsoni and P. gyrina are genetically distinct. Additionally, I have shown that there is strong micro-geographical population structure between the P. johnsoni thermal springs and decreased within population nucleotide diversity. I recommend a modified version of the current recovery strategy and action plan for P. johnsoni as the appropriate action plan. Considering the decreased nucleotide diversity shown in P. johnsoni, each population plays a vital role in the evolutionary robustness of the species beyond just total numbers. I recommend re-instating population counts focused on capturing the yearly minimum and maximum for each population. I propose that population counts be done every four weeks for the three months, or some duriation and frequency that captures previously recorded population minimums and maximums (COSEWIC 2018). This could provide evidence of deviations from the 20-year norms and therefore provide the first warning signs of population collapse, especially when coupled with genomic data. As such, I recommend that semi-regular sequencing be incorporated into P. johnsoni’s management plan to establish a baseline for the impact of genetic drift in population divergence and nucleotide diversity. Additional monitoring of genetic variation levels could determine potentially adaptive versus non-adaptive loci, effective population sizes and inbreeding coefficients. Decreasing nucleotide diversity, effective population size and/or population numbers and/or increasing inbreeding may warrant translocation of individuals from a population with different polymorphisms. By characterizing these factors, management would be able to weigh the potential risks of outbreeding depression versus inbreeding depression. As demonstrated in this system, the use of genomics in conservation is a vital component of creating effective and efficient management plans.

42

References

Adams JR, Vucetich LM, Hedrick PW, Peterson RO, Vucetich JA (2011) Genomic sweep and potential genetic rescue during limiting environmental conditions in an isolated wolf population. Proceedings of the Royal Society B: Biological Sciences, 278, 3336–3344.

Allendorf FW, Hohenlohe PA, Luikart G (2010) Genomics and the future of conservation genetics. Nature Reviews Genetics, 11, 697–709.

Anderson E, Skaug HJ, Barshis DJ (2014) Next-generation sequencing for molecular ecology: a cavaet regarding pooled samples. Molecular Ecology, 23, 502–512.

Andrews S (2010) FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA (2016) Harnessing the power of RADseq for ecological and evolutionary genomics. Nature Reviews Genetics, 17, 81–92.

Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE, 3, 1–7.

De Barba M, Waits LP, Garton EO et al. (2010) The power of genetic monitoring for studying demography, ecology and genetics of a reintroduced brown bear population. Molecular Ecology, 3938–3951.

Barrett RDH, Rogers SM, Schluter D (2008) Natural selection on a major armor gene in threespine stickleback. Science, 322, 255–257.

Bilyj M (2011) A study on the phototrophic microbial mat communities of Sulphur Mountain Thermal Springs and their association with the endangered, endemic snail Physella johnsoni. University of Manitoba.

Bland LM, Collen B, David C, Orme L, Bielby J (2015) Predicting the conservation status of Data Deficient species. Conservation Biology, 53, 1792–1803.

Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: A flexible trimmer for Illumina sequence

43

data. Bioinformatics, 30, 2114–2120.

Bousset L, Henry PY, Sourrouille P, Jarne P (2004) Population biology of the invasive freshwater snail Physa acuta approached through genetic markers, ecological characterization and demography. Molecular Ecology, 13, 2023–2036.

Bouzat JL (2010) Conservation genetics of population bottlenecks: The role of chance, selection, and history. Conservation Genetics, 11, 463–478.

Bull JK, Sands C, Garrick RC et al. (2013) Environmental complexity and biodiversity: The multi- layered evolutionary history of a log-dwelling velvet worm in montane temperate Australia. PLoS ONE, 8, 1–15.

Bushnell B (2014) BBMap: A Fast, Accurate, Splice-Aware Aligner. Lawrence Berkeley National Laboratory. LBNL Report #: LBNL-7065E. Retrieved from https://escholarship.org/uc/item/1h3515gn

Butchart SHM, Walpole M, Collen B et al. (2010) Global biodiversity: Indicators of recent declines. Science, 328, 1164–1168.

Cardinale BJ, Duffy JE, Gonzalez A et al. (2012) Biodiversity loss and its impact on humanity. Nature, 486, 59–67.

Carwardine J, Martin TG, Firn J et al. (2018) Priority threat management for biodiversity conservation: A handbook. Journal of Applied Ecology, 0–2.

Ceballos G, Ehrlich PR (2002) Mammal population losses and the extinction crisis. Science, 296, 904–907.

Clench WJ (1926) Three new species of Physa. Occasional Papers of the Museum of Zoology, 168, 1–8.

COSEWIC (2008) COSEWIC assessment and update status report on the Banff Springs Snail Physella johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., vii + 53 pp.

COSEWIC (2014) COSEWIC wildlife species assessment: quantitative criteria and guidelines.

44

Committee on the Status of Endangered Wildlife in Canada [cited 2018]. https://www.canada.ca/en/environment-climate-change/services/committee-status- endangered-wildlife/wildlife-species-assessment-process-categories-guidelines/quantitative- criteria.html (accessed on 18 Decemeber 2018).

COSEWIC (2015) Guidelines for recognizing designatable units. Committee on the Status of Endangered Wildlife in Canada [cited 2018]. https://www.canada.ca/en/environment- climate-change/services/committee-status-endangered-wildlife/guidelines-recognizing- designatable-units.html (accessed on 29 October 2018).

COSEWIC (2018) COSEWIC status appraisal summary on the Banff Springs Snail Physella johnsoni in Canada. Committee on the Status of Endangered Wildlife in Canada. Ottawa., xxvi pp.

Cowie RH, Regnier C, Fontaine B, Bouchet P (2017) Measuring the sixth extinction: what do mollusks tell us? Nautilus, 131, 3–41.

Craze PG, Barr AG (2002) The use of electrical-component freezing spray as a method of killing and preparing snails. Journal of Molluscan Studies, 68, 191–192.

Dalziel AC, Rogers SM, Schulte PM (2009) Linking genotypes to phenotypes and fitness: How mechanistic biology can inform molecular ecology. Molecular Ecology, 18, 4997–5017.

Daugherty CH, Cree A, Hay JM, Thompson MB (1990) Neglected taxonomy and continuing extinctions of tuatara (Sphenodon). Letters to Nature, 374, 177–179.

Dennenmoser S, Vamosi SM, Nolte AW, Rogers SM (2017) Adaptive genomic divergence under high gene flow between freshwater and brackish-water ecotypes of prickly sculpin (Cottus asper) revealed by Pool-Seq. Molecular Ecology, 26, 25–42.

Djuikwo-Teukeng FF, Da Silva A, Njiokou F et al. (2014) Significant population genetic structure of the Cameroonian fresh water snail, Bulinus globosus, (: Planorbidae) revealed by nuclear microsatellite loci analysis. Acta Tropica, 137, 111–117.

Edmands S (2007) Between a rock and a hard place: Evaluating the relative risks of inbreeding and outbreeding for conservation and management. Molecular Ecology, 16, 463–475.

45

Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Sciences of the United States of America, 107, 16196–16200.

Ferretti L, Ramos-Onsins SE, Pérez-Enciso M (2013) Population genomics from pool sequencing. Molecular Ecology, 22, 5561–5576.

Frankel OH (1974) Genetic conservation: our evolutionary responsibility. Genetics, 78, 53–65.

Frankham R (2005) Genetics and extinction. Biological Conservation, 126, 131–140.

Funk WC, Lovich RE, Hohenlohe PA et al. (2016) Adaptive divergence despite strong genetic drift: Genomic analysis of the evolutionary mechanisms causing genetic differentiation in the island fox (Urocyon littoralis). Molecular Ecology, 25, 2176–2194.

Funk WC, McKay JK, Hohenlohe PA, Allendorf FW (2012) Harnessing genomics for delineating conservation units. Trends in Ecology and Evolution, 27, 489–496.

Futschik A, Schlötterer C (2010) The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics, 186, 207–218.

Gautier M, Foucaud J, Gharbi K et al. (2013) Estimation of population allele frequencies from next-generation sequencing data: Pool-versus individual-based genotyping. Molecular Ecology, 22, 3766–3779.

Gilbertson CR, Wyatt JD (2016) Evaluation of euthanasia techniques for an invertebrate species, land snails (Succinea putris). Journal of the American Association for Laboratory Animal Science, 55, 1–5.

Grasby SE, van Everdingen RO, Bednarski J, Lepitzki DA (2003) Travertine mounds of the Cave and Basin National Historic Site, Banff National Park. Canadian Journal of Earth Sciences, 40, 1501–1513.

Grasby SE, Lepitzki DAW (2002) Physical and chemical properties of the Sulphur Mountain thermal springs, Banff National Park, and implications for endangered snails. Canadian Journal of Earth Sciences, 39, 1349–1361.

46

Gu QH, Zhou CJ, Cheng QQ et al. (2015) The perplexing population genetic structure of Bellamya purificata (Gastropoda: Viviparidae): low genetic differentiation despite low dispersal ability. Journal of Molluscan Studies, 81, 466–475.

Guisan A, Tingley R, Baumgartner JB et al. (2013) Predicting species distributions for conservation decisions. Ecology Letters, 16, 1424–1435.

Gustafson KD, Kensinger BJ, Bolek MG, Luttbeg B (2014) Distinct snail (Physa) morphotypes from different habitats converge in shell shape and size under common garden conditions. Evolutionary Ecology Research, 16, 77–89.

Hedrick PW, Kalinowski ST. (2000) Inbreeding Depression in Conservation Biology. Annual Review of Ecology and Systematics, 31, 139–162.

Hedrick PW, Peterson RO, Vucetich LM, Adams JR, Vucetich JA (2014) Genetic rescue in Isle Royale wolves: genetic analysis and the collapse of the population. Conservation Genetics, 15, 1111–1121.

Hendricks S, Epstein B, Schönfeld B et al. (2017) Conservation implications of limited genetic diversity and population structure in Tasmanian devils (Sarcophilus harrisii). Conservation Genetics, 18, 977–982.

Hivert V, Leblois R, Petit EJ, Gautier M, Vitalis R (2018) Measuring genetic differentiation from pool-seq data. Genetics, 210, 315–330.

Hoban S, Kelley JL, Lotterhos KE et al. (2016) Finding the genomic basis of local adaptation: Pitfalls, practical solutions, and future directions. The American Naturalist, 188, 379–397.

Hoelzel AR, Halley J, O’brien SJ et al. (1993) Elephant seal genetic variation and the use of simulation models to investigate historical population bottlenecks. Journal of Heredity, 84, 443–449.

Holsinger KE, Weir BS (2009) Genetics in geographically structured populations: defining, estimating and interpreting FST. Nature reviews. Genetics, 10, 639–650.

Hooper DU, Adair EC, Cardinale BJ et al. (2012) A global synthesis reveals biodiversity loss as

47

a major driver of ecosystem change. Nature, 486, 105–108.

Howard SD, Bickford DP (2014) Amphibians over the edge: Silent extinction risk of Data Deficient species. Diversity and Distributions, 20, 837–846.

Ingvarsson PK (2001) Restoration of genetic variation lost - The genetic rescue hypothesis. Trends in Ecology and Evolution, 16, 62–63.

Isaac NJB, Mallet J, Mace GM (2004) Taxonomic inflation: Its influence on macroecology and conservation. Trends in Ecology and Evolution, 19, 464–469.

Jarne P, Perdieu MA, Pernot AF, Delay B, David P (2000) The influence of self-fertilization and grouping on fitness attributes in the freshwater snail Physa acuta: Population and individual inbreeding depression. Journal of Evolutionary Biology, 13, 645–655.

Johnson PD, Bogan AE, Brown KM et al. (2013) Conservation status of freshwater gastropods of Canada and the United States. Fisheries, 38, 247–282.

Joseph LN, Maloney RF, Possingham HP (2009) Optimal allocation of resources among threatened species: a project prioritization protocol. Conservation Biology, 23, 328–338.

Kell LT, Dickey-Collas M, Hintzen NT et al. (2009) Lumpers or splitters? Evaluating recovery and management plans for metapopulations of herring. ICES Journal of Marine Science, 66, 1776–1783.

Keller LF, Waller DM (2002) Inbreeding effects in wild populations. Trends in Ecology and Evolution, 17, 230–241.

Kess T, Gross J, Harper F, Boulding EG (2016) Low-cost ddRAD method of SNP discovery and genotyping applied to the periwinkle Littorina saxatilis. Journal of Molluscan Studies, 82, 104–109.

Kofler R, Orozco-terWengel P, de Maio N et al. (2011a) PoPoolation: A toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS ONE, 6.

Kofler R, Pandey RV, Schlötterer C (2011b) PoPoolation2: Identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27,

48

3435–3436.

Koskinen MT, Haugen TO, Primmer CR (2002) Contemporary fisherian life-history evolution in small salmonid populations. Nature, 419, 826–830.

Kremer CS, Vamosi SM, Rogers SM (2017) Watershed characteristics shape the landscape genetics of brook stickleback (Culaea inconstans) in shallow prairie lakes. Ecology and Evolution, 7, 3067–3079.

Laikre L (2010) Genetic diversity is overlooked in international conservation policy implementation. Conservation Genetics, 11, 349–354.

Van Leeuwen CHA, Huig N, Van Der Velde G et al. (2013) How did this snail get here? Several dispersal vectors inferred for an aquatic invasive species. Freshwater Biology, 58, 88–99.

Lepitzki DAW (1998) The ecology of Physella johnsoni, the threatened Banff Springs Snail. Heritage Resource Conservation - Aquatics, i-146.

Lepitzki DAW (2002) Status of the Banff Springs Snail (Physella johnsoni) in Alberta. Alberta Sustainable Resource Development, Fish and Wildlife Division, and Alberta Conservation Association, Wildlife Status Report No. 40, Edmonton, AB., 29 pp.

Lepitzki DAW, Pacas C (2010) Recovery Strategy and Action Plan for the Banff Springs Snail (Physella johnsoni) in Canada. Species at Risk Act Recovery Strategy Series. Parks Canada Agency, Ottawa, vii + 63 pp.

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

Li H, Handsaker B, Wysoker A et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079.

Lien S, Gidskehaug L, Moen T et al. (2011) A dense SNP-based linkage map for Atlantic salmon (Salmo salar) reveals extended chromosome homeologies and striking differences in sex-specific recombination patterns. BMC Genomics, 12, 1–10.

Liu MM, Davey JW, Banerjee R et al. (2013) Fine Mapping of the pond snail left-right

49

asymmetry (chirality) locus using RAD-Seq and Fibre-FISH. PLoS ONE, 8, 2–8.

Lotterhos KE, Whitlock MC (2015) The relative power of genome scans to detect local adaptation depends on sampling design and statistical method. Molecular Ecology, 24, 1031–1046.

Lounnas M, Correa AC, Alda P et al. (2018) Population structure and genetic diversity in the invasive freshwater snail Galba schirazensis (Lymnaeidae). Canadian Journal of Zoology, 96, 425–435.

Lundmark C, Sandström A, Andersson K, Laikre L (2019) Monitoring the effects of knowledge communication on conservation managers’ perception of genetic biodiversity – A case study from the Baltic Sea. Marine Policy, 99, 223–229.

Mace GM (2004) The role of taxonomy in species conservation. Philosophical transactions of the Royal Society of London. Series B, Biological Sciences, 359, 711–9.

Margres MJ, Jones ME, Epstein B et al. (2018) Large-effect loci affect survival in Tasmanian devils (Sarcophilus harrisii) infected with a transmissible cancer. Molecular Ecology, 27, 4189–4199.

Martin TG, Nally S, Burbidge AA et al. (2012) Acting fast helps avoid extinction. Conservation Letters, 5, 274–280.

Mavárez J, Amarista M, Pointier JP, Jarne P (2002a) Fine-scale population structure and dispersal in Biomphalaria glabrata, the intermediate snail host of schistosoma mansoni, in Venezuela. Molecular Ecology, 11, 879–889.

Mavárez J, Pointier JP, David P, Delay B, Jarne P (2002b) Genetic differentiation, dispersal and mating system in the schistosome-transmitting freshwater snail Biomphalaria glabrata. Heredity, 89, 258–265.

Mavárez J, Steiner C, Pointier J-P, Jarne P (2002c) Evolutionary history and phylogeography of the schistosome-vector freshwater snail Biomphalaria glabrata based on nuclear and mitochondrial DNA sequences. Heredity, 89, 266–272.

50

McCallum H (2008) Tasmanian devil facial tumour disease: lessons for conservation biology. Trends in Ecology and Evolution, 23, 631–637.

Mee JA, Bernatchez L, Reist JD, Rogers SM, Taylor EB (2015) Identifying designatable units for intraspecific conservation prioritization: A hierarchical approach applied to the lake whitefish species complex (Coregonus spp.). Evolutionary Applications, 8, 423–441.

Moore AC, Burch JB, Duda TF (2014) Recognition of a highly restricted freshwater snail lineage (Physidae: Physella) in southeastern Oregon: convergent evolution, historical context, and conservation considerations. Conservation Genetics, 16, 113–123.

Morais AR, Siqueira MN, Lemes P et al. (2013) Unraveling the conservation status of data deficient species. Biological Conservation, 166, 98–102.

Morris MRJ, Bowles E, Allen BE, Jamniczky HA, Rogers SM (2018) Contemporary ancestor? Adaptive divergence from standing genetic variation in Pacific marine threespine stickleback. BMC Evolutionary Biology, 18, 1–21.

Morris MRJ, Richard R, Leder EH et al. (2014) Gene expression plasticity evolves in response to colonization of freshwater lakes in threespine stickleback. Molecular Ecology, 23, 3226– 3240. de Oliveira LR, Arias-Schreiber M, Meyer D, Morgante JS (2006) Effective population size in a bottlenecked fur seal population. Biological Conservation, 131, 505–509.

Ouborg NJ, Pertoldi C, Loeschcke V, Bijlsma RK, Hedrick PW (2010) Conservation genetics in transition to conservation genomics. Trends in Genetics, 26, 177–187.

Parsons ECM (2016) Why IUCN should replace “data deficient” conservation status with a precautionary “assume threatened” status—A cetacean case study. Frontiers in Marine Science, 3, 2015–2017.

Peichel CL, Sullivan ST, Liachko I, White MA (2017) Improvement of the threespine stickleback genome using a Hi-C-based proximity-guided assembly. Journal of Heredity, 108, 693–700.

51

Peterson RO, Thomas NJ, Thurber JM, Vucetich JA, Waite TA (1998) Population limitation and the wolves of Isle Royale. Journal of Mammalogy, 79, 828.

Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: An inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE, 7, 1–11.

Pip E, Franck JPC (2008) Molecular phylogenetics of central Canadian Physidae ( : Basommatophora). Canadian Journal of Zoology, 16, 10–16.

Régnier C, Achaz G, Lambert A et al. (2015) Mass extinction in poorly known taxa. Proceedings of the National Academy of Sciences, 112, 7761–7766.

Régnier C, Fontaine B, Bouchet P (2009) Not knowing, not recording, not listing: Numerous unnoticed mollusk extinctions. Conservation Biology, 23, 1214–1221.

Remigio EA, Lepitzki DAW, Lee JS, Hebert PDN (2001) Molecular systematic relationships and evidence for a recent origin of the thermal spring endemic snails Physella johnsoni and Physella wrighti (Pulmonata: Physidae). Canadian Journal of Zoology, 79, 1941–1950.

Roberts DW (2012) Package ‘labdsv’

Rogers SM, Bernatchez L (2007) The genetic architecture of ecological speciation and the association with signatures of selection in natural lake whitefish (Coregonus sp. Salmonidae) species pairs. Molecular Biology and Evolution, 24, 1423–1438.

Rosenberg G (2014) A new critical estimate of named species-level diversity of the recent . American Malacological Bulletin, 32, 308–322.

RStudio Team (2016) RStudio: Integrated Development for R. RStudio, Inc., Boston, MA http://www.rstudio.com/.

Sankurathri CS, Holmes JC (1976) Effects of thermal efffuents on the population dynamics of Physa gyrina Say (Mollusca: Gastropoda) at Lake Wabamun, Alberta. Canadian Journal of Zoology, 54, 582–590.

Santamaría L, Klaassen M (2002) Waterbird-mediated dispersal of aquatic organisms: An

52

introduction. Acta Oecologica, 23, 115–119.

Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nature Publishing Group, 15, 749– 763.

Schmieder R, Edwards R (2011a) Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE, 6.

Schmieder R, Edwards R (2011b) Quality control and preprocessing of metagenomic datasets. Bioinformatics, 21, 863–864.

Shafer ABA, Wolf JBW, Alves PC et al. (2015) Genomics and the challenging translation into conservation practice. Trends in Ecology and Evolution, 30, 78–87.

Shaffer ML (1981) Minimum population sizes for species conservation. Bioscience, 31, 131– 134.

Storfer A, Hohenlohe PA, Margres MJ et al. (2018) The devil is in the details: Genomics of transmissible cancers in Tasmanian devils. PLoS Pathogens, 14, 1–7.

Taylor DW (2003) Introduction to Physidae (Gastropoda: ); biogeography, classification, morphology. Revista de Biologia Tropical, 51, 1–287.

Viard F, Justy F, Jarne P (1997) The influence of self-fertilization and population dynamics on the genetic structure of subdivided populations: a case study using microsatellite markers in the freshwater snail Bulinus truncatus. Evolution, 51, 1518–1528.

Vilà C, Sundqvist A-K, Flagstad Ø et al. (2003) Rescue of a severely bottlenecked wolf (Canis lupus) population by a single immigrant. Royal Society, 270, 91–97.

Weber DS, Stewart BS, Lehman N (2004) Genetic consequences of a severe population bottleneck in the guadalupe fur seal (Arctocephalus townsendi). Journal of Heredity, 95, 144–153.

Wethington AR, Guralnick R (2004) Are populations of physids from different hot springs distinctive lineages? American Malacological Bulletin, 19, 135–144.

53

Wethington AR, Lydeard C (2007) A molecular phylogeny of Physidae (Gastropoda: Basommatophora) based on mitochondrial DNA sequences. Journal of Molluscan Studies, 73, 241–257.

Whitlock MC, Lotterhos KE (2015) Reliable detection of loci responsible for local adaptation:

inference of a null model through trimming the dstribution of FST. The American Naturalist, 186, S24–S36.

Whitlock M, McCauley D (1999) Indirect measures of gene flow and migration: FST ¹ 1/(4Nm + 1). Heredity, 82, 117–125.

Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, http://ggplot2.org

Wright S (1950) The genetical structure of populations. The Annals of Human Genetics, 324– 354.

54

Appendix A: Genomic analysis pipeline

General notes about the project

Below is the pipeline I used to analyze the Pool-seq data. I have generalized each step so that future students can adjust it to their projects. For my project, I analyzed five sites of Physella johnsoni and three sites of Physella gyrina of 20 to 40 individuals per site. These were sequenced by Genome Quebec on the Illumina HiSeq X, aiming for about 40x coverage (determined to be almost double that in actuality). Each site consisted of half a lane worth of data (total of four lanes). The pipeline is not linear, in that there are analysis that branches off at certain points. As well, most of this was run on Cedar so the accessing of modules reflects that. Some of it was run on ARC though. I included the SLURMs because it gives an idea of how long each thing took to run for my files. Rememeber that my files were half a lane each, a college of mine sequenced six pools over one lane and her analysis took a fraction of the time listed below.

Getting started in Cedar

Launch Terminal on a Mac. I think there is something called Putty for Windows? Logging into Cedar ssh [email protected] #ex. [email protected] #You will be prompted for your password #When you type it in nothing will appear but just hit "enter" when you've typed it in #and it should log you right in!

Navigating around Cedar and creating a project directory ls #will show you all the directories in your home folder

#How to make symbolic link to our project folder in def-srogers

rm project #removes the current project directory

ln -s projects/def-rogers/your_user_name project #assigns your account to project

cd project #change directory your in to project

pwd #will give you the path to the directory you are in

mkdir project_name #will make a directory in the project directory named "project_name".

cd project_name

mkdir 00_nameofstep1 #this is a good way to keep your steps in order

cd 00*tab* #by using "tab" on your keyboard it will auto-complete the name

It is worth spending some time learning Unix commands. I wish I had spent more time doing this, instead of trying to figure it out as I went.

55 How to make executable codes

You will need to choose a text editor to use on Cedar or Graham. We chose to use GNU nano because it is user and beginner friendly. To make exectuable codes do below: nano codename #will open a new, blank, nano with the name "codename"

type code

^X #this will close the code and you can select to save it. #nano has functions listed at the bottom of the file - "^" is "control"

ls #will list all the files in the directory you are in - should see your code here!

chmod +x codename #this changes the code so that it is now exectuable, important!

How to submit jobs to run on Cedar

To run code on Cedar, you need to submit them as “jobs”. The way this was explained to me is that you have to submit your request to the “secretary” of Cedar and they will send your job to the appropiate place. How we do this is something called “SLURM”. Create a SLURM nano SLURM_name

#copy and paste below and change as necessary

^X #close and save SLURM Example SLURM #!/bin/bash #Must put this at top # ------# Place you can leave yourself a descriptor of this SLURM # ------

#SBATCH --job-name=Nameofyourjob #Make this descriptive but short! #SBATCH --account=def-srogers #the account this is under #SBATCH --cpus-per-task=XX #how many threads you want #SBATCH --time=0-00:00 #time you want - goes days:hours:minutes #SBATCH --mem=XXG #amount of memory you want

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

/path/to/thecodeyouwanttorun

# ------echo "Job finished with exit code $? at: `date`" # ------

56 Submit SLURM and check on status sbatch SLURM_name #will submit job to queue

squeue -u your_user_name #will give you the job's status

scancel job_ID_number #will cancel the job

Once the job starts running, a SLURM out-file will be created in the format of: slurm-job_ID_number.out. This out-file gives you information on what step your job is on and whether or not it completed successfully. If successful it will have “Job finished with exit code 0”. If it doesn’t say that, then it may give you a helpful error code or a non-helpful error code. Getting information on the job after it has run acct -j job_ID_number --format=JobID,JobName,MaxRSS,Elapsed #Gives JobID (kind of redundant), the name of the job, the memory and the time it took.

Convert BCL to Fastq - not allowing barcode mismatches

If working with BCL files you must first change them to fastq format. Genome Quebec will give you your files in fastq format. However, they allow one barcode mismatch. At times when you need 100% confidence of sequence assignment to the right population you will need to allow no mismatches. You may be able to ask Genome Quebec to do this on their end but I didn’t know this until after they had done the conversion so I got the BCL files from them. Manual: https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_ documentation/bcl2fastq/bcl2fastq2_guide_15051736_v2.pdf tiles: these are the tiles that your data is on (because there will likely be other sequence info) sample-sheet: need to get this from Genome Quebec. Just info about your sequences -r 32 -p 32 -w 32: reading, processing and writing thread count. I set them all to 32. use-bases-mask Y151,I6n2,Y151: this is specific about the sequencer - the Y151 is because it is PE 150 reads Ex. bcl2fastq.1 module load bcl2fastq2/2.20 && \ bcl2fastq\ --runfolder-dir /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A\ #the above is an identifier - change! --output-dir /path/to/00_raw_data\ --tiles s_1\ #change to what tile you are converting - I had s_1 to s_4 --sample-sheet /path/to/BCLfiles/171207_E00434_0072_AHGNNHCCXY_4467HS23A/SampleSheet.1.csv\ #change the above depending on tile - there is a SampleSheet for each tile --create-fastq-for-index-reads\ -r 32 -p 32 -w 32\ --barcode-mismatches 0 --use-bases-mask Y151,I6n2,Y151

#!/bin/bash # ------# Slurm for bcl2fastq for L001 # ------

#SBATCH --job-name=bcl2fastq.1

57 #SBATCH --account=def-srogers #SBATCH --cpus-per-task=32 #SBATCH --time=0-00:45 #SBATCH --mem=20G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

/path/to/bcl2fastq.1

# ------echo "Job finished with exit code $? at: `date`" # ------

Concatenate files together - if sites were run on two or more sites on the flow cells

My sites were split over two lanes on the flow cells, so when I converted them from BCL to fastq I had four files for each site. cat XXPool_L001_R1.fq.gz XXPool_L002_R1.fq.gz > XXPool_R1.fq.gz

FastQC - check the quality of your sequencing reads

Fastqc will give you a report on the quality of your sequencing reads. https://dnacore.missouri.edu/PDF/ FastQC_Manual.pdf this is a pretty good tutorial on how to interpet them. The code below will loop through every .fq.gz file in the directory you tell it to look in and make a report for each. #!/bin/bash # ------# FastQC for pool sites # ------

#SBATCH --job-name=fastqc #SBATCH --account=def-srogers #SBATCH --cpus-per-task=1 #SBATCH --time=0-15:00 #SBATCH --mem=5G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load fastqc/0.11.5 for i in /path/to/fastqfiles/*fq.gz

58 do fastqc -o /path/to/where/you/want/reports $i done # ------echo "Job finished with exit code $? at: `date`" # ------

Trimmomatic - cleaning and filtering low-quality reads and remov- ing adaptors

The available Adaptor file doesn’t have all of the adaptors in it. Please contact me if you would like the file we created. We included all of the adaptors used in our library preparation. phred: can be 64. Depends on your sequencing. threads: change to the number of threads you have available on your cluster. We used 16, half of a node, but I think we could have used more. Just make sure to match this to what you are asking for in your SLURM. ILLUMINACLIP: path to the adaptor file you made. 2: seed mismatches 30: is how accurate the match between the two adaptor ligated reads must be 10: SimpleClip Threshold (I have a doc that goes in more detail if interested) CROP: we were having an issue in some of our reads where Trimmomatic just wasn’t detecting the repetitive sequences in the last 15 nucleotides. Hence the hard crop to trim the last 15 nucleotides off each read. LEADING: trim the leading nucleotides if they fall under Q5 TRAILING: trim the trailing nucleotides if they fall under Q5 SLIDINGWINDOW: 5:20 - look at 5 bases at a time and trim if the average Q is less than 20 MINLEN: only keep reads that are minimally 100 bp java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.36.jar PE -phred33 -threads 16 -trimlog logfile \ path/to/fastqfiles/XXPool_R1.fq.gz /path/to/fastqfiles/XXPool_R2.fq.gz \ XX_R1_P_qtrim.fq XX_R1_U_qtrim.fq XX_R2_P_qtrim.fq XX_R2_U_qtrim.fq \ ILLUMINACLIP:/path/to/Adaptors/TruSeq3-PE-all.fa:2:30:10 CROP:135 LEADING:5 TRAILING:5 SLIDINGWINDOW:5:20 MINLEN:100 #!/bin/bash # ------# Trim cat files # ------

#SBATCH --job-name=XX_trim #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-20:00 #SBATCH --mem=15G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load trimmomatic/0.36

59 /path/to/trimcode/trimcode

# ------echo "Job finished with exit code $? at: `date`" # ------

Post-trim FastQC - check the quality of your sequencing reads

Same as above but for the post-trim files. Should see the quality go up and there be less sequences. Look for over-represented k-mers, there shouldn’t be any!

DeconSeq - removing non-snail contaminants from the sequences

Full disclaimer I made the databases a variety of ways as I found out one way wouldn’t work for every organism.

Step 1.1: How I created the databases for Archaea and Algae (Charophyceae, Chlorophyta, Cryptophyta, Eustigmatophyceae and Klebsormidiophyceae)

Create a database from NCBI of the sequences you would like to remove. You need to get the GI list from NCBI (as per http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html or https://www. biostars.org/p/6528/). You will need to install the newest version of NCBI BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi? CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download). I have heard that moving forward NCBI will be switching to using taxid to create databases rather than GI list but upon publishing this GI list were still being used. In this example, I am using Archaea. Key thing is that the nt database and your GI list must exist in the same directory. It will be messy, which is why I put the “X” in front of the files I was making so that all of them were at the bottom. /path/to/ncbi_directory/ncbi-blast-2.7.1+/bin/blastdb_aliastool -db nt -gilist Archaea.gi -dbtype nucl -out X_nt_archaea -title "database for Archaea"

I just ran this in Cedar without a SLURM. Takes ~ 30 sec to a minute. Creates a .nal file, which if you put “X_nt_archaea” as your database, will mask everything else in the database but those sequences. Once you’ve created the .nal file, you need to convert the database to .fasta file. It’s small so I ran it in the SLURM. #!/bin/bash # ------# NCBI to fasta # ------

#SBATCH --job-name=Charo_fasta #SBATCH --account=def-srogers #SBATCH --cpus-per-task=1 #SBATCH --time=0-05:00 #SBATCH --mem=1G

60 # ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load perl/5.22.4

/path/to/ncbi/ncbi-blast-2.7.1+/bin/blastdbcmd -entry all -db X_nt_archaea -out Archaea.fasta

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 1.2: Creating the database for Threespine stickleback and Human

Threespine stickleback downloaded from: https://datadryad.org/resource/doi:10.5061/dryad.h7h32 (Peichel et al. 2017) Human genome was downloaded from: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/ seq/hs_ref_GRCh38.p12_ch. I used the tutorial provided by DeconSeq to access the human genome. #Download sequence data for i in {1..22} X Y MT; do wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p12_chr$i.fa.gz; done

#Extracting and joining data for i in {1..22} X Y MT; do gzip -dvc hs_ref_GRCh38.p12_chr$i.fa.gz >>hs_ref_GRCh38_p12.fa; rm hs_ref_GRCh38.p12_chr$i.fa.gz; done

Step 1.3: Creation of the Bacterial database

I downloaded all of the full bacterial genomes off of NCBI (I think there was about 10,000 of them?) and all assembly levels for bacteria that have been found in the thermal springs onto a computer (many Gb). I then used Globus to put them on to Cedar. https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria You will need to unzip any files that are zipped (including your query sequences) because DeconSeq can’t use zipped files. Can use: for file in *.gz #loop through all files with this file extension do gunzip $file #unzip them done #when it's finished, stop.

This will had to be done for all of the bacterial genomes. It took 10+ hours so I would consider submiting it as a SLURM. Then I concatenated the bacterial genomes together, using:

61 #!/bin/bash # ------# cat FG NCBI database # ------

#SBATCH --job-name=cat_bac #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-10:00 #SBATCH --mem=1G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

find . -name '*.fna' -print0 | xargs -0 cat > bacteria_genomes.fa

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 2: Splitting sequences by long repeats of ambiguous base N (this is from the DeconSeq manual

Below is all for the human genome because that is what is listed in the manual but I did this for all of the databases. cat hs_ref_GRCh38_p12.fa | perl -p -e 's/N\n/N/' | perl -p -e 's/^N+//;s/N+$//;s/N{200,}/\n>split\n/' >hs_ref_GRCh38_p12_split.fa; rm hs_ref_GRCh38_p12.fa

Step 3: Filtering databases

Need to download and install PRINSEQ - can be found at https://sourceforge.net/projects/prinseq/files/ This step needs a SLURM because it will run out of memory otherwise #!/bin/bash # ------# Filtering sequences - PRINSEQ # ------

#SBATCH --job-name=human_prinseq #SBATCH --account=def-srogers #SBATCH --cpus-per-task=1 #SBATCH --time=0-0:10 #SBATCH --mem=10G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

62 module load perl/5.22.4

perl /path/to/prinseq-lite-0.20.4/prinseq-lite.pl -log -verbose -fasta hs_ref_GRCh38_p12_split.fa -min_len 200 -ns_max_p 10 -derep 12345 -out_good hs_ref_GRCh38_p12_split_prinseq -seq_id hs_ref_GRCh38_p12_ -rm_header -out_bad null

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 4: Index the databases

For the bacterial database (because it is over 200Gb) you will need to first split it into managable chunks before BWA can use them. Can use fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/). The files need to be under 3Gb each, so split it to as many chunks as you need. I did 100. #!/bin/bash # ------# FASTA splitter # ------

#SBATCH --job-name=fasta_split_bac #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-10:00 #SBATCH --mem=100G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load perl/5.22.4

perl /pathto/fasta-splitter.pl --n-parts 100 Bacteria_FG_split_prinseq.fasta

# ------echo "Job finished with exit code $? at: `date`" # ------Next step is to index the databases You must use the BWA provided in the DeconSeq package! The newer BWA reads the files incorrectly for this and will only produce 5 of the 8 outfiles necessary. Cedar is a 64 bit Linux system, so use bwa64. I know that the top examples have been using the human (sorry for the lack of consistency) but the bacterial one needed some extra things to make it run. I didn’t want to run 100 SLURMs so this is a batch job! Modified from script kindly provided by Dr. Stefan Dennenmoser #!/bin/bash # ------# BWA # ------

63 #SBATCH --job-name=index_bac #SBATCH --ntasks=1 #SBATCH --account=def-srogers #SBATCH --time=0-10:00 #SBATCH --mem-per-cpu=20G #SBATCH --array=1-100

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

cd /path/to/Bacterial_Database

filename=`ls -1 *.fasta* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`

filename2=${filename::-6} #the filename without the .fasta part (-6 letters)

/path/to/deconseq-standalone-0.4.3/bwa64 index -p $filename2 -a bwtsw $filename >bwa.log 2>&1

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 5: Configure the DeconSeq file

You will have to go into the installed DeconSeq directory and set up the configure file DeconSeqConfig.pm. You will just need to change the database directory and out directory and the what files are accessed in the “use constant DBS”.

Step 6: Split the query sequences

DeconSeq is unable to handle big datasets (e.g. 50+ Gb files that I was using) Will need to use fastq splitter (http://kirill-kryukov.com/study/tools/fastq-splitter/) I don’t think I ran this in a SLURM and just used the console. If it fails. . . put it in a SLURM. perl /path/to/fastq-splitter.pl --n-parts 50 --check XX_R1.fq

Step 7: ACTUALLY RUNNING DECONSEQ

For DeconSeq you need to choose the identity (-i) which is the percent match between your query sequence and the database and the coverage (-c) which is the amount of the sequence aligns. I went with the parameters they used in their paper and based on what I had seen other people do, which was 94% identity (-i 94) and 90% to 95% coverage (-c 90 or -c 95). Submit this as a batch job. #!/bin/bash # ------

64 # Deconseq # ------

#SBATCH --job-name=XX_decon #SBATCH --ntasks=1 #SBATCH --account=def-srogers #SBATCH --time=3-00:00 #SBATCH --mem-per-cpu=7500MB #SBATCH --array=1-50

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load perl/5.22.4

#put all of the split files from one pop and read direction in its own directory

cd /path/to/trimmed_seq/XX_R1_split

filename=`ls -1 *.fq* | tail -n +${SLURM_ARRAY_TASK_ID} | head -1`

perl /path/to/deconseq-standalone-0.4.3/deconseq.pl -i 94 -c 90 -f /path/to/trimmed_seq/XX_R1_split/$filename -dbs hsref -out_dir /path/to/deconseq_out/XX_R1 -id $filename

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 8: Creating paired files again

DeconSeq was never designed for paired-end sequencing. Therefore it will process each read direction seperately and this causes sequences to be removed in one direction and not the other. Firstly, concatenate the 50 (or more or less depending on what you split your in files to) of clean files. I also did this for the .cont files because I wanted to keep them and see how many were removed. Then you can use this person’s script found at: https://github.com/linsalrob/fastq-pair Step 1: Clone or download - copy URL Step 2: In Cedar or ARC or whatever cluster type: git clone -should see a directory called “fastq-pair” Step 3: gcc fastq-pair/*.c -o fastq_pair Step 4: Should see an exectuable script called “fastq_pair” Running it is super simple. path/to/fastq_pair -t VALUE path/to/file_R1.fastq path/to/file_R2.fastq

Where VALUE is roughly the number of sequences in your file. This is the setting of the hash table size. For some reason I don’t know why, I couldn’t get this to run as a SLURM. It would make the outfiles (if I gave it over 50Gb) but it would run for far longer than it runs in the terminal and just never finish. I ended up running it in ARC’s console because Cedar doesn’t have enough resources allocated to their console.

65 It makes four output files. R1 paired and unpaired and R2 paired and unpaired. Then you will need to zip the files back up!

DISCOVAR de novo - assembling a reference genome

DISCOVAR de novo is very easy to use but there are downstream challenges of not having a reference genome. As well, it was designed for paired-end sequences of 250 bp sequenced from one individual prepared PCR-free. In my thesis you can see the impact of breaking these assumptions in the quality of the genome produced. If it is at all possible to run an individual with at least two insert sizes (ex. 1 kb and 5,000 kb), you will be able to likely generate a much more robust assembly. Ex. Schell et al. 2017. You can download DISCOVAR denovo from here: https://software.broadinstitute.org/software/discovar/ blog/?page_id=98. There are two ways of getting (probably lots more but these are the ones I use) info from the web onto Cedar. Option 1 Go to the link you want to download. Right click and “Copy Link Address”. Go to Cedar. Do below. wget ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/latest_source_code/LATEST_VERSION.tar.gz

Option 2 Can download to your personal computer (if the file isn’t too big) and then use Globus to transfer it from your computer to Cedar. Globus is supported by Compute Canada. You will have to login, download Globus to your computer and make your computer an endpoint. Once you download DISCOVAR denovo, you will have to unzip it. A little aside. . . #Clumpify (belongs to the BBMap/BBTools package) Before I assembled the sequences, I used Clumpify to remove PCR duplicates (unlike Picard it doesn’t need a reference genome, however I don’t think it is as robust) because the library prep is intended to be PCR free. This is seen to improve the quality of the assembly. I also tried BBNORM to normalize the sequencing depth at around 60x coverage. It decreased the N50 but increased the MPL1 and decreased the estimated chimera rate. In hindsight, I think I should have investigated this assembly more and maybe used it. dedupe: The command to remove duplicates subs=2: This means that there can be two subsitutions between the the compared sequences and it will be considered a duplicate. clumpify.sh in1=/home/youraccount/path/to/trimmed.fq.gz/XX_R1_P.fq.gz in2=/path/to/trimmed.fq.gz/XX_R2_P.fq.gz out1=/path/to/Clumpify/XX_R1_P_nodup.fq.gz out2=/path/to/Clumpify/XX_R2_P_nodup.fq.gz dedupe subs=2

#!/bin/bash # ------# Removing duplicates from XX allowing two subs # ------

#SBATCH --job-name=nodup_XX #SBATCH --account=def-srogers #SBATCH --nodes=1 #SBATCH --cpus-per-task=32 #SBATCH --time=0-00:30 #SBATCH --mem=150G

# ------

66 echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load intel/2016.4 module load bbmap/37.36

/home/youraccount/Clumpify/code/XX_clump

# ------echo "Job finished with exit code $? at: `date`" # ------Now that we have files with duplicates removed, back to DISCOVAR de novo DISCOVAR de novo takes A LOT of memory. DiscovarDeNovo READS=/path/to/fastq/files/you/want/to/use/*fq.gz OUT_DIR=/path/to/referencegenome MAX_MEM_GB=1450 NUM_THREADS=32

#!/bin/bash # ------# Assembly of genome (using 1 site) # ------

#SBATCH --job-name=XX_ref #SBATCH --account=def-srogers #SBATCH --nodes=1 #SBATCH --cpus-per-task=32 #SBATCH --time=0-30:00 #SBATCH --mem=1400G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------path/to/code/DDN_code

# ------echo "Job finished with exit code $? at: `date`" # ------

BWA (Burrows-Wheeler Aligner) - aligning sequences to the ref- erence genome

Step 1: Index the reference

This step will create a bunch of index files using the name “reference_genome” as the name of the file

67 bwa index -p reference_genome /path/to/reference/reference_genome.fa

#!/bin/bash # ------# BWA # ------

#SBATCH --job-name=ref_index #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-04:00 #SBATCH --mem=5G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4 module load intel/2017.1 module load bwa/0.7.17

/path/to/code/bwa_index_code

# ------echo "Job finished with exit code $? at: `date`" # ------

Step 2: Align to reference

Details about parameters can be found here: http://bio-bwa.sourceforge.net/bwa.shtml -M: mark shorter split hits as secondary (necessary for using the file in Picard downstream) -t 16: number of threads. Match with SLURM -R: complete read group header line XX: need to change XX to whatever site you are currently working on. The reference genome needs to be put in without the “.fa” because it will be using all of the indexes that are in the directory too. bwa mem -M -t 16 -R '@RG\tID:XX\tLB:XX\tSM:XX\tPL:ILLUMINA' /path/to/ref/reference_genome path/to/XX_R1.fq.gz path/to/XX_R2.fq.gz > XX_out.sam

#!/bin/bash # ------# sam # ------

#SBATCH --job-name=XX_align

68 #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-05:00 #SBATCH --mem=10G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4 module load intel/2017.1

module load bwa/0.7.17

/path/to/code/align_code

# ------echo "Job finished with exit code $? at: `date`" # ------

Samtools - sort the .sam file into a .bam by chromosome number

There are two options for sorting but it needs to be sorted by chromosome for downstream applications. -@: this argument is where you set the number of threads -T: this argument is where you set the indentifier - change to which ever site you are currently working on -o: write to this outfile samtools sort -@ 16 -T XX -o XX.bam XX_out.sam

#!/bin/bash # ------# Samtools Sort # ------

#SBATCH --job-name=XX.bam #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-04:00 #SBATCH --mem=5G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4

69 module load samtools/1.5

/path/to/code/samsort_code

# ------echo "Job finished with exit code $? at: `date`" # ------

Samtools - filter low quality reads and where only one read mapped

Details can be found at: http://www.htslib.org/doc/samtools.html Need to remove where only one read mapped, duplicates and low alignment quality reads (Q20). I did this in three steps, where I removed 1 read mapped first and then removed dups, and then filtered for Q20. -@ : number of threads -f 2: this means only keep it if there is paired reads -o: write to this outfile samtools has good documentation online. Be advised that they updated the package in April 2018! Just need to look at the right info. samtools view -@ 16 -f 2 -o XX_rm1mate.bam /path/to/XX.bam

#!/bin/bash # ------# Samtools Sort # ------

#SBATCH --job-name=XX_rm_Q20 #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-02:00 #SBATCH --mem=25G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4 module load samtools/1.5

/path/to/code/rm1mate_code

# ------echo "Job finished with exit code $? at: `date`" # ------

70 Picard - removing duplicates

Picard will remove duplicates using the MarkDuplicates function. You need to add REMOVE_DUPLICATES=TRUE to remove them. GATK suggests that you keep the marked duplicates but PoPoolation1 and 2 needs them removed. java -jar $EBROOTPICARD/picard.jar MarkDuplicates INPUT=/path/to/XX_rm1mate.bam OUTPUT=XX_rm1mate_nodup.bam METRICS_FILE=XX_rm1mate_Q20_nodup.txt REMOVE_DUPLICATES=TRUE

#!/bin/bash # ------# Picard Dup removal # ------

#SBATCH --job-name=XX_nodup_rm1mate_Q20 #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-02:00 #SBATCH --mem=40G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load nixpkgs/16.09 module load picard/2.17.3

/path/to/code/removeduplicates_code

# ------echo "Job finished with exit code $? at: `date`" # ------

Flagstat - get stats on your alignment

Flagstat will give you stats on the alignment - like percent mapped. #!/bin/bash # ------# Flagstat # ------

#SBATCH --job-name=XX_flagstat

#SBATCH --account=def-srogers

#SBATCH --cpus-per-task=16 #SBATCH --time=0-00:20 #SBATCH --mem=1G

71 # ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4

module load samtools/1.5

samtools flagstat path/to/XX_rm1mate_nodup.bam

# ------echo "Job finished with exit code $? at: `date`" # ------

Picard - Validate the .bam file before moving forward

Check if the bam file is not broken. java -jar $EBROOTPICARD/picard.jar ValidateSamFile I=/path/to/XX_rm1mate_nodup.bam MODE=SUMMARY

#!/bin/bash # ------# Validate Bam Test # ------

#SBATCH --job-name=Vali #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-01:00 #SBATCH --mem=15G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------module load nixpkgs/16.09 module load picard/2.17.3

/path/to/code/validate_code

# ------echo "Job finished with exit code $? at: `date`" # ------

72 Samtools - Q20 filter

-q 20: filter any sequence that the alignment score is less than 20 “Minimum mapping quality for an alignment to be used” samtools view -@ 16 -q 20 -o XX_rm1mate.bam /path/to/XX.bam

#!/bin/bash # ------# Samtools Sort # ------

#SBATCH --job-name=XX_Q20 #SBATCH --account=def-srogers #SBATCH --cpus-per-task=16 #SBATCH --time=0-02:00 #SBATCH --mem=25G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4

module load samtools/1.5

/path/to/code/Q20_code

# ------echo "Job finished with exit code $? at: `date`" # ------

Samtools - mpileup

Information about parameters can be found: http://www.htslib.org/doc/samtools.html The mpileup is a file type that contains base-pair information at each chromosomal position. In the below code you can add an arguement of -f and the path to the reference genome. This will give you the reference genome info in the mpileup file. I didn’t use this because I don’t really care what my reference is. In this step all of the .bam files will be combined into one mpileup (in my case site 1 through 8). For PoPoolation 1, you need to form a mpileup for each site which will be outlined later in this pipeline. -B: stops BAC re-alignment. Necessary for PoPoolation2. -o: write to this outfile This example has three sites (XX, YY and ZZ) being combined into one mpileup samtools mpileup -B XX_rm1mate_Q20_nodup.bam YY_rm1mate_Q20_nodup.bam ZZ_rm1mate_Q20_nodup.bam -o ref_allpools.mpileup

73 #!/bin/bash # ------# Samtools mpileup # ------

#SBATCH --job-name=tut_allpools_mpileup

#SBATCH --account=def-srogers

#SBATCH --cpus-per-task=16 #SBATCH --time=0-06:00 #SBATCH --mem=1G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load gcc/5.4.0 module load intel/2016.4

module load samtools/1.5

/path/to/code/mpileup_code

# ------echo "Job finished with exit code $? at: `date`" # ------

PoPoolation2 - convert mpileup to msync

The msync format is the file type that PoPoolation2 needs to do all of its analysis. It is also one of the formats that Poolfstat accepts as its input. You will need to download PoPoolation2 into your home directory. (https://sourceforge.net/p/popoolation2/wiki/Main/). –min-qual 20: filter anything that has a base quality of less than 20 –threads 8: The number of threads you will be using. I have this faint memory (why didn’t I type these notes as I went?) that this step goes a little squirrely if threaded higher than 8 java -jar /home/youraccount/popoolation2_1201/mpileup2sync.jar --input path/to/ref_allpools.mpileup --output ref_allpools.sync --fastq-type sanger --min-qual 20 --threads 8

#!/bin/bash # ------# mpileup2sync # ------

#SBATCH --job-name=mpileup2sync

#SBATCH --account=def-srogers

74 #SBATCH --cpus-per-task=8 #SBATCH --time=0-2:00 #SBATCH --mem=15G

# ------echo "Current working directory: `pwd`" echo "Starting run at: `date`" # ------

module load nixpkgs/16.09 module load java/1.8.0_121

/path/to/code/mpileup2sync_code

# ------echo "Job finished with exit code $? at: `date`" # ------In the slurm.out it will give you exit code 1. Check that the last chromosome in the sync file matches the mpileup file and if it does, you’re ok.

Poolfstat - pairwise Fst

I originally used PoPoolation2 to calculate pairwise FST but for reasons in my thesis, I think it is biased. So, I went with Poolfstat. It is an R package and it’s so fast. Took about 4 hours to bring the data into R and then under ten minutes to run the calculation. library(poolfstat)

#From their paper, which was re-analyzing Dennenmoser et al. 2017 #- where he had four populations of n=44 of prickly scuplin

pool.data = popsync2pooldata((sync.file="./file"), poolsizes=c(44,44,44,44), poolnames=c(FE,CR,PI,HZ), min.rc=1, min.cov.per.pool=10, max.cov.per.pool=300, min.maf=0.01, noindel=TRUE)

#I used: min.rc=1 min.cov.per.pool=20 max.cov.per.pool=200 min.maf=0.05 noindel=TRUE Once the data is in, you are set to run pairwise Fst PW_fst <- computePairwiseFSTmatrix(pool.data, method = "Anova", min.cov.per.pool=20, max.cov.per.pool=200, min.maf=0.05, output.snp.values = TRUE)

I saved the P airwiseF ST matrix ∗ and∗NbOfSNPs components of the outfile as their own files because I wanted to keep them. The $PairwiseFSTmatrix file is necessary to do the below.

75 Visualizing pairwise Fst

To visual the Pairwise FST distance between each population, I used a Principal Coordinate Analysis. There is a really good blog post describing the different between PCoA and PCA here: http://occamstypewriter. org/boboh/2012/01/17/pca_and_pcoa_explained/. #read in the pairwise FST generated by Poolfstat pcoa <- read.csv("PW_fst_matrix.csv", header=FALSE)

#make it a matrix pcoa.matrix <- data.matrix(pcoa)

#calculate the euclidean distance between the pairwise FST euc.matrix <- dist(pcoa.matrix,'euclidean')

library(labdsv) #package used for PCoA

pco <- pco(euc.matrix, k =2) #calculate the pco for euc distance

#sum the eigen vectors (8 populations) sumeigen=pco$eig[1]+pco$eig[2]+...+pco$eig[8])

eig1=pco$eig[1] eig2=pco$eig[2] perc_eig1=eig1/sumeigen #percent explained by eigen vector 1 perc_eig2=eig2/sumeigen #percent explained by eigen vector 2

plot(pco) #will give you a not pretty figure

#the data is stored as character so this and below deals with that. pco.ggplot<-data.frame(cbind(c("Pop1", "Pop2",..."Pop8"), as.numeric(pco$points[,1]), as.numeric(pco$points[,2])))

pco.ggplot$X2<-as.numeric(as.character(pco.ggplot$X2)) pco.ggplot$X3<-as.numeric(as.character(pco.ggplot$X3))

colnames(pco.ggplot) <- c("site", "PCoA1", "PCoA2") library(ggrepel) library(RColorBrewer) library(ggplot2) tiff('PCoA', units="in", width=10, height=5, res=300) ggplot(pco.ggplot, aes(x=PCoA1, y=PCoA2)) + geom_point(colour="chartreuse4") + geom_point(data=pco.ggplot[c(3, 4, 7), ], aes(x=PCoA1, y=PCoA2), colour="purple4") + geom_label_repel(aes(label = site), size = 3, hjust = 0, nudge_x = 0.003, nudge_y = - 0.00, colour="chartreuse4", show.legend = FALSE) + theme(legend.position="none") + geom_label_repel(data=pco.ggplot[c(3, 4, 7), ], aes(label = site, x=PCoA1, y=PCoA2), colour="purple4", size = 3, hjust = 0, nudge_x = 0.003, nudge_y = - 0.00, show.legend = FALSE) + theme_bw() + theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

76 + labs(x = "PCoA 1 (XX.XX%)", y = "PCoA 2 (XX.XX%)") + scale_x_continuous(breaks=c(-0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1.0), labels=waiver()) + coord_fixed() dev.off()

#I am going to be honest here with this ggplot code. It works. #It ain't pretty and I just kept adding to it until it did a thing and now I am afraid to touch it.

PoPoolation1 - Nucleotide Diversity

The first step here is to make mpileups for each of your sites, instead of one with all of them. For example, in the above, you would just specify site XX. –fastq-type sanger: even though our data is Illumina, we had to set it as sanger. This is because Phred 33, not 64 –pool-size: this should equal the number of individuals x 2 (for diploids) in the site. I have also read that if it is ok to put the number of individuals and that it makes so little difference they are considering taking that parameter out. –min-count: minor allele count for that site –min-coverage: the minimum coverage of a SNP for it to be used in analysis max-coverage: the maximum coverage of a SNP for it to be used in analysis –window-size: size of window perl /home/youraccount/popoolation_1.2.2/Variance-sliding.pl --measure pi --input /path/to/XX_rm1mate_Q20_nodup.mpileup --output XX_pi.file --snp-output XX_pi.snps --fastq-type sanger --pool-size 40 --min-count 4 --min-coverage 20 --max-coverage 200 --window-size 250 --step-size 250

This will give you a .file and .snp file as the outfiles. The .file can be loaded into R and used to calculate mean nucleotide diversity and standard deviation. XX_w250_pi <- read.delim("./XX_w250_pi.file", header = FALSE, sep = "\t", dec = ".") colnames(XX_w250_pi) <- c("chr", "position", "Num.of.SNPs", "frac.of.cov", "pi")

XX_w250_pi$pi<-as.numeric(as.character(XX_w250_pi)) mean(XX_w250_pi$pi, na.rm=TRUE) pi_matrix <- matrix(c("Pop1", "Pop2",..."Pop8", pi_1, pi_2, ...pi_8), nrow = 8, ncol = 2)

colnames(pi_matrix) <- c("Site", "NucleotideDiversity")

pi_DF <- as.data.frame(pi_matrix)

pi_DF$NucleotideDiversity <- as.numeric(as.character(pi_DF$NucleotideDiversity))

library(ggplot2)

tiff('pi_figure.tiff', units="in", width=5, height=5, res=300) ggplot(data=pi_DF, aes(x=Site, y=NucleotideDiversity))

77 + geom_point(colour="purple4", size = 3) + labs(y= "Nucleotide Diversity") + geom_point(data=pi_DF[c(1, 2, 3, 4, 5), ], aes(x=Site, y=NucleotideDiversity), colour="chartreuse4", size = 3) + theme_bw() + ylim(0, 0.006) + theme(axis.text=element_text(size=13), axis.title=element_text(size=15,face="bold"), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black")) dev.off()

#Same as above regarding the ggplot.

78 Appendix B: DNA and sequencing quality

Figure B.1 Pooled DNA (5 µL) for each population pre-dilution for sequencing preparation run through 1% agarose gel with 3 µL of NEB 1 kb DNA ladder. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

79

Figure B.2 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) raw, reverse sequences (pre-Trimmomatic).

80

Figure B.3 Basic statistics and per base sequence quality component of FASTQC report for J3 (Lower Cave & Basin Spring) filtered and trimmed reverse sequences (post- Trimmomatic).

81

Appendix C: PoPoolation2 pairwise FST estimates

Table C.1 Pairwise FST between all populations calculated by PoPoolation2. Pairwise FST was calculated for 250bp side by side windows, minor allele count of 8, minimum coverage of 15 and max coverage of 200, where the entire window was acquired to meet coverage specifications. P. johnsoni - J1, J2, J3, J4 and J5. P. gyrina - G1, G2 and G3.

Population J1 J2 J3 J4 J5 G1 G2 G3 J1 0 0.070 0.065 0.075 0.076 0.211 0.335 0.318 J2 0 0.044 0.033 0.074 0.237 0.355 0.340 J3 0 0.044 0.061 0.213 0.336 0.320 J4 0 0.074 0.233 0.351 0.333 J5 0 0.205 0.333 0.318 G1 0 0.237 0.217 G2 0 0.167 G3 0

82