Utah State University DigitalCommons@USU

All Graduate Theses and Dissertations Graduate Studies

12-2020

Applied Species Delimitation in Microbial Taxa and

Austin C. Koontz State University

Follow this and additional works at: https://digitalcommons.usu.edu/etd

Part of the Ecology and Evolutionary Biology Commons

Recommended Citation Koontz, Austin C., "Applied Species Delimitation in Microbial Taxa and Plants" (2020). All Graduate Theses and Dissertations. 7960. https://digitalcommons.usu.edu/etd/7960

This Thesis is brought to you for free and open access by the Graduate Studies at DigitalCommons@USU. It has been accepted for inclusion in All Graduate Theses and Dissertations by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. APPLIED SPECIES DELIMITATION IN MICROBIAL TAXA AND PLANTS

by

Austin C. Koontz

A thesis submitted in partial fulfillment of the requirements for the degree

of

MASTER OF SCIENCE

in

Ecology

Approved:

William D. Pearse, Ph.D. Paul Wolf, Ph.D. Major Professor Committee Member

Bonnie Waring, Ph.D. Karen Mock, Ph.D. Committee Member Committee Member

D. Richard Cutler, Ph.D. Interim Vice Provost of Graduate Studies

UTAH STATE UNIVERSITY Logan, Utah

2020 ii

Copyright © Austin Koontz 2020

All Rights Reserved iii

ABSTRACT

Applied Species Delimitation

in Microbial Taxa and Plants

by

Austin Koontz, Master of Science

Utah State University, 2020

Major Professor: Dr. William D. Pearse Department: Biology

Species are a fundamental concept in most branches of biology, but despite this centrality, numerous species concepts exist, and this lack of a consensus and the discrep- ancies that arise from it is termed “the species problem”. The species problem leads to issues that are theoretical as well as practical: the inability to define species can have major implications for how taxa are described, discussed, and conserved. As developing sequencing technologies continue to allow researchers to broadly survey populations in nature, more examples of genetically distinct groups within species appear, emphasizing the pertinence of the species problem. The research presented here explores two largely different scenarios that nevertheless share a distinguishing trait common throughout much of biology: ambiguous species boundaries.

The first of these is detailed in Chapter 2, which discusses the usage of operational taxonomic units (OTUs) in microbial ecology, and how phylogenetic techniques can in- form the range in which OTU boundaries alter levels of community diversity. Using simulations and empirical data, I demonstrate the ability of a classic family of branch length transformations (Pagel’s transformations) to indicate meaningful species bound- aries in microbial communities, as well as communities of macrotaxa. In Chapter 3, I discuss the results of a genetic survey of varieties of a species complex of endemic iv to the Great Basin. By analyzing the population genomics of the cusickiana species complex, I provide support for updated classifications of unique populations characterized by similar morphologies, discrete ranges, and nascent speciation due to pronounced isolation by distance. Through these two examples, my aim is to demon- strate techniques and practices that can assist future research in navigating the inherent difficulties of species delimitation.

(68 pages) v

PUBLIC ABSTRACT

Applied Species Delimitation

in Microbial Taxa and Plants

Austin Koontz

Species are a fundamental concept in biology, and many subdisciplines in biology utilize species in aspects of theory and in the communication of results. Given the cen- trality of species in biological science, it can seem surprising that there is no universal definition amongst biologists of what, strictly speaking, a species is. In fact, there are, by some estimates, over 20 different “species concepts”, and this lack of a consensus is termed “the species problem”. This problem has theoretical underpinnings, but has become more relevant as advances in sequencing technologies over the past two decades have allowed researchers to probe the genetics of populations, and in doing so, uncover instances of genetically distinct populations within a species. This thesis explores is- sues in species delimitation in two broadly different scenarios. The first, in Chapter

2, involves grouping individuals in microbial communities, and using phylogenetics to inform the effects of different species boundary thresholds. The second, in Chapter 3, explores the genetic differences between varieties of a species complex of plant endemic to the Great Basin region of the western United States, including a variety endemic to

Logan Canyon (Primula cusickiana variety maguirei, or Maguire’s primrose). While the contexts of these chapters are largely different, they nevertheless share a distinguishing trait, common throughout much of biology: ambiguous species boundaries. vi

Acknowledgments

First, I’m compelled to acknowledge that any success of my thesis research is the result primarily of my outstanding committee. I’m incredibly fortunate to be guided by them. Bonnie taught the Microbial Metagenomics course that hugely expanded my understanding of OTUs, and helped frame the technique described in my first chapter in a way that allowed me to directly tie branch length transformations to the effects of OTU thresholds. Karen’s background is what originally motivated me to apply to

USU, and her questions have helped me understand the results of my Primula research in the context of conservation genetics. She’s also been invaluable in my career pursuits, and for that I cannot thank her enough. Paul, since before I began at USU, seemed to believe in my capacity as a student when I did not, and he is an excellent mentor.

He understands exactly how much guidance to provide and how much to hold back so that I can develop my own understanding. I’m also grateful to Carol Rowe, who has throughout my Primula analysis given me direction when I was left staring blankly out software output. Finally, I contribute most of my success as a student to Will, who has guided me through the research process and to a point of understanding that I couldn’t have reached without his instruction. He has pushed me to take on more than

I thought I could handle, while also being patient with me as my comprehension on many different topics would increase, then backslide, then zigzag. He’s a wizard, and

I’ve learned more through working in his lab than I have in the entire 4 years since completing my undergraduate degree.

Secondly, the funding I’ve received has allowed me to focus almost solely on my re- search during my graduate degree: without this, I wouldn’t have been able to accomplish what I have in the time that I have. I’m thankful to the National Science Foundation, the USU Ecology Program, Dr. Ivan J. Palmblad and the USU Department of Biology,

Dr. Lawrence H. Piette and the USU College of Science, and Native Plant Soci- ety for their contributions. I also want to thank all of the Biology department staff for helping me with the paperwork and degree requirements during my time here–thanks to vii

Metta, Mike, Ashlea, and Nancy Kay. Thanks especially to Shaun Heller, who has been an extremely valuable GPC and has assisted me in more ways than I’m likely aware of.

Thirdly, I want to express my appreciation for everyone who helped me with my

Primula fieldwork over the summer of 2019, which was largely successful due to many different sources of expertise (and dumb luck). Thanks to Dr. Barbara Ertter and espe- cially Dr. Don Mansfield at the College of Idaho, who helped me locate several important populations of variety cusickiana and who tipped me off about the uniqueness contained in this variety. Thanks to Dr. Steve Novick, who let me crash at his place and listen to Stephane Grapelli records during my fieldwork in Boise. Thanks to Todd Stefanic for helping me coordinate my collections in Craters of the Moon National Monument, and to Gretchen Baker, who helped me with my collections from Great Basin National

Park. Gretchen also connected me to Jean Howerton, who tipped me off about the populations of variety nevadensis in the Grant Range, which ended up being a unique component of the genetic survey. Thanks to Jean for this knowledge, letting me stay at her ranch over the summer, and for writing up a very interesting history of variety nevadensis. Thanks to Trish Winn and Jennifer Lewinsohn in the USFS for helping with my permits for collection of variety maguirei in Logan Canyon. Thanks also to all of the herbarium directors who allowed me to explore and sample specimens as part of this research: Michael Piep, Elizabeth Makings at the Arizona State University, and especially Jerry Tiehm at the University of Nevada Reno. Thanks to Buddy Smith, who helped in my attempts to collect P. capillaris in the Ruby Mountains, and to Dr. Leila

Shultz, who helped me to identify the Owyhee cusickiana population. I want to also express my gratitude and appreciation to the incomparable Noel and Pat Holmgren, for connecting me to many of the names listed above and for being inspiring human beings. Finally, I’d like to acknowledge that all of my research collections were made on the ancestral lands of the Western Shoshone, Eastern Shoshone, Shoshone-Bannock,

Southern Paiute, Goshute, and Nez Perce Native American tribes.

Thanks to all of my labmates in both the Pearse and Wolf labs, who have not only helped me with my research but have also been around to support me during the viii most difficult parts of my degree. Michael, Elizabeth, Sylvia, Amanda, Katie, Marley, and Tanner: you are some of my best friends, and each of you are amazing, one-of-a- kind individuals. Thanks especially to Sylvia, who helped me not only with Primula collection but with so much of the data analysis—I owe a lot of my success to the work you’ve already done. Thanks also to my roommate Tim, who was always willing to vent with me on Friday nights after a long week of work. And, importantly, thank you to my partner Kate, who has cooked with me, climbed with me, hiked with me, proofread me, kept me in check, and also kept me afloat.

Finally, thanks to everyone in my family, who has rooted for me from afar since I began my degree: my mom, my stepdad, my dad, Jillian, Tim, Chelsea, and my brother

Elliot, who also helped me with Primula collection.

Austin Koontz ix

Contents

Page

Abstract...... iii

Public Abstract...... v

Acknowledgments...... vi

List of Figures...... xi

List of Tables...... xiii

1 Introduction......

2 A Phylogenetically-based Solution to Ambiguous Species Delimitation in Studies of Assemblage Structure...... 3 2.1 Introduction...... 3 2.2 Methods...... 6 2.2.1 Overview...... 6 2.2.2 Simulations...... 9 2.2.2.1 Function and Parameter Space...... 9 2.2.2.2 Response Metrics...... 9 2.2.2.3 Linear Models...... 11 2.2.3 Empirical Analyses...... 11 2.3 Results...... 13 2.3.1 Simulations...... 13 2.3.2 Empirical Analyses...... 15 2.4 Discussion...... 17 2.4.1 Simulations...... 17 2.4.2 Empirical Analyses...... 18 2.4.3 Methodological Extensions...... 19 2.4.4 Conclusion...... 19

3 Genomic Comparisons among Varieties of a Primrose Species Complex in the Great Basin...... 21 3.1 Introduction...... 21 3.2 Methods...... 25 3.2.1 Sampling...... 25 3.2.2 DNA Extraction...... 25 3.2.3 Library Prep and Sequencing...... 26 3.2.4 Data Processing...... 26 x

3.2.4.1 Complex-Wide Genomic Survey...... 26 3.2.4.2 Variety Specific Clustering...... 27 3.2.5 Population Analyses...... 28 3.2.5.1 STRUCTURE...... 28 3.2.5.2 Discriminant Analysis of Principal Components..... 28 3.3 Results...... 29 3.3.1 P. cusickiana species complex relations...... 29 3.3.2 P. cusickiana var. maguirei ...... 35 3.3.3 P. cusickiana var. cusickiana ...... 38 3.4 Discussion...... 40 3.4.1 Primula cusickiana var. maguirei ...... 40 3.4.2 Genetically distinct populations within var. cusickiana ...... 40 3.4.3 Admixture in var. nevadensis ...... 41 3.4.4 Primula capillaris ...... 41 3.4.5 Conclusion...... 42

4 Summary and Conclusions...... 43

Bibliography...... 45

Appendix...... 51 xi

List of Figures

2.1 Graphical demonstration of Pagel’s δ transformation, a power transfor- mation of the summed branch lengths of the tree. As δ increases, the proportion of unshared evolutionary history also increases. The origi- nal phylogeny is defined as having a δ of 1.0, in which summed branch lengths remain unaltered. Nodes are colored to illustrate the changes in node depth due to transformation...... 7 2.2 An example of a simulated phylogeny with branches appended to it, mod- eling varying levels of differences between individuals. Black represents interspecific differences, red intraspecific differences, and blue random sequencing error. The relative depth that branches are added to the phylogeny reflects a central assumption of our technique, namely that in- terspecific differences must be greater than intraspecific differences, which in turn must be greater than differences arising from random sequencing error...... 8 2.3 Diversity asymptotically increases in response to δ transforma- tions. Mean MPD values across simulation instances for each δ value are shown. MPD values are centered by subtracting from each mean value the same mean value in the original, untransformed phylogeny, which is why negative MPD values are possible. Low δ values (less than 1) decrease MPD across communities, while at δ > 1, MPD continues to increase, but at a lower rate of change. The effects of interspecific differences on MPD approach a maximum at higher values of δ...... 13 2.4 δ predicts changes to MPD in a range of values, while error rates have no regular effect. Graphs of median response metrics (left: MPD correlations; right: shifts in site rankings) for model terms (top: δ; middle: intraspecific diversification; bottom: sequencing error diversification). Colors match Figure 2.2. Increasing error rates have no regular effect on MPD; low δ values reliably cause MPD values of transformed communities to shift from original MPD values...... 14 2.5 δ transformations increase MPD, but trends depend on ultra- metricity. MPD values for the fungal community dataset from Costa Rica are shown on top; values for the macroinvertebrate dataset from the Rio Laja are shown on bottom. Each line represents a site within each community. Note the differences in MPD magnitudes between the upper and lower graphs. Delta transformations asymptotically increase MPD for Rio Laja communities, but fungal communities begin to decrease in diversity around values of δ = 1.5, a result of the non-ultrametric fungal phylogeny...... 16 xii

3.1 The four members (varieties) of the Primula cusickiana species complex: (A) maguirei, in Right Hand Fork of Logan Canyon, Utah; (B) cusickiana, near Cougar Point in Jarbidge, Nevada; (C) domensis, at Notch Peak in the House Range, Utah; (D) nevadensis, on Mount Washington in Great Basin National Park, Nevada...... 23 3.2 Distribution of Jaccard similarity values between samples analyzed. Jac- card similarity values for replicates are shown as purple lines. Similarity coefficient values were calculated based off of the SNP matrix of samples generated by our initial ipyrad run...... 27 3.3 Complex-wide analysis STRUCTURE plots for values of K = 2 through K = 6...... 30 3.4 Complex-wide analysis STRUCTURE plots for values of K = 7 through K = 11...... 31 3.5 Complex-wide analysis STRUCTURE plots for values of K = 12 through K = 16...... 32 3.6 Complex-wide analysis STRUCTURE plot at K = 7, with labels repre- senting the populations from which samples were sourced...... 33 3.7 Discriminant Analysis of Principal Components (DAPC) illustrating or- dination of 7 population clusters. Coloration matches Figures 3.6 and 3.8. Populations of domensis, nevadensis, and cusickiana populations from Nevada and Oregon are shown to group together using this method, while maguirei remains distinct...... 34 3.8 Map of sample locations, with pie charts used to represent sample mem- bership to STRUCTURE clusters at K = 7. Coloration matches Figures 3.6 and 3.7...... 35 3.9 STRUCTURE plot for only maguirei samples at K = 3...... 37 3.10 STRUCTURE plot for only cusickiana samples sourced from the Snake River Plain in Idaho, at K = 3...... 39 xiii

List of Tables

2.1 Parameter descriptions used for simulations. Parameters above the double line indicate arguments for the sim.meta.phy.comm function call; param- eters below the double line indicate values used for simulating intraspecific differences and sequencing error...... 10 2.2 Linear model output using a response metric of MPD correlations. Mul- tiple R2 value of model is 0.3708. δ terms are logged transformed in order to account for δ being a power transformation of branch lengths...... 15 2.3 Linear model output using a response metric of shifts in site rankings. Multiple R2 value of model is 0.186. Negative effects of model terms are caused by the relationship to the response metric: as δ increases from 0.1, MPD increases, leading to average shifts of site rankings to decrease from maximum values at δ = 1.0 (see Figure 2.4)...... 15

4.1 Output from cloc software (Count Lines Of Code), showing number of files and code lines of different languages making up the SymbiotaR2 package...... 54 Chapter 1

Introduction

Despite obvious and well-acknowledged limitations of the species concept (Hey et al., 2003), its prevalence and perseverance in most biological subdisciplines demon- strates its utility as a model. For instance, in systematics, species are largely understood as the “fundamental” taxonomic unit; in evolutionary biology, the seminal text of the field is entitled “On the Origin of Species”. Our understanding of species is important not just theoretically, but also practically. Biodiversity metrics used in conservation often treat species as an indivisible unit, and changing species classifications can sub- sequently effect conservation practice. For instance, Frankham et al.(2012) describe how different species concepts, which emphasize different properties of species, lead to different conservation prioritizations. This lack of a consensus, and these type of dis- crepancies that arise from it, is termed “the species problem” (de Queiroz, 2005). Most species concepts agree with the characterization of a species as an interbreeding popu- lation of individuals that share a unique evolutionary history. The discrepancies arise in asking how much evolutionary divergence needs to occur before that population can be considered a distinct species.

Debate surrounding the species problem has gone on for over a century, and ar- guably since the very beginning of the concept itself, but advances in the field of genomics over the past several decades have highlighted the persistence of the problem, as more researchers uncover instances of incoherent species boundaries in nature. This is par- tially demonstrated by the increase in recent years of research describing cryptic species: multiple species that are morphologically similar but are classified as one species (Bick- ford et al., 2007). While the discovery of cryptic species is often justified by findings of large genetic distances within a species, significant genetic divergence of populations, on its own, is not considered sufficient for delimiting new species: most species are not genetically uniform, and speciation is known to be a protracted process (Coates et al., 2

2018). Diverged populations also may not persist for a long enough time to become distinct species (Huang, 2020). And even in such cases, there are examples of highly di- verged lineages being able to produce fertile offspring, thereby violating the requirement of reproductive isolation which characterizes the biological species concept (Huang and Knowles, 2016). Given these and other complexities, many have emphasized the need to be conservative with regards to declaring new species, and to combine evidence from multiple methods, not just genomic surveys (Carstens et al., 2013).

This thesis describes the results of two different research projects investigating ambiguous species boundaries. In Chapter 2, I outline how a family of classic phyloge- netic transformations–Pagel’s transformations (Pagel, 1999)–can be used to inform when species boundaries do (and do not) matter for analyses of the communities from which those species derive. The context motivating this method is the usage of Operational Taxonomic Units (OTUs) in microbial community ecology. OTUs are a widely used tool today, but there is disagreement in the field regarding the sequence similarity thresh- olds used to determine OTU boundaries (?). I describe this phylogenetic technique, and demonstrate its application to both simulated data and empirical examples of both microbial communities and macrotaxa.

In Chapter 3, I explore a species complex of plants native to the Great Basin region of the United States, including a variety endemic to Logan Canyon, Utah: Maguire’s primrose (Primula cusickiana var. maguirei). This species complex has undergone taxonomic revision since the initial discovery of P. cusickiana var. maguirei and the other complex members, with each being originally classified as separate species before being subsumed to varieties of P. cusickiana (Holmgren and Kelso, 2001). Additionally, past genetic work by Kelso et al.(2009) was inconclusive in resolving the relations within this species complex, making it an interesting opportunity for testing the validity of existing species boundaries. Using a restriction-site associated sequencing (RADseq) approach, I describe the genomic relations between populations of all of the members of this species complex, and contextualize past research of the populations genetics of P. cusickiana var. maguirei (Bjerregaard and Wolf, 2008, Wolf and Sinclair, 1997). 3

Chapter 2

A Phylogenetically-based Solution to Ambiguous Species Delimitation in

Studies of Assemblage Structure

2.1 Introduction

Of the subdisciplines in biology that operate with distinct species concepts—including ecology, population genetics, phylogenetics, and others—few are more hampered by the species problem than microbial community ecology. The complications of species delimi- tation in microbial communities are twofold. First, there are technical barriers: microbes cannot be classified phenotypically, and because all culturable microbes are estimated to represent just less than 1% of all microbial species, traditional taxonomic techniques are inadequate (Hugenholtz, 2002). Secondly, there are more fundamental barriers, stemming from the genetically “plastic” nature of microbes. While mutation rates are lower in prokaryotes than in eukaryotes (Lynch, 2010), the prevalence of horizontal gene transfer in prokaryotes confounds traditional classification techniques assuming a descent through vertical transmission model. Despite these obstacles, microbial community ecol- ogists require a working species definition to calculate diversity metrics and objectively compare results. While methods used to address microbial species delimitation have tracked the advent of genetic sequencing technology, they are largely based around the fundamental concept of the operational taxonomic unit, or OTU (Sneath and Sokal, 1973). By classifying organisms in a quantitative manner based on shared traits, the OTU is meant to be a flexible phylogenetic metric capturing “the lowest ranking taxa in a given study”.

In microbial community ecology today, OTUs consist of grouped sequences from genomic marker regions, clustered based on a threshold DNA similarity value. Typically, the 16S rRNA region is used for classification of bacterial taxa, due to its universal and highly conserved nature (Stackebrandt and Goebel, 1994), while the internal transcribed 4 spacer (ITS) marker region is used for fungi (Schoch et al., 2012). A historic threshold of 97% similarity between sequences of the same OTU was based on a corresponding 70% likelihood of DNA strand reassociation between different bacterial strains, which had been established as definitive for the phylogenetic species concept in bacteria (Wayne et al., 1987). 97% became the standard threshold for all microbial OTUs, fungal or bac- terial, but has recently been criticized for being too low, with researchers recommending increased values of 98.7% (Stackebrandt, 2006) to 99% and, ultimately, 100% (Edgar, 2018). 100% similar sequences (which go by many names, but will be referred to here as amplicon sequence variants, or ASVs) have been credited as being more reliable than OTUs (Callahan et al., 2017). OTU clustering implies a loss of genetic information that ASVs retain, and whereas different OTU clustering algorithms can group datasets in dif- ferent ways, ASVs, representing each unique sequence, seem to allow for researchers to directly compare their results across studies seamlessly. However, ASVs are not immune to the ambiguities of OTUs: inherent sequencing errors can lead to spurious ASVs, and inflated estimates of microbial diversity (Kunin et al., 2010). And while several denoising tools have been developed to quite accurately detect false sequences prior to clustering (Nearing et al., 2018), these tools are similarly based on thresholds of sequence similarity distinguishing correct sequences from error, and therefore subject ASVs to the original OTU problem.

Focusing on the technicalities of sequence similarity thresholds, rather than the underlying nature of microbial communities, has led to a conceptual gap separating OTUs and the species they are meant to convey. In the absence of any real biological underpinning, OTUs remain simple groupings based on contended statistical thresholds. But this ambiguity is not inevitable: OTUs can be understood in a way that aligns with their original intent by re-interpreting them in their native field of phylogenetics, which offers the means of modeling the evolutionary history of individuals in a community. A growing number of pipelines utilize phylogeny to improve the accuracy of OTU tax- onomic assignment, after the clustering process (Wu et al., 2008). Others go further by incorporating tree topology (e.g. HmmUFOtu, Zheng et al.(2018); Ondˇrej(2018)) or phylogenetic distance metrics (e.g. PhylOTU, Sharpton et al.(2011); TreeCluster, Balaban et al.(2019)) to cluster sequences into OTUs, rather than solely referring to se- quence similarity. These developments are promising, given that phylogenetic branches are model-based representations of sequence distances, and may capture evolutionary 5 divergence more reliably than sequence similarity alone (Balaban et al., 2019, Nguyen et al., 2016). However, existing methods still largely use phylogenetic metrics as inputs for increasing the speed and/or accuracy of otherwise standard OTU clustering pipelines. None utilize phylogeny–either the evolutionary processes it represents, or the extensive quantitative techniques and theory that are associated with it–to frame the biology of species delimitation.

Here, we outline the use of Pagel’s transformations (Pagel, 1999), a classic fam- ily of phylogenetic tree transformations, to bound within a range of values uncertainty and error in species assignment. Because Pagel’s transformations can be used to model changes to the standard evolutionary process, they allow us to represent varying degrees of interspecific difference, and thus the fundamental biology underlying speciation. We outline how to use such transformations to model what effect those interspecific differ- ences have on our understanding of the community of interest. In doing so, we leverage certain fundamental assumptions regarding species delimitation: namely, that inter- specific differences must be greater than intraspecific differences, and that intraspecific differences must be greater than differences generated by random sequencing error. By adding branches at different depths, we are able to model each of these differences sepa- rating species, and account for errors that regularly confound OTU boundaries. Uniting phylogenetic techniques with a broader understanding of a biological community, we illustrate how a range of species boundaries impacts the diversity of the communities those species are drawn from. We demonstrate our method using simulations as well as empirical datasets drawn from microbial communities and communities of macrotaxa. 6

2.2 Methods

We begin by providing a broad overview of the technique used to examine species bound- aries. Then, we outline the simulations used to test this technique. Finally, we discuss the usage of this technique on two different empirical examples: microbial communities from Costa Rican soils exposed to different moisture treatments (Waring and Hawkes, 2015), and a community of macroinvertebrates sampled from the Rio Laja of Mex- ico (Pearse et al., 2015). All R (R Core Team, 2020) code is included on GitHub at https://github.com/akoontz11/OTU_Phylogeny; R packages are listed in italics, and functions and parameters are written in coded text.

2.2.1 Overview Rather than a clustering technique using percentage thresholds, our method is characterized by understanding changes to species’ boundaries through their shared evolutionary history–their phylogeny. We alter the degree of interspecific differences between taxa in a phylogeny to determine when variation in OTU assignment does (and does not) matter to the community those OTUs are drawn from.

To do this, we use Pagel’s transformations, a family of branch length transforma- tions used for detecting phylogenetic signal, or the extent to which differences between related species is the result of phylogeny (Pagel, 1999). Specifically, we utilize Pagel’s δ transformation, which changes the tempo of trait evolution at different points in evo- lutionary history. A δ of 1 represents the original phylogenetic tree; δ > 1 transforms the tree to account for faster trait evolution in the recent (among close relatives), and δ < 1 faster evolution deeper in the evolutionary past (see Figure 2.1). This branch transformation corresponds to varying degrees of interspecific difference, and therefore, we use δ to explore the effect of varying species boundaries. 7

δ = 0.5

δ = 1.0

δ = 2.5

Figure 2.1: Graphical demonstration of Pagel’s δ transformation, a power transfor- mation of the summed branch lengths of the tree. As δ increases, the proportion of unshared evolutionary history also increases. The original phylogeny is defined as hav- ing a δ of 1.0, in which summed branch lengths remain unaltered. Nodes are colored to illustrate the changes in node depth due to transformation.

Additionally, we model distinctions on top of interspecific differences to capture how transformations affect not just species boundaries, but the sources of error that often confound those boundaries. We do this by appending two sets of branches onto simulated phylogenies: one representing intraspecific differences, and one representing differences due to random sequencing error (see Figure 2.2). Because OTUs can sepa- rate individuals in a community based on these distinctions, in addition to legitimate interspecific differences, we explicitly model them to account for the effects of error and to make sure our technique is robust. By including these distinctions, we make assump- tions regarding their respective magnitudes: that interspecific differences are greater than intraspecific differences, and that intraspecific differences are greater than random error due to sequencing. We represent these assumptions in our method by altering the depth at which the branches that represent these distinctions are added onto an existing phylogeny. 8

Figure 2.2: An example of a simulated phylogeny with branches appended to it, modeling varying levels of differences between individuals. Black represents interspe- cific differences, red intraspecific differences, and blue random sequencing error. The relative depth that branches are added to the phylogeny reflects a central assumption of our technique, namely that interspecific differences must be greater than intraspe- cific differences, which in turn must be greater than differences arising from random sequencing error.

After appending branches representing error, we perform branch length transfor- mations and calculate the abundance-weighted mean phylogenetic distance (MPD) of the resulting phylogeny. MPD, which is an average of all pairwise distances across a phy- logeny, is a widely used “divergence” metric, capturing relatedness among taxa (Tucker et al., 2017). We use MPD to see how transformations alter the taxonomic diversity in a community. By making this metric abundance-weighted, evolutionary distances are enhanced by population levels, allowing us to emphasize the ecological context of the evolutionary boundaries being measured.

Since δ transforms branch lengths, and MPD is a measure of phylogenetic dis- tance (i.e. branch length) between pairs of individuals, there is an inherent relationship between these two measures. Because transformations are used as a meaningful predic- tor of species boundaries, we expect changes to δ to reliably affect MPD. Similarly, if our delta transformation technique is not compromised by the presence of random in- traspecific differences and sequencing error, we expect MPD to be minimally affected by increases in these terms. Evidence otherwise would imply that the impacts of intraspe- cific differences and differences due to sequencing error are more important to measures 9 of community diversity than the extent of interspecific differences as represented by δ. We use simulations to test the robustness of our technique to these sources of error, and as comparisons with our empirical results.

2.2.2 Simulations

2.2.2.1 Function and Parameter Space To determine whether our technique is able to detect meaningful shifts in species boundaries, we simulate thousands of communities along with their respective phyloge- nies using the sim.meta.phy.comm function in the R package pez (Pearse et al., 2015). This function generates a species-site abundance matrix and a phylogeny of species in a meta-community based on several parameters: number of sites, number of species, timesteps, and values specifying migration, carrying capacity, and stochasticity. Pa- rameter descriptions, and their values, are given in Table 2.1. Parameter values were recommended by the package author (pers. comms) to ensure a reasonable numbers of lineages and phylogenetic structure, and most values were kept constant in all simulation iterations.

To test how varying numbers of species effects our technique, we specify n.spp to use three different values. Importantly, we also vary the birth and death rates used for simulating intraspecific and sequencing error (intra.birth, intra.death, seq.birth, seq.death). By changing intraspecific and sequencing error through their diversification rates (defined as birth rate/death rate), we capture how these error sources impact MPD as compared to changes in interspecific differences using our δ transformation technique. We remove instances in which death rates were greater than or equal to birth rates for both sources of error, as we want to ensure the presence of these appended branches in our final results. These parameter specifications lead to 675 different iterations used to generate our simulation data. Finally, for each of these iterations, we use a range of δ values from 0.1 to 3.0 in increments of 0.1 (i.e. 0.1, 0.2, 0.3, ..., 2.8, 2.9, 3.0) for δ transformations.

2.2.2.2 Response Metrics Final MPD values are calculated following each δ transformation, with abundances drawn from the species-site matrices corresponding to phylogenies. In addition to mea- suring how raw MPD values shift in response to increasing δ values, we capture changes to MPD using two response metrics: (1) correlation between the MPD of the original, 10 function Value 10 5, 10, 15 40 0.02 10 4 1 0.06 0.1, 0.2, 0.3, 0.4,0.1, 0.5 0.2, 0.3, 0.4,3 0.5 0., 0.2, 0.3, 0.4,0.1, 0.5 0.2, 0.3, 0.4,3 0.5 sim.meta.phy.comm Description size (length and width)number of of meta-community species grid innumber initial of community timesteps forprobability community simulations for any given speciesvalue determining to species migrate carrying fromvalue capacity one determining for cell initial a to species cell another value abundances reflecting stochasticity of abundanceprobability calculations of at speciation each for timestep eachbirth species rate at (speciation each rate) timestep death used rate for (extinction intraspecific rate) branches number used of for generations intraspecific used branches birth to rate model (speciation appended rate) intraspecificdeath used differences rate for (extinction sequencing rate) errornumber used branches of for generations sequencing used error to branches model appended sequencing error call; parameters below the double line indicate values used for simulating intraspecific differences and sequencing error. Parameter descriptions used for simulations. Parameters above the double line indicate arguments for the Table 2.1: Parameter size n.spp timesteps p.migrate env.lam abund.lam stoch.lam p.speciate intra.birth intra.death intra.steps seq.birth seq.death seq.steps 11 untransformed phylogeny and the same phylogeny with branches added onto it; and (2) the mean relative shifts in diversity of site rankings, in which community sites are ranked by MPD with the original, untransformed phylogeny and the same phylogeny with ap- pended branches, and the absolute difference in these two rankings are averaged. We use both of these metrics because they emphasize different aspects of shifts in commu- nity diversity: the first directly compares MPD values of resulting phylogenies to their originals, while the second captures the same comparison but in more of a community context (i.e. site diversity).

Because each simulation instance varies from the next due to random birth/death processes, we account for that variation by centering MPD measurements. We do this to capture just the change in MPD between simulated phylogenies at each δ value and the original, untransformed phylogenies of the same communities. For raw MPD values, we center the mean MPD across sites in a community by subtracting the mean MPD across sites in the same community prior to branch additions and transformations. We center our response metrics by capturing the value of each response metric at δ = 1.0 and subtracting that value from response metric values across all δ values for a given simulation instance.

2.2.2.3 Linear Models To examine how transformations change MPD, versus how intraspecifc differences and sequencing error change MPD, we use linear models to capture the effects of our simulation terms on both of the response metrics listed above. All model terms are standardized, in order to compare their effect sizes (Gelman et al., 2013), and because δ is a power transformation of branch lengths, we log-transform our δ term during modeling.

2.2.3 Empirical Analyses To demonstrate our technique being used for real-world scenarios, and as a compar- ison with our simulation results, we use two empirical datasets: fungal communities from tropical soils in Costa Rica (Waring and Hawkes, 2015), and macroinvertebrate commu- nities from the Rio Laja in Mexico, included as part of the R package pez (Pearse et al., 2015). We use both a microbial dataset and a dataset of macrotaxa to demonstrate the generality of our technique to different species delimitation scenarios. 12

For our microbial example, we first constructed a phylogeny of fungal sequences originating from soil communities at the La Selva Biological Station in Costa Rica; abundances for soil communities were pulled from the same data. Fungal sequences were aligned using the program CLUSTAL version 2.1 (Jeanmougin et al., 1998), and a phylogeny was constructed using RAxML version 8.2.11 (Stamatakis, 2014). We did not denoise or cluster sequences prior to phylogeny construction, as we wanted to examine our technique’s ability to deal with these sources of error. For our macrotaxa example, an ultrametric phylogeny, as well as abundance data, was included as part of the laja dataset. In both empirical examples, phylogenies are transformed using the same values used in the simulations (δ ranging from 0.1 to 3.0, in increments of 0.1), and abundance- weighted MPD is calculated. We then plot MPD versus δ, to determine where significant shifts in community diversity are occurring. 13

2.3 Results

2.3.1 Simulations Consistent with our prediction that transformations meaningfully predict species boundaries, MPD values increase asymptotically with increasing δ values (Figure 2.3). Beyond δ values of 1.0, further transformations only minimally affect MPD. Similarly, trends in our response metrics demonstrate that branch transformations predict changes to MPD in a particular range of δ values. Past values of δ = 1.0, transformations do not cause MPD to deviate significantly from its original value. Furthermore, our technique is robust to changes in the diversification rates of intraspecific and sequencing error: we see no regular trend in either response metric with increasing rates of either of these error sources (Figure 2.4; Table 2.2 and Table 2.3). 20 10 0 Mean MPD Values −10 −20

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Delta

Figure 2.3: Diversity asymptotically increases in response to δ transfor- mations. Mean MPD values across simulation instances for each δ value are shown. MPD values are centered by subtracting from each mean value the same mean value in the original, untransformed phylogeny, which is why negative MPD values are pos- sible. Low δ values (less than 1) decrease MPD across communities, while at δ > 1, MPD continues to increase, but at a lower rate of change. The effects of interspecific differences on MPD approach a maximum at higher values of δ.

Because δ is only predictive of MPD in a limited range of values, and because intraspecific and sequencing error minimally affect MPD, our model terms are weakly 14 5 5 3.0 2.5 4 4 2.0 Graphs of median response 3 3 1.5 Delta values reliably cause MPD values of δ 1.0 2 2 Intraspecific diversification rate diversification Intraspecific Sequencing error diversification rate Sequencing error diversification 0.5 1 1

; middle: intraspecific diversification; bottom: sequencing

0.0 4 2 0 −2 4 2 0 −2 −4 4 2 0 −2 −4 −4 δ Median site ranking shifts ranking site Median 5 5 3.0 2.5 4 4 2.0 transformed communities to shift from original MPD values. 3 3 1.5 Delta 1.0 2 2 Intraspecific diversification rate diversification Intraspecific Sequencing error diversification rate Sequencing error diversification 0.5 predicts changes to MPD in a range of values, while error rates have no regular effect. δ 1 1

0.0 0.2 0.0 −0.2 0.2 0.0 −0.2 0.2 0.0 −0.2

Figure 2.4: metrics (left: MPDerror correlations; diversification). right: Colors shifts match in Figure site2.2 . rankings) Increasing for error model rates terms have (top: no regular effect on MPD; low Median MPD correlations MPD Median 15 explanatory to our response variables. For both response metrics, effects of δ were slightly lower than those of intraspecific diversification (intra.div) but greater than those of sequencing error diversification (seq.div). Our model using MPD correlations has a multiple R2 value of 0.3708, and our model using shifts in site rankings has a multiple R2 value of 0.186.

Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6484 0.0009 696.37 0.0000 z.transform(log10(delta)) 0.0393 0.0009 42.22 0.0000 z.transform(intra.div) 0.0414 0.0009 44.44 0.0000 z.transform(seq.div) 0.0024 0.0009 2.56 0.0103 z.transform(comm.spp) 0.0842 0.0009 90.38 0.0000

Table 2.2: Linear model output using a response metric of MPD correlations. Multiple R2 value of model is 0.3708. δ terms are logged transformed in order to account for δ being a power transformation of branch lengths.

Estimate Std. Error t value Pr(>|t|) (Intercept) 18.9133 0.0352 536.91 0.0000 z.transform(log10(delta)) -0.5603 0.0352 -15.91 0.0000 z.transform(intra.div) -2.3083 0.0352 -65.53 0.0000 z.transform(seq.div) -0.3063 0.0352 -8.70 0.0000 z.transform(comm.spp) -0.0761 0.0352 -2.16 0.0307

Table 2.3: Linear model output using a response metric of shifts in site rankings. Multiple R2 value of model is 0.186. Negative effects of model terms are caused by the relationship to the response metric: as δ increases from 0.1, MPD increases, leading to average shifts of site rankings to decrease from maximum values at δ = 1.0 (see Figure 2.4).

2.3.2 Empirical Analyses In our analysis of fungal communities from Costa Rica, we find an asymptotic behavior similar to that found in our simulation results: MPD values initially increase and then level off as δ increases (see Figure 2.5). In contrast to the simulation results, however, there is a global maximum in MPD values: around a δ of 1.5, MPD values begin to decrease. This behavior is caused by the non-ultrametric fungal phylogeny constructed and used for transformations: whereas all tips are extended at high transformation values (i.e. δ > 1) for an ultrametric phylogeny, only the longest tips behave this way for a non-ultramteric phylogeny. Therefore, MPD values decrease as rates of trait evolution accelerate in recently evolved branches. 16

For the Rio Laja macroinvertebrate dataset, community sites show similar trends to those seen in simulations, with MPD values increasing asymptotically. Even at relatively high values of δ (i.e. δ > 2), sites shift in their relative diversity, showing sensitivity to a greater range of δ transformations than that seen in simulated communities. 0.35 0.30 0.25 0.20 0.15 (fung.test[1, 1:length(deltas)]) 0.10 0.05

deltas 1200 MPD Values 1000 800 600 (laja.test[1, 1:length(deltas)]) 400 200 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Deltadeltas

Figure 2.5: δ transformations increase MPD, but trends depend on ultra- metricity. MPD values for the fungal community dataset from Costa Rica are shown on top; values for the macroinvertebrate dataset from the Rio Laja are shown on bot- tom. Each line represents a site within each community. Note the differences in MPD magnitudes between the upper and lower graphs. Delta transformations asymptotically increase MPD for Rio Laja communities, but fungal communities begin to decrease in diversity around values of δ = 1.5, a result of the non-ultrametric fungal phylogeny. 17

2.4 Discussion Using our Pagel’s δ-based technique, we have demonstrated ranges of meaning- ful species boundaries, based on how those boundaries change the mean phylogenetic distance of members in the community from which species are drawn. In both our simulations and empirical examples, altering δ causes changes in the relative diversity of sites in a community, indicating where the effects of interspecific differences matter ecologically. We maintain that our technique illustrates how understanding species in their broader community context can inform the evolutionary boundaries that determine them.

2.4.1 Simulations Diversity of simulated communities, captured using MPD, shows an asymptotic response to increasing δ values, with MPD values increasing sharply at δ values less than 1 and then flattening at δ values greater than 1. That MPD values increase with an increasing δ is the mathematical result the of the branch lengths underlying these two measures. For any given tree topology, there is a maximum MPD value which is represented by the phylogeny with all nodes extended as far back as the root of the tree–resembling what is known as the star phylogeny. δ transformations push nodes further back in evolutionary time, and therefore cause MPD to approach this maximum value.

What is more noteworthy is where MPD values increase more sharply. Recall that δ < 1 accelerates evolution (and, correspondingly, lengthens branches) deeper in the evolutionary past, and δ > 1 does the same in recent evolutionary history (among close relatives; see Figure 2.1). This pattern, in addition to the greatest rate of increase in MPD occurring before δ = 1, demonstrates the relative contributions of recent and an- cient speciation events to a community’s diversity. As δ increases to 1, MPD values rise more rapidly, demonstrating that a significant portion of diversity is tied to diver- sifications in the (relatively) ancient evolutionary history of that community. Likewise, that MPD does not rise sharply past δ = 1 demonstrates the limited impact of recent diversifications on the overall diversity of the community. Therefore, our technique il- lustrates how species boundaries which emphasize recent diversification events do not impact our understanding of community diversity as significantly as more ancient evo- lutionary events. This result is also underlined by our explicit modeling of intraspecific 18 and sequencing error, which we model as recent diversifications in simulated phylogenies.

Figure 2.4 illustrates that our two MPD metrics respond regularly to δ transforma- tions, but show no clear pattern for diversification rates for intraspecific or sequencing error. This finding not only validates the reliability of our method, but carries implica- tions for how these sources of error are handled when determining species boundaries. Because our technique suggests that the effects of these error sources are limited in de- termining community diversity, we expect that methods which account for these errors would not differ greatly in their findings from methods which do not. Indeed, previous studies exploring the effects of OTU thresholds follows these expectations. Research by Glassman and Martiny(2018) found that, for microbial communities sampled across several diverse environments, α and β diversity trends were largely correlated with sam- ples, regardless of clustering at a 97% or 100% (i.e. ASV) threshold. Similarly, Botnen et al.(2018) used thresholds ranging from 87% to 99% to cluster data from nine different microbial studies into OTUs, and found that patterns in community structure remained relatively unchanged across threshold values. By modeling variable interspecific bound- aries and intraspecific and sequencing error phylogenetically, we explicate these findings from first principles. In doing so, our technique highlights the value that phylogenetics, and its associated analytical techniques, brings to studies of species delimitation.

2.4.2 Empirical Analyses Broadly, results from our empirical analyses follow our simulation findings, and demonstrate a similar behavior of diversity in response to increased δ values. In our analysis of fungal communities from soils in Costa Rica, we see a similar increase in the MPD for all communities at low δ values. In contrast to simulated data, however, MPD values decrease as δ increases past values greater than 1.5. This trend results from the phylogeny used for fungal sequences being non-ultrametric (as opposed to the ultrametric phylogenies used for all simulated data, and for the Rio Laja dataset). As unshared branch length proportions of a non-ultrametric phylogeny increase, the longest branches are extended while all other internal branches collapse. Because these internal branches far outnumber other lineages, this leads to overall phylogenetic diversity decreasing.

We chose to include the Rio Laja invertebrate macrotaxa dataset to reflect how our technique can be applied to any assemblage of species, not just to microbial communities. As with our fungal and simulated communities, the relative diversity of sites within this 19 macroinvertebrate community changes as δ increases. In contrast to our fungal dataset and simulation findings, however, our Rio Laja dataset shows that changes in the relative diversity of community sites occur at δ values greater than 1: for instance, lines in the bottom graph of Figure 2.5 cross one another around δ ≈ 2.5. This finding suggests that recent diversification plays a greater role in community studies of macrotaxa than in microbial taxa, but further research with this technique using other examples is required to determine whether this is a general trend. This result may interact with the magnitudes of MPD values in the Rio Laja dataset (compare the vertical axis values between our Rio Laja, fungal, and simulated MPD versus δ plots).

Overall, our Rio Laja macroinvertebrate example demonstrates how our technique can be used to understand the effects of different species boundaries in a variety of different communities, not just microbial ones. OTUs are increasingly used in studies of many different types of communities (for instance, see Ershova et al.(2019)), and our technique offers a means of determining the community effects of those OTU boundaries.

2.4.3 Methodological Extensions While we maintain that our technique illustrates a range of meaningful species boundaries, we refrain from directly assigning OTU thresholds to corresponding δ trans- formation values based on how each effect the MPD of their given communities. However, further research could directly associate δ transformations to OTU thresholds by com- paring the MPD values of the resulting communities. For instance, the MPD of denoised microbial communities, versus those that have been denoised but not clustered, versus those that have been denoised and clustered at different OTU thresholds, could be com- pared to the MPD values generated using different δ transformation values. However, rather than focusing on a more optimal OTU threshold value, we instead emphasize the limitations of understanding species boundaries through the lense of sequence similarities alone.

2.4.4 Conclusion The rapid advances in sequencing technologies and methods for genomic analy- sis in recent decades have caused a proliferation of species delimitation techniques and pipelines. This is especially true in microbial ecology, which requires genetic barcod- ing methods to survey the vast and largely undescribed communities it involves. While 20

OTUs have become an invaluable tool in this endeavor, they are still largely used as sim- ple statistical groupings, ignoring evolutionary theory– a conceptual gap that has been recognized for years (Vogler and Monaghan, 2007). Rather than motivating OTU group- ings using evolutionary theory, the discussion has instead tended towards ideal threshold values. Our δ transformation method demonstrates that variable species thresholds ac- counting for evolutionarily recent diversification have a limited impact on overall commu- nity diversity. In using phylogeny to capture the range in which interspecific differences matter, we unite the evolutionary boundaries of species with their ecological context. 21

Chapter 3

Genomic Comparisons among Varieties of a Primrose Species Complex in

the Great Basin

3.1 Introduction Populations, and the genetic differences between them, are the causes of, and pre- cursors to, speciation. Understanding the structure and origins of these differences is essential for making sense of species’ evolutionary history, and is of even greater im- portance for species of conservation concern. In a conservation context, population differences can carry implications for genetic robustness, establishment of source pop- ulations, and other management decisions. The focus of this chapter is to make such a population comparison between a plant endemic to Utah, Maguire primrose, and its sister varieties.

Maguire primrose (Primula cusickiana var. maguirei) is an herbaceous, perennial plant located exclusively in Logan Canyon, Utah, USA. The plant was listed as threat- ened on August 21st, 1985 (Fish and Wildlife Service, U.S. Department of Interior, 1985). Its status as a rare species with a limited range motivated the need for an increased un- derstanding of its genetic variation, given the greater potential for loss of that variation in small populations. Research in 1997 comparing allozyme marker genes revealed a significant degree of genetic structure between the relatively proximate (about 10 km) lower and upper Logan canyon populations of P. cusickiana var. maguirei (Wolf and Sinclair, 1997), with the two population groups being nearly fixed for different alleles. This structure was confirmed in 2008 through analysis of the same populations using 165 amplified fragment length polymorphism (AFLP) loci (Bjerregaard and Wolf, 2008); additionally, similar levels of polymorphism were found in most upper and lower canyon populations, suggesting this structure is not the result of a past bottleneck event. 22

The same 2008 study compared crosses between and within upper and lower canyon populations, and found that interpopulation crosses generated a higher seed set than crosses within a population (although resulting seeds were not tested for viability). This finding of possible inbreeding depression suggests environmental factors may be driving the population divergence in maguirei. Evidence of a fairly small degree of phenologi- cal overlap in the blooming period between upper and lower canyon populations (likely caused by significant temperature differences) supports this idea (Bjerregaard and Wolf, 2008, Davidson and Wolf, 2011). The population dynamics of maguirei are further af- fected by heterostyly, in which plants express one of two floral morphologies: “pins”, with extended styles and anthers near the base of the corolla, and “thrums”, with re- cessed styles and anthers near the mouth of the corolla. This self-incompatibility system promotes outcrossing between pin and thrum morphs (termed legitimate xenogamy) via insect pollination (Darwin, 1897). Because maguirei populations have been shown to be largely distylous, having a pin:thrum morphology ratio of about 1:1 (Davidson and Wolf, 2011), any individual can mate successfully with only half of the entire population, in scenarios of legitimate xenogamy. These reproductive limitations, along with maguirei’s patchy population distributions, likely serve as barriers to otherwise advantageous out- crossing.

The taxonomic classification of Maguire’s primrose has shifted over time. First described as a distinct species in 1936 (Williams, 1936), P. cusickiana var. maguirei lies within the Parryi section of the Primula genus, along with the other three complex members of what is now the P. cusickiana complex: varieties cusickiana, domensis, and nevadensis. Varieties cusickiana, maguirei, and nevadensis were initially differentiated into species based on their different ecological traits (occupying different soil habitats) and geographic ranges, rather than their morphology (which is largely similar; Holmgren and Kelso(2001); see Figure 3.1). The discovery and publication of P. domensis in 1985 (Kass and Welsh, 1985), along with the continued collection of the other varieties, began to cast doubt on the species distinction for each complex member. In 2001, a review of species within the section Parryi concluded that the morphological differences between P. maguirei, P. cusickiana, P. domensis, and P. nevadensis were insufficient for distinguishing each as its own species, and established that these groups instead be distinguished as different varieties of the same species (Holmgren and Kelso, 2001). However, at the time, no genetic data was available to justify this taxonomic shift. 23

Figure 3.1: The four members (varieties) of the Primula cusickiana species complex: (A) maguirei, in Right Hand Fork of Logan Canyon, Utah; (B) cusickiana, near Cougar Point in Jarbidge, Nevada; (C) domensis, at Notch Peak in the House Range, Utah; (D) nevadensis, on Mount Washington in Great Basin National Park, Nevada.

A 2009 analysis using AFLPs and chloroplast DNA marker regions of section Par- ryi of the Primula genus showed the P. cusickiana complex being more recently de- rived compared to other members of the Parryi group, and supported the complex as monophyletic (Kelso et al., 2009). Relationships within the complex were incongruent, however, with only weak support of a clade containing nevadensis and domensis being sister to a clade made up of maguirei and cusickiana. The position of Primula capillaris, the Ruby Mountain primrose, was similarly variable, being equally supported as nested within the complex and sister to it. In order to increase resolution and better determine the of complex members, the authors suggested a wider analysis utilizing more populations. The restriction-site associated sequencing (RADseq) technologies available today, with their ability to generate reads over many sequence regions of closely related individuals, are well-suited to provide the data required for such an analysis.

We sought to clarify the relatedness of P. cusickiana complex members by geno- typing all four varieties located at distinct populations scattered throughout the Inter- mountain West using a RADseq approach. In addition to taxonomic clarification, this analysis may help to explain the level of genetic structure existing between the upper 24 and lower Logan Canyon populations of P. cusickiana var. maguirei. We hypothesized that one of the maguirei populations is more closely related to a population of another variety within the species complex than it is with the neighboring Logan Canyon pop- ulation, given the genetic distance between the two. Comparing the genetic distances between the maguirei populations with those separating the populations of the other va- rieties has the potential to provide insights into the biogeographic history of this species complex, and could have important implications for conservation and management. 25

3.2 Methods

3.2.1 Sampling All Primula cusickiana species complex samples were gathered in the field. Primula parryi samples were also collected in the field, to use as an outgroup in genetic analyses. Populations and their respective flowering times were determined using herbarium spec- imens, and collection sites were selected to maximize the geographic distribution of each variety. At each population location, an individual plant was removed as completely as possible as a voucher specimen. For DNA samples, two leaves from each of five plants were removed and placed in labeled paper envelopes, which were stored on silica crystals to keep samples dry. Vouchers were deposited at the Intermountain Herbarium (UTC); P. cusickiana var. nevadensis voucher specimens collected from Mt. Washington were additionally deposited at the Great Basin National Park herbarium.

Because past research has shown variable relations between P. capillaris and the P. cusickiana species complex (Kelso et al., 2009), we also tried to collect P. capillaris in the field. However, we were unable to locate any P. capillaris individuals in the Ruby Mountains: at one location suggested by past herbaria data, a population of P. parryi was found instead. To compensate, two P. capillaris samples were sourced from herbaria: one from the Intermountain Herbarium (catalog number UTC00138833), and one from the Arizona State University Herbarium (catalog number ASU0020421).

Leaf tissue from 96 samples–87 field collections, 2 herbarium specimens of P. cap- illaris, and 7 replicates–were placed into 96 QIGAEN Collection Microtubes (catalog number 19560) and sent to University of Wisconsin-Madison Biotechnology Center, for DNA extraction, library prep, and DNA sequencing. Replicate samples were used to assess the quality of sequencing results, and were distributed across all P. cusickiana varieties, as well as P. parryi.

3.2.2 DNA Extraction DNA was extracted using the QIAGEN DNeasy mericon 96 QIAcube HT Kit. DNA was quantified using the Quant-iT TM PicoGreenR© dsDNA kit (Life Technologies, Grand Island, NY). 26

3.2.3 Library Prep and Sequencing Libraries were prepared following Elshire et al.(2011). ApekI (New England Bi- olabs, Ipswich, MA) was used to digest 100 ng of DNA. Following digestion, Illumina adapter barcodes were ligated onto DNA fragments using T4 ligase (New England Bio- labs, Ipswich, MA). Size selection was run on a PippinHT (Sage Science, Inc., Beverly, MA) to subset samples down to 300 – 450 bp fragments, after which samples were purified using a SPRI bead cleanup. To generate quantities required for sequencing, adapter- ligated samples were pooled and then amplified, and a post-amplification SPRI bead cleanup step was run to remove adapter dimers. Final library qualities were assessed using the Agilent 2100 Bioanalyzer and High Sensitivity Chip (Agilent Technologies, Inc., Santa Clara, CA), and concentrations were determined using the Qubit© dsDNA HS Assay Kit (Life Technologies, Grand Island, NY). Libraries were sequenced on an Illumina NovaSeq 6000 2x150 S2.

3.2.4 Data Processing Raw FASTQ data files were demultiplexed and processed using ipyrad version 0.9.31 (http://ipyrad.readthedocs.io/; Eaton and Overcast(2020)). For all downstream STRUCTURE analyses, single nucleotide polymorphisms (SNPs) recognized by ipyrad were used as the basis for variation between individuals. All ipyrad and STRUCTURE parameter files, as well as R scripts used for analysis and data visualization, are provided at https://github.com/akoontz11/Primula/tree/master/R.

3.2.4.1 Complex-Wide Genomic Survey For our complex-wide genomic survey, we ran ipyrad twice: we used the results from our initial run to confirm sequencing consistency for replicate samples, and to identify samples with low coverage (generally, samples with less than 30 loci in the final assembly). For both our initial and secondary runs, demultiplexed sequences were paired and merged, and low quality bases, adapters, and primers were filtered prior to SNP calling.

For our initial run, reads were clustered (clust_threshold parameter) at the de- fault 85% threshold. We specified a minimum sequencing depth (mindepth_statistical) of 6, and a minimum number of samples per locus (min_samples_locus) of 35. Using the results from this run, we used the script vcf2Jaccard.py (https://github.com/ 27 carol-rowe666/vcf2Jaccard) to compare samples with replicates by calculating the mean Jaccard similarity coefficients between all samples. We found that all replicates matched highly with their corresponding samples: see Figure 3.2. 800 600 400 Number of pairwise comparisons 200 0

0.0 0.2 0.4 0.6 0.8 1.0

Jaccard distance value

Figure 3.2: Distribution of Jaccard similarity values between samples analyzed. Jac- card similarity values for replicates are shown as purple lines. Similarity coefficient values were calculated based off of the SNP matrix of samples generated by our initial ipyrad run.

After merging replicates and removing low coverage samples from the dataset, 82 samples remained for our complex-wide analysis. We reran ipyrad using these 82 samples to select for loci specific to this subset. For this second run, reads were again clustered at the default 85% threshold; we used a mindepth_statistical parameter of 6 and a min_samples_locus parameter of 32.

3.2.4.2 Variety Specific Clustering In addition to our complex-wide survey, we were interested in exploring within certain varieties, in order to resolve relations at too fine a scale to be captured at the complex level. We first analyzed only variety maguirei by running ipyrad on just the 18 maguirei samples used in our complex-wide survey. Reads were again clustered at the default 85% threshold; we used a mindepth_statistical parameter of 6, and a min_samples_locus parameter of 5. 28

We were also interested in relationships in variety cusickiana, which has the great- est geographic range of all the species complex members. We ran ipyrad on the 24 cusickiana samples sourced from 4 populations across the Snake River Plain in Idaho (cusickiana samples from Nevada and Oregon were not included). For this run, we used a clust_threshold value of 85%, a mindepth_statistical parameter of 6, and a min_samples_locus parameter of 7.

3.2.5 Population Analyses

3.2.5.1 STRUCTURE To visualize relations between complex members across their geographic range, we used the program STRUCTURE version 2.3 (Pritchard et al., 2000). STRUCTURE uses Bayesian clustering analysis to probabilistically assign individuals to one or more of K populations, where the loci within each population are assumed to be at Hardy- Weinberg equilibrium and linkage equilibrium. For all STRUCTURE runs, we used a burnin length of 50,000, and 100,000 MCMC reps after burnin. For our complex-wide survey, we ran STRUCTURE for K values of 2 – 16, with 50 replicates per K value. For both our maguirei and Snake River Plain cusickiana analyses, we ran STRUCTURE for K values of 2 – 6, with 50 replicates per K value. We used the CLUMPAK server (http://clumpak.tau.ac.il/; Kopelman et al.(2015)) to summarize results across replicates for each K value, and for building STRUCTURE plots.

3.2.5.2 Discriminant Analysis of Principal Components In addition to STRUCTURE, we analyzed the results of our complex-wide survey using Discriminant Analysis of Principal Components (DAPC; Jombart et al.(2010)). DAPC is a statistical technique designed to accommodate the size of genomic datasets and capable of differentiating within-group variation from between-group variation. SNP data is first transformed using a principal components analysis (PCA), and then k-means clustering is run to generate models and likelihoods corresponding to each number of population clusters. Bayesian information criterion (BIC) is then used to determine the best supported number of population clusters. We compared our STRUCTURE results with our results using DAPC, and used our DAPC output to visualize population clusters in a PCA format. 29

3.3 Results

3.3.1 P. cusickiana species complex relations We retrieved, on average, 2.04 x 106 reads per sample, and our complex-wide ipyrad run identified 1,277 loci that were used in our STRUCTURE analysis. Using the Evanno method for finding the optimal K value (Evanno et al., 2005) found K = 5 to yield the greatest ∆K value; using the method described in the STRUCTURE manual (Pritchard et al., 2000), which identifies the K value with the greatest likelihood, generated an optimal K value of K = 14. We visualized the STRUCTURE results for K values ranging from K = 2 - 16 (Figures 3.3, 3.4, and 3.5). Based on the output of these STRUCTURE plots, we determined K = 7 to be the most biologically relevant number of clusters (Figure 3.6). At this K value, varieties domensis and maguirei are clearly delineated, variety nevadensis shows distinctions between its two populations, and variety cusickiana is split into three distinct groups made up of populations from Idaho, Nevada, and Oregon. Higher values of K further emphasized these divisions, and generally did not reveal any more information regarding overall relations within the complex.

Our DAPC reported that the number of clusters with the lowest BIC value (i.e. the greatest supported number of clusters) was 11 (data not shown). However, several of these clusters were quite small (consisting of only one or two samples), and groupings at 11 clusters were incoherent with the distribution of complex populations. Therefore, we ran our DAPC with a specification of 7 clusters, to examine the distances between groups at this level and compare results to our STRUCTURE output. DAPC results for 7 populations clusters are given in Figure 3.7, and illustrate the distinctiveness of varieties cusickiana and maguirei and the similarities between varieties nevadensis and domensis, along with cusickiana populations outside of the Snake River Plain. Figure 3.8 provides an image of sampled populations in their geographic contexts, with pie charts used to illustrate membership to the 7 different clusters according to our STRUCTURE analysis at K = 7.

30

parryi

nv_Troy

nv_GRBA

domensis maguirei

K = 2 K = 3 K = 4 K = 5 K = 6

ck_Owyhee ck_Jarbidge

Complex-wide analysis STRUCTURE plots for values of K = 2 through K = 6. ck_Idaho Figure 3.3: 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00

31

parryi

nv_Troy

nv_GRBA

domensis maguirei

K = 7 K = 8 K = 9

K = 10 K = 11

ck_Owyhee ck_Jarbidge

Complex-wide analysis STRUCTURE plots for values of K = 7 through K = 11. ck_Idaho Figure 3.4: 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00

32

parryi

nv_Troy

nv_GRBA

domensis maguirei

K = 12 K = 13 K = 14 K = 15 K = 16

ck_Owyhee ck_Jarbidge

Complex-wide analysis STRUCTURE plots for values of K = 12 through K = 16. ck_Idaho Figure 3.5: 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 parryi

33

nevadensis_Troy

nevadensis_GRBA domensis

K=7

maguirei

cusickiana_Owyhee

cusickiana_Jarbidge cusickiana_Idaho Complex-wide analysis STRUCTURE plot at K = 7, with labels representing the populations from which samples were sourced 0 Figure 3.6: 1.00 0.75 0.50 0.25 34 remains distinct. populations from Nevada and Oregon are shown to group together using this maguirei cusickiana method, while maguirei cusickiana_Owyhee domensis parryi nevadensis cusickiana_Idaho cusickiana_Jarbidge , and nevadensis , domensis Discriminant Analysis of Principal Components (DAPC) illustrating ordination of 7 population clusters. Coloration matches Figures Figure 3.7: 3.6 and 3.8 . Populations of 35

48°N

46°N

ID Clusters 44°N OR cusickiana_Idaho cusickiana_Jarbidge cusickiana_owyhee ° Y 42 N maguirei domensis nevadensis_Troy 40°N parryi

NV UT

38°N

36°N

125°W 120°W 115°W 110°W X Figure 3.8: Map of sample locations, with pie charts used to represent sample mem- bership to STRUCTURE clusters at K = 7. Coloration matches Figures 3.6 and 3.7.

Overall, our complex-wide analyses found support for several distinct groupings within variety cusickiana, notably for populations located in the Owyhee High Desert in Oregon and in Jarbidge, Nevada. Even at very low values of K (i.e. K = 2 - 3), both of these populations emerge as distinct from the remaining cusickiana populations located primarily along the Snake River Plain in Idaho. This motivated our analysis into solely Snake River Plain populations (see Section 3.3.3 below). Additionally, we found support for the closeness of varieties domensis and nevadensis, with the latter characterized by differing levels of admixture between the sampled populations. Finally, all of our analyses point to the uniqueness of variety maguirei within the species complex.

3.3.2 P. cusickiana var. maguirei In our complex-wide analysis, variety maguirei was grouped as a single population cluster, distinct from all other populations of all other varieties. Even at values of K 36

= 16, the upper and lower Logan Canyon populations of maguirei were not resolved. Thus, we reject our hypothesis that either Logan Canyon maguirei population is more closely related to another population of a different variety.

However, reducing our sample set to only maguirei samples allowed us to retain loci informative to this variety but unshared with other complex member populations. Our maguirei-only ipyrad run generated a STRUCTURE file with 68,492 loci, indicating a large number of loci specific to maguirei and not shared with the wider species complex. To speed up processing times, we ran STRUCTURE on a 17,988 loci subset of maguirei- specific markers. Using the CLUMPAK server, we found optimal K values of K = 4 (using the Evanno method) and K = 3 (using the likelihood method described in the STRUCTURE manual). Figure 3.9 shows the STRUCTURE plot at K = 3, which resolves similar groupings of maguirei populations supported in Bjerregaard and Wolf (2008), and the distinctions between upper and lower canyon populations.

37 Seed Source Seed

samples at K = 3. Right Hand Fork Hand Right maguirei K=3 STRUCTURE plot for only

Figure 3.9: Lower Logan Canyon Logan Lower 0 1.00 0.75 0.50 0.25 38

3.3.3 P. cusickiana var. cusickiana Populations of variety cusickiana from Nevada and Oregon clustered separately from cusickiana populations along the Snake River Plain in Idaho at relatively low values of K (i.e. K = 4; see Figure 3.3). At higher values of K, distinctions began to appear between the Snake River Plain cusickiana individuals. We ran ipyrad and STRUCTUCRE solely using these cusickiana populations to better resolve relations in this group. Our ipyrad run reported 38,751 informative loci between Snake River Plain cusickiana individuals, again indicating a sizable number of loci not shared with other members of the species complex. We ran STRUCTURE on a subset of 4,410 loci for K values ranging from 2 to 6; K = 4 was best supported using the Evanno method, while K = 3 was best supported using the likelihood technique described in the STRUCTURE manual. We visualized K = 3 STRUCTURE plots for these populations (see Figure 3.10) as we felt this level of clustering best represented the biology in this group.

39 Camas Prairie Camas K=3

samples sourced from the Snake River Plain in Idaho, at K = 3. Boise cusickiana

STRUCTURE plot for only Bear Figure 3.10: 0 1.00 0.75 0.50 0.25 40

3.4 Discussion

3.4.1 Primula cusickiana var. maguirei Results from our complex-wide DAPC and STRUCTURE analyses strongly sup- port variety maguirei as its own unique variety, separate from others within the species complex. Simultaneously, our maguirei-only analysis supports previous research reveal- ing notable genetic distance between the upper and lower Logan Canyon populations of this threatened endemic plant. Taken together, these findings reflect the overall bio- geographic trends of this plant. Members of the P. cusickiana species complex are generally adapted to the cool, moist habitats prevalent during the Pleistocene. Due to warming across the Great Basin throughout the Holocene, ranges have become restricted to narrow, disjunct pockets, largely in high elevations and on limestone substrates.

Research into the population structure of P. cusickiana var. maguirei was origi- nally motivated by its threatened status, and identifying potential source populations. Our results suggest that it will be important for future management of maguirei to protect both upper and lower canyon populations, given the genetic diversity between them. This is further emphasized by past research showing that inter-population crosses generated a significantly higher seed set than intra-population crosses (Bjerregaard and Wolf(2008))—but more work needs to be done on maguirei’s breeding dynamics to confirm these results.

3.4.2 Genetically distinct populations within var. cusickiana Previous morphological analysis of 76 individuals from four different populations from Southwest Idaho has posited support for dividing then species cusickiana within Idaho into three unique species: P. cusickiana, P. wilcoxiana, and P. broadheadae (Mans- field(1993). Prior analyses using restriction fragment length polymorphisms (RFLPs) of chloroplast DNA from different cusickiana morphotypes from Owyhee and southwestern Idaho populations found insignificant correlation between genetic and geographic dis- tances between morphotypes, but the authors note that this was possibly a product of the genetic markers used (Owen et al.(2003)). Our results support the differentiation of Owyhee populations of cusickiana, which has been variably classified in the past as P. wilcoxiana by Mansfield and others. Jarbidge populations of cusickiana were first discovered in 1995 (New York Botanical Gardens, catalog #801526), and to our knowl- edge, ours is the first study making any type of genomic comparison including these 41

Jarbidge populations. At all values above K = 5, our results separate these two pop- ulations from the rest of cusickiana and from each other, although our DAPC results suggest close relations. Future morphological analyses and breeding surveys, along with more precise genetic data and estimates of divergence times, will help to clarify whether these populations stand to be classified as separate varieties or species from the rest of the complex.

3.4.3 Admixture in var. nevadensis Varieties nevadensis and domensis are the most recent additions to the P. cu- sickiana species complex, being described in 1967 (Holmgren(1967)) and 1985 (Kass and Welsh(1985)), respectively. The 2009 phylogenetic analysis by Kelso et al. of the Parryi section of Primula found support for grouping these two varieties together, suggesting they may even be considered a single taxon. Our analyses support the close- ness of these two varieties, revealing nevadensis populations from Mt. Washington, in the Snake Range, to be hybrids of domensis, in the House Range to the east, and nevadensis populations in the Grant Range, to the south. At K = 6, Grant Range pop- ulations of nevadensis are also characterized by hybridization, with segments coming from domensis, Jarbidge populations of cusickiana, and Owyhee populations of cusick- iana, whereas at K = 7, Grant Range populations are less admixed. Minor clusters at each of these K values (as seen in CLUMPAK; not shown) similarly suggest possi- ble admixture with Jarbidge and Owyhee cusickiana populations for both nevadensis populations. Clearly, more detail is needed to determine the extent and nature of the admixture within nevadensis populations.

3.4.4 Primula capillaris We originally considered P. capillaris as an outgroup species candidate with which to compare P. cusickiana complex members. However, Kelso et al.’s 2009 review de- scribed the relation of P. capillaris to the complex members as “notably dissonant”, with P. capillaris variably being shown as nested within and sister to the complex vari- eties. This motivated the inclusion of P. parryi in our genomic survey as an outgroup. Unfortunately, our study was unable to clarify the relation of P. capillaris with the rest of the species complex: we were unable to locate any wild populations within the Ruby Mountains, and the two obtained herbarium specimens showed highly variable relations, 42 possibly due to their age (both were sourced from 1965). Future research is needed to de- termine how P. capillaris fits in with populations of the P. cusickiana complex broadly, and to determine potential threats to this narrow endemic.

3.4.5 Conclusion Our results underline the biogeographic history of the P. cusickiana complex and the Parryi section of the Primula genus generally. The vicariance exhibited by these populations is almost certainly a product of their range contraction since the end of the Pleistocene about 2.5 million years ago, justifying the relatively large genetic distances separating them. Such processes continue to this day, and several of the populations considered in this survey are likely threatened by a continued loss of suitable habitat given future climatic warming. Narrow range endemics—including P. cusickiana var. maguirei, but also nevadensis, domensis, and P. capillaris—warrant concern of extinc- tion, and more work needs to be done to better understand the breeding limitations faced by each of these taxa and what can be done to ensure their survival in an increasingly arid Great Basin. 43

Chapter 4

Summary and Conclusions

The topics outlined in this thesis are motivated by the question of how ambiguous species boundaries can be handled in different scenarios of biological study. Despite the unique contexts, both scenarios wrestle with similar sources of error confounding species boundaries–for instance, intraspecific differences (populations) that can be interpreted (or misinterpreted) as unique species. While by no means comprehensive, the results of these chapters demonstrate some methods that can be used to account for intraspecific differences and other sources of error in species delimitation.

In Chapter 2, I describe a phylogenetic technique based on Pagel’s δ transformation that can model the effects of different species boundaries on a community of interest. Using simulations, I demonstrate that, holding to assumptions regarding the relative effects of error on species boundaries, this technique is robust to intraspecific differ- ences and to differences arising from random sequencing error. I use this technique to demonstrate the range in which changing species boundaries have the greatest impact on measures of community diversity. Finally, I highlight two empirical examples to illustrate the application of branch length transformations. While motivated from the controversy surrounding usage of OTUs versus ASVs in microbial ecology, our technique can equally be applied to other types of communities, as demonstrated in the example of macroinvertebrate communities from the Rio Laja in Mexico. OTUs are increasingly used in community studies of protists and other microbial eukaryotes, and my results suggest that the boundaries on those OTUs can be determined in a conscious and me- thodical way. My hope is that researchers involved with species delimitation will utilize this method as a way to understand how the species boundaries they define impact the communities that include those species. Broadly, this technique can be seen as part of a broader current in biology to better unite the fields of evolutionary theory and macroecology (McGill et al., 2019). 44

In Chapter 3, I clarify the genomic relations between members of a species complex of plants endemic to the Great Basin and characterized by similar morphologies and discrete ranges. Using a RADseq approach, I sequenced members of several populations and all varieties of the Primula cusickiana species complex, and successfully illustrated how closely these varieties relate to one another. Specifically, I found that populations of variety maguirei cluster together in the complex-wide analysis, despite finding sup- port for previous research detailing notable genetic distance between these populations. Additionally, my genomic survey reveals genetic distinctions in isolated populations of variety cusickiana and, in contrast, similarities between varieties domensis and nevaden- sis, with the latter characterized by admixture. The relations in this species complex exemplify several of the phenomena that nuance speciation in plants and other taxa–for instance, geographic isolation by distance, cryptic speciation, and hybridization–and this research illustrates the power that RADseq methods provide for untangling the signals of these phenomena. My results motivate further research into the P. cusickiana com- plex, particularly how P. capillaris relates to the complex members, and whether current varietal and species labels should be reassigned to better reflect the relations between them. Finally, the evolutionary vicariance demonstrated by this research highlights the threatened conservation status of not just P. cusickiana var. maguirei, but of nevdensis, domensis, and the Nevada and Oregon populations of cusickiana. These small, isolated populations are precarious and susceptible to disturbances to their limited habitat, and these risks are amplified by climate change, which endangers these populations further by restricting their already diminutive ranges. I hope my findings can serve as a valu- able resource for researchers seeking to better understand this endemic plant, as well as conservationists seeking to protect it. 45

Bibliography

M. Balaban, N. Moshiri, U. Mai, X. Jia, and S. Mirarab. TreeCluster: Clustering biological sequences using phylogenetic trees. PLoS One, 14(8):e0221068, Aug. 2019.

D. Bickford, D. J. Lohman, N. S. Sodhi, P. K. L. Ng, R. Meier, K. Winker, K. K. Ingram, and I. Das. Cryptic species as a window on diversity and conservation. Trends Ecol. Evol., 22(3):148–155, Mar. 2007.

L. Bjerregaard and P. G. Wolf. Strong genetic differentiation among neighboring popu- lations of a locally endemic primrose. West. N. Am. Nat., 68(1):66–75, 2008.

S. S. Botnen, M. L. Davey, R. Halvorsen, and H. Kauserud. Sequence clustering threshold has little effect on the recovery of microbial community structure. Mol. Ecol. Resour., 18(5):1064–1076, 2018.

B. J. Callahan, P. J. Mcmurdie, and S. P. Holmes. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J., 11:2639–2643, 2017.

B. C. Carstens, T. A. Pelletier, N. M. Reid, and J. D. Satler. How to fail at species delimitation. Mol. Ecol., 22(17):4369–4383, Sept. 2013.

D. J. Coates, M. Byrne, and C. Moritz. Genetic diversity and conservation units: Dealing with the species-population continuum in the age of genomics. Frontiers in Ecology and Evolution, 6(October):165, 2018.

C. Darwin. The Different Forms of Flowers on Plants of the Same Species. D. Appleton and Company, 1897.

J. B. Davidson and P. G. Wolf. Natural history of maguire primrose, Primula cusickiana var. maguirei (). West. N. Am. Nat., 71(3):327–337, 2011. 46

K. de Queiroz. Ernst mayr and the modern concept of species. Proc. Natl. Acad. Sci. U. S. A., 102 Suppl 1:6600–6607, May 2005.

D. A. R. Eaton and I. Overcast. ipyrad: Interactive assembly and analysis of RADseq datasets. Bioinformatics, Jan. 2020.

R. C. Edgar. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics, 34(14):2371–2375, July 2018.

R. J. Elshire, J. C. Glaubitz, Q. Sun, J. A. Pol, K. Kawamoto, E. S. Buckler, and E. Sharon. A robust, simple genotyping-by-sequencing (GBS) approach for high di- versity species. PloS One, 6(e19379), 2011.

E. A. Ershova, R. Descoteaux, O. S. Wangensteen, K. Iken, R. R. Hopcroft, C. Smoot, J. M. Grebmeier, and B. A. Bluhm. Diversity and distribution of meroplanktonic larvae in the Pacific Arctic and connectivity with adult benthic invertebrate commu- nities. Frontiers in Marine Science, 6:490, 2019.

G. Evanno, S. Regnaut, and J. Goudet. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol., 14(8):2611–2620, July 2005.

Fish and Wildlife Service, U.S. Department of Interior. Endangered and threatened wildlife and plants; final rule to determine Primula maguirei (Maguire Primrose) to be a threatened species. Technical report, 1985.

R. Frankham, J. D. Ballou, M. R. Dudash, M. D. B. Eldridge, C. B. Fenster, R. C. Lacy, J. R. Mendelson, I. J. Porton, K. Ralls, and O. A. Ryder. Implications of different species concepts for conserving biodiversity. Biol. Conserv., 153:25–31, Sept. 2012.

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis, Third Edition. CRC Press, Nov. 2013.

S. I. Glassman and J. B. H. Martiny. Broadscale ecological patterns are robust to use of exact sequence variants versus operational taxonomic units. Abstr. Gen. Meet. Am. Soc. Microbiol., 3(4):e00148–18, 2018.

C. Gries, E. E. Gilbert, and N. M. Franz. Symbiota - a virtual platform for creating voucher-based biodiversity information communities. Biodivers Data J, (2):e1114, June 2014. 47

J. Hey, R. S. Waples, M. L. Arnold, R. K. Butlin, and R. G. Harrison. Understanding and confronting species uncertainty in biology and conservation. Trends Ecol. Evol., 18(11):597–603, Nov. 2003.

N. H. Holmgren. A new species of primrose from Nevada. Madro˜no, 19(1):27–29, 1967.

N. H. Holmgren and S. Kelso. Primula cusickiana (Primulaceae) and its varieties. Brittonia, 53(1):154–156, 2001.

J. Huang. Is population subdivision different from speciation? from phylogeography to species delimitation. Ecol. Evol., 36:303, June 2020.

J.-P. Huang and L. L. Knowles. The species versus subspecies conundrum: Quantitative delimitation from integrating multiple data types within a single bayesian approach in hercules beetles. Syst. Biol., 65(4):685–699, July 2016.

P. Hugenholtz. Exploring prokaryotic diversity in the genomic era. Genome Biol., 3(2): 3, Jan. 2002.

F. Jeanmougin, J. D. Thompson, M. Gouy, D. G. Higgins, and T. J. Gibson. Multiple sequence alignment with clustal X. Trends Biochem. Sci., 23(10):403–405, Oct. 1998.

T. Jombart, S. Devillard, and F. Balloux. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet., 11:94, Oct. 2010.

R. Kass and S. Welsh. New species of Primula (Primulaceae) from Utah. Great Basin Nat., 45(3):548–550, 1985.

S. Kelso, P. M. Beardsley, and K. Weitemier. Phylogeny and biogeography of Primula sect. Parryi (Primulaceae). Int. J. Plant Sci., 170(1):93–106, 2009.

N. M. Kopelman, J. Mayzel, M. Jakobsson, N. A. Rosenberg, and I. Mayrose. Clumpak: a program for identifying clustering modes and packaging population structure infer- ences across K. Mol. Ecol. Resour., 15(5):1179–1191, Sept. 2015.

V. Kunin, A. Engelbrektson, H. Ochman, and P. Hugenholtz. Wrinkles in the rare biosphere: Pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ. Microbiol., 12(1):118–123, 2010.

M. Lynch. Evolution of the mutation rate. Trends Genet., 26(8):345–352, Aug. 2010. 48

D. Mansfield. The status of “Primula wilcoxiana” and the Primula cusickiana complex in Southwestern Idaho. Technical report, Bureau of Land Management, U.S. Department of the Interior, June 1993.

B. J. McGill, J. M. Chase, J. Hortal, I. Overcast, A. J. Rominger, J. Rosindell, P. A. V. Borges, B. C. Emerson, R. Etienne, M. J. Hickerson, D. L. Mahler, F. Massol, A. Mc- Gaughran, P. Neves, C. Parent, J. Pati˜no,M. Ruffley, C. E. Wagner, and R. Gillespie. Unifying macroecology and macroevolution to answer fundamental questions about biodiversity. Glob. Ecol. Biogeogr., 28(12):1925–1936, Dec. 2019.

J. T. Nearing, G. M. Douglas, A. M. Comeau, and M. G. I. Langille. Denoising the denoisers: an independent evaluation of microbiome sequence error-correction ap- proaches. PeerJ, 6:e5364, 2018.

N. P. Nguyen, T. Warnow, M. Pop, and B. White. A perspective on 16S rRNA opera- tional taxonomic unit clustering using sequence similarity. NPJ Biofilms Microbiomes, 2:16004, Apr. 2016.

M. Ondˇrej.Cutting tree branches to pick OTUs: a novel method of provisional species delimitation. Sept. 2018.

J. Owen, S. Dunham, and D. H. Mansfield. Preliminary genetic analyses show no dif- ferentiation of two primrose (Primula cusickiana) morphotypes. J. Idaho Acad. Sci., 39(1):39, June 2003.

M. Pagel. Inferring the historical patterns of biological evolution. Nature, 401:877–884, 1999.

W. D. Pearse, M. W. Cadotte, J. Cavender-Bares, A. R. Ives, C. M. Tucker, S. C. Walker, and M. R. Helmus. pez: Phylogenetics for the environmental sciences. Bioinformatics, 31(17):2888–2890, 2015.

J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 2000.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL https://www.R-project. org/. 49

C. L. Schoch, K. A. Seifert, S. Huhndorf, V. Robert, J. L. Spouge, C. A. Levesque, W. Chen, Fungal Barcoding Consortium, and Fungal Barcoding Consortium Author List. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for fungi. Proc. Natl. Acad. Sci. U. S. A., 109(16):6241–6246, Apr. 2012.

T. J. Sharpton, S. J. Riesenfeld, S. W. Kembel, J. Ladau, J. P. O’Dwyer, J. L. Green, J. A. Eisen, and K. S. Pollard. PhyLOTU: A high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Comput. Biol., 7(1), 2011.

P. H. A. Sneath and R. R. Sokal. Numerical taxonomy: the principles and practice of numerical classification. W. H. Freeman, 1973.

E. Stackebrandt. Taxonomic parameters revisited: tarnished gold standards. Microbiol. Today, 33:152–155, 2006.

E. Stackebrandt and B. M. Goebel. Taxonomic note: A place for DNA-DNA reassocia- tion and 16S rRNA sequence analysis in the present species definition in bacteriology. Int. J. Syst. Evol. Microbiol., 44(4):846–849, 1994.

A. Stamatakis. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313, 2014.

C. M. Tucker, M. W. Cadotte, S. B. Carvalho, T. J. Davies, S. Ferrier, S. A. Fritz, R. Grenyer, M. R. Helmus, L. S. Jin, A. O. Mooers, S. Pavoine, O. Purschke, D. W. Redding, D. F. Rosauer, M. Winter, and F. Mazel. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biol. Rev. Camb. Philos. Soc., 92:698–715, 2017.

A. P. Vogler and M. T. Monaghan. Recent advances in DNA taxonomy. J Zoological System, 45(1):1–10, Feb. 2007.

B. G. Waring and C. V. Hawkes. Short-term precipitation exclusion alters microbial responses to soil moisture in a wet tropical forest. Microb. Ecol., 69(4):843–854, May 2015.

L. G. Wayne, D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I. Krichevsky, L. H. Moore, W. E. C. Moore, R. G. E. Murray, E. Stackebrandt, M. P. 50

Starr, and H. G. Truper. Report of the ad hoc committee on reconciliation of ap- proaches to bacterial systematics. Int. J. Syst. Evol. Microbiol., 37(4):463–464, 1987.

L. O. Williams. Revision of the western . Am. Midl. Nat., 17(4):741–748, 1936.

P. G. Wolf and R. B. Sinclair. Highly differentiated populations of the narrow endemic plant Maguire Primrose (Primula maguirei). Conserv. Biol., 11(2):375–381, 1997.

D. Wu, A. Hartman, N. Ward, and J. A. Eisen. An automated phylogenetic tree-based small subunit rRNA taxonomy and alignment pipeline (STAP). PLoS One, 3(7), 2008.

Q. Zheng, C. Bartow-McKenney, J. S. Meisel, and E. A. Grice. HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies. Genome Biol., 19(1):82, June 2018. 51

APPENDIX 52

SymbiotaR2 : An R Package for Accessing Symbiota2 Data

Background

Symbiota (http://symbiota.org/docs/) is a virtual platform designed to pro- mote the collaboration of specimen- and voucher-based biodiversity data (Gries et al., 2014). As an open-source repository of occurrence data, Symbiota is a means for museums, herbaria, and other institutions to share the information tied to their bi- ological collections. This includes associated collection data (coordinates, dates, site descriptions, etc.), images, DNA sequence data, taxonomic information, and more. Symbiota operates as a network of portals, where each portal consists of an institu- tion or a group of institutions collecting data associated with a geographic area or a specific taxonomic group. Currently, Symbiota portals exist for the SEINet Net- work of North American Plant Portals, the Consortium of Small Vertebrate Collec- tions, the Consortium of Pacific Herbaria, the Great Lakes Invasives Network, and many others. A more complete list of current Symbiota portals can be found at http: //symbiota.org/docs/symbiota-introduction/active-symbiota-projects/.

In 2018, the Symbiota2 project was started in an effort to rewrite the code un- derlying Symbiota and make it more modular and accessible to different software plat- forms. Lead by a team of specialists from Utah State University and Northern Arizona State University, the impetus for Symbiota2 is driven by feedback from current users of Symbiota looking for expanded functionality. Development for Symbiota2 is ongo- ing, and code for the project can be found online at https://github.com/Symbiota2/ Symbiota2.

As part of the Symbiota2 development effort, I created an R (R Core Team, 2020) package, called SymbiotaR2, that allows users in an R environment to download and ma- nipulate data from a specified Symbiota2 portal. Below, I outline the general structure and size of the SymbiotaR2 package and its commands, and I give a brief description of a potential application of this package. All SymbiotaR2 code can be found online at https://github.com/pearselab/SymbiotaR2. This package is currently undergoing 53 software review through rOpenSci, an online community of R users and developers, so code may be updated in the future as part of that peer review process.

Code Outline and Example

The SymbiotaR2 R package is made up of commands (grouped into R scripts) which each specify a resource to pull from a specified Symbiota2 portal. Once a user is able to obtain access to an existing Symbiota2 portal, or setup a portal of their own (see https: //symbiota2.github.io/Symbiota2/, for documentation), the URL to that portal can be stored in the R environment or a user’s .Rprofile file. That URL is subsequently referenced in a function call from the SymbiotaR2 package, which concatenates onto the URL the necessary path needed to access the specified resource in the portal’s API. Besides the URL, the only arguments given to the command are id, which specifies downloading a resource of a particular ID, or page, which specifies downloading an entire page of resources. Once a resource is downloaded from the Symbiota2 portal, it is converted from its native JSON format into an R object (usually a data.frame) and returned into the global R environment.

For instance, the Coordinates function (in the Checklists.R script) retrieves the geographic coordinates from a digital resource in a given Symbiota2 portal that is tied to a specimen collection. The code for the Coordinates function is given below:

Coordinates <- function(id=NULL, page=NULL, url=NULL){

# Argument handling url <- .get.url(url) RObject <- .api.scaffold(.check.api.entry("checklist/coordinates"), id, page, url)

# ID download if(!is.null(id)){ RObject <- c(RObject$decimalLatitude,RObject$decimalLongitude) return(RObject) } 54

# Page (specified or default) download if(!is.null(page)){ RObject <- RObject$‘hydra:member‘ RObject <- lapply(RObject, function(x) c(x$decimalLatitude,x$decimalLongitude)) output <- as.data.frame(do.call(rbind, RObject)) names(output) <- c("latitude","longitude") return(output) } }

The .check.api.entry function in this code block checks whether there is a re- source that can be retrieved at the URL passed to the function—in this case, the specified portal URL followed by checklist/coordinates. If a resource can be found at that URL, the function proceeds onto either the ID or page sections of the argument; if not, the user is warned that the specified URL cannot be reached.

The core of the SymbiotaR2 package currently consists of 1,520 lines of R code, with files in YAML and R Markdown for documentation purposes (see Table 4.1). After finishing peer review on rOpenSci, the R package will be available with vignettes, help files for all library commands, and built-in tests for ensuring library commands are functional for end uses.

Language Files Blank Comment Code YAML 103 0 0 8133 R 30 217 501 1520 HTML 1 60 5 211 Markdown 1 14 0 43 SUM: 135 291 506 9907

Table 4.1: Output from cloc software (Count Lines Of Code), showing number of files and code lines of different languages making up the SymbiotaR2 package.

Applications and Conclusion

For anyone interested in accessing specimen- or voucher-based biodiversity data, SymbiotaR2 readily allows access to that data from a variety of herbaria and similar 55 institutions within an R environment. As an example, if someone were interested in determining what counties in the United States were represented by an herbarium’s collections, they could send the following command to that collection’s Symbiota2 portal API within R:

queryCounties <- LookupCounties(url="dummy.Symbiota2Portal.url", page=1)

My hope in developing this R package is to grant ecologists, taxonomists, and anyone else who uses R and is interested in biological data a convenient means of accessing the powerful tools that Symbiota2 has to offer.