<<

.

TOOLS FOR BIODIVERSITY ANALYSES USING NATURAL HISTORY COLLECTIONS AND REPOSITORIES: DATA MINING, MACHINE LEARNING AND PHYLODIVERSITY

By

CHANDRA EARL

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2020

1 .

© 2020 Chandra Earl

2 .

ACKNOWLEDGMENTS

I thank my co-chairs and members of my supervisory committee for their mentoring and generous support, my collaborators and colleagues for their input and support and my parents and siblings for their loving encouragement and interest.

3 .

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...... 3

LIST OF TABLES ...... 5

LIST OF FIGURES ...... 6

ABSTRACT ...... 8

CHAPTER

1 INTRODUCTION ...... 9

2 GENEDUMPER: A TOOL TO BUILD MEGAPHYLOGENIES FROM GENBANK DATA ...... 12

Materials and Methods...... 14 Implementation ...... 19 Case Studies ...... 19 Discussion ...... 23

3 MACHINE LEARNING DISTINGUISHES BETWEEN AND DISCOVERS PATTERNS OF BIODIVERSITY IN BUMBLEBEES ...... 33

Materials and Methods...... 35 Results ...... 40 Discussion ...... 41

4 PHYLOGENETIC ANALYSIS OF NORTH AMERICAN ...... 47

Materials and Methods...... 50 Results ...... 55 Discussion ...... 61

5 PHYLODIVERSITY OF NORTH AMERICAN BUTTERFLIES ...... 72

Materials and Methods...... 76 Results ...... 84 Discussion ...... 89

6 CONCLUSIONS ...... 107

LIST OF REFERENCES ...... 108

BIOGRAPHICAL SKETCH ...... 128

4 .

LIST OF TABLES Table page

2-1 The overall results from each of the three GeneDumper runs...... 32

3-1 The three datasets utilized showing the ML implementation, build type used, and number of images and classes...... 46

4-1 The length of each locus used as the seed sequence for GeneDumper input. .. 69

4-2 The total number of sequences after quality filtering and duplicate removal and alignment length in total number of base pairs for each locus...... 70

4-3 The distribution of the number of species with sequence information and the percentage of species with distributions in Mexico, but not in the United States or Canada...... 71

4-4 Ages of all families in this study compared to other studies and their 95% confidence intervals...... 71

5-1 Summary of commonly used phylodiversity metrics used...... 104

5-2 Summary of the top and null linear models for PD, RPD, and PE...... 105

5-3 Summary of the top and null binomial regression models for randomizations of PD, RPD, and PE...... 106

5 .

LIST OF FIGURES

Figure page

2-1 A flowchart depicting the two steps of the GeneDumper pipeline (GeneDump and GeneClean) along with their inputs and outputs...... 26

2-2 A flowchart depicting GeneDump and its two major steps: the initial BLAST and species name resolution...... 27

2-3 A flowchart depicting GeneClean decision making process to clean and validate sequences...... 28

2-4 GeneDumper phylogeny colored by family...... 29

2-5 GeneDumper Nitrogen-Fixing phylogeny colored by order...... 30

2-6 GeneDumper frog and toad phylogeny colored by superfamily...... 31

3-1 Examples of how bumblebee images were processed before training...... 44

3-2 Architectures of the resulting multi-layer models...... 45

3-3 Confusion matrices for the two smaller neural networks...... 46

4-1 A plot showing the growth of sequences across the 14 loci of interest for North American species over time...... 67

4-2 A plot showing the growth of species across the 14 loci of interest for species found in both Mexico and America and Canada over time...... 67

4-3 A time-calibrated phylogeny of North American butterflies with bootstrap support shown for the deeper nodes of the tree...... 68

5-1 Maps depicting the observed phylogenetic diversity and relative phylogenetic diversity values for North American butterflies...... 99

5-2 Observed phylogenetic endemism values for North American butterflies...... 100

5-3 Maps depicting significant phylogenetic diversity patterns and significant relative phylogenetic diversity patterns for butterflies...... 100

5-4 CANAPE results showing randomization of phylogenetic endemism ...... 101

5-5 Significant relative phylogenetic diversity patterns...... 101

5-6 Significant phylogenetic endemism patterns...... 102

5-7 Phylogenetic beta diversity values for North American butterflies...... 102

6 .

5-8 Regionalization results and comparisions for North American butterflies...... 103

5-9 Subregion classification of the Eastern US bioregions ...... 104

7 .

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

TOOLS FOR BIODIVERSITY ANALYSES USING NATURAL HISTORY COLLECTIONS AND REPOSITORIES: DATA MINING, MACHINE LEARNING AND PHYLODIVERSITY

By

Chandra Earl

August 2020

Cochair: Robert P. Guralnick Cochair: Akito Y. Kawahara Major: and Genomics

Natural history collections house massive amounts of data for potential use in biodiversity studies. With such large amounts of specimen, genetic and image data available, computational tools specific to these data and their use is becoming more commonplace. This dissertation serves to explore some of the various biodiversity tools and pipelines using large amounts of natural history data. Specifically, I investigate, develop and use pipelines for building megaphylogenies, machine learning models for species classification and data mining techniques for spatial phylodiversity analysis.

These pipelines provide the means to begin testing hypotheses about biodiversity and its organization at scales and extents that have not be achievable before, and in doing so showcase novel findings that cannot be achieved without such approaches. Large scale informatics tools such as these place natural history museums at the forefront of biodiversity research and the cusp of big data science.

8 .

CHAPTER 1 INTRODUCTION

Enormous growth in both data resources and analytical tools is facilitating a new understanding of biodiversity patterns and their drivers at multiple spatial, temporal and evolutionary scales. For hundreds of years, biologists have mapped species occurrences using inventories and specimen collecting, ultimately building the world’s natural history collections, which are estimated to house 3–4 billion specimens (Ariño,

2010). As these collections grow, both in terms of physical specimens and scope of materials, efforts to digitize and mobilize related data into collections databases have been largely expanded, as have supporting infrastructures. Specimen digitization and data sharing have led to the growth of data aggregators such as the Global Biodiversity

Information Facility (GBIF), VertNet, and iDigBio. Large-scale volunteer efforts, such as iNaturalist, have recently extended natural history reporting through digitally recorded observations of species occurrence. Genetic resources such as

GenBank and Barcode Of Life Datasystems (BOLD; Ratnasingham & Hebert, 2007) for gene sequences are creating opportunities to aggregate and analyze specimen data across populations, taxa, and regions while reducing cost, effort, and delays.

As data repositories and resources have grown, the computational tools necessary to work with these rapidly growing databases have been developed simultaneously. Specimen image data and their transcriptions are being used in data intensive analyses, such as machine learning. Genetic data are applied to evaluate and construct larger and more comprehensive evolutionary trees. Approaches for mapping and modelling species distributions are now scaling up to take advantage of the

9 . massive growth of species distribution data. Here, I utilize and build tools to take advantage of these data to further biodiversity research.

First, I describe a novel program, GeneDumper, to search and clean GenBank data for a user’s taxon and genes of interest. GenBank data can be mined to produce megaphylogenies, but there currently are little to no standardized pipelines or best practice approaches for data mining and cleaning. Attempts to access and integrate data are often plagued with issues stemming from database schema, file formats, data redundancy, numerous independent identifiers and differing curation standards and practices. There have been many individual case studies that produce megaphylogenies, however the methods used vary widely and are not often described or recorded in detail to allow for replication. A platform in which data can be easily obtainable, cleaned, curated, and reported for downstream is much needed.

Second, I use machine learning on collection image data to classify bumblebee species. Machine learning, specifically neural networks, allows for data extraction and classification using digitized images. One of the most promising uses of properly identified, geolocated, imaged specimens is as a resource for enabling better automated identification in field, lab, and collections. Machine learning techniques, if successful for commercially relevant taxa such as bees, can have immediate applied value for scientists, managers at state and national agencies, agricultural initiatives and for citizen scientists.

Third, I collected genetic data from a variety of resources to build and date a of North American butterflies. Recent work has strongly advanced

10 . understanding of butterfly phylogenetics and their diversification, with a strong focus on resolving deeper relationships in the butterfly tree. The sampling effort to assemble a global-scale phylogeny of butterflies at the species level remains uneven, hampering broader understanding of how butterflies have diversified and long-term drivers of that diversity. However, documenting both completeness and which regions and are most in need of further sampling is still rarely performed, even in areas that are presumed to be well sampled, such as North America. Large amounts of online molecular sequence data and the low costs of novel phylogenetic locus sampling sets means that butterflies are a prime candidate for a continent-wide phylogenetic study.

Last, I examine spatial phylodiversity of North American butterflies using species distribution data and the phylogeny described in Chapter 4. This work required extensive effort to assemble not only high-quality expert-assessed range maps but a novel approach to range estimation for species with boundaries also outside North

America. Once these two large datasets were cleaned and names harmonized, I apply a growing set of phylodiversity metrics, including now commonplace measures such as phylogenetic diversity but also relative phylodiversity, phyloendemism, phylobetadiversity and associated phylogenetically-informed bioregionalizations. I utilize these metrics to test a key set of drivers and associations, including a unique and direct quantitative comparison between butterflies and plants. The results not only uncover surprising results about phylogenetic diversity and endemism but also suggest conservation prioritizations in areas where evolutionary unique, range-restricted lineages are located.

11 .

CHAPTER 2 GENEDUMPER: A TOOL TO BUILD MEGAPHYLOGENIES FROM GENBANK DATA

Phylogenetic sequence data are quickly accumulating from the global effort to build a Tree of Life. GenBank (Clark et al., 2016) is the premier example of a publicly accessible database and set of curation and analysis tools that have catalyzed a revolution in the life sciences. In the area of comparative biology, GenBank has become a critical data resource for systematists, who routinely both contribute to it and utilize its existing resources for new analyses. GenBank currently stores 216,531,829 sequences across 469,754 species (as of April 2020) and these data are finally reaching a level of quantity and quality where it is being used routinely in larger-scale, global analyses across thousands or more taxa (Jetz et al., 2012; Miraldo et al., 2016). Such large-scale analyses that solely or primarily re-use GenBank sequence data are now routinely being performed.

Approaches that use existing sequences and taxonomic hierarchies from databases to create large and densely-sampled supermatrices are often called megaphylogeny approaches (Smith et al., 2009). Supermatrix-built megaphylogenies have many advantages compared to approaches: 1) megaphylogenies include branch lengths necessary to understand rates of evolutionary change and 2) have the ability to account for phylogenetic uncertainty. When megaphylogenies densely enough sample all species in a clade (Sun et al. 2020), it is possible to estimate diversification rates, with attendant challenges (Louca & Pennell, 2019). When connected to other data resources, such as traits and distributions, megaphylogeny approaches promise to also provide long-sought answers to questions about the drivers of the spatial and temporal cadence of . The trend to create ever greater taxon-

12 . rich trees is likely to continue with affordable sequencing costs, improvements in resources for sequence assembly, and novel computational approaches on still- improving hardware to scale up phylogenetic analyses. Already, synthesis efforts from the Open Tree of Life (Hinchliff et al., 2015) provides a critical infrastructure to synthesize all such trees and produce a coherent, single resource.

Megaphylogenies can infer past evolutionary events and provide more information about large-scale evolutionary processes for future events (Roquet et al.,

2013). Large phylogenetic frameworks are useful for classification, forensics, conservation, and studies of infectious disease and conservation. While most recognized species have yet to have published sequences, sequence data in GenBank account for 20 to 35% of estimated named species (updated from Federhen, 2012), and this number continues to grow.

Several Genbank mining tools, including pyPHLAWD (Smith & Walker, 2019),

PHLAWD (Smith et al., 2009), phyloGenerator (Pearse & Purvis, 2013), and PhyLoTA

(Sanderson et al., 2008) are already available to hasten production of megaphylogenies, however these tools often have rudimentary curation tools needed to help users through often complex workflow steps in production of megaphylogenies, and even the key first data mining steps could be further improved. For example, none of the above programs handle taxonomic name issues, and rather rely strictly on

GenBank’s names, which are frequently out of date and incorrect. In sum, there have been many individual case studies that produce megaphylogenies (Bininda-Emonds et al., 2007; Kimball et al., 2019; Varga et al., 2019; Zanne et al., 2014; Zheng & Wiens,

2016), but the methods used vary widely and are not often described or recorded in

13 . detail to allow for replication. Tools to simplify needed curation steps and enhance reproducibility are still needed.

Here I present a pipeline, GeneDumper, a two-step toolkit to search and clean

GenBank data for a user’s taxon and genes of interest. GeneDumper’s main innovation is a reproducible and detailed pipeline to produce a database of cleaned, usable sequences for downstream analyses such as megaphylogeny construction.

Materials and Methods

GeneDumper is built using SQLite v.3.8 (Hipp et al., 2015), using Python v.3.6

(Python Core Team, 2015) as a wrapper for the user interface. SQLite is a self- contained database engine with libraries conveniently built into Python and is used for structuring and querying a database of GenBank sequences. GeneDumper uses self- contained libraries in Biopython v.1.75 (Cock et al., 2009) for interacting with GenBank, performing BLAST (Altschul et al., 1990) searches and aligning sequences.

The GeneDumper pipeline includes two steps: a step for data gathering, called

GeneDump and a data cleaning step, GeneClean. GeneDump generates a SQLite database of sequences on GenBank, retrieving species information for each sequence,

Each species is resolved to a user-inputted . Genedump also produces reports about recovery rates for loci of interest per species. GeneClean takes the database from GeneDump and chooses the best sequences based on length, composition and taxonomic relationships (Figure 2-1).

GeneDump: Two inputs are needed for GeneDump: a FASTA file containing target genes (one sequence per target gene) (for example: https://github.com/sunray1/GeneDumper/blob/master/example_files/Butterfly/D_plexipp us_probes.fas) and a CSV file describing a vetted taxonomy for the group of interest (for

14 . example: https://github.com/sunray1/GeneDumper/blob/master/example_files/Butterfly/Butterflyne t2tax_syns.csv). The sequence(s) contained in the FASTA file are used as probes to search the entirety of the GenBank database. These probe sequences should not be fragmented and should not contain long stretches of ambiguous nucleotides.

GeneDump also requires a CSV file containing the taxonomy for the researcher’s taxon of interest. Taxonomies in GenBank are known to occasionally contain errors due to misidentification (Vilgalys, 2003), and therefore an expert-curated taxonomy is recommended. An input taxonomy that contains taxonomic synonyms is also recommended as GeneDump can utilize this information to determine additional sequences for the same taxon in GenBank. In cases where a taxonomy is unavailable,

R scripts can be used to create a NCBI-generated taxonomy from either a list of species or a root taxon (R scripts can be found here:

(https://github.com/sunray1/GeneDumper/blob/master/taxonomy_files/). While

GeneDump is likely to most often apply to a clade of interest (e.g., all species across a family), it can also be used to search across a list of unrelated taxa (e.g., multiple species from a specific geographic region).

Using the FASTA file containing input sequences and CSV taxonomy inputs,

GeneDump uses a similarity-based approach (BLAST via Biopython) to search for all sequences of interest on GenBank using the sequence information. The results are then stored in an SQLite database (Initial BLAST; Figure 2-2). The NCBI-described species for each hit is extracted and resolved to the given input taxonomy, attempting to

15 . account for misspelled names and potential synonyms (Species Name Resolution;

Figure 2-2). From this database, a presence/absence matrix can be generated.

The search approach utilized in GeneDump uses BLAST to search GenBank’s nucleotide database based on likelihood-matching criteria of the target sequences themselves (Figure 2-2, Box 1.1). Sequences that closely match to the user-inputted probe sequences are found, regardless of what they are named. This similarity approach allows GeneDump to use likelihood scores to search for orthologs, whereas simple name search approaches would miss sequences with variant spelling or other name issues. Using a filter for the group(s) of interest and a lowered word size of 7

(from 11) in order to get more hits, each search returns all possible hits for that probe. After initial collation of hits, predicted hits above an e-value threshold of 0.001 are filtered out (Figure 2-2, Box 1.2) and a final number of nucleotide mismatches and amount of query coverage are calculated for each hit.

Once the initial BLAST step of GeneDump is finished, the sequences move to the species resolution step. Species names are extracted from sequence descriptions for each hit (Figure 2-2, Box 2.1), and species names from filtered hits are resolved to the taxonomy inputted by the user (Figure 2-2, Box 2.2). Ambiguous names (e.g., ‘sp.’ or ‘gen.’) are labeled as ‘Unknown species’. GeneDump also checks to see what NCBI has described as the resolved species name and attempts to check the names for misspellings. The program also considers differences in spellings due to Latin suffixes – such as ‘-us’ and ‘-a’; ‘-i’ and ‘-ii’. If an unresolved species has not been updated to reflect a change from a species to a subspecies (e.g., lumping) or vice versa (splitting),

GeneDump searches for names in the taxonomy that would match these changes and

16 . attempts to update the unresolved species. Manual changes can also be inputted by the user if the program cannot resolve certain species names by itself. All these calculations and changes are inputted or documented in the SQLite file and can be viewed using a

SQLite file viewer (e.g., https://sqlitebrowser.org/).

GeneClean: The second major step of the GeneDumper pipeline validates and cleans resolved sequences using alignment and phylogeny-based approaches to remove paralogs, uninformative and mislabeled sequences. GeneClean outputs the best sequences for each gene/species pair which researchers can export into a FASTA file for use in their research. A flowchart of the decision-making steps of GeneClean are shown in Figure 2-3.

Sequences are separated into those that have multiple sequences per species and those that do not (Figure 2-3, Box 1). For those with duplicates, GeneClean chooses the longest sequence with the most unambiguous DNA content (e.g., lacking

“N”s or IUPAC ambiguity codes (Figure 2-3, Box 2). Sequences that cannot be resolved by sequence length and data quality are compared using a cluster analysis (Figure 2-3,

Box 3). If there are only two sequences, the cluster analysis chooses randomly. If there are three or more sequences, the cluster analysis creates a 50% consensus sequence based on the sequences, replacing bases with Ns if there are ambiguous sites. The program then takes each sequence, aligns it to the consensus and calculates identity.

The sequence(s) with the highest identity are chosen. If the sequences can be resolved by length or information, the program does a self-blast of each chosen sequence against the original database to make sure the most similar hit is within the same species (Figure 2-3, Box 4). If the chosen sequence does not blast back to the same

17 . species, then this sequence may be misidentified or the original choices are different sections of the entire gene. To test for misidentifications, all original hits are first aligned for that species using MUSCLE v.3.8 (Edgar, 2004) and percent identity calculated

(Figure 2-3, Box 5). If the identity is above 95%, the original choice is picked. If the identity is below 95%, each hit is aligned to the full-length gene sequence to see if they align to a separate part of the gene. If they do, then GeneClean chooses the least number of sequences it takes to cover the most amount of the gene (tiling analysis)

(Figure 2-3, Box 6). If not, GeneClean checks to see if one hit is creating problems for the whole alignment by blasting each sequence within the alignment against the original database and checking to see which hit has results that are most closely related to the species of the hit, also removing misidentified sequences (Figure 2-3, Box 7).

Cleaned and validated sequences for each locus can be then downloaded and aligned using LONGREF_ALIGN.py

(https://github.com/sunray1/working_scripts/blob/master/LONGREF_ALIGN.py) or using the researcher’s aligner of choice. The iterative approach implemented in

LONGREF_ALIGN.py accounts for alignment issues when aligning fragmented sequences to longer, full-length sequences. The sequences that are being aligned are first split into two groups – those that are longer than the median length of all sequences and those that are shorter. This assumes the longer sequences are the full-length sequences and the shorter are fragments. Using MAFFT v.7 (Katoh & Standley, 2014) to produce a global alignment, full length sequences are aligned, then each fragment is aligned to this full length global alignment. During this process, reverse complimented sequences are also checked. This script also checks the number of full-length and

18 . fragmented sequences. If there are more than 50 of either group, will split these into chunks. Chunks are iteratively aligned to increase alignment accuracy.

Implementation

The above pipeline is implemented in GeneDumper, which is open source and available, along with documentation and walkthroughs, at https://github.com/sunray1/GeneDumper. Requirements include Python v.3.6+,

MUSCLE v.3.8 and BLAST Command Line Applications toolkit v.2.9+

(https://www.ncbi.nlm.nih.gov/books/NBK279690), along with two python packages -

Biopython v.1.75+ and SQLite v.3.8 (this is included in python standard library). No installation or compliation is necessary. MUSCLE and BLAST toolkit executables must be located in the user’s PATH. GeneDumper is licensed under the MIT License.

The two steps described above are implemented in two scripts – GeneDump.py and GeneClean.py, which should be run sequentially. Both scripts contain a series of steps that can be specified (if not, all steps will run), which correspond to the different steps in the pipeline described above. Specific details can be found in the documentation and walkthrough pages.

Case Studies

We test GeneDumper on three taxa with different numbers of species and available sequences. The three taxa are: butterflies (Animalia: Insecta: Papilionoidea), frogs and toads (Animalia: Amphibia: Anura), and nitrogen-fixing plants (Plantae:

Fabids: Fabales, Rosales, Cucurbitales, and Fagales). GeneDump was used to run the initial BLAST search for each group, resolve species names from each hit to the taxonomy and output presence/absence matrices for genes at the species and genus level. GeneClean was used to clean the sequences and output a final FASTA file.

19 .

For each taxonomic group, loci were aligned and renamed using provided scripts

(LONGREF_ALIGN.py and rename_seqs.py) and examined using AliView v.1.26

(Larsson, 2014). Sequences were translated, flanking regions and gaps were removed by hand, making sure to keep each sequence within reading frame, and checked for stop codons. For each aligned locus, FastTree v.2.1 (Price et al., 2009) was used to create individual gene trees, which were then visualized using FigTree v2.0 (Rambaut,

2010). Unusually long branches and misplaced species were pruned from the locus alignment. Cleaned, aligned sequences were then concatenated into a supermatrix using FASconCAT v.1.0 (Kück & Longo, 2014) and a tree constructed in FastTree.

Branches were colored based on classification and rooted following relationships of ingroup taxa to outgroups in recent phylogenetic studies.

Butterflies

Butterflies (: Papilionoidea) play an important role in research on , community , , climate change, and plant-insect interactions (Roe et al., 2009). Thanks to the efforts of centuries of collectors and enthusiasts, more is known about their morphology, species distributions, behavior, and larval host plants than any other insect group. Butterflies have also been the subject of many genetic studies, resulting in a database containing more than 800,000 sequences across more than 18,000 species, making it a prime candidate for studies reusing genetic data (Janzen et al., 2011).

Fifteen full-length, common insect markers from the monarch butterfly, Danaus plexippus, genome were used as input into GeneDumper. These included 13 nuclear genes: arginine kinase (ArgKin), ribosomal proteins S2 and S5 (RpS2/RpS5), carbamoyl phosphate synthetase (CAD), catalase (CAT), glyceraldehyde-3-phosphate

20 . dehydrogenase (GAPDH), efactor 1 alpha (Ef1a), acetyl-CoA acetyl thiolase (CoA), dopa decarboxylase (DDC), malate dehydrogenase (MDH), hairy cell leukemia (HCL), isocitrate dehydrogenase (IDH), and wingless (Wgl). Two mitochondrial genes were also included: cytochrome oxidase subunit 1 (COI) and 2 (COII). The taxonomy used was an update from the list produced by Lamas (2004) as part of the larger ButterflyNet project. This list contains current valid names that reflect recent literature updates, along with synonyms. We were able to find 68,768 sequences, of which 24,797 (36%) were clean and used for the tree. These sequences represent 6,072 of 19,219 (32%) of the species in the taxonomy. Results are depicted in Table 2-1. Figure 2-4 shows the phylogeny colored by family and rooted to the family Papilionidae (Breinholt et al., 2018;

Kawahara & Breinholt, 2014; Mutanen et al., 2010; Regier et al., 2013).

Nitrogen-Fixing Plants

Nitrogen is essential in building basic organic compounds, such as DNA, RNA and amino acids. However, most organisms cannot access atmospheric nitrogen.

Symbiotic relationships between certain bacteria and plants allow atmospheric nitrogen to be converted into nitrogen that is available for metabolism by most organisms. This process, called nitrogen “fixation” (rendering it bioavailable) is essential for life as a critical step in the nitrogen cycle (Vicente & Dean, 2017). The importance of nitrogen fixing plants for both basic and applied research has led to significant phylogenetic research across this highly diverse group of flowering plants, includes hundreds of loci across thousands of species, contained in the clade made up of orders Cucurbitales,

Fabales, Fagales and Rosales (Ané et al., 2018; Werner et al., 2014).

Four loci that are frequently used in plant phylogenetic studies were utilized for

GeneDumper. These include three chloroplast loci: ATP synthase subunit beta (atpb);

21 . maturase kinase (matK), and ribulose-bisphosphate carboxylase (rbcL) and one mitochondrial locus: maturase (matR). The taxonomy for all species within each of the four orders of the nitrogen-fixing clade was downloaded from The Plant List (The Plant

List, 2013), which included synonyms. We were able to find 38,563 sequences, of which

8,407 (22%) were clean and used for the tree. These sequences represent 5,331 of

38,563 (14%) of the species in the taxonomy. Results for this run are depicted in Table

2-1. Figure 2-5 shows the phylogeny colored by order and rooted to the order Fabales

(Stevens, 2001).

Frogs and Toads

The Anura (frogs and toads) are part of one of the most diverse radiations of terrestrial vertebrates (Feng et al., 2017; AmphibiaWeb, 2020), with the number of described species increasing rapidly. Due to its large diversity (~7,000 species), there are frequent reclassifications and controversies within the taxonomic and phylogenetic relationships of these species (Hedges & Kumar, 2009; Zhang et al., 2013). There have been numerous studies that have generated large amount of genetic data, which are available on GenBank (Pyron & Wiens, 2011; Feng et al., 2017; Zhang et al., 2013).

The African Clawed Frog, Xenopus laevis, is a popular model organism for biomedical research. Its genome and many of this species' genes have no been sequenced (Karimi et al., 2018; Session et al., 2016). Thirteen loci that have been used for both deep and shallow taxonomic levels of Anura were chosen using the reference sequence of X. laevis in GeneDumper. The loci were: C-X-C chemokine receptor type 4

(CXCR4), histone 3a (H3A), sodium–calcium exchanger (NCX1), recombination- activating gene 1 (RAG1), rhodopsin (RHOD), seventh-in-absentia (SIA), solute-carrier family 8 (SLC8A3), and tyrosinase (TYR). Four mitochondrial genes were also included:

22 . cytochrome b (cyt-b), the large and small subunits of the mitochondrial ribosome genes

(12S and 16S) and cytochrome oxidase subunit 1 (COI).

The taxonomy of Anura was downloaded from GenBank using an R script, taxo_of_taxa.R

(https://github.com/sunray1/GeneDumper/blob/master/taxonomy_files/taxo_of_taxa_ex.

R). Ambiguous species without names were excluded (e.g., Anura sp.). We were able to find 66,677 sequences, of which 19,386 (29%) were clean and used for the tree. These sequences represent 4,463 of 5,059 (88%) of the species in the taxonomy. Results for this run are depicted in Table 2-1. The phylogeny was colored by superfamily and was rooted to the superfamily Leiopelmatoidea (AmphibiaWeb, 2020; Worthy, 1987). It is shown in Figure 2-6.

Discussion

GeneDumper will be useful across a diversity of biological discipines, including comparative biology, macroecology, spatial phyodiversity and others that utilize large- scale genetic data. The three examples presented here demonstrate the utility of

GeneDumper for querying GenBank for sequences of interest. Input taxonomies can be built from third party databases, as in the butterfly and plant examples, or can be downloaded from GenBank, as in the anuran example. Similar datasets can also be created to understand gaps in sequencing efforts or to calculate rates of species or sequence increase over time.

Several existing programs, such as pyPHLAWD (Smith & Walker, 2019),

PHLAWD (Smith et al., 2009), phyloGenerator (Pearse & Purvis, 2013), and PhyLoTA

(Sanderson et al., 2008), query GenBank with the goal of building phylogenies whereas

GeneDumper serves to create a database of sequences that are chosen based on a-

23 . priori decisions. GeneDumper uses orthologous sequence matching, allowing objective, accurate matches while attempting to circumvent requirements to use GenBank taxonomy if desired, unlike phyloGenerator, which uses sequence label matching.

Sequence matching also provides a means to search GenBank for full length sequences within nuclear, mitochondrial and chloroplast genomes. This means that input sequences do not necessarily have to be named loci, they can be unnamed regions of larger genomes. GeneDumper also allows for the acquisition of sequences for specific species of interest versus all species contained under one root taxon. This is useful in regional phylodiversity analyses, where researchers may only be interested in species located within a certain country or region.

The two particularly novel characteristics in GeneDumper are the ability for a user to input their own taxonomy and the step-wise decision making for determining sequences best for phylogenetic analyses. None of the above programs can use novel taxonomies and their synonomies in order to capture the most sequence data and support the simplest possible name management process for end users. Instead of relying on GenBank’s names and synonyms, which are frequently out of date and incorrect, it is possible for a user to input their own. There is no downstream reconciliation of species and their synonyms or name cleaning required, as this is all performed by GeneDumper.

The step-wise decision making process implemented in GeneClean is another novel characteristic of GeneDumper. phyloGenerator implements a similar approach to choosing sequences, but is not nearly as in depth as the one presented here. Instead, it chooses between multiple sequences either at random, according to the median,

24 . maximum or minimum length of sequences on GenBank, or with reference to a target gene length. The approach implemented in PHLAWD and pyPHLAWD uses either a baited or cluster analysis and uses a modified orthologous search method to search an

NCBI database for a locus where sequence diversity is represented by a group of seed sequences. Sequences with the same species x locus pairing are chosen based on coverage and identity. The multistep testing approach here, while also using coverage and identity as criteria for making some decisions, implements other approaches to further refine decisions, such as the self-blast and tiling analysis.

PHLAWD and pyPHLAWD also require the separate download and creation of the NCBI nucleotide databases for the user’s taxa of interest. While these databases are useful for multiple downstream searches with different parameters (since the database is downloaded locally, it is faster to query), these databases are large and take time to download. For any subsequent analyses, these databases would have to be redownloaded each time in order to update available sequences. Since

GeneDumper uses NCBI’s BLAST Command Line toolkit to query the databases online

(not locally), there is much less need for any BLAST database upkeep. The only database kept locally in GeneDumper is the one containing BLAST search matches, not the entire nucleotide database (or even a subset).

As with any pipeline, and no matter how good the workflow for cleaning and choosing high quality sequences, there is still a need for manual curation. Here, we have attempted to reduce the manual curation steps by providing a suite of cleaning tools. However, there are still vetting needed to increase quality sequence

25 . output files. Manual curation includes checking poorly aligned-sequences or completely misspelled species names.

While GeneDumper uses a high e-value in the initial BLAST search in order to be as inclusive as possible to search for sequences, sequences that are very divergent from the input seed may still be missed. Implementing an iterative BLAST search, similar to PSI-BLAST (Altschul et al., 1997) may improve outputs and could be implemented as a next step. PSI-BLAST develops profiles based on protein sequence alignments which can be used to search the database for new matches and is updated for subsequent iterations. This method could help find evolutionary distant sequences in the database.

In summary, GeneDumper is quick to download and easy to run, requiring no installation and very few, easy to install dependencies. A user-inputted taxonomy and cleaning toolkit allows for improved results, reducing hand curation and increasing reproducibility. GeneDumper is next-generation tool which can be used to mine increasing data resources in Genbank, and support curation and cleaning needed to effectively build megaphylogenies.

Figure 2-1. The two steps of the GeneDumper pipeline (GeneDump and GeneClean) along with their inputs and outputs.

26 .

Figure 2-2. GeneDump and its two major steps: initial BLAST and species name resolution. The two input files (red), subsequent steps (yellow), and the output, taxonomically resolved sequences (green). Boxes are labeled with the respective step numbers as described in the Methods

27 .

Figure 2-3. GeneClean decision making process to clean and validate sequences. Input data (purple), questions that are being asked by the pipeline (yellow), results from those questions (blue) and output to sequences that are chosen (green). Boxes are labeled with the respective step numbers as described in the Methods.

28 .

Figure 2-4. GeneDumper butterfly tree, based on 15 loci and 6,336 taxa; branches colored by family. Butterfly thumbnails were obtained from digitized McGuire Center for Lepidoptera & Biodiversity (MGCL) images with permission.

29 .

Figure 2-5. GeneDumper Nitrogen-Fixing clade tree based on 4 loci and 5,335 taxa; branches colored by order. Thumbnails were taken from the University of Florida Herbarium (FLAS), with permission.

30 .

Figure 2-6. GeneDumper frog and toad tree based on 13 loci and 4,318 taxa; branches colored by superfamily. Thumbnails were images licensed under Noncommercial Creative Commons (backgrounds removed) taken from iNaturalist.

31 .

Table 2-1. Results from three GeneDumper runs.

Butterflies Nitrogen fixing plants Frogs and toads (Papilionoidea) (Anura)

Species in taxonomy 19,219 38,563 5,059

Number of sequences 68,768 33,543 66,677 (before GeneClean)

Number of sequences 24,797 8,407 19,386 (after GeneClean)

Species with ≥ 1 or 6,085 8,035 4,468 more loci (before GeneClean)

Species with ≥ 1 loci 6,082 5,909 4,463 (after GeneClean)

Species with ≥ 3 loci 3,172 930 2,470 (before GeneClean)

Species with ≥ 3 loci 3,167 721 2,451 (after GeneClean)

Locus with most COI matK 12S sequences

Number of species in 6,072 5,331 4,318 tree

32 .

CHAPTER 3 MACHINE LEARNING DISTINGUISHES BETWEEN SPECIES AND DISCOVERS PATTERNS OF BIODIVERSITY IN BUMBLEBEES

Museum specimens have enormous potential for use in a broad range of societally relevant questions (Funk, 2018), but their data have historically been accessible only to researchers who can physically visit collections. Recent collection digitization efforts, such as GBIF (GBIF, 2018) and others (Atlas of Living Australia,

2020; IDigBio, 2011; SpeciesLink, 2020; Constable et al., 2010 (VertNet)) provide not only digital specimen records but often specimen images and associated metadata

(Page et al., 2015; Vollmar et al., 2010; Wen et al., 2015). Specimen imaging is not a new endeavor, but a combination of decreasing costs, standards development, and better infrastructure support, along with growth of 3D techniques have expanded the media creating enterprise significantly in the last decade. Industry-scale imaging of specimens is thus a new frontier, and links to broader field-based, citizen science efforts that are also rapidly expanding (Barve et al., 2020).

One of the most promising uses of properly identified, geolocated, imaged specimens is to serve as a resource to aid automated identification. Automated approaches to image identification have shown great promise for plants and mammals

(Lee et al., 2017; Miao et al., 2019; Wäldchen et al., 2018; Weinstein, 2018) and are just beginning to be applied to the most megadiverse clade of macrobiota on the planet -- (e.g., Hansen et al., 2020; Wen & Guyer, 2012; Yang et al., 2015). Machine learning techniques, if successful for commercially relevant insect taxa such as bees, can have immediate applied value for biodiversity researchers, state and national land managers, agricultural initiatives, and citizen scientists.

33 .

Machine learning provides a means to classify objects such as images based on multi-layered models. These neural networks eliminate the need for much pre- processing of input data, which simplifies tool development and lowers barriers for broadest use of these approaches. Convolutional neural networks (CNNs) are a type of supervised, multi-layer model that allow users to extract features from a set of images which are then assigned to known classifications. CNNs have shown remarkable success in classification accuracy, presuming that training datasets are themselves well described and large enough. Once a model is trained, it can use features extracted during that step to make predictions about new data objects. CNNs are especially well suited to image recognition, classification and detection and have been used in applications such as facial recognition, product recommendations, and natural language processing (Lecun et al., 2015; Zeng et al., 2018). CNNs that attempt to determine species identity using trained models based on existing collection digitization efforts can significantly speed that process (Wäldchen & Mäder, 2018).

Here, we trained CNNs to distinguish among subgenera and species of bumblebees (Bombus spp.). Bumblebees are an important group of insect pollinators that provide dominating pollination services for both agriculture systems and biological communities (Colla et al., 2012; Hegland & Totland, 2008; Soroye et al., 2020).

Specimens are easily captured or recorded because they are large, colorful, conspicuous and diurnal. These same features, including distinctive coloration, makes them particularly amenable to species identification from images. Phylogenetic relationships between species are strongly supported and current, accepted subgeneric groupings are monophyletic (Cameron et al., 2007). Subgenera are largely categorized

34 . by morphological diagnosability (excluding geography) (Williams et al., 2008), making the subgenus a distinguishable classification level ideal for computer visualization. Here we test the ability of CNN approaches for detection of species based on a relatively limited training dataset drawn from museum specimen data and discuss next steps for improving accuracy and for scaling up this prototype work.

Materials and Methods

Specimen digitization: Bombus specimens from the United States National

Museum of Natural History (USNM), in Washington, D.C., USA were imaged using a traditional light box digitiation setup. The collection of bumblebees at the USNM consists of 45,000 specimens and is one of the world’s largest bumblebees collections, with representative specimens of all 15 subgenera and 178 of 250 described species

(Smithsonian Collections, 2020). We were able to use images from all subgenera and species groups as inputs into our models.

Imaging specimens does not solve the issue of having usable digital data from specimen labels, which contains critical metadata, especially identification to species.

Here, label transcriptions were completed by citizen volunteers through the Smithsonian

Transcription Center (https://transcription.si.edu/). These images and transcriptions are available online through the Smithsonian Collections (https://www.si.edu/collections).

Image classification and augmentation: Species names provided on labels from imaged specimens were verified by comparing those names against the phylogeny of Bombus (Williams et al., 2008). These identifications were updated when necessary, as some older names are no longer valid given results in Williams et al. (2008). A final dataset of species names associated with images was used in model training as described below.

35 .

Specimen images were cropped, resized and transformed using PythonMagick

(https://github.com/ImageMagick/PythonMagick) in an executable script so images could be edited in batches

(https://github.com/sunray1/Machine_learning/blob/master/image_editor.py). High resolution images (2820 x 1680 pixels) in the format of JPEG images were cropped close to the bee to decrease the amount of background noise. These images were then resized to allow faster computational processing. Resizing factors (e.g., the amount of resizing) were determined by how many groups were being tested - more groups required more computational power, which in turn required smaller images.

Some species had a limited number of specimen images. We therefore used image augmentation as means to increase dataset size when training the model

(Shorten & Khoshgoftaar, 2019). We implemented augmentation the following way: All images flipped horizontally, vertically, both horizontally and vertically, rotated by 45 degrees and rotated by 315 degrees. Once images were cropped, resized, augmented and relabeled according to their taxonomic group, resulting data objects were used for training the neural network.

Dataset creation: We created three datasets with different numbers of classes using different types of groupings (subgeneric vs species). Each of the three datasets was randomly partitioned into 3 non-overlapping sets – 70% were used for training the model, 20% for model validation while training and 10% for final testing (Table 3-1).

Dataset 1 consists of our coarsest grouping into two subgenera. Alpinobombus and Psithyrus were chosen due to their distinguishing morphological differences, thus potentially having the best performance for CNNs. Cuckoo bumblebees (genus

36 .

Psithyrus), are a specialized social parasite which do not have pollen baskets on the hind legs unlike Alpinobombus (Williams & Williams, 1998). Alpinobombus is a small monophyletic subgenus particularly adapted and restricted to the coldest areas of the overall spatial distribution of bumblebees, such as Alaska, Fennoscandia, Siberia, or

Greenland (Williams et al., 2019). Therefore, at least in the far north, they co-occur with few or no other bumblebee species, which simplifies the system for analysis if mimicry is involved (Williams et al., 2015). The colour patterns of species within Alpinobombus have been described as being both variable within species while showing close similarities among species (Williams et al., 2016).

Datasets 2 and 3 consist of all 15 Bombus subgenera and 178 species respectively. We expected the simplest classification challenge between 2 subgenera to be more successful than harder challenge of classification to subgenera or the even more difficult challenge of recognition to species.

Neural network implementation: Neural Networks were either built from scratch in Mathematica 11.3 (Wolfram Research, 2018) or built on top of pre-trained models (transfer learning) in fast.ai (Howard, 2018) on a Boxx workstation with an

NVIDIA GP100 GPU card. The GPU allowed for parallelization of importation and training, significantly speeding up computation time.

Highly complex models, such as deep neural networks, contain many parameters and can be easy to overfit (Bühlmann & Van de Geer, 2011). To avoid overfitting, we use a mix of regularization techniques, including early stopping, weight decay, and dropout (Girosi et al., 1995). Early stopping uses checkpointing while training, essentially stopping the model before validation error begins to decrease and thus

37 . overfitting the model (Yao et al., 2007). Weight decay is a method that keeps weights in a model from growing too large. By multiplying the weights by a factor slightly less than one after each training iteration, larger weights are “penalized” (Nakamura & Hong,

2019). Lastly, dropout is a technique by which randomly selected weights within each layer are ignored during an iteration. This strengthens the remaining weights and increases independence of the individual weights themselves (Srivastava et al., 2014).

All models tested used combinations of these techniques to avoid overfitting.

For Dataset 1, we built and trained models completely from scratch. Different architectures (number and size of layers) were tested, along with the learning rate parameter. In general, each model was run until the accuracy of the model on the validation set started to decrease, which meant the model was overfitting the data. Early stopping was used to get model weights just before overfitting occured, as this would be the moment where the model has the best accuracy. Generally, added layers were convolution/pooling layer pairs. Each of these layer pairs serves to define features in each image, with earlier layers describing more simple features (vertical or horizontal lines) while later layers attempt to describe more complex features (wing shape, abdomen patterns, etc). The size of the layers changes the amount of parameters in the network, making it more or less complex.

We started with the simplest approach (e.g. only a few layers in the neural network), and then increased complexity (e.g. adding layers) until overfitting was guaranteed within five epochs (the entire dataset is passed through five times) of training, which meant the model was too complex. Accuracy for each model was determined by evaluation against the final testing dataset. Each model was tested three

38 . times, in order to get an average accuracy for each particular model, to avoid outliers.

The model with the best final accuracy was chosen for further testing.

For the two datasets that contained more than two classes (Datasets 2 and 3), transfer learning was used, which allowed for reuse of a different, pre-trained model’s layers. Instead of training from scratch we can transfer layers and weights over from a larger, more complex neural network and refine them to our specific task. We reused the lower layers of ResNet-101 (He et al., 2016). This particular network was chosen as it was trained to classify images from ImageNet (Fei-Fei, 2010). ImageNet is a dataset of more than 15 million labeled images containing around 22,000 categories. ResNet-

101 used a subset of 1000 images in each of 1000 categories and was able to achieve an accuracy of 78.25%. One of these categories was bees, which means ResNet-101 has pre-defined features that distinguish bees from other objects. We refine the top layers to distinguish between the features that differ across bee taxonomic groups.

All lower layers (except the top two) from ResNet-101 were first frozen, to ensure the pre-trained weights were not damaged during training. The softmax layer, which is used to add image probability prediction to the model, was also modified to output our specific number of classes (15 for Dataset 2 and 178 for Dataset 3). The model was trained until overfitting occurred (early stopping was used as above), after which all layers were unfrozen. The model was then trained again using all layer weights until overfitting occurred (again using early stopping) and evaluated on the test dataset.

Model validation: Confusion matrices are used to evaluate the accuracy of a model’s classification and are useful to see if the model is confusing two classes

(mislabeling one as another). Each row of the matrix represents the instances in a

39 . predicted class while each column represents the instances in the actual class. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made. Confusion matrices were built for all final models for each dataset.

Results

Image processing and dataset creation: All image were used in training the models. After reclassification of transcriptions, there were 15 subgeneric classes with 8 to 17,000 images each and resulting in 178 species groupings containing up to 4,000 images. The number of images in each dataset after augmentation, along with the number of images in each partition (training, validation and testing) is shown in Table 3-

1. An example image and how it was processed and augmented is depicted in Figure 3-

1.

Neural network implementation: The Mathematica/Jupyter notebooks used to build our models are located on GitHub (https://github.com/sunray1/Machine_learning).

Architectures, learning rates and number of epochs for the best networks are depicted in Figure 3-2. Our most successful neural networks distinguished between 2 subgenera with 96% accuracy; between all 15 subgenera with 94% accuracy; and all 178 species with 93% accuracy.

Figure 3-3 shows confusion matrices for Datasets 1 and 2. The confusion matrix for Dataset 3 is located on GitHub

(https://github.com/sunray1/Machine_learning/blob/master/178species.png) due to the large file size. The confusion matrix for Dataset 1 shows that only 67 images were mislabeled, with the most errors (40 images) being Alpinobombus mislabeled as

Psithyrus. The highest error in Dataset 2 occurs when predicting Sibiricobombus. All

40 . instances of Sibiricobombus were predicted to be Pyrobombus, but no instances of

Pyrobombus were predicted to be Sibiricobombus. The second highest error was found when predicting Mendacibombus and the third highest error was found predicting

Kallobombus. These subgenera were predicted to be a variety of other subgenera.

These top three errors occurred in the subgenera with the three lowest number of images per class (Sibiricobombus: 48 images; Mendacibombus: 192 images;

Kallobombus: 258 images), so it is highly likely that the error rate is due to the lack of images available for training the model properly. The errors for Dataset 3 behave the same way. There were many species (84) which had very few (<100) images used for training, where the model was unable to place the images in the correct classifications.

However, this is expected with low numbers of training images, which we discuss in more detail below.

Discussion

Identification of specimens in museum collections with machine learning is promising, as it can yield provisional identifications at rates that far exceed what can be accomplished via manual curation by taxonomists. With burgeoning growth of specimen digitization, and challenges with updating out of date taxonomies on labels, machine learning approaches may be a particularly useful means to augment what is done most successfully by trained taxonomists. We stress the value of machine learning to provide provisional identifications. In some cases, the trained models may be so good that taxonomists can trust the results, freeing up time to focus on more difficult identifications.

Our results are particularly promising regarding accuracy of machine learning if the training dataset is large enough. 186,255 of the 200,274 images (93%) were

41 . classified to the correct species and 188,257 (94%) to the correct subgenus. While only a few studies have begun to utilize machine learning for specimens, those that have generally resulted in lower classification accuracy between a small number of species. For example, (Hansen et al., 2020) examined ground beetle specimen identification, yielding 51.9% identification accuracy between 361 beetle species. This lower accuracy is expected however, given the large number of species being classified. Wen & Guyer (2012) examined orchard pest classification and achieved a best classification rate of 86.6%. There were only five species to classify in their work, and only a total of 641 images used for training. This accuracy is fairly high considering the small image pool, but could improve with more data input. Yang et al. (2015) uses an approach that focuses on specifically wing shape features to classify between seven species of owlflies and were able to achieve 95% accuracy.

While the results presented here are compelling, morphological colors and patterns of bumblebees for classification are more distinct than in most insects (with the exception perhaps being butterflies and moths). Many insect species from the same genus can be especially difficult to differentiate using a single “whole body” (typically from top view) image as a parameter, as even experts sometimes need external information, such as behavior or geographic location to classify to the species level

(Hassan et al., 2014). However, machine learning approaches for automatic classification have high potential despite this ‘taxonomic impediment’ (Gaston & O’Neill,

2004). It is known there are architectural and processing differences between human vision and neural networks, and that neural networks may be able to determine features indistinguishable to human vision (Geirhos et al., 2018). An obvious frontier is

42 . determining not only the success of machine learning in cases where morphological differences are subtle, but how to align features extracted from machine learning tools with characters that use for diagnosis.

Machine learning approaches offer significant potential for collections, as presented here, but models are limited by the images that are used as input. The models built here are specifically designed to classify digitized bumblebees within museum collections. In order to generalize this approach and increase real world applicability, live bumblebee images against natural backgrounds, such as flowers, could be used to train models. While this task is more difficult and may lead to lower accuracy due to higher image complexity, other image classification methods could be used as a bulwark against such challenges with image background, such as object detection (to determine the location of the bee within the image) or image segmentation

(to detect individual parts of an object). More experiments combining museum and field- based photographs, focused on determining best practices, is a next step agenda that should be pursued.

Automatic species identification can have use in field studies in addition to pure image classification. Using machine learning to classify images from live camera traps enables cameras to be utilized as observers and data gatherers in ecological studies without need for continuous human interaction. Classified, continuous image data combined with metadata such as where and when the images were taken can provide detailed knowledge of species habitat preferences, activity levels, and species interactions (Tabak et al., 2019). It could also provide abilities to historically document and forecast abundances and activities of species. However, all of the potential uses

43 . will only become achievable with considerable improvements of accuracy between species, such as the ones presented here. Proper testing and validation in applied contexts and in a broader range of taxa and habitats are crucial to achieve this type of applied usage.

Figure 3-1. Examples of how bumblebee images were processed before training. A) Original digitized image, with specimen labels. B) Image cropped to bee without labels. C) Example of image augmentation to “create” new images; this one is reflected vertically. Photo courtesy of Smithsonian Collections.

44 .

Figure 3-2. Architectures of the two multi-layer models. Arrows indicate directionality of each model. Each colored box describes the layer type (convolution, pooling, fully connected layer or softmax) and the size of each layer. Softmax layers were set to the number of classes (2, 15 or 178). A) Architecture of the model built from scratch using Dataset 1, trained for 20 epochs, at a learning rate of 1e-5. B) ResNet 101 architecture used for transfer learning of Datasets 2 and 3; trained for 30 epochs with a learning rate between 1e-5 and 3e-4. Batch size = 416 images.

45 .

Figure 3-3. Confusion matrices for the two smaller neural networks showing actual class by predicted class. A) Dataset 1. B) Dataset 2.

Table 3-1: Three datasets utilized to create neural networks, showing ML implementation, build type, and number of images and classes. Dataset 1 2 3 No. of classes 2 15 178

Type of classes Subgenera Subgenera Species

Input image size (pixels) 400 x 400 256 x 256 256 x 256

No. of images post-augmentation 15,660 200,274 200,274

No. of images in training set (70%) 10,960 140,192 140,192

No. of images in validation set (20%) 3,130 40,055 40,055

No. of images in test set (10%) 1,570 20,027 20,027

ML Implementation Mathematica fast.ai fast.ai

Build type From scratch Transfer Learning Transfer Learning

46 .

CHAPTER 4 PHYLOGENETIC ANALYSIS OF NORTH AMERICAN BUTTERFLIES

Butterflies (Papilionoidea) have fascinated researchers, collectors and enthusiasts for centuries and are among the best-studied of insects. There are approximately 19,000 known butterfly species distributed across the world excluding

Antarctica (~15% of the order Lepidoptera; (Grimaldi & Engel, 2005)). Recent work has greatly advanced our understanding of butterfly phylogenetics, with a focus on resolving deep relationships in the butterfly tree (Espeland et al., 2018). It is well understood that butterflies are monophyletic and comprise seven families, including Hesperiidae

(skippers) and Hedylidae (American moth-butterflies) (Espeland et al., 2018; Kawahara et al. 2019). However, sampling effort to assemble a global-scale phylogeny of butterflies at the species level remains uneven, hampering broader understanding of how butterflies have diversified and long-term drivers of their diversity (Heikkilä et al.,

2012; Wahlberg et al., 2013).

Butterflies are central to the development of key evolutionary concepts, such as the theory of coevolution (Ehrlich & Raven, 1964). However, questions such as whether butterflies co-evolved with their larval host plants remains largely untested (Braga et al.,

2018). Similar knowledge gaps exist for species' geographic distributions, and together these issues limit broadest understanding of the processes that determine how butterfly communities have assembled and how niches are partitioned, despite classical questions related to food resource use, and evolution of mimicry complexes (Joshi et al., 2017).

These same gaps also place critical limits in applied areas, such as the ability to make the best possible applied conservation decisions for butterfly species under threat

47 . from change. This is especially important given recent documentation of precipitous insect declines in the face of accelerating global change (Basset & Lamarre, 2019;

Wagner, 2020). For example, we have not yet quantified measures of butterfly diversity hotspots, e.g., places where lineage diversity and endemism are highest that can be used to make decisions about conservation prioritizations. Of particular importance is combining spatial and phylogenetic information in order to best integrate information about evolutionary distinctness and spatial patterns of endemism.

In sum, few broad areas of the planet have comprehensive phylogenetic hypotheses to underpin other diversity analyses. A possible exception is North America

(defined here as Canada, the U.S.A. and Mexico), where phylogenetic knowledge has rapidly grown with barcoding efforts (ButterflyNet, 2020; Hebert et al., 2004; Hebert &

Gregory, 2005; Hill et al., 2018; Pfeiler et al., 2012). We focus here on producing a near-complete phylogeny for North American butterflies. This focus is warranted given the potential for broad-scale biodiversity synthesis with other diversity facets, while also complementing similar efforts in Europe (Wiemers et al., 2019) in order to hasten efforts in developing a Holarctic view of butterfly diversity. North America is home to nearly

10% of known global butterfly species with especially high diversity in Mexico, where key transitions from Neotropical to temperate butterfly communities occur (Lotts &

Naberhaus, 2017). Mexico is one of the world’s most megadiverse countries in terms of overall and butterfly-specific diversity, yet it still remains undersampled due to difficulty in obtaining permits and lack of sampling accessibility for some of the most diverse areas.

48 .

Large amounts of online molecular sequence data and the low costs of novel phylogenetic locus sampling sets means that butterflies are a prime candidate for a continent-wide phylogenetic study. Such studies, now commonplace for many groups of organisms e.g., plants and vertebrates (Campbell et al., 2005; Hawkins et al., 2014;

Licona-Vera & Ornelas, 2017; Ma et al., 2016; Williams et al., 2013), are still uncommon in insect groups. Using sequence data from GenBank (Clark et al., 2016), Barcode of

Life Data Systems (BOLD, (Ratnasingham & Hebert, 2007)) and dried pinned museum specimens, we constructed a 14-locus, time-calibrated phylogeny that represents 93% of USA/Canadian species, 63% of Mexican species and 90% of species found in both countries.

Our goal was not only to develop the best possible supermatrix phylogeny for

North America, to better understand sampling growth and gaps as a means to predict rates towards completing efforts for North American taxa. We hypothesized the growth in sequencing repositories of species sampled for at least one gene is showing continuing linear growth. We also hypothesized that this growth was biased, with overall less sampling of species in southern portions of the continent. In order to augment sampling based purely on supermatrices, we also add 140 previously unsequenced

Mexican taxa which were sequenced from pinned museum specimens, adding critical molecular data from an area with a disproportionate amount of sequence knowledge gaps. Finally, we make this tree easily available for subsetting and use in other downstream analyses.

49 .

Materials and Methods

Species List

In order to determine names of butterfly species found in North America, we utilized field guide resources to generate a continental list, including guides that cover the USA and Canada (e.g., Brock & Kaufman, 2003) and those focused more on

Mexico (e.g., Glassberg, 2018). This led to two separate lists (Northern and Southern

North America) that both needed to be vetted for quality and ultimately merged. Our list of species per region explicitly excluded strays or other accidentals, and we also removed species found only on islands e.g., the Caribbean. As part of the larger

ButterflyNet project, a global names master list has been assembled that is an update from the list produced by Lamas (2004).This list contains current valid names that reflect recent literature updates, along with synonyms. In order to properly assemble a coherent, single list, we checked each name from the two lists with this ButterflyNet master list and reconciled all names. We then merged all name duplicates to assemble a single, final, vetted list, along with synonyms, for use in downstream assembly of sequence data.

Sequence Database Acquisition

GenBank: GenBank was searched for 13 commonly sequenced butterfly loci across all North America species using a novel toolkit that utilizes an input taxonomy list

(see above), a list of known synonyms, and a list of loci for which to search, and returns matches. Its matching approach utilizes a probe sequence for each locus of interest. In cases where there are multiple returns for a species x locus combination, the toolkit chooses the best sequence using a set of well-defined rules, often simplifying to picking the longest sequence with the most unambiguous DNA content (See Chapter 2 and

50 . https://github.com/sunray1/GeneDumper). Of the 13 loci used for this work, 12 were nuclear genes: arginine kinase (ArgKin), ribosomal proteins S5 and S2 (RpS5/RpS2), carbamoyl phosphate synthetase (CAD), catalase (CAT), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), elongation factor 1 alpha (Ef1a), dopa decarboxylase (DDC), malate dehydrogenase (MDH), hairy cell leukemia (HCL), isocitrate dehydrogenase

(IDH), and wingless (Wgl). Three common mitochondrial genes were also included, but were treated as one to avoid duplicate sequences: cytochrome oxidase subunits 1

(COI) and 2 (COII) and the trnL intron located between them. The length of each locus is reported in Table 4-1.

Once sequences were downloaded from GenBank, they were auto-cleaned using a set of criteria built into the GeneDumper toolkit. This approach separates full-length sequences and sequence fragments. Full-length sequences are first aligned, then chunks of sequence fragments are iteratively added to this alignment using the -- addfragment and --adjustdirection commands of MAFFT (Katoh & Standley, 2014).

Since species x locus pairs frequently only have partial fragments of sequences that align to different parts of the whole locus, this approach decreases the number of misaligned fragments. Alignments were visualized using AliView v.1.26 (Larsson, 2014) and ultimately manually curated to remove flanking regions and to check that loci maintained the correct reading frame. GeneDumper uses a tiling approach for choosing sequences (in order to get the most amount of data across a locus), allowing some species x locus pairs to have slightly (0-20 bps) overlapping fragments generated from different sources. These fragments were merged together to create a 50% consensus sequence that contained data across the whole locus, where possible.

51 .

BOLD: Barcode of Life sequences were assembled for all species in the species list using a novel python script to query BOLD’s API

(http://www.boldsystems.org/index.php/resources/api) for locus information. Most of

BOLD data is provisioned to Genbank but some genetic data remains unique to BOLD.

If there were multiple sequences available for a species, all sequences were downloaded, aligned and a consensus sequence was created, choosing the most common nucleotide for each site. Only sequences that matched the 14 loci listed above were kept, and the vast majority of BOLD sequences are the barcode marker, cytochrome oxidase I (COI). Sequences were aligned using the same iterative approach described above and manually curated to remove inserts and flanking regions. We de- duplicated sequences present in both BOLD and GenBank, and note that BOLD had a surprising number of unique species x locus combinations that are not yet in GenBank.

ButterflyNet: Where available, sequences were taken from data produced from

Kawahara et al. 2020, which has global coverage but including unique exemplar species not yet available in sequencing repositories. The majority of sequences were obtained from specimens directly using Anchored Enrichment (AHE; (Lemmon et al., 2012)) with the Butterfly2.0 target capture set of Espeland et al., 2018. This capture set included the 14 genes of interest and typically produced full-length, high quality sequences.

Novel sequencing of museum specimens: Collections in the McGuire Center for Lepidoptera & Biodiversity (MGCL), Florida Museum of Natural History, Gainesville,

FL. USA, were searched for specimens for species without sequence data on public data portals. These specimens were pinned and dried, dried and papered, or stored in

52 . ethanol. One leg from each specimen was removed and transferred to a 96 well PCR microplate containing 30 uL 95% EtOH. Each well and specimen was labeled with a sample code before DNA extraction. For museum species with multiple specimens, the specimen with the most recent collection date was used to collect a tissue sample.

Plates were shipped to the Canadian Centre for DNA Barcoding (CCDB) for COI barcoding (Sanger).

Digital voucher images of museum specimens were taken with their associated data whenever possible. For pinned specimens, we used a 12-megapixel rear-facing camera with quad-LED lights. For ethanol and papered specimens, we chose representative photos of species from Butterflies of America (Lotts & Naberhaus, 2017).

Photographers were contacted for reuse permission and it was noted in the submission that the photos were not of the original specimen. COI barcodes were received and aligned together with all other data sources using MAFFT.

Plots of growth of species-level data in repositories: To determine growth of species with sequencing coverage in GenBank, we utilized date information, especially year of deposition, to plot increasing growth of species that have at least one marker available. We did not include the novel sequence data that came from the ButterflyNet project or those that were sequenced specifically for this project. We further examined sampling rates separately for United States and Canadian species, for species in

Mexico and for species found in both regions (USA/Canada and Mexico), in order to determine if these rates are similar. Rates of sampling for each of these areas were statistically tested using one-proportion z-tests in R v.3.6.3 (R Core Team, 2019) to determine if these rates were higher or lower than expected by chance.

53 .

Phylogeny Reconstruction

Cleaned and aligned sequences for each locus from each source were concatenated together and realigned. Duplicate sequences were removed and those with 95% or more similarity across overlapping regions were merged into a 50% consensus sequence using a python script. Sequences that could not be merged were submitted to BLAST (Altschul et al., 1990) to determine similarity to other sequences.

The sequence with the most closely related hits to the reference sequence was chosen.

Once there was one sequence per species per locus, all loci were concatenated across species into a supermatrix using FASconCAT-G v.1.02 (Kück & Longo, 2014).

Phylogenetic trees were built using maximum-likelihood phylogenetic analyses in

RAxML v.8.2.10 (Stamatakis, 2014). A preliminary tree was built using an unpartitioned

GTR+Γ model of nucleotide evolution using a constraint tree based on family level relationships following (Espeland et al., 2018). That analysis recovered seven monophyletic families, including Hesperiidae and Hedylidae. Long branches (greater than 99.5% longer than average branch lengths and thus significant outliers) from this preliminary tree were removed from the alignment and ten more trees were built using the same options as the preliminary tree. The tree with the log likelihood score closest to 0 was examined and singleton GenBank and BOLD sequences with species identifiers that did not group within the correct subfamily were removed under the assumption that these represented labeling or other errors during deposition. Only 17 sequences were removed based on these criteria. Three more trees were built and the tree with the log likelihood score closest to 0 was chosen as the final tree. We determined tree support by running 200 bootstraps via RAxML under the GTR+Γ model.

Bipartitions were drawn on the final maximum-likelihood tree using two methods:

54 .

Felsenstein’s binary bootstrap method and a gradual “transfer” distance method implemented in BOOSTER (Lemoine et al., 2018). The distance method accounts for the fact that datasets with large numbers of taxa may contain rogue taxa with unstable phylogenetic positions. Instead of removing these taxa, this method quantifies these errors as instability scores and uses these values to calculate BS scores. Phylogenies were visualized with FigTree v2.0 (Rambaut, 2010) and phytools (Revell, 2012).

Divergence times were estimated using treePL v.1.0 (Sanderson, 2002; Smith &

O’Meara, 2012) using a congruification approach. We used dates from the (Espeland et al., 2018) as constraints on analogous nodes on the best-supported maximum likelihood tree, focusing on deeper branches of the tree. Espeland et al. (2018) describes two dated phylogenies; both phylogenies were built using nine fossil calibration points, the

F48+Γ substitution model and a lognormal, birth-death tree prior, but used differing values of the median age of Angiosperms for root calibration. We used the maximum overlap between date ranges of the six families as input into treePL. Analysis options were first optimized using the ‘prime’ command. A smoothing value of 0.00001 was chosen following five cross-validation analyses. All analyses determined this value to have the lowest chi-squared value.

A Jupyter notebook describing the above analysis is available on GitHub

(https://github.com/sunray1/NA_butterflies/blob/master/README.md.ipynb).

Results

Data Acquisition

There were 1,597 unique species names gathered from field guides in Mexico and 600 unique species from North American resources once strays and accidentals were filtered. After cross-referencing species names from each reference to the

55 .

ButterflyNet master list, and removing duplicate names that spanned across Northern and Southern North America, a total of 1,927 valid names remained. We used the

ButteflyNet master list to also gather all synonyms for those names, and were able to recover 6,741 synonyms. This brought the final number of taxon names used

(valid+synonym) to search for DNA sequence data to 8,668.

Using GeneDumper, 21,831 sequences were recovered from GenBank across all

14 loci of interest. After sequence filtering, cleaning, and the best sequence(s) for a particular species/locus pair were chosen, 3,997 sequences across 1,019 (53%) species remained. From the BOLD sequence database, 1,696 sequences were harvested across 10 of the loci of interest, representing 1,218 (63%) species. The addition of BOLD data increased the number of unique species with at least one locus by 142. Of the ButterflyNet sequences, 2,735 sequences across 12 of the loci of interest were found, representing 225 (11%) species and adding 11 unique species.

Of the 1,927 valid species names, we gathered usable sequence information for

1,299 (67.4%) from GenBank, BOLD and ButterflyNet. Of the 628 species without usable sequence information, 346 were found in the MGCL and novelly sequenced. Of these, 206 failed due to amplification failure or extraction issues. There were 140 species that were successful and added to this study, resulting in a total of 1,437

(74.6%) species in total. The 140 species added here added 7% more coverage in terms of total species. Of these, 96 species are not distributed in the USA or Canada, but are found only in Mexico.

Sequences from each source were added together to create one FASTA file per locus. Once sequences were cleaned to remove duplicate species, there were a total of

56 .

5,908 sequences remaining across all 14 loci. The number of sequences per gene and alignment lengths are reported in Table 4-2. Once all loci were concatenated together in an alignment, it contained 1,437 species with a length of 12,361 bp and 73% missing data.

A plot of the growth of sequences in GenBank across all species and loci over time is shown in Figure 4-1. Plots for Mexican species, for USA+Canada species and species found in both areas are shown in Figure 4-2. We addressed if overall species sampling rates were higher or lower than expected in Mexico versus USA and Canada via a one-proportion z-test and found that Mexican species (.628) were significantly lower than expected (p-value: 2.2e-16). Conversely, rates for both species found just in

America and Canada (.933) and rate for species found across both areas (.904) are both significantly higher that expected (American p-value: 1.041e-14, Both area p-value:

1.079e-14). While the growth in overall sequence rates over time better fit an exponential growth model than linear one (results not shown), there are time periods where rates are especially high and appear to time with BOLD sequence depositions into GenBank (see Discussion). Growth in numbers of species with sequences shows an even more pronounced pattern of periods of striking increases followed by periods where growth appears to be more linear, especially for North American and Canadian species.

Coverage across each of the seven families is shown in Table 4-3. Sampling across all families is high with Riodinidae being the lowest at 62.08%. Most North

American riodinids are distributed within the neotropics (84.06%) and are thus more difficult to obtain samples for.

57 .

Phylogenetic Analyses and Relationships

Although the backbone of the tree was constrained at the family level, subfamily and tribe-level relationships within each family generally agree with those of Espeland et al. (2018). The median bootstrap value when calculated according to Felsenstein’s bootstrap proportion (FBP) of this tree is 56, while the median value is 86 when calculated with the transfer bootstrap expectation (TBE) (Lemoine et al., 2018). Here, we present TBE bootstrap values as they better represent node support for our phylogeny (See Discussion). The tree is shown in Figure 4-3.

Papilionidae is thought to be the to the remaining butterflies and has traditionally been divided into three extant subfamilies: Baroniinae, , and

Papilioninae (Allio et al., 2020; Condamine et al., 2018). The sole member of monospecific family Baroniinae, Baronia brevicornis, is typically considered the sister to the other two subfamilies, but has also been grouped to Parnassiinae with Papilioninae as sister (Nazari et al., 2007). All three subfamilies are generally accepted as monophyletic, but Papilioninae has been hypothesized to be polyphyletic (Espeland et al. 2018). Our results recover Baroniinae as sister to the remainder of the family (BS =

99.5), a clade containing Parnassiinae + Papilioninae, rendering all subfamilies monophyletic (BS = 99.8 and 94.3 respectively).

Family Hesperiidae is one of the largest and most diverse families of butterflies.

Although the of this group is well defined, there is disagreement on the relationships at subfamily and tribal levels (Yuan et al., 2015; Brockmann, et al., 2019).

Only four of the eight described subfamilies were sampled in our phylogeny. Three were recovered as monophyletic (Hesperiinae, Eudaminae and Heteropterinae), with

Pyrginae as paraphyletic. Within Pyrginae, tribes Carcharodini, Achlyodini, Erynnini, and

58 .

Pyrgini formed a single clade (BS = 97) sister to tribes Pyrrhopygini and Celaenorrhinini

(BS = 89), followed by the other three subfamilies (Eudaminae + (Hesperiinae +

Heteropterinae))) (BS = 97, 99 and 95, respectively).

Pieridae is currently divided into four subfamilies: Pseudopontiinae,

Dismorphiinae, Coliadinae and Pierinae (Braby et al., 2006). Pseudopontiinae is monotypic, containing a single genus found in and was thus excluded from this study. Representatives from the other three subfamilies had the following relationships:

(Dismorphiinae + (Coliadinae + Pierinae)), following recent studies (Cao et al., 2016;

Ding & Zhang, 2017) (BS = 100, 98 and 99). Subfamily Pierinae is usually divided into two tribes: Pierini and Anthocharidini. Our analysis supports the monophyly of these two tribes (BS = 100 and 98, respectively).

Relationships within Lycaenidae are currently unresolved and are the least well supported groups in our phylogeny (median BS = 68), but still generally follow the arrangement presented by Eliot, 1974. Eliot classified Lycaenidae into seven subfamilies, of which we have representatives from four (Miletinae, Theclinae,

Polyommatinae and Lycaeninae). Like Espeland et al. (2018), we find Polyommatinae were nested within Theclinae. We also find Lycaeninae nested within Theclinae. The classification presented by Corbet et al. (1992) reduced Theclinae, Polyommatinae, and

Lycaeninae to an inclusive “Lycaeninae”. Our phylogeny agrees with this lumping (BS =

99). Miletinae, although represented by only one species, was placed sister to these three subfamilies (BS = 100).

Riodinidae are mostly found in the Neotropics and are divided into three subfamilies: Riodininae, Euselasiinae and Nemeobiinae (Seraphim et al., 2018).

59 .

Nemeobiinae are strictly Old World riodinids and were therefore excluded from this study. The two remaining subfamilies, Riodininae and Euselasiinae, were both monophyletic with 100 and 98 BS support, respectively. Tribal relationships within

Riodininae were still mixed and unsupported in all studies. This may be due to rapid radiations within this subfamily (Espeland et al., 2015). Although there was some between tribes, we agree with the suggestion made by Espeland et al. (2018) that an Emesis-Apodemia tribal group might need to be erected (BS = 94).

There is no phylogenetic consensus within Nymphalidae, the most speciose butterfly family, although recent studies have shed some light on its structure (Chazot et al., 2019; Wahlberg et al., 2009). Due to the sheer number of species (>6,000) and high amounts of diversity, Nymphalidae has been split into anywhere from 9 to 12 families

(Brower, 2000; Freitas & Brown, 2004; Wahlberg et al., 2003). All subfamilies in our analyses were monophyletic. We found Libytheinae to be sister to Danainae and

Ithomiinae (lumped into Danainae by some) (BS = 98), which is in turn sister to the rest of the family (BS = 96). This is also one of the topologies suggested by Espeland et al.

(2018). Placement of the remaining nymphalids was largely concordant with the relationships in Espeland et al. (2018), with the exception of the placement of

Apaturinae. Espeland et al. (2018) found Apaturinae to be sister to Biblidinae whereas we found Apaturinae sister to the clade containing Cyrestinae and Nymphalinae. This may be due to the lack of representation of subfamily Pseudergolinae in this study since it is found in Asia, not North America.

Dating Analyses

Age estimation of the tree resulted in a dated phylogeny that largely corresponds to different recent butterfly-wide dating analyses (see Table 4-4). We find that butterflies

60 . originated around 120 mya in the lower Cretaceous, which agrees with many previous studies, although the hypothesized age has varied widely. We too, find many of within- family divergences leading to extant subfamily lineages occured after the K-Pg boundary, suggested by Wahlberg et al. (2013) and Espeland et al. (2018) to be a significant event in butterfly evolutionary history. A comparison of subfamily radiation dates showed similar results to those of Espeland et al. (2018) and other studies (Table

4-4)

Discussion

The importance of a coherent sampling strategy: Our assembled supermatrix represents the most comprehensive species coverage yet compiled for butterflies of

North America. However, we are still missing 25% species, most of which (86%) have distributions located south of Canada and the USA. This differential in sampling is stark; species located only in the southern parts of the continent are much less represented in

GenBank compared to American and Canadian species. Overall diversity of butterflies in Mexico is much higher than Canada or the USA, whether measured in absolute terms or in terms of butterflies per unit area (Heppner, 1992; Martínez et al., 2002; Pozo et al.,

2003). Further, there are a number of logistic challenges to sampling in the field or from

Museums for Mexican species. Those challenges on the field side include difficulty in obtaining collecting permits and international specimen shipping, and lack of accessibility (Llorente-Bousquets et al., 1997). On the collections side, the infrastructure for publicly available insect specimens are much sparser in Mexico than the USA or

Canada, which also limits availability and access to rarer Mexican species. It may also be the case that cases of endemism and rarity are higher in Mexico than the USA or

Canada (a topic we turn to in Chapter 5), simply based on the enormous diversity in a

61 . much smaller area than the rest of North America, also making sampling more challenging. Missing species located in the USA and Canada were almost all endemic or rare species that we were subsequently unable to find in collections.

Our approach towards assembling a comprehensive phylogeny of North

American relied on supermatrix approaches (Driskell et al., 2004) that utilize commonly- sampled barcode markers, and post-hoc assembly of these genes from publicly available sequence data repositories, along with de-novo sequencing of primarily

Mexican species to fill gaps, effectively help to illuminate those areas not well understood. This approach required us to first develop a comprehensive understanding of which species have or have not been sampled and then to strategically determine how to close those gaps.

New sequences were generated for 140 species to cover geographic areas of

North American butterfly diversity that are least understood. All of these sequences were generated from museum specimens, highlighting the importance of maintaining historical collections. Without the extensive butterfly collection housed within the MGCL, which has geographic focal strengths of North America, especially Mexico and Central

America, getting specimens for sequencing would have been much more difficult to impossible. This study highlights the importance of curation, organization and digitization of collections. Being able to quickly and efficiently find needed specimens is paramount to working with collections. Missing species may very well be hidden within unidentified, mislabeled and misplaced specimens. We also note that hundreds of specimens that were obtained from MGCL did not work on first try at BOLD facilities, possibly due to a lack of sample tissue amount. Future efforts to process older museum

62 . samples may yield high quality sequence outputs. The work at BOLD focused on sanger approaches for COI markers, and it is entirely possible that next-gen sequencing approaches using target-capture methods optimized for historical samples will yield significant improvements.

Our approach contrasts with recent efforts to gather comprehensive genomic data for USA and Canadian species (Zhang et al., 2019). That work sampled nearly all

North American species excluding Mexico and sequenced draft genomes for those species. While technically impressive, and of undoubted value for ultimately developing genomic resources broadly for lepidopterans, we argue there are trade-offs in terms of time and energy if the goal is to gather a fuller understanding of butterfly diversity patterns. Our efforts leveraged high-quality phylogenies developed at the level of

“tribes”, and with global scope, in order to constrain our topology and to help generate a dated North American via congruification.

Cadence of sequence growth: We plotted growth of sequencing efforts in

GenBank, given its role as a key global repository, as a means to understand if we can forecast when sampling for North American might reach near completion. The growth of overall North American butterfly sequencing has grown exponentially, which is no surprise given continued lowering of costs for sequencing. This growth is not smooth and continuous but shows spikes in growth over the last two decades. One of the most significant of these peaks is the release of 9,077 sequences from Janzen et al. (2011), which represented the largest ever barcode-marker based biodiversity inventory of

Lepidoptera in an area of northwest Costa Rica. Although located in Central America, many of the species’ distributions reach northward into Mexico.

63 .

The growth of numbers of species is relatively linear, but punctuated, which contrasts with the relative exponential growth of sequences. Besides the Janzen et al.

(2011) barcode-based inventory checklist described above, there were two other rapid increases in 2000 and 2006, each corresponding to early, significant phylogenetic studies (Brower, 2000; Hajibabaei et al., 2006) where species coverage rather resequencing efforts, were prioritized. With sampling gaps closing especially in USA and Canada, and with the likelihood that remaining gaps are for endemic and rare species in Mexico, it is possible that continued linear increases in species with at least a single marker may eventually slow from its linear rate, as logistics of sampling rather than sequencing effort itself becomes the main bottleneck in future efforts. We note that

US and Canada species tend to show a more pronounced linear rate of growth with a higher slope while species in southern parts of the continent have a much more pronounced, episodic increase associated with periodic research efforts. We note that the 140 Mexican species for which we have generated COI will represent a pulse of new sampling of similar magnitude to the other 3 major pulses in 2000, 2006 and 2010.

Assessment of quality and usability of the supermatrix: The topology of our phylogeny is largely congruent with previous work. The least well supported relationships within our tree are within the subfamily Hesperiinae (Family: Hesperiidae)

(median BS = 78), and within family Lycaenidae (median BS = 68), which have long had controversial internal relationships (Cong et al., 2019; Warren et al., 2009).

We chose to use a supermatrix approach even though this gave us a relatively high percentage of missing data (73%) because the signal present in other genes provides at least some estimate of the genealogy and can help make up for the missing

64 . data (Lemmon et al., 2009; Thomson & Shaffer, 2010; Wiens & Tiu, 2012). We also decided to use a maximum likelihood framework to help counteract the signal from missing data because bayesian frameworks tend to have more pronounced topological bias due to interactions between missing data and priors (Lemmon et al., 2009).

However there are still concerns over missing data with ML analyses, which is why we decided to constrain this tree at the family level, although initial, unconstrained analyses resulted in the monophyly of all seven families. Wiens (2006) showed that incomplete taxa could still be placed with strong statistical confidence while incorrectly improving accuracy for complete taxa due to long-branch attraction. We argue that this long- branch attraction would be a problem in our dataset, as only 57 species (4%) have all

14 genes present, therefore constraining the tree would help to counteract this potential error.

It has been shown that bootstrap values will decline as the number of sampled taxa increases This is because monophyly is rejected if only one taxon drops out of a clade and as sampled taxa increase, there are increasingly many ways for taxa to fall outside of a clade (Sanderson & Wojciechowski, 2000) Constraining the tree, especially at deep levels, helps to counteract this. Bootstrap support values therefore may not be the best method for assessing confidence in large phylogenies like ours, rather support calculated using gradual distance is more appropriate (TBE values presented here)

(Lemoine et al., 2018).

Congruification methods for dating trees are becoming mainstream as larger phylogenies are becoming commonplace. Exploiting information contained in existing time-calibrated trees can assume reasonable time scalings. It should be noted that

65 . congruification approaches are not far removed from indirect calibration - wherein estimates of divergence times are used in secondary analyses as calibration points

(Eastman et al., 2013). It should also be noted that issues can arise when relationships at the analogous nodes between the target tree and the reference tree differ. Because our phylogeny (the target tree) was constrained at the analogous nodes of the reference tree (Espeland et al. 2018), these issues are safely ignored here.

Key uses of North American phylogenetic resources: Our intent was to provide the most robust and useful assessment of true continental-level phylogenetic diversity yet developed, recognizing the limitations with super-matrix approaches given sparse matrices for many loci used here (Wiens, 2006). Despite those limitations, a key goal is providing this tree broadly for downstream uses. Those uses include using a continental-scale phylogeny in conjunction with geographic distributions to answer key questions about insect biodiversity and conservation that could not be answered at smaller scales.

We also see high value using dated tree-based approaches to understand geographic patterns of diversification. For example, plant phylodiversity analyses show striking patterns of east-west differences in diversification rates across North America and such questions can now be approached if tied to species geographic delimitations.

We can also begin to examine classic biogeographic questions regarding ages of clades and communities primarily found in the Neotropical regions of Mexico up to the highly seasonal communities in boreal and polar regions. To further enable such uses, we have sought to make this tree available in multiple formats.

66

Figure 4-1. Plot showing the increase in the number of total sequences North American butterfly sequences over time. Dark line shows the total number of sequences for the 14 loci. The growth of the top three most abundant loci are also plotted, showing that the shape of the curve is influenced most significantly by the mitochondrial genes COI/COII.

Figure 4-2: Increase in the number of sequences for butterflies over time across three range types – those found only in Mexico, those found only in the USA/Canada and those found in both regions.

67

Figure 4-3: A time-calibrated tree of 1,437 North American butterflies with bootstrap support shown for 39 of the deepest nodes (before the K-Pg boundary).

68

Table 4-1: The length of the 14 loci used as the starting sequence for input into GeneDumper. Locus Name Locus Length

ArgKin 596

CAD 2,211

CAT 1,293

COI_trnL_COII 2,271 (COI: 1,531; trnL: 67; COII: 673)

DDC 957

EF1a 1,240

GAPDH 609

HCL 633

IDH 709

MDH 733

RpS2 474

RpS5 614

Wgl 453

69

Table 4-2: Total number of sequences and alignment length. Sequences number calculated from results after quality filtering, and duplicate sequence removal.

Locus Name Number of Sequences Alignment Length (bp)

ArgKin 166 579

CAD 421 2068

CAT 209 1467

COI_trnL_COII 1445 2178

DDC 241 702

Ef-1a 672 1533

GAPDH 442 855

HCL 214 813

IDH 346 945

MDH 363 996

RpS2 317 636

RpS5 467 588

Wgl 605 534

70

Table 4-3: The number of species with sequence data by butterfly family. The total number of species and the percentage of species in North America and outside of USA/Canada. Hedylidae not listed as this family does not occur in North America. Species with Total Percentage of NA Percentage of family outside Family Sequence Info Species Species USA/Canada

Hesperiidae 552 784 70.41% 29.60%

Lycaenidae 209 307 68.08% 53.42%

Nymphalidae 418 490 85.31% 56.32%

Papilionidae 47 62 75.81% 41.93%

Pieridae 98 102 96.08% 28.43%

Riodinidae 113 182 62.09% 84.06%

Table 4-4: Ages of all families (in millions of years) in this study and others with 95% confidence intervals. Clade Wahlberg et Heikkilä et al. Espeland et Condamine Chazot et al. This al. (2009) (2012) al. (2018) * et al. (2018) (2019) study

Papilionoidea 104 (93-116) 110 (92-128) 119 (91-143) 98 (66 - 189) 108 (89-129) 119.50

Papilionidae 63 (93 - 116) 75 (62-88) 84 (63-109) 86 (56-164) 68 (53-84) 79.67

Hesperiidae N/A 65 (54-79) 79 (60-99) 76 (55-143) 65 (56-78) 72.49

Pieridae 73 (57-86) 80 (67-97) 87 (67-108) 71 (45-135) 77 (63-92) 92.29

Lycaenidae 75 (63-86) 73 (58-84) 78 (60-96) 62 (39-119) 71 (57-85) 57.40

Riodinidae 65 (56-76) 72 (57-83) 73 (56-92) 60 (36-116) 73 (60-88) 72.49

Nymphalidae 94 (84-104) 87 (74-101) 91 (71-112) 80 (52-151) 82 (68-98) 87.43

* Dates used in Espeland et al. (2018) were used to constrain the tree in the present study.

71

CHAPTER 5 PHYLODIVERSITY OF NORTH AMERICAN BUTTERFLIES

Our understanding of broad-scale patterns of insect biodiversity is extremely limited across spatial, phylogenetic, and phenomic scales. In most cases, we lack information about where insect species are located (e.g., the Wallacean shortfall (Hortal et al., 2015; Lomolino, 2004; Whittaker et al., 2005)), their evolutionary relationships and their community functions. This is in stark contrast to our understanding of North

American vertebrates and plants, where rapid efforts to close phylogenetic and spatial information gaps (Allen et al., 2019; Davies & Buckley, 2011; Thornhill et al., 2017) have allowed exciting insights into processes governing, and associated changes, to multiple facets of diversity (Kling et al., 2019). The best hope for quickly expanding the knowledge base of insect biodiversity lies in looking at species where existing biodiversity data are already dense, but not yet fully integrated. Assembling these key facets can reveal not only broad-scale patterns of biodiversity, but also key processes and drivers that shape these patterns.

Butterflies (Papilionoidea) are popular insects that serve as an ideal study group for researchers and naturalists due to their diurnal activity, often vibrant and showy colors, and specialized larval host plant associations (Brock & Kaufman, 2003; Grimaldi

& Engel, 2005). Not only are they the most collected and photographed insects (Scoble,

1995), but many species and clades of butterflies have become model groups for studying diverse ecological and evolutionary processes, such as Batesian and Müllerian mimicry (e.g., butterflies in the genus Heliconius, (Brower, 1996; Kronforst & Papa,

2015; Lewis et al., 2019; Palmer et al., 2018)) and adaptation to agricultural systems

(e.g., the common and widespread cabbage white, Pieris rapae, (Shen et al., 2016)).

72

Butterflies also serve as key pollinators, bioindicators of change, and are one of the few insect groups where conservation agencies have made at least initial assessments for vulnerable and endangered status. For example, butterflies have served as indicators of environmental degradation in Ghana’s Tarkwa Gold Mines (Kyerematen et al., 2018) and Papua New Guinea’s rainforest and oil palm plantation habitats (Miller et al., 2011).

Butterflies such as Plebejus argus and Cupido minimus are also useful indicators of climate change (Macgregor et al., 2019).

Due to the interest of both professionals and amateurs, North American butterfly natural history is relatively well known with rich distribution and genetic data resources freely available. Given the strong data basis, butterflies are among the best positioned of any insect group to ask broad-scale questions about the structure and drivers of diversity. These strong data sources enable moving beyond simple taxic summaries of diversity, such as species richness, and towards a more comprehensive, process- oriented understanding of how community diversity is structured at the continental scale. Despite this potential, there have been no synthetic, broad-scale phylodiversity analyses of butterflies (or any other insect group). Even North American-wide summaries of butterfly species diversity have been limited (Kocher & Williams, 2000;

Martínez et al., 2002; Ricketts et al., 1999).

North America is particularly well suited for a first analysis of butterfly phylodiversity given the excellent data resources available, its wide range of ecosystems, dynamic geological history, and significant insect diversity (Danks, 1994;

Godfray et al., 2000). There are approximately 1,900 species of butterflies in North

America across 14 broad ecoregions, ranging from the Eastern Temperate Forests to

73

tundra and taiga in northern Canada, tropical wet forests in southern Mexico and the warm and cold deserts of the Southwest (Lotts & Naberhaus, 2017). North American landscapes have dramatically changed during the Quaternary, especially in the West, due to long-term aridification and orogeny leading to formation of the Sierra Mountains, and across northern North America through cyclic patterns of glaciations (Bintanja &

Van de Wal, 2008). A fundamental question is how current climate and historical changes in climate and landscape across the continent have shaped butterfly phylogenetic diversity and endemism. Additionally, recent efforts to document North

American plant phylodiversity provide a remarkable opportunity to directly examine whether plant diversity and insect diversity show similar patterns. For example, do butterflies and plants show concordant patterns of historical community assembly given butterflies often have strong, specific plant host associations?

Here we examine spatial phylodiversity across North America, including Mexico.

Phylodiversity approaches have two key advantages compared to traditional taxic approaches. First, phylodiversity reduces reliance on species definitions; rather, input phylogenetic trees are used to calculate diversity metrics. Second, phylodiversity approaches allow hypothesis testing, in particular whether communities are more distantly or closely related to each other than expected by chance. Our approach focuses on assembling highly complete datasets for both North American phylogenetics and species’ distributions, although available distribution data is still at relatively coarse- grain (100 km x 100 km grids).

We also apply a growing set of phylodiversity metrics, including now commonplace measures such as phylogenetic diversity (PD) (Faith, 1992) but also

74

relative phylodiversity (RPD) (Mishler et al., 2014), phyloendemism (PE) (Rosauer et al.,

2009) and phylobetadiversity (PBeta) (Whittaker, 1972) and associated phylogenetically-informed bioregionalizations. Because phylogenetic diversity approaches require range information, they are also useful for assessing endemism, and use phylogenies to provide a means to differentiate types of endemic lineages: recent radiations leading to neoendemism, or relictual endemism and low diversity and range-restricted groups that were once more widespread, i.e. paleoendemics. While these metrics are now commonly applied, we provide a short summary of the ones used here in Table 5-1.

We utilize metrics of phylogenetic diversity and endemism to test a key set of drivers and associations, including a unique, direct empirical comparison between butterflies and plants. In particular, and based on a recent analysis of North American plant phylodiversity, we make the following predictions:

1. Regions that have long-term stable climates, and that are warmer, will have higher phylodiversity (Mittelbach et al., 2007; Rohde, 1992). Areas of strong climate instability, which were also those most impacted by glacial- interglacial cycles, will have lower than expected diversity.

2. Relative phylogenetic diversity is highest in areas that have been most stable, accumulating long-surviving lineages, while Western North America and areas that were recently glaciated will have much lower RPD.

3. Endemism in general is highest in areas that are more climatically stable, especially paleoendemism or mixed endemism. Neoendemics may be more common in less stable, heterogeneous areas e.g., mountain regions.

4. Plant phylodiversity and butterfly phylodiversity are highly congruent at broad scales of the analyses, due to co-evolutionary dynamics and specialisms between the two groups, and similarities in underlying landscape and climate drivers.

5. Phylogenetic beta diversity, based on range-weighted metric, and defined clusters or bioregions based on that turnover, will mirror regions that were based

75

on primarily floristic or habitat analyses, since most butterflies are specialist herbivores.

Materials and Methods

Species Name Assembly

We consolidated a list of currently valid names and all known synonymies. The valid names were derived from a global checklist (Lamas, 2004) and was augmented via assembly of synonymies from multiple online sources, including Funet (Savela,

2020) and Wikipedia (Wikipedia, 2020). Names resources were used to normalize names from field guides that contained range maps to this master list. In particular, we assembled names from the Kaufman Field Guide to Butterflies of North America (Brock

& Kaufman, 2003) and from A Swift Guide to Butterflies of Mexico and Central America

(Glassberg, 2018) from which we also digitized range maps (see below). For this study, we excluded species that were endemic to islands near the North American landmass e.g., the Caribbean. Name normalization was performed using the R package taxotools

(Barve, 2020) and involved a minimal amount of expert curation. Once all names were normalized to a consistent, accepted name that were consistent between the two field guide sources, we then used those names and associated synonyms to perform two key tasks. First, we re-assigned normalized names to those digitized species range maps where normalization was required. Second, we used normalized names and synonyms to search GenBank for matching genetic resources, as described in more detail below.

Range Maps and Digitization

Range maps were digitized for each species included in our species list. For

Mexico range maps, we used the range maps from A Swift Guide to Butterflies of

76

Mexico and Central America, which are digitally available on Map of Life’s website. For the United States and Canada range maps, we digitized the range maps in the Kaufman

Field Guide to Butterflies of North America. To do so, we scanned the entire book and saved the individual range maps as tiff files. Next, we georeferenced the tiff files to a map of the United States and Canada, so we could overlay the butterfly range maps to the correct spatial location. We then traced the range maps of all butterflies manually in

QGIS version 3.2 (QGIS Development Team, 2020) to create spatial polygons representing the range maps for every species in our species list. Very few (less than

1%) of the species in field guides did not have an associated range map, and in those cases we used occurrence records and descriptions in field guides in order to produce range approximations.

All range maps from the two books were joined into a single shapefile consisting of many spatial polygons which were clipped to only terrestrial areas within North

America. We captured key attribute data for all range products including reported broad on-wing phenology. We did not digitize information about stray distributions for this work, since our key interest was range of source populations. We also therefore excluded species that are stray into North America. In order to produce highly credible range maps, we set up a rigorous review process where a subset of the most difficult maps to digitize were reviewed by the authors of the study. This rigor was essential because many maps included very fine-scale stippling of range locations that often required a use of hand-lens to verify.

Some of the wider-ranging species in our analysis have ranges beyond the borders of North America. Even though our focus is on North American phylodiversity,

77

calculations that involve range-weighting such as phyloendemism or phylobetadiversity, still rely on overall range estimates. We determined a coarse estimate of range extents by generating country-level range maps for every species in our list where this was needed. We generated these country-level range maps utilizing three separate resources:

1. Country-level ranges from Lepidoptera and other life forms database (Funet; (Savela, 2020)). 2. Assembling GBIF data for all relevant species, and extracting country-level data from these records. 3. Field-guide data from a trait database that is being assembled as part of the ButterflyNet project (ButterflyNet, 2020).

These maps were individually curated and validated as follows:

1. Country-level maps were compared to reference sources such as Wikipedia distribution descriptions and butterflies and moths of North America (BAMONA; (Lotts & Naberhaus, 2017)). 2. Maps with significant disjunctions were all checked by hand.

In some cases, disjunctions represented systemic issues e.g., records in French

Guiana that were reported as coming from France. In other cases, disjunctions may represent either real patterns or knowledge gaps across a range. Careful checking of multiple resources (e.g., Butterflies of America, (Warren et al., 2016)) were used to make final validations. These country-level maps are coarse for regions outside North

America, but provide a usable approximation for range size needed for this work.

Once we completed all range assembly steps, we created a 100 km by 100 km resolution grid at the global scale projected to Lambert azimuthal equal area centered at

39 degrees north and 95 degrees west. If the species’ range map intersected the centroid, species was considered present in the cell. While the map was generated

78

globally, our analysis region for phylodiversity metrics were constrained to North

American grid cells (based on our sensu stricto definition of North America above).

Sequence Alignment and Phylogeny

A phylogeny was constructed by combining data from three sources: GenBank

(Clark et al., 2016), Barcode of Life Data System (BOLD) (Ratnasingham & Hebert,

2007), and existing sequences from the ButterflyNet project (ButterflyNet, 2020). In addition, specimens from the McGuire Center for Lepidoptera & Biodiversity (MGCL) were sequenced specifically for this project to add data for many of the species lacking sequence data. All assembled sequence data conformed to currently valid names based on our master taxonomy list. In sum, sequence data for at least once locus was available for >75% of the species that are found in North America.

We provide a brief description here, but the methods and data analyses used for constructing this tree are described in Chapter 4. The phylogeny was built from scratch using a supermatrix, maximum likelihood (ML) approach, where sequences were aligned using MAFFT v.7 (Katoh & Standley, 2014) and concatenated using

FASconCAT-G v.1.02 (Kück & Longo, 2014). ML analyses were conducted using

RAxML v.8.2.10 (Stamatakis, 2014) under the GTR+Γ molecular model. Once constructed a final check was made to assure that tip names were consistent with range products.

Phylodiversity metrics can be interpreted differently depending on the phylogeny used - using the amount of “feature diversity” contained in a region when using a phylogram or using the amount of “evolutionary history” in a region when using a chronogram. We were interested in both measures, and therefore produced a time- calibrated tree usable for such analyses. Divergence times were calculated using

79

penalized likelihood in treePL (Sanderson, 2002; Smith & O’Meara, 2012) following a congruification approach (Eastman et al., 2013). In particular, we obtained node calibrations from Espeland et al. (2018), extracting date ranges from each of the six family nodes that were concordant with our phylogeny. All phylodiversity metrics were analyzed twice using differently calibrated phylogenies - a phylogram and a chronogram.

Observed Patterns of Diversity and Endemism

The spatial data set and the phylogeny described above were imported into

Biodiverse v.3.0 (Laffan et al., 2010). Tips on the phylogeny were then mapped to species in the spatial data set. We used this information to calculate phylogenetic diversity (PD) and phylogenetic endemism (PE) metrics for equal-area square grid cells

(100 x 100 km) across North America. PE metrics take into account overall range sizes of species, including their extents outside of the continent. We also calculated relative phylogenetic diversity (alpha-diversity) (RPD) and relative phylogenetic endemism

(RPE). These are ratios of PD and PE compared to a phylogeny with the same topology but with equal branch lengths. These metrics can provide useful information about relative overall age or feature diversity of community. For example, younger communities that have more recently radiated are expected to have lower chronogram- derived RPD values. All cell-based values for PD, PE and RPD were exported from

Biodiverse and mapped (See results).

Statistical Tests of Phylogenetic Clustering and Overdispersion

Phylogenetic diversity and endemism measurements are often highly correlated to taxic diversity (species richness), since each taxon added to a community must also add to the overall PD. A key value of phylodiversity approaches is that these values can

80

be compared to null models in order to determine where phylodiversity values are higher or lower than expected compared to random communities. This is done using a resampling technique, where species present within each grid cell are randomly resampled (keeping the same number of original species). Values for PD, PE, RPD and

RPE are then calculated for each randomization iteration, creating a null distribution for each grid cell. A two-tailed test was then applied to PD, PE and RPD randomizations to determine whether or not actual values were significantly high or low when compared to the null distributions.

RPE randomizations provide a means to categorize different types of phylo- endemism. This method, called Categorical Analysis of Neo- And Paleo- Endemism

(CANAPE) distinguishes different types of centers of endemism, specifically neo- endemism, paleo-endemism and mixed endemism. As above, randomizations are applied to RPE values per cell (Mishler et al., 2014). CANAPE then further determines if there are higher than expected concentrations of range-restricted shorter (i.e. neo- endemics) or longer (i.e. paleo-endemics) branches, or a mixture of both. Endemism measures, including randomizations, were calculated in Biodiverse and the categorization method for CANAPE was run in R v.3.6.3 (R Core Team, 2019) to determine per-grid-cell phyloendemism types, and plot those results spatially.

We calculated 100 randomizations per grid cell in parallel using a 40 core Dell

Xeon PowerEdge standalone server. This allowed us to speed up computation time by at least 10x. Randomizations were merged together using a perl script and final calculations of significance performed in Biodiverse. Results were exported as grids and re-imported in R for further plotting and analysis.

81

Drivers of Phylodiversity and Endemism

Assembly of explanatory variables for diversity and endemism patterns:

We used six variables to analyze the observed phylogenetic diversity and endemism patterns. These included four bioclimatic variables (annual mean temperature, annual precipitation, temperature seasonality [standard deviation * 100], and precipitation seasonality [coefficient of variation]; (Fick & Hijmans, 2017)) and two climate stability variables (temperature stability and precipitation stability; (Owens & Guralnick, 2019)).

The climate stability variables represent the inverse of the mean standard deviation between time slices over the past 21,000 years. The climate stability measures at least capture geographic distribution of climate instability likely during Pleistocene glacial- interglacial cycles. All variables were resampled to 100 km grid cells and projected to

Lambert azimuthal equal area projection centered at 39 degrees north and 95 degrees west.

Modelling of diversity patterns: Six diversity metrics (PD, RPD, PE, along with randomization tests for all 3 measures) were modelled to test which explanatory variables were most predictive of diversity and diversity significance. Phylogenetic significance analyses were derived from the randomized metrics and divided into binomial datasets, where significantly high values were scored with the value 1 and significantly low values were scored with the value of 0. Cells with non-significant values were removed from the binomial regressions.

We fitted all models as generalized linear models. For the PD, RPD, and PE analyses the Gaussian distribution was used. For the models examining significance, the binomial distribution was used. We used the dredge function from the package Mu-

MIn (Barton, 2009) in R version 3.6.2 to examine all possible models, including all

82

univariate variables and models including only the intercept. We used an information- theoretic approach using corrected Akaike’s Information Criterion (AICc) to rank models

(Burnham et al., 2002). Models that were a subset of another model examined were not considered to be competitive if within delta AICc ≤ 2. We examined the collinearity of variables of the models by calculating variance-inflation factors (VIF) using the car package, and models with VIF ≥ 5 were also not considered as competitive models. If perfect separation occurred in our binomial models, indicating model overfitting, we selected the top ranked model that did not lead to perfect separation, and perfectly separating models were not considered as competitive models. We used delta AICc values and Akaike weights (wi) to rank competing models.

Similarities Between Butterfly and Plant Phylodiversity

We examined associations between butterfly and plant phylodiversity metrics using Pearson’s correlations and visual comparisons. We gathered the same phylodiversity metrics used here from chronogram-based analyses recently published by Mishler et al. (2020). We resampled from the original resolution of 50 km x 50km to

100 km x 100 km used here. We then ran Pearson’s correlations to quantify how similar measures of PD, RPD, and PE are between plants and butterflies. We display the figures of the RPD and PE randomizations for both plants and butterflies in the same projection and resolution to visually compare the randomization results, rather than attempt to quantify those similarities.

Phylogenetic Beta Diversity and Regionalization

Phylogenetic beta diversity uses phylogenetic range-weighted turnover

(PhyloRWT, (Laffan et al., 2016)) to compare branch lengths between locations. In a

PBeta analysis, each grid cell is compared to neighbors to determine which branches

83

are shared across those comparisons and which are unique. Low PBeta values indicate high shared branches (low dissimilarity) and high values the converse. We used a range-weighted metric for PBeta because it down-weights wide ranging species and thus has steeper rates of turnover, without associated saturation, across space. Thus, it can also better delineate where biotic transitions happen. PBeta values are then used as inputs into an agglomerative clustering analysis (WPGMA), in order to produce a dendrogram usable to define clusters based on node depth. We utilized Biodiverse 3.0 for these analyses, which uses a corrected measure of weighted endemism to choose tie-breakers in the clustering approach. This weighs endemics more strongly in determining regionalizations. We examined output clustering and determined regions based on well-defined groupings of contiguous sets of grid cells that would be comparable with defined level 1 ecoregions for North America based primarily on plant habitats (Olson et al., 2001). We expected that such a regionalization would show the similar groupings with plant regionalizations given the strong, shared co-evolutionary histories between the groups.

Results

Patterns of Diversity and Endemism

The map of phylogenetic diversity (Figure 5-1a) shows a clear latitudinal pattern across North America, with highest values primarily concentrated in the tropical dry and wet forests in Mexico and the lowest values across the arctic of Canada. While this latitudinal trend should come as no surprise, phylodiversity is more heterogeneous in the western portions of North America, and especially in areas with the highest topographic heterogeneity. Relative phylogenetic diversity (RPD) also peaks in wet and dry tropical forests in Mexico, and remains uniformly high across the eastern temperate

84

forest, Great Plains and southern deserts (Figure 5-1b). By comparison, overall RPD is lower across the temperate West, including the cold deserts, west coast forests, the mediterranean regions of California, and towards the North into northern ecosystems such as boreal forests and taiga. Phylogenetic endemism (Figure 5-2) shows the same general latitudinal gradient as PD and RPD, but shows areas of higher phylogenetic endemism along temperate sierra mountain ranges in Mexico, the ecotone between plains and desert, and in coast ranges in the Pacific. The highest PE values are located in the tropical wet forests of Mexico, but endemism is significantly higher in the temperate west than east and especially associated with transition zones in the Rockies and Sierras. Still, a key limitation with the analyses here is the coarse scale, which limits assessing where along gradients these patterns peak (Daru et al., 2020).

Statistical Tests of Phylogenetic Clustering

We uncovered highly regionalized patterns of overdispersion and clustering based on PD randomizations (Figure 5-3a). All boreal, taiga and tundra regions showed lower than expected PD, indicative of phylogenetic clustering. Perhaps more surprising, most of the temperate regions in the West, including diverse ecoregions in cold deserts, west coast forests and mediterranean portions of California also displayed clustering.

Tropical wet and dry forests, already the most phylodiverse areas in North America, also show higher than expected phylodiversity, or overdispersion, when compared to null models. We also note that portions of the south-central semi-arid prairies, especially towards the ecotone with eastern edge of the Rockies, also show higher than expected phylodiversity. The southern deserts and eastern temperate forests do not show significantly high or low PD. Relative phylogenetic diversity randomization indicates that southern portions of North America have communities with longer branches than

85

expected under null models (Figure 5-3b). This includes not only tropical regions but also semi-arid highlands, and southern deserts into semi-arid plains and prairie. Shorter than expected branch lengths are found in much of the basin range and mountainous regions in the Sierras and Rockies. RPD was non-significant in the eastern temperate forest, northern great plains, and northerly most portions of North America.

CANAPE

We present the overall results for phylogenetic endemism randomizations in comparison to plants in Figure 5-6. That figure shows a strong latitudinal signal for PE significance, with higher latitudes containing less than expected phyloendemism, and lower latitudes significantly higher than expected PE. CANAPE provides a means to further dissect these areas of higher than expected endemism in order to detect neoendemism, paleoendemism and mixed areas. Regions of Neoendemism are located especially in western forests including the coast ranges and Sierra Nevada (Figure 5-4).

Mixed patterns of endemism with both paleo- and neoendemics are found in predominantly drier and more seasonal environments in the West, especially deserts and Mediterranean regions. We also document mixed endemism in the southeastern coastal plains and southern, tropical portions of Florida. While pure paleoendemism is more rare, there is an indication of some areas in tropical portions of Mexico.

Drivers of Phylodiversity and Endemism

The effect of environmental variables on PD, RPD, and PE were generally consistent across models, but the relative importance of the variables differed (Table 5-

2). Annual mean temperature was the most important environmental variable in predicting PD, followed by mean annual precipitation, and precipitation seasonality. The environmental variable that was most important in predicting RPD was temperature

86

seasonality, followed by precipitation seasonality. Annual precipitation and precipitation season were the variables that were most important in predicting PE. In general, warmer and wetter areas with high precipitation seasonality and low temperature seasonality had higher phylodiversity and endemism values (Table 5-1). Areas with higher elevation had larger PD and PE values but lower RPD values. Temperature stability and precipitation stability were less explanatory than current climate variables, but temperature stability was significantly associated with higher PD and PE.

Precipitation stability was significantly associated with higher RPD in the top model, but was significantly associated with lower PE (Table 5-2).

Temperature was the most important variable in predicting randomizations of PD and RPD for North American butterflies (Table 5-3). Overall, cells with significantly higher than expected values of PD and RPD are found in areas that are warmer and wetter and have high precipitation stability and low temperature stability with the opposite for areas that are more clustered. Elevation was an important variable in predicting areas with significant RPD and PE values. Higher elevation areas were positively correlated with areas with significantly high RPD and PE.

Similarities Between Butterfly and Plant Phylodiversity

Butterflies and plants of North America displayed a similar pattern of phylogenetic diversity (r = 0.57) when values were scaled to 0-1 and compared, with greater phylogenetic diversity occurring in southern Mexico for both groups. Butterflies and plants surprisingly did not have similar patterns of RPD (r = -0.09), as butterflies again had the greatest RPD values in southern Mexico, while plants displayed a slight trend for the greatest RPD values in eastern Canada. Butterflies and plants both showed a pattern of having high PE values in southern Mexico, but the similarity was

87

weak (r = 0.35), as plants also had high PE especially in the northeastern United States where butterfly endemism is absent.

Of particular interest is whether deviations of diversity from expected patterns differs for plants and butterflies. The expectation is that drivers of plant phylodiversity should directly relate to butterflies given the tight ecological associations between the two groups. Initial analyses of plant PD showed significantly lower than expected values across the continent (Mishler et al., 2020), which may be an artifact of sampling completeness, so here we focus on comparisons to RPD. Plants showed a strong pattern of having significantly lower than expected values of RPD in western North

America, and significantly higher areas in southern Mexico and eastern North America

(Figure 5-5a). Butterflies also have significantly high areas of RPD in tropical wet and dry forests in Southern Mexico but also in Southern California and the American

Southwest, unlike plants. Also unlike plants, butterflies do not show significantly higher

RPD in eastern temperate forests. Both groups show lower than expected RPD in much of western North America (Figure 5-5b). Regions of significant endemic in North

America were broadly similar for both groups with higher than expected values of PE in lower latitudes and lower than expected values of PE in higher latitudes (Figure 5-6).

While broadly similar, we also note that latitudinal patterns of PE differ somewhat for the groups, with less significantly lower than expected endemism in mid-latitudes for butterflies compared to plants.

Phyloregionalization of Butterfly Diversity

PBeta diversity maps (Figure 5-7) do not suggest any latitudinal gradients but rather topographical barriers and habitat transitions may be most important for areas where lineage turnover is strongest. Noticeably high areas of PBeta are especially

88

clustered towards mountainous regions including the Sierra Madres, Sierras and

Rockies. These may represent sharp, broad-scale ecotones. We also note that PBeta values are especially high in regions of the highest arctic, potentially representing unique circumpolar lineages.

Agglomerative clustering of PBeta values, and selection of contiguous spatial clusters led to recognition of nine regions (Figure 5-8a). This regionalization of butterfly diversity is strikingly similar in broad aspects to the 14 level 1 terrestrial ecoregions defined by the World Wildlife Foundation (Figure 5-8b). Biomes such as the tundra, taiga and western temperate conifer forests largely overlap across the regionalizations.

Other butterfly phyloregions such as temperate areas in east USA (temperate broadleaf, temperate conifer and temperate grasslands) are aggregated into one region, as are subtropical areas in south Mexico (subtropical moist and dry broadleaf and conifer forests). Arid regions split into warmer deserts and highlands (in light purple) and colder deserts (in orange). More forested areas in the Northwest make a large region, and to the West there is a narrow but well defined Mediterrean phyloregion.

Discussion

We present the first continental-scale phylodiversity analyses for butterflies across North America. This analysis is notable for being relatively complete, with coarse-scale distribution data for all species, and a phylogeny with >75% sampling of

North American species. This level of completeness provides, for the first time, a well- resolved, continental-scale view of phylogenetic diversity for a whole insect suborder.

We also extend beyond completeness just in North America, by also gathering very coarse country-level range maps for the ~25% of species that have ranges outside of our defined North America boundaries. We argue this approach is better than simply

89

truncating ranges for any analysis relying on range-weighted metric, such as phylo- endemism or range-weighted turnover. Here, not including full ranges would have led to many neotropical butterflies having much smaller ranges that truncate at the border of

Mexico rather than properly extending into Central and South America. In turn, when calculating range-weighting for endemism analysis, species with these truncated ranges are likely to obscure actual patterns of neo- and paleoendemism.

Below we discuss how our results address key predictions regarding patterns of butterfly diversity across a continent with an enormous breadth of habitats, from the hot and dry deserts in the Southwest to the wet forests in the Yucatan and the cold, arid environments of the taiga and tundra. In particular, we focus on processes that are likely to have shaped this diversity, based on both examination of climatic and topographic drivers, and via comparison with a recently published North American plant phylodiversity study (Mishler et al. 2020). We explicitly expected concordance of patterns and process because of the strong associations and known or presumed co- diversification dynamics between butterflies and their plant hosts.

Patterns and Drivers of Phylogenetic Diversity and Endemism Across North America

Butterfly phylodiversity unsurprisingly shows a latitudinal trend towards decreasing diversity in the North, but this oversimplifies a complex pattern, especially in

Western North America. Phylodiversity peaks in tropical wet and dry forests in the southernmost portion of the continent, representing neotropical butterfly communities not found elsewhere in North America (Martínez et al., 2002). However, diversity remains high especially in the temperate sierras including the Madrean pine-oak woodlands in the Southwest, a known plant biodiversity hotspot (Bowers & McLaughlin,

90

1996). In other portions of the West, PD variation appears more longitudinal, with higher

PD in areas with more mountainous terrain. The lowest PD values are in the farthest

Northern portions of North America, as might be expected given both strong filtering for cold-adapted species and massive perturbations of landscape caused by formation of ice-sheets during glacial periods (Rowe et al., 2004). While butterfly phylodiversity is low in boreal, taiga and tundra, portions of the Canadian cordillera are more diverse.

These results are further borne out when examining climatic drivers, with PD highest in the warmest, wettest areas, where temperature stability over time is also high. RPD and Phyloendemism results point not to temperature but to the importance of seasonality and precipitation as drivers. RPD was highest in regions in regions with more equitable year-round temperatures, but highly seasonal rainfall patterns, perhaps representing conditions where highly divergent lineages could co-exist and where plant diversity may also be unusually high. However, despite our predictions that temperature stability would be a key predictor of PD and RPD significance, it was never a top predictor in any model. Rather, temperature is always the dominant predictor of PD,

RPD and PE significance. Below we summarize these results more thoroughly by major regions across North America, focusing on synthesis across geographic distance and environmental gradients.

Northern North America: We focus on a set of surprising takeaways when considering overall patterns, starting first in Northern North America. The most Northern portions of the continent show low PD but not RPD, suggesting the importance of environmental filtering due to cold and seasonal conditions,and possibly disequilibrium from -recolonization dynamics across glacial cycles. We argue environmental

91

filtering is more likely than disequilibrium dynamics in a volant group with high reproductive rates (Pellissier et al., 2013) such as butterflies, in comparison to clades where dispersal can often lag behind changing conditions (Alexander et al., 2018). This is further supported by lower than expected phyloendemism and non-significant RPD in this area (Figure 5-6a; Figure 5-3a), suggesting mostly wide-ranging species that are not showing recent, rapid diversification across North.

Western United States: The Western United States north of the hot deserts, shows very strong patterns of both reduced PD and RPD, unlike the areas in Canada.

This suggests a region where butterflies have undergone recent radiations, along with potential for environmental filtering given steep elevational and climatic gradients that themselves were in flux during glacial-interglacial cycling during the Pleistocene

(Thackray, 2008). While potential for recent radiations would also suggest higher phyloendemism, especially neo-endemism, across the region, this is only the case in the Mediterranean portion of California, the Sierra Nevadas and Pacific coastal regions.

Less seasonal, coastal Pacific areas are also the only ones that generally do not have signals of lower than expected PD.

Southern North America: The Southern portion of North America shows particularly surprising results, especially in the deserts. PD and RPD are both higher than expected in tropical regions of North America, consistent with the tropics as a museum for butterfly diversity (Farrera et al., 1999; Hostetler et al., 1999). The small portion of tropical wet forest in the Yucatan region is also an area of lower than expected phyloendemism, the only place in North America besides the far North where we find such a pattern. Yet, dry forests, and much of the rest of the southern parts of

92

North America have higher than expected phyloendemism, perhaps due to the patchwork of environments in this area (Olson et al., 2001).

In the temperate sierras, and desert regions of the South, we find higher than expected RPD but no indication of clustered or overdispersed PD. This surprising result suggests phylogenetically old communities of butterflies in deserts, a region that formed recently, during the mid-. It has long been known that plants found in this region are derived from related lineages in thornscrub and arid highlands that are phylogenetically much older (Axelrod, 1959). Butterflies in current desert areas may be connected to lineages that persisted in subtropical, yet still seasonal habitats, in the

South, and less closely related to species in the Northwest. We turn to comparisons with plants and phylogenetic beta-diversity in more detail below.

Eastern United States: The Eastern US, including the Great Plains and Eastern

Temperate Forests, are most remarkable for being unremarkable, which contrasts with spatial phylodiversity finding for plants (see below). In particular, the Great Plains, which became grassland-dominated in the Miocene and Pliocene, does not show an indication of younger lineages based on RPD, as shown in other groups (Mishler et al., 2020). We also did not find that tropical regions of Florida are older or more diverse than expected.

While neither PD nor RPD were significant, the Southern Coastal Plains does show significantly high levels of mixed phyloendemism, which aligns well with known plant biodiversity hotspots (Myers et al., 2000).

Comparisons of Spatial Phylodiversity Patterns Between Plants and Butterflies

The present study provides the first quantitative comparison between plant and butterfly diversification patterns at a continental scale. As expected, given strong ecological associations that have played out over evolutionary time-frames, there are

93

broad similarities between butterfly and plant patterns (Kumar et al., 2009). However, to our surprise, there are also striking differences, which may reflect a mix of methodological differences, sampling issues, and biological signal. We used consistent

PD metrics across both studies, allowing for direct comparisons of outputs. However, sampling completeness varies dramatically across studies. The study on plant phylodiversity in North America, encompassing at least an order of magnitude more diversity than butterflies, is still mostly incomplete both in terms of phylogenetic and spatial distribution information. These issues with completeness make strong assessments of patterns more challenging. For example, Mishler et al. (2020) recovered a pattern of lower than expected phylodiversity across the continent, which likely represents sampling issues, due to high bias between well sampled and poorly sampled areas. This bias prevents more concrete comparisons between plant PD and butterfly

PD. Stil, for the most resolved plant phylodiversity results, initial comparisons are useful and illuminating.

While it has long been known that Western North America has been dramatically reshaped by regional tectonism, orogeny and climatic changes, the full magnitude of those impacts on flora and fauna besides vertebrates (Badgley, 2010) are just now starting to be understood (Pellissier et al., 2018). Plant phylodiversity work has confirmed this in a spectacular fashion, with eastern temperate forests showing older than expected plant lineages and the plains and westen portions of North America showing more recent diversifications (Mishler et al., 2020). We expected to find congruent results when examining RPD in butterflies. Our results strongly confirm that

Western North America, especially in temperate regions, have lower than expected

94

RPD for both plants and butterflies. However, we did not recover higher than expected butterfly RPD in the East. We argue that more stable areas such as the eastern portion of North America, and especially the Southeast, may show moderate discordance across groups, but in a consistent manner. Butterflies must have diversified in the shadow of a persistent, highly diverse angiosperm-dominated forest in eastern temperate North America. Given both the earlier origination of forests and the massive inequality of butterfly to plant species in the region, providing ample opportunity for evolving new host relationships, the overall effect is potential for later diversifications, e.g., lower RPD, for butterflies compared to plants. Further examination, in other groups, is warranted to see if such ordering effects may be more general. To be clear, we do expect both plants and their consumers to share similar patterns of phylodiversity in some regions, and this indeed the case in the tropics where RPD is high for both butterflies and plants. However, our hypothesis based on these data is that when there is discordance, the direction should be towards older lineages of plants.

Phylogenetic endemism for plants and butterflies are highly concordant, with higher than expected values of PE across the south and lower than expected values in the north. While plant PE in Mishler et al. (2020) are lower than expected across much of North America, butterflies show a more stratified pattern with lower than expected PE only in the farther North. One explanation for these differences are methodological. The

Mishler et al. (2020) truncated ranges at southern edges of the region of interest, which leads to artificial range restrictions, and in turn can impact randomization results across other regions. We interpret their results with caution. Still, the congruence between plants and butterflies for areas with higher than expected phyloendemism is notable, as

95

is the alignment with known biodiversity hotspots. Plants and butterflies both have centers of diversity and endemism in the Mediterrean regions of California, the Madrean

Oak-Pine Woodlands, Mesoamerica, and the Southern Coastal Plains. This congruence is discussed more in the section on phylogenetic regionalizations and conservation prioritization.

Phylogenetic Turnover and Regionalization

Agglomerative weighted clustering of range-weighted pBeta results showed spatial region that strongly align with WWF defined North American terrestrial biomes

(Olson et al., 2001). Kemp et al. (2017) shows that insect species turnover is significantly related to plant community structure, and alignments between these regionalizations are supported at the broadest-scale. We note that the resolution of our study, due to the coarse grain of range maps, cannot distinguish some of the finer-scale patterns seen in the much higher resolution results of more traditional eco- regionalization approaches. This makes finer discrimination of regions such as the

Marine West Coast Forests (Figure 5-8, red) and Tropical Humid Forests in south

Florida (Figure 5-8, pink) more challenging. Finer classifications, especially the extensive regions covering much of the Eastern Forests and Great Plains are discussed below. While coarse, the classifications found here provide a compelling match with other regionalizations, generally aligning with WWF ecoregions in the Northern portion of the continent, the California Mediterranean, hot desert ecoregions, and .

While finer-scale details are obscured by the necessarily coarse scale of our analysis, we were particularly surprised to not find a distinctive bioregion for butterflies that aligns with the Great Plains (Figure 5-8, purple). The Great Plains ecoregion is a

96

floristically distinct area that has persisted since aridification of temperate regions during the Miocene (Ray & Adams, 2001). Clustering analysis indicates that this area and the

Eastern Temperate Forest do not form distinctive phylogroups and even further subclassifications did not retrieve a cohesive plains bioregion (Figure 5-9). Sub- classifying the Eastern Temperate Forest does however separate the southern coastal plains butterfly communities from core Eastern Temperate Forests, Mixed woods to the

North, and northern prairies to the Northwest. The habitat-defined south-central arid prairies here form a phylogenetic cluster with the warm deserts. While the Mississippi river is a barrier that appears to strongly structure plant species and community structure (Soltis et al., 2006), we do not find the same patterns for the more mobile butterflies. Rather, based on the raw range-weighted turnover results, it may be that major topographic and climatic barriers are more important for defining range boundaries.

Clustering also provides key insights into relationships among bioregions, and in particular we call attention to the close relationship between warm desert bioregions

(Figure 5-8, orange) and the bioregion encompassing the Mexican arid highlands and temperate sierras (Figure 5-8, yellow). This grouping is clearly separated from another large grouping that stretches across the continent at mid-latitudes, from the California

Mediteranean, to mountains and cold deserts, and the vast eastern ecoregions. This unexpected result of warm deserts butterfly phylodiversity having stronger southern affinities, composed of species from older lineages, rather to the cold deserts to the

North, composed of species that more recently radiated, may help explain the otherwise confounding result of higher than expected relative phylogenetic diversity of butterflies

97

in a region that is known to have formed relatively recently, in the Pliocene (Axelrod,

1959; Daubenmire, 1978). While it is possible that butterfly historical biogeographic patterns are simply different than what is seen in plants, another possibility is that a more complete analysis of North American plant phylodiversity will uncover a similar pattern, especially given the paleobotanical literature which has long documented a subtropical to warm temperate origin for desert flora (Axelrod, 1959).

Importance for Conservation and Conclusions

Butterflies are under threat, perhaps represented most iconically by Monarchs and their declines, but the whole fauna may be imperiled (Wepprich et al., 2019).

Results here provide a needed step for better prioritizing areas of highest conservation need. In particular, we document areas high in both mixed phyloendemism and PD and

RPD, which collectively harbour both more accumulated evolutionary history (Redding et al., 2008) and which also contain the most range-restricted species. These areas are likely to be foci for conservation (Davies & Cadotte, 2011).

Within North American, there are four key biodiversity hotspots: California

Floristic Province, North American Coastal Plain, Madrean Pine-Oak Woodlands and

Mesoamerica, perhaps more that are unrecognized (Noss et al., 2015). These areas were recognized due to their high amount of plant diversity and endemism and threat of extinction (Myers et al., 2000). Butterfly PD and PE patterns are also higher than expected within all four of these hotspots, strengthening arguments about protecting habitat in these key hotspots. However, our results also uncovered significant endemism, old lineages (based on higher than expected RPD) and relatively high phylogenetic diversity in the warm deserts of North America. While the Sonoran,

Chihuahuan and Mojave deserts may not first appear to be hotspot for butterflies, our

98

results make a strong case for habitat conservation across these regions in particular, especially since they are not already documented biodiversity hotspots.

The present study examined butterfly phylodiversity across North America and tested correlations of PD and RPD metrics between butterflies and plants. While the majority of phylodiversity studies have been on vertebrates and plants, insects are the most diverse group on Earth and are among the best positioned of any group to ask continental-scale questions about the structure and drivers of diversity. While North

America has a relatively well well-known fauna and flora, significant phylogenetic and spatial gaps in knowledge exist in most regions of the world. We hope that this study sets the stage for future global projects on phylodiversity beyond butterflies.

Figure 5-1. Maps depicting observed phylogenetic diversity and relative phylogenetic diversity values for North American butterflies. A) Phylogenetic Diversity (PD). B) Relative Phylogenetic Diversity (RPD).

99

Figure 5-2. Observed phylogenetic endemism values for North American butterflies.

Figure 5-3. Maps depicting significant phylogenetic diversity patterns and significant relative phylogenetic diversity patterns for butterflies.A) Significant phylogenetic diversity (PD) patterns for butterflies. Areas with significantly high values have taxa that are less closely related than expected by chance, while areas with significantly low values have taxa that are more closely related than expected by chance. B) Significant relative phylogenetic diversity (RPD) patterns for butterflies. Areas in pink have significantly longer branches than expected; areas in turquoise have significantly shorter branches than expected.

100

Figure 5-4. CANAPE results showing randomization of phylogenetic endemism. All cells that are colored have significantly high PE. Red cells have concentrations of rare short branches (neoendemism); blue cells have concentrations of rare long branches (paleoendemism), and purple cells have mixtures of neo- and paleoendemism.

Figure 5-5: Significant relative phylogenetic diversity patterns for A) plants and B) butterflies Butterfly results were reprojected and scaled to have the same extent as the plant results

101

Figure 5-6: Significant phylogenetic endemism (PE) patterns for A) plants and B) butterflies. Butterfly results were reprojected and scaled to have the same extent as the plant results.

Figure 5-7: Phylogenetic beta diversity values for North American butterflies.

102

Figure 5-8: Regionalization results and comparisions for North American butterflies A) Clustering results for butterflies compared to B) WWF ecoregions. C) Dendrogram depicting the relationships between clusters. Clusters are colored based on similarity between the two maps.

103

Figure 5-9: Subregion classification of the Eastern US bioregions showing clusters aligning with the southeastern coastal plains, core temperate forests, northern mixed hardwoods and northern prairies.

Table 5-1: Summary of commonly used phylodiversity metrics used.

Diversity Metrics

Phylogenetic Diversity (PD) Measured as the sum of branch lengths connecting the terminal taxa present in each location (usually to the root of the tree)

Phylogenetic Endemism (PE) Like PD bute measured on a tree where the branches are weighted by the fraction of their geographic range found in that location, a Range- Weighted Tree

Relative Phylogenetic Ratio of PD measured on the original tree to PD measured using a Diversity (RPD) comparison tree with the same topology, but where each branch is adjusted to be of equal length.

Categorical Analysis of Neo- Geographic centers of endemism are identified using PE and then and Paleo-Endemism classified as paleo, neo or mixed based on RPE. (CANAPE)

Randomization Tests These metrics are tested for statistical significance using a spatially structured randomization that re-assigns terminal taxon occurrences on the map, subject to two constraints: the range size of each taxon and the richness of each locations are held constant,

Phylo-turnover Measured by comparing the lengths of branches of the overarching (Phylobetadiversity) tree shared and unshared among pairs of locations

Range-weighted phylo- Values of phylo-turnover but branches are weighted by the fraction of turnover their geographic range found in that location, a Range-Weighted Tree

104

Table 5-2. Summary of the top and null linear models for PD, RPD, and PE. Numbers in the columns indicate changes in PD, RPD, and PE values when variable values between locations increased by one standard deviation. Models are ranked based on the difference from the top model in Akaike’s Information Criterion (∆ AIC) and competing models with an Akaike weight (wi) of at least 0.05 are shown in the table. Models that were a subset of another model and within ∆ AIC of 2 were not considered to be competitive. Additionally, models with variance-inflation factors (VIF) 5 were not considered competitive. All variables contributed significantly (P < 0.05) to the models. Precip Temp Precip Temp ∆ Model Temp Precip Season Season Elevation Stability Stability AIC wi

PD - Top 0.043 0.027 0.022 1.02E-5 0.013 0 1.000 PD - Null 1859 0.000 RPD - Top 0.011 0.021 -0.047 -0.018 0.004 0.008 0 0.924 RPD - 2nd 0.010 0.020 -0.046 -0.017 0.010 5.645 0.055 RPD - Null 1768 0.000 PE - Top 9.54E-5 1.12E-4 1.05E-4 3.19E-5 -2.61E-5 8.41E-5 0 0.995 PE -Null 1300 0

105

Table 5-3. Summary of the top and null binomial regression models for randomizations of PD, RPD, and PE. Numbers in the columns indicate changes in randomizations of PD, RPD, and PE values when variable values between locations increased by one standard deviation. Models are ranked based on the difference from the top model in Akaike’s Information Criterion (∆ AIC) and competing models with an Akaike weight (wi) of at least 0.05 are shown in the table. Models that were a subset of another model and within ∆ AIC of 2 were not considered to be competitive. Additionally, models with variance- inflation factors (VIF) 5 or models with perfect separation were not considered competitive. All variables contributed significantly (P < 0.05) to the models. Precip Temp Precip Temp Model Temp Precip Season Season Elev Stability Stability ∆ AIC wi

0.79 PD Rand - Top 8.22 0.76 1.04 -1.17 0.00 PD Rand -2nd 7.97 0.79 -0.97 4.84 0.07 PD Rand - Null 713.18 0.00 0.00 RPD Rand - Top 8.06 1.53 0.70 1.01 0.74 -1.54 0 0.79 RPD Rand -2nd 8.83 1.44 1.17 -1.56 2.64 0.21 RPD Rand - Null 636.57 0 0 PE Rand - Top 4.31 1.52 0.00 1.00 PE Rand - Null 716.55 0.00

106

CHAPTER 6 CONCLUSIONS

The increasing pace of digital specimen data mobilization coupled with the rapid development of tools and protocols for the novel use of these data have placed natural history museums and herbaria at the forefront of biodiversity research. Rapid technological advances have unlocked multiple avenues of using these data at large spatial and temporal scales. Tools to build megaphylogenies provide clear snapshots of biodiversity and make it possible to infer diversity at large scales from an evolutionary standpoint. The use of machine learning has transformative potential in biodiversity research and can be used at large scales for species identification or monitoring. Large scale informatics tools such as these place natural history museums at the forefront of biodiversity research and the cusp of big data science.

107

LIST OF REFERENCES

Alexander, J. M., Chalmandrier, L., Lenoir, J., Burgess, T. I., Essl, F., Haider, S., Kueffer, C., McDougall, K., Milbau, A., Nuñez, M. A., Pauchard, A., Rabitsch, W., Rew, L. J., Sanders, N. J., & Pellissier, L. (2018). Lags in the response of mountain plant communities to climate change. Global Change Biology, 24(2), 563–579. https://doi.org/10.1111/gcb.13976

Allen, J. M., Germain-Aubrey, C. C., Barve, N., Neubig, K. M., Majure, L. C., Laffan, S. W., Mishler, B. D., Owens, H. L., Smith, S. A., Whitten, W. M., Abbott, J. R., Soltis, D. E., Guralnick, R., & Soltis, P. S. (2019). Spatial Phylogenetics of Florida Vascular Plants: The Effects of Calibration and Uncertainty on Diversity Estimates. IScience, 11, 57–70. https://doi.org/10.1016/j.isci.2018.12.002

Allio, R., Scornavacca, C., Nabholz, B., Clamens, A.-L., Sperling, F. A., & Condamine, F. L. (2020). Whole Genome Shotgun Resolves the Pattern and Timing of Evolution. Systematic Biology, 69(1), 38–60. https://doi.org/10.1093/sysbio/syz030

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. In Nucleic Acids Research (Vol. 25, Issue 17, pp. 3389–3402). https://doi.org/10.1093/nar/25.17.3389

AmphibiaWeb. (2020). AmphibiaWeb: Information on amphibian biology and conservation. https://amphibiaweb.org/

Ané, J.-M., Folk, R., Guralnick, R., Kirst, M., Roy, S., Soltis, P., & Soltis, D. (2018). NITFIX Project. https://nitfix.org/

Ariño, A. H. (2010). Approaches to estimating the universe of natural history collections data. Biodiversity Informatics, 7(2). https://doi.org/10.17161/bi.v7i2.3991

Atlas of Living Australia. (2020). https://www.ala.org.au/

Axelrod, D. I. (1959). Late Cenozoic Evolution of the Sierran Bigtree Forest. Evolution, 13(1), 9. https://doi.org/10.2307/2405942

Badgley, C. (2010). Tectonics, topography, and mammalian diversity. Ecography, no- no. https://doi.org/10.1111/j.1600-0587.2010.06282.x

Barton, K. (2009). Mu-MIn: Multi-model inference. R Package Version 0.12.2/r18.

108

Barve, V. (2020). taxotools: Tools to handle taxonomic data (R package V 0.0.43). Retrieved from Https://Cran.r-Project.Org/Web/Packages/Taxotools/Index.Html.

Basset, Y., & Lamarre, G. P. A. (2019). Toward a world that values insects. Science, 364(6447), 1230–1231. https://doi.org/10.1126/science.aaw7071

Bininda-Emonds, O. R. P., Cardillo, M., Jones, K. E., MacPhee, R. D. E., Beck, R. M. D., Grenyer, R., Price, S. A., Vos, R. A., Gittleman, J. L., & Purvis, A. (2007). The delayed rise of present-day mammals. , 446(7135), 507–512. https://doi.org/10.1038/nature05634

Bintanja, R., & van de Wal, R. S. W. (2008). North American ice-sheet dynamics and the onset of 100,000-year glacial cycles. Nature, 454(7206), 869–872. https://doi.org/10.1038/nature07158

Bowers, J. E., & McLaughlin, S. P. (1996). Flora of the Huachuca Mountains, a botanically rich and historically significant sky island in Cochise County. J. Ariz- Nev. Acad. Sci., 66–107.

Braby, M. F., Vila, R., & Pierce, N. E. (2006). Molecular phylogeny and of the Pieridae (Lepidoptera: Papilionoidea): Higher classification and biogeography. Zoological Journal of the Linnean Society, 147(2), 239–275. https://doi.org/10.1111/j.1096-3642.2006.00218.x

Braga, M. P., Guimarães, P. R., Wheat, C. W., Nylin, S., & Janz, N. (2018). Unifying host-associated diversification processes using butterfly–plant networks. Nature Communications, 9(1), 1–10. https://doi.org/10.1038/s41467-018-07677-x

Breinholt, J. W., Earl, C., Lemmon, A. R., Lemmon, E. M., Xiao, L., & Kawahara, A. Y. (2018). Resolving Relationships among the Megadiverse Butterflies and Moths with a Novel Pipeline for Anchored Phylogenomics. Systematic Biology, 67(1), 78–93. https://doi.org/10.1093/sysbio/syx048

Brock, J., & Kaufman, K. (2003). Field Guide to Butterflies of North America. Houghton Mifflin.

Brower, A. V. Z. (1996). Parallel Race Formation and he Evolution of Mimicry in Heliconius Butteflies: A Phylogenetic Hypothesis from Mitochondrial DNA Sequences. Evolution, 50(1), 195–221. https://doi.org/10.1111/j.1558- 5646.1996.tb04486.x

Brower, A. V. Z. (2000). Phylogenetic relationships among the Nymphalidae (Lepidoptera) inferred from partial sequences of the wingless gene. Proceedings of the Royal Society of London. Series B: Biological Sciences, 267(1449), 1201– 1211. https://doi.org/10.1098/rspb.2000.1129

Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20192-9

109

Burnham, K. P., Anderson, D. R., & Burnham, K. P. (2002). Model selection and multimodel inference : a practical information-theoretic approach. Springer- Verlag.

ButterflyNet. (2020). https://www.butterflynet.org/

Cameron, S. A., Hines, H. M., & Williams, P. H. (2007). A comprehensive phylogeny of the bumble bees (Bombus). Biological Journal of the Linnean Society, 91(1), 161–188. https://doi.org/10.1111/j.1095-8312.2007.00784.x

Campbell, D. C., Serb, J. M., Buhay, J. E., Roe, K. J., Minton, R. L., & Lydeard, C. (2005). Phylogeny of North American amblemines (Bivalvia, Unionoida): Prodigious proves pervasive across genera. Invertebrate Biology, 124(2), 131–164. https://doi.org/10.1111/j.1744-7410.2005.00015.x

Cao, Y., Hao, J. S., Sun, X. Y., Zheng, B., & Yang, Q. (2016). Molecular phylogenetic and dating analysis of pierid butterfly species using complete mitochondrial genomes. Genetics and Molecular Research, 15(4). https://doi.org/10.4238/gmr15049196

Chazot, N., Wahlberg, N., Freitas, A. V. L., Mitter, C., Labandeira, C., Sohn, J.-C., Sahoo, R. K., Seraphim, N., de Jong, R., & Heikkilä, M. (2019). Priors and Posteriors in Bayesian Timing of Divergence Analyses: The Age of Butterflies Revisited. Systematic Biology, 68(5), 797–813. https://doi.org/10.1093/sysbio/syz002

Clark, G., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2016). GenBank. Nucleic Acids Research, 44, 67–72. https://doi.org/10.1093/nar/gkv1276

Cock, P. J. A, Antao, T., Chang, J. T., Chapman, B. a, Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython: freely available Python tools for compu... [Bioinformatics. 2009] - PubMed result. Bioinformatics (Oxford, England), 25(11), 1422–1423. https://doi.org/10.1093/bioinformatics/btp163

Colla, S. R., Gadallah, F., Richardson, L., Wagner, D., & Gall, L. (2012). Assessing declines of North American bumble bees (Bombus spp.) using museum specimens. Biodiversity and Conservation, 21(14), 3585–3595. https://doi.org/10.1007/s10531-012-0383-2

Condaine, F. L., Nabholz, B., Clamens, A. L., Dupius, J.R., & Sperling, F. A. H. (2018). Mitochondrial phylogenomics, the origin of swallowtail butterflies, and the impact of the number of clocks in Bayesian molecular dating. Systematic Entomology, 43(3), 460–480. https://doi.org/10.1111/syen.12284

Cong, Q., Zhang, J., Shen, J., & Grishin, N. (2019). Fifty new genera of Hesperiidae (Lepidoptera). Insecta Mundi. https://digitalcommons.unl.edu/insectamundi/1233

110

Constable, H., Guralnick, R., Wieczorek, J., Spencer, C., & Peterson, A. T. (2010). VertNet: A New Model for Biodiversity Data Sharing. PLoS Biology, 8(2), e1000309. https://doi.org/10.1371/journal.pbio.1000309

Corbet, A. S., Pendlebury, H. M., & Eliot, J. N. (1992). The butterflies of the Malay Peninsula. Malayan Nature Society.

Danks, H. v. (1994). Regional Diversity of Insects in North America. American Entomologist, 40(1), 50–55. https://doi.org/10.1093/ae/40.1.50

Daru, B. H., Farooq, H., Antonelli, A., & Faurby, S. (2020). Endemism patterns are scale dependent. Nature Communications, 11(1), 2115. https://doi.org/10.1038/s41467-020-15921-6

Daubenmire, Rexford. (1978). Plant Geography. Elsevier Science.

Davies J., T., & Buckley, L. B. (2011). Phylogenetic diversity as a window into the evolutionary and biogeographic histories of present-day richness gradients for mammals. Philosophical Transactions of the Royal Society B: Biological Sciences, 366(1576), 2414–2425. https://doi.org/10.1098/rstb.2011.0058

Davies, T. J., & Cadotte, M. W. (2011). Quantifying Biodiversity: Does It Matter What We Measure? In Biodiversity Hotspots (pp. 43–60). https://doi.org/10.1007/978- 3-642-20992-5_3

Ding, C., & Zhang, Y. (2017). Phylogenetic relationships of Pieridae (Lepidoptera: Papilionoidea) in China based on seven gene fragments. Entomological Science, 20(1), 15–23. https://doi.org/10.1111/ens.12214

Driskell, A. C., Ané, C., Burleig, J. G., McMahon, M. M., O’Meara, B. C., & Sanderson, M. J. (2004). Prospects for building the tree of life from large sequence databases. Science, 306(5699), 1172–1174. https://doi.org/10.1126/science.1102036

Eastman, J. M., Harmon, L. J., & Tank, D. C. (2013). Congruification: support for time scaling large phylogenetic trees. Methods in Ecology and Evolution, 4(7), 688– 691. https://doi.org/10.1111/2041-210X.12051

Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. https://doi.org/10.1093/nar/gkh340

Ehrlich, P. R., & Raven, P. H. (1964). Butterflies and Plants: A Study in Coevolution.

Eliot, J. N. (1974). The higher classification of the Lycaenidae (Lepidoptera) : a tentative arrangement. Bull Br Mus Nat Hist Entomol, 28, 371–505.

111

Espeland, M., Breinholt, J., Willmott, K. R., Warren, A. D., Vila, R., Toussaint, E. F. A., Maunsell, S. C., Aduse-Poku, K., Talavera, G., Eastwood, R., Jarzyna, M. A., Guralnick, R., Lohman, D. J., Pierce, N. E., & Kawahara, A. Y. (2018). A Comprehensive and Dated Phylogenomic Analysis of Butterflies. Current Biology, 28(5), 770-778.e5. https://doi.org/10.1016/j.cub.2018.01.061

Espeland, M., Hall, J. P. W., DeVries, P. J., Lees, D. C., Cornwall, M., Hsu, Y.-F., Wu, L.-W., Campbell, D. L., Talavera, G., Vila, R., Salzman, S., Ruehr, S., Lohman, D. J., & Pierce, N. E. (2015). Ancient Neotropical origin and recent recolonisation: Phylogeny, biogeography and diversification of the Riodinidae (Lepidoptera: Papilionoidea). and Evolution, 93, 296– 306. https://doi.org/10.1016/j.ympev.2015.08.006

Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1–10. https://doi.org/10.1016/0006-3207(92)91201-3

Farrera, I., Harrison, S. P., Prentice, I. C., Ramstein, G., Guiot, J., Bartlein, P. J., Bonnefille, R., Bush, M., Cramer, W., von Grafenstein, U., Holmgren, K., Hooghiemstra, H., Hope, G., Jolly, D., Lauritzen, S. E., Ono, Y., Pinot, S., Stute, M., & Yu, G. (1999). Tropical climates at the Last Glacial Maximum: A new synthesis of terrestrial palaeoclimate data. I. Vegetation, lake-levels and geochemistry. Climate Dynamics, 15(11), 823–856. https://doi.org/10.1007/s003820050317

Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. https://doi.org/10.1093/nar/gkr1178

Fei-Fei, L. (2010). ImageNet: crowdsourcing, benchmarking & other cool things. CMU VASC Seminar. https://doi.org/http://image-net.org/index

Feng, Y. J., Blackburn, D. C., Liang, D., Hillis, D. M., Wake, D. B., Cannatella, D. C., & Zhang, P. (2017). Phylogenomics reveals rapid, simultaneous diversification of three major clades of Gondwanan frogs at the Cretaceous–Paleogene boundary. Proceedings of the National Academy of Sciences of the United States of America, 114(29), E5864–E5870. https://doi.org/10.1073/pnas.1704632114

Fick, S. E., & Hijmans, R. J. (2017). WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology, 37(12), 4302–4315. https://doi.org/10.1002/joc.5086

Freitas, A. V. L., & Brown, K. S. (2004). Phylogeny of the Nymphalidae (Lepidoptera). Systematic Biology, 53(3), 363–383. https://doi.org/10.1080/10635150490445670

Funk, V. A. (2018). Collections-based science in the 21st Century. In Journal of Systematics and Evolution (Vol. 56, Issue 3, pp. 175–193). Wiley-Liss Inc. https://doi.org/10.1111/jse.12315

112

GBIF. (2018). What is GBIF? https://www.gbif.org/what-is-gbif

Geirhos, R., Temme, C. R. M., Rauber, J., Schütt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems, 2018-December, 7538–7550. http://arxiv.org/abs/1808.08750

Girosi, F., Jones, M., & Poggio, T. (1995). Regularization Theory and Neural Networks Architectures. Neural Computation, 7(2), 219–269. https://doi.org/10.1162/neco.1995.7.2.219

Glassberg, J. (2018). A Swift Guide to Butterflies of Mexico and Central America (2nd ed.). Princeton University Press.

Godfray, H. C. J., Lewis, O. T., & Memmott, J. (2000). Studying insect diversity in the tropics. In Changes and Disturbance in Tropical Rainforest in South-East Asia (pp. 87–100). https://doi.org/10.1142/9781848160125_0008

Grimaldi D., & Engel M. S. (2005). Evolution of the Insects. Cambridge University Press.

Hajibabaei, M., Janzen, D. H., Burns, J. M., Hallwachs, W., & Hebert, P. D. N. (2006). DNA barcodes distinguish species of tropical Lepidoptera. Proceedings of the National Academy of Sciences, 103(4), 968–971. https://doi.org/10.1073/pnas.0510466103

Hansen, O. L. P., Svenning, J. C., Olsen, K., Dupont, S., Garner, B. H., Iosifidis, A., Price, B. W., & Høye, T. T. (2020). Species-level image classification with convolutional neural network enables insect identification from habitus images. Ecology and Evolution, 10(2), 737–747. https://doi.org/10.1002/ece3.5921

Hassan, S. N., A Rahman, N. S., Zaw Htike, Z., & Lei Win, S. (2014). Advances in Automatic Insect Classificaion. An International Journal (ELELIJ), 3(2). https://doi.org/10.14810/elelij.2014.3204

Hawkins, B. A., Rueda, M., Rangel, T. F., Field, R., & Diniz-Filho, J. A. F. (2014). Community phylogenetics at the biogeographical scale: Cold tolerance, niche conservatism and the structure of North American forests. Journal of Biogeography, 41(1), 23–38. https://doi.org/10.1111/jbi.12171

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December, 770–778. https://doi.org/10.1109/CVPR.2016.90

Hebert, P. D. N., & Gregory, T. R. (2005). The Promise of DNA Barcoding for Taxonomy. Systematic Biology, 54(5), 852–859. https://doi.org/10.1080/10635150500354886

113

Hebert, P. D. N., Penton, E. H., Burns, J. M., Janzen, D. H., & Hallwachs, W. (2004). Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proceedings of the National Academy of Sciences of the United States of America, 101(41), 14812–14817. https://doi.org/10.1073/pnas.0406166101

Hedges, S. B., & Kumar, S. (2009). The Timetree of Life. Oxford University Press.

Hegland, S., & Totland, Ø. (2008). Is the magnitude of pollen limitation in a plant community affected by pollinator visitation and plant species specialisation levels? Oikos, 117(6), 883–891. https://doi.org/10.1111/j.0030- 1299.2008.16561.x

Heikkilä, M., Kaila, L., Mutanen, M., Peña, C., & Wahlberg, N. (2012). Cretaceous origin and repeated tertiary diversification of the redefined butterflies. Proceedings of the Royal Society B: Biological Sciences, 279(1731), 1093–1099. https://doi.org/10.1098/rspb.2011.1430

Heppner, J.B. (1992). Faunal Regions and Diversity of Lepidoptera. Tropical Biology,vol.2, suppl.l. American Entomologist, 38(4), 252–252. https://doi.org/10.1093/ae/38.4.252

Hill, R., Ganeshan, M., Wourms, L., Kronforst, M., Mullen, S., & Savage, W. (2018). Effectiveness of DNA Barcoding in Speyeria Butterflies at Small Geographic Scales. Diversity, 10(4), 130. https://doi.org/10.3390/d10040130

Hinchliff, C. E., Smith, S. A., Allman, J. F., Burleigh, J. G., Chaudhary, R., Coghill, L. M., Crandall, K. A., Deng, J., Drew, B. T., Gazis, R., Gude, K., Hibbett, D. S., Katz, L. A., Dail Laughinghouse, H., McTavish, E. J., Midford, P. E., Owen, C. L., Ree, R. H., Rees, J. A., … Cranston, K. A. (2015). Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences of the United States of America, 112(41), 12764–12769. https://doi.org/10.1073/pnas.1423041112

Hipp, R. (2015). SQLite (Version 3.8) [Computer software]. SQLite Development Team.

Hortal, J., de Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., & Ladle, R. J. (2015). Seven Shortfalls that Beset Large-Scale Knowledge of Biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46(1), 523–549. https://doi.org/10.1146/annurev-ecolsys-112414-054400

Hostetler, S. W., Clark, P. U., Bartlein, P. J., Mix, A. C., & Pisias, N. J. (1999). Atmospheric transmission of North Atlantic Heinrich events. Journal of Geophysical Research: Atmospheres, 104(D4), 3947–3952. https://doi.org/10.1029/1998JD200067

Howard, J. (2018). fastai. In GitHub. GitHub. https://doi.org/https://github.com/fastai/fastai

114

iDigBio. (2011). https://www.idigbio.org/

Janzen, D. H., Hallwachs, W., Burns, J. M., Hajibabaei, M., Bertrand, C., & Hebert, P. D. N. (2011). Reading the complex skipper butterfly fauna of one tropical place. PLoS ONE, 6(8). https://doi.org/10.1371/journal.pone.0019874

Jetz, W., Thomas, G. H., Joy, J. B., Hartmann, K., & Mooers, A. O. (2012). The global diversity of birds in space and time. Nature, 491(7424), 444–448. https://doi.org/10.1038/nature11631

Joshi, J., Prakash, A., & Kunte, K. (2017). Evolutionary Assembly of Communities in Butterfly Mimicry Rings. The American Naturalist, 189(4), E58–E76. https://doi.org/10.1086/690907

Karimi, K., Fortriede, J. D., Lotay, V. S., Burns, K. A., Wang, D. Z., Fisher, M. E., Pells, T. J., James-Zorn, C., Wang, Y., Ponferrada, V. G., Chu, S., Chaturvedi, P., Zorn, A. M., & Vize, P. D. (2018). Xenbase: a genomic, epigenomic and transcriptomic model organism database. Nucleic Acids Research, 46(D1), D861–D868. https://doi.org/10.1093/nar/gkx936

Katoh, K., & Standley, D. M. (2014). MAFFT: Iterative refinement and additional methods. Methods in Molecular Biology, 1079, 131–146. https://doi.org/10.1007/978-1-62703-646-7_8

Kawahara, A. Y., & Breinholt, J. W. (2014). Phylogenomics provides strong evidence for relationships of butterflies and moths. Proceedings of the Royal Society B: Biological Sciences, 281(1788), 20140970. https://doi.org/10.1098/rspb.2014.0970

Kemp, J. E., Linder, H. P., & Ellis, A. G. (2017). Beta diversity of herbivorous insects is coupled to high species and phylogenetic turnover of plant communities across short spatial scales in the Cape Floristic Region. Journal of Biogeography, 44(8), 1813–1823. https://doi.org/10.1111/jbi.13030

Kim, M. il, Wan, X., Kim, M. J., Jeong, H. C., Ahn, N.-H., Kim, K.-G., Han, Y. S., & Kim, I. (2010). Phylogenetic relationships of true butterflies (Lepidoptera: Papilionoidea) inferred from COI, 16S rRNA and EF-1α sequences. Molecules and Cells, 30(5), 409–425. https://doi.org/10.1007/s10059-010-0141-9

Kimball, R. T., Oliveros, C. H., Wang, N., White, N. D., Barker, F. K., Field, D. J., Ksepka, D. T., Chesser, R. T., Moyle, R. G., Braun, M. J., Brumfield, R. T., Faircloth, B. C., Smith, B. T., & Braun, E. L. (2019). A Phylogenomic Supertree of Birds. Diversity, 11(7), 109. https://doi.org/10.3390/d11070109

Kling, M. M., Mishler, B. D., Thornhill, A. H., Baldwin, B. G., & Ackerly, D. D. (2019). Facets of phylodiversity: Evolutionary diversification, divergence and survival as conservation targets. Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1763). https://doi.org/10.1098/rstb.2017.0397

115

Kocher, S. D., & Williams, E. H. (2000). The diversity and abundance of North American butterflies vary with habitat disturbance and geography. Journal of Biogeography, 27(4), 785–794. https://doi.org/10.1046/j.1365-2699.2000.00454.x

Kronforst, M. R., & Papa, R. (2015). The functional basis of wing patterning in Heliconius butterflies: The molecules behind mimicry. In Genetics (Vol. 200, Issue 1, pp. 1–19). Genetics. https://doi.org/10.1534/genetics.114.172387

Kück, P., & Longo, G. C. (2014). FASconCAT-G: Extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. Frontiers in Zoology, 11(1). https://doi.org/10.1186/s12983-014-0081-x

Kumar, S., Simonson, S. E., & Stohlgren, T. J. (2009). Effects of spatial heterogeneity on butterfly species richness in Rocky Mountain National Park, CO, USA. Biodiversity and Conservation, 18(3), 739–763. https://doi.org/10.1007/s10531- 008-9536-8

Kyerematen, R., Adu-Acheampong, S., Acquah-Lamptey, D., Anderson, R. S., Owusu, E. H., & Mantey, J. (2018). Butterfly Diversity: An Indicator for Environmental Health within Tarkwa Gold Mine, Ghana. Environment and Natural Resources Research, 8(3). https://doi.org/10.5539/enrr.v8n3p69

Laffan, S. W., Lubarsky, E., & Rosauer, D. F. (2010). Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33(4), 643–647. https://doi.org/10.1111/j.1600-0587.2010.06237.x

Laffan, S. W., Rosauer, D. F., di Virgilio, G., Miller, J. T., González-Orozco, C. E., Knerr, N., Thornhill, A. H., & Mishler, B. D. (2016). Range-weighted metrics of species and phylogenetic turnover can better resolve biogeographic transition zones. Methods in Ecology and Evolution, 7(5), 580–588. https://doi.org/10.1111/2041- 210X.12513

Lamas, G. (2004). Atlas of neotropical Lepidoptera: Checklist Pt. 4a Hesperioidea- papilionoidea. In John B. Heppner (Ed.), Atlas Of Neotropical Lepidoptera (p. 439). Scientific Pub.

Larsson, A. (2014). AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics, 30(22), 3276–3278. https://doi.org/10.1093/bioinformatics/btu531

Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. In Nature (Vol. 521, Issue 7553, pp. 436–444). Nature Publishing Group. https://doi.org/10.1038/nature14539

Lee, S. H., Chan, C. S., Mayo, S. J., & Remagnino, P. (2017). How deep learning extracts and learns leaf features for plant classification. Pattern Recognition, 71, 1–13. https://doi.org/10.1016/j.patcog.2017.05.015

116

Lemmon, A. R., Brown, J. M., Stanger-Hall, K., & Lemmon, E. M. (2009). The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference. Systematic Biology, 58(1), 130–145. https://doi.org/10.1093/sysbio/syp017

Lemmon, A. R., Emme, S. A., & Lemmon, E. M. (2012). Anchored Hybrid Enrichment for Massively High-Throughput Phylogenomics. Systematic Biology, 61(5), 727– 744. https://doi.org/10.1093/sysbio/sys049

Lemoine, F., Domelevo Entfellner, J.-B., Wilkinson, E., Correia, D., Dávila Felipe, M., de Oliveira, T., & Gascuel, O. (2018). Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature, 556(7702), 452–456. https://doi.org/10.1038/s41586-018-0043-0

Lewis, J. J., Geltman, R. C., Pollak, P. C., Rondem, K. E., van Belleghem, S. M., Hubisz, M. J., Munn, P. R., Zhang, L., Benson, C., Mazo-Vargas, A., Danko, C. G., Counterman, B. A., Papa, R., & Reed, R. D. (2019). Parallel evolution of ancient, pleiotropic enhancers underlies butterfly wing pattern mimicry. Proceedings of the National Academy of Sciences, 116(48), 24174–24183. https://doi.org/10.1073/pnas.1907068116

Licona-Vera, Y., & Ornelas, J. F. (2017). The conquering of North America: dated phylogenetic and biogeographic inference of migratory behavior in bee hummingbirds. BMC Evolutionary Biology, 17(1). https://doi.org/10.1186/s12862- 017-0980-5

Llorente Bousquets, J., Garcia, A. N., & Soriano, E. G. (1997). Biodiversidad, Taxonomia y Biogeografia de Artropodos de Mexico. The Quarterly Review of Biology, 72(1), 79–80. https://doi.org/10.1086/419693

Lomolino, M. (2004). Frontiers of Biogeography: New Directions in the Geography of Nature. Sinauer.

Lotts, K., & Naberhaus, T. (2017). Butterflies and Moths of North America. https://doi.org/http://www.butterfliesandmoths.org/

Louca, S., & Pennell, M. W. (2019). Phylogenies of extant species are consistent with an infinite array of diversification histories. BioRxiv, 14, 719435. https://doi.org/10.1101/719435

Ma, Z., Sandel, B., & Svenning, J. C. (2016). Phylogenetic assemblage structure of North American trees is more strongly shaped by glacial-interglacial climate variability in gymnosperms than in angiosperms. Ecology and Evolution, 6(10), 3092–3106. https://doi.org/10.1002/ece3.2100

117

Macgregor, C. J., Thomas, C. D., Roy, D. B., Beaumont, M. A., Bell, J. R., Brereton, T., Bridle, J. R., Dytham, C., Fox, R., Gotthard, K., Hoffmann, A. A., Martin, G., Middlebrook, I., Nylin, S., Platts, P. J., Rasteiro, R., Saccheri, I. J., Villoutreix, R., Wheat, C. W., & Hill, J. K. (2019). Climate-induced phenology shifts linked to range expansions in species with multiple reproductive cycles per year. Nature Communications, 10(1), 1–10. https://doi.org/10.1038/s41467-019-12479-w

Martínez, A. L., Llorente, J., Fernández, I. V., & Warren, A. D. (2002). Biodiversity and biogeography of Mexican butterflies (Lepidoptera: Papilionoidea and Hesperioidea). Proceedings- Entomological Society of Washington, 105(1), 209– 224.

Miao, Z., Gaynor, K. M., Wang, J., Liu, Z., Muellerklein, O., Norouzzadeh, M. S., McInturff, A., Bowie, R. C. K., Nathan, R., Yu, S. X., & Getz, W. M. (2019). Insights and approaches using deep learning to classify wildlife. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-44565-w

Miller, D. G., Lane, J., & Senock, R. (2011). Butterflies as potential bioindicators of primary rainforest and Oil Palm plantation habitats on New Britain, Papua New Guinea. Pacific Conservation Biology, 17(2), 149–159. https://doi.org/10.1071/pc110149

Milliron, H. (1961). Revised classification of the bumblebees - a synopsis (Hymenoptera: Apidae). Kans. Entomol. Soc., 34, 49–61.

Minet, J. (1991). Tentative reconstruction of the ditrysian phylogeny (Lepidoptera: Glossata). Insect Systematics & Evolution, 22(1), 69–95. https://doi.org/10.1163/187631291X00327

Miraldo, A., Li, S., Borregaard, M. K., Flórez-Rodríguez, A., Gopalakrishnan, S., Rizvanovic, M., Wang, Z., Rahbek, C., Marske, K. A., & Nogués-Bravo, D. (2016). An Anthropocene map of genetic diversity. Science, 353(6307), 1532– 1535. https://doi.org/10.1126/science.aaf4381

Mishler, B. D., Guralnick, R., Soltis, P. S., Smith, S. A., Soltis, D. E., Barve, N., Allen, J. M., & Laffan, S. W. (2020). Spatial Phylogenetics of the North American Flora. Journal of Systematics and Evolution. https://doi.org/10.1111/jse.12590

Mishler, B. D., Knerr, N., González-Orozco, C. E., Thornhill, A. H., Laffan, S. W., & Miller, J. T. (2014). Phylogenetic measures of biodiversity and neo-and paleo- endemism in Australian . Nature Communications, 5(1), 1–10. https://doi.org/10.1038/ncomms5473

118

Mittelbach, G. G., Schemske, D. W., Cornell, H. v., Allen, A. P., Brown, J. M., Bush, M. B., Harrison, S. P., Hurlbert, A. H., Knowlton, N., Lessios, H. A., McCain, C. M., McCune, A. R., McDade, L. A., McPeek, M. A., Near, T. J., Price, T. D., Ricklefs, R. E., Roy, K., Sax, D. F., … Turelli, M. (2007). Evolution and the latitudinal diversity gradient: speciation, extinction and biogeography. Ecology Letters, 10(4), 315–331. https://doi.org/10.1111/j.1461-0248.2007.01020.x

Mutanen, M., Wahlberg, N., & Kaila, L. (2010). Comprehensive gene and taxon coverage elucidates radiation patterns in moths and butterflies. Proceedings of the Royal Society B: Biological Sciences, 277(1695), 2839–2848. https://doi.org/10.1098/rspb.2010.0392

Myers, N., Mittermeler, R. A., Mittermeler, C. G., da Fonseca, G. A. B., & Kent, J. (2000). Biodiversity hotspots for conservation priorities. Nature, 403(6772), 853– 858. https://doi.org/10.1038/35002501

Nadeem, M. S. A., Zucker Jean-Daniel, J.-D., Džeroski, S., Geurts, P., Rousu, J., Nadeem, S., Zucker, J.-D., & Hanczar Nadeem, B. (2010). Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option (Vol. 8).

Nakamura, K., & Hong, B.-W. (2019). Adaptive Weight Decay for Deep Neural Networks. IEEE Access, 7, 118857–118865. http://arxiv.org/abs/1907.08931

Nazari, V., Zakharov, E. v., & Sperling, F. A. H. (2007). Phylogeny, historical biogeography, and taxonomic ranking of Parnassiinae (Lepidoptera, Papilionidae) based on morphology and seven genes. Molecular Phylogenetics and Evolution, 42(1), 131–156. https://doi.org/10.1016/j.ympev.2006.06.022

Noss, R. F., Platt, W. J., Sorrie, B. A., Weakley, A. S., Means, D. B., Costanza, J., & Peet, R. K. (2015). How global biodiversity hotspots may go unrecognized: lessons from the North American Coastal Plain. Diversity and Distributions, 21(2), 236–244. https://doi.org/10.1111/ddi.12278

Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N., Underwood, E. C., D’amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt, T. F., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., & Kassem, K. R. (2001). Terrestrial Ecoregions of the World: A New Map of Life on EarthA new global map of terrestrial ecoregions provides an innovative tool for conserving biodiversity. BioScience, 51(11), 933–938. https://doi.org/10.1641/0006-3568(2001)051[0933:teotwa]2.0.co;2

Owens, H. L., & Guralnick, R. (2019). climateStability: An R package to estimate climate stability from time-slice climatologies. Biodiversity Informatics, 14, 8–13. https://doi.org/10.17161/bi.v14i0.9786

119

Page, L. M., MacFadden, B. J., Fortes, J. A., Soltis, P. S., & Riccardi, G. (2015). Digitization of Biodiversity Collections Reveals Biggest Data on Biodiversity. BioScience, 65(9), 841–842. https://doi.org/10.1093/biosci/biv104

Palmer, D. H., Tan, Y. Q., Finkbeiner, S. D., Briscoe, A. D., Monteiro, A., & Kronforst, M. R. (2018). Experimental field tests of Batesian mimicry in the swallowtail butterfly Papilio polytes. Ecology and Evolution, 8(15), 7657–7666. https://doi.org/10.1002/ece3.4207

Pearse, W. D., & Purvis, A. (2013). phyloGenerator: an automated phylogeny generation tool for ecologists. Methods in Ecology and Evolution, 4(7), 692–698. https://doi.org/10.1111/2041-210X.12055

Pellissier, L., Alvarez, N., Espíndola, A., Pottier, J., Dubuis, A., Pradervand, J.-N., & Guisan, A. (2013). Phylogenetic alpha and beta diversities of butterfly communities correlate with climate in the western Swiss Alps. Ecography, 36(5), 541–550. https://doi.org/10.1111/j.1600-0587.2012.07716.x

Pellissier, L., Heine, C., Rosauer, D. F., & Albouy, C. (2018). Are global hotspots of endemic richness shaped by ? Biological Journal of the Linnean Society, 123(1), 247–261. https://doi.org/10.1093/biolinnean/blx125

Pfeiler, E., Johnson, S., & Markow, T. A. (2012). DNA Barcodes and Insights into the Relationships and Systematics of Buckeye Butterflies (Nymphalidae: Nymphalinae: Junonia ) from the Americas. Journal of the Lepidopterists’ Society, 66(4), 185–198. https://doi.org/10.18473/lepi.v66i4.a1

Pozo, C., Martínez, A. L., Uc Tescum, S., Salas Suárez, N., & Martínez, A. M. (2003). Butterflies (Papilionoidea and Hesperoidea) of Calakmul, Camexcha, Mexico. The Southwestern Naturalist, 48(4), 505–525. https://doi.org/10.1894/0038- 4909(2003)048<0505:BPAHOC>2.0.CO;2

Price, M. N., Dehal, P. S., & Arkin, A. P. (2009). FastTree: Computing Large Trees with Profiles instead of a Distance Matrix. Molecular Biology and Evolution, 26(7), 1641–1650. https://doi.org/10.1093/molbev/msp077

Pyron, A. R., & Wiens, J. J. (2011). A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, , and caecilians. Molecular Phylogenetics and Evolution, 61(2), 543–583. https://doi.org/10.1016/j.ympev.2011.06.012

Python Core Team. (2015). Python: A dynamic, open source programming language. Python Software Foundation.

QGIS Development Team. (2020). QGIS Geographic Information System. Open Source Geospatial Foundation Project.

120

R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Rambaut, A. (2010). FigTree v2.0. In Institute of Evolutionary Biology, University of Edinburgh, Edinburgh.

Ratnasingham, S., & Hebert, P. D. N. (2007). BOLD: The Barcode of Life Data System: Barcoding. Molecular Ecology Notes, 7(3), 355–364. https://doi.org/10.1111/j.1471-8286.2007.01678.x

Ray, N., & Adams, J. M. (2001). A GIS-based Vegetation Map of the World at the Last Glacial Maximum (25,000-15,000 BP). Internet Archaeology, 11. https://doi.org/10.11141/ia.11.2

Redding, D. W., Hartmann, K., Mimoto, A., Bokal, D., DeVos, M., & Mooers, A. Ø. (2008). Evolutionarily distinctive species often capture more phylogenetic diversity than expected. Journal of Theoretical Biology, 251(4), 606–615. https://doi.org/10.1016/j.jtbi.2007.12.006

Regier, J. C., Mitter, C., Zwick, A., Bazinet, A. L., Cummings, M. P., Kawahara, A. Y., Sohn, J. C., Zwickl, D. J., Cho, S., Davis, D. R., Baixeras, J., Brown, J., Parr, C., Weller, S., Lees, D. C., & Mitter, K. T. (2013). A Large-Scale, Higher-Level, Molecular Phylogenetic Study of the Insect Order Lepidoptera (Moths and Butterflies). PLoS ONE, 8(3). https://doi.org/10.1371/journal.pone.0058568

Revell, L. J. (2012). phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution, 3(2), 217–223. https://doi.org/10.1111/j.2041-210X.2011.00169.x

Ricketts, T. H., Dinerstein, E., Olson, D. M., & Loucks, C. (1999). Who’s Where in North America? BioScience, 49(5), 369–381. https://doi.org/10.2307/1313630

Roe, A., Weller, S., Baixeras, J., Brown, J., Cummings, M., Davis, D., Kawahara, A., Parr, C., Regier, J., Rubinoff, D., Simonsen, T., Wahlberg, N., & Zwick, A. (2009). Evolutionary Framework for Lepidoptera Model Systems. https://doi.org/10.1201/9781420060201-c1

Rohde, K. (1992). Latitudinal Gradients in Species Diversity: The Search for the Primary Cause. Oikos, 65(3), 514. https://doi.org/10.2307/3545569

Roquet, C., Thuiller, W., & Lavergne, S. (2013). Building megaphylogenies for macroecology: Taking up the challenge. Ecography, 36(1), 13–26. https://doi.org/10.1111/j.1600-0587.2012.07773.x

Rosauer, D., Laffan, S. W., Crisp, M.D., Donnellan, S. C., & Cook, L. G. (2009). Phylogenetic endemism: a new approach for identifying geographical concentrations of evolutionary history. Molecular Ecology, 18(19), 4061–4072. https://doi.org/10.1111/j.1365-294X.2009.04311.x

121

Rowe, K. C., Heske, E. J., Brown, P. W., & Paige, K. N. (2004). Surviving the ice: Northern refugia and postglacial colonization. Proceedings of the National Academy of Sciences, 101(28), 10355–10359. https://doi.org/10.1073/pnas.0401338101

Sanderson, M. J. (2002). Estimating Absolute Rates of Molecular Evolution and Divergence Times: A Penalized Likelihood Approach. Molecular Biology and Evolution, 19(1), 101–109. https://doi.org/10.1093/oxfordjournals.molbev.a003974

Sanderson, M. J., Boss, D., Chen, D., Cranston, K. A., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology, 57(3), 335–346. https://doi.org/10.1080/10635150802158688

Sanderson, M. J., & Wojciechowski, M. F. (2000). Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Systematic Biology, 49(4), 671–685. https://doi.org/10.1080/106351500750049761

Savela, M. (2020). Lepidoptera and some other life forms. https://www.nic.funet.fi/pub/sci/bio/life/intro.html

Scoble, M. J. (1995). The Lepidoptera : Form, Function, and Diversity. The Natural History Museum in association with Oxford University Press.

Seraphim, N., Kaminski, L. A., Devies, P. J., Penz, C., Callaghan, C., Wahlberg, N., Silva-Brandao, K. L., & Freitas, A.V.L. (2018). Molecular phylogeny and higher systematics of the metalmark butterflies (Lepidoptera: Riodinidae). Systematic Entomology, 43(2), 407–425. https://doi.org/10.1111/syen.12282

Session, A. M., Uno, Y., Kwon, T., Chapman, J. A., Toyoda, A., Takahashi, S., Fukui, A., Hikosaka, A., Suzuki, A., Kondo, M., van Heeringen, S. J., Quigley, I., Heinz, S., Ogino, H., Ochi, H., Hellsten, U., Lyons, J. B., Simakov, O., Putnam, N., … Rokhsar, D. S. (2016). Genome evolution in the allotetraploid frog Xenopus laevis. Nature, 538(7625), 336–343. https://doi.org/10.1038/nature19840

Shen, J., Cong, Q., Kinch, L. N., Borek, D., Otwinowski, Z., & Grishin, N. v. (2016). Complete genome of Pieris rapae, a resilient alien, a cabbage pest, and a source of anti-cancer proteins. F1000Research, 5, 2631. https://doi.org/10.12688/f1000research.9765.1

Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1). https://doi.org/10.1186/s40537-019- 0197-0

122

Smith, S. A., Beaulieu, J. M., & Donoghue, M. J. (2009). Mega-phylogeny approach for comparative biology: An alternative to supertree and supermatrix approaches. BMC Evolutionary Biology, 9(1), 37. https://doi.org/10.1186/1471-2148-9-37

Smith, S. A., & O’Meara, B. C. (2012). treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics, 28(20), 2689–2690. https://doi.org/10.1093/bioinformatics/bts492

Smith, S. A., & Walker, J. F. (2019). PyPHLAWD: A python tool for phylogenetic dataset construction. Methods in Ecology and Evolution, 10(1), 104–108. https://doi.org/10.1111/2041-210X.13096

Smithsonian Collections. (2020). https://www.si.edu/collections

Soltis, D. E., Morris, A. B., McLachlan, J. S., Manos, P. S., & Soltis, P. S. (2006). Comparative phylogeography of unglaciated eastern North America. In Molecular Ecology (Vol. 15, Issue 14, pp. 4261–4293). https://doi.org/10.1111/j.1365- 294X.2006.03061.x

Soroye, P., Newbold, T., & Kerr, J. (2020). Climate change contributes to widespread declines among bumble bees across continents. Science, 367(6478), 685–688. https://doi.org/10.1126/science.aax8591 speciesLink. (2020). http://splink.cria.org.br/description?criaLANG=en

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res., 15(1), 1929–1958.

Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics, 30(9), 1312–1313. https://doi.org/10.1093/bioinformatics/btu033

Stevens, P. F. (2001). Angiosperm Phylogeny Website. http://www.mobot.org/MOBOT/research/APweb/

Tabak, M. A., Norouzzadeh, M. S., Wolfson, D. W., Sweeney, S. J., Vercauteren, K. C., Snow, N. P., Halseth, J. M., di Salvo, P. A., Lewis, J. S., White, M. D., Teton, B., Beasley, J. C., Schlichting, P. E., Boughton, R. K., Wight, B., Newkirk, E. S., Ivan, J. S., Odell, E. A., Brook, R. K., … Miller, R. S. (2019). Machine learning to classify animal species in camera trap images: Applications in ecology. Methods in Ecology and Evolution, 10(4), 585–590. https://doi.org/10.1111/2041- 210X.13120

Thackray, G. D. (2008). Varied climatic and topographic influences on Late Pleistocene mountain glaciation in the western United States. J. Quaternary Sci, 23, 671– 681. https://doi.org/10.1002/jqs.1210

123

The Plant List. (2013). http://www.theplantlist.org/

Thomson, R. C., & Shaffer, H. B. (2010). Sparse Supermatrices for Phylogenetic Inference: Taxonomy, Alignment, Rogue Taxa, and the Phylogeny of Living Turtles. Systematic Biology, 59(1), 42–58. https://doi.org/10.1093/sysbio/syp075

Thornhill, A. H., Baldwin, B. G., Freyman, W. A., Nosratinia, S., Kling, M. M., Morueta- Holme, N., Madsen, T. P., Ackerly, D. D., & Mishler, B. D. (2017). Spatial phylogenetics of the native California flora. BMC Biology, 15(1). https://doi.org/10.1186/s12915-017-0435-x

Varga, T., Krizsán, K., Földi, C., Dima, B., Sánchez-García, M., Sánchez-Ramírez, S., Szöllősi, G. J., Szarkándi, J. G., Papp, V., Albert, L., Andreopoulos, W., Angelini, C., Antonín, V., Barry, K. W., Bougher, N. L., Buchanan, P., Buyck, B., Bense, V., Catcheside, P., … Nagy, L. G. (2019). Megaphylogeny resolves global patterns of mushroom evolution. Nature Ecology and Evolution, 3(4), 668–678. https://doi.org/10.1038/s41559-019-0834-1

Vicente, E. J., & Dean, D. R. (2017). Keeping the nitrogen-fixation dream alive. In Proceedings of the National Academy of Sciences of the United States of America (Vol. 114, Issue 12, pp. 3009–3011). National Academy of Sciences. https://doi.org/10.1073/pnas.1701560114

Vilgalys, R. (2003). Taxonomic misidentification in public DNA databases. In New Phytologist (Vol. 160, Issue 1, pp. 4–5). https://doi.org/10.1046/j.1469- 8137.2003.00894.x

Vollmar, A., Macklin, J. A., & Ford, L. (2010). Natural History Specimen Digitization: Challenges and Concerns. Biodiversity Informatics, 7(2). https://doi.org/10.17161/bi.v7i2.3992

Wahlberg, N., Leneveu, J., Kodandaramaiah, U., Peña, C., Nylin, S., Freitas, A. V. L., & Brower, A. V. Z. (2009). Nymphalid butterflies diversify following near demise at the Cretaceous/Tertiary boundary. Proceedings of the Royal Society B: Biological Sciences, 276(1677), 4295–4302. https://doi.org/10.1098/rspb.2009.1303

Wahlberg, N., Weingartner, E., & Nylin, S. (2003). Towards a better understanding of the higher systematics of Nymphalidae (Lepidoptera: Papilionoidea). Molecular Phylogenetics and Evolution, 28(3), 473–484. https://doi.org/10.1016/S1055- 7903(03)00052-6

Wahlberg, N., Wheat, C. W., & Peña, C. (2013). Timing and Patterns in the Taxonomic Diversification of Lepidoptera (Butterflies and Moths). PLoS ONE, 8(11), e80875. https://doi.org/10.1371/journal.pone.0080875

Wäldchen, J., & Mäder, P. (2018). Machine learning for image based species identification. In Methods in Ecology and Evolution (Vol. 9, Issue 11, pp. 2216– 2225). British Ecological Society. https://doi.org/10.1111/2041-210X.13075

124

Wäldchen, J., Rzanny, M., Seeland, M., & Mäder, P. (2018). Automated plant species identification—Trends and future directions. In PLoS Computational Biology (Vol. 14, Issue 4). Public Library of Science. https://doi.org/10.1371/journal.pcbi.1005993

Warren, A. D., K. J. Davis, E. M. Stangeland, J. P. Pelham, K. R. Willmott, & N. V. Grishin. (2016). Illustrated Lists of American Butterflies. http://www.butterfliesofamerica.com/

Warren, A. D., Ogawa, J. R., & Brower, A. V. Z. (2009). Revised classification of the family Hesperiidae (Lepidoptera: Hesperioidea) based on combined molecular and morphological data. Systematic Entomology, 34(3), 467–523. https://doi.org/10.1111/j.1365-3113.2008.00463.x

Weinstein, B. G. (2018). A computer vision for animal ecology. In Journal of Animal Ecology (Vol. 87, Issue 3, pp. 533–545). Blackwell Publishing Ltd. https://doi.org/10.1111/1365-2656.12780

Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110–115. https://doi.org/10.1016/j.compag.2012.08.008

Wen, J., Ickert-Bond, S. M., Appelhans, M. S., Dorr, L. J., & Funk, V. A. (2015). Collections-based systematics: Opportunities and outlook for 2050. In Journal of Systematics and Evolution (Vol. 53, Issue 6, pp. 477–488). Wiley-Liss Inc. https://doi.org/10.1111/jse.12181

Wepprich, T., Adrion, J. R., Ries, L., Wiedmann, J., & Haddad, N. M. (2019). Butterfly abundance declines over 20 years of systematic monitoring in Ohio, USA. PLOS ONE, 14(7), e0216270. https://doi.org/10.1371/journal.pone.0216270

Werner, G. D. A., Cornwell, W. K., Sprent, J. I., Kattge, J., & Kiers, E. T. (2014). A single evolutionary innovation drives the deep evolution of symbiotic N2-fixation in angiosperms. Nature Communications, 5. https://doi.org/10.1038/ncomms5087

Whittaker, R. H. (1972). Evoluation and Measurement of Species Diversity. TAXON, 21(2–3), 213–251. https://doi.org/10.2307/1218190

Whittaker, R. J., Araújo, M. B., Jepson, P., Ladle, R. J., Watson, J. E. M., & Willis, K. J. (2005). Conservation Biogeography: assessment and prospect. Diversity and Distributions, 11(1), 3–23. https://doi.org/10.1111/j.1366-9516.2005.00143.x

Wiemers, M., Chazot, N., Wheat, C. W., Schweiger, O., & Wahlberg, N. (2019). A complete time-calibrated multi-gene phylogeny of the European butterflies. BioRxiv, November. https://doi.org/10.1101/844175

Wiens, J. J. (2006). Missing data and the design of phylogenetic analyses. Journal of Biomedical Informatics, 39(1), 34–42. https://doi.org/10.1016/j.jbi.2005.04.001

125

Wiens, J. J., & Tiu, J. (2012). Highly Incomplete Taxa Can Rescue Phylogenetic Analyses from the Negative Impacts of Limited Taxon Sampling. PLoS ONE, 7(8), e42925. https://doi.org/10.1371/journal.pone.0042925

Wikipedia. (2020). https://www.wikipedia.org/

Williams, B. W., Gelder, S. R., Proctor, H. C., & Coltman, D. W. (2013). Molecular phylogeny of North American Branchiobdellida (Annelida: Clitellata). Molecular Phylogenetics and Evolution, 66(1), 30–42. https://doi.org/10.1016/j.ympev.2012.09.002

Williams, P. H., Berezin, M. v., Cannings, S. G., Cederberg, B., Ødegaard, F., Rasmussen, C., Richardson, L. L., Rykken, J., Sheffield, C. S., Thanoosing, C., & Byvaltsev, A. M. (2019). The arctic and alpine bumblebees of the subgenus Alpinobombus revised from integrative assessment of species’ gene coalescents and morphology (Hymenoptera, Apidae, Bombus). Zootaxa, 4625(1), 1–68. https://doi.org/10.11646/zootaxa.4625.1.1

Williams, P. H., Byvaltsev, A. M., Cederberg, B., Berezin, M. v., Ødegaard, F., Rasmussen, C., Richardson, L. L., Huang, J., Sheffield, C. S., & Williams, S. T. (2015). Genes suggest ancestral colour polymorphisms are shared across morphologically cryptic species in arctic bumblebees. PLoS ONE, 10(12). https://doi.org/10.1371/journal.pone.0144544

Williams, P. H., Cameron, S. A., Hines, H. M., Cederberg, B., & Rasmont, P. (2008). A simplified subgeneric classification of the bumblebees (genus Bombus). Apidologie, 39(1), 46–74. https://doi.org/10.1051/apido:2007052

Williams, P. H., Cannings, S. G., & Sheffield, C. S. (2016). Cryptic subarctic diversity: a new bumblebee species from the Yukon and Alaska (Hymenoptera: Apidae). Journal of Natural History, 50(45–46), 2881–2893. https://doi.org/10.1080/00222933.2016.1214294

Williams, P. H., & Williams, P. H. (1998). An annotated checklist of bumble bees with an analysis of patterns of description (Hymenoptera: Apidae, Bombini). Bulletin of the Natural History Museum., 67, 79–152. https://www.biodiversitylibrary.org/part/76466

Wolfram Research, I. (2018). System Modeler, Version 11.3. In Champaign, IL.

Worthy, T. H. (1987). Palaeoecological information concerning members of the frog genus leiopelma: leiopelmatidae in New Zealand. Journal of the Royal Society of New Zealand, 17(4), 409–420. https://doi.org/10.1080/03036758.1987.10426482

Yang, H. P., Ma, C. sen, Wen, H., Zhan, Q. bin, & Wang, X. L. (2015). A tool for developing an automatic insect identification system based on wing outlines. Scientific Reports, 5. https://doi.org/10.1038/srep12786

126

Yao, Y., Rosasco, L., & Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation, 26(2), 289–315. https://doi.org/10.1007/s00365-006-0663-2

Yuan, X., Gao, K., Yuan, F., Wang, P., & Zhang, Y. (2015). Phylogenetic relationships of subfamilies in the family Hesperiidae (Lepidoptera: Hesperioidea) from China. Scientific Reports, 5. https://doi.org/10.1038/srep11140

Zanne, A. E., Tank, D. C., Cornwell, W. K., Eastman, J. M., Smith, S. A., Fitzjohn, R. G., McGlinn, D. J., O’Meara, B. C., Moles, A. T., Reich, P. B., Royer, D. L., Soltis, D. E., Stevens, P. F., Westoby, M., Wright, I. J., Aarssen, L., Bertin, R. I., Calaminus, A., Govaerts, R., … Beaulieu, J. M. (2014). Three keys to the radiation of angiosperms into freezing environments. Nature, 506(7486), 89–92. https://doi.org/10.1038/nature12872

Zeng, J., Zhao, X., Gan, J., Mai, C., Zhai, Y., & Wang, F. (2018). Deep Convolutional Neural Network Used in Single Sample per Person Face Recognition. Computational Intelligence and Neuroscience, 2018, 1–11. https://doi.org/10.1155/2018/3803627

Zhang, J., Cong, Q., Shen, J., Brockmann, E., & Grishin, N. v. (2019). Genomes reveal drastic and recurrent phenotypic divergence in firetip skipper butterflies (Hesperiidae: Pyrrhopyginae). Proceedings of the Royal Society B: Biological Sciences, 286(1903), 20190609. https://doi.org/10.1098/rspb.2019.0609

Zhang, J., Cong, Q., Shen, J., Opler, P. A., & Grishin, N. v. (2019). Genomics of a complete butterfly continent. BioRxiv, 845, 829887. https://doi.org/10.1101/829887

Zhang, P., Liang, D., Mao, R.-L., Hillis, D. M., Wake, D. B., & Cannatella, D. C. (2013). Efficient Sequencing of Anuran mtDNAs and a Mitogenomic Exploration of the Phylogeny and Evolution of Frogs. Mol Biol Evol, 30(8), 1899–1915. https://doi.org/10.1093/molbev/mst091

Zheng, Y., & Wiens, J. J. (2016). Combining phylogenomic and supermatrix approaches, and a time-calibrated phylogeny for squamate reptiles (lizards and snakes) based on 52 genes and 4162 species. Molecular Phylogenetics and Evolution, 94, 537–547. https://doi.org/10.1016/j.ympev.2015.10.009

127

BIOGRAPHICAL SKETCH

Chandra Earl attended the University of Florida for both undergraduate and graduate school. She received a Bachelor of Science in 2015 with a major in biology and a minor in bioinformatics. While working for the Florida Museum of Natural History, she was awarded a Doctor of Philosophy in genetics and genomics.

128