Integrating Life History Traits Into Predictive Phylogeography
Total Page:16
File Type:pdf, Size:1020Kb
Received: 20 August 2018 | Revised: 4 January 2019 | Accepted: 16 January 2019 DOI: 10.1111/mec.15029 ORIGINAL ARTICLE Integrating life history traits into predictive phylogeography Jack Sullivan1,2* | Megan L. Smith3* | Anahí Espíndola1,4 | Megan Ruffley1,2 | Andrew Rankin1,2 | David Tank1,2 | Bryan Carstens3 1Department of Biological Sciences, University of Idaho, Moscow, Abstract Idaho Predictive phylogeography seeks to aggregate genetic, environmental and taxonomic 2 Institute for Bioinformatics and data from multiple species in order to make predictions about unsampled taxa using Evolutionary Studies, University of Idaho, Moscow, Idaho machine‐learning techniques such as Random Forests. To date, organismal trait data 3Department of Ecology, Evolution and have infrequently been incorporated into predictive frameworks due to difficulties Organismal Biology, The Ohio State University, Columbus, Ohio inherent to the scoring of trait data across a taxonomically broad set of taxa. We re‐ 4Department of Entomology, University of fine predictive frameworks from two North American systems, the inland temperate Maryland, College Park, Maryland rainforests of the Pacific Northwest and the Southwestern Arid Lands (SWAL), by Correspondence incorporating a number of organismal trait variables. Our results indicate that incor‐ Jack Sullivan, Department of Biological porating life history traits as predictor variables improves the performance of the Sciences, University of Idaho, Moscow, ID. Email: [email protected] supervised machine‐learning approach to predictive phylogeography, especially for Funding information the SWAL system, in which predictions made from only taxonomic and climate vari‐ National Science Foundation, Grant/Award ables meets only moderate success. In particular, traits related to reproduction (e.g., Numbers: DEB 14575199, DEB 1457726, DG‐1343012; NSF GRFP; Ohio State reproductive mode; clutch size) and trophic level appear to be particularly informa‐ University; Institute for Bioinformatics and tive to the predictive framework. Predictive frameworks offer an important mecha‐ Evolutionary Studies (IBEST), University of Idaho, Grant/Award Number: NIH nism for integration of organismal trait, environmental data, and genetic data in NCRR 1P20RR016454‐01, NIH NCRR phylogeographic studies. 1P20RR016448‐01, NSF EPS‐809935. KEYWORDS comparative phylogeography, cryptic diversity, machine learning, random forest 1 | INTRODUCTION & Knowles, 2016; Riddle, 2016; Zamudio, Bell, & Mason, 2016), echoing earlier studies that integrated traits into phylogeography The role of geography in the generation of biological diversity has (e.g., Paz, Ibáñez, Lips, & Crawford, 2015). been central to the development of evolutionary theory, including Concurrent with this colloquium, Espíndola et al. (2016) in‐ Darwin's focus on the Galapagos Islands and Wallace's studies of troduced a supervised machine‐learning approach using random the Sahul and Sunda Shelves. Comparative phylogeography (e.g., forests (RF) (Breiman, 2001) that integrates comparative phyloge‐ Bermingham & Avise, 1986) developed as a modern manifestation ography with distributional and climate data housed in publicly avail‐ of this interest and was the focus of a recent colloquium (Avise, able databases such as GBIF and WorldClim (Fick & Hijmans, 2017). Bowen, & Ayala, 2016) that addressed the current limitations, and The method of Espíndola et al. (2016) uses these geospatial data as future prospects of the discipline. Several papers that emerged from predictor variables to develop a classifier that permits predictions to that colloquium issued a call for the development of trait‐based ap‐ be made about the nature of genetic variation (e.g., the presence or proaches for phylogeography to move forward (e.g., Papadopoulou absence of cryptic diversity, as indicated by deep genetic divergence or phylogeographic breaks) in taxa that have little to no genetic data *Co‐first authors on this paper yet available. Specifically, an existing comparative phylogeographic 2062 | © 2019 John Wiley & Sons Ltd wileyonlinelibrary.com/journal/mec Molecular Ecology. 2019;28:2062–2073. SULLIVAN ET al. | 2063 data set is used as the training data set to build the classifier, and, of 2 | MATERIALS AND METHODS the various supervised machine‐learning options available, RF per‐ mit an assessment of the importance of predictor variables in the 2.1 | Random forest classifier predictive function. Of the many supervised machine‐learning approaches available, In its original version, the predictive phylogeographic approach we focus on the ensemble classifier inherent in RF (Breiman, 2001), of Espíndola et al. (2016) used a machine learning framework to which uses collections of decision trees to classify observations predict whether or not unsampled species were likely to harbour based on a series of predictor variables. Details of our use of this cryptic diversity, which was defined originally as pre‐Pleistocene method are provided in Table 1. First, we use existing phylogeo‐ divergence across a geographic barrier. Since the Espíndola et al. graphic studies (see below) as a training data set (i.e., reference taxa). (2016) Pacific Northwest training data set included two woody Here, each observation (georeferenced occurrence records for the plants, a mammal, three amphibians, and two terrestrial inverte‐ reference taxa) was classified based on whether it is from a taxon brates, life history traits that could be coded across all included that exhibits or does not exhibit a particular deep phylogeographic species were difficult to identify. To address this shortcoming, tax‐ break between disjunct populations (i.e., cryptic diversity; step 1, onomic categories served as a proxy for life history traits and were Table 1). That is, the presence or absence of cryptic diversity is the used with occurrence records for species endemic to a particular response variable (i.e., the feature we seek to predict). Inferences region or ecosystem as observations to be classified. Predictive as to whether or not taxa harboured cryptic diversity were based accuracy was reasonably good in Espíndola et al. (2016) and the either on results from Approximate Bayesian Computation (ABC) predictive phylogeographic framework remains ideally suited to and BEAST analyses, or on results from previous studies (details in incorporation of trait data as predictor variables, particularly for Espíndola et al., 2016). comparative phylogeographic investigations that are more taxo‐ In step two (Table 1), we use RF in an attempt to use all the nomically focused. information in the data and we grow multiple classifiers (i.e., mul‐ Although many current comparative phylogeographic data sets tiple decision trees) using a bootstrap aggregating (bagging) ap‐ are taxonomically restricted, for example to a regional avifauna (e.g., proach. Two thirds of observations are randomly assigned to be Kirchman & Franklin, 2007), or to regional anurans (e.g., Freilich et al., used as input data to construct decision trees; the remaining third 2016), cross kingdom studies (e.g., Carstens, Brunsfeld, Demboski, is called the out‐of‐bag (OoB) sample and is used as an internal val‐ Good, & Sullivan, 2005; Hewitt, 1996; Sérsic et al., 2011) and syn‐ idation to assess accuracy of the emergent classifier. At each split theses (e.g., Médail & Diadema, 2009; Soltis, Morris, McLachlan, (i.e., node) in each decision tree, a random subset of the predictor Manos, & Soltis, 2006) are beginning to accumulate. The broad variables is chosen. From among this set of variables, the one that taxonomic sampling of such studies leads to inherent difficulties best partitions the observations between classes is selected. This related to the comparison of genetic variation, but model selection process is repeated and nodes are added to decision trees until all and species‐specific simulations represent powerful tools to test observations are classified (step 2C, Table 1). Whereas individual ecosystem‐wide hypotheses (e.g., Carstens et al., 2005; Carstens & decision trees may provide poor predictions (i.e., high OoB error Richards, 2007) and make species‐specific predictions (e.g., Ruffley rates), the ensemble classifier obtained from a collection (i.e., a et al., 2018; Smith, Ruffley, Tank, Sullivan, & Carstens, 2017), en‐ forest) of random trees has been shown to be very accurate in abling such cross‐kingdom comparisons. Along with simulations and many applications. Furthermore, the proportion of decision trees model comparison, the use of the predictive approach provides an in the forest that classify an observation into a particular category invaluable strategy for the integration of trait data into comparative (i.e., the proportion that classify an occurrence record as being phylogeography because such data are often available even in taxa from a taxon that harbours cryptic diversity) provides a measure for which no genetic data have yet been collected. of strength of inference. For more information, an excellent review In the current paper, we expand the approach of Espíndola et of random forest classification is provided by Liaw and Wiener al. (2016) and incorporate functional traits as predictor variables in (2002). order to integrate life history explicitly into predictive phylogeogra‐ phy. The challenge in doing so for a taxonomically broad compara‐ tive data set is to identify traits that may influence the geographic 2.2 | Study systems and predictor variables structure of