Modelling establishment risk for non-indigenous species

using aquarium as a case study

Lidia Della Venezia

Department of Biology

McGill University

Montreal, Quebec,

June 2019

A thesis submitted to McGill University in partial fulfillment of the

requirements of the degree of Doctor of Philosophy in Biology

© Lidia Della Venezia, 2019

Table of contents

Dedication ...... v

Acknowledgements ...... vi

Thesis abstract ...... vii

Résumé ...... ix

List of tables ...... xii

List of figures ...... xv

Preface ...... xviii Thesis format and style ...... xviii Contributions of co-authors ...... xix Original contributions to knowledge ...... xx

General introduction ...... 1 Introduction ...... 1 Risk assessment and risk management ...... 2 Non-indigenous species establishment ...... 4 Missing data in ecological datasets ...... 7 Methodological approach ...... 9 Thesis outline ...... 10 References ...... 12

Chapter 1: The rich get richer: invasion risk across North America from the aquarium pathway under climate change ...... 22 1.1 Abstract ...... 23 1.2 Introduction ...... 25 1.3 Methods ...... 28 1.3.1 Model formulation ...... 28 1.3.2 Data and variable choice ...... 29

ii

1.3.3 Model fitting ...... 33 1.4 Results ...... 34 1.5 Discussion ...... 38 1.6 Acknowledgements ...... 43 References ...... 44

Connecting statement ...... 57

Chapter 2: Guiding rapid response to non-indigenous aquarium fish: identifying risk factors for persistent versus casual establishment ...... 58 2.1 Abstract ...... 59 2.2 Introduction ...... 61 2.3 Methods ...... 64 2.3.1 Variable choice ...... 66 2.3.2 Model fitting ...... 67 2.3.3 Multiplicative risk factors ...... 68 2.4 Results ...... 70 2.4.1 Multiplicative risk factors ...... 70 2.4.2 Re-evaluating "risky" species in terms of persistence . . . 72 2.4.3 Comparing establishment sub-stages: casual versus persistent . 73 2.5 Discussion ...... 74 2.6 Acknowledgements ...... 78 References ...... 79

Connecting statement ...... 92

Chapter 3: Filling in FishBase: a more powerful approach to the imputation of missing trait data ...... 93 3.1 Abstract ...... 94 3.2 Introduction ...... 96 3.3 Methods ...... 100 3.3.1 Trait data ...... 100 3.3.2 Comparison of existing imputation methods . . . . 101 3.3.3 Novel imputation protocol: CRU ...... 102

iii

3.3.4 Model averaging and gap-filling ...... 107 3.3.5 Validation procedure ...... 107 3.4 Results ...... 109 3.4.1 Performance of the imputation models . . . . . 109 3.4.2 Validation of the uncertainty estimates . . . . . 110 3.4.3 Filling in FishBase ...... 111 3.5 Discussion ...... 112 3.5.1 Model comparison and ensemble imputation . . . . 112 3.5.2 Uncertainty estimation ...... 113 3.5.3 Caveats ...... 115 3.6 Conclusions ...... 116 3.7 Acknowledgements ...... 117 References ...... 118

General conclusion ...... 132 References ...... 135

Appendices ...... 136 Appendix A: Supplementary material for Chapter 1 . . . . . 137 Appendix B: Supplementary material for Chapter 2 . . . . . 141 Appendix B.1 ...... 142 Appendix B.2 ...... 147 Appendix B.3 ...... 150 Appendix B.4 ...... 153 Appendix B.5 ...... 154 Appendix C: Supplementary material for Chapter 3 . . . . . 156

iv

Dedication

I dedicate this thesis to my entire family, but especially to my parents, Ketty e Livio, you are everything I could ever ask for. To my sister Claudia, my brother Giulio, my nephews Chiara e

Lorenzo, and my dear nonna Noemi, you constantly fill my heart with love and joy. To my best friends Elisabetta, Silvia, Mery and Marina, you always support me and are pure happiness. And to Paolo, your smile brings the best out of me.

v

Acknowledgements

I would like to thank my supervisor Prof. Brian Leung, without whom my work would have not been possible and who has allowed me to grow both academically and personally. I also thank Prof. Gregor Fussmann and Prof. Frédéric Guichard for serving on my supervisory committee and for their guidance, particularly during major direction changes in my research. I am very grateful to my co-author Jason Samson, for providing insightful discussions and strong encouragement. Many thanks are due to my lab mates, Johanna Bradie, Corey Chivers, Kristina

Enciso, Alyssa Gervais, Emma Hudgins, Dat Nguyen, Victoria Reed, Natalie Richards, Anthony

Sardain, Dylan Schneider, and Shriram Varadarajan, for their consistent support and help.

Further, I acknowledge the Canadian Aquatic Invasive Species Network, the Fonds vert of the

Quebec government, McGill University, the National Sciences and Engineering Research

Council, and the Quebec Centre for Biodiversity Science for providing the necessary funding to pursue my PhD research.

I am especially grateful to my closest friends in Montreal, who have seen me through some of the happiest and toughest times of my life. In chronological order, I thank Chiara, Jay, Enrico,

Veronica, Franco, Luca, Estelle, Gianni, Fabrizio, Nunzio, Alessandra, Simone, Salvatore, Raul,

Andrea and Anna. Merci à mes francophones préférés, Céline, Andréa, Mathilde et Mattia.

Finally, my appreciation goes to Mundo Lingo for having been a second home and having allowed me to meet some of the most interesting and charming people I know, the Jarry pool for being my aquatic oasis during the summer, YouTube for always providing the perfect work soundtrack, and my kickboxing instructors and partners for having prevented several breakdowns during the most challenging months of my PhD.

vi

Thesis abstract

Invasive species cause substantial ecological and economic damages. While major modelling improvements have been made in the last decades, predictions are primarily based on single species analyses, look at a single factor at a time (i.e. environmental conditions, species traits, and propagule pressure, individually), and consider quite broad stages (e.g., establishment), which may be usefully resolved into smaller sub-stages. Further, although models arguably provide the most coherent, sophisticated predictions, most quantitative models remain unused in policy-oriented invasive species risk assessments, which largely rely on expert- opinion and simple summation across individual factors believed to influence invasions (i.e. scoring-based approaches). Of course, for quantitative analyses, data limitations typically exist.

To allow quantitative methods to be even more powerful and broadly usable, approaches are needed to alleviate those limitations and optimally use the available information. In this thesis, I advance the field of invasion biology, contributing to each of the three issues identified above.

In Chapter 1, I consider the three main predictors of biological invasions: environment, propagule pressure and species traits, and I integrate these into a coherent multispecies, geographically explicit model. I show the importance of their combination, and forecast that, for the aquarium fish invasion pathway, "the rich get richer" in that the most vulnerable current locations are likely to suffer the greatest increase in new invasions in the future. By employing an integrative approach and a multispecies perspective, this work provides support to decision making for resource managers and policy makers, and a better understanding of non-indigenous species establishment.

In Chapter 2, I recognize that while prevention may be ideal, it is not always achievable,

vii and prioritizing rapid responses is necessary for the effective management of potentially harmful non-indigenous species. To address issues of rapid response, we need to more finely resolve the establishment phase of biological invasions, and determine what happens after a new species has been detected (i.e. whether long-term persistence occurs). In this chapter, I have three objectives:

1) I separate casual (i.e. temporary) establishment and persistence (i.e. lack of subsequent extirpation), using the framework developed in Chapter 1 to generate a multispecies, geographically explicit model to prioritize rapid response efforts; 2) recognizing that mathematical models often remain unused in policy, I convert the model parameters into simple multiplicative risk factors, to facilitate broader adoption outside the modelling field; 3) from a fundamental perspective, I show that propagule pressure is important for casual establishment, but does not help predict subsequent persistence. Species traits are the most important group of predictors for casual establishment, while the environment is most relevant for persistence.

In Chapter 3, I develop an approach to more powerfully use available data (focusing on species traits). From Chapters 1 and 2, it became apparent that about 80% of trait values were missing in the trait database used (FishBase). Thus, I develop a novel, fast, simple method for imputation, based on trait Correlation, taxonomic Relatedness and Uncertainty minimization

(CRU). I compare it against existing cutting edge approaches, and show that CRU was the most accurate. Additionally, I consider and demonstrate that including CRU into an ensemble model combining existing techniques (Phylopars and missForests) yields even greater accuracy. This work fills in a substantial subset of FishBase, but also provides an approach for the imputation of other trait databases. Overall, this thesis advances understanding of ecological processes and informs environmental management while using the best available information.

viii

Résumé

Les espèces envahissantes causent des dommages écologiques et économiques importants.

Bien que des améliorations majeures aient été apportées à la modélisation au cours des dernières décennies, les prédictions reposent principalement sur des analyses par espèce, examinent un facteur à la fois (conditions environnementales, caractéristiques des espèces et pression des propagules, individuellement) et prennent en compte des étapes assez larges (par exemple, l'établissement), qui peuvent être utilement décomposées en sous-étapes plus petites. En outre, bien que les modèles fournissent sans doute les prévisions les plus cohérentes et sophistiquées, la plupart des modèles quantitatifs restent inutilisés dans les évaluations des risques des espèces envahissantes orientées sur l'action, qui reposent largement sur des opinions d’experts et une simple synthèse des facteurs individuels susceptibles d’influencer les invasions (approches fondées sur des scores). Bien sûr, pour les analyses quantitatives, des limitations de données existent généralement. Pour que les approches quantitatives soient encore plus puissantes et largement utilisables, des approches sont nécessaires pour atténuer ces limitations et utiliser de manière optimale les informations disponibles. Dans cette thèse, je fais progresser le domaine de la biologie des invasions, contribuant à chacune des trois questions identifiées ci-dessus.

Au Chapitre 1, je considère les trois principaux facteurs prédictifs des invasions biologiques : environnement, pression propagulaire et traits des espèces, et les intègre dans un modèle cohérent multi-espèce, géographiquement explicite. Je montre l’importance de leur combinaison et prévois que, pour la voie d’invasion par les poissons d’aquarium,

«les riches s’enrichissent», les localités les plus vulnérables étant susceptibles de connaître la plus forte augmentation de nouvelles invasions à l’avenir. En utilisant une approche intégrative

ix et une perspective multi-espèce, ce travail apporte un soutien à la prise de décision pour les gestionnaires de ressources et les décideurs, ainsi qu'une meilleure compréhension de l'établissement d'espèces non indigènes.

Au Chapitre 2, je reconnais que si la prévention peut être idéale, elle n’est pas toujours réalisable et que la priorité donnée aux interventions rapides est nécessaire pour la gestion efficace des espèces non indigènes potentiellement nuisibles. Pour résoudre les problèmes de réponse rapide, nous devons décomposer plus finement la phase d'établissement des invasions biologiques et déterminer ce qui se passe après la détection d'une nouvelle espèce (c'est-à-dire si une persistance à long terme se produit). Dans ce chapitre, j’ai trois objectifs : 1) je sépare l’établissement occasionnel (c’est-à-dire temporaire) et la persistance (c’est-à-dire l’absence d'extirpation ultérieure), en utilisant le cadre développé au Chapitre 1 pour générer un modèle multi-espèces, géographiquement explicite, pour hiérarchiser les efforts d’intervention rapide; 2) reconnaissant que les modèles mathématiques restent souvent inutilisés dans les politiques, je convertis les paramètres du modèle en simples facteurs de risque multiplicatifs, afin de faciliter une adoption plus large en dehors du champ de la modélisation; 3) d'un point de vue fondamental, je montre que la pression propagulaire est importante pour l'établissement occasionnel, mais ne permet pas de prédire la persistance ultérieure. Les traits des espèces constituent le groupe de prédicteurs le plus important pour l'établissement occasionnel, tandis que l'environnement est le plus pertinent pour la persistance.

Au Chapitre 3, je développe une approche pour utiliser plus efficacement les données disponibles (en se concentrant sur les traits des espèces). Dans les Chapitres 1 et 2, il est apparu qu’environ 80% des valeurs de trait manquaient dans la base de données de traits utilisée

(FishBase). Ainsi, je développe une nouvelle méthode d'imputation simple et rapide, fondée sur

x la corrélation de traits, la relation taxonomique et la minimisation d'incertitude (CRU). Je compare cette méthode aux approches de pointe existantes et montre que CRU est la plus précise. De plus, je considère et démontre qu'inclure CRU dans une modèle d'ensemble combinant des techniques existantes (Phylopars et missForests) permet une précision encore plus grande. Ce travail remplit un sous-ensemble substantiel de FishBase mais fournit également une approche pour l'imputation d'autres bases de données de traits. Globalement, cette thèse permet de mieux comprendre les processus écologiques et d’éclairer la gestion de l’environnement tout en utilisant les meilleures informations disponibles.

xi

List of tables

Table 1.1. Results of the forward selection for the joint model (PET) including environmental variables (type BIO), species traits (type TR) and environment-traits interactions (type INT). X̅ indicates the mean, s2 represents the variance, 1st & 2nd denote first and second order terms. The terms included in the final PET model and the corresponding standardized parameter values are indicated with an asterisk, while AIC improvements that were bigger than 2 units in the model selection are denoted by daggers. The last column shows the order of inclusion in the model.

Table 1.2. Top 10 species with the highest likelihood of establishment in the United States currently and in 2050, as predicted by our joint model, along with the state where they pose the highest risk, their propagule pressure, and the values of the traits retained in the PET model as important determinants of establishment, i.e. maximum temperature tolerance (Max T., °C) and maximum length (Max L., cm).

Table 2.1. Species status categories for the aquarium species in our dataset, as described on the

USGS Non-indigenous Aquatic Species website and as categorized in our study (CS/PS).

Table 2.2. AIC, AIC difference from the best model (ΔAIC), AUC, fitted 푐^ parameter and percentage of deviance explained (%dev.exp) by the full casual establishment and persistence model, and their respective submodels. The best model for each dataset is indicated in bold.

xii

Table 2.3. Parameter values for the predictors retained in each sub-stage best model after variable selection. Rank indicates the entry order of each important variable in the corresponding model, either as a first order term only (1st) or including an additional second order term (2nd).

Table 2.4. Selected rules of thumb to quickly quantify risk of casual and persistent establishment. The first column reports the average value of each relevant predictor in the equivalent model, while the other columns identify the variable values corresponding to an OR change of +1000%, +100%, +50%, +25%, -25% and -50%. The means of variables that were significant for both models differ, because the persistence model dataset represents a subset of the casual establishment model dataset (e.g., maximum length).

Table 3.1. Species functional and life-history traits used for the analysis, obtained from

FishBase. The last column indicates the percentage of missing value for each trait in the dataset.

Table 3.2. Measures of variation included in the hierarchical model for uncertainty estimation.

2 Table 3.3. Average R MSE across traits and approaches, obtained by cross-validation. Each row is either a single methodology or an ensemble of methods.

2 Table 3.4. Cross-validation predictive performance (R MSE) of MICE, missForest, Phylopars,

CRU and the best ensemble (missForest-CRU), by trait. For single approaches, the values in bold correspond to the best method for a specific trait, while under the ensemble model, the bold

xiii values indicate where averaging the predictions from CRU and missForest improved the accuracy of the imputation.

xiv

List of figures

Figure 1.1. Distributions of environmental variables (a,b), traits (c,d), interaction term (e) and propagule pressure (f) as included in the model. The plots for the environmental variables and the interaction term also depict the curves representing the distributions of values under current conditions and as forecasted for 2050 in the USA (solid and dashed lines) and in Quebec (dotted and dash-dotted lines). The black dots show the establishments occurred in the USA (see Table

A.2 in Appendix A), and their size is proportional to the number of corresponding species, except for plot (e) and (f), where each dot represents a species-location combination. The interaction term is standardized for simplicity of representation, while propagule pressure values are on a log scale.

Figure 1.2. Average establishment risk across species by state (USA) or administrative region

(Quebec) as predicted by the PET model under current (a) and future (b) climatic conditions.

Darker shades indicate a higher average risk of establishment. Very low probabilities are displayed using scientific notation, e.g. 1E-5 corresponds to one multiplied by 0.00001.

Figure 1.3. Expected numbers of establishments for the United States at highest risk, as predicted by the PET model. Clear grey areas indicate the number of species expected to be established under current conditions, while dark grey denotes the additional species forecasted to establish by the year 2050. Only the states with higher expected numbers of establishments are shown, while the remaining are pooled in a single bar (Rem.).

xv

Figure 1.4. Distribution of the estimated establishment risk in the USA and Quebec. The black dots represent the species that have established in the USA. The solid and the dashed lines correspond to the distributions of establishment probabilities predicted for the USA presently and for 2050 respectively, while the dotted and the dash-dotted lines represent those predicted for

Quebec for the same years. The probability of establishment values are reported on a log scale.

Figure 2.1. Illustrative example of rules of thumb, i.e. odds ratios (OR), derived for two species traits. Each dark dot represents the reference point, i.e. the average trait value across species in the dataset, while each triangular dot corresponds to the species s of interest.

Figure 2.2. Effect of each significant predictor on the likelihood of persistently establishing versus failing, expressed as odds ratio (OR), when gradually varying each predictor. OR equals 1

(dashed line) at each variable's mean value, reported by the corresponding point. The average values for the interaction plot (e) correspond to those of the respective main terms (b,d). The triangles indicate the OR of P. managuensis (traits) and Hawaii (environmental conditions), and their interaction (e). Very high values in (e) coincide with areas of extremely low absolute probability values, so that little probability increases determine very high OR.

Figure 3.1. Possible estimates obtainable for each missing datum using CRU, depending on the amount of information used.

Figure 3.2. P-P plots evaluating each method's performance in estimating uncertainty. The plots compare the observed and the theoretical residuals' percentiles, given the uncertainty estimates

xvi provided by (a) CRU using the HUE algorithm, (b) RMSD, (c) MICE and (d) Phylopars. missForest was excluded as it did not provide an estimate of uncertainty for each imputed value.

The 1:1 line defines the expectation for a perfect match between theoretical and observed percentiles.

Figure 3.3. Average uncertainty by number of species within (a) and (b) family.

xvii

Preface

Thesis format and style

This thesis is presented in a manuscript-based format, and consists of three papers. Each chapter focuses on making better use of limited ecological information, with the general scope of improving non-indigenous species establishment risk assessment, using an integrative and more accessible multispecies perspective.

Chapter 1

Della Venezia, L., Samson, J., & Leung, B. (2018). The rich get richer: Invasion risk across

North America from the aquarium pathway under climate change. Diversity and Distributions,

24, 285-296.

Chapter 2

Della Venezia, L., & Leung, B. (under review at Biological Invasions). Guiding rapid response to non-indigenous aquarium fish: identifying risk factors for persistent versus casual establishment.

Chapter 3

Della Venezia, L., & Leung, B. (under review at Ecography). Filling in FishBase: a more powerful approach to the imputation of missing trait data.

xviii

Contribution of co-authors

This thesis is composed of my original work, and I am the primary author for all chapters.

My work has been conducted in close cooperation with my supervisor, Prof. Brian Leung.

Additionally, one chapter has seen the insightful contribution of an additional co-author, who is mentioned below.

Chapter 1: Lidia Della Venezia, Brian Leung and Jason Samson conceived the project. Lidia

Della Venezia and Brian Leung formulated the model, and Lidia Della Venezia led the programming and the analysis of the data, under the consultation of Brian Leung. Lidia Della

Venezia wrote the first draft, and all authors revised the manuscript together, providing corrections and discussing ideas.

Chapter 2: Lidia Della Venezia and Brian Leung conceived the project. Lidia Della Venezia built and analysed the models, and derived the multiplicative risk factors for rapid response, assisted by helpful discussions with Brian Leung. Lidia Della Venezia led the preparation of the manuscript, with input from Brian Leung.

Chapter 3: Lidia Della Venezia and Brian Leung conceived the methodology. Lidia Della

Venezia performed all the statistical analyses, assisted by discussions with Brian Leung. Lidia

Della Venezia wrote the manuscript, with insightful contribution from Brian Leung.

xix

Original contributions to knowledge

In Chapter 1, I present an integrated approach to non-indigenous species establishment risk assessment and I show how incorporating species traits, propagule pressure and local environmental conditions in a pathway-level risk assessment framework considerably improves predictive power. This approach represents the first attempt to incorporate all three categories of predictors in a multispecies framework, allowing both species-specific and geographically explicit predictions. Further, I use this integrative approach to predict non-indigenous species establishments under a climate change scenario for the year 2050. In contrast with other findings in the literature, I demonstrate that climate change might impact more profoundly areas that are already particularly susceptible to non-indigenous species establishment. This methodology is relevant from a management perspective and it is applicable to virtually any introduction pathway.

Chapter 2 presents a multispecies approach to prioritize species and locations, and optimize management decisions in the context of rapid response. I model casual (i.e. temporary) establishment and persistence separately, to identify species that will likely become extirpated after early establishment and those that will manage to persist and potentially cause harm. I then derive practical "rules of thumb" for rapid assessment that can be used to prioritize instances of high risk. I thereby advance invasion ecology in two ways. Firstly, I deepen our understanding of the factors associated with successful establishment and their relative importance. I identify species traits and propagule pressure as most relevant for casual establishment, with the environment being more important for persistence, and propagule pressure having no effect on the latter. Secondly, I provide a quick and effective tool to prioritize the investment of resources

xx for rapid response, while also addressing the need to improve the usability of technical knowledge for policy makers.

Finally, in Chapter 3 I focus on the imputation of missing information in species traits datasets, where lacking data is the norm. Although imputation methods exist, I demonstrate that a novel, relatively simple approach based on trait Correlations, taxonomic Relatedness and

Uncertainty estimation (CRU) can be more effective and provide more reliable error estimates for every single imputed datum, compared to alternative sophisticated tools. Additionally, I explore for the first time the effectiveness of using an ensemble modelling approach to the imputation of missing species traits, and I demonstrate that averaging predictions from CRU and other well-performing methods further improves accuracy. Finally, employing the best multi- model ensemble, I fill in and provide a complete version of a substantial subset of the fish trait database FishBase. This methodology can be applied to other trait datasets, with applications in a variety of ecological fields, such as functional and community ecology, and conservation.

xxi

General introduction

Introduction

Non-indigenous species are species that are encountered outside of the range defined by their natural barriers and dispersal capacities (Mack et al., 2000). Due to increasing economic globalization, over the past few decades the number of non-indigenous species has steadily increased worldwide (Ricciardi, 2007; Seebens et al., 2017), determining what has been called a

"biotic homogenization" (McKinney & Lockwood, 1999; Olden, 2006). In fact, the process by which certain species manage to overcome their natural biogeographical barriers and reach non- native locations is often mediated by human-related activities, and these species continue to be transported around the world both intentionally and unintentionally (Hulme et al., 2008).

While not all non-indigenous species are able to establish self-sustaining populations and most of them do not cause harm (García-Berthou et al., 2005; Ricciardi and Cohen, 2007;

Williamson & Fitter, 1996), some can undergo explosions in population density (Elton, 1958).

Consequently, these species achieve widespread distributions and begin to cause substantial impacts, becoming essentially invasive (Colautti & MacIsaac, 2004). In new habitats, invasive species can interact with the native community and act as competitors or parasites (Lockwood et al., 2013), causing biodiversity loss (Doherty et al., 2016; Millenium Ecosystem Assessment

2005; Sala et al., 2000) and altering ecosystem dynamics (Ehrenfeld, 2010; Pyšek et al., 2012).

Freshwater ecosystems are particularly impacted by non-indigenous species (Lockwood et al.,

2013), which represent one of the major causes of biodiversity reduction, second only to habitat loss (Dextrase & Mandrak, 2006). In fact, freshwater species appear to be five times more prone to extinction than their terrestrial counterparts in North America (Ricciardi & Rasmussen, 1999).

1

In addition, invasive species can cause economic losses for hundreds of billions of dollars, by damaging agriculture, driving the loss of natural resources, affecting recreational activities and threatening human health (Pimentel et al., 2005).

The considerable damages caused by invasive species underlie much of the research that has been devoted to identifying the factors associated with biological invasions, and to understanding how they relate to invasion success, in order to prevent, control, and reduce the impact of harmful species (Byers et al., 2002; Keller et al., 2007; Ricciardi & MacIsaac, 2008).

Risk assessment and risk management

Most of the research currently conducted in the field of invasion ecology can be related to invasive species risk assessments (Lodge et al., 2006; Ricciardi & Rasmussen, 1998; Stohlgren

& Schnase, 2006). Developing tools to predict and evaluate the impact of invasive species is essential to identify effective management strategies and, most importantly, to inform instances where these are needed (Andersen et al., 2004a; Chadès et al., 2011; Kerr et al., 2016). While potentially harmful species should ideally be excluded from intentional introductions in non- native locations, this is hardly feasible without imposing unsustainable restrictions to certain industries (e.g., aquaculture; Naylor et al., 2001). Additionally, many currently invasive species originated from their unintentional movement around the globe, often as accidental hitchhikers of traded goods (Hulme, 2009; Westphal et al., 2008).

Hence, an increasing number of studies have focused on providing sound scientific background to inform management strategies aimed at preventing and reducing impact (Buckley,

2008; Simberloff, 2003). Among the variety of possible management actions, prevention and rapid response have been widely recommended as the most efficient and most cost-effective

(Alvarez & Solis, 2019; Lodge et al., 2006). While prevention is usually more feasible and thus

2 prioritized (Finnoff et al., 2007; Leung et al., 2002), it does not necessarily guarantee success

(Vander Zanden et al., 2010), so that rapid response remains critical (Wittenberg & Cock, 2001).

Moreover, even if many governments worldwide have implemented agendas for the prevention and control of invasive species (McGeoch et al., 2010), the resources invested are often insufficient, if not incorrectly spent (Finoff et al., 2007; Leung et al., 2002). Pinpointing high- risk species and vulnerable ecosystems as priorities for intervention would allow funds to be more efficiently used (Lohr et al., 2017; Papeş et al., 2011) and remains a challenging task of invasion ecology (Stewart-Koster et al., 2015).

One of the main obstacles to the effective management of non-indigenous species is time.

Efficient prevention and control strongly rely on timely responses to new introductions and detections of potentially invasive species (Mehta et al., 2007; Simberloff et al., 2013). However, risk assessments are often performed on a species-by-species basis (Leung et al., 2012), which makes the comparison of several taxa introduced at the same time a daunting task, particularly given the increasing number of organisms moved around the world. This has led to a call for multispecies risk assessments, where tens or even hundreds of species can be assessed simultaneously. Recently, a few multispecies frameworks have been developed successfully

(e.g., Chapman et al., 2017; Singh et al., 2015), often focusing on species belonging to the same invasion pathway (e.g., Bradie & Leung, 2015). For instance, a pathway that has received increasing attention is the aquarium trade, responsible for the introduction and the establishment of several species around the globe (Duggan, 2010; Rixon et al., 2005), including a third of the

100 worst invasive fish (Padilla & Williams, 2004). Pathway-based tools allow to simultaneously estimate risk for species belonging to the same pathway of introduction, shortening the time necessary for preliminary assessments.

3

Another essential factor for efficient risk assessments is the need to evaluate risk in a spatially explicit fashion. In fact, even non-indigenous species that can eventually cause damage are usually able to establish and become harmful only in certain suitable locations (Ricciardi et al., 2013; Williamson & Fitter, 1996). Risk assessment tools that help prioritize not only species but also geographical areas would improve the accuracy of predictions (Andersen et al., 2004b).

Although attempts have already been made over the past decades (e.g., Giljohann et al., 2011;

Pitt et al., 2009; Rouget et al., 2002), spatially explicit frameworks should be used more extensively.

Finally, while quantitative risk assessments would be more rigorous than qualitative and scoring approaches, the latter are often the frameworks used by policy makers (Cook et al.,

2010). This is due mainly to two reasons. Firstly, quantitative tools for risk assessment have started to be developed only in recent years, especially for multiple species, so that expert opinion continues to be preferred (Leung et al., 2012). Secondly, even when sophisticated quantitative tools exist, they often require substantial technical skills. Therefore, researchers should guarantee the usability of technical knowledge also to non-specialists (Cassey et al.,

2018a).

Non-indigenous species establishment

Biological invasions can be described as series of steps, each representing a potential target for management (e.g., transport, introduction, establishment, spread; Blackburn et al., 2011).

However, most studies have focused on the establishment phase (Leung et al., 2012), i.e. the process by which a non-indigenous species founds a self-sustaining population in a novel location (Lockwood et al., 2013). In fact, the earlier stages of invasions are considered critical,

4 since they generally guarantee a higher likelihood of successful management at lower costs

(Hulme, 2006; Puth & Post, 2005; Rejmánek & Pitcairn, 2002).

Among the determinants of successful establishment, the environmental conditions of the receiving location, and climate matching in particular, are essential (e.g., Duncan et al., 2014;

Hayes & Barry, 2008; Mahoney et al., 2015), and relate to the concept of niche, by which a species' spatial distribution is limited by a multivariate set of environmental variables within which the species can maintain a self-sustaining population (Jiménez-Valverde et al., 2011). This concept is at the base of species distribution models (SDM), which have been increasingly used to predict where a species can establish based on local environmental features (e.g., Ficetola et al., 2007; Gallien et al., 2010; Peterson, 2003). Another critical factor influencing establishment success is propagule pressure, i.e. the number of introduced individuals (Lockwood et al., 2005;

2009), which has been widely recognized as a factor of primary importance (Cassey et al.,

2018b; Colautti et al., 2006; Simberloff, 2009) and has been incorporated in SDMs (e.g., Leung

& Mandrak, 2007) to improve predictions. Analogously, species traits have also been identified as important determinants of successful establishments, including, among others, temperature tolerances, size and trophic level (Kolar & Lodge, 2002; Pyšek et al., 2009; Van Kleunen et al.,

2010). Given their relevance, integrating these different classes of predictors into a unified risk assessment model ideally should increase the accuracy of predictions, in comparison with studies that analyze each component individually, and it should avoid redundancy and improve efficiency (Leung et al., 2012). In addition, it might shed light on the relative importance of each factor in the successful establishment of non-indigenous species and provide insight into the processes underlying one of the earlier stages of invasions. In fact, although these predictors are known to be important, their role during establishment has not yet been fully understood

5

(Dawson et al., 2009; Essl et al., 2015; Milbau & Stout, 2008). Finally, incorporating information from each of these classes of variables would have the further advantage of allowing predictions that are both species-specific and location-specific and could potentially be used to tailor management actions, for example by reducing the number of individuals displaced (i.e. propagule pressure), or restricting species based on certain traits or only in specific locations.

Ideally, integrating information about environmental conditions would additionally permit to account for the effect of alternative drivers of global change, which have only recently commenced to be studied in conjunction with invasions (Brook et al., 2008; Didham et al.,

2007). Climate change in particular appears to have opened up several novel opportunities for invaders around the world (Walther et al., 2009), for example making temperate areas, normally considered too cold, accessible for sub-tropical and tropical species (Hellmann et al., 2008;

Rahel & Olden, 2008), potentially exacerbating the impact of invasions.

Unfortunately, the characterization of non-indigenous species establishment is often hampered by the limited availability of adequate information. Data can be missing for any predictor variable, and proxies might be required as surrogate when more detailed information is lacking (e.g., Eschtruth & Battles, 2011; Cook et al., 2019). Even species occurrences are often difficult to categorize. Simple detections of a species in the wild are frequently treated as actual establishment. However, the majority of non-indigenous species detected in the wild later becomes extirpated without anthropic interventions (Blackburn et al., 2011; Williamson & Fitter,

1996). Overlooking these failed establishments can in turn overestimate risk and misestimate which species pose a real threat (Zenni & Nuñez, 2013), with important consequences for management. Arguably, separating establishment into sub-stages and characterizing the species that manage to establish only temporarily from those that manage to persist in the long term

6 would be advantageous from both a theoretical and an applied point of view, improving our understanding of the factors determining successful invasions and avoiding misspending resources on species that would likely become extinct irrespectively of management actions.

Missing data in ecological datasets

Lack of information is an important impediment to the study of invasions, as much as to other branches of ecology (Nakagawa & Freckleton, 2008). This has become even more apparent due to the growing availability of big-scale datasets. The use of data-intensive approaches represents an extraordinary opportunity to address ecological questions that would have been unthinkable a few decades ago (Kelling et al., 2009; Luo et al., 2011), including the aforementioned pathway-level risk assessments for invasive species. Nonetheless, extensive databases are virtually always incomplete (Allison, 2002; Horton & Kleinman, 2007) and consequently part of the information they contain is often considered unusable (Nakagawa &

Freckleton, 2008). It suffices to think that, for example, many regression-based methods provided in a number of statistical programming languages automatically discard incomplete rows of data and restrict the analysis to the so-called "complete cases" (e.g., R; R Core Team,

2018). Alternatively, in-depth examination is required to evaluate how to best use patchy data, and decisions often depend on the specific case (Jones, 1996).

Classic examples of databases with very high levels of missingness are global species trait repositories, such as FishBase for fish (Froese & Pauly, 2018) and TRY for plants (Kattge et al.,

2011). Complete versions of these datasets would allow trait information to be used more efficiently. As a matter of fact, species traits represent versatile tools serving several purposes, including estimating and monitoring biodiversity (Tilman, 2001; Vandewalle et al., 2010), understanding the mechanisms behind ecosystem services (Lavorel et al., 2011), assessing risk

7 of invasions and extinctions (Liu et al., 2017) and informing conservation (Cadotte et al., 2011).

Nevertheless, trait-based metrics tend to be quite sensitive to missing data (Májeková et al.,

2016; Pakeman, 2014).

Until complete information is collected, a solution to deal with missing data is imputation, i.e. the replacement of missing values with best estimates. Reasonable replacement data can be obtained based on different mechanisms (Penone et al., 2014), from simple options like using the average or median of existing values (e.g., Nakagawa et al., 2001), to choices based on ecological hypotheses (Taugourdeau et al., 2014), to more sophisticated methods based on regression or phylogenetics (e.g., Goolsby et al., 2017; Stekhoven & Bühlmann, 2011; van

Buuren & Groothuis-Oudshoorn, 2011). Although imputation tends to outperform complete-case analyses (Nakagawa & Freckleton, 2008; van der Heijden et al., 2006), it is used relatively rarely in ecology (Nakagawa, 2015). Notably, when missing values in trait databases are replaced with plausible ones, alternative methods tend to perform differently depending on the specific trait and on the missingness rate (e.g., Poyatos et al., 2018), and not all existing methods produce a measure of the reliability (i.e. uncertainty) of their predictions (Penone et al., 2014). Thus, the choice of the appropriate method remains an open question.

Cases in which alternative methods perform differently have sometimes been tackled using a so-called ensemble modelling approach (Bates & Granger, 1969; Clemen, 1989; Palm &

Zellner, 1992; Winkler, 1989). Essentially, multiple models with desirable characteristics are combined to obtain ameliorated predictions, based on the assumption that the noise from each model would "cancel out", while a reliable signal would emerge more clearly (Bates & Granger,

1969). Ecologists have successfully used ensemble models (e.g., SDMs; Araújo & New, 2007;

Guo et al., 2015). Considering that each imputation methodology presents advantages and

8 disadvantages, averaging predictions from multiple approaches in an ensemble modelling perspective might be an avenue worth exploring to improve missing value forecasts. This would also entail the development of additional imputation algorithms that further ameliorate predictions, while accounting for uncertainty.

Methodological approach

In this thesis, I essentially adopt an ecological modelling approach. Statistical models are extensively used in ecology, where they mainly serve two purposes: inference and prediction.

Here, I predominantly focus on the predictive side of models. Nonetheless, in Chapter 1 and 2, ecological modelling and model comparison are also used to derive insight into the relevance of certain categories of biological predictors to the study of non-indigenous species establishment, and into the extent to which such predictors contribute to success or failure.

More specifically, I adapt existing modelling approaches to novel needs and modify them to incorporate additional predictors, to improve the accuracy of forecasts. From an ecological perspective, I concentrate on generating frameworks that can be applied to several species simultaneously and that can translate different sources of information into species-specific and location-specific predictions. To this aim, integration is essential to this thesis on several levels, and it includes combining different modelling approaches, diverse types of predictors, alternative drivers of environmental change, and predictions from different algorithms.

While ecological modelling represents a powerful tool for research, I recognize the importance of making sure that the rationale of the statistical approaches used is supported by ecological hypotheses and that the results are analyzed considering the underlying biological mechanisms. Therefore, the assumptions of each model are acknowledged throughout this work, and the limitations are addressed whenever possible.

9

Thesis outline

In this thesis, I address the limitations identified above. Specifically, I investigate modelling approaches to improve the understanding and management of non-indigenous species establishment, and examine the optimal use of limited ecological information.

In Chapter 1, I extend an existing modelling framework for pathway-level risk assessment of non-indigenous species establishment (Bradie & Leung, 2015) which included propagule pressure and species traits data. Focusing on freshwater fish species introduced in the USA via the aquarium pathway, I additionally incorporate information on environmental conditions and trait-environment interactions, to obtain geographically explicit, species-specific predictions.

Including the interaction terms was also necessary to simultaneously tackle the so-called 4th corner problem (Legendre et al., 1997), which aims at understanding the role played by traits in the way a species interacts with the surrounding environment. Through the inclusion of environmental information, I evaluated the effect of a climate change scenario forecasted for the year 2050 on the establishment likelihood of aquarium fish, which unveiled the need to prioritize management actions in the southernmost regions of the country for this introduction pathway.

In Chapter 2, I provide tools for rapid response to non-indigenous species detections.

Given that most non-indigenous species encountered in the wild later become extinct without intervention, I model casual (i.e. temporary) and persistent establishment using the framework developed in Chapter 1, to serve two objectives. Firstly, I derive simple "rules of thumb" for rapid assessment that will help efficiently define management priorities, by differentiating instances in which action is needed from those in which a non-indigenous species will likely go extinct without intervention. Secondly, I better characterize the establishment phase by identifying the relevant predictors of each sub-stage and their relative importance, showing that

10 while species traits and propagule pressure are most influential for achieving initial, casual establishment, the environment is critical for persistence, on which instead propagule pressure has no effect.

Finally, I address the issue of the high amount of missing data in species trait databases, which was encountered in Chapter 1 and 2. Concretely, in Chapter 3 I develop a novel imputation technique that uses information from trait Correlations, taxonomic Relatedness and

Uncertainty minimization (CRU). Uncertainty is estimated using a novel algorithm that provides error measures for each imputed datum. Using a validation approach on data available for 20 functional traits across more than 30,000 species from the FishBase database (Froese & Pauly,

2018), I demonstrate that, despite its relative simplicity, CRU performs better than more sophisticated approaches. Further, I show that the algorithm for uncertainty estimation predicts the error (i.e. the deviation between true and predicted values) better than the methods provided by other imputation approaches. Finally, I investigate the use of an ensemble modelling approach to missing data imputation, and I demonstrate that averaging predictions from alternative methods that perform differently depending on the specific trait increases the overall accuracy of imputation. Then, I use the best ensemble of imputation models to generate a complete version of a substantial subset of the FishBase dataset.

Overall, this thesis contributes a better understanding of one of the fundamental phases of biological invasions, along with tools to facilitate management practices, to prioritize resources, and to integrate information to make the best use of the available data.

11

References

Allison, P. D. (2002). Missing data. Thousand Oaks, CA: Sage Publications.

Alvarez, S., & Solis, D. (2019). Rapid Response Lowers Eradication Costs of Invasive Species: Evidence from . Choices, 33, 1.

Andersen, M. C., Adams, H., Hope, B., & Powell, M. (2004a). Risk assessment for invasive species. Risk Analysis: An International Journal, 24, 787-793.

Andersen, M. C., Adams, H., Hope, B., & Powell, M. (2004b). Risk analysis for invasive species: general framework and research needs. Risk Analysis: An International Journal, 24, 893-900.

Araújo, M. B., & New, M. (2007). Ensemble forecasting of species distributions. Trends in ecology & evolution, 22, 42-47.

Bates, J. M., & Granger, C. W. (1969). The combination of forecasts. Journal of the Operational Research Society, 20, 451-468.

Blackburn, T. M., Pyšek, P., Bacher, S., Carlton, J. T., Duncan, R. P., Jarošík, V., ... & Richardson, D. M. (2011). A proposed unified framework for biological invasions. Trends in ecology & evolution, 26, 333-339.

Bradie, J., & Leung, B. (2015). Pathway‐level models to predict non‐indigenous species establishment using propagule pressure, environmental tolerance and trait data. Journal of applied ecology, 52, 100-109.

Brook, B. W., Sodhi, N. S., & Bradshaw, C. J. (2008). Synergies among extinction drivers under global change. Trends in ecology & evolution, 23, 453-460.

Buckley, Y. M. (2008). The role of research for integrated management of invasive species, invaded landscapes and communities. Journal of Applied Ecology, 45, 397-402.

Byers, J. E., Reichard, S., Randall, J. M., Parker, I. M., Smith, C. S., Lonsdale, W. M., ... & Hayes, D. (2002). Directing research to reduce the impacts of nonindigenous species. Conservation Biology, 16, 630-640.

Cadotte, M. W., Carscadden, K., & Mirotchnick, N. (2011). Beyond species: functional diversity and the maintenance of ecological processes and services. Journal of applied ecology, 48, 1079-1087.

Cassey, P., Delean, S., Lockwood, J. L., Sadowski, J., & Blackburn, T. M. (2018b). Dissecting the null model for biological invasions: A meta-analysis of the propagule pressure effect. PLoS biology, 16, e2005987.

12

Cassey, P., García-Díaz, P., Lockwood, J. L., & Blackburn, T. M. (2018a). Invasion Biology: Searching for Predictions and Prevention, and Avoiding Lost Causes. Invasion Biology: Hypotheses and Evidence, 1.

Chadès, I., Martin, T. G., Nicol, S., Burgman, M. A., Possingham, H. P., & Buckley, Y. M. (2011). General rules for managing and surveying networks of pests, diseases, and endangered species. Proceedings of the National Academy of Sciences, 108, 8323-8328.

Chapman, D., Purse, B. V., Roy, H. E., & Bullock, J. M. (2017). Global trade networks determine the distribution of invasive non‐native species. Global Ecology and Biogeography, 26, 907-917.

Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. International journal of forecasting, 5, 559-583.

Colautti, R. I., Grigorovich, I. A., & MacIsaac, H. J. (2006). Propagule pressure: a null model for biological invasions. Biological Invasions, 8, 1023-1037.

Colautti, R. I., & MacIsaac, H. J. (2004). A neutral terminology to define ‘invasive’ species. Diversity and distributions, 10, 135-141.

Cook, C. N., Hockings, M., & Carter, R. B. (2010). Conservation in the dark? The information used to support management decisions. Frontiers in Ecology and the Environment, 8, 181-186.

Cook, G., Jarnevich, C., Warden, M., Downing, M., Withrow, J., & Leinwand, I. (2019). Iterative Models for Early Detection of Invasive Species across Spread Pathways. Forests, 10, 108.

Dawson, W., Burslem, D. F., & Hulme, P. E. (2009). Factors explaining alien plant invasion success in a tropical ecosystem differ at each stage of invasion. Journal of Ecology, 97, 657- 665.

Dextrase, A. J., & Mandrak, N. E. (2006). Impacts of alien invasive species on freshwater fauna at risk in Canada. Biological Invasions, 8, 13-24.

Didham, R. K., Tylianakis, J. M., Gemmell, N. J., Rand, T. A., & Ewers, R. M. (2007). Interactive effects of habitat modification and species invasion on native species decline. Trends in ecology & evolution, 22, 489-496.

Doherty, T. S., Glen, A. S., Nimmo, D. G., Ritchie, E. G., & Dickman, C. R. (2016). Invasive predators and global biodiversity loss. Proceedings of the National Academy of Sciences, 113, 11261-11265.

Dormann, C. F., Calabrese, J. M., Guillera‐Arroita, G., Matechou, E., Bahn, V., Bartoń, K., ... & Guelat, J. (2018). Model averaging in ecology: a review of Bayesian, information‐theoretic, and tactical approaches for predictive inference. Ecological Monographs.

13

Duggan, I. C. (2010). The freshwater aquarium trade as a vector for incidental invertebrate fauna. Biological invasions, 12, 3757-3770.

Elton, C.S. (1958). The ecology of invasions by and plants. Methuen, , United Kingdom.

Duncan, R. P., Blackburn, T. M., Rossinelli, S., & Bacher, S. (2014). Quantifying invasion risk: the relationship between establishment probability and founding population size. Methods in Ecology and Evolution, 5, 1255-1263.

Ehrenfeld, J. G. (2010). Ecosystem consequences of biological invasions. Annual review of ecology, evolution, and systematics, 41, 59-80.

Eschtruth, A. K., & Battles, J. J. (2011). The importance of quantifying propagule pressure to understand invasion: an examination of riparian forest invasibility. Ecology, 92, 1314-1322.

Essl, F., Dullinger, S., Moser, D., Steinbauer, K., & Mang, T. (2015). Macroecology of global bryophyte invasions at different invasion stages. Ecography, 38, 488-498.

Ficetola, G. F., Thuiller, W., & Miaud, C. (2007). Prediction and validation of the potential global distribution of a problematic alien invasive species—the American bullfrog. Diversity and distributions, 13, 476-485.

Finnoff, D., Shogren, J. F., Leung, B., & Lodge, D. (2007). Take a risk: preferring prevention over control of biological invaders. Ecological Economics, 62, 216-222.

Froese, R. and D. Pauly. Editors. 2018. FishBase. World Wide Web electronic publication: www.fishbase.org.

Gallien, L., Münkemüller, T., Albert, C. H., Boulangeat, I., & Thuiller, W. (2010). Predicting potential distributions of invasive species: where to go from here?. Diversity and Distributions, 16, 331-342.

García-Berthou, E., Alcaraz, C., Pou-Rovira, Q., Zamora, L., Coenders, G., & Feo, C. (2005). Introduction pathways and establishment rates of invasive aquatic species in Europe. Canadian Journal of Fisheries and Aquatic Sciences, 62, 453-463.

Giljohann, K. M., Hauser, C. E., Williams, N. S., & Moore, J. L. (2011). Optimizing invasive species control across space: willow invasion management in the Australian Alps. Journal of Applied Ecology, 48, 1286-1294.

Goolsby, E. W., Bruggeman, J., & Ané, C. (2017). Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within‐species variation. Methods in Ecology and Evolution, 8, 22-27.

14

Guo, C., Lek, S., Ye, S., Li, W., Liu, J., & Li, Z. (2015). Uncertainty in ensemble modelling of large-scale species distribution: effects from species characteristics and model techniques. Ecological modelling, 306, 67-75.

Hayes, K. R., & Barry, S. C. (2008). Are there any consistent predictors of invasion success?. Biological invasions, 10, 483-506.

Hellmann, J. J., Byers, J. E., Bierwagen, B. G., & Dukes, J. S. (2008). Five potential consequences of climate change for invasive species. Conservation biology, 22, 534-543.

Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 61, 79-90.

Hulme, P. E. (2006). Beyond control: wider implications for the management of biological invasions. Journal of Applied Ecology, 43, 835-847.

Hulme, P. E. (2009). Trade, transport and trouble: managing invasive species pathways in an era of globalization. Journal of applied ecology, 46, 10-18.

Hulme, P. E., Bacher, S., Kenis, M., Klotz, S., Kühn, I., Minchin, D., ... & Pyšek, P. (2008). Grasping at the routes of biological invasions: a framework for integrating pathways into policy. Journal of Applied Ecology, 45, 403-414.

Jiménez-Valverde, A., Peterson, A. T., Soberón, J., Overton, J. M., Aragón, P., & Lobo, J. M. (2011). Use of niche models in invasive species risk assessments. Biological invasions, 13, 2785-2797.

Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American statistical association, 91, 222-230.

Kattge, J., Diaz, S., Lavorel, S., Prentice, I. C., Leadley, P., Bönisch, G., ... & Cornelissen, J. H. C. (2011). TRY–a global database of plant traits. Global change biology, 17, 2905-2935.

Keller, R. P., Lodge, D. M., & Finnoff, D. C. (2007). Risk assessment for invasive species produces net bioeconomic benefits. Proceedings of the National Academy of Sciences, 104, 203-207.

Kelling, S., Hochachka, W. M., Fink, D., Riedewald, M., Caruana, R., Ballard, G., & Hooker, G. (2009). Data-intensive science: a new paradigm for biodiversity studies. BioScience, 59, 613- 620.

Kerr, N. Z., Baxter, P. W., Salguero‐Gómez, R., Wardle, G. M., & Buckley, Y. M. (2016). Prioritizing management actions for invasive populations using cost, efficacy, demography and expert opinion for 14 plant species world‐wide. Journal of applied ecology, 53, 305-316.

15

Kolar, C. S., & Lodge, D. M. (2002). Ecological predictions and risk assessment for alien in North America. Science, 298, 1233-1236.

Lavorel, S., Grigulis, K., Lamarque, P., Colace, M. P., Garden, D., Girel, J., ... & Douzet, R. (2011). Using plant functional traits to understand the landscape distribution of multiple ecosystem services. Journal of Ecology, 99, 135-147.

Legendre, P., Galzin, R., & Harmelin-Vivien, M. L. (1997). Relating behavior to habitat: solutions to the fourth-corner problem. Ecology, 78, 547-562.

Leung, B., Lodge, D. M., Finnoff, D., Shogren, J. F., Lewis, M. A., & Lamberti, G. (2002). An ounce of prevention or a pound of cure: bioeconomic risk analysis of invasive species. Proceedings of the Royal Society of London B: Biological Sciences, 269, 2407-2413.

Leung, B., & Mandrak, N. E. (2007). The risk of establishment of aquatic invasive species: joining invasibility and propagule pressure. Proceedings of the Royal Society of London B: Biological Sciences, 274, 2603-2609.

Leung, B., Roura‐Pascual, N., Bacher, S., Heikkilä, J., Brotons, L., Burgman, M. A., ... & Sol, D. (2012). TEASIng apart alien species risk assessments: a framework for best practices. Ecology Letters, 15, 1475-1493.

Liu, C., Comte, L., & Olden, J. D. (2017). Heads you win, tails you lose: Life‐history traits predict invasion and extinction risk of the world's freshwater fishes. Aquatic Conservation: Marine and Freshwater Ecosystems, 27, 773-779.

Lockwood, J. L., Cassey, P., & Blackburn, T. (2005). The role of propagule pressure in explaining species invasions. Trends in Ecology & Evolution, 20, 223-228.

Lockwood, J. L., Cassey, P., & Blackburn, T. M. (2009). The more you introduce the more you get: the role of colonization pressure and propagule pressure in invasion ecology. Diversity and Distributions, 15, 904-910.

Lockwood, J. L., Hoopes, M. F., & Marchetti, M. P. (2013). Invasion ecology. John Wiley & Sons.

Lodge, D. M., Williams, S., MacIsaac, H. J., Hayes, K. R., Leung, B., Reichard, S., ... & Carlton, J. T. (2006). Biological invasions: recommendations for US policy and management. Ecological applications, 16, 2035-2054.

Lohr, C. A., Hone, J., Bode, M., Dickman, C. R., Wenger, A., & Pressey, R. L. (2017). Modeling dynamics of native and invasive species to guide prioritization of management actions. Ecosphere, 8.

16

Luo, Y., Ogle, K., Tucker, C., Fei, S., Gao, C., LaDeau, S., ... & Schimel, D. S. (2011). Ecological forecasting and data assimilation in a data‐rich era. Ecological Applications, 21, 1429-1442.

Mack, R. N., Simberloff, D., Mark Lonsdale, W., Evans, H., Clout, M., & Bazzaz, F. A. (2000). Biotic invasions: causes, epidemiology, global consequences, and control. Ecological applications, 10, 689-710.

Mahoney, P. J., Beard, K. H., Durso, A. M., Tallian, A. G., Long, A. L., Kindermann, R. J., ... & Mohn, H. E. (2015). Introduction effort, climate matching and species traits as predictors of global establishment success in non‐native reptiles. Diversity and Distributions, 21, 64-74.

Májeková, M., Paal, T., Plowman, N. S., Bryndová, M., Kasari, L., Norberg, A., ... & Le Bagousse-Pinguet, Y. (2016). Evaluating functional diversity: missing trait data and the importance of species abundance structure and data transformation. PloS one, 11, e0149270.

McGeoch, M. A., Butchart, S. H., Spear, D., Marais, E., Kleynhans, E. J., Symes, A., ... & Hoffmann, M. (2010). Global indicators of biological invasion: species numbers, biodiversity impact and policy responses. Diversity and Distributions, 16, 95-108.

McKinney, M. L., & Lockwood, J. L. (1999). Biotic homogenization: a few winners replacing many losers in the next mass extinction. Trends in ecology & evolution, 14, 450-453.

Mehta, S. V., Haight, R. G., Homans, F. R., Polasky, S., & Venette, R. C. (2007). Optimal detection and control strategies for invasive species management. Ecological Economics, 61, 237-245.

Milbau, A., & Stout, J. C. (2008). Factors associated with alien plants transitioning from casual, to naturalized, to invasive. Conservation Biology, 22, 308-317.

Millennium Ecosystem Assessment (MA) (2005). Ecosystems and Human Well-being: Current State and Trends. Volume 1, Island Press, Washington, DC.

Nakagawa, S. (2015). Missing data: mechanisms, methods and messages. Ecological statistics: Contemporary theory and application, 81-105.

Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology & Evolution, 23, 592-596.

Nakagawa, S., Waas, J. R., & Miyazaki, M. (2001). Heart rate changes reveal that little blue penguin chicks (Eudyptula minor) can use vocal signatures to discriminate familiar from unfamiliar chicks. Behavioral Ecology and Sociobiology, 50, 180-188.

Naylor R.L., Williams S.L., & Strong D.R. (2001). Aquaculture - a gateway for exotic species. Science, 294, 1655-1656.

17

Olden, J. D. (2006). Biotic homogenization: a new research agenda for conservation biogeography. Journal of Biogeography, 33, 2027-2039.

Padilla, D. K., & Williams, S. L. (2004). Beyond ballast water: aquarium and ornamental trades as sources of invasive species in aquatic ecosystems. Frontiers in Ecology and the Environment, 2, 131-138.

Pakeman, R. J. (2014). Functional trait metrics are sensitive to the completeness of the species' trait data?. Methods in Ecology and Evolution, 5, 9-15.

Palm, F. C., & Zellner, A. (1992). To combine or not to combine? Issues of combining forecasts. Journal of Forecasting, 11, 687-701.

Papeş, M., Sällström, M., Asplund, T. R., & Vander Zanden, M. J. (2011). Invasive species research to meet the needs of resource management and planning. Conservation Biology, 25, 867-872.

Penone, C., Davidson, A. D., Shoemaker, K. T., Di Marco, M., Rondinini, C., Brooks, T. M., ... & Costa, G. C. (2014). Imputation of missing data in life‐history trait datasets: which approach performs the best?. Methods in Ecology and Evolution, 5, 961-970.

Peterson, A. T. (2003). Predicting the geography of species’ invasions via ecological niche modeling. The quarterly review of biology, 78, 419-433.

Pimentel, D., Zuniga, R., & Morrison, D. (2005). Update on the environmental and economic costs associated with alien-invasive species in the United States. Ecological economics, 52, 273-288.

Pitt, J. P., Worner, S. P., & Suarez, A. V. (2009). Predicting Argentine ant spread over the heterogeneous landscape using a spatially explicit stochastic model. Ecological Applications, 19, 1176-1186.

Poyatos, R., Sus, O., Badiella, L., Mencuccini, M., & Martínez-Vilalta, J. (2018). Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences, 15, 2601-2617.

Puth, L. M., & Post, D. M. (2005). Studying invasion: have we missed the boat?. Ecology letters, 8, 715-721.

Pyšek, P., Jarošík, V., Hulme, P. E., Pergl, J., Hejda, M., Schaffner, U., & Vilà, M. (2012). A global assessment of invasive plant impacts on resident species, communities and ecosystems: the interaction of impact measures, invading species' traits and environment. Global Change Biology, 18, 1725-1737.

18

Pyšek, P., Jarošík, V., Pergl, J., Randall, R., Chytrý, M., Kühn, I., ... & Sádlo, J. (2009). The global invasion success of Central European plants is related to distribution characteristics in their native range and species traits. Diversity and Distributions, 15, 891-903.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rahel, F. J., & Olden, J. D. (2008). Assessing the effects of climate change on aquatic invasive species. Conservation biology, 22, 521-533.

Rejmánek, M., & Pitcairn, M. J. (2002). When is eradication of exotic pest plants a realistic goal. Turning the tide: the eradication of invasive species, 249-253.

Ricciardi, A. (2007). Are modern biological invasions an unprecedented form of global change?. Conservation Biology, 21, 329-336.

Ricciardi, A., & Cohen, J. (2007). The invasiveness of an introduced species does not predict its impact. Biological Invasions, 9, 309-315.

Ricciardi, A., Hoopes, M. F., Marchetti, M. P., & Lockwood, J. L. (2013). Progress toward understanding the ecological impacts of nonnative species. Ecological Monographs, 83, 263- 282.

Ricciardi, A., & MacIsaac, H. J. (2008). In Retrospect: The book that began invasion ecology. Nature, 452, 34.

Ricciardi, A., & Rasmussen, J. B. (1998). Predicting the identity and impact of future biological invaders: a priority for aquatic resource management. Canadian Journal of Fisheries and Aquatic Sciences, 55, 1759-1765.

Ricciardi, A., & Rasmussen, J. B. (1999). Extinction rates of North American freshwater fauna. Conservation biology, 13, 1220-1222.

Rixon, C. A., Duggan, I. C., Bergeron, N. M., Ricciardi, A., & Macisaac, H. J. (2005). Invasion risks posed by the aquarium trade and live fish markets on the Laurentian Great Lakes. Biodiversity & Conservation, 14, 1365-1381.

Rouget, M., Richardson, D. M., Nel, J. L., & Van Wilgen, B. W. (2002). Commercially important trees as invasive aliens–towards spatially explicit risk assessment at a national scale. Biological Invasions, 4, 397-412.

Sala, O. E., Chapin, F. S., Armesto, J. J., Berlow, E., Bloomfield, J., Dirzo, R., ... & Leemans, R. (2000). Global biodiversity scenarios for the year 2100. science, 287, 1770-1774.

19

Seebens, H., Blackburn, T. M., Dyer, E. E., Genovesi, P., Hulme, P. E., Jeschke, J. M., ... & Bacher, S. (2017). No saturation in the accumulation of alien species worldwide. Nature communications, 8, 14435.

Simberloff, D. (2003). How much information on population biology is needed to manage introduced species?. Conservation Biology, 17, 83-92.

Simberloff, D. (2009). The role of propagule pressure in biological invasions. Annual Review of Ecology, Evolution, and Systematics, 40, 81-102.

Simberloff, D., Martin, J. L., Genovesi, P., Maris, V., Wardle, D. A., Aronson, J., ... & Pyšek, P. (2013). Impacts of biological invasions: what's what and the way forward. Trends in ecology & evolution, 28, 58-66.

Singh, S. K., Ash, G. J., & Hodda, M. (2015). Keeping ‘one step ahead’of invasive species: using an integrated framework to screen and target species for detailed biosecurity risk assessment. Biological invasions, 17, 1069-1086.

Stekhoven, D. J., & Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112-118.

Stewart‐Koster, B., Olden, J. D., & Johnson, P. T. (2015). Integrating landscape connectivity and habitat suitability to guide offensive and defensive invasive species management. Journal of Applied Ecology, 52, 366-378.

Stohlgren, T. J., & Schnase, J. L. (2006). Risk analysis for biological hazards: what we need to know about invasive species. Risk Analysis: An International Journal, 26, 163-173.

Taugourdeau, S., Villerd, J., Plantureux, S., Huguenin‐Elie, O., & Amiaud, B. (2014). Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data. Ecology and Evolution, 4, 944-958.

Thuiller, W., Richardson, D. M., Pyšek, P., Midgley, G. F., Hughes, G. O., & Rouget, M. (2005). Niche‐based modelling as a tool for predicting the risk of alien plant invasions at a global scale. Global Change Biology, 11, 2234-2250.

Tilman, D. (2001). Functional diversity. Encyclopedia of biodiversity. Academic Press, San Diego, 109-120. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45. van der Heijden, G. J., Donders, A. R. T., Stijnen, T., & Moons, K. G. (2006). Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. Journal of clinical epidemiology, 59, 1102-1109.

20

Van Kleunen, M., Weber, E., & Fischer, M. (2010). A meta‐analysis of trait differences between invasive and non‐invasive plant species. Ecology letters, 13, 235-245.

Vander Zanden, M. J., Hansen, G. J., Higgins, S. N., & Kornis, M. S. (2010). A pound of prevention, plus a pound of cure: early detection and eradication of invasive species in the Laurentian Great Lakes. Journal of Great Lakes Research, 36, 199-205.

Vandewalle, M., De Bello, F., Berg, M. P., Bolger, T., Doledec, S., Dubs, F., ... & Da Silva, P. M. (2010). Functional traits as indicators of biodiversity response to land use changes across ecosystems and organisms. Biodiversity and Conservation, 19, 2921-2947.

Walther, G. R., Roques, A., Hulme, P. E., Sykes, M. T., Pyšek, P., Kühn, I., ... & Czucz, B. (2009). Alien species in a warmer world: risks and opportunities. Trends in ecology & evolution, 24, 686-693.

Westphal, M. I., Browne, M., MacKinnon, K., & Noble, I. (2008). The link between international trade and the global distribution of invasive alien species. Biological Invasions, 10, 391-398.

Williamson, M., & Fitter, A. (1996). The varying success of invaders. Ecology, 77, 1661-1666.

Winkler, R. L. (1989). Combining forecasts: A philosophical basis and some current issues. International Journal of Forecasting, 5, 605-609.

Wittenberg, R., & Cock, M. J. (Eds.). (2001). Invasive alien species: a toolkit of best prevention and management practices. CABI.

Zenni, R. D., & Nuñez, M. A. (2013). The elephant in the room: the role of failed invasions in understanding invasion biology. Oikos, 122, 801-815.

21

Chapter 1

The rich get richer: invasion risk across North America from the aquarium pathway under

climate change.

Authors: Lidia Della Venezia, Jason Samson and Brian Leung

A version of this chapter has been published in the journal Diversity and Distributions 24(3),

285-296. It is reprinted here with permission from John Wiley & Sons, Inc.

22

1.1 Abstract

Aim: To evaluate how the establishment risk of freshwater fish species from the aquarium trade will change under a climate change scenario forecast for the year 2050.

Location: North America

Methods: In order to estimate changes in the magnitude of risk across geography and across different species in the aquarium pathway, we considered an integrated approach to modelling the probability of establishment, which simultaneously included proxies of propagule pressure, environmental variables, species traits and interactions between environment and traits. We then used the parameters of our model to predict how the risk of establishment will change under a scenario of climate change forecast for the year 2050.

Results: Our joint model performed better than submodels, suggesting that combining all components is worthwhile. The most predictive factors were precipitation, maximum temperature tolerance, maximum fish length and minimum temperature. Our joint model forecast a 40% increase in the average risk of establishment by 2050 in the United States. In contrast to our expectations, the absolute establishment risk associated with this pathway remained very low for the entire suite of species in the aquarium trade in northern regions, such as Quebec, Canada.

Instead, Florida, which has one of the highest current risks of establishment, was also forecasted to have the greatest absolute risk increase.

Main conclusions: Our methodology for risk assessment allows invasive species management strategies to consider entire suites of species at a time, and to forecast establishment risk for each species and location. While the aquarium pathway is likely to become more important for the

USA, the Quebec government should prioritize other pathways of introduction in its exotic

23 invasive species strategy. Our approach can be extended to be applied to different sets of species pertaining to the same or different pathways.

24

1.2 Introduction

Over the past few decades, invasive species have received increasing attention from the scientific community due to their potential consequences, environmentally (e.g. biodiversity loss,

Doherty et al., 2016; Mack et al., 2000; Sala et al., 2000) and economically (Pimentel et al.,

2001). In new habitats, invasive species can act as competitors, predators or parasites of species belonging to the native community (Lockwood et al., 2013), and can cause huge economic losses by interfering with agriculture, livestock, and human health (Pimentel et al., 2005). Specifically, freshwater fauna appears to be very sensitive to invasions, being characterized by extinction rates five times higher than for terrestrial fauna in North America (Ricciardi & Rasmussen, 1999;

Strayer & Dudgeon, 2010).

Many invasive species are introduced via trade, leading to the release and dispersal of potentially detrimental non-indigenous organisms (Hulme, 2009). Specifically, the aquarium commerce has been identified as an important source of propagule pressure (number of individuals introduced), and has led to the invasion and spread of various aquatic organisms in the United States, whereas fewer have established in Canada (Duggan et al., 2006; Gertzen et al.,

2008), potentially because of harsh climate conditions. The high propagule pressure from the aquarium commerce can be explained by the popularity of the aquarium hobby in North

America, with over 10% of households possessing some ornamental fish. Out of those, 96% of the volume of fish imported is represented by freshwater species (Chapman et al., 1997; Ramsey,

1985). The aquarium trade has thus been identified as a significant source of potentially invasive species both in the United States and Canada (Rixon et al., 2005) and it is important to analyze it explicitly (Gertzen et al., 2008).

25

While strides have been made predicting different phases of the invasion process, these have only recently been examined in conjunction with other drivers such as climate change

(Hobbs & Mooney, 2005; Holzapfel & Vinebrooke, 2005; Muhlfeld et al., 2014). Climate change can affect invasive species and their impact at different stages of the invasion process

(initial introduction, establishment, impact) and through different mechanisms (Hulme, 2016;

Walther et al., 2009). For example, milder winters might mean a higher chance of establishment for species adapted to warm environments, such as sub-tropical species in temperate areas

(Hellmann et al., 2008). Additionally, established non-indigenous species could expand their existing range given novel environmental conditions, potentially having an impact on a larger scale (e.g. Morrison et al., 2005). A recent study has shown the potential for several species to move from the Atlantic to the Pacific ocean and vice versa under climate change, with potential important consequences on entire communities (Wisz et al., 2015). For these reasons, climate change should be explicitly addressed to inform proper invasive species management (Mainka &

Howard, 2010). This is especially true given the fact that synergistic interactions between biological invaders and climate change have been observed and classified as an important threat to biodiversity (Brook et al., 2008). The influence of climate change on aquatic species invasiveness and impacts is difficult to predict but it will likely exacerbate the issue for freshwater aquarium species, which tend to be mainly tropical (Rahel & Olden, 2008). For example, a reduction in ice cover and an increase in oxygen conditions during the winter season, in addition to warmer water in summertime, could increase the chances of survival and reproduction for freshwater species (Rahel & Olden, 2008). Normally hostile environments for tropical fish like those in much of Canada, where harsh winters are generally a limiting factor for the survival of aquarium species, might become more suitable for establishment in a warming

26 climate. As such, there is interest in assessing changing establishment risk levels for aquarium species in northern regions, making pathway level analysis of the aquarium trade under climate change of direct and immediate policy relevance.

Notably, environmental suitability is not the only determinant of invasive species establishment. Species functional traits and their native range, propagule pressure, reproductive behaviours, , genetics and time elapsed since a species was release, have been proven to be important predictors of establishment and impact (Heger & Trepl, 2003; Van Kleunen et al., 2010; Williamson & Fitter, 1996). Including species traits (Syphard & Franklin, 2010) and propagule pressure in species distribution models (SDM) (Leung & Mandrak, 2007) has proven effective. By combining these three components, we can model entire pathways (e.g. Bradie et al., 2013), while also considering both the geographical context, as well as the effects of changing environmental conditions such as climate change. Additionally, focusing on pathways of invasion can be advantageous when single species models cannot be built due to absence or scarcity of data.

Here, we hypothesized that the establishment risk posed by aquarium fish species across

North America would increase over time due to climate change. We also hypothesized that aquarium-mediated invasions will become an emerging threat in parts of Canada with climate change, even though these regions have not historically suffered from aquarium-mediated invasions. To estimate geographically-dependent, pathway-level risk of establishment, we used a combination of propagule pressure proxies, environment, and species traits (PET) of freshwater fish species from the aquarium trade pathway in the United States. We then projected the potential future establishment risk in the USA and in the Canadian province of Quebec under a climate change scenario forecasted for the year 2050, thus extrapolating over different

27 geographical and temporal scenarios. We chose to extrapolate predictions to Quebec as our study originated by a project in collaboration with the Quebec Ministry of Forest, Wildlife and Parks

(Ministère des Forêts, de la Faune et des Parcs (MFFP); http://mffp.gouv.qc.ca). The MFFP was interested into assessing the probability of new invasive aquatic species because the normally harsh winter conditions in Quebec prevent the establishment of aquarium species, but this barrier may no longer be effective in the future as climate change may open up opportunities for these species to establish, especially in the southernmost regions of the province.

1.3 Methods

1.3.1 Model formulation

The probability of establishment P(E) of a species from the aquarium pathway, as modelled by Bradie et al. (2013), was determined using the following equation:

푃(퐸) =1− 푞 (1.1) where q defines the probability of a single propagule not establishing, N represents a measure of propagule pressure, and c is a shape parameter that allows for the presence of density-dependent effects (i.e. Allee effects). In order to include the effect of species traits and environmental conditions on the cumulative probability of establishment, the parameter q has been modelled by

Bradie & Leung (2015) as a function of predictors (V):

푃(퐸) =1− 푞(푉) (1.2) where V is a vector of species-specific predictors for which we have information, and the subscript s defines the species of interest. To consider geographically-dependent probabilities, we can adapt qsl, a logistic function of the predictors:

푞 = (1.3) where, for each species-location combination, the linear predictor zsl is characterized as follows:

28

푧 = 푏 + ∑(푏푋 + 푏푋) + ∑(푏퐸 + 푏퐸) + 퐼 (1.4)

Each Xws defines a trait for species s, where W is the total number of traits considered, while each Eml is an environmental variable for location l, where M is the total number of environmental variables. Isl denotes the interaction terms for each species-location combination, as explained in the next paragraph. For both species traits and environmental variables, first- and second-order functions of each predictor were considered, to allow non-monotonic relationships showing optimal ranges. For traits, bw1 and bw2 represent the fitted parameter values for trait w, and they are common for all species, while for environmental variables, bm1 and bm2 are fitted parameter values for variable m that are common for all locations. Finally, following Brown et al. (2014), we added interaction terms between traits and environmental conditions (denoted by

Isl in equation 1.4), to account for the "fourth-corner problem" (Legendre et al., 1997), wherein the effect of environmental variables is moderated by species traits (and vice versa).

퐼 = ∑ ∑(푏푋퐸 + 푏푋퐸 + 푏푋퐸) (1.5)

Above is a description of the interaction terms for each species-location combination. bmw1, bmw2 and bmw3 are fitted parameters that describe the interactions between trait w and environmental variable m. The interaction terms where both environmental and trait variables were squared have been excluded.

1.3.2 Data and variable choice

We used establishment records from the US Geological Survey Non-indigenous Aquatic

Species data base (US Geological Survey 2017), which included both species information and state-level geographical locations of historical establishments. Species that were stocked and strong outliers (Carassius auratus, i.e. goldfish) were removed from the analysis because their behaviour might not be the same as the other species belonging to the pathway. We also removed

29 species that established before 1971, as historical data on fish imports are available from that year, allowing to verify the consistency of propagule pressure estimates and species’ popularity over time (Bradie et al., 2013).

Following Bradie et al. (2013), we used Canadian import data for freshwater aquarium fishes obtained from Fisheries and Oceans Canada (B. Cudmore & N. Mandrak, unpublished data) as a proxy for propagule pressure for North America. Bradie et al. (2013) previously showed Canadian import volumes could be accurately extrapolated to the US, scaling by population size, hence we also scaled the import data by population size of each state as our geographically separated proxy of propagule pressure. Since imported aquarium species’ popularity seems to remain fairly consistent over time, as analyzed in Bradie et al. (2013) and in accordance with Chapman and colleagues (1997), following similar logic we forecasted changes in propagule pressure to 2050, using projected population size increases. Both current population size estimates and forecasts for 2050 were obtained from the United States Census Bureau

(https://www.census.gov/en.html).

To extrapolate to the Canadian province of Quebec, we obtained current population estimates for each of the 17 administrative regions in Quebec, through the Institut de la statistique du Québec (http://www.stat.gouv.qc.ca/). The institute provides forecasts for each administrative region of Quebec until 2036, while only a single estimate for the entire population of Quebec for the year 2051 is available. We assumed that the proportional density remained the same across regions and derived regional population forecasts for our year of interest from the single estimate for 2051. Quebec population estimates were once again used to scale imports at the regional level.

30

Species traits were obtained from FishBase (Froese & Pauly, 2016) and included maximum and minimum temperature tolerance, northernmost latitude, trophic level, K (the rate at which asymptotic length is reached) and maximum total length. However, the majority of our aquarium species lacked data for at least one trait. The use of complete cases only (i.e. species with information for all traits), as well as the replacement of missing data with the average trait value across species, are among the most common methods to deal with missing information, but they can introduce bias in the estimation of the parameter values of the model (Greenland & Finkle,

1995). Thus, we decided to impute the missing data in the following way: the original dataset included additional traits that were removed from the analysis because they lacked information for more than 50% of the species or because they were highly correlated with the ones retained for the analysis (Bradie & Leung, 2015). We imputed missing values for each species using a 2- step procedure. First, we performed a linear regression for each trait using all the other available traits as predictors, as we argued that this would give a better estimate than simply using a mean across all species. Second, we adjusted our predictions for each species missing traits based on their expectations given their genus, as we would expect some phylogenetic similarity. To do so, we ran equivalent linear models where instead we regressed the average trait values within genera. We then adjusted our predictions from the first regression by the residual value of the corresponding genus for each species, using it as an "offset". When not enough information was available across species in a genus, the same procedure was applied to the corresponding family.

We finally removed K because not enough information was available for imputation. Although more sophisticated approaches exist for missing data imputation (see Horton & Kleinman, 2007), this method performed well based on a jackknife validation, without imposing any additional computational burden.

31

The environmental predictors were based on the 19 bioclimatic (Bioclim) variables from www.worldclim.org (Hijmans et al., 2005), for which current data and future projections for the year 2050 are available. They were used to fit the model and to predict invasive species under climate change, respectively. Species distribution models have often made use of these data (e.g.

Bálint et al., 2011; Huang, 2014; Stanton et al., 2012), whose scale was relatively fine for each environmental variable (10-minutes resolution, corresponding to cells of 18.6 x 18.6 km at equator). However, given the resolution of our establishment data, we decided to generate state- wide estimates of each Bioclim variable, by taking the mean and variance of all the values associated with cells falling within each state boundary, hypothesizing that variance would be a coarse measure of variability across each region. We then reduced the number of environmental predictors by removing the most highly correlated variables to avoid multicollinearity.

To obtain forecasts of environmental variables, we used projections from WorldClim.

WorldClim offers several future scenarios, including the four different Representative

Concentration Pathways (RCPs; Moss et al., 2008) for greenhouse gas concentration trajectories adopted by the IPCC in its fifth Assessment Report (AR5) in 2014

(http://www.ipcc.ch/report/ar5/). For each climate scenario, future predictions are provided from a series of General Circulation Models (GMCs; Phillips, 1956), which are numerical models that represent physical processes in the atmosphere, ocean, cryosphere and land surface, and simulate the response of the global climate to the increasing greenhouse gas concentrations. Among the potential RCPs, the worst case scenario was chosen (i.e. RCP8.5), and among the several available GCMs, the GISS-E2-R one developed by NASA (Nazarenko et al., 2015) was selected, but our model can easily be applied to any other possible scenario. Our choice of the worst case

32 scenario was driven by a preference for conservative risk estimates, given the large impact invasive species can cause (Lovell & Stone, 2005).

1.3.3 Model fitting

Our model was fit on the species establishments known in the USA, as in Bradie &

Leung (2015). We used maximum likelihood estimation to find the best parameter values by maximizing the log-likelihood calculated as follows:

log(퐿) = ∑ ∑ log1− 푞 + ∑ log (푞 ) (1.6) where i denotes the set of species that have successfully invaded a particular state, and u denotes the species that did not establish. This sum was iterated for each US state (l in the equation), as propagule pressure and establishment status change geographically.

Given the high number of variables and interaction terms possible for the model, and the fact that we encountered problems of complete separation given the nature of our data for logistic regression (Albert & Anderson, 1984), we used a forward selection approach to identify the most important variables (Johnson & Omland, 2004), using the Akaike Information Criterion

(AIC; Akaike, 1974) as a criterion of choice. At each step of the selection procedure, the variable that mostly improved the AIC was retained in the model, until the AIC improvement was less than 2 units (Anderson & Burnham, 2002). We always included the first order term when a quadratic term was retained in the model.

Relatively simple models have proven effective in explaining patterns of invasions, for example in forest pests (Hudgins et al., 2017). To verify that the additional complexity of a joint model, with propagule pressure, environment, and species traits (henceforth termed PET model) was worthwhile for the aquarium pathway, we compared the PET model to six simpler alternatives, four of which respectively included species traits only, environmental conditions

33 only, both traits and environment, and propagule pressure only, and two combining propagule pressure (PP) either with the environment or with species traits. We included models which do not incorporate propagule pressure for consistency, in spite of its widely recognized importance as a predictor of establishment success (e.g. Copp et al., 2007; Duggan et al., 2006; Lockwood et al., 2009). We then selected the best model using AIC (Akaike, 1974) as a measure of goodness of fit.

Finally, we predicted the probability of establishment of our aquarium fish species under current conditions and for 2050 in the United States and in the Canadian province of Quebec, and we looked at the relative change in establishment risk associated with the aquarium pathway, identifying which geographical areas were currently most at risk, which areas were projected to increase the most, and which species were the most likely to establish.

All the analyses and the modelling were performed using the R statistical programming environment (R Development Core Team, 2015).

1.4 Results

The PET model had the lowest AIC value, indicating a substantially improved fit.

Specifically, the AIC of the PET model was 232.98, while the propagule pressure (PP) model had an AIC of 361.92, the traits-PP model had an AIC of 322.27, and the environment-PP model an AIC of 277.42. The models without propagule pressure performed consistently worse than their counterparts including it: the traits-environment model had an AIC of 240.31, the traits only model had an AIC of 340.92, and the environment only model an AIC of 286.53. Thus, our results suggest that combining traits and environmental information with propagule pressure yielded the best model.

34

Of the 30 variables (including second order terms and interactions; see Table A.1 in

Appendix A) we tested, based on a threshold level of improvement of AIC = 2, we found that only precipitation and minimum temperature were environmental predictors of establishment.

For traits, maximum temperature tolerance and maximum total fish length were retained in the model, along with the interaction term between precipitation and maximum length (Table 1.1).

Thus, while combining environment and traits with propagule pressure represented the best model, only a small number of predictors were needed to discriminate between established and unestablished species across regions.

Higher minimum temperatures during the coldest month of the year led to a higher number of establishments (Fig. 1.1a), as only the first order term resulted significant, thus indicating a monotonic relationship between minimum temperature and establishment risk (Table

1.1). In contrast, first and second order terms were included in the model for precipitation during the wettest month (Table 1.1), indicating that regions with both high and relatively low precipitations appeared to be at higher risk of establishment, consistent with the observation that aquarium species have established both in high precipitation states such as Hawaii and Florida but also in arid states like Utah, New Mexico and Nevada (Fig. 1.1b). A similar pattern was observed for species with high or relatively low tolerance to high temperatures (Table 1.1), following the establishments occurred in the southernmost regions of the USA, as well as in colder states like Nevada (Fig. 1.1c). Intermediate-sized species appeared to be at an advantage for establishment (Fig. 1.1d). In addition to the abovementioned main factors, the interaction between maximum length and precipitation was retained in the model (Table 1.1; Fig. 1.1e). We noted that the inclusion of such term appeared to be driven by one particularly large species established in Florida, the clown featherback ( ornata). We thus repeated the analysis

35 after removing it from our dataset, and found that the interaction term no longer improved the fit of the model. Thus, the interaction should be viewed with caution, as it is uncertain whether it reflects a real phenomenon (albeit with a rare combination of traits and environment), or whether the establishment of C. ornata reflects other idiosyncratic, unknown factors. For propagule pressure, only those species-location combinations above the median established (Fig. 1.1f).

Overall, the PET model estimated an increased risk of invasion from the aquarium pathway by the year 2050. In the USA specifically, the states that were predicted to have the highest absolute establishment risk are Hawaii, Florida and Nevada under current climate conditions, while the riskiest ones under a climate change forecast for 2050 are Florida, Hawaii and

Louisiana (Fig. 1.2). The first two are indeed the states where the highest numbers of invasions in our dataset have occurred. However, the estimated risk increase in some regions, mainly

Louisiana and Florida, was largely driven by more intense precipitations of the wettest month.

Since the environmental variables values forecasted for 2050 for these regions fell outside the fitting range of our model, we conducted a further analysis truncating those environmental values to the maximum observed in the fitting range. Encouragingly, the regions forecasted to have the highest average probability of establishment remained the same, although the relative increase was reduced considerably for Florida. Even under this more conservative scenarios, the PET model predicted Hawaii and Florida to be susceptible to the highest number of establishments by the year 2050 (Fig. 1.3). Nevada, Louisiana and Arizona, along with most of the southernmost states of the US, also appeared to be increasingly at risk of invasions. In contrast to these southern states, the aquarium pathway did not seem to represent a big threat for northern regions

(Fig 1.2 & 1.3). Comparable trends were observed for the three alternative RCP scenarios, although characterized by averagely lower risk estimates (see Fig. A.1 in Appendix A).

36

Overall, the PET model forecasted approximately a 40% increase in the average invasion risk across the United States by 2050, and a two-fold increase in Quebec over the same time frame. Although establishment risk was projected to increase in North America, values remained extremely low in Quebec across species and administrative regions (Fig. 1.4). The riskiest regions were Montreal, Montérégie and Laval both presently and in 2050, but the average risk, even if doubled, continued to be extremely low. These three regions are in the southernmost part of the province of Quebec. The minimum temperature of the coldest month was a limiting factor for establishment, and although it is forecasted to increase over time, it will not likely be high enough by 2050 to make Quebec suitable to potential invasions from fish currently traded in the aquarium pathway.

The inclusion of species traits in our joint model allowed us to forecast the species-location combinations that will pose the highest establishment risk. The top ten species in the US were expected to occur in Florida and Hawaii, based on current environmental conditions, while all the riskiest ones for 2050 were forecasted to have the highest probability of establishment in

Florida (Table 1.2). For some of these species, propagule pressure was particularly high (e.g. the

White Cloud Mountain minnow, Tanichthys albonubes), thus representing an important determinant of establishment risk, while for species imported at more moderate levels, traits such as size had a stronger influence on their likelihood of establishment (e.g., the porthole shovelnose catfish, Hemisorubim platyrhynchos). Another example of traits importance is the

Manuel's Piranha (Serrasalmus manueli), characterized by a relatively big size and whose high maximum temperature tolerance could make it more resistant to Florida summer peak temperatures, potentially allowing the species to establish despite the low number of propagules imported. In Quebec, none of the aquarium fish species analyzed posed a high risk: the species

37 that was predicted to pose the highest establishment risk in 2050 was the White Cloud Mountain minnow (Tanichthys albonubes, P = 0.0004) in Montreal, with a probability of establishment three orders of magnitude lower than the top risk species forecasted for the USA.

1.5 Discussion

Climate change can have wide-ranging impacts on ecosystem and there has been interest in potential interactions between climate change and other global drivers of ecosystem change, such as biological invasions (Hellmann et al., 2008). In this regard, there has been specific concern about regions at high latitudes across the globe, which would be expected to experience an increasing number of invasions, as temperatures increase and species ranges shift northward

(e.g. Cheung et al., 2009; Nehring, 1998; Sharma et al., 2007; Stachowicz et al., 2002). On the other hand, our model suggested that for species pertaining to the aquarium trade pathway, it will be the southernmost states of the USA that will experience the highest absolute risk, with the greatest projected increase in the number of freshwater fish invasions by the year 2050. This reflects both the already preferable habitats in southern climates, but also the novel climate conditions (i.e. combinations of temperature and precipitation) that are expected in tropical and subtropical regions (Williams et al., 2007). Comparatively, the aquarium fish pathway was predicted to pose only a minimal absolute risk of establishments in northern regions, such as the province of Quebec. We note, however, that this assumes that the traits of fish traded remains similar to those used to fit the model, which seems reasonable given stable preferences for specific species of fish over the last few decades (Bradie et al., 2013). Thus, from a management perspective, it may be reasonable for the Quebec government to focus its efforts on pathways with higher establishment risk. To identify and quantify those higher risk pathways, analyses such as those presented here could be used.

38

The PET model offers an integrated approach to estimating risk in a quantitative way.

Comparatively, most existing risk assessment studies consider only one class of predictors (e.g. environmental variables in species distribution models), and they are often species-specific and conducted in a qualitative way (Leung et al., 2012). In contrast, here we combined species functional traits, environmental conditions, their interaction, and a proxy for propagule pressure to predict risk under current and forecasted climate scenarios. The inclusion of these three types of predictors improved the fit of our model compared to simpler alternatives, and also allowed us to estimate the risk posed by an entire suite of species belonging to a specific pathway accounting for the environmental/geographical context, and projected future environments. In the

USA for example, the state forecasted to be at highest risk is Florida, which has already suffered a conspicuous number of establishments of non-indigenous fish species (US Geological Survey

2017).

Among the predicted riskiest species, the Jaguar Guapote (Parachromis managuensis) has been established in Florida for a long time (Shafland, 1996), while the White Cloud

Mountain minnow (Tanichthys albonubes) has already been collected in Georgia, although its status is unknown (US Geological Survey 2017). The black sharkminnow (Labeo chrysophekadion) was reported in Florida at the beginning of the 1980s (Courtenay & Stauffer,

1990), although the establishment is believed to have failed (US Geological Survey 2017). The black sharkminnow native range is in South-East Asia, a region slightly warmer than Florida, so this species might become of growing concern with the ongoing changes in climate.

The functional traits that were most important in our model were maximum temperature tolerance and maximum size. Both traits were also relevant predictors of successful fish establishment in California (Marchetti et al., 2004), while maximum adult size was deemed

39 important for fish in the Iberian Peninsula (Ribeiro et al., 2008). Physiological tolerances were found to be important for fish species being imported from a combination of different pathways, for example in the Great Lakes (Kolar & Lodge, 2002), while maximum length, as observed for other freshwater species (e.g. Vila-Gispert et al., 2005), seems to be favourable for establishment up to a certain point. Although previous studies found that species that grow to considerable sizes might become unwanted pets and be released more often (Gertzen et al., 2008), extremely big individuals might be disadvantaged in finding a suitable habitat for invasion, and might generally be imported less frequently. Finally, average to high propagule pressures were needed to pose a risk of invasion, consistent with arguments in the literature (Lockwood et al., 2005,

2009).

The only environmental conditions that were included in our model are minimum temperature and precipitation and they were strong predictors of establishment. Temperature and precipitation have already been identified as being among the most important variables in species distribution models in the literature (Bradie & Leung, 2016; Thuiller, 2007). In our specific case, low temperatures likely represent a limiting factor for the aquarium pathway, mostly characterized by tropical or subtropical species. At the same time, some among the warmest regions of the United States are also the driest (e.g. Nevada and Utah; Bioclim, Hijmans et al.,

2005). Even if precipitations can be very low, aquarium fish establishments have occurred in these states, for example the Jaguar Guapote (Parachromis managuensis) in Nevada and the zebra danio (Danio rerio) in New Mexico (US Geological Survey 2017; see Table A.2 in

Appendix A). Interestingly, such locations are also characterized by lower variability in their temperature annual range, potentially resulting in more suitable environments for those tropical species whose native ranges present relatively stable temperature conditions. Nonetheless, more

40 generally, heavier precipitations are expected to increase the establishment risk associated with this pathway across the country. In fact, the relatively strong forecasted increase in establishment probabilities in some of the southernmost areas of the USA is linked to a forecasted considerable increase in precipitation by 2050, particularly in Louisiana, Alabama and Mississippi, which are predicted to experience the highest relative increase in establishment risk. Louisiana, in particular, was forecasted to be the third riskiest state based on average establishment probabilities, and it might thus need to be prioritized as an important region for potentially invasive aquarium fish.

Although our joint model performed better than simpler alternatives, it is important to recognize its limitations. First, given the nature of the propagule pressure and environmental data, the forecasted establishment probabilities provided here represent the expectation over the time frame considered to fit the model and to predict future risk (Bradie et al., 2013). Despite the fact that import data were a snapshot for the time period during which species establishments occurred (namely, from 1971) and the assumption of stable species' popularity in the aquarium trade (Bradie et al., 2013), here we were mostly interested in estimating how establishment risk will change in North America under novel climatic conditions.

Secondly, our model assumed that the distribution of functional traits of traded species will not change over time. In support of this assumption, the most popular species included in the analysis have remained similar for decades (Bradie et al., 2013), and their traits have likely been already selected as the best for the aquarium trade. Nonetheless, novel species with novel traits and tolerances could emerge on the aquarium fish market.

Finally, while SDMs in general, and integrated approaches like our joint model, can prove effective in predicting establishments, biotic interactions were not accounted for in the

41

PET model. Although previous work showed that biotic factors are implicitly recaptured in

SDMs (Leung & Bradie, 2017), such factors may become important, especially if climate change alters the native community composition or their interactions (e.g., Walther, 2000).

Despite the limitations encountered, we believe that the PET model presented here represents a useful framework to predict the establishment risk associated with the aquarium fish trade under current and projected future climate. It suggested that climate change will increase establishment risk most in southern states, rather than northern ones, for aquarium freshwater fish species. It offered insight into which traits and environmental conditions are important in determining the establishment of potential freshwater fish invaders in North America. In addition, it is generalizable: the same procedures can be applied to estimate the establishment risk associated with different suites of species from the same (e.g., invertebrates; Duggan, 2010) or alternative pathways (e.g., wood packaging materials; Haack, 2006).

Lastly, the PET approach provides a mean to inform measures of management that can be tailored for entire pathways or for single species. Specifically, the predictions from our model can be used to direct efforts to prevent invasions, which is usually preferable to control and eradication (Leung et al., 2002). For example, given that our framework provides establishment risk estimates at the pathway level for each location, it can be employed to geographically prioritize pathways of invasions, and was information requested by Quebec’s Ministry of Natural

Resources in Canada. In Quebec, as well as in states like Alaska and Minnesota that do not appear at risk, priority should be given to alternative pathways of introduction for which resources would be better spent in order to prevent new invaders. This is particularly important when the application of an unsuited policy could determine an economic damage to the aquarium market, which accounts for billions of dollars worldwide every year (Padilla & Williams, 2004).

42

In addition, the ability to make predictions for each species in a geographically explicit way would allow stakeholders to intervene on propagule pressure in a custom-made fashion, for example by banning the species which are more likely to establish based on their functional traits and on the characteristics of the receiving location, or by simply reducing the number of individuals imported to each state to maintain the risk of establishment below a certain threshold, for the same economic reasons mentioned above. Finally, another advantage for managers lies in the potential for the PET model to accommodate additional predictors of successful establishment (Bradie and Leung, 2015), including both species-specific characteristics and alternative environmental factors, depending on the availability of data and on the pathway considered, and the possibility to predict risk for any additional species before it is introduced.

1.6 Acknowledgements

LDV would like to thank J. Bradie, E. Hudgins, V. Reed, A. Sardain, D. Nguyen and N.

Richards for insightful discussions. The authors thank the Wildlife and Habitat direction of the

MFFP for valuable insights on invasive exotic species strategy in Quebec. This research was supported by the Fonds verts of the Quebec government under their climate change adaptation plan and by an NSERC Discovery grant to BL.

43

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19, 716-723.

Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71, 1-10.

Anderson, D. R., & Burnham, K. P. (2002). Avoiding pitfalls when using information-theoretic methods. The Journal of Wildlife Management, 912-918.

Bálint, M., Domisch, S., Engelhardt, C. H. M., Haase, P., Lehrian, S., Sauer, J., Theissinger, K., Pauls, S.U. & Nowak, C. (2011). Cryptic biodiversity loss linked to global climate change. Nature Climate Change, 1, 313-318.

Bradie, J., & Leung, B. (2015). Pathway-level models to predict non-indigenous species establishment using propagule pressure, environmental tolerance and trait data. Journal of Applied Ecology, 52, 100-109.

Bradie, J., & Leung, B. (2016). A quantitative synthesis of the importance of variables used in MaxEnt species distribution models. Journal of Biogeography. Advance online publication. doi: 10.1111/jbi.12894

Bradie, J., Chivers, C., & Leung, B. (2013). Importing risk: quantifying the propagule pressure– establishment relationship at the pathway level. Diversity and Distributions, 19, 1020-1030.

Brook, B. W., Sodhi, N. S., & Bradshaw, C. J. (2008). Synergies among extinction drivers under global change. Trends in ecology & evolution, 23, 453-460.

Brown, A. M., Warton, D. I., Andrew, N. R., Binns, M., Cassis, G., & Gibb, H. (2014). The fourth‐corner solution–using predictive models to understand how species traits interact with the environment. Methods in Ecology and Evolution, 5, 344-352.

Chapman, F. A., Fitz‐Coy, S. A., Thunberg, E. M., & Adams, C. M. (1997). United States of America trade in ornamental fish. Journal of the World Aquaculture Society, 28, 1-10.

Cheung, W. W., Lam, V. W., Sarmiento, J. L., Kearney, K., Watson, R., & Pauly, D. (2009). Projecting global marine biodiversity impacts under climate change scenarios. Fish and fisheries, 10, 235-251.

Copp, G. H., Templeton, M., & Gozlan, R. E. (2007). Propagule pressure and the invasion risks of non‐native freshwater fishes: a case study in . Journal of Fish Biology, 71, 148-159.

Courtenay, W. R., & Stauffer, J. R. (1990). The introduced fish problem and the aquarium fish industry. Journal of the World Aquaculture Society, 21, 145-159.

44

Doherty, T. S., Glen, A. S., Nimmo, D. G., Ritchie, E. G., & Dickman, C. R. (2016). Invasive predators and global biodiversity loss. Proceedings of the National Academy of Sciences, 113, 11261-11265.

Duggan, I. C. (2010). The freshwater aquarium trade as a vector for incidental invertebrate fauna. Biological invasions, 12, 3757-3770.

Duggan, I. C., Rixon, C. A., & MacIsaac, H. J. (2006). Popularity and propagule pressure: determinants of introduction and establishment of aquarium fish. Biological invasions, 8, 377- 382.

Froese, R. and D. Pauly. Editors. 2016. FishBase. World Wide Web electronic publication: www.fishbase.org.

Gertzen, E., Familiar, O., & Leung, B. (2008). Quantifying invasion pathways: fish introductions from the aquarium trade. Canadian Journal of Fisheries and Aquatic Sciences, 65, 1265-1273.

Greenland, S., & Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. American journal of epidemiology, 142, 1255- 1264.

Haack, R. A. (2006). Exotic bark-and wood-boring Coleoptera in the United States: recent establishments and interceptions. Canadian Journal of Forest Research, 36, 269-288.

Heger, T., & Trepl, L. (2003). Predicting biological invasions. Biological Invasions, 5, 313-321.

Hellmann, J. J., Byers, J. E., Bierwagen, B. G., & Dukes, J. S. (2008). Five potential consequences of climate change for invasive species. Conservation Biology, 22, 534-543.

Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25, 1965-1978. URL http://www.worldclim.org/

Hobbs, R. J., & Mooney, H. A. (2005). Invasive species in a changing world: the interactions between global change and invasives. SCOPE-SCIENTIFIC COMMITTEE ON PROBLEMS OF THE ENVIRONMENT INTERNATIONAL COUNCIL OF SCIENTIFIC UNIONS, 63, 310.

Holzapfel, A. M., & Vinebrooke, R. D. (2005). Environmental warming increases invasion potential of alpine lake communities by imported species. Global Change Biology, 11, 2009- 2015.

Horton, N. J., & Kleinman, K. P. (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. The American Statistician, 61, 79-90.

45

Huang, J. P. (2014). Modeling the effects of anthropogenic exploitation and climate change on an endemic stag beetle, Lucanusmiwai (Lucanidae), of Taiwan. Journal of Asia-Pacific Entomology, 17, 423-429.

Hudgins, E. J., Liebhold, A. M., & Leung, B. (2017). Predicting the spread of all invasive forest pests in the United States. Ecology Letters, 20, 426-435.

Hulme, P. E. (2009). Trade, transport and trouble: managing invasive species pathways in an era of globalization. Journal of Applied Ecology, 46, 10-18.

Hulme, P. E. (2016). Climate change and biological invasions: evidence, expectations, and response options. Biological Reviews. Advance online publication. doi: 10.1111/brv.12282

Institut de la statistique du Québec (2014). Perspectives démographiques du Québec et des regions, 2011-2061. URL http://www.stat.gouv.qc.ca/

IPCC Fifth Assessment Report. URL http://www.ipcc.ch/report/ar5/

Johnson, J. B., & Omland, K. S. (2004). Model selection in ecology and evolution. Trends in ecology & evolution, 19, 101-108.

Kolar, C. S., & Lodge, D. M. (2002). Ecological predictions and risk assessment for alien fishes in North America. Science, 298, 1233-1236.

Legendre, P., Galzin, R., & Harmelin-Vivien, M. L. (1997). Relating behavior to habitat: solutions to the fourth-corner problem. Ecology, 78, 547-562.

Leung, B., & Bradie, J. (2017). Estimating non‐indigenous species establishment and their impact on biodiversity, using the Relative Suitability Richness model. Journal of Applied Ecology. Advance online publication. doi:10.1111/1365-2664.12862

Leung, B., Lodge, D. M., Finnoff, D., Shogren, J. F., Lewis, M. A., & Lamberti, G. (2002). An ounce of prevention or a pound of cure: bioeconomic risk analysis of invasive species. Proceedings of the Royal Society of London B: Biological Sciences, 269, 2407-2413.

Leung, B., & Mandrak, N. E. (2007). The risk of establishment of aquatic invasive species: joining invasibility and propagule pressure. Proceedings of the Royal Society of London B: Biological Sciences, 274, 2603-2609.

Leung, B., Roura‐Pascual, N., Bacher, S., Heikkilä, J., Brotons, L., Burgman, M. A., ...& Sol, D. (2012). TEASIng apart alien species risk assessments: a framework for best practices. Ecology Letters, 15, 1475-1493.

Lockwood, J. L., Cassey, P., & Blackburn, T. (2005). The role of propagule pressure in explaining species invasions. Trends in Ecology & Evolution, 20, 223-228.

46

Lockwood, J. L., Cassey, P., & Blackburn, T. M. (2009). The more you introduce the more you get: the role of colonization pressure and propagule pressure in invasion ecology. Diversity and Distributions, 15, 904-910.

Lockwood, J. L., Hoopes, M. F., & Marchetti, M. P. (2013). Invasion ecology. John Wiley & Sons.

Lovell, S. J., & Stone, S. (2005). The Economic Impacts of Aquatic Invasive Species: A Review of the Literature (No. 200502). National Center for Environmental Economics, US Environmental Protection Agency.

Mack, R. N., Simberloff, D., Mark Lonsdale, W., Evans, H., Clout, M., & Bazzaz, F. A. (2000). Biotic invasions: causes, epidemiology, global consequences, and control. Ecological applications, 10, 689-710.

Mainka, S. A., & Howard, G. W. (2010). Climate change and invasive species: double jeopardy. Integrative Zoology, 5, 102-111.

Marchetti, M. P., Moyle, P. B., & Levine, R. (2004). Alien fishes in California watersheds: characteristics of successful and failed invaders. Ecological Applications, 14, 587-596.

Ministère des Forêts, de la Faune et des Parcs (MFFP). URL http://mffp.gouv.qc.ca/

Morrison, L. W., Korzukhin, M. D., & Porter, S. D. (2005). Predicted range expansion of the invasive fire ant, Solenopsis invicta, in the eastern United States based on the VEMAP global warming scenario. Diversity and Distributions, 11, 199-204.

Moss, R. H., Babiker, M., Brinkman, S., Calvo, E., Carter, T., Edmonds, J. A., Elgizouli, I., Emori, S., Erda, L., Hibbard, K. & Jones, R. (2008). Towards new scenarios for analysis of emissions, climate change, impacts, and response strategies (No. PNNL-SA-63186). Pacific Northwest National Laboratory (PNNL), Richland, WA (US).

Muhlfeld, C. C., Kovach, R. P., Jones, L. A., Al-Chokhachy, R., Boyer, M. C., Leary, R. F., ... & Allendorf, F. W. (2014). Invasive hybridization in a threatened species is accelerated by climate change. Nature Climate Change, 4, 620.

Nazarenko, L., Schmidt, G. A., Miller, R. L., Tausnev, N., Kelley, M., Ruedy, R., Russell, G.L., Aleinov, I., Bauer, M., Bauer, S. & Bleck, R. (2015). Future climate change under RCP emission scenarios with GISS ModelE2. Journal of Advances in Modeling Earth Systems, 7, 244-267.

Nehring, S. (1998). Establishment of thermophilic phytoplankton species in the North Sea: biological indicators of climatic changes?.ICES Journal of Marine Science: Journal du Conseil, 55, 818-823.

47

Padilla, D. K., & Williams, S. L. (2004). Beyond ballast water: aquarium and ornamental trades as sources of invasive species in aquatic ecosystems. Frontiers in Ecology and the Environment, 2, 131-138.

Phillips, N. A. (1956). The general circulation of the atmosphere: A numerical experiment. Quarterly Journal of the Royal Meteorological Society, 82, 535-539.

Pimentel, D., McNair, S., Janecka, J., Wightman, J., Simmonds, C., O’connell, C., Wong, E., Russel, L., Zern, J., Aquino, T. & Tsomondo, T. (2001). Economic and environmental threats of alien plant, , and microbe invasions. Agriculture, Ecosystems & Environment, 84, 1- 20.

Pimentel, D., Zuniga, R., & Morrison, D. (2005). Update on the environmental and economic costs associated with alien-invasive species in the United States. Ecological economics, 52, 273-288.

R Development Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rahel, F. J., & Olden, J. D. (2008). Assessing the effects of climate change on aquatic invasive species. Conservation Biology, 22, 521–533.

Ramsey, J. S. (1985). Sampling aquarium fishes imported by the United States. Journal of the Alabama Academy of Science, 56, 220-245.

Ribeiro, F., Elvira, B., Collares-Pereira, M. J., & Moyle, P. B. (2008). Life-history traits of non- native fishes in Iberian watersheds across several invasion stages: a first approach. Biological Invasions, 10, 89-102.

Ricciardi, A., & Rasmussen, J. B. (1999). Extinction rates of North American freshwater fauna. Conservation Biology, 13, 1220-1222.

Rixon, C. A., Duggan, I. C., Bergeron, N. M., Ricciardi, A., & Macisaac, H. J. (2005). Invasion risks posed by the aquarium trade and live fish markets on the Laurentian Great Lakes. Biodiversity & Conservation, 14, 1365-1381.

Sala, O. E. (& 18 others) 2000. Biodiversity scenarios for the year 2100. Science, 287, 1770– 1774.

Shafland, P. L. (1996). Exotic fishes of Florida—1994. Reviews in Fisheries Science, 4, 101-122.

Sharma, S., Jackson, D. A., Minns, C. K., & Shuter, B. J. (2007). Will northern fish populations be in hot water because of climate change?. Global Change Biology, 13, 2052-2064.

48

Stachowicz, J. J., Terwin, J. R., Whitlatch, R. B., & Osman, R. W. (2002). Linking climate change and biological invasions: ocean warming facilitates nonindigenous species invasions. Proceedings of the National Academy of Sciences, 99, 15497-15500.

Stanton, J. C., Pearson, R. G., Horning, N., Ersts, P., & Reşit Akçakaya, H. (2012). Combining static and dynamic variables in species distribution models under climate change. Methods in Ecology and Evolution, 3, 349-357.

Strayer, D. L., & Dudgeon, D. (2010). Freshwater biodiversity conservation: recent progress and future challenges. Journal of the North American Benthological Society, 29, 344-358.

Syphard, A. D., & Franklin, J. (2010). Species traits affect the performance of species distribution models for plants in southern California. Journal of Vegetation Science, 21, 177- 189.

Thuiller, W. (2007). Biodiversity: climate change and the ecologist. Nature, 448, 550-552.

U.S. Geological Survey (2017). Nonindigenous Aquatic Species Database. Available at: http://nas.er.usgs.gov Gainesville, FL. (accessed 3 April 2017).

United States Census Bureau. URL https://www.census.gov/en.html

Van Kleunen, M., Weber, E., & Fischer, M. (2010). A meta‐analysis of trait differences between invasive and non‐invasive plant species. Ecology letters, 13, 235-245.

Vila-Gispert, A., Alcaraz, C., & García-Berthou, E. (2005). Life-history traits of invasive fish in small Mediterranean streams. Biological Invasions, 7, 107.

Walther, G. R. (2000). Climatic forcing on the dispersal of exotic species. Phytocoenologia, 30, 409-430.

Walther, G. R., Roques, A., Hulme, P. E., Sykes, M. T., Pyšek, P., Kühn, I., Zobel, M., Botta- Dukát, Z., Bugmann, H., & Czucz, B. (2009). Alien species in a warmer world: risks and opportunities. Trends in ecology & evolution, 24, 686-693.

Williams, J. W., Jackson, S. T., & Kutzbach, J. E. (2007). Projected distributions of novel and disappearing climates by 2100 AD. Proceedings of the National Academy of Sciences, 104, 5738-5742.

Williamson, M. H., & Fitter, A. (1996). The characters of successful invaders. Biological conservation, 78, 163-170.

49

Table 1.1. Results of the forward selection for the joint model (PET) including environmental variables (type BIO), species traits (type TR) and environment-traits interactions (type INT). X̅ indicates the mean, s2 represents the variance, 1st & 2nd denote first and second order terms. The terms included in the final PET model and the corresponding standardized parameter values are indicated with an asterisk, while AIC improvements that were bigger than 2 units in the model selection are denoted by daggers. The last column shows the order of inclusion in the model. ______Variable Type β1st β2nd AIC 1st AIC 2nd Order ______c (shape parameter)* NA 0.4080* NA NA NA NA Intercept* NA 13.1335* NA 361.92† NA 1 Precipitation wettest month (x̅ )* BIO 0.7359* -0.7797* 319.70 288.40† 2 Maximum temperature tolerance* TR 0.3084* -0.5126* 277.63 262.39† 3 Maximum length* TR -0.6577* 7.4772* 262.97 246.69† 4 Minimum temperature coldest month (x̅ )* BIO -1.7598* -0.0644 235.10† 237.07 5 st nd Max. length (1 ) & precip. wettest month (x̅ ; 2 )* INT -0.6706* NA 232.98† NA 6 Minimum temperature tolerance TR -0.0347 0.3410 237.08 238.61 NA Northernmost latitude TR -0.1952 -0.1761 236.30 237.53 NA Trophic level TR 0.2325 0.0138 235.57 237.56 NA Mean temperature warmest quarter (x̅ ) BIO -1.9045 -0.1869 237.09 239.05 NA Mean diurnal range (s2) BIO -0.4008 -0.4110 235.62 237.13 NA Minimum temperature coldest month (s2) BIO -0.2289 30.3139 237.05 233.26 NA st st Max. temp. tolerance (1 ) & min. temp. coldest month (x̅ ; 1 ) INT 0.1006 NA 235.85 NA NA Maximum length (1st) & min. temp. coldest month (x̅ ; 1st) INT 0.0968 NA 236.93 NA NA nd st Max. temp. tolerance (2 ) & min. temp. coldest month (x̅ ; 1 ) INT 1.2049 NA 13663.79 NA NA Max. length (2nd) & min. temp. coldest month (x̅ ; 1st) INT 0.2950 NA 237.02 NA NA st st Max. temp. tolerance(1 ) & precip. wettest month (x̅ ; 1 ) INT 0.0606 NA 236.07 NA NA Max. length (1st) & precip. wettest month (x̅ ; 1st) INT -0.0426 NA 237.04 NA NA nd st Max. temp. tolerance (2 ) & precip. wettest month (x̅ ; 1 ) INT 0.0424 NA 234.94 NA NA nd st Max. length (2 ) & precip. wettest month (x̅ ; 1 ) INT -0.4240 NA 236.91 NA NA st nd Max. temp. tolerance(1 ) & precip. wettest month (x̅ , 2 ) INT -0.1480 NA 234.45 NA NA ______

50

Table 1.2. Top 10 species with the highest likelihood of establishment in the United States currently and in 2050, as predicted by our joint model, along with the state where they pose the highest risk, their propagule pressure, and the values of the traits retained in the PET model as important determinants of establishment, i.e. maximum temperature tolerance (Max T., °C) and maximum length (Max L., cm). ______USA current State at PP Max T. Max L. highest risk ______1. Parachromis managuensis Florida 499 36 55 2. Oxyeleotris marmorata Florida 18825 24 79 3. Archocentrus multispinosus Hawaii 12 36 11 4. Serrasalmus manueli Florida 4 35 44 5. Balantiocheilos melanopterus Hawaii 5484 28 43 6. Tanichthys albonubes Hawaii 30174 22 4 7. Gyrinocheilus aymonieri Hawaii 8733 28 34 8. Labeo chrysophekadion Florida 10520 27 90 9. Hemisorubim platyrhynchos Florida 16 22 64 10. Mogurnda adspersa Hawaii 8 20 14 ______USA 2050 State at PP Max T. Max L. highest risk ______1. Parachromis managuensis Florida 592 36 55 2. Oxyeleotris marmorata Florida 22356 24 79 3. Serrasalmus manueli Florida 4 35 44 4. Balantiocheilos melanopterus Florida 93958 28 43 5. Archocentrus multispinosus Florida 190 36 11 6. Labeo chrysophekadion Florida 12439 27 90 7. Gyrinocheilus aymonieri Florida 149641 28 34 8. Tanichthys albonubes Florida 517037 22 4 9. Hemisorubim platyrhynchos Florida 19 22 64 10. Pterygoplichthys gibbiceps Florida 10097 27 50 ______

51

Figure 1.1. Distributions of environmental variables (a,b), traits (c,d), interaction term (e) and propagule pressure (f) as included in the model. The plots for the environmental variables and the interaction term also depict the curves representing the distributions of values under current conditions and as forecasted for 2050 in the USA (solid and dashed lines) and in Quebec (dotted and dash-dotted lines). The black dots show the establishments occurred in the USA (see Table A.2 in Appendix A), and their size is proportional to the number of corresponding species, except for plot (e) and (f), where each dot represents a species-location combination. The interaction term is standardized for simplicity of representation, while propagule pressure values are on a log scale.

(a) (b)

(c) (d)

52

(e) (f)

53

Figure 1.2. Average establishment risk across species by state (USA) or administrative region (Quebec) as predicted by the PET model under current (a) and future (b) climatic conditions. Darker shades indicate a higher average risk of establishment. Very low probabilities are displayed using scientific notation, e.g. 1E-5 corresponds to one multiplied by 0.00001.

54

Figure 1.3. Expected numbers of establishments for the United States at highest risk, as predicted by the PET model. Clear grey areas indicate the number of species expected to be established under current conditions, while dark grey denotes the additional species forecasted to establish by the year 2050. Only the states with higher expected numbers of establishments are shown, while the remaining are pooled in a single bar (Rem.).

55

Figure 1.4. Distribution of the estimated establishment risk in the USA and Quebec. The black dots represent the species that have established in the USA. The solid and the dashed lines correspond to the distributions of establishment probabilities predicted for the USA presently and for 2050 respectively, while the dotted and the dash-dotted lines represent those predicted for Quebec for the same years. The probability of establishment values are reported on a log scale.

56

Connecting statement

In the previous chapter, I incorporated geographically explicit environmental information in a pathway-level risk assessment framework, which allowed me to predict the riskiest species in the aquarium trade pathway in the USA, both under current and future climatic conditions.

Identifying the species that are most likely to establish, and in which locations specifically, is important from a management perspective, since it allows the prioritization of instances where prevention is key. Despite prevention being essential to the management of invasions, its effectiveness is not always guaranteed, and measures for rapid response, in particular eradication, should be taken when needed. While in Chapter 1 I looked at established species versus those that were never detected in the wild, in the next chapter I compare them with species that establish only temporarily, but later become extinct without anthropic intervention.

By modelling the two sub-stages of casual and persistent establishment separately, I obtain "rules of thumb" that can be used to rapidly assess detections and to separate species that will likely persist from those that will go extinct regardless of interventions, thus informing better prioritization of eradications. Further, I identify the factors that allow a species to successfully overcome each sub-stage, and better characterize the role of environment, traits and propagule pressure at each phase.

57

Chapter 2

Guiding rapid response to non-indigenous aquarium fish: identifying risk factors for

persistent versus casual establishment.

Authors: Lidia Della Venezia and Brian Leung

A version of this chapter is under review at Biological Invasions.

58

2.1 Abstract

1. While prevention is ideal, it is not always achievable, and rapid response strategies are necessary to effectively manage potentially harmful non-indigenous species. After new species are detected, many will become extirpated without intervention; a few will persist and potentially cause harm.

2. To help prioritize limited resources, we employed a multispecies, geographically explicit approach, focusing on the establishment of non-indigenous aquarium fish in the USA. We modeled casual (i.e. temporary) establishment and persistence separately, to identify which species should be prioritized after detection. Further, to facilitate the usability of quantitative models by policy-makers, we converted our results into simple graphical response curves and

"rules of thumb", wherein each factor's contribution can be viewed as a multiplier (e.g., increasing trait X by Y% increases the odds of establishment by Z%). Finally, from a fundamental perspective, analyzing casual and persistent establishment separately improved our understanding of the earlier stages of invasions.

3. Maximum temperature tolerance and maximum fish length were the most important predictors of both sub-stages, and while precipitation affected casual establishment, minimum temperature variability had the strongest effect on persistence. We identified carpintis, and Pygocentrus nattereri as ranking highest for rapid response, if detected in California, New Mexico and Texas. These states, along with Florida and Hawaii, should take precedence in management funding, being those that currently host more persistent species and where more new establishments are expected.

4. Expectedly, we found that the important factors differed considerably between sub-stages, with species traits and propagule pressure being most relevant for casual establishment, and the

59 environment being more predictive of persistence. Notably, propagule pressure had no effect on persistence, suggesting that it would not help target eradication efforts for aquarium fish.

Synthesis and applications. Our model allows comparisons of persistence for >1000 species across locations to target rapid responses after detection, and can provide guidance for species not currently traded. We identify five factors to help predict persistence. Our analysis re- evaluates "risky" species in terms of persistence, suggesting that many species which were flagged in the literature actually pose low risk. Conversely we identify species that, if detected, warrant rapid response.

60

2.2 Introduction

Invasive species currently represent one of the biggest threats to the environment, and can cause enormous economic losses (e.g., Ehrenfeld, 2010; Olson, 2006; Mack et al., 2000; Pyšek et al., 2012). Hence, in recent years, researchers have investigated ways to prevent and reduce the impact that such species might cause (Buckley, 2008; Simberloff, 2003). Specifically, prevention and rapid response have been widely recognized as the most successful management approaches to hamper non-indigenous species in a cost-effective way (Alvarez & Solis, 2019;

Hobbs & Humphries, 1995). Prevention in particular is usually preferable and represents a priority among management policies (Finnoff et al., 2007; Leung et al., 2002; Lodge et al.,

2006). However, it does not necessarily guarantee success (Vander Zanden et al., 2010), and rapid response remains a critical strategy for management (Wittenberg & Cock, 2001), especially when the invader's density is low and measures like eradication are more feasible and cost- effective (Simberloff et al., 2013; Westbrooks, 2004). The importance of the appropriateness of intervention has prompted discussion on how resources should be spent (e.g., Myers et al., 2000;

Leung et al., 2005; Vander Zanden et al., 2010). Prioritizing instances where management would be most needed is essential to direct limited resources (Lohr et al., 2017; Papeş et al., 2011).

Particularly, promptly identifying species and locations of highest concern would allow stakeholders to respond to new introductions and detections more quickly and effectively (e.g.,

Vander Zanden & Olden, 2008). Yet, it remains a major challenge in invasion ecology (Stewart-

Koster et al., 2015).

In this study, we provide insight into rapid response and eradication. We focused on the establishment phase, i.e. the process by which a non-indigenous species in a novel location founds a self-sustaining population, with individuals surviving and successfully reproducing

61

(Colautti & MacIsaac, 2004; Lockwood et al., 2013). However, even populations that temporarily survive and reproduce can subsequently die off without any anthropic intervention

(Blackburn et al., 2011; Williamson & Fitter, 1996). We thus considered that distinguishing between an initial sub-stage wherein a species is found in a novel location temporarily

(henceforth termed casual establishment) and the subsequent step wherein the species exists as an enduring self-sustaining population over time (henceforth termed persistent establishment) would be of practical importance for management. In fact, simple detections of a species in the wild are often treated as establishments or a proxy for it, but they arguably relate more closely to casual establishment, overestimating those species which persist to potentially cause harm.

Therefore, when non-indigenous organisms are detected in the wild, distinguishing species that would likely go extinct without human mediation from those that pose a real threat could help determine instances where eradication effort is needed, and provide critical information for prioritizing rapid response resources.

While mathematical models provide rigorous, quantitative means to prioritize management

(e.g., Chadès et al., 2011; Kerr et al., 2016; Kumschick et al., 2012), we recognize that decisions in this field often do not make use of this quantitative evidence, relying instead on expert opinion, and less technical scoring-based approaches (Cook et al., 2010; Leung et al., 2012).

Arguably, science should facilitate the usability of technical knowledge for stakeholders (Cassey et al., 2018a), for instance by converting model parameters into rules of thumb. Ideally, such rules of thumb should be easy to apply and should provide useful insights. Similar concepts are often used in medicine to express changes in the odds of an outcome following exposure to a certain factor (e.g., the risk of low weight at birth is 4 times higher in neonates from women exposed to tobacco; Mumbare et al., 2012). Simultaneously, they should still be based on a solid

62 statistical foundation that derives reliable predictions from the empirical data. In this manuscript, we also express our models as rules of thumb, in addition to parameter values.

Finally, we consider the fundamental insights from studying the two phases of casual and persistent establishment separately. Currently, few studies have looked into identifying the factors involved in failures at different stages of invasions (Dawson et al., 2009; Marchetti et al.,

2004), and very few specifically concentrated on establishment (e.g., Ficetola et al., 2009;

Milbau & Stout, 2008). Here, we analyzed casual and persistent establishment, simultaneously considering propagule pressure, environmental suitability, and species traits (i.e. the main predictors of establishment; Leung et al., 2012), to assess their relative contributions during each sub-stage. Specifically, despite having been widely recognized as a strong determinant of invasion success (Cassey et al., 2018b; Colautti et al., 2006; Simberloff, 2009), it is unclear whether propagule pressure primarily contributes to casual establishment (e.g., Leung et al.,

2012) or it remains important for persistence (e.g., Ficetola et al., 2008). Similarly, environmental conditions, and most importantly climate matching, were shown to be consistently relevant factors influencing establishment across taxa (e.g., Duncan et al., 2014; Hayes & Barry,

2008; Mahoney et al., 2015). Nonetheless, local conditions might have a stronger effect on the survival of the propagules reaching a new location (casual establishment; Essl et al., 2015) or could affect long-term population persistence through vulnerable phases like reproduction (e.g.,

Ficetola et al., 2009). Analogously, studies focusing on the role of species traits in plants found them to be important predictors of establishment (Pyšek et al., 2009), either by allowing individuals to cope with novel conditions (Blackburn et al., 2009), or having a stronger effect on persistence, for example favoring future reproduction in spite of early population growth (Sol et al., 2012).

63

In brief, in this study, we focused on fish species introduced through the aquarium trade pathway as our study system to address three-fold objectives: 1) to develop a model to help prioritize rapid response, by separating the sub-stages of establishment (casual versus persistent);

2) to provide "rules of thumb", wherein model parameters are converted into a series of simple multiplicative risk factors; 3) to increase fundamental understanding of the invasion process, by elucidating the relative contributions of propagule pressure, environment, and species traits for each of the two sub-stages of establishment.

2.3 Methods

In this study, we focused on aquarium freshwater fish species commonly traded in the

United States for which import data were available. The aquarium trade is responsible for importing thousands of individual fish annually (Smith et al., 2008) and it is a significant source of non-indigenous species (Howeth et al., 2016; Rixon et al., 2005), some of which have already established in North America (Lockwood et al., in press). To separate the predictors associated with the early establishment and persistence of non-indigenous aquarium fish in the United

States, we examined spatially referenced records of non-indigenous aquatic species in the USA over the past 50 years (United States Geological Survey, 2017). We selected all the detection records for non-indigenous freshwater fish classified as coming from the "aquarium release" pathway and compiled them by state. The USGS further categorized each record by status, based on successful reproduction, persistence and eradication (Table 2.1). We grouped the species detected in the wild into casually established species (CS) and persistently established species

(PS). The CS included all the species by state that were detected at some point in time after 1971 regardless of their present status (59 species and 151 species-location combinations). We chose

1971 as a threshold since previous work from Chapman et al. (1997) and Bradie et al. (2013) had

64 shown that species’ popularity remained relatively consistent in the aquarium fish trade. Two occurrences (Devario malabaricus, NV, and Serrasalmus rhombeus, FL) which were classified as "Eradicated" (Table 2.1) might have persisted and successfully established in the absence of human intervention. We found that results were robust to their inclusion and parameter values did not change, and kept these species as part of the CS group. Since a species that goes through the sub-stage of casual establishment can either go extinct later or persist, CS included both casual and persistent species. In contrast, the PS group included only those species which were able to avoid extirpation, i.e. species that successfully reproduced, that could survive overwinter, and for which multiple life stages were identified in the wild (21 species and 28 species-location combinations, identified as persistently "Established"; Table 2.1). All other aquarium fish species imported into the USA that have never been detected in the wild were considered unestablished

(neither casually nor persistently). The complete list of occurrences for both CS and PS, along with the corresponding state, propagule pressure and trait values can be found in Appendix B.1.

We used the PET modelling framework (Della Venezia et al., 2018) to account for species traits, environment, propagule pressure and density-dependent effects simultaneously.

The model defined the probability of a certain species establishing as follows:

푃(퐸) =1− (1− 푝) (2.1) where p was the probability of a single propagule establishing, N was the number of propagules introduced (or a proxy for it), and c was a shape parameter that allowed for density-dependent effects. In the context of this framework, p was modeled as a logistic function of species-specific and location-specific predictors:

푝 = (2.2) where zsl was defined as:

65

푧 = 푏 + ∑ (푏푋 + 푏푋) + ∑ (푏퐸 + 푏퐸) + ∑ ∑(푏푋퐸) (2.3)

Each Xws was a trait of species s, for a total of W traits, while each Eml was an environmental condition for location l, for M environmental variables. Both first and second order terms for each of these predictors were included to allow a non-monotonic relation with establishment probability. b denoted the coefficients of the model, for traits (w) and environmental conditions

(m), and traits-environment interactions, each described by parameter value bmw, common across all species/location combinations.

We used the PET framework in two different ways. Firstly, we identified the factors that predicted casual establishment (i.e. CS versus species that were never detected in the wild). We referred to this as the "casual establishment model". Secondly, to create rules of thumb for rapid response, we identified the factors that predicted PS versus extirpated species (the fraction of CS that later went extinct). We refer to this as the "persistence model".

2.3.1 Variable choice

The PET establishment model incorporated propagule pressure (P), environmental variables (E) and species traits (T) as predictors of establishment (Appendix B.2). Building from

Della Venezia et al. (2018), species traits originated from the online database FishBase (Froese

& Pauly, 2018) and were minimum and maximum temperature tolerance, northernmost latitude, trophic level and maximum length. Additional traits were removed to avoid multicollinearity, while missing data were imputed using the methodology described in Della Venezia et al. (2018) based on trait correlations and taxonomic proximity. The environmental variables were obtained from the Bioclim database on www.worldclim.com (Hijmans et al., 2005). Given that the establishment data were at the level of each US state, we estimated the mean (x̄ ) and variance

(s2) of each variable for each state. The final set of environmental variables, after removing the

66 highly correlated ones, included variability of the diurnal range (BIO2s²), average and variability of the minimum temperature of the coldest month (BIO6x̄ and BIO6s²), average temperature of the warmest quarter (BIO10x̄ ) and average precipitation of the wettest month (BIO13x̄ ). All variables were standardized before fitting the models, to allow the comparison of the relative importance of the predictors by directly contrasting the magnitude of their coefficients, which represented a measure of their effect size (Schielzeth, 2010).

Since fish releases from aquarists were virtually impossible to track, as a proxy for aquarium fish propagule pressure we used Canadian import data from Fisheries and Ocean

Canada (B. Cudmore & N. Mandrak, unpublished data). Bradie and colleagues (2013) showed that United States aquarium fish imports could accurately be obtained by scaling Canadian imports by population size. In addition, they observed consistency in patterns of species' popularity in the aquarium trade market (see also Chapman et al., 1997) particularly after the year 1971, which was then chosen as a threshold year for establishment data. The same approach was used to derive geographically explicit estimates of propagule pressure based on population by state, assuming that propagule pressure would scale with population density (Duggan et al.,

2006; Pyšek et al., 2010). Population size data at the national and state level were available from the United States Census Bureau (https://www.census.gov/en.html).

2.3.2 Model fitting

The two alternative models described above were fit using maximum likelihood estimation, which allowed the best fitting parameters to be found, given the observed data. The log-likelihood function (Della Venezia et al., 2018) was defined as follows:

log(퐿) = ∑ ∑ log1−(1− 푝 ) + ∑ log (1 − 푝 ) ) (2.4)

67 where i represented a successful species, i.e. CS or PS depending on the model, and u was a species which failed to established either temporarily or persistently, for each location l. The sum was iterated over all L states.

We applied a forward selection procedure to identify the most predictive variables

(Johnson & Omland, 2004), to reduce the relatively high number of traits, environmental conditions and interactions included in the model, and to avoid problems of complete separation which may occur when using logistic functions (Albert & Anderson, 1984). We selected the

Akaike Information Criterion (AIC; Akaike 1974) as our metric of choice, and each variable was retained in the model when its inclusion decreased the AIC value of at least 2 units (Anderson &

Burnham, 2002). The model performance was estimated using the area under the curve (AUC) of the receiver operating characteristic (ROC; Hanley & McNeil, 1982).

Finally, to improve our fundamental understanding of the establishment process, we characterized the importance of the three categories of predictors in analysis, and compared the full PET framework to each submodel (i.e. excluding propagule pressure, environment or species traits). We ranked them based on their associated AIC values and the percentage of deviance they explained, to evaluate the relative importance of species traits, environment and propagule pressure during each sub-stage of establishment.

2.3.3 Multiplicative risk factors

Once the best model for each sub-stage was identified, we derived rules of thumb that might be more intuitive for a broader audience to easily compare species and prioritize resources.

We determined the effect of each important predictor variable on the likelihood of a species successfully establishing, either casually or persistently, versus failing. To do so, we converted the parameters of the fitted models (eq. 2.2 and 2.3) and expressed each factor as a multiplier,

68 increasing or decreasing risk of establishment. This was accomplished by calculating the odds ratios (OR), i.e. the ratio between the odds (probability of successfully establishing versus failing) for varying values of each significant predictor and the odds at a reference value

(Appendix B.3):

푂푅 = (2.5)

For each significant predictor k, its average value across species or locations was chosen as reference. Odds ratios are the simplest way of interpreting the results of a logistic model and have been used extensively in epidemiology and medicine (e.g., Bland & Altman, 2000;

Cummings, 2009; Walter, 2000). Each ORk then represented a multiplicative risk factor, i.e. a measure of the relative change in risk of establishing versus failing, relative to the odds of average species and locations. Based on logistic regression, odds ratios had the advantage of being estimated independently for each variable, so that the relative contribution of each predictor could be assessed separately. Also, each ORk mathematically corresponded to a multiplier, so that the cumulative effect of all predictors (ORc) was the product across all ORk.

푂푅 = ∏ 푂푅 (2.6)

For instance, if a species s had a value for trait k = 1 corresponding to ΔOR1 = 2 (Fig 2.1a), the likelihood of successfully persisting versus failing for species s would be twice as high as that of a species with an average value for trait 1. If species s also had OR2 = 1.25 for trait k = 2 (Fig

2.1b), then the overall ORc = 2x1.25, and, altogether, species s would be 2.5 times more likely to establish versus not establishing than the average species. Odds ratios could be used to compare different species as well, where the ratio ORs=1/ORs=2 represents the relative odds of success of species 1 over species 2.

69

Based on logistic regression, probabilities can be obtained from the odds (푝 = ).

Knowing ORC and the Oddsref when all predictors were set to the reference (i.e. mean) value, actual probabilities for each species/location combination would correspond to:

푝 = (2.7)

All data manipulations, model fitting and analyses were conducted in the R statistical programming environment (R Core Team, 2018).

2.4 Results

Five factors distinguished species likely to become naturally extirpated from those that persisted (AUC = 0.946, ~52% deviance explained; Table 2.2). The minimum temperature of the coldest month (BIO6s²) was among the strongest predictors of persistence, likely determining whether the species could overwinter in the USA (Table 2.3). In particular, its variability

(BIO6s²) was the best determinant for persistence (Table 2.3), presumably because strong fluctuations could drive species to extirpation, even those that were able to initially survive the low minima. Persistence also appeared to be favoured by quite high mean temperatures of the warmest quarter (BIO10x̄ ) and for intermediate species with high maximum temperature tolerances (Table 2.3). Finally, the probability of persistence appeared to change with interacting minimum temperature of the coldest month and maximum length, with big species being favored in more stable cold environments (Table 2.3).

2.4.1 Multiplicative risk factors

Using the parameters from the fitted models (Table 2.3), we calculated the odds for varying values of the predictors (Appendix B.3), and we compared them against the mean across locations and species as our reference value (Oddsref = 0.002184, corresponding to pref =

0.002180) to obtain our rules of thumb (Fig. 2.2; Table 2.4). While persistence was determined

70 by five predictors, some were substantially more important. For example, maximum temperature tolerance appeared much more influential than maximum length for persistence, as shown by the range of the corresponding OR (Fig. 2.2a,b). Temperature tolerance sharply increased persistence, with species with tolerances higher than 35°C being at least 50 times more likely to successfully persist than failing (Fig. 2.2a). On the other hand, persistent establishment appeared likely for a relatively narrow range of lengths, with species about 60 cm in length being 4 times more likely to persist than to fail compared to the average, and very big species being extremely unlikely to cause concern in the long term (Fig. 2.2b). Similarly, the effect of minimum temperature variability (BIO6s²; Fig. 2.2d) appeared substantially stronger than the average warmest temperature (BIO10x̄ ; Fig. 2.2c), increasing the odds up to ~200 times with respect to average conditions.

The odds ratios allowed us to combine the contribution of different predictors by simple multiplication, to obtain their overall expected effect on persistence risk. To illustrate, the species expected to have the highest likelihood of persistent establishment in the USA was the jaguar guapote (Parachromis managuensis; for a list of traits, see Appendix B.1). The estimate of OR was ~150 for maximum temperature tolerance, and ~4 for size, respectively (triangular dots in

Fig. 2.2a,b), making this species about (150x4 =) 600 times more likely to establish than the average fish in our persistence dataset. Similarly, based on local environmental conditions, the state where persistent establishment was more likely to occur was Hawaii, with OR of about 1 and 55, for mean temperature of the warmest quarter and minimum temperature variance, respectively (triangular dots in Fig. 2.2c,d), making the insular state ~55 times more suitable to persistent species than the average. For the jaguar guapote, the calculated overall likelihood of successfully persisting versus failing in Hawaii was more than 60 thousand times higher than the

71 average species/location, including the contribution of the interaction terms (OR interaction =

~2, triangular dot in Fig. 2.2e; overall OR = 150x4x1x55x2). Thus, while the probability of persistence was low across all species and locations on average (pref = 0.002180), for the jaguar guapote in Hawaii it was 0.9964. Similar estimates can be easily obtained and compared for any species/location combination.

2.4.2 Re-evaluating "risky" species in terms of persistence

Additionally, we looked at species within our dataset which had been flagged as potentially invasive in the literature. These included the Wels catfish (Silurus glanis), the European weatherfish (Misgurnus fossilis), the spined loach (Cobitis taenia), the white cloud mountain minnow (Tanichthys albonubes), the clown loach (Chromobotia macracanthus), the silver (Osteoglossum bicirrhosum) and the glass catfish (Kryptopterus bicirrhis), particularly in the Great Lakes area. We found that the only species of concern for persistence in the states surrounding the Great Lakes was the Wels catfish, specifically in Illinois. Other sensitive states for this species were New Mexico and North Carolina, but the likelihood of early establishment across all these states remained very low (Appendix B.4). None of the other species listed appeared to be likely to persist in the Great Lakes region, although some of them would be troubling in other states, being tens to hundreds of times more likely to succeed than average.

These were the glass catfish in Hawaii, and the white cloud mountain minnow and the clown loach in both Florida and Hawaii (Appendix B.4). By comparison, the silver arowana was never able to persist, nor was it predicted to pose a substantial threat across the USA, despite having been casually established in 9 states.

On the other hand, among the species/location combinations for which casual establishment had already occurred, our model predictions for persistence suggested different

72 species of highest concern. These included the red-bellied piranha (Pygocentrus nattereri), the lowland (Herichthys carpintis), the Rio pearlfish (Nematolebias whitei), the banded leporinus (Leporinus fasciatus), and the climbing perch (Anabas testudineus), particularly in

California, Florida, Hawaii, New Mexico and Texas. All had a likelihood of successfully persisting versus failing tens to hundreds of times higher than average species/locations. If these species were detected again in their casual occurrence sites, they should be prioritized for rapid response.

2.4.3 Comparing establishment sub-stages: casual versus persistent

The combination of propagule pressure, environment, and traits (i.e. the PET model) performed better than any of the submodels for casual establishment, both in terms of goodness of fit (AIC) and in prediction accuracy (AUCPET = 0.957, ~36% deviance explained; Table 2.2;

Appendix B.5). In contrast, for persistence, the best model included only species traits and environmental conditions (AUCET = 0.946, ~52% deviance explained; Table 2.2). For casual establishment, species traits were more explanatory than environment (either with or without propagule pressure; Table 2.2), and indeed all traits examined were important (Table 2.3;

Appendix B.5). More specifically, we found that casual establishment risk was inversely related to trophic level, seemingly favouring herbivorous species. Casual establishment risk also increased with latitude for species whose northernmost distribution limit ranged up to ~23°N, and then decreased at higher latitudes (Table 2.3; Appendix B.5). Some traits were also predictive of persistent species, with maximum length and, to a lesser extent, maximum temperature tolerance being important for both sub-stages. Interestingly, risk was unimodally related to maximum length, with intermediate to big species having an advantage for casual establishment, and relatively smaller ones being favored for persistence. Optimal physiological

73 ranges of temperature were significant both for casual establishment and for persistence, with minimum temperature tolerance losing relevance at the latter sub-stage (Table 2.3; Appendix

B.5).

In contrast, environmental conditions appeared to be more relevant for persistence, explaining alone about 35% of deviance (Table 2.2). While precipitations and average minimum temperatures were determinants of casual establishment (Table 2.2; Appendix B.5), persistence was predicted by mean temperatures of the warmest quarter and variability in minimum temperature of the coldest month (Table 2.2). Nonetheless, the submodel combining species traits and environment was the best in distinguishing persistent species from those that were subsequently extirpated, which represented more than 80% of the observed casual establishments, while propagule pressure was generally not predictive of persistence for aquarium fish (Table 2.2). Propagule pressure appeared to favour casual establishment and its inclusion in the model added considerably to the final percentage of explained deviance (Table

2.2), with values of propagule pressure higher than the median increasing risk (Appendix B.2).

However, propagule pressure was not important for persistence of aquarium fish species (Table

2.2; Appendix B.2). This was further corroborated by the fitted 푐^ value being consistently very close to zero for persistence models (at zero, propagule pressure would have no effect; Table

2.2).

2.5 Discussion

Strides have been made in the field of invasion ecology, aiming at identifying priorities for management and guiding strategies for prevention and early response. However, prioritization remains a challenging task, and geographically explicit, multispecies risk assessment frameworks could help make the process more efficient. Here, we focused on establishment and rapid

74 response. Among the casual fish species considered in this study, more than 80% later went extinct without any human intervention, and would have resulted in unnecessary allocation of valuable resources, if funds had been spent on their eradication. Instead, by separating establishment into two sub-stages and pinpointing the factors associated with successful casual and persistent establishment, we have provided analyses to target species likely to persist, after they have been detected.

Although we recognize that non-indigenous species could generate local impact even during their casual establishment, being able to prioritize species that pose a lasting threat and to redirect (often scarce) resources, toward instances where their investment would be necessary, would help maximize the efficacy of management strategies (Jenkins, 2013; Keller & Perrings,

2011). When detections take place early, the five predictive factors identified and the multiplicative risk factors derived from the persistence model would allow a quick simultaneous assessment of multiple species and locations. In an exemplificative case, the pirapitinga

(Piaractus brachypomus) has managed to casually establish in as many as 44 states, but it has never become persistent due to unfavorable local conditions. However, its likelihood of persistent establishment is non-negligible in some southern states like Florida and Nevada, where measures should be taken in case of detection. Further, we have listed Pygocentrus nattereri,

Herichthys carpintis, Nematolebias whitei, Leporinus fasciatus and Anabas testudineus as the most likely to persist among species that have already casually established. These species are ranked as highest concern and they should be eradicated if they were detected again, especially in Texas, New Mexico, California, Florida or Hawaii. Instead, when we looked at species that had already been classified as potential threats in specific areas (e.g., Great Lakes; Howeth et al.,

2016; Kolar & Lodge, 2002; Rixon et al., 2005), our results suggested that even if they were

75 detected, they would have a high likelihood of extirpation without intervention, except for the

Wels catfish (Silurus glanis). Across all other species considered, although casual establishment might occur for some (e.g., Chromobotia macracanthus and Osteoglossum bicirrhosum), persistence was predicted to be very unlikely.

On the other hand, species traits and local environmental conditions that are advantageous in the earliest phases of establishment might also represent useful filtering criteria to drive restrictions in the aquarium market and to define targets for the investment of resources for early detection (Mehta et al., 2007). Although we focused primarily on informing rapid response after detection, preventing casual establishment could be necessary in specific cases, e.g., for species that can cause substantial temporary impact, or for which eradication would hardly be feasible

(e.g., Dogliotti et al., 2018; Simberloff, 2003). For example, even if the Wels catfish appeared to have very low chances of overcoming the casual establishment phase, it has the potential to exert substantial impacts (Copp et al., 2009). In such cases, a reduction in the number of commercialized individuals should be considered, as it might be sufficient to make establishment risk virtually inexistent. However, targeting propagule pressure after detection would not reduce persistence risk for freshwater aquarium fish. Overall, our model and the associated multiplicative risk factors provide quantitative support to decision making that would help reduce also the costs associated to control (Rejmánek & Pitcairn, 2002).

In addition to allowing us to derive guidance for prioritization of management practices, separating establishment into sub-stages provided additional fundamental knowledge about this phase of invasions. Generally, our results reflect the importance of distinct predictors during separate phases of the invasion process (Dawson et al., 2009; Essl et al., 2015; Kuppinger et al.,

2010; Marchetti et al., 2004; Milbau & Stout, 2008). For instance, propagule pressure has been

76 recognized as the most consistent predictor of establishment success across taxa (Cassey et al.,

2018b; Lockwood et al., 2013). Looking at sub-stages, our results suggested that propagule pressure was very important for casual establishment, but not for persistence, confirming previous observations on freshwater fish in California (Marchetti et al., 2004). However, studies on vascular plants had found that continual propagules contribution could enhance establishment success at later stages (e.g., Essl et al., 2015), suggesting that establishment dynamics vary across taxa.

In contrast to propagule pressure, both species-specific and location-specific characteristics played an important role during both sub-stages of establishment. However, species traits were more important for casual establishment, while location-specific variables were more important for persistence. As expected, the relevant predictors differed between stages. For example, trophic level was retained as a predictor of casual establishment in line with previous studies

(Purvis et al., 2000; Ruesink, 2005), but it did not have an effect on persistence. Even for traits that were relevant for both stages, their relationship with the likelihood of success changed. For example, maximum length was consistently important across stages, with the largest species being generally disadvantaged, in agreement with findings in the literature (Ribeiro et al., 2008;

Ruesink, 2005). While mid-range species were favored for early establishment, potentially due to an initial advantage in terms of survival, only relatively small ones appeared to find a suitable environment for persistence. Such species were often detected in the wild across northern states

(e.g., red-bellied piranha, Pygocentrus nattereri; Appendix B.1), but coming from tropical or subtropical regions, they could only overwinter in the mild climate of the southernmost USA

(Bennett et al., 1997).

77

The environment, on the other hand, seemed to play a greater role for persistence in the

United States, similarly to what has been observed for naturalized and casual bryophytes (Essl et al., 2015). Our results supported observations from previous studies about the importance of climatic variables at different stages of invasion in fish species (Bomford et al., 2010; Howeth et al., 2016), as well as in other vertebrates (Duncan et al., 2001; Forsyth et al., 2004; Mahoney et al., 2015). Expectedly, the minimum temperature of the coldest month was one of the strongest environmental predictors of casual establishment and persistence, both as average and as variability. After overcoming low temperatures during the earlier phase, aquarium species seemed to favor locations that are relatively steady in winter, in accordance with previous studies

(Bradie & Leung, 2017; Drake & Lodge, 2004).

The approach used here can be applied to other suits of organisms across different pathways of introduction, to derive geographically explicit, pathway-specific risk factors. In the context of invasions by aquarium fish, this work suggests a small number of predictors can differentiate species and locations likely to establish and persist. Moreover, our model suggests different species of most concern, from the perspective of persistence, and thus different targets of rapid response, once detection of a species in the wild has occurred.

2.6 Acknowledgements

The authors would like to thank E. Hudgins, D. Nguyen, N. Richards and S. Varadarajan for insightful discussions. This research was supported by an NSERC Discovery grant to BL.

78

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19, 716-723.

Alvarez, S., & Solis, D. (2019). Rapid Response Lowers Eradication Costs of Invasive Species: Evidence from Florida. Choices, 33 (316-2019-039), 1.

Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71, 1-10.

Anderson, D. R., & Burnham, K. P. (2002). Avoiding pitfalls when using information-theoretic methods. The Journal of Wildlife Management, 912-918.

Bennett, W. A., Currie, R. J., Wagner, P. F., & Beitinger, T. L. (1997). Cold tolerance and potential overwintering of the red-bellied piranha Pygocentrus nattereri in the United States. Transactions of the American Fisheries Society, 126, 841-849.

Blackburn, T. M., Cassey, P., & Lockwood, J. L. (2009). The role of species traits in the establishment success of exotic birds. Global Change Biology, 15, 2852-2860.

Blackburn, T. M., Pyšek, P., Bacher, S., Carlton, J. T., Duncan, R. P., Jarošík, V., ... & Richardson, D. M. (2011). A proposed unified framework for biological invasions. Trends in ecology & evolution, 26, 333-339.

Bland, J. M., & Altman, D. G. (2000). The odds ratio. Bmj, 320, 1468.

Bomford, M., Barry, S. C., & Lawrence, E. (2010). Predicting establishment success for introduced freshwater fishes: a role for climate matching. Biological Invasions, 12, 2559-2571.

Bradie, J., Chivers, C., & Leung, B. (2013). Importing risk: quantifying the propagule pressure– establishment relationship at the pathway level. Diversity and Distributions, 19, 1020-1030.

Bradie, J., & Leung, B. (2017). A quantitative synthesis of the importance of variables used in MaxEnt species distribution models. Journal of Biogeography, 44, 1344-1361.

Buckley, Y. M. (2008). The role of research for integrated management of invasive species, invaded landscapes and communities. Journal of Applied Ecology, 45, 397-402.

Cassey, P., Delean, S., Lockwood, J. L., Sadowski, J., & Blackburn, T. M. (2018b). Dissecting the null model for biological invasions: A meta-analysis of the propagule pressure effect. PLoS biology, 16, e2005987.

Cassey, P., García-Díaz, P., Lockwood, J. L., & Blackburn, T. M. (2018a). Invasion Biology: Searching for Predictions and Prevention, and Avoiding Lost Causes. Invasion Biology: Hypotheses and Evidence, 1.

79

Chadès, I., Martin, T. G., Nicol, S., Burgman, M. A., Possingham, H. P., & Buckley, Y. M. (2011). General rules for managing and surveying networks of pests, diseases, and endangered species. Proceedings of the National Academy of Sciences, 108, 8323-8328.

Chapman, F. A., Fitz‐Coy, S. A., Thunberg, E. M., & Adams, C. M. (1997). United States of America trade in ornamental fish. Journal of the World Aquaculture Society, 28, 1-10.

Colautti, R. I., Grigorovich, I. A., & MacIsaac, H. J. (2006). Propagule pressure: a null model for biological invasions. Biological Invasions, 8, 1023-1037.

Colautti, R. I., & MacIsaac, H. J. (2004). A neutral terminology to define ‘invasive’ species. Diversity and distributions, 10, 135-141.

Cook, C. N., Hockings, M., & Carter, R. B. (2010). Conservation in the dark? The information used to support management decisions. Frontiers in Ecology and the Environment, 8, 181-186.

Copp, G. H., Robert Britton, J., Cucherousset, J., García‐Berthou, E., Kirk, R., Peeler, E., & Stakėnas, S. (2009). Voracious invader or benign feline? A review of the environmental biology of European catfish Silurus glanis in its native and introduced ranges. Fish and fisheries, 10, 252-282.

Cummings, P. (2009). The relative merits of risk ratios and odds ratios. Archives of pediatrics & adolescent medicine, 163, 438-445.

Dawson, W., Burslem, D. F., & Hulme, P. E. (2009). Factors explaining alien plant invasion success in a tropical ecosystem differ at each stage of invasion. Journal of Ecology, 97, 657- 665.

Della Venezia, L., Samson, J., & Leung, B. (2018). The rich get richer: Invasion risk across North America from the aquarium pathway under climate change. Diversity and Distributions, 24, 285-296.

Dogliotti, A., Gossn, J., Vanhellemont, Q., & Ruddick, K. (2018). Detecting and quantifying a massive invasion of floating aquatic plants in the río de turbid waters using high spatial resolution ocean color imagery. Remote Sensing, 10, 1140.

Drake, J. M., & Lodge, D. M. (2004). Effects of environmental variation on extinction and establishment. Ecology Letters, 7, 26-30.

Duggan, I. C., Rixon, C. A., & MacIsaac, H. J. (2006). Popularity and propagule pressure: determinants of introduction and establishment of aquarium fish. Biological invasions, 8, 377- 382.

Duncan, R. P., Blackburn, T. M., Rossinelli, S., & Bacher, S. (2014). Quantifying invasion risk: the relationship between establishment probability and founding population size. Methods in Ecology and Evolution, 5, 1255-1263.

80

Duncan, R. P., Bomford, M., Forsyth, D. M., & Conibear, L. (2001). High predictability in introduction outcomes and the geographical range size of introduced Australian birds: a role for climate. Journal of Animal Ecology, 70, 621-632.

Ehrenfeld, J. G. (2010). Ecosystem consequences of biological invasions. Annual review of ecology, evolution, and systematics, 41, 59-80.

Essl, F., Dullinger, S., Moser, D., Steinbauer, K., & Mang, T. (2015). Macroecology of global bryophyte invasions at different invasion stages. Ecography, 38, 488-498.

Ficetola, G. F., Bonin, A., & Miaud, C. (2008). Population genetics reveals origin and number of founders in a biological invasion. Molecular Ecology, 17, 773-782.

Ficetola, G. F., Thuiller, W., & Padoa‐Schioppa, E. (2009). From introduction to the establishment of alien species: bioclimatic differences between presence and reproduction localities in the slider turtle. Diversity and Distributions, 15, 108-116.

Finnoff, D., Shogren, J. F., Leung, B., & Lodge, D. (2007). Take a risk: preferring prevention over control of biological invaders. Ecological Economics, 62, 216-222.

Forsyth, D. M., Duncan, R. P., Bomford, M., & Moore, G. (2004). Climatic suitability, life‐history traits, introduction effort, and the establishment and spread of introduced mammals in . Conservation Biology, 18, 557-569.

Froese, R. and D. Pauly. Editors. 2018. FishBase. World Wide Web electronic publication. www.fishbase.org, version (02/2018).

Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36.

Hayes, K. R., & Barry, S. C. (2008). Are there any consistent predictors of invasion success?. Biological invasions, 10, 483-506.

Heger, T., & Trepl, L. (2003). Predicting biological invasions. Biological Invasions, 5, 313-321.

Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25, 1965-1978. URL http://www.worldclim.org/

Hobbs, R. J., & Humphries, S. E. (1995). An integrated approach to the ecology and management of plant invasions. Conservation biology, 9, 761-770.

Howeth, J. G., Gantz, C. A., Angermeier, P. L., Frimpong, E. A., Hoff, M. H., Keller, R. P., ... & Lodge, D. M. (2016). Predicting invasiveness of species in trade: climate match, trophic guild and fecundity influence establishment and impact of non‐native freshwater fishes. Diversity and Distributions, 22, 148-160.

81

Jenkins, P. T. (2013). Invasive animals and wildlife pathogens in the United States: the economic case for more risk assessments and regulation. Biological invasions, 15, 243-248.

Johnson, J. B., & Omland, K. S. (2004). Model selection in ecology and evolution. Trends in ecology & evolution, 19, 101-108.

Keller, R. P., & Perrings, C. (2011). International policy options for reducing the environmental impacts of invasive species. BioScience, 61, 1005-1012.

Kerr, N. Z., Baxter, P. W., Salguero‐Gómez, R., Wardle, G. M., & Buckley, Y. M. (2016). Prioritizing management actions for invasive populations using cost, efficacy, demography and expert opinion for 14 plant species world‐wide. Journal of applied ecology, 53, 305-316.

Kolar, C. S., & Lodge, D. M. (2002). Ecological predictions and risk assessment for alien fishes in North America. Science, 298, 1233-1236.

Kumschick, S., Bacher, S., Dawson, W., Heikkilä, J., Sendek, A., Pluess, T., ... & Ingolf, K. (2012). A conceptual framework for prioritization of invasive alien species for management according to their impact. NeoBiota 15, 69-100.

Kuppinger, D. M., Jenkins, M. A., & White, P. S. (2010). Predicting the post-fire establishment and persistence of an invasive tree species across a complex landscape. Biological Invasions, 12, 3473-3484.

Leung, B., Finnoff, D., Shogren, J. F., & Lodge, D. (2005). Managing invasive species: rules of thumb for rapid assessment. Ecological Economics, 55, 24-36.

Leung, B., Lodge, D. M., Finnoff, D., Shogren, J. F., Lewis, M. A., & Lamberti, G. (2002). An ounce of prevention or a pound of cure: bioeconomic risk analysis of invasive species. Proceedings of the Royal Society of London B: Biological Sciences, 269, 2407-2413.

Leung, B., Roura‐Pascual, N., Bacher, S., Heikkilä, J., Brotons, L., Burgman, M. A., ...& Sol, D. (2012). TEASIng apart alien species risk assessments: a framework for best practices. Ecology Letters, 15, 1475-1493.

Lockwood, J. L., Hoopes, M. F., & Marchetti, M. P. (2013). Invasion ecology. John Wiley & Sons.

Lockwood, J.L., D. Welbourne, C. Romagosa, P. Cassey, N. Mandrak, A. Strecker, B. Leung, O. Stringham, B. Udell, D. Episcopio-Sturgeon, M. Tlusty, J. Sinclair, M. Springborn, E. Pienaar, A. Rhyne, and R. Keller. In Press. When pets become pests: the role of the exotic pet trade in producing invasive vertebrate animals. Frontiers in Ecology and the Environment.

Lodge, D. M., Williams, S., MacIsaac, H. J., Hayes, K. R., Leung, B., Reichard, S., ... & Carlton, J. T. (2006). Biological invasions: recommendations for US policy and management. Ecological applications, 16, 2035-2054.

82

Lohr, C. A., Hone, J., Bode, M., Dickman, C. R., Wenger, A., & Pressey, R. L. (2017). Modeling dynamics of native and invasive species to guide prioritization of management actions. Ecosphere, 8.

Mack, R. N., Simberloff, D., Mark Lonsdale, W., Evans, H., Clout, M., & Bazzaz, F. A. (2000). Biotic invasions: causes, epidemiology, global consequences, and control. Ecological applications, 10, 689-710.

Mahoney, P. J., Beard, K. H., Durso, A. M., Tallian, A. G., Long, A. L., Kindermann, R. J., ... & Mohn, H. E. (2015). Introduction effort, climate matching and species traits as predictors of global establishment success in non‐native reptiles. Diversity and Distributions, 21, 64-74.

Marchetti, M. P., Moyle, P. B., & Levine, R. (2004). Invasive species profiling? Exploring the characteristics of non‐native fishes across invasion stages in California. Freshwater biology, 49, 646-661.

Mehta, S. V., Haight, R. G., Homans, F. R., Polasky, S., & Venette, R. C. (2007). Optimal detection and control strategies for invasive species management. Ecological Economics, 61, 237-245.

Milbau, A., & Stout, J. C. (2008). Factors associated with alien plants transitioning from casual, to naturalized, to invasive. Conservation Biology, 22, 308-317.

Mumbare, S. S., Maindarkar, G., Darade, R., Yenge, S., Tolani, M. K., & Patole, K. (2012). Maternal risk factors associated with term low birth weight neonates: a matched-pair case control study. Indian pediatrics, 49, 25-28.

Myers, J. H., Simberloff, D., Kuris, A. M., & Carey, J. R. (2000). Eradication revisited: dealing with exotic species. Trends in ecology & evolution, 15, 316-320.

Olson, L. J. (2006). The economics of terrestrial invasive species: a review of the literature. Agricultural and Resource Economics Review, 35, 178-194.

Papeş, M., Sällström, M., Asplund, T. R., & Vander Zanden, M. J. (2011). Invasive species research to meet the needs of resource management and planning. Conservation Biology, 25, 867-872.

Purvis, A., Gittleman, J. L., Cowlishaw, G., & Mace, G. M. (2000). Predicting extinction risk in declining species. Proceedings of the Royal Society of London B: Biological Sciences, 267, 1947-1952.

Pyšek, P., Jarošík, V., Pergl, J., Randall, R., Chytrý, M., Kühn, I., ... & Sádlo, J. (2009). The global invasion success of Central European plants is related to distribution characteristics in their native range and species traits. Diversity and Distributions, 15, 891-903.

83

Pyšek, P., Jarošík, V., Hulme, P. E., Kühn, I., Wild, J., Arianoutsou, M., ... & Genovesi, P. (2010). Disentangling the role of environmental and human pressures on biological invasions across Europe. Proceedings of the National Academy of Sciences, 107, 12157-12162.

Pyšek, P., Jarošík, V., Hulme, P. E., Pergl, J., Hejda, M., Schaffner, U., & Vilà, M. (2012). A global assessment of invasive plant impacts on resident species, communities and ecosystems: the interaction of impact measures, invading species' traits and environment. Global Change Biology, 18, 1725-1737.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rejmánek, M., & Pitcairn, M. J. (2002). When is eradication of exotic pest plants a realistic goal. Turning the tide: the eradication of invasive species, 249-253.

Ribeiro, F., Elvira, B., Collares-Pereira, M. J., & Moyle, P. B. (2008). Life-history traits of non- native fishes in Iberian watersheds across several invasion stages: a first approach. Biological Invasions, 10, 89-102.

Rixon, C. A., Duggan, I. C., Bergeron, N. M., Ricciardi, A., & Macisaac, H. J. (2005). Invasion risks posed by the aquarium trade and live fish markets on the Laurentian Great Lakes. Biodiversity & Conservation, 14, 1365-1381.

Ruesink, J. L. (2005). Global analysis of factors affecting the outcome of freshwater fish introductions. Conservation Biology, 19, 1883-1893.

Schielzeth, H. (2010). Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution, 1, 103-113.

Simberloff, D. (2003). How much information on population biology is needed to manage introduced species?. Conservation Biology, 17, 83-92.

Simberloff, D. (2009). The role of propagule pressure in biological invasions. Annual Review of Ecology, Evolution, and Systematics, 40, 81-102.

Simberloff, D., Martin, J. L., Genovesi, P., Maris, V., Wardle, D. A., Aronson, J., ... & Pyšek, P. (2013). Impacts of biological invasions: what's what and the way forward. Trends in ecology & evolution, 28, 58-66.

Smith, K. F., Behrens, M. D., Max, L. M., & Daszak, P. (2008). US drowning in unidentified fishes: scope, implications, and regulation of live fish import. Conservation Letters, 1, 103-109.

Sol, D., Maspons, J., Vall-Llosera, M., Bartomeus, I., García-Peña, G. E., Piñol, J., & Freckleton, R. P. (2012). Unraveling the life history of successful invaders. Science, 337, 580-583.

84

Stewart‐Koster, B., Olden, J. D., & Johnson, P. T. (2015). Integrating landscape connectivity and habitat suitability to guide offensive and defensive invasive species management. Journal of Applied Ecology, 52, 366-378.

United States Census Bureau. URL https://www.census.gov/en.html

U.S. Geological Survey (2017). Nonindigenous Aquatic Species Database. Available at: http://nas.er.usgs.gov Gainesville, FL.

Vander Zanden, M. J., Hansen, G. J., Higgins, S. N., & Kornis, M. S. (2010). A pound of prevention, plus a pound of cure: early detection and eradication of invasive species in the Laurentian Great Lakes. Journal of Great Lakes Research, 36, 199-205.

Vander Zanden, M. J., & Olden, J. D. (2008). A management framework for preventing the secondary spread of aquatic invasive species. Canadian Journal of Fisheries and Aquatic Sciences, 65, 1512-1522.

Walter, S. D. (2000). Choice of effect measure for epidemiological data. Journal of clinical epidemiology, 53, 931-939.

Westbrooks, R. G. (2004). New approaches for early detection and rapid response to invasive plants in the United States. Weed Technology, 18, 1468-1471.

Williamson, M., & Fitter, A. (1996). The varying success of invaders. Ecology, 77, 1661-1666.

Wittenberg, R., & Cock, M. J. (Eds.). (2001). Invasive alien species: a toolkit of best prevention and management practices. CABI.

85

Table 2.1. Species status categories for the aquarium species in our dataset, as described on the USGS Non-indigenous Aquatic Species website and as categorized in our study (CS/PS). ______Status Description CS/PS ______

Collected Species was collected or observed from the site; reproduction is not known CS Established Population is reproducing and overwintering, currently established CS, PS Eradicated Population was eliminated by human activity CS Extirpated Population died out on its own, without human interference CS Failed Population was introduced but died out; failed to reproduce CS Unknown Used when all other categories do not fit CS ______

86

Table 2.2. AIC, AIC difference from the best model (ΔAIC), AUC, fitted 푐^ parameter and percentage of deviance explained (%dev.exp) by the full casual establishment and persistence model, and their respective submodels. The best model for each dataset is indicated in bold. ______AIC ΔAIC AUC 푐^ %dev.exp ______

Casual establishment model

PET 1359.89 0 0.957 0.546 35.59 sp. traits - pp 1504.97 145.08 0.926 0.542 28.07 environment - pp 1753.45 393.56 0.838 0.360 13.93 sp. traits - environment 1585.52 225.63 0.897 - 23.32 propagule pressure 1889.50 529.61 0.795 0.378 8.67 sp. traits 1773.85 413.96 0.806 - 14.95 environment 1896.87 536.98 0.719 - 6.84 null model 2066.59 706.70 0.500 - 0

Persistence model

PET 85.91 2 0.946 1*10-8 51.60 sp. traits - pp 117.20 33.29 0.841 1*10-12 26.87 environment - pp 106.34 22.43 0.840 1*10-8 34.85 sp. traits - environment 83.91 0 0.946 - 51.52 propagule pressure 148.82 64.91 0.439 4*10-9 0 sp. traits 115.20 31.29 0.840 - 26.87 environment 104.34 20.43 0.862 - 34.86 null model 146.82 62.91 0.500 - 0 ______

87

Table 2.3. Parameter values for the predictors retained in each sub-stage best model after variable selection. Rank indicates the entry order of each important variable in the corresponding model, either as a first order term only (1st) or including an additional second order term (2nd). ______Parameter Casual establishment model Persistence model ______Rank 1st 2nd Rank 1st 2nd Species traits Maximum temperature tolerance (°C) 4 0.411 - 3 0.360 0.834 Minimum temperature tolerance (°C) 5 -0.512 -0.404 NA - - Northernmost latitude 3 0.917 -0.589 NA - - Trophic level 6 -0.166 - NA - - Maximum length (cm) 1 1.795 -1.844 2 1.382 -7.679

Environmental conditions

Minimum temperature coldest month (°C; BIO6x̄ ) 7 0.112 0.212 NA - -

Mean temperature warmest quarter (°C; BIO10x̄ ) NA - - 4 0.504 -2.670

Precipitation wettest month (mm; BIO13x̄ ) 2 0.053 0.479 NA - -

Minimum temperature coldest month (°C; BIO6s²) NA - - 1 4.971 -4.114

Interactions

Min. temp. coldest month (BIO6x̄ ) & Max. length 8 -0.129 - NA - -

Min. temp. coldest month (BIO6s²) & Max. length NA - - 5 -3.172 - ______

88

Table 2.4. Selected multiplicative risk factors to quickly quantify risk of casual and persistent establishment. The first column reports the average value of each relevant predictor in the equivalent model, while the other columns identify the variable values corresponding to an OR change of +1000%, +100%, +50%, +25%, -25% and -50%. The means of variables that were significant for both models differ, because the persistence model dataset represents a subset of the casual establishment model dataset (e.g., maximum length). ______Average +1000% +100% +50% +25% -25% -50% ______Casual establishment Max. temperature tolerance (°C) 26.7 39 30.3 28.8 27.8 25.2 23.2 Min. temperature tolerance (°C) 22.5 - - 13.2; 20.2 12; 21.4 9.7; 23.7 8.3; 25 Northernmost latitude 3.4 - - 12.4; 33.5 7.7; 38.3 -1; 47 -6.1; 52.1 Trophic level 3.1 - - - 2.4 4 - Max. length (cm) 21.6 61.7; 225.3 31.7; 255.3 27.4; 259.6 24.8; 262.2 17.7; 269.3 12.3; 274.7

Min. temp. coldest month (°C; BIO6x̄ ) -7.5 - 6.3 -23.7; 2.5 -20.5; -0.7 - -

Precipit. wettest month (mm; BIO13x̄ ) 111.1 - 55.9; 160.3 68.1; 148.1 78.4; 137.8 - -

Persistence Max. temperature tolerance (°C) 27.5 33.1 23.3; 30.3 24.1; 29.5 24.7; 28.8 - - Max. length (cm) 32.2 - 40.8; 84.5 36.9; 88.4 34.7; 90.6 29.3; 96 25.7; 99.7

Mean temp. warmest quarter (°C; BIO10x̄ ) 21.7 - 22.7; 24.4 22.2; 24.9 21.9; 25.1 21.4; 25.7 21.1; 26

Min. temp. coldest month (°C; BIO6s²) 6.2 7.8; 16.6 6.6; 17.8 6.5; 18 6.4; 18.1 6.1; 18.3 5.9; 18.6 ______

89

Figure 2.1. Illustrative example of rules of thumb, i.e. odds ratios (OR), derived for two species traits. Each dark dot represents the reference point, i.e. the average trait value across species in the dataset, while each triangular dot corresponds to the species s of interest.

90

Figure 2.2. Effect of each significant predictor on the likelihood of persistently establishing versus failing, expressed as odds ratio (OR), when gradually varying each predictor. OR equals 1 (dashed line) at each variable's mean value, reported by the corresponding point. The average values for the interaction plot (e) correspond to those of the respective main terms (b,d). The triangles indicate the OR of P. managuensis (traits) and Hawaii (environmental conditions), and their interaction (e). Very high values in (e) coincide with areas of extremely low absolute probability values, so that little probability increases determine very high OR.

91

Connecting statement

In Chapters 1 and 2, I made extensive use of heterogeneous information, i.e. species traits, environmental conditions and propagule pressure. Virtually all sources of data in ecology present some sort of limitation, including, for example, the spatial resolution of climatic variables and establishment data, the use of proxies for propagule pressure as a measure of the number of individuals introduced, and the high prevalence of missing information in species traits databases. In Chapter 3, I address the latter. Using a subset of 20 species characteristics from the fish database FishBase, I validate the use of existing imputation frameworks and explore the use of an ensemble approach where predictions from alternative algorithms are averaged, demonstrating that it ameliorates predictions. Further, I develop a novel methodology for imputation, based on trait Correlations, taxonomic Relatedness and Uncertainty minimization

(CRU). Despite its relative simplicity, I show that CRU works better than more sophisticated approaches, and that its inclusion in model ensembles further improves accuracy. At the same time, the novel algorithm for the estimation of uncertainty included in CRU predicts the deviation between imputed and true values better than the methods associated with other imputation algorithms. Finally, I use the best model ensemble to fill in all missing information and provide a complete version of a substantial subset of FishBase, which includes more than

30,000 species.

92

Chapter 3

Filling in FishBase: a more powerful approach to the imputation of missing trait data.

Authors: Lidia Della Venezia and Brian Leung

A version of this chapter is under review at Ecography.

93

3.1 Abstract

Aim: Species traits are central in predictive ecology, being used for functional diversity indexes, species distribution modelling and as indicators of ecosystem services. Yet, large trait databases typically have high prevalence of missing values and the existing information is often suboptimally used. Continued development of imputation methods to derive better estimates of missing values would allow trait databases to be more powerfully utilized. Further, since imputed values may have different levels of reliability, estimation of uncertainty will provide crucial information to make informed decisions about imputations.

Innovation: We impute 20 continuous traits from FishBase, a popular database containing more than 33,000 fish species, with a high prevalence of missing data (~79%). We consider cutting- edge approaches, namely MICE, missForest and Phylopars. Additionally, we derive a novel, simple algorithm based on trait correlations, residual adjustments based on taxonomic proximity and uncertainty minimization (Correlation-Relatedness-Uncertainty; CRU). Next, we explore the value of an ensemble approach, based on model averaging. Finally, we introduce a hierarchical approach to estimate uncertainty for each imputed value associated with CRU, and compare its performance against existing methods.

Main conclusions: Using cross-validation of FishBase, we show that, despite its simplicity,

CRU performed slightly better than missForest (R2~0.6), while Phylopars yielded weaker average but heterogeneous results, and MICE performed consistently worse. CRU and missForest performed differently depending on the trait, so that an ensemble model integrating

CRU and missForest provided the strongest accuracy (R2~0.7). For uncertainty, CRU's hierarchical procedure accurately estimated the distribution of deviations for imputed values in the validation set, while Phylopars substantially misestimated uncertainty, and missForest

94 provided only a single global estimate of uncertainty. Thus, we provide a more powerful general imputation procedure, a relatively unbiased uncertainty estimation method, and we make available the imputed values, filling in all missing data across 20 continuous traits of FishBase.

95

3.2 Introduction

The availability of big-scale and global databases is growing quickly, such that the amount of biological and environmental data currently accessible allows ecological researchers to answer questions that would have been impossible a decade ago. This development in ecology is particularly relevant to researchers that aim to explain broad scale patterns and to understand global processes. Scientists who tackle major environmental issues in complex systems have now the possibility to employ data-intensive approaches to inform predictive models (Kelling et al. 2009; Luo et al. 2011). However, there are challenges associated with using extensive databases, which are usually characterized by high levels of missing information, such that the choice of how best to use the available data remains an open question, and is often determined ad hoc.

A common example is provided by global trait databases like FishBase (Froese & Pauly,

2017) and TRY (Kattge et al., 2011), which store dozens of characteristics across thousands of species, for various taxonomic groups. Traits constitute the result of adaptation, and dictate how a species responds to the environment and how it affects the entire ecosystem by interacting with other species (McGill et al., 2006). They represent the basis for the estimation of functional diversity (Tilman, 2001), which in turn can be used to study community assembly and function

(Petchey & Gaston, 2002), and to inform conservation (Cadotte et al., 2011). Lately, species traits have been used to monitor biodiversity (Vandewalle et al., 2010) and have been employed in niche-based modelling to describe and predict species' abundances and distributions (Cavalli et al., 2014; Kraft et al., 2008). In addition, traits have been linked to the provision of ecosystem services (Lavorel et al., 2011), and have been employed to forecast shifts and assemblages under conditions of environmental change (Estrada et al., 2016; Gayraud et al., 2003) and in traits-

96 based risk assessment models for invasions and extinctions (Liu et al., 2017; Pyšek et al., 2012;

Van Kleunen et al., 2010).

Given the frequency of missing information in trait databases and the known effects of missing data on the accuracy of measures of functional diversity and other functional traits metrics (Májeková et al., 2016; Pakeman, 2014), optimally filling in empty values will allow these databases to be more efficiently used, until more and higher quality information is gathered. Arguably the most common approach to missing trait data consists in performing a

"complete case analysis", removing each row of a dataset that presents missing values. Yet, this may result in excluding a great number of species from an analysis, in addition to the potential inclusion of bias, for example when estimating the parameters of a model (Hadfield, 2008;

Nakagawa & Freckleton, 2011; van Buuren, 2012). Alternatively, imputation, i.e. the replacement of missing data with plausible values, is generally encouraged as it tends to yield better results than complete case analysis (e.g. Nakagawa & Freckleton, 2008; van der Heijden et al., 2006), and it can be performed in different ways. The most common imputation strategy in ecology consists of replacing a variable missing value with the sample mean or median (e.g.,

Nakagawa et al., 2001).

Much more sophisticated imputing strategies exist (Penone et al., 2014; Taugourdeau et al., 2014), relying for example on using predictions from regression analyses (e.g., Rubin, 1987;

Stekhoven & Bühlmann, 2011; van Buuren & Groothuis-Oudshoorn, 2011), or exploiting phylogenetic information to facilitate imputation (e.g., Goolsby et al., 2017; Guénard et al.,

2013; Swenson, 2014). However, they remain relatively underused in ecology (Nakagawa &

Freckleton, 2008; Nakagawa, 2015), especially for big datasets. Alternative techniques perform differently depending on the trait considered and on the percentage of missing data (e.g.,

97

Ellington et al., 2015; Penone et al., 2014; Poyatos et al., 2018; Taugourdeau et al., 2014). Even within single databases, characteristics of each imputed value could vary, such that alternative techniques could yield different estimates. As such, it would be worth assessing whether using these methods in combination in an ensemble framework (Bates & Granger, 1969; Winkler,

1989) might yield a generally more reliable estimate. Model averaging in particular might represent a promising avenue to improve the performance of imputation strategies in trait-based ecology. It consists of combining the forecasts obtained from different approaches by averaging them, and has proven to be a simple and effective way to incorporate predictions from alternative models, in order to obtain more accurate results and partially remove the noise associated with each method (Araújo & New, 2007; Dormann et al., 2018; Guo et al., 2015). Model averaging has been successfully employed by ecologists in a variety of fields (e.g. Dormann et al., 2008a;

Garcia et al., 2012; Le Lay et al., 2010; Meller et al., 2014; Symonds & Moussalli, 2011;

Thuiller, 2004). While the logic underlying ensemble models should apply for imputation as well, to our knowledge, it has not been used to average predictions from alternative approaches to the imputation of species trait datasets.

If ensemble models yield predictive improvements, it could also be worthwhile to derive additional approaches to imputation, even though quite sophisticated methods already exist. In this context, even simple approaches could be worthwhile, if their consideration yields even greater improvements when combined into ensembles (or if such new approaches fit better than existing ones when applied in isolation). Simple approaches would have the additional advantage of conceptual ease.

Beyond the imputation itself, some imputed values may be more reliable than others.

Estimating the degree of uncertainty could provide crucial information about whether to keep (or

98 exclude) certain imputed values, how uncertainty is propagated through subsequent statistical analyses, or how to weight or select amongst alternative imputed values. Thus, imputation approaches should ideally provide a measure of confidence for each predicted missing value.

Specifically, the estimation of uncertainty should relate probabilistically to the magnitude of deviation between the imputed and true value.

In this manuscript, we focus on a set of continuous fish traits from the FishBase database

(Froese & Pauly, 2017). FishBase contains information about tens of thousands of species, it is visited about 700,000 times per month, and is arguably one of the most used global trait datasets, having being cited more than 8,000 times (Google Scholar). Thus, filling in FishBase is itself important, and will serve the scientific community improving an existing popular database.

At the same time, FishBase provides a perfect example of the high missing rate typical of trait databases (~79% of the subset used for this study is lacking). Thus FishBase represents a concrete, highly relevant case study to analyze approaches for imputation. First, we explore the utility of ensemble models in the context of trait imputation, using three existing sophisticated approaches in the literature, namely multivariate imputation by chained equations (MICE; Rubin,

1987; van Buuren & Groothuis-Oudshoorn, 2011), missForest (Stekhoven & Bühlmann, 2011) and Phylopars (Bruggeman et al., 2009). The reasons behind their choice are the good performance, the computational efficiency, and the capacity to fill in datasets with high missing data rates. Second, we derive a novel imputation methodology based on information from simple trait correlations, taxonomic averages, and uncertainty minimization (Correlation-Relatedness-

Uncertainty; CRU). We both integrate CRU with the existing approaches into an ensemble model and examine improvements in imputation, and compare each approach singly in the context of imputing missing values in FishBase, using k-fold cross-validation. Third, we present

99 a novel approach for the estimation of the uncertainty associated with each imputed value predicted by CRU, and compare our approach with those of the other three imputation approaches (i.e. how well does each one probabilistically predict the magnitude of deviation, between imputed and true trait values?). Finally, we use the best approach identified in our analysis to fill in all the missing data in a FishBase subset, and provide a complete version of 20 continuous traits for more than 30,000 species.

3.3 Methods

3.3.1 Trait data

The global fish database FishBase (Froese & Pauly, 2017) contains trait information for more than 33,000 fish species globally, including species' morphology, range of occurrence, environmental tolerances and trophic level. FishBase represents a good example of the high prevalence of lacking information in trait datasets (e.g., Dimarchopoulou et al., 2017 on marine fish in the Mediterranean), and of the traits analyzed in this study across all species in the database, about 79% of the entries was missing. Specifically, we extracted 20 primary continuous functional and life-history traits (Table 3.1) from each species’ main web page, covering environmental tolerances, size, latitudinal and longitudinal ranges, weight, age and trophic level. We excluded derived traits as they could be estimated from the selected traits, and thus were redundant and collinear. We also excluded categorical variables, limiting our scope to continuous traits only. Some traits (e.g., fecundity) were excluded from the analysis since the estimates reported in FishBase could not be converted to a common measurement unit for all species.

100

3.3.2 Comparison of existing imputation methods

We applied three approaches that exploit existing relationships between traits to impute missing data. The first technique was multivariate imputation by chained equations (MICE;

Rubin, 1987), a multiple imputation methodology that has been implemented in the R package

"mice" (van Buuren & Groothuis-Oudshoorn, 2011). We selected predictive mean matching

(Little, 1988) as our imputation method, since it has been used most frequently for missing traits

(e.g., Baraloto et al., 2010; Di Marco et al., 2012; Fisher et al., 2003). MICE generates multiple imputed datasets to recapture the stochasticity in the relationship between traits. We imputed the

FishBase dataset 5 times and we retained the mean of each missing value's imputations as our final estimate of the missing entry, and their standard deviation as the associated uncertainty.

Although in some circumstances more repetitions might be necessary (Graham et al., 2007), we limited the iterations to 5 because few imputations are considered enough for good predictions

(Rubin, 1987; Schafer & Olsen, 1998) and because of computational time.

The second method was missForest, an iterative non-parametric imputation approach based on random forests (Breiman, 2001), which is effective with every type of data and computationally feasible also on databases as big as the one considered here. We used the R package "missForest" (Stekhoven & Bühlmann, 2011) to obtain predictions for every missing value. However, this method only provides an overall estimate of uncertainty for the entire imputed dataset, in the form of the normalized root mean squared error.

The third approach we tested was Phylopars (Bruggeman et al., 2009), a method that combines information from existing data and phylogenetic information to obtain predictions for missing values. Phylopars has been implemented in the R package "Rphylopars" (Goolsby et al.,

2017) and it requires the user to provide a phylogenetic tree describing the evolutionary

101 relationships between species. This requirement was particularly limiting in our case, since phylogenetic information is available only for about 5% of species in FishBase (e.g., Betancur-R. et al., 2017). However, a taxonomic tree can be used as an alternative. Unlike missForest,

Phylopars provides uncertainty estimates for each imputed value.

All predictions from the alternative models were truncated to the observed trait range across all species, for consistency with our novel imputation method (see below).

3.3.3 Novel imputation protocol: CRU

To build a novel imputation approach, we considered that 1) traits are often correlated with one another, 2) taxonomically and phylogenetically related species should be more similar, and

3) the strength of trait correlations and taxonomic similarities may differ across species, and we could use measures of uncertainty to choose between models for imputation. Therefore, we developed a protocol to predict missing values based on trait correlations, species relatedness, and uncertainty predicted for each imputed value (Correlation-Relatedness-Uncertainty; CRU).

Finding Correlations

The first step was to predict each missing value based on other traits using regression. To do so, we used multiple regression, as follows:

푡, = 훽 + ∑, 훽 푡, + 휀 (3.1)

For each missing trait 푡, of species i, we examined the relation with all the other traits

푡,, 푡,,…, 푡, with available data for species i. This was done by regressing values for all other species with non-missing values for both 푡, and 푡, (complete cases), using multiple regression.

Subsequently, using a backward selection approach (Halinski & Feldt, 1970), the variable with the biggest p-value was removed and the regression performed again, using all other species with non-missing values (i.e. as more traits were removed, more species would be complete cases for

102 the remaining predictor traits). The process was repeated until all the predictors included were significant (p<0.05). 훽 was the intercept, 훽 was the slope of the k-th predictor, and 휀 was the unexplained error (eqn. 3.1).

Combining Taxonomy

We considered that species belonging to the same taxon could be on average more similar to one another than to species pertaining to other taxa. After standardizing genus and family for each species by retrieving information from the Integrated Taxonomic Information System

(ITIS; http://www.itis.gov/info.html) database through the R package "taxize" (Chamberlain &

Szöcs, 2013), the simplest way of accounting for within-group similarities among related species would be to use dummy variables for taxonomic groups. However, this procedure was exceedingly slow and excluded many groups given the high rate of missing data, making dummy variables less powerful.

One could instead apply a simple "taxonomic adjustment" for the group of interest.

Specifically, for each missing value 푡,, we first predicted the expected average missing trait

value 푡̅ , across species within the same taxonomic group g as follows:

푡̅ , = 푏 + ∑ 푏 푡̅ , (3.2) where 푏 and 푏were the fitted parameters from the multiple linear regression (eqn. 3.1), and

∑ 푡̅ = , was the average within-group observed trait 푡 . Note that 푡̅ was calculated , , across all 푁 species belonging to taxonomic group g for which trait 푡 was known, and thus included incomplete case species not included in the regression. The taxonomic adjustment corrected each missing value prediction from the regression by an amount corresponding to the

103

∑ difference between the observed mean trait 푡̅ = , and the expected average value 푡̅ , , for taxon g, as follows:

푎푑푗푢푠푡푚푒푛푡 = 푡̅ , − 푡̅ , (3.3)

푡̂ , = 푡̂, + 푎푑푗푢푠푡푚푒푛푡 (3.4)

where, for species i, 푡̂, was the expected missing trait based on the regression and 푡̂ , was the taxonomically-adjusted expected trait value. In theory, this adjusting approach could be applied to phylogenetic differences too (Grafen, 1989; Martins & Hansen, 1997). Thus, for generality, we formulated a phylogenetic adjustment model, based on inverse weighting to phylogenetic distance, but we do not present it further as phylogenetic data were only available for ~5% of the

FishBase species (Betancur-R. et al., 2017) and did not appreciably improve predictions (see

Appendix C).

When no other species within taxonomic group g had information for our trait of interest, the adjustment in equation 3.3 could not be calculated and trait correlations only, i.e. the expected 푡̂, from the multiple regression (eqn. 3.1; Fig. 3.1), were used. Alternatively, when no other traits were known for species i, our predictions simplified to the average trait of interest

푡̅ , across the other species in group g (either genus or family), thus using only taxonomic information (i.e. Relatedness; Fig. 3.1). Finally, when no information from trait correlations and taxonomy was available, we used the overall mean for that trait. All predictions were truncated to the range observed across all species in the database.

Estimating Uncertainty

Each imputed value was based on different regressions, amounts of data, and adjustments.

Therefore, uncertainty should differ for different traits and species and we estimated it for every predicted missing value.

104

As a measure of the uncertainty associated to each prediction, the simplest option was to use the root-mean-square deviation (RMSD; Table 3.2), i.e. the standard deviation of the residuals of the relationship between predicted and observed values. However, RMSD did not account for the fact that some imputations for some groups might be more reliable than others, and the residual variation might differ between taxonomic groups. While parameter estimation from multiple regression is robust to inequality of variance (Johnston, 1963; White, 1980), there could be additional benefit in more finely resolving the degree of uncertainty for each imputed value. We therefore defined 휀 as the within-group standard deviation (i.e. the standard deviation of the residuals of observed versus predicted trait values for group g), and constructed a hierarchical model of uncertainty (Table 3.2). Specifically, we assumed that the 휀 values came from a lognormal distribution, with parameters 휇 and 휎:

() 푓휀휇, 휎 = 푒 , 휀 >0 (3.5) √

We considered alternatives, namely a normal distribution truncated at the zero value and a gamma distribution, but these fitted worse in the majority of cases (>80%) based on the model

AICs (Akaike Information Criterion; Akaike, 1974), and therefore were not considered further.

휇 was defined as (Mood et al., 1974):

휇 = ln(푚) − (3.6)

푚 represents the mean of the non-logarithmized distribution, i.e. the average residual variation across taxonomic groups. We used the RMSD to approximate m, given that it represented the average residual variation across all species included in the analysis, while we estimated 휎 using maximum likelihood estimation (MLE). The likelihood function for 휎 was:

퐿(휎|푚, 푟)= ∏ ∫ 푝(휀|휎, 푚) ∏ 푝푟,휀 푑휀 (3.7)

105

We define 휎 as the maximum likelihood estimate of 휎 (Table 3.2) - the value that maximized the joint likelihood of observing the within-group residuals 푟, across all species i and groups g, given the within-group residual standard deviation 휀, multiplied by the probability of 휀, given parameters 휎 and m. 푟, were modelled as normally distributed with a mean of zero and standard deviation 휀. The integral across 휀values was approximated numerically (eqn. 3.7).

Once 휎 was determined via MLE, we calculated 휀̂ , the estimated value of 휀 for each taxonomic group g, as the expected value weighted by the likelihood of 휀:

휀̂ = ∫ 휀푝휀휎, 푚 ∏ 푝푟,휀 푑휀 (3.8) where 휀̂ was the 휀 estimate for group g given the observed within-group residuals 푟,, the hierarchical parameter estimate 휎, and constant a:

푎 = ∫ 푝휀휎, 푚 ∏ 푝푟,휀 푑휀 (3.9)

Henceforth, we term this the Hierarchical Uncertainty Estimation (HUE) model. When two or more species within taxonomic group g had information for our missing trait of interest 푡, HUE allowed us to estimate the uncertainty for each imputed value. When only a single species in group g was present, there would not be enough information for HUE to calculate the uncertainty. In these cases, RMSD was the best measure we could obtain (i.e. the mean uncertainty across all groups). Finally, when no other trait or taxonomic information was available to predict a missing value, uncertainty was measured as the total trait variability across species.

Selection based on uncertainty

Given that CRU could be calculated with correlations and taxonomy, separately or in combination, at the genus or family rank (Fig. 3.1), that taxonomic groups could differ in their

106 uncertainty, and that the amount of data could differ for each approach, we reasoned that the alternative ways of calculating CRU could also differ in uncertainty. Thus, to obtain our final imputed CRU dataset, we used the uncertainty estimates from HUE as a selection criterion, choosing the predictions with the lowest estimated uncertainty. We tested whether using uncertainty predictively would yield an improvement in the general accuracy of imputed values, using a k-fold validation procedure (see below). We compared CRU against an alternative approach, where the missing data were filled in a step-wise fashion using the most information available, based on the assumption that more and better resolved information would result in better predictions. This basically translated into using the estimates obtained from both correlations and taxonomy at the genus level as a first step, followed by correlations-taxonomy at the family rank, correlations only, and taxonomy only, at the genus and family level (Fig. 3.1).

This alternative approach excluded the uncertainty-based selection step and was thus denoted as

CR (Correlation-Relatedness).

3.3.4 Model averaging and gap-filling

We explored the utility of an ensemble approach based on averaging model predictions to evaluate whether it would improve the accuracy of the imputations. We calculated the average predictions of all the possible combinations of MICE, missForest, Phylopars, and our novel CRU algorithm. We evaluated accuracy using k-fold cross-validation (see below). We used the best ensemble predictions to fill in the 79% missing data in FishBase.

3.3.5 Validation procedure

We considered both the accuracy of the imputed values of the alternative imputation approaches as well as the validity of the uncertainty estimates.

107

First, for each imputation, we tested performance using a k-fold cross-validation approach, where at each step we removed and predicted 0.5% of data, to retain the majority of the data for prediction (given the high missing rate of FishBase), while remaining computationally feasible

(as opposed to jackknifing each imputed value). For each method, the resulting sets of predictions were pooled to represent the entirety of our dataset. We evaluated the performance

2 of each method by calculating the R MSE (Mean Squared Error) of the linear relationship between predicted (푦) and true values (y), when this relationship is forced to the 1:1 line. In this way, we

2 obtained a measure of their ability to correctly predict missing data. The R MSE was calculated as follows:

∑() 푅 =1− (3.10) ∑() and it was always equal to or smaller than the linear regression R2.

Next, we assessed whether HUE provided a reasonable measure of uncertainty, and whether it estimated uncertainty more accurately than RMSD alone (i.e. if the added complexity was worthwhile). We used P-P plots to contrast the empirical residuals' percentiles and the theoretical percentiles, given the specific error estimate (휀̂ or RMSD) for each residual.

Theoretically, we expected that 1% of the residuals would fall below the first percentile of the cumulative distribution function described by the uncertainty estimate, 2% would fall below the second percentile, and so on. If our estimates correctly recaptured the variability in the within- group residuals, the plots would follow the 1:1 line between expected percentile and the proportion of times the observed residuals fall into each of these percentiles. We performed this analysis for HUE, as well as for MICE and Phylopars, to evaluate if they provided accurate estimates of uncertainty for each imputed value. missForest was excluded since it only supplied an overall measure of uncertainty.

108

All the analyses and data manipulation were performed in the R statistical and programming environment (R Core Team 2017).

3.4 Results

3.4.1 Performance of the imputation models

The cross-validation showed that the full CRU algorithm provided advantages in terms of

2 2 predictive power (average R MSE = 0.65) with respect to CR (average R MSE = 0.61), meaning that selecting the value with the least estimated uncertainty as the best estimate for imputation yielded a noticeable improvement. CRU also performed better than the other methods analyzed singularly (Table 3.3).

2 The second best single model was missForest, with average R MSE = 0.62, followed by

2 2 MICE (average R MSE = 0.49) and Phylopars (average R MSE = 0.24; Table 3.3). However, the performance of each method depended strongly on the specific trait (see Table 3.4 for a detailed

2 breakdown of the R MSE values by trait). For example, despite being the two best models overall,

2 the R MSE by trait ranged from 0.34 to 0.85 for CRU, and from 0.15 to 0.88 for missForest (Table

3.4). This discrepancy between traits was particularly evident for Phylopars, which predicted

2 missing values better than the alternative models for certain traits (maximum R MSE = 0.81), but

2 performed very poorly for others (minimum R MSE = -0.69), thus decreasing the overall performance of the method (Table 3.4).

On the other hand, the use of an ensemble approach turned out to be beneficial for imputation, generally performing better that the single components applied separately (Table

3.3). More specifically, by averaging the predictions from missForest and CRU, i.e. the two best

2 single models, we obtained a further 5% increase in overall accuracy (R MSE = 0.70; Table 3.3).

The missForest-CRU ensemble approach was thus chosen as the best candidate to fill in all the

109 missing data in our FishBase subset. Interestingly, we did not observe a correlation between

2 accuracy (i.e. R MSE) and the amount of missing data by trait. We found both cases in which the imputation of a trait with very little information performed relatively poorly, and others in which the imputed values were accurate even with high missingness. For example, both maximum age

2 and length at first maturity had about 96% of information missing, with corresponding R MSE values of 0.42 and 0.9, respectively (Tables 3.1 and 3.4). Using the missForest-CRU ensemble improved predictions by up to 7% for the majority of species traits. The approach performed slightly worse only in the very few cases in which one of the two models did not predict well

(e.g., longitudes, trophic level; Table 3.4).

3.4.2 Validation of the uncertainty estimates

Incorporating a random effect based on taxonomy (i.e. the HUE algorithm) estimated uncertainty better than just using the RMSD (Fig. 3.2a,b). Percentiles departed more strongly from the 1:1 line (where the observed magnitudes of deviation matched the expected distributions) when the RMSD was used, while they were close to the expected values when we used HUE, showing that the HUE model largely predicted the distribution of residuals. Thus, we found measurable and substantial benefits in using the hierarchical model in estimating uncertainty. We note, though, that we necessarily limited our estimation of uncertainty to taxonomic groups for which we had more than one residual. When taxonomic information was lacking, the RMSD was used as our measure of uncertainty. This was necessary for about 2% of the imputed data.

Like HUE, the estimates of uncertainty derived for MICE were also able to approximately recapture the observed distribution of residuals (Fig. 3.2c), though slightly worse than HUE. On the other hand, Phylopars was less accurate, deviating consistently from the 1:1 line and showing

110 a tendency to frequently underestimate uncertainty (i.e. a high proportion of observed deviations fell in the tails of the distribution, Fig. 3.2d). Therefore, HUE was the most accurate approach for the estimation of the uncertainty associated to each imputed value among the methods considered.

3.4.3 Filling in FishBase

CRU was able to fill in 99.8% of the 20 continuous FishBase traits analyzed using either trait correlations, taxonomy or a combination of the two, while only about 0.2% of the missing data had to be filled using the overall mean by trait. 99.6% of the species had data for at least one trait, 89.7% for at least two traits and 61.8% reported three traits or more, while no species had information for all the selected traits, and only 130 species had no information at all.

Taxonomically, about 50% of the dataset at the genus rank and around 86% at the family rank had measures of trait information for more than one species, which could be used to impute values.

For uncertainty, we found that the average uncertainty within taxonomic group decreased sharply as the number of species in each group increased, both at the genus and at the family rank (Fig. 3.3). Additionally, across traits, the most uncertain species were those classified on

FishBase as "polar", "boreal" and "deep-water". For example, among the families characterized with the highest average uncertainty, we found both those including deep-sea organisms (e.g.,

Parabembridae, Regalecidae) and those represented by only one or very few species (e.g.,

Bathysauroididae, Geotriidae, Kryptoglanidae). Similarly, at the genus rank, some of the highest uncertainty values were observed in genera distributed at great depths (e.g., Ataxolepis,

Lophotus), including only one species (e.g., Austroglanis, Gymnarchus), or both (e.g.,

111

Leucobrotula, Matsuichthys). Finally, missing trait predictions tended to have a smaller associated uncertainty for marine species than for freshwater and brackish species.

3.5 Discussion

3.5.1 Model comparison and ensemble imputation

Missing data affect virtually all trait databases. FishBase (and other trait databases) could be substantially more powerful, if this limitation were addressed. We do so here, by providing a new imputation approach (CRU), comparing it to existing ones and showing how a simple ensemble imputation strategy improved predictive performance on validation data, applied to the

2 popular database FishBase (overall R MSE = 0.7).

For FishBase, the imputation approaches did not perform equally. CRU performed better than the alternative approaches, followed by missForest. On the other hand, MICE consistently performed worse than either CRU or missForest. This might be due to its sensitivity to high percentages of missing data (Penone et al., 2014; Schrodt et al., 2015) and to existing multicollinearity among traits (Van Buuren & Groothuis-Oudshoorn, 2011). While Phylopars also performed worse, there appeared to be elements worth future exploration: specifically, while for certain traits it performed poorly, for others Phylopars predicted better than the other approaches. The predictiveness of Phylopars might have been affected by the strength of the phylogenetic (taxonomic) signal in each trait, which could influence the efficacy of the imputation (e.g., Kamilar & Cooper, 2013; Molina-Venegas et al., 2018). However, at this time, we have no recommendation of how to choose a priori which traits or under which conditions

Phylopars will perform poorly, and therefore we did not use Phylopars to fill in FishBase.

Nonetheless, the results suggest that predictive improvements using Phylopars could be possible.

112

Full phylogenetic information was not available for FishBase, but might improve Phylopars predictions for other trait datasets.

The fact that different methods predicted better for different traits justified our choice to investigate an ensemble approach based on model averaging, which turned out to improve predictions in our cross-validation process. The improvement with respect to each approach used independently was evident for almost all the ensembles. However, the best ensemble included only the two best performing single models (see also Dormann et al., 2018 on model averaging), namely CRU and missForest in our case. More generally, ensemble imputation could be further ameliorated by the inclusion of more alternative well-behaved models. For example, we had considered two other approaches to the imputation of our dataset, namely BHPMF (Bayesian hierarchical probabilistic matrix factorization; Schrodt et al., 2015) and PEM (phylogenetic eigenvectors maps; Guénard et al., 2013). These methods have not been included in the analysis because of the excessive computational and memory burden when applied to our dataset, especially given the high number of iterations in our validation. However, they could be valuable candidates for the ensemble imputation of other trait datasets.

3.5.2 Uncertainty estimation

Beyond predicting missing values, understanding the level of uncertainty of each estimate is also important. Uncertainty is pervasive in ecology, and critical to the interpretation of ecological predictions (Ascough Ii et al., 2008; Clark et al., 2001; Regan et al., 2002). As part of the CRU approach proposed here, we derived HUE, an uncertainty estimation procedure, and we demonstrated that it predicted the magnitude of deviation between imputed and true values better than the alternative methods analysed. HUE could serve several purposes. In addition to providing a measure of reliability for each imputed datum, HUE's uncertainty estimates offer an

113 appropriate measure to select data for further studies. For instance, it could be opportune to drop those imputed values with high levels of uncertainty. Data characterized by strong uncertainty could affect the quality of a model, making predictions unreliable (e.g., Dormann et al., 2008b), although the threshold for inclusion remains an open question. Alternatively, all imputed data could be retained, and the associated uncertainty propagated in further analyses (e.g., Brown,

2010; Hastings et al., 2010). Additionally, high levels of uncertainty could help prioritize future data collection. Finally, the uncertainty estimates themselves could help the imputation process, where there are multiple candidate models under consideration.

We demonstrated that by making uncertainty part of the CRU procedure, we improved predictive performance, without any additional data requirements. This is analogous to model selection procedures (Johnson & Omland, 2004), but applied to each individual data point.

Unlike canonical model selection methods, here each model had different sets of missing species and traits, and thus varying amounts of information, which made the models not directly comparable using standard metrics like AIC (Akaike, 1974). However, HUE allowed us to select the best CRU values to impute based on the lowest associated amount of uncertainty.

The same logic could be applied to alternative predictions from different imputation models, although here this procedure could not be employed since missForest only provided an overall uncertainty measure for the entire dataset and Phylopars tended to misestimate uncertainties. However, if accurate estimates of uncertainty were derived for each imputed datum and for each imputation methodology in the ensemble, such uncertainties could theoretically be used as a selection criterion towards improved imputation. For instance, the uncertainty estimates could be used as a weighting factor during model averaging, assuming that the smaller the uncertainty, the more reliable the prediction from a specific model. For the time being, in the

114 context of an ensemble approach to imputation, we used unweighted averages, which worked well for our dataset and in other fields of ecology (Marmion et al., 2009; Dormann et al., 2018;

Rapacciuolo et al., 2012).

3.5.3 Caveats

Despite the fact that our approach to imputation and uncertainty estimation performed well, certain caveats of our trait data imputation need to be addressed. First of all, our analysis did not explore how the performance of CRU would change depending on the mechanism behind missingness, i.e. whether the information that is lacking is missing completely at random

(MCAR), missing at random (i.e. the probability of missingness in a variable may be related to other observed variables; MAR) or missing not at random (i.e. the likelihood of missingness in a variable is related to the variable itself; MNAR; Nakagawa & Freckleton, 2008). However, real data are often assumed to be missing at random (Nakagawa & Freckleton, 2011; Penone et al.,

2014), which means that their values could be inferred from other variables in the database, making our imputation methodology, largely based on regression, a desirable alternative to data deletion.

Secondly and most importantly, due to the sparsity of phylogenetic information for the species included in FishBase, we chose taxonomy as a metric of relatedness. Even though phylogenetic methods would be the logical answer to missing data in trait databases and have been shown to perform well under different circumstances (Kim et al., 2018; Swenson et al.,

2014; Taugourdeau et al., 2014), full phylogenetic trees are very rarely available for big datasets, and other imputation studies have also opted for simplified surrogates (e.g., taxonomic trees;

Schrodt et al., 2015). Although phylogenetic information was not considered for FishBase, given that it was available for only about 5% of species, we acknowledge that taxonomies do not

115 completely represent true phylogenies (Ereshefsky, 2000) and that better resolved relationships between species should be incorporated, if they became available. To this effect, we derived a modification of CRU to allow phylogenetics to be accommodated as well, which could be important for other databases (Appendix C). We note, however, that the formulation incorporating phylogeny in CRU remains untested.

As for the model averaging approach, although we provided reliable uncertainty values for each datum imputed by CRU, it is unclear how to combine them with the uncertainty associated with missForest in order to obtain final uncertainty measures for each value imputed by the best ensemble model. The reasons are twofold: missForest only offered an overall measure of uncertainty for the whole dataset, and how to properly combine uncertainties from different models is often based on strict assumptions and remains an open question in the context of model averaging (Dormann et al., 2018).

Finally, although the proposed imputation methodology performed well, we acknowledge that in no way the further collection of new data from the field should be discouraged, as it remains the most desirable approach to filling missing information, especially for rare or elusive species and for small taxonomic groups.

3.6 Conclusions

Here, we addressed some of the limitations associated with missing data imputations by providing a novel method that outperformed alternative techniques in trait-based ecology, and by proposing an imputation ensemble approach that improved prediction accuracy. Moreover, we derived and demonstrated a reliable algorithm for the estimation of uncertainty for trait databases. Our approach to filling in missing values appeared robust to high missingness, relatively straightforward, and less computationally demanding than alternative methods.

116

Further, we have provided a complete version of a substantial subset of FishBase (web location to be determined with journal). Given the plethora of uses of trait databases, and given that all of them contain substantial degrees of missingness, improving imputation approaches to make them more robust, efficient, and straightforward will have wide appeal and utility, with broad ramifications for diverse ecological and evolutionary questions.

3.7 Acknowledgements

The authors would like to thank E. Hudgins, D. Nguyen, V. Reed, A. Sardain, N. Richards and S. Varadarajan for insightful discussions, and D. Nguyen in particular for retrieving the data from FishBase. This research was supported by an NSERC Discovery grant to BL.

117

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19, 716-723.

Araújo, M. B., & New, M. (2007). Ensemble forecasting of species distributions. Trends in ecology & evolution, 22, 42-47.

Ascough Ii, J. C., Maier, H. R., Ravalico, J. K., & Strudley, M. W. (2008). Future research challenges for incorporation of uncertainty in environmental and ecological decision-making. Ecological modelling, 219, 383-399.

Baraloto, C., Paine, C. T., Poorter, L., Beauchene, J., Bonal, D., Domenach, A. M., ...& Chave, J. (2010). Decoupled leaf and stem economics in rain forest trees. Ecology letters, 13, 1338-1347.

Bates, J. M., & Granger, C. W. (1969). The combination of forecasts. Journal of the Operational Research Society, 20, 451-468.

Betancur-R, R., Wiley, E. O., Arratia, G., Acero, A., Bailly, N., Miya, M., ...& Orti, G. (2017). Phylogenetic classification of bony fishes. BMC evolutionary biology, 17, 162.

Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.

Brown, J. D. (2010). Prospects for the open treatment of uncertainty in environmental research. Progress in Physical Geography, 34, 75-100.

Bruggeman, J., Heringa, J., & Brandt, B. W. (2009). PhyloPars: estimation of missing parameter values using phylogeny. Nucleic acids research, 37, W179-W184.

Cadotte, M. W., Carscadden, K., & Mirotchnick, N. (2011). Beyond species: functional diversity and the maintenance of ecological processes and services. Journal of applied ecology, 48, 1079-1087.

Cavalli G., Baattrup-Pedersen A., & Riis T. 2014. The role of species functional traits for distributional patterns in lowland stream vegetation. Freshwater Science, 33, 1074–1085.

Chamberlain, S. A., & Szöcs, E. (2013). taxize: taxonomic search and retrieval in R. F1000Research, 2:191. URL: http://f1000research.com/articles/2-191/v2.

Clark, J. S., Carpenter, S. R., Barber, M., Collins, S., Dobson, A., Foley, J. A., ...& Pringle, C. (2001). Ecological forecasts: an emerging imperative. Science, 293, 657-660.

Di Marco, M., Cardillo, M., Possingham, H. P., Wilson, K. A., Blomberg, S. P., Boitani, L., & Rondinini, C. (2012). A novel approach for global mammal extinction risk reduction. Conservation Letters, 5, 134-141.

118

Dimarchopoulou, D., Stergiou, K. I., & Tsikliras, A. C. (2017). Gap analysis on the biology of Mediterranean marine fishes. PloS one, 12, e0175949.

Dormann, C. F., Calabrese, J. M., Guillera‐Arroita, G., Matechou, E., Bahn, V., Bartoń, K., ...& Guelat, J. (2018). Model averaging in ecology: a review of Bayesian, information‐theoretic, and tactical approaches for predictive inference. Ecological Monographs.

Dormann, C. F., Purschke, O., Márquez, J. R. G., Lautenbach, S., & Schröder, B. (2008b). Components of uncertainty in species distribution analysis: a case study of the great grey shrike. Ecology, 89, 3371-3386.

Dormann, C. F., Schweiger, O., Arens, P., Augenstein, I., Aviron, S. T., Bailey, D., ...& Burel, F. (2008a). Prediction uncertainty of environmental change effects on temperate European biodiversity. Ecology letters, 11, 235-244.

Ellington, E. H., Bastille‐Rousseau, G., Austin, C., Landolt, K. N., Pond, B. A., Rees, E. E., ... & Murray, D. L. (2015). Using multiple imputation to estimate missing data in meta‐regression. Methods in Ecology and Evolution, 6, 153-163.

Ereshefsky, M. (2000). The poverty of the Linnaean hierarchy: A philosophical study of biological taxonomy. Cambridge University Press.

Estrada, A., Morales-Castilla, I., Caplat, P., & Early, R. (2016). Usefulness of species traits in predicting range shifts. Trends in ecology & evolution, 31, 190-203.

Fisher, D. O., Blomberg, S. P., & Owens, I. P. (2003). Extrinsic versus intrinsic factors in the decline and extinction of Australian marsupials. Proceedings of the Royal Society of London B: Biological Sciences, 270, 1801-1808.

Froese, R. & D. Pauly. Editors. 2017. FishBase. World Wide Web electronic publication: www.fishbase.org.

Garcia, R. A., Burgess, N. D., Cabeza, M., Rahbek, C., & Araújo, M. B. (2012). Exploring consensus in 21st century projections of climatically suitable areas for African vertebrates. Global Change Biology, 18, 1253-1269.

Gayraud, S., Statzner, B., Bady, P., Haybachp, A., Schöll, F., Usseglio‐Polatera, P., & Bacchi, M. (2003). Invertebrate traits for the biomonitoring of large European rivers: an initial assessment of alternative metrics. Freshwater Biology, 48, 2045-2064.

Goolsby, E. W., Bruggeman, J., & Ané, C. (2017). Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within‐species variation. Methods in Ecology and Evolution, 8, 22-27.

Grafen, A. (1989). The phylogenetic regression. Phil. Trans. R. Soc. Lond. B, 326, 119-157.

119

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention science, 8, 206-213.

Guénard, G., Legendre, P., & Peres‐Neto, P. (2013). Phylogenetic eigenvector maps: a framework to model and predict species traits. Methods in Ecology and Evolution, 4, 1120- 1131.

Guo, C., Lek, S., Ye, S., Li, W., Liu, J., & Li, Z. (2015). Uncertainty in ensemble modelling of large-scale species distribution: effects from species characteristics and model techniques. Ecological modelling, 306, 67-75.

Hadfield, J. D. (2008). Estimating evolutionary parameters when viability selection is operating. Proceedings of the Royal Society of London B: Biological Sciences, 275, 723-734.

Halinski, R. S., & Feldt, L. S. (1970). The selection of variables in multiple regression analysis. Journal of Educational Measurement, 7, 151-157.

Hastings, A. F., Wattenbach, M., Eugster, W., Li, C., Buchmann, N., & Smith, P. (2010). Uncertainty propagation in soil greenhouse gas emission models: an experiment using the DNDC model and at the Oensingen cropland site. Agriculture, ecosystems & environment, 136, 97-110.

Integrated Taxonomic Information System on-line database, http://www.itis.gov. (accessed 29 May 2017).

Johnson, J. B., & Omland, K. S. (2004). Model selection in ecology and evolution. Trends in ecology & evolution, 19, 101-108.

Johnston, J. (1963). Econometric Methods. New York: McGraw Hill.

Kamilar, J. M., & Cooper, N. (2013). Phylogenetic signal in primate behaviour, ecology and life history. Phil. Trans. R. Soc. B, 368, 20120341.

Kattge, J., Diaz, S., Lavorel, S., Prentice, I. C., Leadley, P., Bönisch, G., ... & Cornelissen, J. H. C. (2011). TRY–a global database of plant traits. Global change biology, 17, 2905-2935.

Kelling, S., Hochachka, W. M., Fink, D., Riedewald, M., Caruana, R., Ballard, G., & Hooker, G. (2009). Data-intensive science: a new paradigm for biodiversity studies. BioScience, 59, 613- 620.

Kim, S. W., Blomberg, S. P., Pandolfi, J. M. & Chase, J. (2018). Transcending data gaps: a framework to reduce inferential errors in ecological analyses. Ecology Letters, 21, 1200-1210.

Kraft, N. J., Valencia, R., & Ackerly, D. D. (2008). Functional traits and niche-based tree community assembly in an Amazonian forest. Science, 322, 580-582.

120

Lavorel, S., Grigulis, K., Lamarque, P., Colace, M. P., Garden, D., Girel, J., ...& Douzet, R. (2011). Using plant functional traits to understand the landscape distribution of multiple ecosystem services. Journal of Ecology, 99, 135-147.

Le Lay, G., Engler, R., Franc, E., & Guisan, A. (2010). Prospective sampling based on model ensembles improves the detection of rare species. Ecography, 33, 1015-1027.

Little, R. J. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics, 6, 287-296.

Liu, C., Comte, L., & Olden, J. D. (2017). Heads you win, tails you lose: Life‐history traits predict invasion and extinction risk of the world's freshwater fishes. Aquatic Conservation: Marine and Freshwater Ecosystems, 27, 773-779.

Luo, Y., Ogle, K., Tucker, C., Fei, S., Gao, C., LaDeau, S., ...& Schimel, D. S. (2011). Ecological forecasting and data assimilation in a data‐rich era. Ecological Applications, 21, 1429-1442.

Májeková, M., Paal, T., Plowman, N. S., Bryndová, M., Kasari, L., Norberg, A., ...& Le Bagousse-Pinguet, Y. (2016). Evaluating functional diversity: missing trait data and the importance of species abundance structure and data transformation. PloS one, 11, e0149270.

Marmion, M., Hjort, J., Thuiller, W., & Luoto, M. (2009). Statistical consensus methods for improving predictive geomorphology maps. Computers & Geosciences, 35, 615-625.

Martins, E. P., & Hansen, T. F. (1997). Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. The American Naturalist, 149, 646-667.

McGill, B. J., Enquist, B. J., Weiher, E., & Westoby, M. (2006). Rebuilding community ecology from functional traits. Trends in ecology & evolution, 21, 178-185.

Meller, L., Cabeza, M., Pironon, S., Barbet‐Massin, M., Maiorano, L., Georges, D., & Thuiller, W. (2014). Ensemble distribution models in conservation prioritization: from consensus predictions to consensus reserve networks. Diversity and distributions, 20, 309-321.

Molina‐Venegas, R., Moreno‐Saiz, J. C., Castro Parga, I. , Davies, T. J., Peres‐Neto, P. R. & Rodríguez, M. Á. (2018). Assessing among‐lineage variability in phylogenetic imputation of functional trait datasets. Ecography, 41, 1740-1749.

Mood, A. M., Graybill, F. A. & D. C. Boes. Introduction to the Theory of Statistics. 3rd ed., New York: McGraw-Hill, 1974. pp. 540–541.

Nakagawa, S. (2015). Missing data: mechanisms, methods and messages. Ecological Statistics: Contemporary Theory and Application, edited by: Fox, G., Negrete-Yankelevich, S., and Sosa, VJ, Oxford University Press, Oxford, UK, 81-105.

121

Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology & Evolution, 23, 592-596.

Nakagawa, S., & Freckleton, R. P. (2011). Model averaging, missing data and multiple imputation: a case study for behavioural ecology. Behavioral Ecology and Sociobiology, 65, 103-116.

Nakagawa, S., Waas, J. R., & Miyazaki, M. (2001). Heart rate changes reveal that little blue penguin chicks (Eudyptula minor) can use vocal signatures to discriminate familiar from unfamiliar chicks. Behavioral Ecology and Sociobiology, 50, 180-188.

Pakeman, R. J. (2014). Functional trait metrics are sensitive to the completeness of the species' trait data?. Methods in Ecology and Evolution, 5, 9-15.

Penone, C., Davidson, A. D., Shoemaker, K. T., Di Marco, M., Rondinini, C., Brooks, T. M., ...& Costa, G. C. (2014). Imputation of missing data in life‐history trait datasets: which approach performs the best?. Methods in Ecology and Evolution, 5, 961-970.

Petchey, O. L., & Gaston, K. J. (2002). Functional diversity (FD), species richness and community composition. Ecology letters, 5, 402-411.

Poyatos, R., Sus, O., Badiella, L., Mencuccini, M., & Martínez-Vilalta, J. (2018). Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information. Biogeosciences, 15, 2601-2617.

Pyšek, P., Jarošík, V., Hulme, P. E., Pergl, J., Hejda, M., Schaffner, U., & Vilà, M. (2012). A global assessment of invasive plant impacts on resident species, communities and ecosystems: the interaction of impact measures, invading species' traits and environment. Global Change Biology, 18, 1725-1737.

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Rapacciuolo, G., Roy, D. B., Gillings, S., Fox, R., Walker, K., & Purvis, A. (2012). Climatic associations of British species distributions show good transferability in time but low predictive accuracy for range change. PLoS One, 7, e40212.

Regan, H. M., Colyvan, M., & Burgman, M. A. (2002). A taxonomy and treatment of uncertainty for ecology and conservation biology. Ecological applications, 12, 618-628.

Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York.

Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate behavioral research, 33, 545-571.

122

Schrodt, F., Kattge, J., Shan, H., Fazayeli, F., Joswig, J., Banerjee, A., ...& Gillison, A. (2015). BHPMF–a hierarchical Bayesian approach to gap‐filling and trait prediction for macroecology and functional biogeography. Global Ecology and Biogeography, 24, 1510-1521.

Stekhoven, D. J., & Bühlmann, P. (2011). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112-118.

Swenson, N. G. (2014). Phylogenetic imputation of plant functional trait databases. Ecography, 37, 105-110.

Symonds, M. R., & Moussalli, A. (2011). A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using Akaike’s information criterion. Behavioral Ecology and Sociobiology, 65, 13-21.

Taugourdeau, S., Villerd, J., Plantureux, S., Huguenin‐Elie, O., & Amiaud, B. (2014). Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data. Ecology and evolution, 4, 944-958.

Thuiller, W. (2004). Patterns and uncertainties of species' range shifts under climate change. Global Change Biology, 10, 2020-2027.

Tilman, D. (2001). Functional diversity. In: Encyclopedia of Biodiversity (ed. Levin, S.A.). Academic Press, San Diego, CA, pp. 109–120. van Buuren, S. (2012). Flexible Imputation of Missing Data.CRC Press, Boca Raton, Florida, USA. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45. van der Heijden, G. J., Donders, A. R. T., Stijnen, T., & Moons, K. G. (2006). Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. Journal of clinical epidemiology, 59, 1102-1109.

Van Kleunen, M., Weber, E., & Fischer, M. (2010). A meta‐analysis of trait differences between invasive and non‐invasive plant species. Ecology letters, 13, 235-245.

Vandewalle, M., De Bello, F., Berg, M. P., Bolger, T., Dolédec, S., Dubs, F., ... & Da Silva, P. M. (2010). Functional traits as indicators of biodiversity response to land use changes across ecosystems and organisms. Biodiversity and Conservation, 19, 2921-2947.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, 817-838.

123

Winkler, R. L. (1989). Combining forecasts: A philosophical basis and some current issues. International Journal of Forecasting, 5, 605-609.

124

Table 3.1. Species functional and life-history traits used for the analysis, obtained from FishBase. The last column indicates the percentage of missing value for each trait in the dataset. ______Name Description % missing ______Minimum depth Smallest depth (m) reported for juveniles and adults. 62.61

Maximum depth Greatest depth (m) reported for juveniles and adults. 58.03

Minimum temperature Minimum temperature tolerance (°C). 88.43

Maximum temperature Maximum temperature tolerance (°C). 88.11

Maximum total length Total length (cm) of the largest specimen ever caught. 59.70

Maximum standard length Standard length (cm) of the largest specimen ever caught. 53.26

Latitude North Maximum latitude at which the species is found. 70.77

Latitude South Minimum latitude at which the species is found. 71.31

Minimum common depth Smallest depth (m) where juveniles/adults are most often found. 94.61

Maximum common depth Greatest depth (m) where juveniles/adults are most often found. 94.47

Longitude East Maximum longitude at which the species is found. 87.56

Longitude West Maximum longitude at which the species is found. 87.61

Length at first maturity Average length (cm) at which fish mature for the first time. 96.48

Minimum maturity length Length (cm) of smallest mature fish. 94.94

Maximum maturity length Length (cm) of largest mature fish. 97.46

Maximum weight Total weight (kg) of the largest specimen ever caught. 94.58

Maximum age Age (years) of the oldest specimen ever found in the wild/captivity. 96.08

Common total length Total length (cm) at which specimens are commonly caught. 91.81

Common standard length Standard length (cm) at which specimens are commonly caught. 97.40

Trophic level Species position within the food web 4.41

Overall 79.48 ______

125

Table 3.2. Measures of variation included in the hierarchical model for uncertainty estimation. ______Measure of variability Description ______RMSD For each approach, the overall residual mean variation, i.e. standard deviation of the residuals of the relationship between predicted and observed trait values

휎 Standard deviation of the normally distributed logarithm of 휀

휎 Maximum likelihood estimate of 휎 based on the observed residuals

휀 Within taxonomic group variation, i.e. standard deviation of the within- group residuals when comparing predicted and observed trait values

휀̂ Estimate of 휀 based on the observed residuals and 휎

______

126

2 Table 3.3. Average R MSE across traits and approaches, obtained by cross-validation. Each row is either a single methodology or an ensemble of methods. ______2 Method R MSE ______MICE 0.486 missForest 0.621 Phylopars 0.239 CRU 0.646 MICE - missForest 0.617 MICE - Phylopars 0.542 MICE - CRU 0.647 missForest - Phylopars 0.560 missForest - CRU 0.703 Phylopars - CRU 0.583 MICE - missForest - Phylopars 0.632 MICE - missForest - CRU 0.679 MICE - Phylopars - CRU 0.637 missForest - Phylopars - CRU 0.673 MICE - missForest - Phylopars - CRU 0.675 ______

127

2 Table 3.4. Cross-validation predictive performance (R MSE) of MICE, missForest, Phylopars, CRU and the best ensemble (missForest- CRU), by trait. For single approaches, the values in bold correspond to the best method for a specific trait, while under the ensemble model, the bold values indicate where averaging the predictions from CRU and missForest improved the accuracy of the imputation. ______Trait MICE missForest Phylopars CRU missForest-CRU ______Minimum depth 0.41 0.48 0.51 0.52 0.58 Maximum depth 0.34 0.48 0.65 0.58 0.60 Minimum temperature 0.53 0.68 0.61 0.67 0.74 Maximum temperature 0.39 0.44 0.49 0.57 0.61 Maximum total length 0.57 0.71 0.73 0.72 0.79 Maximum standard length -0.01 0.33 0.45 0.47 0.53 Latitude North 0.35 0.64 0.54 0.59 0.70 Latitude South 0.37 0.65 0.62 0.64 0.72 Minimum common depth 0.81 0.81 0.47 0.83 0.85 Maximum common depth 0.82 0.81 0.11 0.84 0.85 Longitude West 0.44 0.68 0.29 0.45 0.64 Longitude East 0.32 0.66 0.20 0.38 0.61 Length at first maturity 0.84 0.88 -0.39 0.85 0.90 Minimum maturity length 0.77 0.85 -0.17 0.85 0.88 Maximum maturity length 0.80 0.83 -0.69 0.75 0.82 Maximum weight 0.59 0.59 -0.28 0.61 0.64 Maximum age 0.08 0.35 0.17 0.34 0.42 Common total length 0.73 0.76 0.27 0.73 0.78 Common standard length 0.61 0.65 -0.61 0.73 0.76 Trophic level -0.03 0.15 0.81 0.78 0.64 ______

128

Figure 3.1. Possible estimates obtainable for each missing datum using CRU, depending on the amount of information used.

129

Figure 3.2. P-P plots evaluating each method's performance in estimating uncertainty. The plots compare the observed and the theoretical residuals' percentiles, given the uncertainty estimates provided by (a) CRU using the HUE algorithm, (b) RMSD, (c) MICE and (d) Phylopars. missForest was excluded as it did not provide an estimate of uncertainty for each imputed value. The 1:1 line defines the expectation for a perfect match between theoretical and observed percentiles.

130

Figure 3.3. Average uncertainty by number of species within (a) genus and (b) family.

131

General conclusion

The past few decades have seen an explosion in the number of potentially harmful non- indigenous species transported around the world, mostly due to the ongoing economic globalization and the growth of international trade (Hulme, 2009; Westphal et al., 2008), prompting the rapid development of the field of invasion ecology (Ricciardi & MacIsaac, 2008).

Given the impossibility of impeding the movement of potentially damaging species without serious repercussions on economies worldwide, advancing current understanding of the processes and factors underlying invasions and using this knowledge to recommend well- informed management strategies are among the main roles of invasion ecologists (Simberloff et al., 2013). In particular, prevention and rapid response to detections of non-indigenous species in the wild are the most effective instruments against possibly harmful species (Pyšek &

Richardson, 2010). However, as most non-indigenous species introduced in non-native locations do not manage to establish or cause impact (Williamson & Fitter, 1996), knowing where and when interventions should be prioritized is of primary importance to guarantee success in a cost- effective way (Pyšek & Richardson, 2010).

Another major challenge encountered by invasion ecologists is the widespread lack of information, which often limits the ability of making predictions and hampers the development of risk assessment frameworks (Nakagawa & Freckleton, 2008). However, while more data are being collected and global databases compiled, it is critical that researchers keep integrating different sources of information and develop methods to make the best use of the available data

(Leung et al., 2012). Therefore, the main aim of this thesis has been to advance the ability to

132 inform proper management strategies using a multispecies and geographically explicit perspective, and integrating heterogeneous sources of limited information.

In Chapter 1, I extended an existing multispecies risk assessment framework and integrated the main predictors of successful non-indigenous species establishment, to obtain risk predictions that were both species-specific and spatially explicit. The inclusion of climatic variables, in addition to propagule pressure and species traits, allowed me to identify priorities where prevention would be desirable, and to forecast how establishment risk should vary in the face of future climatic changes. Specifically, results showed the southernmost regions to be the most vulnerable to non-indigenous aquarium fish in the USA, both currently and under future conditions, contrarily to other findings in the literature. In Chapter 2, the framework was applied separately to casual and persistence establishment, to prioritize rapid response to detections of non-indigenous species in the wild. I derived simple "rules of thumb" to quickly estimate risk, and to allow policy makers to decide about the appropriateness of eradication in a timely and effective way. These general rules made the results easily accessible to non-specialists, yet remaining based on sound statistical frameworks and reliable predictions. In addition to providing guidance to non-indigenous species managers, both Chapters 1 and 2 contributed to a better characterization of the factors that promote or hinder establishment, overall and specifically for each sub-stage, thus advancing our understanding of one of the most important phases of the invasion process.

Among the data challenges encountered in Chapter 1 and 2 were the need for proxies of propagule pressure, the coarse resolution of environmental and establishment data, and missing information for species traits. In Chapter 3, I focused on the latter, and addressed the problem of incomplete information in global trait datasets. I proposed for the first time the use of an

133 ensemble perspective to the imputation of missing trait data, and demonstrated that better predictions can be obtained by averaging well-behaved models, and relatively simple novel methods can both outperform more sophisticated ones and further improve predictions in ensembles. Finally, I provided a novel method for uncertainty estimation that better recaptured imputation errors. While this methodology was developed and validated using a subset of the

FishBase database, it is transferable to other trait datasets that can be used in various ecological applications not only related to invasion ecology, such as informing conservation, managing ecosystem services and protecting biodiversity.

Together, this thesis has contributed tools for prevention and rapid response to non- indigenous species, while simultaneously advancing our fundamental understanding of a critical phase of invasions and accounting for additional drivers of global change. Further, it has improved the usability of scientific information and has advanced the relatively novel use of imputation in ecology, ameliorating predictions and accounting for uncertainty.

134

References

Hulme, P. E. (2009). Trade, transport and trouble: managing invasive species pathways in an era of globalization. Journal of applied ecology, 46, 10-18.

Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing data. Trends in Ecology & Evolution, 23, 592-596.

Pyšek, P., & Richardson, D. M. (2010). Invasive species, environmental change and management, and health. Annual review of environment and resources, 35, 25-55.

Ricciardi, A., & MacIsaac, H. J. (2008). In Retrospect: The book that began invasion ecology. Nature, 452, 34.

Simberloff, D., Martin, J. L., Genovesi, P., Maris, V., Wardle, D. A., Aronson, J., ... & Pyšek, P. (2013). Impacts of biological invasions: what's what and the way forward. Trends in ecology & evolution, 28, 58-66.

Westphal, M. I., Browne, M., MacKinnon, K., & Noble, I. (2008). The link between international trade and the global distribution of invasive alien species. Biological Invasions, 10, 391-398.

Williamson, M., & Fitter, A. (1996). The varying success of invaders. Ecology, 77, 1661-1666.

135

Appendices

136

Appendix A

Supplementary material for Chapter 1

137

Table A.1. List of the predictors included in the model selection process. The main terms represented 5 traits, 5 environmental variables and their squared terms (eqn. 2.4). The interaction terms include all the possible pair-wise combinations of the main terms selected to be retained in the model (marked with an asterisk), except when both terms are squared (eqn. 2.5). X̅ indicates the mean, s2 represents the variance, 1st & 2nd denote first and second order terms. ______Name Description (main terms) ______Main terms (1st and 2nd order) Maximum temperature * Maximum temperature tolerance (°C) Minimum temperature Minimum temperature tolerance (°C) Northernmost latitude Maximum latitude at which the species is found Trophic level Food web rank, estimated as 1 + mean trophs of food items; mean troph is weighted by the contribution of the various food items Maximum length * Total length (cm) of the largest specimen ever caught Minimum temperature of coldest month (x̅ ) *(1st only) Minimum monthly temperature occurrence over a given year Mean temperature of warmest quarter (x̅ ) Mean temperature that prevails during the warmest quarter Precipitation of wettest month (x̅ ) * Total precipitation that prevails during the wettest month Annual mean diurnal range (s2) Mean of the monthly temperature ranges (monthly maximum minus monthly minimum) Minimum temperature of coldest month (s2) Minimum monthly temperature occurrence over a given year Interaction terms Minimum temperature of coldest month (x̅ ; 1st) & Maximum temperature tolerance (1st) Minimum temperature of coldest month (x̅ ; 1st) & Maximum length (1st) Minimum temperature of coldest month (x̅ ; 1st) & Maximum temperature tolerance (2nd) Minimum temperature of coldest month (x̅ ; 1st) & Maximum length (2nd) Precipitation of wettest month (x̅ ; 1st) & Maximum temperature tolerance (1st) Precipitation of wettest month (x̅ ; 1st) & Maximum temperature tolerance (2nd) Precipitation of wettest month (x̅ ; 2nd) & Maximum temperature tolerance (1st) Precipitation of wettest month (x̅ ; 1st) & Maximum length (1st) Precipitation of wettest month (x̅ ; 1st) & Maximum length (2nd) Precipitation of wettest month (x̅ ; 2nd) & Maximum length (1st) ______

138

Table A.2. List of freshwater aquarium fish species currently established in the United States, along with the invaded state, their current propagule pressure value (PP), their maximum temperature tolerance (Max T., °C) and maximum length (Max L., cm). ______Species State PP Max T. Max L. ______Ancistrus temminckii Hawaii 28 27 12 Chitala ornata Florida 74 28 122 Cichlasoma salvini Florida 964 32 27 Corydoras aeneus Hawaii 4783 25 9 Danio rerio New Mexico 33461 24 6 Hoplosternum littorale Florida 124 26 19 Leporinus fasciatus Hawaii 110 26 30 Macrognathus siamensis Florida 2130 27 37 Melanochromis johannii Hawaii 227 25 10 Parachromis managuensis Florida 498 36 55 Parachromis managuensis Hawaii 35 36 55 Parachromis managuensis Nevada 71 36 55 Parachromis managuensis Utah 74 36 55 Pelvicachromis pulcher Hawaii 534 25 10 Poeciliopsis gracilis California 9 28 5 Pterygoplichthys anisitsi Florida 2221 24 42 Pterygoplichthys anisitsi Texas 3003 24 42 Pterygoplichthys multiradiatus Florida 422 27 50 Pterygoplichthys multiradiatus Hawaii 29 27 50 Puntius filamentosus Hawaii 15 24 18 Tilapia buttikoferi Florida 887 25 38 ______

139

Figure A.1. Relative increase in the average risk by state posed by the aquarium pathway under the four RCP scenarios, based on the risk predicted by the PET model under current and 2050 conditions in the USA.

140

Appendix B

Supplementary material for Chapter 2

141

Appendix B.1 Table B.1.1. List of persistent species (PS) along with their state of occurrence, propagule pressure (PP) and traits. ______Species State PP Max. temp. Min. temp. N-most lat. Troph. level Max. length ______Amphilophus citrinellus FL 3186 33 23 15 3.2 22 Amphilophus citrinellus HI 221 33 23 15 3.2 22 Amphilophus labiatus HI 1268 33 28 13 3.5 22 Chitala ornata FL 61 28 24 38 3.7 90 Cichlasoma salvini FL 796 32 22 -3 3.7 25 Corydoras aeneus HI 3950 28 25 -28 3 26 Danio rerio NM 27637 24 18 33 3.1 8 Dawkinsia filamentosa HI 12 24 20 21 2.6 18 Hoplosternum littorale FL 102 26 18 11 2.7 26 Hypsophrys nicaraguensis HI 5 36 23 3 2.7 20 Labeotropheus fuelleborni FL 905 25 22 -9 2.8 18 Leporinus fasciatus HI 91 26 22 10 3 37 Macrognathus siamensis FL 1759 27 23 5 3.3 37 Parachromis managuensis FL 412 36 25 37 4 55 Parachromis managuensis HI 29 36 25 37 4 55 Parachromis managuensis NV 59 36 25 37 4 55 Parachromis managuensis UT 61 36 25 37 4 55 Pelvicachromis pulcher HI 441 25 24 10 3.3 11 Petenia splendida FL 4 30 26 -3 4.5 18 Platydoras costatus TX 2375 30 24 9 3 15 Poeciliopsis gracilis CA 7 28 24 23 2 9 Pterygoplichthys anisitsi FL 1835 24 21 12 2 55 Pterygoplichthys anisitsi TX 2480 24 21 12 2 55 Pterygoplichthys multiradiatus FL 348 27 23 10 2.2 50 Pterygoplichthys multiradiatus HI 24 27 23 10 2.2 50 Pterygoplichthys multiradiatus TX 471 27 23 10 2.2 50 Pygocentrus nattereri FL 7541 27 23 9 3.7 28 Trichopsis vittata FL 1314 28 22 17 3.4 8.15 ______

142

Table B.1.2. List of casual species (CS) that were unable to successfully persist, along with their state of occurrence, propagule pressure (PP) and traits. ______Species State PP Max. temp. Min. temp. N-most lat. Troph. level Max. length ______Agamyxis pectinifrons TX 1112 26 20 -2 2.8 47 Ameca splendens NV 8 32 26 23 2 12 Amphilophus citrinellus MA 1053 33 23 15 3.2 22 Amphilophus citrinellus MN 853 33 23 15 3.2 22 Anabas testudineus FL 51 30 22 28 3 25 Aphyocharax anisitsi FL 20592 28 18 -6 3.2 6 Balantiocheilos melanopterus IN 21028 28 22 20 3 25 Barbonymus schwanenfeldii FL 4774 25 22 16 3 29 Barbonymus schwanenfeldii IN 1536 25 22 16 3 29 Betta splendens CT 52891 30 24 22 3.3 7 Betta splendens FL 304830 30 24 22 3.3 7 Chitala ornata MO 18 28 24 38 3.7 90 Chitala ornata NY 58 28 24 38 3.7 90 Chitala ornata NC 30 28 24 38 3.7 90 Chitala ornata VT 2 28 24 38 3.7 90 Cichlasoma trimaculatum FL 72 30 21 -3 3.4 25 Cichlasoma trimaculatum NV 10 30 21 -3 3.4 25 Danio rerio CA 521253 24 18 33 3.1 8 Danio rerio CT 47496 24 18 33 3.1 8 Danio rerio FL 273740 24 18 33 3.1 8 Devario malabaricus FL 14674 25 18 22 3.2 12 Devario malabaricus NV 2093 25 18 22 3.2 12 Gymnocorymbus ternetzi CO 24410 26 20 -11 3.1 8 Gymnocorymbus ternetzi FL 90813 26 20 -11 3.1 8 Gymnocorymbus ternetzi LA 20626 26 20 -11 3.1 8 Helostoma temminkii FL 22430 28 22 16 2.8 30 Hemigrammus ocellifer CO 9455 26 22 -9 3.2 4 Herichthys carpintis FL 1236 33 23 -3 3.2 20 Hoplosternum littorale GA 51 26 18 11 2.7 26 Hyphessobrycon eques FL 54597 26 22 1 3.1 5 Labeo chrysophekadion FL 8689 27 24 2 2 90 Leiarius marmoratus FL 19 26 24 8 4.5 100 Leporinus fasciatus FL 1306 26 22 10 3 37 Macropodus opercularis FL 8642 26 16 30 3.8 8

143

Macropodus opercularis LA 1963 26 16 30 3.8 8 Melanochromis auratus FL 9100 26 22 -12 2 11 Melanochromis auratus NV 1298 26 22 -12 2 11 Moenkhausia sanctaefilomenae FL 18680 26 22 -1 3 8 Myloplus rubripinnis MA 282 27 23 2 2 39 Nematolebias whitei CA 24 23 20 -15 3.2 8 Osteoglossum bicirrhosum CA 4158 30 24 -6 3.4 90 Osteoglossum bicirrhosum CO 587 30 24 -6 3.4 90 Osteoglossum bicirrhosum FL 2184 30 24 -6 3.4 90 Osteoglossum bicirrhosum HI 151 30 24 -6 3.4 90 Osteoglossum bicirrhosum IL 1356 30 24 -6 3.4 90 Osteoglossum bicirrhosum IN 703 30 24 -6 3.4 90 Osteoglossum bicirrhosum KS 308 30 24 -6 3.4 90 Osteoglossum bicirrhosum NV 311 30 24 -6 3.4 90 Osteoglossum bicirrhosum PA 1354 30 24 -6 3.4 90 Oxydoras niger FL 30 24 21 -2 2.8 73 Pangasianodon hypophthalmus FL 60557 26 22 19 3.1 300 Pangio kuhlii FL 19297 30 24 30 3.6 12 Paracheirodon innesi CO 224962 26 20 -6 2.9 8 Parachromis managuensis LA 94 36 25 37 4 55 Perrunichthys perruno TX 64 26 22 8 4.5 60 Pethia gelius FL 1199 22 18 17 3.3 5 Piaractus brachypomus AL 132 28 23 23 2.5 88 Piaractus brachypomus AZ 189 28 23 23 2.5 88 Piaractus brachypomus AR 81 28 23 23 2.5 88 Piaractus brachypomus CA 1069 28 23 23 2.5 88 Piaractus brachypomus CO 151 28 23 23 2.5 88 Piaractus brachypomus CT 97 28 23 23 2.5 88 Piaractus brachypomus FL 561 28 23 23 2.5 88 Piaractus brachypomus GA 281 28 23 23 2.5 88 Piaractus brachypomus HI 39 28 23 23 2.5 88 Piaractus brachypomus ID 46 28 23 23 2.5 88 Piaractus brachypomus IL 349 28 23 23 2.5 88 Piaractus brachypomus IN 181 28 23 23 2.5 88 Piaractus brachypomus IA 85 28 23 23 2.5 88 Piaractus brachypomus KY 121 28 23 23 2.5 88 Piaractus brachypomus LA 127 28 23 23 2.5 88 Piaractus brachypomus ME 36 28 23 23 2.5 88 Piaractus brachypomus MD 164 28 23 23 2.5 88 Piaractus brachypomus MA 185 28 23 23 2.5 88

144

Piaractus brachypomus MI 270 28 23 23 2.5 88 Piaractus brachypomus MN 150 28 23 23 2.5 88 Piaractus brachypomus MS 81 28 23 23 2.5 88 Piaractus brachypomus MO 166 28 23 23 2.5 88 Piaractus brachypomus MT 28 28 23 23 2.5 88 Piaractus brachypomus NE 52 28 23 23 2.5 88 Piaractus brachypomus NV 80 28 23 23 2.5 88 Piaractus brachypomus NH 36 28 23 23 2.5 88 Piaractus brachypomus NJ 244 28 23 23 2.5 88 Piaractus brachypomus NY 538 28 23 23 2.5 88 Piaractus brachypomus NC 276 28 23 23 2.5 88 Piaractus brachypomus ND 21 28 23 23 2.5 88 Piaractus brachypomus OH 316 28 23 23 2.5 88 Piaractus brachypomus OK 107 28 23 23 2.5 88 Piaractus brachypomus OR 111 28 23 23 2.5 88 Piaractus brachypomus PA 348 28 23 23 2.5 88 Piaractus brachypomus SC 135 28 23 23 2.5 88 Piaractus brachypomus SD 24 28 23 23 2.5 88 Piaractus brachypomus TN 181 28 23 23 2.5 88 Piaractus brachypomus TX 759 28 23 23 2.5 88 Piaractus brachypomus UT 83 28 23 23 2.5 88 Piaractus brachypomus VT 17 28 23 23 2.5 88 Piaractus brachypomus VA 229 28 23 23 2.5 88 Piaractus brachypomus WA 198 28 23 23 2.5 88 Piaractus brachypomus WV 50 28 23 23 2.5 88 Piaractus brachypomus WI 157 28 23 23 2.5 88 Platydoras costatus FL 1757 30 24 9 3 15 Pseudotropheus socolofi FL 3574 26 24 -11 2.7 11 Pygocentrus nattereri CA 14360 27 23 9 3.7 28 Pygocentrus nattereri HI 523 27 23 9 3.7 28 Pygocentrus nattereri KS 1064 27 23 9 3.7 28 Pygocentrus nattereri MA 2492 27 23 9 3.7 28 Pygocentrus nattereri MI 3632 27 23 9 3.7 28 Pygocentrus nattereri MN 2019 27 23 9 3.7 28 Pygocentrus nattereri NE 698 27 23 9 3.7 28 Pygocentrus nattereri OH 4249 27 23 9 3.7 28 Pygocentrus nattereri OK 1435 27 23 9 3.7 28 Pygocentrus nattereri PA 4677 27 23 9 3.7 28 Pygocentrus nattereri TX 10194 27 23 9 3.7 28 Pygocentrus nattereri VA 3077 27 23 9 3.7 28

145

Serrasalmus rhombeus FL 21 27 23 3 4 29 Symphysodon discus CO 1370 30 26 -1 2.9 18 Synodontis ocellifer MN 12 27 23 5 3.1 49 Tanichthys albonubes GA 179868 22 18 24 2.7 4 Telmatochromis bifrenatus FL 38 26 24 -3 2 9 Trichogaster fasciata FL 2195 28 22 19 3.1 13 Trichogaster fasciata PA 1362 28 22 19 3.1 13 Trichogaster labiosa FL 2848 28 22 19 3.1 9 Xiphophorus xiphidium FL 9 25 18 25 3.1 6 ______

146

Appendix B.2 Figure B.2.1. Distributions of traits (a-e), propagule pressure (f), environmental variables (g-h) and interactions (i), as included in the casual establishment model. The interaction terms are standardized for simplicity of interpretation, while propagule pressure is plotted on a log scale. The black dots represent casual establishments, and their size is proportional to the square root of the number of corresponding species.

(a) (b)

(c) (d)

(e) (f)

147

(g) (h)

(i)

148

Figure B.2.2. Distributions of traits (a-b), propagule pressure (c), environmental variables (d-e) and interactions (f), as included in the persistence model. The interaction terms are standardized for simplicity of interpretation, while propagule pressure is plotted on a log scale. The black dots represent persistent establishments, and their size is proportional to the square root of the number of corresponding species.

(a) (b)

(c) (d)

(f) (e)

149

Appendix B.3

Given the parameters of the fitted models (eq. 2.2 and 2.3 in the main text), we could calculate the odds as follows:

푂푑푑푠 = 푒푥푝푏푋 + 푏푋 (B.3.1)

^ where Xws represented a value of the predictor in analysis (e.g., a trait here) for species s, and 푏

^ and 푏 were the fitted parameters for the first and second order term. The same logic applied for the interaction terms:

푂푑푑푠 = 푒푥푝푏푋퐸 (B.3.2) where Xws and Eml were the interacting trait for species s and environmental condition for

^ location l, and 푏 was the corresponding fitted parameter.

Probabilities of successful establishment can be calculated from the odds as follows:

() 푝 = = (B.3.3) () where Oddssl are the odds after accounting for all relevant predictors, and zsl is the linear predictor described in equation 2.3 in the main text. Because of the standardization procedure employed before fitting the model, the parameters provided in table 2.3 of the main text are also standardized, and raw variable values should be adjusted as follows:

푋 = (B.3.4)

푋 = (B.3.5)

푋푌 = (B.3.6)

For each predictor X, 푋 is the standardized value of the first order term, 푋 is the standardized value of the second order term, 푠 is the standard deviation of X, and 푠 is the standard

150 deviation of the squared 푋. For each interaction term XY, 푋푌 is the corresponding standardized value and 푠 is the standard deviation of the product of the standardized predictors X' and Y'.

The corresponding numerical values to be used are presented in table B.3.1.

151

Table B.3.1. Reference values for proper standardization to calculate probabilities of casual and persistent establishment for aquarium non-indigenous fish species from the fitted parameters. ______

푋 푠 푋 푠 ______Casual establishment model Maximum temperature tolerance 26.711 2.105 0.999 2.572 Minimum temperature tolerance 22.479 2.324 0.999 3.929 Northernmost latitude 3.410 15.714 0.999 1.600 Trophic level 3.101 0.515 0.999 1.535 Maximum length 21.588 25.072 0.999 9.990

Minimum temperature coldest month (BIO6x̄ ) -7.464 7.104 0.999 1.669

Precipitation wettest month (BIO13x̄ ) 111.070 34.975 0.999 1.536

Persistence model Maximum temperature tolerance 27.940 2.626 0.993 2.013 Maximum length 51.412 39.781 0.993 3.140

Mean temperature warmest quarter (BIO10x̄ ) 23.251 3.532 0.993 0.834

Minimum temperature coldest month (BIO6s²) 8.964 5.148 0.993 1.047 ______

푋푌 푠 ______Interactions -19 Min. temp. coldest month (BIO6x̄ ) & Max. length -1.46∙10 0.999 -19 Min. temp. coldest month (BIO6s²) & Max. length 6.93∙10 0.999 ______

152

Appendix B.4 Table B.4.1. Odds ratios (OR) of species previously reported as matters of concern in the literature, in their riskiest locations. ______Species OR casual OR persistence ______Chromobotia macracanthus Florida 34 171 Hawaii 53 3109 Kryptopterus bicirrhis Hawaii 40 1698 Silurus glanis Illinois 0.2 17 New Mexico 0.3 35 North Carolina 0.2 16 Tanichthys albonubes Florida 8 32 Hawaii 15 3138 ______

153

Appendix B.5

For casual establishment, we found Oddsref = 0.000052, based on a single propagule. The corresponding probability of casual establishment across species and locations on average was also 0.000052 (the likelihood of success roughly corresponds to the odds when prevalence is very low: 151 casual establishments over 51,800 species/location combinations). The predictor with the strongest effect was maximum length. Our multiplicative risk factors for casual establishment showed that size affected more strongly the earlier sub-stage than persistence, with the risk of casually establishing versus failing being about 80 times bigger for fish reaching slightly more than a meter in length (Fig. B.5.1e), which would instead be disfavoured for persistence. Instead, while the risk of casual establishment increased unimodally with maximum temperature tolerance, it had a relatively weaker effect than on persistence. Nevertheless, species with maximum temperature tolerances higher than ~30°C would have at least twice the chance of casually establishing versus failing than the average species (Fig. B.5.1a; Table 2.4 in main text).

While all traits considered contributed to casual establishment, among environmental conditions, precipitation (BIO13x̄ ) had the strongest effect, with values departing from the average increasing the odds up to more than 7 times (Fig. B.5.1g). While minimum temperature of the coldest month was also retained as a relevant predictor, for casual establishment it was its mean that affected risk, and less strongly than its variability influenced persistence. Specifically, areas with average minimum temperatures of the coldest month higher than ~6°C would see their risk of casually establishing versus failing at least doubled across all species (Fig. B.5.1f).

Finally, risk appeared to change with interacting minimum temperature of the coldest month and maximum length also for the casual establishment sub-stage, with big species being favored at low temperatures (Fig. B.5.1h).

154

Figure B.5.1. Effect of each significant predictor on the likelihood of casually establishing versus failing, expressed as odds ratio (OR), when gradually varying each predictor. OR equals 1 (dashed line) at each variable's mean value, reported by the corresponding point. The average values for the interaction plot (h) correspond to those of the respective main terms (e,f).

155

Appendix C

Supplementary material for Chapter 3

156

C.1 Phylogenetic adjustment

The phylogenetic adjustment was similar to the taxonomic adjustment, but each species' contribution (eqn. C.1) was weighed based on its distance from species i:

∑(,,) 푎푑푗푢푠푡푚푒푛푡 = − 푡̅ , (C.1) ∑ , where weights 푤, were defined as an inverse function of the phylogenetic distance between each within-group species n and species i, 푡, was the observed value for trait 푡 for each

species n, and 푡̅ , was the expected within-group average trait 푡 based on the multiple regression model (eqn. 3.2 in main text). Species for which we did not have phylogenetic information were assigned the average known distance between species s and all other within- group species in the phylogeny. This formulation corresponded to the taxonomic adjustment

(eqn. 3.3 in main text) when phylogeny was not available and all the weights were the same.

Using a leave-one-out cross-validation approach, we found that the inclusion of phylogenetic information, which was available for only a fraction of the species in the dataset

(~5.4%), improved predictions very marginally for the subset of species for which we had

2 phylogenetic data, producing R MSE values virtually identical to the ones obtained with the taxonomic adjustment (Fig. C.1). The results were likely due to the small number of species with phylogenetic data and the high amount of missing traits, which allowed us to predict only 3% of the validation set. For these reasons, the phylogenetic adjustment was not employed to impute

FishBase.

157

Figure C.1. Scatter plots of predicted VS observed trait values, using correlations and the taxonomic (a) or the phylogenetic (b) adjustment at the genus rank, on the subset of species for 2 which we had phylogenetic data. Each plot reports the corresponding R MSE value in the top right corner.

158