Department of Physical Geography

Exploratory GIS Data Analysis and Regional and Transferred Maxent Modelling of the Round Goby Neogobius Melanostomus and Chinese Mitten Crab Eriocheir Sinensis in and County Coastal Areas

Devon Reid

Master’s thesis NKA 153 Physical Geography and Quaternary Geology, 30 Credits 2016

Preface

This Master’s thesis is Devon Reid’s degree project in Physical Geography and Quaternary Geology at the Department of Physical Geography, Stockholm University. The Master’s thesis comprises 30 credits (one term of full-time studies).

Supervisors have been Alistair Auffret and Ian Brown at the Department of Physical Geography, Stockholm University. Examiner has been Göran Alm at the Department of Physical Geography, Stockholm University.

The author is responsible for the contents of this thesis.

Stockholm, 17 June 2016

Steffen Holzkämper Director of studies

Abstract This study is a multidisciplinary approach to Species Distribution Modelling (SDM) where predictive models have been developed regarding the current distribution and potential spread of two invasive species found in Baltic Sea waters. Invasive species in the Baltic have long been an ecological and economic problem and the two species studied are well known for their adaptability in colonization and detrimental effects on local ecology all over the world. First, the Round Goby (Neogobius Melanostomus) has been steadily colonizing the Swedish Baltic coastline since 2008, the impact of which on local ecosystems is not fully understood. Also, the Chinese Mitten Crab (Eriocheir Sinensis), found in Swedish waters since the 1930’s, has been known to be a robust invader of ecosystems but presence in the Baltic is still not well explained. Four high spatial resolution models have been developed, three respective Round Goby and one for Mitten Crab. Two models are specific to the Blekinge/Hanöbukten region of the Swedish Baltic Sea coast, showing predicted current distribution of Round Goby. Two are predictions of Round Goby and Mitten Crab transferred or projected to other regions, with different approaches in setting model parameters and choosing variables, showing current and potential distribution. This study features: exploratory data analysis and filtering using GIS tools, highly discriminant environmental variable selection and rejection, and several different approaches to modelling in Maxent using custom and default settings. Predictive maps have been developed showing current distribution and potential spread as well as explanatory tabular data outlining direct and indirect drivers of species presence. Maxent has proven to be a powerful predictive tool on a regional basis, and proximity to introduction locations play a major role. Maxent, used in combination with spatial data modelling, exploration and filtering techniques has yielded a valid explanatory model as well. Transferring predictions to other regions is quite sensitive, however, and can depend heavily on species, sampling strategy and similarity of habitat type. Round Goby predictions were successfully created regionally and transferred to Stockholm, but Mitten Crab predictions were not successfully transferred to Blekinge.

1

2

Table of Contents

1) Introduction and Background 5 1.1) Neogobius Melanostomus (Round Goby) 5 1.2) Eriocheir Sinensis (Chinese Mitten Crab) 6 1.3) Study Description and Goals 7 2) Methods 2.1) Study Area 8 2.2) Observation Data 9 2.3) Observed Absence Data 10 2.4 – 2.6) Processing, Exploration, Selection and Models 2.4a) Exploratory Data Analysis 10 2.4b) Grouping Analysis 12 2.5a) Environmental (Raster) Variable Processing 13 2.5b) Reasons/rules for variable rejection 16 2.5c) Requirements for Inclusion 17 2.6a) Models 19 2.6b) Model Method 1 20 2.6c) Model Method 2 21 3) Results 3.1) Overview of Maps and Output Tables 22 3.2) Validation and Thresholds 23 3.3) Predictions 24 3.3a) Models A and B 25 3.3b) Model C 26 3.3c) Model D 27 4) Discussion 4.1) Discussion of Model A Results and Conclusions 28 4.2) Discussion of Model B Results and Conclusions 28 4.3) Discussion of Model C Results and Conclusions 29 4.4) Comparison of Models C and D Results and Model D Conclusions 30 4.5) Assessment of Validation Results 31 4.5) GIS Contributions 31 4.5) General Conclusions 32

5) References 33

6) Appendices 39

3

4

1) Introduction and Background Invasion of non-indigenous species (NIS) is acknowledged as one of the most important external drivers affecting structure and functions of marine ecosystems globally (Ojaveer and Kotta, 2014). NIS in the Baltic Sea may substantially alter local biodiversity and modify not only the structure and functions of ecosystems they live in, but also alter services these systems provide (Simberloff, 2011; Strayer, 2012). Following an examination of different exotic/invasive species found along the Swedish Baltic Sea coastline, two subjects were selected which occupy similar near- shore (brackish) habitats, the Round Goby and Chinese Mitten Crab. These two invasive species are of particular concern to the Swedish EPA (Naturvårdsverket), and in a 2011 report they list them as the biggest threats (along with two species of aquatic plant, Signal Crayfish and Zebra Mussel) to Lake Mälaren in Stockholm which connects with the Baltic via locks that are considered quite permeable. Although Mitten Crab has been found already, and numbers appear to be increasing, Round Goby has not yet been found in Mälaren although habitat there is considered favorable. (NATURVÅRDSVERKET RAPPORT 6375, see link (1)). These species, occupying a similar habitat type (or potentially so) are best modelled with high resolution raster variable data I have collected or made, which show detailed nearshore physical and ecological gradients. In a previous study (Reid, 2015) my focus was Round Goby where I developed a modelling system for this near-shore species in Blekinge County. In this study, among other goals, I wanted to continue and improve upon modelling methods by addressing certain known issues prevalent in species distribution models e.g.: Spatial Autocorrelation, Multicollinearity in response variables, filtering/subsampling and methods of model validation. Round Goby has become even more interesting lately to resource managers working locally in (Florin, 2012) and on a Baltic-wide scale (Kotta et al., 2016), as more discoveries have been made in the last year. This is not necessarily a bad thing for modelling, as more data generally mean a more robust model. Mitten Crab data come from The Swedish Natural History Museum. Although most observations are found in freshwater, I determined a subsample of marine and brackish water occupants/data may be modelled to show predictive distribution within that type of habitat.

1.1) Neogobius Melanostomus (Round Goby) The Round Goby, Neogobius Melanostomus, is considered to be among the top tier of importance in terms of their level of impact of colonized areas (Reid, 2015; Ojaveer and Kotta, 2014). It can also be very difficult to eradicate once established in a region (Sapota, 2012). It is still not known exactly what the effect of Round Goby will have on the Swedish Baltic Sea Coastline because it is not yet established, however in areas that have been colonized by this species they have been found to have a detrimental effect on local fauna by competing for food and habitat, feeding on roe of native species (Balshine et al. 2005; Lauer et al. 2005; French III and Jude 2001; Steinhart et al. 2004; Fitsimmons et al. 2006; Roseman et al. 2006) and “causing decline of native benthic fish populations due to spawning interferences” (Almqvist, 2008; Janssen and Jude, 2001). Native range of the Round Goby: “N. Melanostomus occurs in all shallow water regions of the Black Sea, the Caspian Sea, the Marmara Sea and in all areas of the Sea of Azov” (Berg, 1949). The primary vector of introduction of the Round Goby in Sweden is most probably from shipping traffic between the southern coasts of the Baltic, particularly the gulf of Gdansk in Poland (NRC 1995; Nehring 2005) and “A patchy distribution and long distances between native regions of occurrence and newly settled areas is characteristic for the current world distribution of N. Melanostomus. This points to the existence of an effective transport mechanism for moving round gobies long distances. Most probably, transport occurs in ballast waters.” (Sapota, 2012). (Figure 1). Kotta et al. (2016) discovered on a Baltic-wide scale that Round Goby is highly negatively correlated to wave exposure and positively correlated to shipping lanes which was also found in my

5 previous study (Reid, 2015). “Scale”, meaning both region (extent) (Elith and Leathwick, 2009), and grain size or spatial resolution (Wisz et al., 2008; Roura-Pascual, 2006) is very important to consider in Species Distribution Modelling. For the purposes of detailed ecological understanding (Fleishman et al. 2001; Ferrier et al. 2002), a goal of this study, local (regional) approaches are favourable to global. This is especially the case with Round Goby as they have proven themselves to behave differently based on varying environmental conditions on a regional basis in introduced ecosystems (Hirsch et al., 2016). In addition, fine scale (10m) resolution grids have proven to preserve natural hydro-geologic features of the coastline, which are important for detailed analysis of this nearshore organism. This approach has yielded results indicating correlation with fine gradient changes in protective physical conditions from wave exposure, certain types of substrate, other modelled species and habitat (for food or similar niche), preferred slope angle and to a smaller degree, water chemistry and depth in Reid (2015), which are further articulated in this study. 1.2) Eriocheir Sinensis (Chinese Mitten Crab) The Chinese Mitten Crab (Eriocheir Sinensis) is known worldwide for its invasion prowess (Drotz et al., 2010). It is listed in the top 100 “world’s worst” invaders by the Global Invasive Species Database (Lowe et al., 2000). Mitten Crab tolerates a wide range of environmental conditions as evidenced by the fact that it has been found in estuaries, lakes, rivers, wetlands and marine water. In colonized areas it has been known to burrow into and damage dikes, erode riverbanks, prey heavily on native species such as macroalgae, invertebrates and fish, and compete for food and habitat. Mitten Crab is also considered the second intermediate host for human lung fluke parasite in Asia and humans can become infected with the parasite through ingestion. The fluke settles in the lungs and other parts of the body, and can cause significant bronchial or, in cases where it migrates into the brain and/or muscles, neurological illnesses (Gollasch, 2011, NOBANIS). Mitten Crab native range is an area extending from Vladivostok in Russia in the north to southern China, centralized around the Yellow Sea in China (frammandearter, link (2)). It has developed a permanent presence in European waterways since the first discovery in 1912 and subsequent spread to the Baltic German coast in the 1920’s. It has been discovered in Estonia, Finland, and Poland, and has been somewhat regularly found in Swedish waterways since the 1930’s. Introducing vectors are also shipping (ballast tanks and hull fouling of vessels) (figure 1), and imports of living species for aquaria or for human consumption (Marquard 1926; Peters 1933). Mitten Crab is a catadromous species, meaning that it lives in fresh water, and enters water of a certain salinity (generally >15%) (Anger, 1991) to spawn. Reproduction of the mitten crab in the central, northern and eastern Baltic region is considered unlikely due to low salinity and the individuals caught are assumed to actively migrate into the region from the main European distribution area, the southeastern North Sea. This is can be a distance of up to 1500km and it is likely the dynamics of the North Sea population is regulating the occurrence of the Mitten Crab in the Baltic Sea area (Ojaveer et al., 2007). It is not considered established as of now, which is attributed the low salinity within the Baltic Sea in combination with low surface water temperatures and ovigerous females are rarely seen (Drotz et al., 2010, Anger 1991, 2003). Drotz et al. (2010) make a case for a possible breeding population in the region around the Göta Älv River estuary, in the vicinity of North Sea and Baltic Sea confluence in Kattegat. Ojaveer et al. (2007) argue that breeding is impossible in the Baltic. Both papers agree that there has been an increase in observations steadily since the 90’s. In addition to salinity requirements, other requirements for Mitten Crab to reproduce are known, including ideal water temperature of 15 degrees and above. Different climatic models in Sweden, presented by the IPCC (Intergovernmental Panel on Climate Change), predict an increase in mean temperature of 4-6ºC in the coastal region of (SMHI, link (3)) by the year 2100. Increases in temperature could make the crab less sensitive to the lower salinity (Anger, 1991). In such a case the possibility that reproducing Mitten Crab population established in the Göta Älv River estuary in the future is high (Drotz et al., 2010). What that means for the Baltic is unclear. For this reason, in this study I do not try to show or predict any preferred “niche” but rather attempt to show and predict

6 where they can be found in two regions of the Swedish Baltic coast, and draw no conclusions as to why.

1.3) Study Description and Goals In this study, Round Goby and Mitten Crab observation locations have been modelled using a suite of data cleaning and filtering tools, high resolution habitat data and the machine learning tool Maxent (Maximum Entropy) as developed by (Phillips, 2006; Phillips and Dudik, 2008) to generate predictions of distribution of each species. Four models were created, three for Round Goby and one for Mitten Crab. Two of the Round Goby models were confined to the extent of Blekinge County and Hanöbukten (study area, figure 2), Model A using a suite of predictor variables including those representing vectors of introduction: a distance to ports raster, and marine traffic transponder data yielding a prediction of current distribution by all correlative factors. Model B used a subset of data, to remove populations likely to be a direct result of introduction, and a subset of variables excluding vectors of introduction, in an effort to develop an explanatory model that shows predicted current distribution by habitat preference. Models A and B could be described as “realized niche” models, however “A striking characteristic of SDMs is their reliance on the niche concept” (Guisan & Zimmermann 2000), which seems in many cases to be poorly understood (Guisan & Thuiller, 2005), so I chose current distribution as a descriptive term for these predictions. Model C, the third Round Goby model is done using Maxent by different methods. Akaike Information Criteria (AIC) is used to delineate the best possible model by adjustment of beta multiplier and a subset of the highest contributing variables to develop a model in Blekinge County and then transfer predictions to and the inner archipelago. This method is employed in effort to show Potential Distribution, and it does seem to qualify within niche concept as Potential Niche. Potential Niche, as described by Peterson & Soberon (2012): Areas that fulfill both the abiotic and biotic requisites of the species constitute the potential geographic distribution of the species, however I call it Potential Distribution. Model D, Mitten Crab, was developed in the same way as the third model, by AIC, using data collected from citizen reports recorded by Naturhistoriska Riksmuseet (Drotz et al., 2010). Model D is instead developed with Stockholm environmental variables and observations, and transferred to the Blekinge County study area, and is intended to show the current (Stockholm) and potential (Blekinge) distribution of Mitten Crab in both study areas. All models are considered of local/regional scale and models are of high spatial resolution, in order to derive detailed ecological understanding of these species within study areas. The questions to be addressed by this project are: What are the differences in this thesis and Reid (2015) in Round Goby distribution? Are vectors of introduction still an important explanatory variable? What are the differences in the predicted distribution and explanatory models (models A and B), and what can be determined by output grids and variable contributions? Where are the areas of highest risk of colonization of Round Goby in Blekinge and Stockholm? Is Lake Mälaren at risk? Where is Mitten Crab most likely to be found in both areas? I establish also that spatial data modelling and exploration using GIS tools can be highly complementary to Species Distribution Models (SDM). Finally, I illustrate that transferability is highly dependent on sample scheme, species, and similarities in habitat gradient between transferred areas.

7

Figure 1 Cross-section of ships showing ballast transportation of exotics. See link (4)

2) Methods 2.1) Study area Both coastal study areas shown in (figure 2) were selected due to high international marine traffic, similarity of habitat for model transferability and availability of observation and habitat data in both locations. Round Goby in , first found in 2008 Florin (2012), has been steadily spreading throughout the Blekinge region. One observation of which is found on the boundary of the neighbouring Hanöbukten (Hanö Bight) region, in the southwest of the study area which was included due to proximity, similarity and availability of quality habitat data. I also included it to explore the possibility of westward migration of Round Goby and in transferability of Mitten Crab (study area, figure 2). Study area boundaries in Blekinge/Hanö match that of a collection of “basins and sub- basins” found in HOME nearshore data surveys (Marmefelt et al. 2007), described in (2.5a Variable Processing), (2.5b Rules for rejection) and (table 2). The transferred Round Goby model is projected to the Stockholm Study area. The Stockholm study area boundaries are based on a subset of measured variables created during the Marin Modellering I Stockholm project (Nyström- Sandman et al., 2013). Subset area was based on actual distribution of Mitten Crab and is a cross- section of nearshore habitat including the inner Skärgården (archipelago). Mitten Crab potential distribution predictions are transferred to the Blekinge study area. This extent/scale within the study area should be a fair representation of favorable and unfavorable habitat for both species, values of which allow the modeling software to discriminate, predict and extrapolate.

8

Figure 2 The Blekinge study area extends from south 6136550 north to 6242900 and from west 447230 east to 576090, sweref99TM meters. The Stockholm study area extends from south 6516168 north to 6651598 and from west 642944 east to 734844 sweref99TM meters. Mitten Crab observation points in Stockholm are shown in Green. Round Goby observation points in Blekinge are shown in red.

2.2) Observation data Round Goby Observation data came from a variety of sources including yearly gillnet monitoring programs undertaken by Sveriges Lantbruksuniversitet (SLU) c.o. Ann-Britt Florin, GBIF biological database which serves invasive species data globally and locally (many sources), and Artportalen which is another online biological database where members of the public can report non- indigenous species discoveries. Data collection spanned a timeline of seven years, from the first discovery by a sport fisherman in our study region in 2008 to the latest survey results in 2015. Collection methods are very different between sport fishing reports and gillnet monitoring in that gillnets are stationary and are often in place for several days, often resulting in catches of multiple individuals per point location, while sport fishing locations record single individuals at a single time. Final count of Round Goby locations within the Blekinge study area were (n71). Mitten Crab observation data come from Stockholm’s Naturhistoriska Riksmuseet (Drotz et al., 2010). Observation data temporal scale ranges from the first discovery in 1930’s to present. All samples recorded before 1980 are from museum records (and some after). The Baltic Sea has changed much in the last 86 years and environmental variables are from recent studies, so points recorded before 1980 were removed. In addition, comments accompanying the coordinates for individuals address confidence in accuracy of coordinates. All unsure coordinates were removed from the dataset. Final count of Mitten Crab data points within the Stockholm study area was (n76). “It is important to cross-check coordinates by visual and other means”, (Hijmans and Elith, 2011). In both cases data were cleaned. Observation locations within 25m of habitat data (the shoreline), but outside of the gridded area were moved to the closest instance of shoreline, while those further than 25m were removed from the dataset. Round Goby data point locational accuracy were

9 verified with SLU, while Mitten Crab data custodians were not available to verify locations. I aggregated multiple individual occurrences into single points with no additional weight given to prevent bias in modelling due to autocorrelation for both species, in respective study areas. 2.3) Observed absence data As part of the yearly gillnet monitoring program for Round Goby, absences of species in established monitoring locations are also recorded. Although models presented here are Presence Only, Presence/Absence modelling systems were considered. Since Round Goby is an exotic species in Swedish waters, it is still not yet equilibrated with the environment and I decided that modelling with absences may introduce a bias in absence due to scale of study area and rate of spread rather than environmental conditions (Elith and Leathwick, 2009). In other words, modelling with absences may limit the model from predicting a more representative niche, or correct distribution, regardless of (temporal) mechanisms of recruitment. These mechanisms are not well understood, so Presence-Only modelling was selected. In addition, using similar modelling methods as I used in Reid (2015) allowed for comparison of ongoing colonization of this species in Blekinge. The modelling system used in Reid, (2015) and currently, MAXENT, is largely a presence only system, and is shown to perform well with smaller sample sizes (as I had) (Ng and Jordan 2001; Phillips and Dudik, 2008). Absences were used in creation of a sample bias grid and also used in external validation (both described below). Absence data were filtered to only cover the same vicinity within the study area as presence locations (n50) in an effort to eliminate bias as described, in the external validation process. 2.4a) Observation Data Exploratory Analysis Spatial autocorrelation has been known to introduce bias into modelling algorithm results, in that it can cause “overfitting”, and artificially inflated AUC values in some species distribution models (Veloz, 2009; Guisan & Thuiller, 2005; Elith & Leathwick 2009; Hijmans, 2012; Bahn & McGill, 2013; Boakes et al., 2010). In this study, autocorrelation of observation points has been analysed using several methods. The Multi-distance spatial cluster analysis (Ripley’s K) function (ArcMAP10.2.2) was run on both Round Goby in Blekinge and Mitten Crab in Stockholm (appendices, figure A1), where confidence envelopes are developed by the distribution of random points (n99), over thirty iterations, calculating average distances within the study area. Mitten Crab is considered by this test to be significantly clustered, while Round Goby is significantly clustered in part of the study area and dispersed in another. A common solution to problems created by correlated point data is to aggregate observations (by point or by polygon), however in the case of Round Goby, data are already aggregated by amount of observations in gillnet traps. In comparing methods for dealing with problems created by autocorrelation, Betts et al. (2006) argue that there has been no consistent and well-researched means to implement controls for autocorrelation in modelling. Phillips et al. (2004) argue that “Although a species’ realized distribution may exhibit some spatial correlation, the potential distribution does not, so considering spatial correlation is not necessarily desirable during species distribution modeling”. Three out of four models do not employ altered/aggregated data to compensate for the effects of autocorrelation with this reasoning in mind. Also important to consider in using unaltered data is that: observation records cover a broad temporal range, data points are relatively few (<=n76), several grids of high resolution habitat data may show fluctuation of habitat occupancy on a very fine scale and in fact reasons for correlation of data points are important for, and could be explained by models. An example: N: Melanostomus is known to be negatively correlated to wave exposure (Kotta et al., 2014, Reid, 2015). In the study area of Blekinge there are only so many places that satisfy preference criteria with regard to this. Conversely, the spectrum of wave exposure values from high to low is required to determine this negative correlation within the study area. This is also an argument against altering habitat data (an example: (Kramer- Schadt et al., 2013)). Clustered data is filtered in model B as I will describe below. Another visualization of cluster can be seen by results of the Optimized Hotspot Analysis tool in Arcmap 10.2.2, which identifies significant spatial clusters of high (hot) and low (cold) spots by

10 aggregating the observation locations into weighted features. Using the distribution of the weighted features, it identifies an appropriate scale of analysis, and creates fishnet polygons with dimensions based on this. The statistical significance reported in the output polygon features are automatically adjusted for multiple testing and spatial dependence using the False Discovery Rate (FDR) correction method. (ArGIS10.2.2) (figure 3)

Figure 3 Getis Ord Gi optimized hotspot analysis of Mitten Crab in Stockholm, left and Round Goby in Blekinge, right. Significant clustering is found in both study areas.

Another relatively simple data exploration tool, Mean Center (ArcGIS10.2.2), calculates the geographic center of concentration for a set of features. In the case of Round Goby, which is colonizing the coastline, mean center is very close to where the first discovery was made. In addition, when overlaid on marine traffic transponder data (described in 2.5a Environmental Variable Processing), at a rate of 600 boats per month (average) at location point shows what could probably be the vector of introduction itself, ship ballast (Sapota 2012). This location could possibly be where large ships empty ballast before entering shallow waters to dock (figure 4). Mitten Crab observations include many freshwater locations where there is no available habitat data. Since the reason for presence in marine vs. fresh-water remains unclear, it makes little sense to use a small subset of the data to determine mean center of occurrence in marine water. No further alteration has been applied to Mitten Crab observation data.

Figure 4 Mean center of N. Melanostomus Observations (Euclidean). An average of 600 ships per month travel through this location

11

2.4b) Grouping analysis and Round Goby data subset Grouping and subsequent filtering of Round Goby data for model B (subset of data, subset of variables) was based on the results of variogram tests and Grouping Analysis tests in (ArcMAP10.2.2). Variograms were run on Round Goby data to identify outliers in data, both spatially and with regard to each environmental variable value at point locations. Grouping Analysis tests were run to looks for clusters of Round Goby data only based on commonalities and differences in variable values at data points. Both variogram analysis and Grouping Analysis identified the same group of presences as data outliers. Grouping Analysis results, including separate groups identified can be seen in a graph, with summary statistics (figure 5) and results in map form are in (figure 6).

Figure 5 Top: Two distinct groups are shown by comparison of the summary of standardized values of environmental variables at point locations. Standardization of the attribute values involves a z-transform where the mean for all values is subtracted from each value and divided by the standard deviation for all values. Standardization puts all of the attributes on the same scale even when they are represented by very different types of numbers. Bottom: the global Mean, Standard Deviation (Std.Dev.), Minimum, Maximum, and R2 values for all data in each analysis field. Larger R2 values (coefficient of determination) indicate how well a particular variable was able to discriminate among features. The variable with the highest R^2 divides the observations into groups most effectively. (ArcGIS10.2.2)

12

Figure 6 Map of N. Melanostomus observation locations in Blekinge, grouped by differences in values of environmental variables at point locations. An argument can be made that group 1 locations are a result of a reproducing population and preferred habitat while group 2 are most likely the direct result of introduction

2.5a) Environmental (Raster) Variable Processing Predictor variables used in these models represent physical features/processes, habitat dynamics and geochemical properties that should (or may) be important habitat requirements of each species. The variables, in raster format, come from previous projects undertaken by the aquatic consulting agency, Aquabiota AB (Fyhr et al. 2015; Nyström-Sandman et al., 2013; Sundblad et al., 20014a), SGU (Sveriges Geologiska Undersökning), IMO (International Maritime Organization), SMHI (Swedish Meteorological and Hydrological Institute), some compiled for previous models (Reid 2015), and some made or compiled for this project. For a complete list of variables used in respective models, see (table 2) which includes processing and source details. Secchi depth, which is a measure of eutrophication by clarity of water to a certain depth, was only used in Model D. Stockholm Secchi depth grids were created by Floren et al. (2012), where 100m resolution Meris FR data is combined with Landsat TM (thematic mapper) and ETM+ (enhanced thematic mapper) data to best effect of respective sensors in determining water clarity by depth (m). I resampled this grid from 100m to 10m for the Stockholm study area to match other predictor grids. Method of resample was nearest neighbor technique as this method (although normally best used with discrete data) does not change the values of cells and instead splits individual pixels into ten parts of equal value. Blekinge Secchi values were derived from Satellite data: Landsat TM with an EO-sensor with a resolution of approximately 30 m. Regression analysis between a satellite images and a large number of field measurements was used to create a high-resolution map of Blekinge and the Hanöbukten (Philipson et al. 2013), which I resampled to 10m using nearest neighbor. Resampling to fine resolution does introduce false accuracy however most predictors are in 10m scale which preserves characteristic features of the coastline, which is very important in modelling these nearshore species. Bathymetry in Blekinge: A continuous depth grid for Skåne and Blekinge County and the entire Hanöbukten study area was created from point format depth data. The depth data is based on hydrographic surveys carried out at different times and with different methodologies within the study area. Examples of various methods used to collect digital depth data are single and multibeam echosounding and digitized depth curves. To convert point data into a continuous depth grid in 10- meter resolution, a semi-variogram model for interpolation was used. During interpolation, a search was performed in at most ten points (at least two) and eight directions. Root mean square error,

13 average standard error, standard error, and standardized root-mean-square-error was recorded for each square (Fyhr et al., 2015). I created a mosaicked Bathymetry grid in Stockholm with a combination of three different grids. Two come from (Nyström-Sandman et al., 2013). First, a nearshore grid covering depths of 0- 6m with a spatial resolution of 10m. The second is a far coarser 200m spatial resolution, which covers deeper areas (considered secret information by Swedish Maritime Administration). While the fine resolution grid covered all nearshore areas within the Stockholm study area, the 200m grid did not, so a third grid created by Svealandskusförbund (in cooperation with Stockholm University Department of Systems Ecology) 50m grid was used (RT38 2.5 gon V reprojected to sweref99 TM). The 50m grid was created from digitized chart information, results were interpolated by Spline, providing a continuous differentiable surface. Metadata lists limitations of the interpolated surface in military areas where chart information is not comprehensive, and steep gradients where “interpolated surface is not reliable” (svealandskustenförbund, link (5)). Another reason for including the third (50m) grid was because a sharp artificial gradient was discovered between the 10m and 200m Nyström-Sandman et al. (2013) data grids, and both species modelled are most often found in nearshore areas. For Round Goby, average depth was recorded as –7.3, for Mitten Crab, -9m. The 50m grid showed far more plausible depth values at the 10m grid boundary, however less plausible for deeper water areas. All three were mosaicked together using MEAN operator (ArcGIS10.2.2) and extracted by mask by the common study area boundary. Averaging values introduces error to the nearshore, however creates a smoother gradient in the boundaries between grids where both species are most often found. I aggregated a Substrate grid in Blekinge from three vector features at three different spatial scales. First, a 10 meter scale substrate feature was generated by Fyhr et al. (2015) for the Blekinge region, validated and edited from SGU data with information from drop camera surveys. In addition, SGU is in the processing of updating seafloor substrate data, piece by piece, the resulting scale is 1:100,000 (although actual spatial resolution varies throughout study areas). The substrate is determined based on bottom composition measurements done by side-scan sonar. Normally it is 1000 meters between these measurement lines and resolution between points is greatly affected by stratification in the water which is due to vertical differences in water temperature and salinity. Striations in the seabed, along with shallow depths cause geological uncertainty in sonar data yields that increases with the distance from the measurement lines (SGU, 1:100,000 link (6)). Finally, older SGU data covering the offshore area outside of the Marmoni project study area that have not been updated yet were generated at a scale 1:500:000. Normally it is 13000 meters between survey lines, however a geological interpretation made from existing charts has been used to fill gaps between survey lines. Again, spatial resolution varies throughout study area (SGU, 1:500:000 link (7)). I combined the three features, using merge (ArcGIS10.2.2) by substrate type, to ensure coverage over the study area. I then converted the aggregated polygon feature to raster grid at 10m spatial resolution, which matched average Fyhr et al. (2015) manipulated survey spatial resolution within the area of study populated by Round Goby, then masked by study area. For scale boundaries see (appendices, figure A2). For a description of substrate types see (table 2). Classifications are based on marine geological map databases and on bottom surface observations classified according to the EUNIS system (now called the HUB system) (Hallberg et al. 2010). SGU Substrate feature accuracy in the Stockholm study area was determined by Nyström-Sandman et al. (2012) to be of poor quality and project parameters did not included drop-camera surveys/updates to improve these data. Test modelling runs using these data showed considerable error, so I also did not include substrate data in Stockholm as a predictor variable (discussed in section 4.6). I created a Distance to Ports grid by a process of digitizing point locations based on the following information. First a list of industrial port coordinates (GCS WGS84, decimal degrees) in Blekinge and Stockholm was gathered from United Nations Trade and Transport Locations (link (8)). Resolution of coordinates was rather poor, so they were used only as an approximate guide to actual locations. In addition, AIS transponder (marine traffic) data were visually interpreted to locate actual lanes of frequent travel and destination points. For the Stockholm area, Polygon data provided by

14

Swedish Sjöfartsverket for one month of travel were compiled and values were added together and converted to raster format with 10m grid pixels to determine a gradient of traffic from low to high, to determine regular or infrequent usage of port destinations. For the Blekinge area, a years’ worth (2011) of AIS data averaged by month and collected in raster format were used for visual interpretation. Also, Bathymetry grids were visually inspected to determine if certain waterways are indeed passable by large ships (greater than 4m was determined to be an acceptable depth for large ships to land). Google Earth imagery spanning several years were analyzed to determine scale of port, type of use (Industrial etc.) and if larger ships could be identified. For the Stockholm area, polygon data showing “harbors, industry and built-up areas along the shoreline in Swedish part of study area 3” along with “Map of Marinas along the shoreline in Swedish part of study area 3” polygons digitized through interpretation of aerial photos and derived from Corine landcover by Metria Miljöanalys AB for Swedish Environmental Protection Agency (available through BALANCE/HELCOM, link(9)) were also used as guides. All factors listed were considered in placement of points representing destination/departure locations. In conversion to raster grid, the Euclidian Distance tool (ARCMAP10.2.2) was used, masked by respective study areas, resulting in (10m) raster grids showing distance in meters throughout study areas to point/port locations in each study area. Pike (E. Lucius) and mud snail (Hydrobiidae) modelled probability of presence grids did not extend all the way to shoreline in a few instances. In these cases, I created constant raster grid with the lowest possible value (for each) in agreement with original data creators Fyhr et al. (2015), and added to existing grids ensuring full coverage/extent within study area which is required for modelling methods employed here using Raster Calculator (ArcGIS10.2.2). SMHI HOME data (Temperature, Salinity, and Total Nitrogen) (link (10)) were derived from the Probe model (Svensson 1998; Sahlberg, 2009 (link (11)). Probe has a high vertical resolution with a vertical grid cell size of 0.5m in the top 4m. The grid cell size then increases as the depth increases. In the depth interval 4 -70m the cell size is 1.0m, from 70 – 100m it is 2m. These data were interpolated by basins and sub-basins extents as described below. (SMHI Coastal Zone Model 2009). These data were resampled to 10m for use in the Marmoni project (Fyhr et al., 2015 link (12)). Marine traffic raw data was compiled by the Danish Maritime Authority (DMA) for the International Maritime Organization (IMO). The pixel value of the raster data set is monthly density average for the entire year 2011 in the Blekinge study area as recorded by Automatic Identification System (AIS). AIS transponder data in the Stockholm study area was processed as described (distance to ports), but not used in models. Wave Exposure is based on “fetch”, i.e. the distance of open water over which the wind can act upon the sea surface and waves can develop. Fetch is calculated for every sea grid cell of the map. This is done by starting at the map edge of the incident–wind direction and increasing the grid cell values by the size of one cell (in meters) for each sea grid cell in the propagation direction, until land is reached. Units are gradients from high to low exposure based on this calculation (SWM, Issaeus, 2004). All additional predictor variables are described in (table 2), with a short description of genesis and source citations. I created a weighted bias raster grid showing locations within the study area that have been sampled for Round Goby in the past and no observations have been made, for use in models A, B and C. “Presence only data… commonly suffer from large, unknown biases due to their typically haphazard collection schemes” (Fithian et al. 2014). The purpose of which was to establish that there have been (other) sampling efforts in the areas outlined. A constant raster with a value of .1 was created for the entire study area, whereas values of 1.0 were assigned to sampled areas. The bias grid acts as a weighting system for coordinates within the study area (indicating biased sample collection) in calculations of predictive probability of presence in models. High bias locations were created with

15 a diameter of 100m due to gillnet survey methods which are known to sample a larger area due to longer duration in place (Ann-Britt Florin, personal communication 2015). 2.5b) Reasons/rules for variable rejection Firstly, a number of (partially measured, partially modelled) variables were rejected based on modifiable areal unit problem (MAUP) edge/shape effects of interpolated grids (Openshaw, 1983). The SMHI grid data as described previously, and in (table 2), HOME (Marmefelt et al. 2007) regarding hydrographic/chemical conditions, employ the use of basins and sub-basins which follow the classification of water bodies according to Swedish waters archives (SVAR) (Fyhr et al., 2015). These boundaries, on a fine scale introduce somewhat abrupt edges that are not seen in 10m spatial resolution scale bathymetry or validated Substrate features. This is likely an artefact of aggregating rasters from separate sub-basins (using blend mosaic method in (Fyhr et al., 2015)). Some error introduced by these boundaries was accepted due to potential mechanistic importance of the variables Temperature, Salinity and Total Nitrogen, however several other variables were removed to eliminate propagation of this error through iterations of Maxent models. See (figure 7) for an example of these edge effects in modelling.

Figure 7 Left: Maxent model output using several HOME data grids, center: 10m bathymetry showing no sharp physical boundary, right: Maxent model output using a limited number of HOME sub-basin scale data grids.

Multicollinearity, although not proven to effect predictive ability (Dormann et al., 2013; Harrell 2001), may cause incorrect estimates of the importance of single predictors as modelled in Maxent (Phillips et al., 2004; Shan et al., 2006; Elith et al., 2011). To explain: the Maxent algorithm may rely on two correlated predictors to produce a predictive model, but it can’t necessarily determine which one better explains organism presence. To compensate for this, highly correlated variables that contribute less to models are removed before modelling. Correlation among environmental variable grids was determined using Band Calculation Statistics (ArcMAP10.2.2). Pearson’s R (rho) correlation values were calculated by comparing each cell within the study area and determining similarity. This (partial) method for variable rejection was used in models A and B as described in models. An example of confusion matrix output listing R values is shown in (appendices, table A1), where lower contributing and correlated variables (>.60) are identified. Multicollinearity test results were used in conjunction with an iterative/step-wise approach to variable selection where models were run and the provided output jackknife measure of variable importance was used to determine which of correlated variables contribute most to models and whether or not removing them would improve model performance. Jackknife is a setting in Maxent where models can be set to evaluate the relative strengths of each predictor variable. Training and test gain (of training and test data during cross-validation (explained in Models)) is calculated for each variable alone and decrease in training and test gain when the variable is omitted from the full model helps to determine which variables are most important (Yost et al., 2008; Phillips tutorial on Maxent (link (13)), 2009; Baldwin, 2009). Variable selection methods differed for models C) Round Goby in Blekinge Transferred to Stockholm and D) Mitten Crab in Stockholm transferred to Blekinge. One difference is that there were only 9 instead of 13 variables in common between Blekinge and Stockholm, so only this subset of 9 were tested using the following methods resulting in a model designed to be not overly simple or overly complex (Warren and Seifert 2011), final variables listed in (table 2). “Niche models have

16 been criticized for low performance in identifying the most important range-limiting environmental variables and in transferring habitat suitability to different environmental conditions” (Warren and Seifert, 2011; Warren et al., 2014). The Maxent Variable Selection package (R) developed by (Jueterbock et al., 2016) and based in part on code developed in ENMtools package (R) (Warren et al., 2008; Glor and Warren 2011; Nakazato et al., 2010; Warren and Seifert 2011; Warren et al., 2010) identifies the most important set of variables assessed in performance of models by Akaike information criteria, AICc (Akaike, 1974). The program first tests for Pearson R correlation (as before), then through an iterative process Variable Selection tests different regularization values and correlation and contribution of variables to determine the best possible set of variables by sample size corrected AIC, (AICc), instead of ROC/AUC as used by Maxent internal validation. This is described further in section 2.6c model method 2. (Table 1) shows results of these tests for both areas.

Table 1 “minimization of AICc values generally favors models that recognize the fundamental niche of a species and that are better transferable to other scenarios” (Warren and Seifert, 2011). Betamultiplier = regularization #, variables = number of environmental variables in best model, samples = number of observations used to “train” model, loglikelihood = function of a parameter for a given outcome, AIC (Akaike Information Criterion), AICc = sample size corrected, BIC = Bayesian information criterion, AUCtest = area under curve of Test data, AUCtrain = area under curve of training data, AUCdiff = difference between testing and training data results.

Model betamultiplier variables samples loglikelihood AIC AICc BIC AUC.Test AUC.Train AUC.Diff C 1 4 71 -1001 2033 2042 2067 0.9635 0.9655 0.002 D 3.5 6 76 -1060 2135 2137 2151 0.908 0.9193 0.0113

2.5c) Requirements for Inclusion When beginning this project there were 30 variables to select from, however this list was pared using several methods for rejection and inclusion to create the most parsimonious model (Merow et al., 2014; Warren and Seifert, 2011). At most 13 variables were used in model A (all Round Goby observations in Blekinge with all environmental variables) and at the least, 4 in model C (Round Goby transferred to Stockholm). Before selecting variables by methods mentioned above, the first round of selection was based on a-priori mechanistic assumptions e.g.: Round Goby is largely a nearshore species (Sapota, 2012; Kornis et al., 2012) and would be affected by nearshore gradient variables. Also, some variables were included to shed light on questions such as “Will proximity to potential predator species habitat effect distribution?”, so predictive probability of Pike presence was included, or “which of known possible food sources (Almqvist, 2008) may affect distribution” so Hediste Diversicolor (invasive Ragworm) Hybrobiidae (invasive mud snail), Macoma Balthica (indigenous clam) and Mytilus Edulis (indigenous Blue Mussel) were included. Variables were kept even if there was little contribution, at no detriment to the predictive power of the model (Reid, 2015; Phillips and Dudik 2008). Also, a-priori knowledge was used to select features based on methods used in creation of original data, and if criteria are met in compatibility of these features with others. For example, many of the variables were created for the Aquabiota AB Marmoni project (Fyhr et al., 2015 (link (12)) where methods used in genesis (scale, field data and modelling methods, data collection standards, error) are clearly documented, are similar and have been field validated which make them useful in comparisons with new environmental data. Variables in Stockholm were created using similar methods, which should provide standardization for model transferability between both regions.

17

Table 2 Final list of predictor variables used in Maxent modelling. Hydrographic, physical and modelled variables are separated here by continuous, ordinal and categorical raster data formats. Description is a summary of genesis, Source is accompanying published documentation or data stewardship entity. Models used: described in models section of this report. Stockholm (S) Blekinge (B) shows where variables were used in each respective study area.

Predictor Stockholm (S) variables Description Source Models Used Blekinge (B) Hydrographic Continuous Fetch based surface wave model (SWM) incorporating wind direction and strength Issaeus, 2004; Wave exposure (models water surface). Sundblad et al 2014a ABCD SB HOME Water model system, connecting several models for land, lakes, rivers and Total Nitrogen coastal waters Created and implemented (mg/m3) at SMHI Marmefelt et al 2007 AB B Bottom Salinity (ppt) HOME Water model system Marmefelt et al 2007 ABD SB Bottom Temperature (c) HOME Water model system Marmefelt et al 2007 CD SB Empirically derived Secchi disk depth based on field observations and satellite Secchi (m) images Philipson et al 2013 D SB Shipping traffic, relative intensity estimated in the Baltic Sea recorded by Automatic Identification System (AIS) transponders aboard ships, collected by the International and DanishMaritime Organizations Naturvårdsverket Physical Ship traffic (IMO)(DMO) 2010 A B Blekinge: A continuous depth grid in Fyhr et al 2015, meters was created from point format Sjöfartsverket, depth Stockholm: Mosaicked coarse and Hansen 2012, fine spatial resolution from three Svealands Depth (m) different products / two sources Kustvattenförbund ABD SB Calculated by taking the difference in depth from one raster cell to another, and is given in degrees where zero degrees describes a completely horizontal Slope (degrees) surface and 90 degrees a vertical surface Fyhr et al 2015 ABCD SB Distance to Ports Euclidean distance of areas within the (m) study area to large ship ports Devon Reid AC SB Predicted probability produced using Random Forest, based on field data from Pike (predicted 402 stations collected by small Modelled probability) underwater detonations, AUC 0.96 Fyhr et al 2015 AB B Ordinal Zoobenthos were modelled using Random Forest based on field data Hediste collected with benthic Van-Veen grabs, Modelled diversicolor AUC 0.80 Fyhr et al 2015 AB B Modelled using Random Forest based on field data collected via dropvideo, snorkelling and diving. AUC Mytilus edulis 0.91 Fyhr et al 2015 AB B Same collection and modelling method Macoma balthica as Hediste diversicolor AUC 0.94 Fyhr et al 2015 AB B Same collection and modelling method Hydrobiidae as Hediste diversicolor AUC 0.80 Fyhr et al 2015 AB B Categorical Regional seafloor survey updates, (SGU) drop camera surveys (Fyhr, et al., 2015). Substrate types are: 1 – bedrock, 2 – mixed coarse substrate, 3 – mixed medium coarse, 5 – sand, 8 – mud, 9 – combination Substrate artificial substrate (SGU). SGU, Fyhr et al 2015 AB B

18

2.6) Models Maxent is a general-purpose method for characterizing probability distributions from incomplete information. In estimating the probability defining a species’ distribution across a study area, Maxent calculates estimated distribution which must agree with everything that is known (or inferred from the environmental conditions where the species has been observed) but should avoid making any assumptions that are not supported by the data (Phillips and Dudik, 2008; Pearson, 2007). Maxent is a deterministic machine learning method that predicts the distribution of an organism by finding the probability distribution of “maximum entropy” i.e., the closest to uniform (in extrapolation) that respects a set of constraints derived from sample locations. This method of modelling has proven to be powerful in predicting on a regional scale (Elith et al., 2006; Elith and Graham, 2009; Vaclavik and Meentemeyer, 2009), invasive species on regional scale (Vaclavik et al., 2010; Gormley et al., 2010) and in transferring predictions of invasive species to new (similar) regions (Duque-Lazo, 2016) and (Elith et al., 2010, (although approaches are somewhat different)). Species Distribution Modelling of Invasive species in general has its detractors as well, see (De Marco et al., 2008; Dormann 2007; Midgley et al., 2006) for critique such as “outcomes will be influenced by genetic variability, phenotypic plasticity and evolutionary changes; dispersal pathways are difficult to predict”. In Cross-validation (selected for these models), the occurrence data is randomly split into a number of equal-size groups called “folds”, and models are created leaving out each fold in turn. First the model “trains” (training data) the model and then the left-out folds are used for evaluation (test data). Cross-validation has one big advantage over using a single training/test split: it uses all of the data for validation, thus making better use of small data sets (Phillips tutorial on Maxent, 2009). Training data, iteratively checked and corrected by test data are used to calculate probability. Constraints in modelling are represented by functions of environmental predictor variables, with their means required to be close to the empirical average of values extracted at occurrence sites in order to calculate probabilistic predictions (Phillips et al., 2006; Phillips and Dudik, 2008; Vaclavik et al 2010). The Maxent algorithm is shown to conform to Gibbs distribution while the regularization parameter, which penalizes the use of large values of model parameters, can be interpreted as the use of a Bayesian prior (Phillips, 2006; Williams, 1995). Through a process of simultaneously minimizing the regularized loss (or equivalently, maximizing the regularized gain) while maximizing the entropy of the predicted distribution where feature expectations are only close to feature averages over sample locations rather than exactly equal to them, Maxent trains itself through an iterative process to make predictions (Philips, 2006), (output grids shown in Results, figures 8, 9 and 10). Regularization is designed to reduce model overfitting by a) ensuring a smoothing of empirical constraints (which are not always precise) and b) penalizing the model by magnitude of coefficients, which allows the algorithm to reject certain variables from contributing to the final result (Tibrishani 1996; Merow 2013). The regularization multiplier (default ‘1’ in Maxent) was set by (Phillips and Dudik, 2008) and was shown to perform well as tested with several different taxonomic groups and environmental variables. This setting was used in models A and B. In model C and D, described below, different methods are employed for determination of proper regularization settings. Predicted distribution is calculated for the same area of coverage as the environmental variables (raster grids). Due to computational complexity of sampling every pixel in grids larger than 10,000, a random sample of 10,000 "background data" pixels is used to represent the variety of environmental conditions present in the data. The Maxent distribution is then computed over the union of the background pixels and the samples for the species being modeled. Feature types, in tandem with regularization constants influence predictive performance of the model. It is recommended in the literature for observation datasets of less than 80 to use linear, quadratic and hinge features which constrain means, variances, and covariances of the respective variables to match their empirical values (Phillips et al. 2006). I used these settings in each model.

19

A description of the math and history behind Maxent, see (Phillips 2004; 2006; Phillips and Dudik, 2008). For an explanation of the algorithms from a statistics perspective as opposed to machine learning, see (Elith et al., 2011). The current (as of writing) Maxent software Version 3.3.3k was used. Software as well as tutorials and documentation can be found at http://www.cs.princeton.edu/~schapire/maxent/. Models A, B and C are run with a Bias Grid as previously described. When there is much bias, presence-only models approximate the biased sampling distribution as much as they approximate the species distribution. This can be avoided by having the background sample reflect the same bias as the presence data (Zaniewski et al., 2002; Dudik et al., 2005; Phillips and Dudik 2008). 2.6b) Model Method 1 Two models were created using the common “step-wise” or “leave-one-out” approach of running Maxent models and selecting variables based on respective contribution or detriment to predictive power of results as internally validated by AUC, with model contribution measured by “jackknife measure of variable importance” as described in (Van Gils et al., 2014; Pearson, 2007; Kumar and Stohlgren 2009; Pearson, et al., 2007). The AUC measure of Model performance is in a single measure of “predictive performance across a full range of possible thresholds” (Pearson, 2007). It is derived from the Receiver operating characteristic (ROC) curve which plots errors in commission vs. omission, where “sensitivity” or percentage of actual presences successfully predicted is graphed against “specificity” number of absences correctly predicted converge. In this case, Maxent uses “specificity-1” which is an evaluation of low-prediction background data, which can also be called in this case pseudo-absences (Fielding and Bell 1997). Model A) This is a model run with a full set of variables and all observation locations. It is (somewhat) a continuation of models I ran in Reid (2015), with several new observation points gathered by SLU than in 2015 (n41 < n71). A carefully selected set of predictors was included as shown in (table 2), including a newly created feature, Distance to Ports, which may be a better representation of presence in relation to location of introduction (Kotta et al., 2016) than AIS transponder data (Marine Traffic). Model A shows a likely current distribution of Round Goby in Blekinge using all pertinent environmental data available, the goal of which is to predict, explanation by variable contribution being a secondary goal. This model can also be called “model based interpolation/extrapolation to unsampled sites” (Guisan & Thuiller 2005). There have been critics of this approach in establishing “niche” of invasive species, as mentioned, and also in using AUC/ROC to validate such models (Jimenez-Valverde 2012, Jueterbock et al., 2015), so the goal for model A is to define “current distribution” or “range” (Elith and Leathwick, 2009). Model B) This model is run with a subset of Round Goby observations and subset of variables. It includes a filtered portion of observation data points, separated by groups delineated by variogram, cluster and grouping analysis as described in exploratory analysis above. Filtering of data (spatially) should promote better model calibration and evaluation (Radosavljevic and Anderson 2014; Veloz, 2009; Anderson, 2012; Hijmans, 2012). Observation data in areas where presence can most probably be explained by vector of introduction (ballast) where Round Goby, a largely sessile organism (Almqvist 2008) probably never migrated away from seeking more preferable habitat are filtered out. In addition, variables representing vectors of introduction are removed from the model. Model B is a novel attempt by the author to isolate populations in relation to environmental data by preference of habitat, excluding dynamics of introduction, using exploratory GIS techniques. It is an effort to explain direct and indirect drivers of established Round Goby distribution (Elith and Leathwick 2009), and prediction is a secondary goal, although internal and external validation shows predictive power is nearly the same as model A.

20

2.6c) Model Method 2 Two models were created which transfer predictions at one location to another. Transferring predictions to a different region has been called “extrapolation or forecasting” (Araujo & New 2007; Miller et al. 2004). There are several factors to consider when transferring predictions “across space and time” (Elith and Leathwick, 2009). Similarity of habitat is known to be important in determining transferability (Duque-Lazo et al., 2016; Vaclavik et al., 2010) and was determined through analysis of min mean max and standard deviation of habitat grid values in both regions (table 4), that conditions were similar enough to transfer predictions (Brackish water, low wave exposure, protected coastline, similar depth gradient). According to (Elith and Leathwick 2009) “It is inherently risky to transfer predictions because no observations of species occurrence are available from the training data to directly support the predictions”. I address this problem (in Blekinge) by external validation of predictions. As mentioned, AUC as an evaluation of SDM’s (especially for invasive species) has its critics, and several have made the argument that this is particularly applicable to transferability/forecasting (Duque-Lazo 2016). Jimenez-Valverde (2011) argues the AUC systems is a flawed approach (described in 4.5 Assessment of Validation Results). For this reason, variable selection and regularization number settings for Maxent models C and D were based on different criteria than the standard step-wise approach for model A and B. The AIC Maxent method was developed in part by (Akaike, 1974; Glor and Warren, 2011), coded in ENMtools for R and development of Maxent Variable Selection was done by (Jueterbock, 2016) as mentioned in (2.5b Rules for Rejection) here. The package, run in R, is a separate creation of a Maxent model which uses SWD (samples with data). 10,000 random points were created within the study area representing background data, combined with observation data (both) with values extracted from environmental variables, using the extract multi-values to point tool (ArcMAP10.2.2). Then, all variables are excluded which have a relative contribution score below the value set with contribution threshold (3%). Then, those variables are removed that correlate with the variable of highest contribution (correlation values set to 60%). After this, the remaining set of variables is then used to compile a new Maxent model. Variables with low contribution scores are again removed and remaining variables that are correlated to the variable of second-highest contribution are discarded. This process is repeated until left with a set of uncorrelated variables that all had a model contribution above 3%. Performance of each model is assessed with the sample-size-adjusted Akaike information criterion, which are estimated from single models that include all occurrence sites and background data, calculated as AIC = (-2)log-(maximum likelihood) + 2(number of independently adjusted parameters within the model) (Akaike, 1974, see table 1). The AIC method differs from AUC in that it looks for the simplest model that best satisfies criteria. It represents a sort of “Occam’s Razor” approach. Instead of high AUC values, low AIC values are best. “Minimization of AICc values generally favors models that recognize the fundamental niche of a species and that are better transferable to other scenarios” (Warren and Seifert, 2011). Using AIC as opposed to AUC is more of a statistics-based approach to using a machine learning tool. (Table 1) above shows final results of Maxent Variable Selection with both AICc and AUC values for models C and D which are described below. Model C) Round Goby in Blekinge Transferred to Stockholm includes a streamlined list of predictors (shown in tables 2 & 3). Only the four highest contributing, uncorrelated variables were included. A 10-fold crossvalidation Maxent model, with a regularization (beta) multiplier of 1 is used. I intend with this model to show the potential distribution of Round Goby in Blekinge and in Stockholm Areas. (Jimenez-Valverde, 2012) encourages the use of AIC methods for this scenario (explained in detail in Discussion). Coincidentally to being best of a tested series by AICc, this was also the best model by AUC (with a smaller variable set in common between both study areas) which was not often the case in previous tests (table 1), by the Maxent Variable Selection tool. The purpose of model C is prediction.

21

Model D) Mitten Crab in Stockholm transferred to Blekinge is again 10-fold cross-validation with 6 environmental variables and a regularization multiplier of 3.5 as established by the Variable selection tool (table 1). The purpose of Model D is prediction of current (Stockholm) and potential distribution (Blekinge) with settings as prescribed by AIC method model tests. Workflow for all models is shown in (appendices, figure A3) 3) Results 3.1) Overview of Maps and Output Tables Model output maps, shown in (figures 8, 9 and 10) show current and potential distribution of both species in study areas. Note that mapped distributions in models A, B and C and D in the Stockholm area show a broad range of occupied or potential habitat while the transferred model D shows almost no prediction in the Blekinge study area (discussed in section 4.4). Maxent output/results include tables showing estimates of relative contribution of environmental variables to the Maxent model (table 3). For a graphic showing positive and negative correlation (and categorical) of observation data to variables, see (appendices, figure A4). As can be seen in (table 3), wave exposure is a strong contributor to Models A (57.2%), B (79.3%) and C (61.2%), reinforcing findings from (Kotta et al., 2016) and also my previous study (Reid, 2015). In addition, wave exposure also contributes very highly in prediction of Mitten Crab in Stockholm (model D, 54%). Distance to ports was the second most important predictor in models A (26.4%) and C (33.8%). Substrate in models A (3.7%) and B (3.2%) shows a correlation to either soft bottom (mud) or rock (appendices, figure A4), although not as strong a contributor as the 2015 model (11.1%). Absent variables representing vectors of introduction (marine traffic and distance to ports) in model A, model B environmental variables contribute much more, thusly explaining direct and indirect drivers of nearshore habitat. Additional contributions of note in model D are: Secchi depth (9.8%), salinity (23%), slope (7.9%) and bathymetry (3.2%). (Table 4) shows Min, Mean, Max and standard deviations of variable values, and values of variables at observation locations. Internal and external validation scores of finished models are shown in (table 5).

Table 3 Maxent variable contributions. Model A and B is Round Goby in Blekinge, C is Round Goby in Blekinge transferred to Stockholm, D is Mitten Crab in Stockholm transferred to Blekinge

Maxent Percent Contribution models variables A B C D Wave Exposure 57.2 79.3 61.2 54 Distance to Ports 26.4 --- 33.8 --- Substrate 3.7 3.2 ------H. Diversicolor 2.6 3.2 ------Secchi ------9.8 Salinity 2.5 0.4 --- 23.3 Temperature ------2.1 1.8 Slope 1.8 3.5 2.9 7.9 Total Nitrogen 1.8 4.9 ------Bathymetry 1.3 2.4 --- 3.2 M. Edulis 0.8 0 ------Pike 0.7 1.2 ------Marine Traffic 0.7 ------Hydrobiidae 0.3 1.2 ------M. Balthica 0.2 0.5 ------

Statistical values of variable grids and at observation sites are shown in (table 4).

22

Table 4 Summary of predictor variables in background raster data and the occurrence sampling points. Variables are summarised by min, max, mean and std. deviation values. 1 is the lowest probability and 5 is the highest for Ordinal data: Hydrobiidae (modelled abundance): 0 = 0, 1 = 0-13, 2 = 13- 120, 3 = 120 – 160, 4 = 160-122, 5 = 222 -1253. Macoma Balthica (modelled abundance) 0 = 0, 1 = 0-20, 3 = 50-92, 4 = 92-177, 5 = 177-1326. Hediste Diversicolor, Mytilus Edulis: 0 = low probability of presence, 1 = high probability of presence, 2 = very high probability of presence. Additional variables have been described in table 2 and in the methods section.

Blekinge Round Goby Layer MIN MAX MEAN STD MIN MAX MEAN STD Marine Traffic 0.0 1200.0 128.9 139.9 0.0 980.0 156.4 215.3 Bathymetry (m) 1.8 -51.3 -17.9 11.5 -22.4 -0.4 -7.3 5.1 Salinity (ppt) 5.9 8.2 7.4 0.2 6.3 7.2 7.1 0.2 Temperature (c) 5.2 9.9 8.3 1.2 8.0 9.8 9.0 0.4 Distance to port (m) 0.0 32774.6 14210.8 7484.3 616.8 11778.5 3196.9 2334.2 Hediste Diversicolor 0.0 2.0 0.2 0.6 0.0 2.0 0.8 0.8 Hydrobiidae 1.0 5.0 1.3 0.8 1.0 5.0 1.8 1.3 Macoma balthica 0.0 5.0 2.4 1.6 0.0 5.0 3.3 1.7 Mytilus Edulis 0.0 2.0 0.6 0.8 0.0 1.0 0.1 0.3 Pike 0.0 1.0 0.1 0.2 0.0 0.7 0.2 0.3 Secchi (m) 1.3 8.9 7.3 1.0 ------Slope (angle degree) 0.0 42.2 1.2 1.5 0.0 7.6 1.9 1.5 Substrate 1.0 9.0 ------1.0 8.0 ------Nitrogen (mg/m^3) 283.9 451.3 294.8 10.6 284.2 404.7 312.7 22.6 Wave exposure 0.0 881091.4 475312.0 214493.7 0.0 417194.0 49570.8 88828.1 Stockholm Mitten Crab - Bathymetry (m) 110.8 0.4 -23.3 16.9 -1.2 -33.7 -9.0 8.1 Salinity (ppt) 0.7 7.9 6.2 0.6 2.3 6.6 4.8 1.4 Temperature (c) 1.9 8.0 6.1 1.3 1.9 7.9 6.8 1.0 Distance to port (m) 0.0 21002.6 5407.6 3261.8 116.6 11399.0 3484.0 0.7 Secchi (m) 0.9 6.3 4.3 1.0 1.5 5.1 3.0 0.7 Slope (angle degree) 0.0 54.6 4.4 3.3 0.7 21.1 6.6 4.4 Wave exposure 79.0 754019.8 106647.7 185171.0 0.0 131857.0 7184.8 15159.4

3.2) Validation and Thresholds As mentioned, there were absence location data for Round Goby in the Blekinge study area. These data were used to validate Maxent predictions externally using True Skill Statistic (TSS). TSS, also known as the Hanssen–Kuipers discriminant, is traditionally used for assessing the accuracy of weather forecasts. It compares the number of correct forecasts, minus those attributable to random guessing, to that of a hypothetical set of perfect forecasts (Allouche et al. 2006). It is a process by which we compare sensitivity vs. specificity with external data as opposed to “1 – specificity” as used in Maxent Cross-validation which is a measure of (model created) background data performance. The amount of correctly vs. incorrectly predicted presences (sensitivity) is compared against correctly vs. incorrectly predicted absences (specificity), minus the factor of random guessing. The result of comparing the two gives us a cut-off value, the location of which on a grid shows us our percent correctly predicted (appendices, figure A5). The cutoff value is used as a binary threshold in predictive maps created for models A, B and C. Evaluation data should be spatially independent from the calibration data and not contain any environmental bias found in them (Radosavljevic and Anderson, 2014) and “spatially independent evaluations should be used to identify models that avoid

23

overfitting” (Bahn & McGill, 2013). For Model C in Stockholm, an attempt was made to validate locations of highest prediction by sampling however no Round Goby were discovered. For model D, there were no external data to validate presences or absences in the Stockholm study area, however there were presence data in the Blekinge study area. These data were used to validate the Mitten Crab transferred predictive map. Since there were no data to externally validate the Stockholm prediction, minimum training presence binomial probability as internally calculated by Maxent was used as a threshold. According to (Warren et al., 2010) “The simplest approach … is to assign a threshold value to the Maxent predicted suitability output scores that corresponds with the minimum training presence (i.e., the lowest predicted suitability score that corresponds with a known occurrence)”. This threshold was also applied to the transferred prediction map in Blekinge and a simple percentage of observation locations with prediction values above this threshold were calculated. Discussion over which threshold to use and why continues, including conceptual critique of the minimum training presence method (Warren et al. 2008), and there seems to be no one solution to fit all needs. See (table 5) for internal and external validation scores.

Table 5 Internal is validation in AUC scores for each model done by Maxent over 10-fold validation. Sensitivity is amount of presences correctly predicted, specificity is amount of (actual) absences correctly predicted, for an example of the sens/spec graph see (appendices, figure A5). PPP = positive predictive performance and NPP = negative predictive performance, as measures of user accuracy generated by confusion matrix (Fielding and Bell, 1997). User accuracy is important from a management perspective, as a measure of how accurate the map was. NPP compares where it is predicted to not occur vs. how many cases it was still found. PPP is a measure of how many of the predicted occurrence areas was it predicted to be (which is also influenced by how common a species is) vs. where it was not found. Cutoff represents where sensitivity meets specificity, this value is used as the primary threshold in predictive maps (except for model D). AUC in the last column is validated externally.

Internal External User Accuracy Models AUC Correctly Predicted % Sensitivity Specificity PPP NPP Cutoff AUC A 0.970 78 0.77 0.8 0.85 0.71 0.452 0.88 std. d. 10 folds 0.014 B 0.967 78 0.765 0.78 0.82 0.72 0.427 0.87 std. d. 10 folds 0.009 C 0.957 73 0.69 0.74 0.79 0.62 0.501 0.8 std. d. 10 folds 0.026 D 0.913 0% correct in Transferred model std. d. 10 folds 0.073 Blekinge

3.3) Predictions Models A and B covering only the Blekinge study area are shown in (figure 8), where area deemed unsuitable is outlined in dark blue. Area established as suitable by Minimum Training Presence (Warren et al., 2009) area is shown in blue-green. Then the externally validated cutoff value for both models begins a gradient from dark green to red indicating highest prediction.

24

3.3a) Models A and B

Figure 8 Two cutoff values are employed for both models (A & B) Minimum Training Presence (MTP) (Warren et al., 2009) and the cutoff established by external validation (sensitivity and specificity, (Allouche et al. 2006)) of presence and absence locations (cutoff). Absences used in this calculation are shown as a crosshatch.

(Table 6) lists the resulting area of each model predicted to be occupied or potentially occupied, as compared to the entire study area.

Table 6 Predicted distribution area and study area as calculated in km^2 above cutoff as established by external validation. ** Cutoff for the Stockholm and transferred to Blekinge Mitten Crab model is established by minimum training presence.

Predicted distribution area km^2 Study area km^2 % over cutoff Model A 4032 177224 2.3 Model B 4741 177224 2.7 Model C Blekinge 4981 177224 2.8 Model C Stockholm 100801 254324 39 Model D Stockholm ** 68793 254324 27 Model D Blekinge** 1084 177224 0.60

25

3.3b) Model C Model C, showing Round Goby predictions in Blekinge County, transferred to Stockholm County is shown in (figure 9). Again, Minimum training presence is shown here as lowest predictive probability supported by training data. Then the externally validated cutoff value for both models begins at .5 and continues on a gradient from dark green to red indicating highest predictive probability.

Figure 9 Model C is Round Goby as modelled in Blekinge, transferred to the Stockholm study area. Dark blue indicates predicted unsuitable habitat, Blue-Green shows predicted distribution by Minimum Training Presence (MTP), Dark green represents cutoff as established by external validation continuing on a color gradient from green to red outlining cells of highest predicted probability.

26

3.3c) Model D Model D, Mitten Crab modelled in Stockholm and transferred to Blekinge. Since there were no external data to validate the Stockholm model, thresholds were set by Minimum training presence (figure 10).

Figure 10 Model D, predictive probability of Mitten Crab modelled in Stockholm and transferred to Blekinge is shown with blue representing area below minimum training presence, and then a gradient from green to red showing highest probability. E. Sinensis observations (n22) shown in red in the lower map are observation locations used to validate the transferred model.

27

4) Discussion 4.1) Discussion of Model A Results and Conclusions In revisiting the Blekinge area from the 2015 study Reid (2015) with several new observations, a new opportunity seemed to be available to analyze spread of the Round Goby over time. Model A, in comparison to 2015 results is shown in (appendices, figure A6). Although predicted distribution is similar, it seems to be more centralized in the current study. The 2015 model does not include the Distance to Ports variable, which may be showing a better explanatory “introduction” variable as a constraint of prediction in Model A, whereas it could not be seen before. This would indicate that the two models and associated predictive grids are not comparable. In model A, strong contribution by Distance to Ports (with all data points) may be an indication of slower colonization by swimming and reproduction, and that mechanisms of recruitment rely heavily on re-introduction or even that Round Goby is quite adaptable to most conditions where it’s “dropped” once. Model A validation results are more favorable than the 2015 study (AUC .97 vs. AUC .96), which would reinforce such an argument. Furthermore, as evidenced by (table 3), variable contributions to models, (positive correlation shown in appendices, figure A4) Distance to ports in models A (26.4%) and C (61%) (Table 3) is much higher contributor to the model than marine traffic (Model A, (.7%)). This could be due to limitations in AIS transponder data in determining which A-class ships that use ballast are landing (Danish Maritime Authority, link (14)). In comparison, by Minimum Training presence in model A (as compared to the 2015 study) the prediction seems to agree more with a-priori knowledge in that Round Goby should not be found in the open ocean south of the southern archipelago due to depth and lack of suitable substrate (Sapota, 2012), further indicating a more comprehensive analysis in Model A, with more robust predictions and a more effective explanatory variable. In essence, Model A seems to be modelling colonization by direct introduction with environmental constraints, which could only partly be determined in the 2015 study. 4.2) Discussion of Model B Results and Conclusions A good way to compare vectors of introduction vs. environmental conditions is by comparing model A, the highest performing predictive model by all validation measures to model B (table 5). The differences in all validation scores are very small. In this case, it would seem that both direct and indirect environmental predictors are nearly as important in explaining presence as vectors of introduction, indicating that there is a preferred habitat, but the sessile nature and robust adaptability of Round Goby or constant reintroduction (modelled) is keeping us from seeing what that is in models A and C. Model B, by elimination of introduction vector variables, and filtering out a subset of data explained by these is meant to show this. The threshold of minimum training presence in model B (figure 8), shows what we would expect of Round Goby from known behavior, in that they generally favor near-shore, shallow water habitat (Sapota 2012; Kornis, 2012; Florin, 2012, Reid, 2015). Predicted presence is shown with an eastward migration where there are in fact observation records in Öland and (but certainly not westward, exclusive Gothenburg), possibly toward areas of lower salinity and by the same reasoning migrating toward the shoreline for probable colonization of freshwater areas. By these results I would theorize we have captured current distribution by habitat preference. Whether or not Round Goby has colonized these areas is also dependent on rate of spread (swimming), and by this logic whether or not the entire Karlskrona area is preferable, and our model is artificially constraining distribution. The only way to answer this question is through planned external (stratified random) validation in areas of high probability by Model B and possibly DNA surveys, which is beyond the scope of this study. Regarding Model B and variable contributions. Wave exposure was the highest contributor (79.3%) (table 3) showing negative correlation (appendices, figure A4). As the wave exposure grid is at the water surface (Sundblad et al 2014a) and Round Goby is generally a shallow water, nearshore fish, it is not surprising they would be affected by this. This relationship can be seen in tandem with anther variable, bathymetry (2.4%) where shallower depths show lower wave exposure at observation

28 points indicating a preference for both nearshore and protected areas. Round Goby presence was positively correlated (appendices, figure A4) with higher nitrogen (4.9%) suggesting that eutrophication at local scale is not a limiting factor in Round Goby distribution, also found by (Kotta et al., 2016), however another measure of eutrophication, Secchi depth performed no better than random in model pre-tests, so this correlation may be an indirect association with physical features of nearshore habitat. Substrate (3.2%) found that Round Goby is highly correlated with rock and mud substrates reinforcing the two different reproductive strategies for exposed and sheltered areas described in (Almqvist 2008), although it is known to need hard material to lay eggs on (Sapota 2012; Kornis 2012). Preferred substrate is definitely not sand (appendices, figure A4), which reinforces previous knowledge from a biological perspective (Sapota 2012; Florin, 2012). Hediste Diversicolor (Ragworm)(3.2%), Hydrobiidae (mud snail)(1.7%) and Macoma Balthica (native clam)(.5%) could be potential food sources and it could be feeding on any combination of these, but it is possible they could also occupy the same type of habitat as Round Goby for other reasons. Mytilus Edulis was not shown to correlate with Round Goby (appendices, figure A4), however did not contribute significantly to model B (0%). This is not to say it is not a potential food source, rather it is a matter of regional habitat location as M. Edulis occupies habitat deeper than Round Goby is found within the study area. M. Edulis is a known food source of Round Goby on the Polish Baltic coast (Almqvist 2008). This fact can also lend legitimacy to the theory that regional modelling approaches are best for this species. Salinity (bottom) contributed only (.4%) to model B, and it is known that the range of salinity within the study area (table 4) is well within Round Goby requirements (Kornis, 2012). However, as seen in (appendices, figure A4) a negative correlation to high salinity is exhibited, also shown in model A (2.5%). True magnitude of this result is difficult to determine, however this would be congruent with a pattern shown in the St. Lawrence River and Great Lakes, USA (Kornis, 2012). Slope (3.5%) (appendices, figure A4), shows a positive correlation with >5 degrees, and gradually descending to negative correlation with increased slope angle. With (1.2%) contribution, Pike presence is shown to be positively correlated with Round Goby. Although Pike are voracious predators, they can also be prey, and Round Goby is known to feed on roe of other species (Balshine et al. 2005, Lauer et al. 2005, French III and Jude 2001, Steinhart et al. 2004, Fitsimmons et al. 2006, Roseman et al. 2006), and original Pike data was a model of juveniles (although adults generally stay close to spawning habitat (Göran Sundblad, personal communication)). 4.3) Discussion of Model C Results and Conclusions Model C, selected by performance by AICc using (1) regularization multiplier, did not perform poorly by any validation metric even with a streamlined set of variables (table 5) although not as well as A and B. In comparison of the predicted surface area between Model A (figure 8) and C (figure 9, Blekinge), predictive distribution surfaces (considered favourable) are quite similar in area, with a difference of only 709km^2 (table 6). The threshold “cutoff” as establishes by external validation is appropriate for the Blekinge study area, it is somewhat arbitrary for the transferred Stockholm study area, as there is no way to validate with presences. A threshold of .5 is generally quite conservative though, and highest predictive areas would be the same, regardless of threshold. It has been suggested a gradient from lowest to highest is the best representation of the predictive surface, however minimum training presence was calculated so it was used to establish a (visual) predictive baseline. “The advantages of ROC plots are considered to be their independence of a particular threshold and that they provide a single measure of model and prediction accuracy” (Sundblad et al., 2009, Pearce and Ferrier, 2000). Additionally, since the model was developed by AICc instead of AUC, and the predictions are transferred to a new study area, it is tricky to set thresholds. Using the fairly conservative (.5) threshold, it should be of great interest to land managers that 39% of the Stockholm study area (as opposed to the 2.8% in Blekinge) shown in (table 6) are predicted as potentially favourable for Round Goby. In addition, Lake Mälaren is shown to be at risk, as some of the areas of highest probability of colonization are at connection points to the Baltic (figure 9), both in the Södertälje region (south portion of study area) and Stockholm city (center-east of study area). I am

29 conducting a pilot study in field sampling and precursory validation of Round Goby predictions in the Stockholm study area, however no Round Goby have been caught as of publication. 4.4) Comparison of Models C and D Results and Model D Conclusions Transferability performance of SDMs between two disjunctive areas is still not as well understood as region-based, which makes it hard to obtain the most reliable predictions. (Duque-Lazo et al., 2016). To compensate for this, only the most parsimonious models were developed through selection of variables for models C and D by AICc. This (not overly complex and not overly simple) model has been designed to address a known overfitting issue in SDM’s (Thuiller et al., 2008). Evidence of good model fit for models C and D can be seen by AUC standards in low AUCdiff (table 1) values, showing the difference in AUC in training and test data, and low AICc. This should be an indication of good transferability (Jueterbock et al., 2016). Transferability of predictions is very sensitive and can depend on species (Randin, et al., 2006) and can depend heavily on region similarity (Duque-Lazo et al., 2016). Although not externally validated, possibly for mechanistic reasons of rate of colonization of Round Goby, internal high AUC scores and low AICc scores, as well as high external validation scores in the Blekinge region has led to a successful model transfer to the Stockholm area in Model C. This would suggest environmental similarity between regions (further evidenced by table 4), whereas Mitten Crab in Stockholm (Model D) was not successfully transferred to the Blekinge study region. This would indicate that the Stockholm environment is of a similar “region” as Blekinge for Round Goby, whereas Blekinge is not of a similar region as Stockholm for Mitten Crab, however as I will describe, model results may be skewed by sample selection bias. To illustrate the problems in transferred prediction, figure 10 (bottom) shows observation locations used for (transferred) validation which occupy the lowest possible value on the predictive map (0% correctly predicted). In fact, only .6% of the study area values were over the lowest possible cutoff value, minimum training presence. The Stockholm regional model scored well with internal AUC of .91, but seems not to be transferrable. Model D contribution results (table 3) show a surprising negative correlation to higher salinity within the Stockholm study area (appendices, figure A4), which was a strong contributor to model D (23.3%). This is surprising because if the Mitten Crab is searching for favorable spawning habitat, I would expect the opposite of this catadromous species. Also in Model D, a negative correlation to secchi depth (water clarity) (9.8% contribution) is shown (in conjunction with appendices, figure A4), indicating correlation or indifference to eutrophic conditions. Slope was also an important explanatory variable (7.9%) which shows Mitten Crab is correlated with higher slope angles within the Stockholm study area. Poor performance in Model D (transferred) is most likely not due to habitat dissimilarity, although in future studies, Mahalanobis distances (Mahalanobis, 1936) can be used to test similarities between environmental conditions and distances between these conditions in study areas. More likely, the problems encountered with transferring the Mitten Crab prediction is the result of sample selection bias (Baldwin, 2009) and (Philips, 2008) in response to (Peterson, 2007). Sample selection bias, while accounted for in the Round Goby models in the form of weighted bias grid was not used in the Mitten Crab Model D. This bias should be expected with this dataset as all observations come from public submissions. The public sample at public fishing sites, or road access points, which is the very definition of a haphazard collection scheme referenced by (Fithian et al. 2014) to be avoided in predictive modelling (or compensated for). The poor results of transferring the predictive grid from Stockholm to Blekinge is likely symptomatic of this bias and also casts doubt on the validity of the predictive grid created for Mitten Crab in the Stockholm study area. Another theory posited by (Elith et al., 2010) is that with invasive species, “all models can be wrong in the same way, for example, because the species is not in equilibrium”. This may be true of Mitten Crab as it has no place to reproduce within at least 1500km of the Stockholm area. My study goal was to model where they could be found, as an exploration and failure to do so, at least in transferability may also be because I lack the right explanatory variable in my model (Elith and Leathwick 2009;

30

Leathwick & Whitehead 2001), and that possibly none exist to explain presence other than introduction, or that there are many limitations in this approach as there is no defined “niche” to qualify in marine water. Future approaches should test a broader suite of predictors, an analysis of scale of travel, and include a planned (stratified random) data collection field survey. 4.5) Assessment of Validation Results Internal AUC results calculated for models A, B and C show much higher values than externally validated Sensitivity and Specificity AUC as shown in (table 5). What is surprising is that, in external validation, errors of commission (false negative, predicting absences incorrectly) and omission (false positive, predicting absences incorrectly) seem to be almost equal between sensitivity and specificity for each model even though I didn’t run Maxent with actual absences. Perhaps the higher internal AUC rating is a result of the “flawed” weighting scheme inherent in ROC/AUC as described in Jimenez-Valverde (2011) who states “AUC is not an appropriate performance measure because the weight of commission errors is much lower than that of omission errors”. As omission errors are based on an algorithm which represents where an absence should be and not actual absences (pseudo- absences) and is measured in specificity-1 (as opposed to specificity), it seems odd to add higher weight to this heuristic in validation. The trade-off, it seems, is inflation of predictive ability in general (table 5 comparison), although models A, B and C performed quite well by all validation metrics. 4.6) GIS Contributions Many approaches to Species Distribution Modelling have focused on characterizing conditions in “environmental space” that are suitable for the subject, and subsequently identifying where suitable environments are distributed in geographical space. Modern SDM approaches represent the convergence of site-based ecology and advances in GIS and spatial data technologies (Elith and Leathwick, 2009). Progressively in recent years, SDM’s are using GIS tools in beginning phases of modelling. This study is a geographical approach to SDM, starting essentially from the data. Some benefits of starting a modelling regime from a geographic perspective are: understanding source projects and processes of spatial data generation both in terms of species locations and environmental variable grids, using explorative GIS tools to determine what type of distribution observation data follow and if transformation is required, and overlaying and analyzing variable grids visually for errors or possible trends (figure 4). Proper knowledge of limitations and pitfalls of geographic data to be used in modelling is important in all stages, including understanding variable errors and how that error propagates in predictive (and especially) explanatory models. An example of this is the effect of the quality of the Substrate feature, which was suspect in Stockholm, but verified in Blekinge, showing a preference of Round Goby to two different substrate types for two possible different reproductive strategies (discussed in section 4.2). As an important factor in Blekinge, transferring a predictive model to Stockholm would only be as good as the least common denominator, which is the poor quality of Stockholm Substrate (section2.5a). In this study, GIS data exploration tools have been highly complementary in analyzing Round Goby data, including the novel use of Grouping Analysis (figures 5 and 6) as a data filtering tool by similarities in environmental data in Model B. Variograms and Optimized hotspot analysis (ArcMAP 10.2.2) (figure 3) served as useful tools in identifying spatial autocorrelation and data outliers, and Mean Center by distribution provided reinforcing argument regarding vector of introduction (ballast) (figure 4). GIS tools were also valuable in raster analysis in creation of Stockholm Bathymetry, Distance to Ports by Euclidean Distance, the processing and aggregation of several Substrate products, creation of the Bias Grid and in analyzing multicollinearity via the Band Collection Statistic tool. GIS tools are also invaluable for visualizing model results (Pearson, 2007) and carrying out additional processing of model output such as calculating suitable habitat area (table 6), compensating for MAUP effects (figure 7), analyzing natural barriers in colonization of the coastline and in presenting modelled data in the form of maps (figures 8,9 and 10). Another benefit of the approaches

31 used in this, and source material studies is that GIS tools when used in tandem with statistical validation tools and biological/ecological a-priori knowledge represents a comprehensive, multidisciplinary approach to modelling, with several avenues for verifying results. 4.7) General Conclusions High internal and external validation results In Round Goby Models A, B and C show confidence in outlining where areas of current distribution and of highest risk of colonization are. Model A performs better in validation to the Reid (2015) study, and the resulting predictive grid shows what would most likely be vectors of introduction (or reintroduction) with environmental constraints. Model B performs nearly as well as Model A in validation (table 5), and shows that environmental variables, modelled with filtered data create a good explanatory model of colonizing Round Goby population distribution. Model C, modelled using AIC methods in setting Maxent parameters performed well in validation in the Blekinge region, however performance of the transferred model to Stockholm must be validated through a large-scale (stratified random) sample survey. The prediction of 39% favorable habitat within the Stockholm study area and multi-pathway risk to Lake Mälaren should certainly be of interest to land/resource managers and hopefully these results will motivate such a study. Model D, although performing favorably in Stockholm, completely failed to transfer to the Blekinge region likely due to sample selection bias from public Mitten Crab reports, which also calls into question the validity of the Stockholm regional Mitten Crab model. As it stands, availability of modelling data for other project purposes or public reports and associated biases are often all that is available, and models are really just a “best guess” based on limited information.

32

5) References

Akaike, H. 1974. A new look at the statistical model identification IEEE Transactions on Automatic Control 19:6 716–723.

Allouche, O., Tsoar, A. and Kadmon, R. 2006. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology 43:1223-1232

Almqvist, G. 2008. Round goby Neogobius melanostomus in the Baltic Sea – Invasion Biology in practice, PhD Thesis, Dept. of Systems Ecology, Stockholm University, Sweden. pp 154

Anger, K. 1991. Effects of temperature and salinity on the larval development of the Chinese mitten crab Eriocheir sinensis (Decapoda: Grapsidae). Marine Ecology Progress Series 72: 103–110

Anger, K. 2003. Salinity as a key parameter in the larval biology of decapod crustaceans. Invertebrate Reproduction and Development 43: 29–45

Bahn, V., McGill, B.J. 2013. Testing the predictive performance of distribution models. Oikos, 122, 321–331.

Baldwin, R.A. 2009. Use of Maximum Entropy Modeling in Wildlife Research. Entropy 2009, 11, 854-866

Balshine, S., Verma, A., Chant, V. and Theysmeyer, T. 2005. Competitive Interactions between round gobies and logperch. Journal of Great Lakes Research 31:68-77

Berg, L.S. 1949. Freshwater fishes of the USSR and adjacent countries. Acad. Sci.. USSR Zool. Inst. 850 pp

Betts, M.G., Diamond, A.W., Forbes, G.J., Villard, M.-A. and Gunn, J.S. 2006. The importance of spatial autocorrelation, extent and resolution in predicting forest bird occurrence. Ecol Model 191:197–224

Boakes, E.H., McGowan, P.J.K., Fuller, R.A., Ding, C.-Q., Clark, N.E., O’Connor, K. and Mace, G.M. 2010. Distorted views of biodiversity: spatial and temporal bias in species occurrence data. PLoS Biology, 8, e1000385

De Marco, P., Diniz-Filho, J.A.F., Bini, L.M. 2008. Spatial analysis improves species distribution modelling during range expansion. Biol. Lett. 4:577–80

Dormann, C.F. 2007. Promising the future? Global change projections of species distributions. Basic Appl. Ecol. 8:387–97

Dormann, C. F., Elith, J., Bacher, S. , Buchmann, C. , Carl, G., Carre, G., García Marquéz, J.R., Gruber, B., Lafourcade, B. , Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., and Lautenbach, S. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. – Ecography 36: 27–46

Drotz, M. K., Berggren, M., Lundberg, S., Lundin, K., Von Proschwitz, T., 2010. Invasion routes, current and historical distribution of the Chinese mitten crab (Eriocheir sinensis H. Milne Edwards, 1853) in Sweden. Aquatic Invasions Volume 5, Issue 4: 387–396

Dudik, M., Phillips, S. J. and Schapire, R. E. 2005. Correcting sample selection bias in maximum entropy density estimation. Pages 323–330 in Advances in neural information processing systems 18. MIT Press, Cambridge, Massachusetts, USA

Duque-Lazo, J., Van-Gils, H., Groen, T.A., Navarro-Cerrillo, R.M. 2016. Transferability of species distribution models: The case of Phytophthora cinnamomi in Southwest Spain and Southwest Australia. Ecological Modelling 320 (2016) 62–70

Elith, J., Kearney, M., Phillips, S. 2010. The art of modelling range-shifting species. Methods in Ecology & Evolution 2010, 1, 330–342

Elith, J., Graham, C.H. 2009. Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models. Ecography 32, 66–77

Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lucia, G., Lohmann, Loiselle, B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Jacob, McC., Overton, Peterson, A.T., Phillips, S.J., Richardson, K., Scachetti-Pereira, R.,

33

Robert, E., Schapire, Soberon, J., Williams, S., Wisz, M.S., Zimmermann, N.E. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29, 129–151

Elith, J. & Leathwick, J.R. (2009a) Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution and Systematics, 40, 677– 697.Elith, J., Phillips, S., Hastie, T., Dudik, M., En Chee, Y., Yates, C. 2011. A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, (Diversity Distrib.) (2011) 17, 43–57

ESRI 2016. ArcGIS/ArcMAP. 10.2.2 ed. Redlands, California: ESRI (Environmental Systems Resource Institute).

Ferrier, S., Watson, G., Pearce, J., Drielsma, M. 2002. Extended statistical approaches to modelling spatial pattern in biodiversity: the north-east New South Wales experience. I. Species-level modelling. Biodivers. Conserv. 11:2275–307

Fielding, A. H., and Bell, J. F. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24:38–49

Fithian, W., Elith, J., Hastie, T. and Keith, D. 2014. Bias Correction in species distribution models: pooling survey and collection data for multiple species. Methods in Ecology and Evolution. doi: 10.1111/2041- 210X.12242

Fitzsimmons, J., B. Williston, G. Williston, G. Bravener, J. L. Jonas, R. M. Claramunt, J. E. Marsden, and B. J. Ellrott. 2006. Laboratory estimates of salmonine egg predation by round gobies (Neogobius melanostomus), sculpins (Cottus cognatus and C. bairdi), and Crayfish (Orconectes propinquus). Journal of Great Lakes Research 32:227–241.

Fleishman, E., MacNally, R., Fay, J.P., Murphy, D.D. 2001. Modeling and predicting species occurrence using broad-scale environmental variables: an example with butterflies of the Great Basin.Conserv. Biol. 15:1674– 85

Florén, K., Philipson, P., Strömbeck, N., Antonia Nyström-Sandman, A., Isæus, M. and Wijkmark, N. 2012. Satellite-Derived Secchi Depth for Improvement of Habitat Modelling in Coastal Areas. AquaBiota Report 2012:02. 56 sid.

Florin, A.-B. 2011. Svartmunnad smörbult-risk eller resurs? HAVET 2011: 49-51. (in Swedish)

French, J.R.P. III, and D. J. Jude. 2001. Diets and Diet Overlap of Nonindigenous Gobies and Small Benthic Native Fishes Co-inhabiting the St. Clair River, Michigan. Journal of Great Lakes Research. 27:300–311

Fyhr, F., Wijmark, N., Iseaus, M., Enhus, C., Lindahl, U., Ogonowski, M., Nikolopoulos, A., Nilsson, L., Nystrom-Sandman, A., Naslund, J., Didrikas, T., Sundblad, G., Wikstrom, S., Hogfors, H., Floren, K., Slagbrand, P., Hallberg, O., and Phillipson, P. 2015. Marine mapping and management scenarios in the Hanö Bight, Sweden. AquaBiota Report 2015:01. ISSN: 978-91-85975-38-9.

Glor, R.E., and Warren, D.L. 2011. Testing the ecological basis of biogeographic boundaries. Evolution 65:673- 683.

Gollasch, S. 2011. NOBANIS – Invasive Alien Species Fact Sheet – Eriocheir sinensis. – From: Online Database of the European Network on Invasive Alien Species – NOBANIS www.nobanis.org

Gormley, A.M., Forsyth, D.M., Griffioen, P., Woodford, L., Lindeman, M., Scroggie, M.P. and Ramsey, D.S.L. 2011. Using presence-only and presence–absence data to estimate the current and potential distributions of established invasive species. Journal of Applied Ecology 48, 25–34

Guisan, A. And Zimmermann, N.E. 2000. Predictive habitat distribution models in ecology. Ecol. Model., 135, 147–186

Guisan, A. and Thuiller, W. 2005. Predicting species distribution: offering more than simple habitat models. Ecology Letters, 8, 993–1009

Harrell, F. E. Jr. 2001. Regression modeling strategies – with applications to linear models, logistic regression, and survival analysis. – Springer.

34

HELCOM, 2014. HELCOM Guide to Alien Species and Ballast Water Management in the Baltic Sea Number of pages: 40

Hijmans R. J., and J. Elith. 2011. Species Distribution Modeling with R. Reference and Guide

Hijmans, R.J. 2012. Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. Ecology, 93, 679–88.

Hirsch P.E., N’Guyen, A., Adrian-Kalchhauser, I., Burkhardt-Holm, P. (2016) What do we really know about the impacts of one of the 100 worst invaders in Europe? A reality check. Ambio 45, 267-279.

Online publication date: 1-Apr-2016.

Isæus, M. 2004. Factors structuring Fucus communities at open and complex coastlines in the Baltic Sea, PhD Thesis, Dept. of Botany, Stockholm University, Sweden. 40+p.

Jimenez-Valverde, A. 2012. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species. Ecology 5:498–507.

Jimenez-Valverde, A., Peterson, A.T., Soberon, J., Overton, J.M., Aragon, P., Lobo, J.M. 2011. Use of niche models in invasive species risk assessments. Biol Invasions (2011) 13:2785–2797

Jueterbock, A., Smolina, I., Coyer, J.A. and Hoarau, G. 2016. The fate of the Arctic seaweed Fucus distichus under climate change: an ecological niche modelling approach Ecology and Evolution 6(6), 1712-1724

Kornis, M.S., Mercado-Silva, N. and Vander Zanden, M. J. 2012. Twenty years of invasion: a review of round goby Neogobius melanostomus biology, spread, and ecological implications. Journal of Fish Biology 80:235– 285. doi: 10.1111/j.1095-8649.2011.03157.x

Kotta, J., Nurkse, K., Puntila, R., Ojaveer, H., 2016. Shipping and natural environmental conditions determine the distribution of the invasive non-indigenous round goby Neogobius melanostomus in a regional sea. Estuarine, Coastal and Shelf Science 169 (2016) p.15-24

Kramer-Schadt, S., Niedballa, J., Pilgrim, J.D., Schröder, B., Lindenborn, J., Reinfelder, V., Stillfried, M., Heckman, I., Scharf, A.K., Augeri, D., M., Cheyne, S.M., Hearn, A.J., Ross, J., Macdonald, D., Mathai, J., Eaton, J., Marshall, A.J., Semiadi, G., Rustam, R., Bernard, H., Alfred, R., Samejima, H., Duckworth, J.W:, Breitenmoser-Wuersten, C., Belant, J.L., Hofer, H., Wilting, A. 2013. The importance of correcting for sampling bias in MaxEnt species distribution models (M. Robertson, Ed.) Divers. Distrib. ;19:1366–1379

Kumar, S., Stohlgren, T. J. 2009. Maxent modeling for predicting suitable habitat for threatened and endangered tree Canacomyrica monticola in New Caledonia. Journal of Ecology and Natural Environment Vol. 1(4), pp. 094-098

Lauer, T.E., and Truemper, H.A. 2005. Gape limitation and piscine prey size-selection by yellow perch in the extreme southern area of Lake Michigan, with emphasis on two exotic prey items. Journal of Fish Biology 66:135-149

Leathwick JR, Whitehead D. 2001. Soil and atmospheric water deficits and the distribution of New Zealand’s indigenous tree species. Funct. Ecol. 15:233–42

Lowe S., Browne M., Boudjelas S., De Poorter, M. 2000. 100 of the World’s Worst Invasive Alien Species A selection from the Global Invasive Species Database. Published by The Invasive Species Specialist Group (ISSG) a specialist group of the Species Survival Commission (SSC) of the World Conservation Union (IUCN), 12pp. First published as special lift-out in Aliens 12, December 2000. Updated and reprinted version: November 2004.

Mahalanobis, P. C. 1936. "On the generalised distance in statistics" (PDF). Proceedings of the National Institute of Sciences of India 2 (1): 49–55. Retrieved 2012-05-03.

Marmefelt, E., Sahlberg, J. and Bergstrand, M.. 2007. Home Vatten i södra Östersjöns vattendistrikt. Integrerat modellsystem för vattenkvalitets‐ beräkningar. SMHI Oceanografi 87 [In Swedish]

Marquard, O. 1926. Die Chinesische Wollhandkrabbe, Eriocheir sinensis MILNE-EDWARDS, ein neuer Bewohner deutscher Flüsse. Fischerei, 24: 417-433 pp

35

Merow, C., Smith M. J., Edwards Jr., T.C., Guisan, A., McMahon S.M., Normand, S., Thuiller, W., Wüest R.O., Zimmermann N.E. and Elith, J. 2014. What do we gain from simplicity versus complexity in species distribution models ? – Ecography 37: 1267–1281

Midgley, G.F., Hughes, G.O., Thuiller W., Rebelo, A.G. 2006. Migration rate limitations on climate changeinduced range shifts in Cape Proteaceae. Divers. Distrib. 12:555

Nakazato, T., Warren, D.L. and Moyle, L.C. 2010. Ecological and geographic modes of species divergence in wild tomatoes. American Journal of Botany 97:680-693.

National Research Council. 1995. Understanding Marine Biodiversity. Washington (DC): National Academy Press

Nehring, S. 2005. International shipping - a risk for aquatic biodiversity in Germany. In: Nentwig, W. et al. (Eds.): Biological Invasions - From Ecology to Control. NEOBIOTA 6:125-143.

Ng, A. Y. and Jordan, M. I. 2001. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. Adv. Neural Inform. Process. Syst. 14: 605-610

Nyström-Sandman, A., Didrikas T., Enhus, C., Florén, K., Isaeus, M., Nordemar, I., Nikolopoulos, A., Sundblad, G., Svanberg, K. & Wijkmark, N., 2013: Marin Modellering i Stockholms län, AquaBiota Report 2013:10

Ojaveer, H., Kotta, J. 2014. Ecosystem impacts of the widespread non-indigenous species in the Baltic Sea: literature survey evidences major limitations in knowledge. Hydrobiologia:1-15.

Ojaveer H., Gollasch S., Jaanus A., Kotta J., Laine, A.O., Minde, A., Normant, M., Panov V.E. (2007) Chinese mitten crab Eriocheir sinensis in the Baltic Sea – a supply-side invader? Biological Invasions 9: 409–418

Openshaw, stan. The modifiable areal unit problem. 1983. [Norwick Norfolk] : Geo Books, Concepts and techniques in modern geography, no. 38.

Pearson, R. G. 2007. Species’ Distribution Modeling for Conservation Educators and Practitioners. Synthesis. New York: Am. Mus. Natl. Hist. http://ncep.amnh.org

Pearson, R.G., Raxworthy, C.J., Nakamura, M., Peterson, A.T. (2007). Predicting species distributions from small numbers of occurrence records: a test case using cryptic geckos in Madagascar. J. Biogeo. 34: 102- 117

Peters, N., Panning, A., Thiel, H., Werner, H. and Schmalfuß, H. 1936. Die chinesische Wollhandkrabbe in Europa. Der Fischmarkt, 4/5: pp 1-19

Peterson, A.T., Soberón, J., 2012. Species Distribution Modeling and Ecological Niche Modeling: Getting the Concepts Right. Natureza & Conservação (Brazilian Journal of Nature Conservation)10(2):102-107

Peterson, A.T., Papes, M. and Eaton, M. 2007. Transferability and model evaluation in ecological niche modeling: a comparison of GARP and Maxent. Ecography, 30, 550–560

Phillips, S.J., Dudik, M. and Schapire, R.E. 2004. A maximum entropy approach to species distribution modeling. In Proceedings of the 21st international conference on machine learning, pp. 655-662. AMC Press, New York.

Phillips, S.J., Anderson, R.P. and Schapire, R.E.. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190, 231-259

Phillips, S.J. (2008) Transferability, sample selection bias and background data in presence-only modelling: a response to Peterson et al. 2007. Ecography, 31, 272–278

Phillips, S.J. and Dudık, M. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, 31, 161–175

Radosavljevic, A., Anderson, R.P. 2014. Making better MAXENT models of species distributions: complexity, overfitting and evaluation. Journal of Biogeography (J. Biogeogr.) (2014) 41, 629–643

Randin, C.F., Dirnbock, T., Dullinger, S., Zimmermann, N.E., Zappa, M. & Guisan, A. 2006. Are niche-based species distribution models transferable in space? Journal of Biogeography, 33, 1689–1703

36

Reid, D. 2015. Habitat Suitability and Species Distribution Modeling of the Round Goby in the Blekinge/Hanobukten region of the Swedish Baltic Sea Coastline. AquaBiota Report 2015:07. 34 pp. (submitted for publication, request copy)

Roseman, E. F., W. W. Taylor, D. B. Hayes, A. L. Jones, and J. T. Francis. 2006. Predation on Walleye eggs by fish on reefs in Western Lake Erie. Journal of Great Lakes Research 32:415–423.

Roura-Pascual, N., Suarez, A.V., Gomez, C., Pons, P., Touyama, Y., Wild, A.L. 2006. Niche differentiation and fine-scale projections for Argentine ants based on remotely-sensed data. Ecological Applications, 16, 1832– 1841.

Sapota, M.R. 2012. Nobanis – Invasive Alien Species Fact Sheet, Neogobius melanostromus. Nobanis.org

Shan, Y., Paull, D., McKay, R.I. 2006. Machine learning of poorly predictable ecological data. – Ecol. Model. 195: 129 – 138.

Simberloff, D., 2011. How common are invasion-induced ecosystem impacts? Biological Invasions 13: 1255– 1268

Steinhart, G. B., E. A. Marschall, and R. A. Stein. 2004 (a). Round goby predation on smallmouth bass offspring in nests during simulated catch-and-release angling. Transactions of the American Fisheries Society 133:121– 131. Steinhart, G. B., R. A. Stein, and E. A. Marschall. 2004 (b). High growth rate of young-of-year smallmouth bass in 946 RUETZ III ET AL. Lake Erie: a result of the round goby invasion? Journal of Great Lakes Research 30:381–389

Strayer, D. L., 2012. Eight questions about invasions and ecosystem functioning. Ecology Letters 15: 1199– 1210.

Sundblad, G., Bekkby, T., Isæus, M., Nikolopoulos, A., Norderhaug, K. M. and Rinde, E. 2014a. Comparing the ecological relevance of four wave exposure models. Estuarine, Coastal and Shelf Science 140:7–13

Sundblad, G., Bergström, U., Sandström, A. and Eklöv, P. 2014b. Nursery habitat availability limits adult stock sizes of predatory coastal fish. ICES Journal of Marine Science 71:672-680

Tibshirani, R. 1996. Bias, variance and prediction error for classification rules. Technical report, Univ. of Toronto.

Thuiller, W., Albert, C., Araujo, M.B., Berry, P.M., Cabeza, M., Guisan, A., Hickler, T., Midgely, G.F., Paterson, J., Schurr, F.M., Sykes, M.T., Zimmermann, N.E. 2008. Predicting global change impacts on plant species’ distributions: future challenges. Perspect. Plant Ecol. Evol. Syst. 9, 137–152

Wisz, M.S., Hijmans, R., Li, J., Peterson, A.T., Graham, C.H., Guisan, A. 2008. Effects of sample size on the performance of species distribution models. Diversity and Distributions 14, 763–773

Václavík, T., Meentemeyer, R.K. 2009. Invasive species distribution modeling (iSDM): are absence data and dispersal constraints needed to predict actual distributions? Ecol. Model. 220, 3248–3258

Vaclavic, T., Kanaskie, A., Hansen, E.M., Ohmann, J.L., Meentemeyer, R.K. 2010. Predicting potential and actual distribution of sudden oak death in Oregon: Prioritizing landscape contexts for early detection and eradication of disease outbreaks. Forest Ecology and Management 260 (2010) 1026–1035

Van Gils, H., Westinga, E., Carafa, M., Antonucci, A., Ciaschetti, G. 2014. Where the bears roam in Majella National Park, Italy. J. Nat. Conserv. 22, 276–287

Veloz, S. 2009. Spatially autocorrelated sampling falsely inflates measures of accuracy for presence-only niche models. Journal of Biogeography 36:2290–2299.

Warren, D. and Seifert, S. 2011. Ecological niche modeling in MaxEnt: the importance of model complexity and the performance of model selection criteria. – Ecol. Appl. 21: 335–342

Warren, D.L., Glor, R.E. and M. Turelli. 2010. ENMTools: a toolbox for comparative studies of environmental niche models. Ecography 33:607-611.

37

Warren, D.L., Wright, A.N., Seifert, S. N., and Shaffer, H. B. 2014. “Incorporating Model Complexity and Spatial Sampling Bias into Ecological Niche Models of Climate Change Risks Faced by 90 California Vertebrate Species of Concern.” Diversity and Distributions 20 (3): 334–343. doi:10.1111/ddi.12160.

Warren, D.L., Glor, R. E. and M. Turelli. 2008. Environmental niche equivalency versus conservatism: quantitative approaches to niche evolution. Evolution 62:2868-2883.

Williams, P. M. 1995. Bayesian regularization and pruning using a Laplace prior. Neural Comput. 7: 117-143

Yost, A.C., Petersen, S.L., Gregg, M., and Miller, R. 2008. Predictive modeling and mapping sage grouse (centrocercus urophasianus) nesting habitat using maximum entropy and a long-term dataset from southern oregon. Ecological Informatics, 3(6):375 – 386,

Zaniewski, A.E., Lehmann, A. & Overton, J.M. 2002. Predicting species spatial distributions using presence- only data: a case study of native New Zealand ferns. Ecological Modelling, 157, 261–280

Links

1) Naturvårdsverketrapport, Mitten Crab and risk of Round Goby in Lake Mälaren, (Swedish) http://www.naturvardsverket.se/Documents/publikationer6400/978-91-620-6375.pdf

2) frammandearter: Mitten Crab Facts (translated) http://www.frammandearter.se/0/2english/pdf/Eriocheir_sinensis.pdf 3) SMHI - Swedish Meteorological and Hydrological Institute (2010) Klimatdata. http://www.skvvf.se/ http://www.smhi.se/klimatdata

4) HELCOM Guide to Alien Species and Ballast Water Management and http://globallast.imo.org

5) Svealandskustförbund (bathymetry of Stockholm) http://www.kustdata.su.se/skvvf/giskartor.html

SGU Substrate reports 6) http://resource.sgu.se/dokument/produkter/maringeologi-500000-beskrivning.pdf

7) http://resource.sgu.se/dokument/produkter/maringeologi-100000-beskrivning.pdfG

8) United Nations Code for Trade and Transport Locations (UN/LOCODE) http://www.unece.org/fileadmin/DAM/cefact/locode/se.htm

9) Ports and harbors Balance/helcom http://helcom.fi/baltic-sea-trends/data-maps/biodiversity/balance

10) smhi home data http://www.smhi.se/sgn0106/if/biblioteket/rapporter_pdf/Oceanografi_98.pdf

11) SMHI Coastal Zone Model and Probe Model http://www.smhi.se/sgn0106/if/biblioteket/rapporter_pdf/Oceanografi_98.pdf

12 The above referenced Marmoni project (Fyhr et al, 2015) is part of a multi-nation Baltic-wide data gathering project, found here: http://marmoni.balticseaportal.net/wp/

13 Phillips and Schapire, a Brief Tutorial on Maxent https://www.cs.princeton.edu/~schapire/maxent/tutorial/tutorial.doc

14) Danish Maritime authority A and B class ships http://www.dma.dk/AIS/WORTHKNOWINGABOUTAIS/Sider/AISclassAandB.aspx

38

6) Appendices

Figure A1 Ripleys K function Round Goby (top), Mitten Crab (bottom). Created in Arcmap, these graphs represent statistically significant clustering or dispersal of observation locations within respective study areas. Confidence Envelope is constructed by distribution of random points 99 times for ten iterations and calculating distances for each distribution. Round Goby (top) is considered to be significantly clustered (partly) and significantly dispersed in other parts of the Blekinge study area. Expected K values represent expected dispersion at larger distances. The lower confidence envelope shown (top) is a result of size and irregular shape of the study area. Mitten Crab (bottom) is considered to be significantly clustered within the Stockholm study area. X axis is distance, Y axis is transformation of the K-Function.

39

Figure A2 Study area with joined features from: fine scale (Fyhr et. al. 2015) in yellow, coarser scale SGU 1:100,000 in red and Coarsest scale SGU 1:500,000 in green.

Table A1 Confusion matrix for Pearson’s R (rho) correlation values for pixels in same location. Red highlights indicate variables removed due to these results and jackknife measure of variable importance (exception of temperature in model C, and secchi in model D). Orange highlights represent cells with values over the .60 threshold selected for this model. Gray highlights indicate variables removed for low contribution as measured by Jacknife Measure of Variable importance in Maxent.

Layer traffic bathy salinity temp dist2port hediste highplant hydrobiid landform macoma mytilus perch pike secchi slope substrate nitrogen phosphorus wave zostera traffic 1.00 -0.28 0.00 -0.27 -0.09 -0.10 -0.11 -0.08 -0.05 0.20 0.01 -0.20 -0.16 0.14 -0.01 0.10 -0.04 0.20 0.07 -0.10 bathy 1.00 -0.22 0.92 -0.34 0.41 0.30 0.34 0.12 -0.58 0.09 0.47 0.48 -0.58 0.18 -0.02 0.13 -0.87 -0.58 0.28 sal 1.00 -0.09 -0.24 -0.19 -0.20 -0.18 -0.02 -0.13 0.00 -0.37 -0.29 -0.13 -0.10 -0.07 -0.12 0.30 0.27 -0.11 temp 1.00 -0.30 0.34 0.21 0.26 0.10 -0.63 0.15 0.33 0.36 -0.58 0.16 -0.07 0.04 -0.92 -0.46 0.21 dist2port 1.00 -0.32 -0.14 -0.21 0.04 0.06 0.14 -0.23 -0.24 0.62 -0.11 -0.22 -0.44 0.19 0.55 -0.11 hediste 1.00 0.44 0.53 -0.10 0.13 -0.25 0.42 0.49 -0.44 -0.07 0.23 0.36 -0.34 -0.56 0.42 highplant 1.00 0.50 -0.02 0.05 -0.17 0.52 0.63 -0.24 -0.05 0.25 0.34 -0.23 -0.42 0.46 hydrobiid 1.00 -0.03 0.13 -0.18 0.39 0.42 -0.32 -0.07 0.26 0.33 -0.29 -0.46 0.37 landform 1.00 -0.28 0.20 0.06 0.05 0.04 0.16 -0.12 -0.04 -0.09 0.07 -0.03 macoma 1.00 -0.43 -0.08 -0.04 0.18 -0.25 0.33 0.27 0.60 -0.06 0.06 mytilus 1.00 -0.25 -0.25 0.21 0.27 -0.34 -0.34 -0.21 0.34 -0.17 perch 1.00 0.80 -0.45 0.02 0.28 0.53 -0.34 -0.58 0.26 pike 1.00 -0.37 0.04 0.21 0.43 -0.35 -0.57 0.48 secchi 1.00 -0.05 -0.29 -0.54 0.45 0.68 -0.17 slope 1.00 -0.14 -0.05 -0.16 -0.13 -0.01 substrate 1.00 0.42 0.09 -0.41 0.08 nitrogen 1.00 0.04 -0.57 0.12 phosphorus 1.00 0.40 -0.20 wave 1.00 -0.34 zostera 1.00

40

Figure A3 Workflow for this study from data collection and processing through to finished grids

Figure A4 the following curves are results from a Maxent model created using only the corresponding variable for models A,B and C (Round Goby) top and model D (Mitten Crab) bottom. These plots reflect the dependence of predicted suitability both on the selected variable and on dependencies induced by correlations between the selected variable and other variables. It can be determined from these plot whether the presence of respective species is positively or negatively correlated to each individual variable. In this figure, AiS_traf is Marine Traffic, bathy is bathymetry, bottom_sal is Salinity, bottom temp is temperature, dist2port is distance to ports, Hediste is Hediste Diversicolor, Hydrobiid is Hydrobiidae, macoma is Macoma Balthica, Mytilus is Mytilus Edulis, Secchi is Secchi disk depth, wave is Wave Exposure. The only categorical variable (substrate) shows correlation with substrate types defined in table 4.

41

Figure A5 example of the comparison of sensitivity and specificity used to validate externally and set thresholds for predictive grids (Model A), Calculated in (R).

Figure A6 Comparison of current study results (top) with previous 2015 study results (bottom)

42