A new tool for predicting distribution patterns of African in space and time: sensitivity analyses of model parameters and environmental variables

A thesis submitted in partial fulfilment of the requirements of the degree of Doctor of Natural Sciences (Dr. rer. nat.) of the Faculty of Environment and Natural Resources, Albert-Ludwigs-Universität Freiburg im Breisgau, Germany

Submitted by Nirmal Ojha from Nepal

Freiburg im Breisgau 2014

Dean: Prof. Dr. Barbara Koch

Supervisor: Prof. Dr. Axel Drescher

2nd Reviewer : Prof. Dr. Carsten F. Dormann

Date of thesis' defence : 19 Nov 2013

Acknowledgements

Firstly I would like to thank the Faculty of Environment and Natural Resources for accepting my application for pursuing a doctoral degree. Many thanks go to Prof Dr Axel Drescher for supervising my research work. My warm thanks to Prof Dr Carsten F Dormann for accepting to review my work. Special thanks are due to Prof Dr Gertrud Schaab from the Hochschule Karlsruhe who supervised my work offering her guidance throughout the entire period of my work. I am grateful to the office of the Chancellor of the Hochschule Karlsruhe and helpful team of the Institute of Applied Research (IAF) for providing me the work space and allowing me to use the facilities. Acknowledgement is also due to Christian Stern who organised each year the necessary ArcGIS developer’s license in timely manner. I am thankful to Dr Viola Clausnitzer, Dr Frank Suhling, Dr K. D-B Dijkstra and Jens Kipping for providing me the records of dragonflies’ locations. The discussions and feedbacks offered by them on the initial predicted distributions of some species contributed in improving the modelling tool, thereby better prediction results. My appreciations are also to the former members of the working group G(V)ISЯ at the IAF while being part of the BIOTA Africa project for their friendship, moral support and cordial working environment making part of the journey enjoyable. Thanks also to Dorothea Heim who helped translating the extended summary in German. I am indebted to Prof Dr Prajwal Lal Pradhan from the Institute of Engineering, Pulchowk Campus Nepal. He has been motivational figure and his advice and moral support has led me to this position today. Finally, I would like to express sincere gratitude to my parents for their support.

Page| iii

Page| iv

Abstract

In the last few decades, Africa has been a dynamic continent regarding the changes in landscape, population and climate. To identify effects of the changes in environmental conditions on biodiversity, species distribution modelling (SDM) can be of use and SDM has been used in wide array of ecological applications such as determining hotspots, planning of reserves, designing survey for biodiversity inventory, or assessing the impacts of environmental change on biodiversity. which require both terrestrial and aquatic ecosystem for a lifecycle, is suitable species to consider as flagship species for many ecological studies. Here, a logistic regression based new SDM tool, the ‘SpeeDi Tool' is presented focusing on modelling the distribution of African Odonata species using the Odonata Database of Africa. The use of geographic information system (GIS) in pre- and post- processing is integral part of the SDM workflow and GIS and statistical modelling is integrated in the SpeeDi Tool. The user centred approach for the development of the SpeeDi Tool offers usability and achievement of the goal (i.e. predicting the distribution range) with ease. kersteni , a widely spread species in sub-Saharan Africa, is taken as species of interest to demonstrate the use and ability of the SpeeDi Tool. An expert-drawn watershed based range map from IUCN serves the purpose for visually comparing the modelled spatial distribution and, thus, enables to evaluate the predicted range. The SpeeDi Tool has several modelling parameters, some of which have been new in SDM field, namely, elastic-net factor which has not been applied to SDM using background samples until now, soft buffer threshold (SBT) which is a new concept introduced here, and weights for samples. In addition to the use of background samples, it introduces the modelling by using presence samples with absence and / or background samples; the combination of presence, absence and background samples is a new option not found in existing SDM tools yet. In order to gain confidence in using the SpeeDi Tool, several sensitivity analyses are performed using P. kersteni samples for different modelling approaches for applying different modelling parameters and for using different environmental geodatasets. These sensitivity analyses are thought for determining the optimum values of different regression parameters to maximise the model’s performance, and for finding the important environmental variables and their effects on the prediction of distribution ranges. The concept similar to that of a virtual species is used to evaluate general applicability of the SpeeDi Tool. The sensitivity analyses of modelling parameters showed a) the elastic-net regularisation is superior to L1 or L2 regularisation, b) the uncertainty in population prevalence in background samples can be reduced by applying SBT, c) weights can be effective in reducing effects of sampling bias, d) the number of background samples is sensitive for fitting the model, and e) product interaction of variables are necessary for better prediction of distribution range. The sensitivity of environmental datasets showed a) monthly climate datasets should be preferred over synthesised bioclimatic datasets, b) predicted distributions using land-cover datasets with different classification schemes are not much different but the contribution of land cover classes in different datasets indicated that false interpretation regarding ecological significance of these classes can be possible. Further, the results for the modelling of A. minuscula showed that there is not much difference in distribution range when modelled at spatial resolutions of 1 km and 8 km. The results also indicated that modelling extent should not extend too far beyond the species’ native region.

Page| v

Page| vi

Table of Contents Acknowledgements ...... iii Abstract ...... v List of Figures ...... x List of Tables ...... xiii List of Boxes ...... xv List of acronyms ...... xv 1. Introduction ...... 1 Background ...... 1 The thesis’ aims and limits...... 2 Outline of the thesis ...... 3 2. Background on species distribution modelling ...... 5 Three types of species distribution models ...... 5 Uses of species distribution modelling ...... 6 Empirical methods for species distribution modelling ...... 8 Use of presence, absence and background data in species distribution modelling ...... 10 General characteristics and assumptions of statistical species distribution models ...... 12 Commonly used data in statistical species distribution models ...... 15 Incorporating statistics and GIS for species distribution modelling ...... 16 Summary and main considerations ...... 18 3. Quality-in-use for development of species distribution modelling tool ...... 20 Context of use ...... 20 Quality measures ...... 21 3.2.1. Functionality ...... 22 3.2.2. Reliability ...... 22 3.2.3. Usability ...... 23 3.2.4. Efficiency ...... 25 3.2.5. Maintainability ...... 25 3.2.6. Portability ...... 26 User-centred design ...... 26 User experience ...... 28 3.4.1. Cognition ...... 29 3.4.2. Metaphors ...... 30 3.4.3. Emotions ...... 30 4. Developing a robust and easy to use species distribution modelling tool ...... 32 Work flow concept for geodata processing and statistical modelling for SDM in SpeeDi Tool ...... 32 4.1.1. Geodata preparation ...... 33 4.1.2. Statistical modelling...... 33 4.1.3. Post-processing ...... 34

Page| vii

User-centred design and user profile for the SpeeDi Tool ...... 34 Architecture for the modelling tool ...... 35 GUI design ...... 36 Logistic regression with presence, absence and background data ...... 39 4.5.1. Formulating binary logistic regression model...... 39 4.5.2. Control mechanism to counter over-fitting in regression model ...... 41 Functions offered by the SpeeDi Tool ...... 42 4.6.1. Pre-processing in the SpeeDi Tool ...... 42 4.6.2. Statistical modelling using logistic regression in the SpeeDi Tool ...... 44 4.6.3. Post-processing functions in the SpeeDi Tool ...... 44 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool ...... 50 Odonata and database of African Odonata ...... 50 Pre-processing of location data and environmental geodata: setting the modelling scenario ...... 51 5.2.1. Geodata pre-processing with the SpeeDi Tool for predicting the spatial distribution of P. kersteni ...... 52 5.2.2. Logistic regression modelling for predicting the presences of P. kersteni ...... 54 Result of the modelling ...... 55 5.3.1. Intermediary output ...... 55 5.3.2. Post processing the intermediary output ...... 56 Visual assessment of output of modelling the distribution of P. kersteni...... 58 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters ...... 62 Sensitivity analysis of model tuning parameters for P. kersteni ...... 62 6.1.1. Elastic-net factor ...... 62 6.1.2. Initial population prevalence ...... 63 6.1.3. Soft buffer threshold for background ...... 65 Sample data and model definition/formulation ...... 66 6.2.1. Size of background samples ...... 66 6.2.2. Algorithm of random number generator for creating background samples ...... 67 6.2.3. Polynomial degree and interaction term for continuous variables ...... 69 6.2.4. Effect of sample density of presences ...... 72 Different modelling approach for predicting the distribution of P. kersteni ...... 73 6.3.1. Modelling with presences and absences derived from known distribution range ...... 74 6.3.2. Modelling with presences from watershed based range map and random background samples for all of Africa ...... 76 6.3.3. Modelling with actual field samples and absences sampled from the watershed based range map ...... 76 6.3.4. Feeding the presence-background model with auxiliary absence data ...... 77

Page| viii

7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land-cover datasets, spatial extent and resolution ...... 80 Bioclimatic data and its influence on modelling the prediction of Pseudagrion kersteni ..... 80 7.1.1. Using bioclimatic variables related to precipitation and temperature ...... 81 7.1.2. Using six selected bioclimatic variables related to precipitation and temperature based on ecological relevance for P. kersteni ...... 81 Supplementing ‘selected six bioclimatic variables’ with x-y coordinates for predicting the distribution range of P. kersteni ...... 82 Using monthly temperature and precipitation data as main climate variables for predicting the distribution of P. kersteni ...... 85 Role of land-cover data in modelling P. kersteni – effect of classification schemes ...... 85 Predicting the past and the future distribution of P. kersteni with scenarios for land-cover and climate ...... 87 7.5.1. Developing the land-cover scenario for the year 1940 ...... 87 7.5.2. Developing the land-cover scenario for the year 2050 ...... 90 7.5.3. Comparison of past and future land-cover scenarios with current situation ...... 91 7.5.4. Predicting the distribution of P. kersteni with land-cover and climate scenarios for the year 1940 and 2050 ...... 92 Role of modelling extent and spatial resolution of geodata in predicting the distribution of minuscula ...... 94 8. Discussion and outlook ...... 100 Predicting the spatial distribution of Pseudagrion kersteni by means of the SpeeDi Tool and sensitivity of modelling parameters ...... 100 8.1.1. Predicted distribution of P. kersteni and effect of samples ...... 100 8.1.2. Sensitivity regarding regression control parameters for modelling of P. kersteni ...... 104 Role of environmental data in the prediction of species distribution with the SpeeDi Tool...... 106 8.2.1. Climate data and geographic trend surface for predicting the distribution of P. kersteni ...... 106 8.2.2. Effects of land-cover and climate datasets on the predicted distribution range of P. kersteni ...... 109 8.2.3. Effect of scale and modelling extent on predicting the distribution range of A. minuscula ...... 111 Ranking of different parameters for predicting the distribution of P. kersteni using the SpeeDi Tool ...... 112 Suitability of the SpeeDi Tool for modelling spatial distribution of African Odonata ...... 114 The SpeeDi Tool from a user’s perspective for modelling species distribution ...... 115 9. Extended summary ...... 119 Zusammenfassung ...... 123 References ...... 129 Appendices ...... 145

Page| ix

List of Figures

Figure 2-1: Fitting the probability (logistic regression) of true- and pseudo-absence data shown with one predictor variable ‘Elevation’. The probability of absence data (a and b: true; c and d: pseudo) is zero and this value is used throughout the iteration process. But despite pseudo-absence data is likely to contain a mixture of true-absence (prob. = 0, circular) and non-absence (i.e. prob. > 0, plus sign; their actual value indicated by squares), probability value of zero is assumed. (taking Ward et. al., 2009 figure 3 for idea) ...... 11 Figure 2-2: Fitting the probability (logistic regression) of background data (z = 0) in an iterative process shown with one predictor variable ‘Elevation’. z = 1 represents presence samples; z = 0 represents background samples; z = 0, y = 1 represents samples at favourable environment (presences); and z = 0, y = 0 represents samples at unfavourable environments (absences). At each iteration step, the value of y changes for z = 0 ...... 13 Figure 2-3: Loose GUI coupling of GIS and statistical SDM involving data converter and bridged via GUI (adapted from Jankowski (1995), Karimi and Houston (1996), and Brandmeyer and Karimi (2000)) ...... 17 Figure 2-4: Tight coupling of GIS and Statistical SDM with the APIs as core component in the centre of the different systems, database and the GUI (adapted from Jankowski (1995)) ...... 18 Figure 2-5: Integrated coupling of GIS and SDM with systems and database interacting as a single unit (adapted from Brandmeyer and Karimi (2000)) ...... 18 Figure 4-1: Conceptual work flow for modelling species distribution in SpeeDi Tool with three steps: pre-processing, modelling and post-processing ...... 33

Figure 4-2: Different components in the main GUI of the SpeeDi tool...... 37 Figure 4-3: Common layout of dialog-boxes (top left) for pre- and post-processing functions in SpeeDi Tool; an example of dialog-box for running local function (top right) and displaying the help associated with the function when the ‘Help’ button is clicked (bottom) ...... 38 Figure 4-4: Setting default preferences in the tool, accessible via menubar; left: for logistic regression modelling most of them are related to the output graphs, and right: for modelling task related mainly to spatial properties ...... 39 Figure 4-5: Profiles of SBT adjustment for background samples for assumed pop. prev. (pi) = 0.3. The x-axis represents the original value and the adjusted value is shown on y-axis.The legend ‘y_0.n’ represents the average probability value of all samples...... 41

Figure 5-1.Photo of Pseudagrion kersteni ...... 50 Figure 5-2: Distribution of Pseudagrion kersteni sample locations (ODA: Kipping et al., 2009) and the distribution range (Clausnitzer et al., 2012) ...... 51 Figure 5-3: Illustrating of distance dataset of hydrographical features; top left shows how the pixel values are calculated for linear features based on the cell-distance, top-right: distance

Page| x

raster for rivers; bottom-left: distance raster for areal features (lakes, ponds, wetlands), bottom-right: minimum distance value raster combined from linear and areal datasets . 53 Figure 5-4: ROC-curve (left) and ‘sensitivity and specificity’ graphs (right) of the distribution model for P. kersteni...... 55 Figure 5-5: Individual logistic response of the 5 most contributing environmental variables in the model predicting the presence of P. kersteni (note: the base reference in y-axis in figure ‘a’ is 0.5 and not 0) ...... 57 Figure 5-6: Probability of environmental suitability for occurrence of P. kersteni across Africa (left) and predicted classified range into presence, probably presence and absence (right) modelled using SpeeDi Tool. The black line represents the 5 different regions aggregated from countries’ boundaries...... 59 Figure 5-7: Distributional range of P. kersteni as predicted by IUCN redlist assessment in 2010 (Clausnitzer et al., 2010) ...... 60 Figure 5-8: Land cover classes of the continent of Africa (Mayaux et al., 2004) as used for the modelling of P. kersteni; coded values are inclosed in brackets ...... 61 Figure 6-1: Effect of elastic-net factor on the final numbers of variables entered into the model with least regularisation (10th), and total number of iterations required for fitting each path models consisting of 10 models ...... 64 Figure 6-2: Effect of initial population prevalence (π) on the probability value, calculated for number of background samples = 5000, and number of presence samples = 496 ...... 65 Figure 6-3: Sensitivity and specificity curves of different background samples for the modelled probability distribution of P. kersteni, the suffixed numbers represent the number of background samples; the box highlights the closeness of range in threshold for obtaining maximum accuracy for optimal binary classification considering sensitivity and specificity values ...... 68 Figure 6-4: Predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms (same as Figure 5-6 right) ...... 70 Figure 6-5: Response curves of elevation (upper five) and maximum temperature of August (lower five) for the predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms ...... 71 Figure 6-6: Histogram of elevation (m), precipitation (mm) in March and minimum temperature (°Celsius times 10) in February inferred from the predicted presence range of P. kersteni ...... 73 Figure 6-7: Models using weights for presence samples density, samples with weight of 1 throughout (top left), manually assigned and adjusted weights per cluster (top right, same as Figure 5-

Page| xi

6 right), and weights based on the global average distance to other samples (bottom left) ...... 74 Figure 6-8: Modelling with presence and absence samples generated from the assumed watershed based range of P. kersteni (Clausnitzer et al., 2012); left: light green areas as presence range and light blue as absence range; randomly generated presence samples used for training (red dots) and evaluation (green dots) of the model; right: predicted binary classified (presence-absence) range of P. kersteni ...... 75 Figure 6-9: Model output using different approaches, using randomly generated presences from the watershed based range map (Clausnitzer et al., 2012) and background samples from all over Africa (left), and using field collected presence samples and randomly generated absence samples from absence regions of watershed based range map (right) ...... 77 Figure 6-10: Example of improving the model predictions, prediction result with field collected presence samples with background samples (left, same as Figure 5-6 right) and improved prediction reducing the false presences at the Okavango Delta (see circle) by using auxiliary information of knowledge based absences locations at the Okavango Delta. .... 78 Figure 7-1: Modelled distribution range of P. kersteni using different climatic data: a) with the 19 bioclimatic variables (left, chapter 7.1.1), and b) with 6 selected bioclimatic variables (right, chapter 7.1.2) ...... 82 Figure 7-2: Predicted distribution of P. kersteni based on climate data as 6 selected bioclimatic variables supplemented by x-y geographic coordinates ...... 83 Figure 7-3: Response curves of five of the predictor variables (elevation and four temperature related bioclim variables) for the predicted distribution of P. kersteni based on climate data of six selected bioclimatic variables and with additional x-y geographic coordinates ...... 84 Figure 7-4: Predicted distribution range of P. kersteni based on datasets using different land-cover classification schemes: GLC2000 with FAO scheme (upper left, same as Figure 5-6 right) and GLCC with USGS modified level 2 scheme (upper right); and differing classes in the belt from west to east Africa in the two datasets, GLC2000 (lower-left) and GLCC-USGS (lower-right) ...... 86 Figure 7-5: Land-cover scenario hind casted for the year 1940 for modelling the past distribution of Odonata species ...... 89 Figure 7-6: Land-cover scenario projected for the year 2050 for modelling the future distribution of Odonata species ...... 91 Figure 7-7: Area and proportions covered by different land-cover classes and their scenarios for Africa in three time steps; 1940, 2000 and 2050 (colours and land-cover classes match those of maps in Figures 5-8, 7-5 and 7-6) ...... 92 Figure 7-8: Modelled distribution range of P. kersteni for the year 1940 (upper left), 2000 (upper right), 2050 (lower left), and change in distribution range (lower right). The model is trained with the environmental data from 2000 (see Table 5-1, except minimum NDVI) and projected for the scenarios of land-cover and climate in the years 1940 and 2050 ... 94

Figure 7-9: Photo of Aeshna minuscula ...... 95

Page| xii

Figure 7-10: A. minuscula sample locations (ODA: Kipping et al., 2009) and expected presence range based on watersheds (Clausnitzer et al., 2012) ...... 95 Figure 7-11: Predicted presence range for A. minuscula based on backgrounds sampled over the entire continent and different spatial resolutions of environmental geodatasets, 1 km (top, left) and 8 km (top, right) and the difference in predicted range in the southern Africa due to change in resolution (bottom) ...... 97 Figure 8-1: P. kersteni samples (ODA: Kipping et al., 2009) measured as kernel-density within 80 km radius; also indicated is the expected habitat range corresponding to the watershed based distribution map (Clausnitzer et al., 2012) ...... 102 Figure 8-2: Distribution of sample location records in ODA, predicted range (see Figure 5-6, right) and expected range based on watershed range of P. kersteni over different land-cover classes; the colour and number of the land-cover classes corresponds to the colour and number of Figure 5-8...... 111

List of Tables

Table 2-1: Three different categories of species distribution models and their basic characteristics primarily based on Guisan and Zimmermann (2000) ...... 6 Table 2-2: Some of the approaches used in studies for generating pseudo-absences for presence- absence models...... 12 Table 3-1: Some examples of SDM users, contexts of modelling and the typical consequences of over- and under-prediction ...... 21

Table 3-2: Basic functionality matrix overview of selected SDM tools ...... 22 Table 3-3: Methods to improve reliability regarding noise, over-fitting and correlated variables in selected SDM tools...... 24

Table 3-4: User-centred features of some selected SDM tools ...... 29

Table 4-1: Assumed user profile for using the SpeeDi Tool ...... 35

Table 4-2: Functions offered by the SpeeDi Tool for species distribution modelling ...... 46 Table 5-1: List of climatic and environmental geodata used for modelling of P. kersteni and their sources ...... 52

Table 5-2: Total and unique sample locations of P. kersteni in the ODA ...... 54 Table 5-3: Five most contributing variables of the model for predicting the distribution of P. kersteni ...... 56 Table 6-1: Effect of elastic-net factor on model (1) performance measured with AUC for binary classification of distribution of P. kersteni ...... 63 Table 6-2: Comparative values of model performance regarding the initial population prevalence for modelling the distribution of P. kersteni ...... 64

Page| xiii

Table 6-3: Initial population prevalence of background samples and the correlation matrix of the probability values for the distribution of P. kersteni ...... 65 Table 6-4: Comparative values of model performance regarding the initial population prevalence and applying 'soft-buffer-threshold’ for modelling the distribution of P. kersteni ...... 65 Table 6-5: Initial population prevalence of background samples and the correlation matrix of the probability values calculated applying ‘soft-buffer-threshold’ for the distribution of P. kersteni ...... 66 Table 6-6: Model performance for different background sample sizes for modelling the distribution of P. kersteni ...... 67 Table 6-7: Model performance for predicting the distribution of P. kersteni based on AUC value and accuracy when using different algorithm of pseudo random number generator for background data generation ...... 68 Table 6-8: Correlation of probability values among different model outputs when using different algorithms for generating background samples ...... 69 Table 6-9: Performance of models of different complexities measured by AUC value and accuracy for predicting the distribution of P. kersteni (see chapter 6.2.3 for explanations on model abbreviations) ...... 69 Table 6-10: Ranking of environmental variables for 5 different model complexities based on the explained deviance in contributing to the calculated probability for modelling distribution of P. kersteni (see chapter 6.2.3 for explanations on model abbreviations and Table 5-1 for naming of variables) ...... 72 Table 6-11: Accuracy assessment of the model performance in predicting a pre-defined arbitrary range of a species based on training and evaluation sample data ...... 76 Table 7-1: Contribution of environmental variables in predicting the distribution of P. kersteni using bioclimatic data as climatic variables ...... 81 Table 7-2: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables ...... 82 Table 7-3: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables and supplemented by geographic coordinates ...... 83 Table 7-4: Contribution of environmental variables in predicting the distribution of P. kersteni using monthly temperature and precipitation data as climatic variables ...... 85 Table 7-5: Geodatasets used for developing scenarios for 1940 and 2050, with sources and descriptive information ...... 88

Table 7-6: Different modelling extents and spatial resolutions applied for modelling A. minuscula ... 96 Table 8-1: Ranking of factors affecting predictions of species distribution based on potential impact on output of SpeeDi Tool for modelling P. kersteni ...... 113 Table 8-2: The features of SpeeDi Tool in comparison to other species distribution modelling tools (similar features in other tools are italicised) ...... 118

Page| xiv

List of Boxes

Box 3-1: Basic questions to summarise a project in determining the context of use of a system (based on Maguire et al., 1998) ...... 27 Box 3-2: Typical questions for development of a species distribution modelling tool regarding user profile (based on Johnson, 2010; Lewis and Reiman, 1994; Maguire et al., 1998) ...... 28

List of acronyms

AIC Akaike Information Criterion ALFG Additive Lagged Fibonacci Generator API Application Programming Interface AUC Area Under the Curve (of an ROC) BIC Bayesian Information Criterion BRT Boosted Regression Tree CART Classification And Regression Tree CCAFS Climate Change, Agriculture and Food Security CGIAR Consultative Group on International Agricultural Research CLI Command Line Interface CRU Climate Research Unit CRU-TS CRU - Time Series EM Expectation-Maximisation ENFA Ecological Niche Factor Analysis FAO Food and Agriculture Organisation FEWS-ADDC Famine Early Warning System - Africa Data Dissemination Centre GAM Generalised Additive Model GARP Genetic Algorithm and Rule-set Prediction GBIF Global Biological Information Facility GIS Geographic Information System GLC2000 Global Land Cover of the year 2000 GLCC Global Land Cover Characterisation GLM Generalised Linear Model GRASP Generalised Regression Analysis and Spatial Prediction GUI Graphical User Interface HYDE Historical Database of global Environment IPCC Intergovernmental Panel on Climate Change IPP Initial Population Prevalence IUCN International Union for Conservation of Nature LASSO Least Absolute Shrinkage Selection Operator LUCS Land Use land Cover System LUS Land Use System MARS Multivariate Additive Regression Spline MaxEnt Maximum Entropy

Page| xv

MDT Minimised Difference Threshold MST Maximised Sum Threshold NDVI Normalised Differential Vegetation Index ODA Odonata Database of Africa P-RNG Pseudo-Random Number Generator ROC Receiver Operating Characteristics SBT Soft Buffer Threshold SDM Species Distribution Modelling SOAP Simple Object Access Protocol SpeeDi Tool SPEciEs DIstribution modelling Tool SRNG Subtractive Random Number Generator TOC Table Of Contents USGS United States Geological Survey XML eXtensible Mark-up Language

Page| xvi 1. Introduction

1. Introduction

Background

Africa is the continent with rapidly changing landscape with changes in population structure, land-cover and climate (UNEP, 2008). It is also a continent with some of the global hotspots of biodiversity (Myers et al ., 2000). Biodiversity inventory plays an important role in identifying hotspots and helps in planning relating to conservation activities. The location database of global records of faunal species is being dominated by vertebrates: mammals, reptiles, amphibians and birds (e.g. Rodrigues et al ., 2004). Such records are lacking for invertebrates, including which are the most diverse group. In this regard, Odonata (commonly known as dragonflies and ) is among the front runner, at least, in Africa due to the Odonata Database of Africa (ODA) (Clausnitzer et al ., 2012). The database can, thus, add to conservation planning of biodiversity (e.g. Simaika et al ., 2013). The usefulness of the database is partly shown by the IUCN redlist assessment of African Odonata species where the location records are used for estimating and mapping species range at a spatial resolution provided by watershed boundaries (Clausnitzer et al ., 2012, Dijkstra et al ., 2011). Naturally, the resolution of such maps depends on the size of each watershed and, thus, the resolution is heterogeneous. The delineated ranges can, however, be used as ‘first level filters’ for assessing species conservation status as well as planning actions (Clausnitzer et al ., 2012). However, distribution ranges based on watershed also include large areas which do not represent the true habitable areas (Simaika et al ., 2013). Fine and homogeneous level of details (better than the levels offered by watersheds) of species ranges can be achieved through regularly- sized grid based species distribution modelling (SDM) methods which are generally founded on the correlation of recorded species’ location and the surrounding environment (Guisan and Zimmermann, 2000). As such, SDM have been used as part of a broader collection of chained tools in many ecological applications (Guisan and Thuiller, 2005). Here, the availability of spatio-temporal environmental datasets, especially climatic variables have contributed in rapid adoption of SDM (Buisson et al ., 2010). Most of the SDM tools are based and focused on statistical methods. However, a complete species distribution modelling task includes much work in processing the required spatial data. A geographic information system (GIS) is more suited for handling spatial data. The ability of a SDM tool to handle spatial data would not only facilitate a smooth workflow but also enhance usability and overall user experience. However, most of the currently available SDM tools lack even basic GIS capabilities. The wish is not only to have GIS capabilities integrated in a SDM tool but to have a user interface which is easy to handle and offers support for a smooth flow of all relevant tasks. Further, many SDM tools make use of statistical methods but a straight forward adoption of any new development in the field of statistical computation cannot often be easily adopted due to the nature of data (spatial vs. non-spatial) involved and the data requirements (presences and/or absences) of the statistical methods. The fact that a single SDM tool, which can satisfactorily predict distribution of each species, does not exist (e.g. Elith et al ., 2006; Farber and Kadmon, 2003; Guisan et al ., 1999; Guisan et al ., 2007; Terribile et al ., 2010) is one of the reasons for the development and availability of

Page| 1 1. Introduction

several SDM tools employing different methods. Having several tools to choose from can be a dilemma for modellers but it can also be an advantage. Several studies have found that one method may be better suited for some species and other methods may be suited better for other species (e.g. Brotons et al ., 2004; Elith et al ., 2006; Guisan et al ., 2007). Thus, a pool of different methods as well as similar methods applying different techniques, e.g. regularisation and/or iterative methods in fitting regression coefficients offer a better choice with which better results can be achieved in more straight forward means (Araújo and Guisan, 2006).

The thesis’ aims and limits

As the title of the thesis suggests, the main aim of the thesis is to develop a new modelling tool for predicting the distribution of African dragonflies (Odonata). This includes conceptualisation and implementation of binary logistic regression based method. As the Odonata database has only the records of presence locations, applying the concept of expectation-maximisation is to be considered. The thesis also targets to introduce the recent development of elastic-net regularisation in regression for presence-only modelling scenario. It also aims to offer insights on effects of different model parameters in predicting species distribution through comprehensive sensitivity analyses of model parameters. Investigating the level of influence of environmental geodatasets is another aim. The sensitivity analyses regarding model parameters and different environmental geodatasets will be useful when modelling later for a large pool of Odonata species. With GIS functionalities being important, the GIS and statistical modelling should be available within the tool. Further, the aim is also to make the tool easy to use for the so called 'non GIS-experts' in handling GIS tasks.

Therefore, the tasks of the thesis are summarised as:

• Collect, analyse and process (harmonise) geodata for modelling the spatial distribution of African dragonflies species • Develop and implement concept for binary logistic regression modelling using presence-only data • Incorporate recent developments (e.g. elastic-net regularisation) in regression based modelling • Use coupled mechanism for integrating logistic regression and GIS in a harmonised graphical user interface • Offer smooth workflow regarding basic SDM tasks and required functionalities • Make use of ODA to demonstrate the ability of the new tool in modelling species distribution • Perform sensitivity analyses of model (regression) parameters • Perform sensitivity analyses of environmental geodata • Simulate historical and future distribution of Kersten’s Sprite ( Pseudagrion kersteni ), a wide spread Odonata species in sub Saharan Africa, using climate and land-cover scenarios of past and future

Although the tasks are thus defined, several options are possible for achieving some of the tasks. ArcGIS Engine is selected as main GIS engine and ArcObjects application programming interface (API) is used as GIS API. The Microsoft DotNET framework is selected for the programming environment. However, the thesis will neither list other options nor elaborate advantages and disadvantages of the

Page| 2 1. Introduction

selected option. Freely available environmental geodata will be the basis for required environmental layers in predicting the species’ distribution range. The ODA is the primary source of the dragonflies’ location data. The thesis neither attempts to identify the outliers in the Odonata database nor tries to rectify the errors in coordinate information. So any errors in the coordinate information and outliers are ignored. Another limitation is that the thesis presents distribution of only two species. Instead, it focuses on sensitivity analyses. The information obtained from these analyses will, however, be useful in modelling distributions of other species. And finally, no attempt has been made to interpret the results ecologically.

Outline of the thesis

Chapter 2 looks into different aspects of species distribution models. It provides an overview on different modelling methods and techniques used in predicting spatial distribution of species. Further, the elaborated characteristics and assumptions of SDM offer insights on the limitations of species distribution models. Since GIS is an integral part in the overall SDM process, the chapter also looks into different ways of incorporating GIS and statistical modelling for SDM. Chapter 3 on the quality-in-use investigates the users’ need while using an SDM tool. It, therefore, looks into the context of using such an SDM tool. The context of use is relevant for different quality aspects such as functionality, reliability, usability which are essential from a user’s perspective. It also reviews the processes necessary for incorporating users’ requirements including the experiential aspect.

The task of developing a new tool is described in chapter 4 which incorporates the reviews of chapters 2 and 3. It offers the concept of the working mechanism and handling of the new tool. Moreover, the basic requirements of a complete SDM task are analysed and its implementation described. Furthermore, the chapter provides the necessary concept for logistic regression modelling implemented in the tool. The chapter also lists the functions available in the tool which are necessary and have been selected for predicting species distribution range. The demonstration of the new tool’s ability is presented in chapter 5. Here, P. kersteni is selected as the target species to be modelled. The chapter lists the environmental geodatasets, explains the different steps performed including processing of the geodatasets in predicting the distribution range of P. kersteni in Africa using the new tool and illustrates various outputs with their according explanations. It also covers the visual assessment of the predicted distribution range of P. kersteni .

Confidence is needed in adopting the results provided by a new tool for predicting species distribution and one way of getting confidence is performing sensitivity analyses. Chapter 6 focuses on the sensitivity of various parameters related to the model. These analyses show the effects of different model parameters such as elastic-net factor, population prevalence in background samples, number of background samples, ‘soft-buffer-threshold’, distribution pattern of background samples and sampling bias of presence locations on the output (prediction) which can provide clues for choosing optimal values in obtaining best results. The second part of the sensitivity analyses, chapter 7, looks into the role and effects of environmental geodata. Climate related projections of past and future distributions are among the mostly sought applications of species distribution modelling, so the sensitivity of using different types of climate datasets and different schemes of land-cover datasets are the foci when modelling the

Page| 3 1. Introduction

distribution range of P. kersteni in Africa. Other aspects are geographical extent (or landscape) and the spatial resolution of the datasets.

Several aspects of modelling the spatial distribution of P. kersteni mainly, with the new tool are analysed in chapter 8. The discussions start with various impacts of the characteristics of sample data followed by the influences of regression parameters and effects of environmental datasets. The chapter also discusses the new and special features the new tool offers, also in the context of usability. A comparison of the new tool with other SDM tools is included too. Finally, chapter 9 provides the overview of the thesis summarising various findings. As improvements are always possible for any work, some of the tasks that can be taken over from there are also presented as an outlook.

Page| 4 2. Background on species distribution modelling

2. Background on species distribution modelling

Three types of species distribution models

A predicted spatial distribution of occurrence of species based on statistical techniques and GIS technology is termed as species distribution model (Guisan and Zimmermann, 2000). The data used in the spatially explicit models are the location records of presences or presences and absences of species and different environment related measurements (e.g. bioclimatic variables) (Guisan and Thuiller, 2005). The variables chosen are normally the ones which are ecologically meaningful and relate to the species niche (Hirzel et al ., 2006). Guisan and Zimmermann (2000) classified species distribution models into three general classes: analytical, mechanistic, and empirical models (see Table 2-1) based on three characteristics of 'generality', 'reality' and 'precision'. It has to be noted, however, that it is often difficult to exactly differentiate between the three classes, especially among analytical and mechanistic models ( ibid.). The analytical method is designed for predicting the response of the species-environment relationship within the given conditions. Thus, the properties regarding precision and generality (on response curves) are analysed. The mechanistic method is primarily based on the fundamental physiological processes and offers reality and generality. The performance of such model is measured based on “the theoretical correctness of the predicted response” (Guisan and Zimmermann, 2000 p150). Key to such model formulation is the primary knowledge of interactions among process variables (i.e. behaviours) which define the model structure (Guisan and Thuiller, 2005). Hence, the data requirement for this type of model is related to the processes and their interactions. Empirical models are statistically derived models which focus on precision and reality. The models are based on the observation of environmental covariates at the species presence (and absence) locations and on formulating the empirical relationship between the species and the covariates (e.g. regression based methods). These models are not necessarily expected to describe the ecological functions, mechanisms and causative relationships between the model parameters and the predicted responses (Guisan and Zimmermann, 2000). The selection of any of the three methods is determined by the purpose based on the desired properties, data availability, complexity of the model and the spatial scale and extent. The choice of a model also depends on the purpose of modelling such as understanding the ecological dynamics, interactions among communities, and relationship with the biophysical environment (Austin, 2007; Guisan and Zimmermann, 2000). Empirical or statistical models are commonly used as they are relatively easy to formulate in comparison to establishing the causal relationship based on the analytical method and the mechanistic method (Hartley et al ., 2010). Using geocoding technique, large amounts of natural history and herbarium collections have been and are being georeferenced (Yesson and Culham, 2006) (e.g. GBIF 1). The georeferenced species data have facilitated in modelling the predictive distribution of species and several empirical modelling methods have been developed (Elith et al ., 2006; Guisan and Thuiller, 2005). Although statistical models are easy to formulate, combinatorial use of an increasing number of potential predictor variables (see chapter 2.6) has been pushing the

1http://data.gbif.org/welcome.htm (17-Apr-2012)

Page| 5 2. Background on species distribution modelling

limits on computation due to the complexity as well as the iterative nature of numerical algorithms. Further, inclusion of interaction term increases the number of predictor variables exponentially especially in GLM and its derivatives (Guisan and Thuiller, 2005) and thus increases computing. However, the developments of new modelling methods are also helped by increasing computing power (Buisson et al ., 2010; Jeffers, 1999) helping to handle iterative and complex numerical algorithms.

Table 2-1: Three different categories of species distribution models and their basic characteristics primarily based on Guisan and Zimmermann (2000) Analytical Mechanistic Empirical focus on precision & reality & Precision & generality generality reality Basis theoretical physiological process phenomenological (mathematical) Nature dynamic, easily dynamic, easily static; not (or not easily) transferable from one transferable from one transferable from one landscape to other landscape to other landscape to other (often based on indirect gradients) Purpose predict response in describe cause and describe the state (e.g. simplified reality (e.g. effects from response present/past/future logistic growth (e.g. competition, distribution) equation) dispersal) model medium to difficult , difficult easier than mechanistic formulation based on simplification ecological easy because of being fairly easy, based on may not confirm to ecological interpretation directly based on ecological processes theory, based on stochastic theory events

Uses of species distribution modelling

Species distribution modelling (SDM) has wide ranges of uses. Locating hotspots for conservation, designing cost-effective surveys, predicting impacts of environmental change, predicting potential species invasion and relationship between phylogeny and climate are common examples of SDM uses which are briefly looked over below.

Determining biodiversity hotspots and reserve planning – Araújo et al . (2002) applied species distribution models for 78 breeding passerine bird species in Great Britain from the recorded locations of presence and extinction in order to prioritise conservation areas. They reported a negative correlation between the local probability of extinction and the probability of occurrence. They argue that selecting areas with high probability of occurrence is suitable for reserve selection as the probability of retaining the species in the area in future will be high in those areas. They showed that areas selected with higher probability of occurrence will lead to a reduced rate of local extinction of the species. García (2006) used SDM of 301 species to derive the patterns of Herpetofauna biodiversity in Mexico. By mapping the biodiversity, the author found several hotspots of species richness. With the large proportion of species used in the mapping being endemic and endangered, and forming the biodiversity hotspots, the author concludes that several areas should be included for prioritising conservation.

Page| 6 2. Background on species distribution modelling

Design surveys and/or facilitate (re-)discovery of species – Some species are rare and difficult to find, and hence the term rare species is used. The data for such species can also be used for effective conservation planning. Here, use of species distribution models even with limited data can facilitate for quick field sampling (Le Lay et al ., 2010). Based on a predictive distribution map, Engler et al . (2004) reported discoveries of four new sites of highly endangered Eryngium alpinum species in Switzerland. Guisan et al . (2006) also used SDM in an iterative way for a stratified sampling design for the rare species E. alpinum . With the improved records of several newly discovered populations, the authors suggested that the Swiss Red list status of the species could be downgraded from vulnerable to endangered. The process not only helped discovering new populations but also offered an effective way, economically and efficiently (in terms of time), for assessing the species threat and conservation status. A model-based sampling strategy was exercised by Le Lay et al . (2010) for three rare plant species ( Cypripedium calceolus, Eryngium alpinum and Scorzonera laciniata ) and five common plant species ( Anthyllis vulneraria, Astrantia major, Briza media, Heracleum sphondylium, Pulsatilla alpina ) in western Swiss Alps. Also here, the authors viewed that the model played an important role in increasing the knowledge of the distribution of the rare species and led to discovery of new sites.

Impacts of environmental change on biodiversity – Statistical SDM are commonly applied for predicting the current distribution of species and are used for projecting into the future via climate scenarios. The process of projecting into the future reveals the range shifts, loss or gain in suitable habitat (Thuiller, 2003). This can help e.g. for designing corridors for migration, or better formulation of conservation strategies (Thuiller et al ., 2008; McClean et al ., 2005). But, are static models able to predict the future species distributions? Hijmans and Graham (2006) compared the outputs of static models with a mechanistic model to evaluate the ability of static models for predicting the shifts, shrinkage and expansion in distribution range induced by climate change. Modelling 100 plant species, they concluded that the static models are indeed able to predict the change but suggest some cautions and approaches which can improve the predictions further.

Species invasion – Competition is natural among species, either within the same community (intra-specific) or with different communities (inter-specific). The dominance of non-indigenous species over the native (indigenous) species is termed biological invasion (Lonsdale, 2002). Invasive species may have severe effect on the ecosystem, and thus they are a risk to biodiversity, especially to endangered species. In terms of economy, there can be positive or negative impact based on the way (intentionally or accidentally), purpose and management (control) of introduction (Wonham, 2006). Although several studies have been made to find out the nature or the characteristics of invasion and few key factors have been discovered, the discoveries are not sufficient to predict whether the introduction of a new species will lead to invasion or not (Peterson, 2003; Wonham, 2006). One of the preliminary options to determine the potential range of invasion is to find out whether the environment in the landscape is habitable for the new species and the species distribution modelling offers the possibility in assessing this habitat suitability. Peterson (2003) discussed the use of species’ potential distribution for predicting the potential of invasion by an introduced species. The application was demonstrated by modelling the aquatic plant Hydrilla verticillata , a native species in Asia-Pacific and invasive in North America, based on the environmental niche of the invasive species’ native landscape and projecting these ecological characteristics (or parameters) to the new geographic space. The prediction of invasion was then compared with the data collected in the North American region.

Page| 7 2. Background on species distribution modelling

Phylogenetic and Phyloclimatic relationships – One of the uses of species distribution models based on climate variables is the study of biogeography, either to test a hypothesis, or to understand the biogeographic processes. Biogeographic studies explain why a species occur at certain geographic locations, what the limiting factors are (e.g. climate, altitude), how they migrate with time, and so on (Cox and Moore, 2005). Phylogenetic hypothesis in combination with geographic range maps are commonly used to suggest speciation theory. To identify important speciation mechanisms in dendrobatid frogs in South America, Graham et al . (2004) used phylogenetic information and environmental niche models. They were able to show the lineages of speciation in environmental space and thus added knowledge regarding phylogeny and species distribution. Yesson and Culham (2006) showed for plant genus Drosera that bioclimatic models reveal phylogenetic patterns. They pointed out that although central regions of Australia have, at present, a suitable climate for D. macrantha , they are not observed in this region because the climate in the past (paleoclimate) was not suitable and thus acted as a historical limiting factor. Such studies can help explaining why a species is not observed in areas of potentially suitable environmental conditions (at present) and thus contribute to understanding the historical biogeography of the species.

Empirical methods for species distribution modelling

There have been several studies comparing the strength of predicting the species distribution based on the applied statistical measures and few of these studies have included several tools (e.g. Elith et al ., 2006; Guisan et al ., 2007, Wisz et al ., 2008). These methods include profile based models such as hyper rectilinear envelope (BIOCLIM Model, Busby, 1991), multi-dimensional convex-sub envelopes (HABITAT, Walker and Cocks, 1991), distance based envelope (DOMAIN, Carpenter et al ., 1993), hyper-ellipsoid envelope (ENFA-Biomapper, Hirzel et al ., 2002b) and parametric linear models (GLM, Hosmer and Lemeshow, 2000), non-parametric additive linear models (e.g. GAM, and MARS, Hastie et al ., 2009), regression tree (e.g. CART, and BRT, ibid.), and several other machine learning methods like neural networks ( ibid.), GARP (Stockwell and Peters, 1999), MaxEnt (Phillips et al ., 2006), etc. The choices of the methods are also dependent on the type of species data available such as presence-absence or presence-only. Some methods are briefly discussed below 2.

Profile based - The bio-climatic envelope model (BIOCLIM, Busby, 1991) was one of the first methods to offer modelling presence-only data. A bioclimatic profile of climate variables is created within a rectilinear envelope. The suitability of climatic habitat is then classified into ‘suitable’, ‘marginal’ and ‘unsuitable’ categories. Walker and Cocks (1991) introduced the HABITAT tool to further enhance the BIOCLIM concept by introducing simple classification and regression tree (CART) procedure to form sub-envelopes of the original rectilinear envelope. The sub-envelopes reduced the error prediction by forming a compact environmental envelope, i.e. condensed envelopes (ibid.). However, the rectilinear envelopes (and sub-envelopes) based methods provide some constraints. The distance based DOMAIN modelling method was developed to reduce the problems due to the tightly constrained convex-envelopes of the HABITAT procedure and to provide further options for modellers. It uses the environmental distance (point to point similarity function) to predict the species occurrence (Carpenter et al ., 1993). All three envelope based methods, BIOCLIM, HABITAT and DOMAIN, were first developed at the CSIRO, Australia.

2 arranged based on nature of species sample data requirements

Page| 8 2. Background on species distribution modelling

Ecological niche factor analysis (ENFA) method was implemented in Biomapper software which uses the elliptical envelope formed by factorising the variables into principal components. The first component represents species’ marginality (difference in central tendency measure, e.g. mean or median) and the second component represents the specialization (comparison of variance) (Hirzel et al ., 2002b). This method includes the envelope in elliptic form (created by factor analysis) and the distances (mean or median) in the environmental variables’ space. One of the drawbacks of the envelope based BIOCLIM, DOMAIN and ENFA methods is that none of these can use categorical data as input. Further, ENFA method assumes normality of predictor variables which may not be true in many cases (Engler et al ., 2004) but transformation such as box-cox has been suggested as workaround (Hirzel et al ., 2002b; Wisz and Guisan, 2009).

Linear models and derivatives – Generalised linear model (GLM) is a set of parametric linear models where a model is created to find the systematic effects from a set of data (McCullagh and Nelder, 1989). GLM based models are able to use continuous and categorical type data as input, with the categorical data transformed by using ‘dummy’ variables. Binary logistic regression modelling has been the popular choice of many for probability calculation because of its characteristics of confining the result within the range of 0 and 1 (Hosmer and Lemeshow, 2000). However, the necessity of absence data has troubled many modellers. With the lack of absence data, the presence-absence models use randomly generated pseudo-absence data (Wisz and Guisan, 2009). Generalised adaptive model (GAM) is a non-parametric extension of GLM. The parameters are fitted by applying smoothing functions for non-linear relationships, often with the scatter plot smoother (Hastie et al ., 2009). One of the drawbacks of the GAM based methods is that the method is not designed for large dataset because it is computationally expensive (Hastie et al ., 2009). However, GAM based models are successfully used in SDM. GRASP (Lehmann et al ., 2002) is such a GAM based tool used in R and S-PLUS environment. For small region, GRASP can be applied directly but use of look-up table is suggested for modelling with a larger number of cells ( ibid. ). With inclusion of the BRUTO procedure, the computational efficiency of model fitting process in GAM is increased by several factors (Leathwick et al . 2006). Multivariate adaptive regression splines (MARS) is an adaptive regression model which fits the response in piece-wise linear functions and includes recursive partitioning (tree) method to improve the regression fit (Friedman, 1991). Since GAM and MARS use logistic regression for species distribution modelling, these methods also require absence data and pseudo-absences are commonly used.

Other methods – Genetic algorithm for rule-set prediction (GARP) is based on genetic algorithm and few rules for predicting the species occurrence. The rules include the bioclimatic profile rules, a simple logistic regression rule, an atomic rule and a GARP rule 3 (Stockwell and Peters, 1999). GARP uses presence and (pseudo-) absence data. GARP was one of the widely used methods for studies with presence-only data at the time of development of MaxEnt (Phillips et al ., 2006). MaxEnt is a parametric machine learning method which predicts the species distribution by using maximum entropy estimation, i.e. the distribution is closest to uniform or close to its empirical average of each predictor variables (Phillips et al ., 2006). The maximum entropy method is equivalent to maximising the negative log-likelihood in logistic regression but differs in how it is formulated. MaxEnt is formulated with the ‘Bayesian’ perspective while logistic regression is

3 Refer Stockwell and Peters (1999) for details on how rules are defined

Page| 9 2. Background on species distribution modelling

formulated with ‘Frequentists’ perspective (He, 2010). Further, in Maxent the likelihood is used for performance measurement and not for parameter estimation (Dudik, 2007).Thus, the MaxEnt method is regarded as generative whereas logistic regression is a discriminative method (Phillips et al ., 2006). The MaxEnt method for species distribution modelling uses presence and background data (see chapter 2.4 for background data), and hence absence data is not necessary. If background samples are not provided, the MaxEnt tool generates them before modelling. Boosting is a machine learning method for improving the fit of the model. Boosted regression tree (BRT) is a gradient boosted model which uses several regression trees. The final model is calculated from these trees in weighted fashion (Hastie et al ., 2009). The gbm 4 package of R requires presence and (pseudo-) absence data. For species distribution modelling, a special package (ecogbm 5) is derived from the original gbm package and the requirement is presence and background data.

Use of presence, absence and background data in species distribution modelling

Species sample data requirements vary based on the modelling methods. With the conception of background samples, literatures (e.g. Elith et al ., 2006; Elith and Leathwick, 2009; Phillips et al ., 2009; Ward et al ., 2009) have used the term ‘presence-only’ modelling to indicate both types of methods a) requiring only presence samples (e.g. BIOCLIM model, ENFA), and b) requiring presence and background samples (e.g. MaxEnt, ecogbm). The term ‘presence-only’ used in the thesis means the method requiring only presence samples and not ‘presence-background’ samples. Uses of the three types of sample data are briefly described here with an overview on a method to measure the performance of models using these data.

Presences – Presence samples are the locations where species occurrences have been recorded. Profile (envelope) based modelling needs only presence locations. Thus, it provides a good opportunity to make models with geo-referenced museum records. However, there were some concerns regarding the predicted areas as many non-habitable areas are predicted as suitable (Elith and Burgman, 2002). Although the ENFA approach improved the envelope-technique a lot, the suggested method, if possible, has been the presence-absence models such as GLM (Hirzel et al ., 2002b).

True-absences and pseudo-absences – Absence samples are the locations where the occurrence of species has not been observed on several visits. The records of such instances are true-absences. Typically, these are not recorded and hence true-absences lack for many species (Barry and Elith, 2006; Hirzel et al ., 2002b). So, in order to get absence locations, pseudo-absences samples are generated randomly based on some knowledge where the species may not be present (Wisz and Guisan, 2009). In this way the presence-absence models are formed, i.e. pseudo-absences are used as pure/true absences (see Figure 2-1). One of the difficulties using pseudo-absence approach is how to generate reliable absences. Apart from sampling random points on a landscape, several strategies for generating pseudo-absences (see Table 2-2) have been devised and investigated for building binary logistic regression models. The primary purpose of devising these strategies is to get indirectly

4 http://cran.r-project.org/web/packages/gbm/index.html, accessed 28.12.2011 5 ecogbm is not available at official CRAN distribution site, a beta version is available at: http://www.stanford.edu/~hastie/Papers/Ecology/ecogbm_1.01.tar.gz, accessed 28.12.2011

Page| 10 2. Background on species distribution modelling

as much reliable absence samples as possible. However, there is not a confirmed recommended approach. Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation

(a) true-absence (initial) (c) pseudo-absence (initial)

pr ab. true ab. assumed actual Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation (b) true-absence (!nal) (d) pseudo-absence (!nal) Figure 2-1: Fitting the probability (logistic regression) of true- and pseudo-absence data shown with one predictor variable ‘Elevation’. The probability of absence data (a and b: true; c and d: pseudo) is zero and this value is used throughout the iteration process. But despite pseudo-absence data is likely to contain a mixture of true-absence (prob. = 0, circular) and non-absence (i.e. prob. > 0, plus sign; their actual value indicated by squares), probability value of zero is assumed. (taking Ward et. al ., 2009 figure 3 for idea)

Backgrounds – Phillips et al . (2006) introduced the Maxent method using background samples for predicting species distribution in regression model. When using background samples, the covariates of the presence samples across the landscape are then compared with the covariates of the background samples (Ward et al ., 2009). The concept of Phillips et al . (2006) in using background samples goes into the process of model fitting and not merely for comparing covariate values as in envelope based method such as ENFA approach in Biomapper. The method is similar to the use of pseudo-absences, but it uses the random sample locations with some prior population prevalence value (MaxEnt initialises with probability value of 0.5, Elith et al ., 2011), and then adjusts the values iteratively during fitting of the model parameters. A similar concept of background samples for GLM based on Expectation-Maximisation (EM) concept (Dempster et al ., 1977) was explored by Ward et al . (2009) and implemented in ecogbm package of R statistical software showing that EM applied GLM-based models can perform very well. The fitting process begins by assigning the background samples with a prior prevalence value. The probability values for the background samples are then assigned iteratively (see Figure 2-2). Thus, the probability value of background samples with suitable environment for a species will get increased whereas the probability values for samples with unsuitable environment get decreased. The iteration process runs until maximum iteration number is reached or other statistical criteria are satisfied.

Model evaluation – The model performance is evaluated by comparing the number of correct predictions of presences and absences via binary classification matrix. A widely used evaluation measure is the Area Under the Curve (AUC) value of the Receiver Operating Characteristics (ROC)

Page| 11 2. Background on species distribution modelling

curve. The ROC curve is threshold independent and is based on the sensitivity (true-positive rate) on a vertical axis and specificity (false positive rate) on the horizontal axis of the classified model output (Krazanowski and Hand, 2009). An evaluation of the model formed by presence and true absence data provides the real performance of the model. For the model with pseudo-absence data, the evaluation result may only be real if the assumptions made for generating pseudo-absences are as valid as for true absences. Models using background samples (presence-only) cannot, directly, provide the real performance metric in determining the presence and absence locations (Hirzel et al ., 2006), and can only measure for true presences and false absences. For a proper evaluation, absence (either true or pseudo) data is required. Generally for the evaluation purpose when lacking absence data, the background data is used as (pseudo-)absences in generating the ROC-curve (Phillips et al ., 2006).

Table 2-2: Some of the approaches used in studies for generating pseudo-absences for presence-absence models Approach Devised for Employed by random distribution and 43 fern species Zaniewski et al . (2002) weighted-random distribution Alpine herbaceous plant ENFA-weighted pseudo-absences Engler et al . (2004) (Eryngium alpinum ) three butterfly species ( Melitaea 4 different approaches based on 12 didyma , Coenonympha tullia and Lütolf et al . (2006) species and museum records Maculinea teleius ) habita t envelope via logical (spatial) Northern Goshawk ( Accipiter Zarnetsek et al . (2007) function of 1 and 2 eco-variables gentilisatricapillus ) nests Root vole ( Microtes oeconomus ) combined ENFA and distance weighted and white-tailed eagle Hengl et al . (2009) (Haliaeetus albicilla ) nests

General characteristics and assumptions of statistical species distribution models

When modelling the species distributions, assumptions are made and the output will be valid within the assumed conditions revealing certain characteristics. Some general assumptions and associated characteristics are presented here.

Static and nature at equilibrium – Statistical models are static (not-dynamic) and deterministic (event based). They do not include the physiological processes (see Table 2-1) (Franklin, 1995). Biotic interactions such as inter- and intra-specific competitions cannot be modelled and these models also lack micro-habitat requirements but these interactions are the vital processes determining the species range (Araújo and Luoto, 2007; Holt and Barfield, 2009, Wisz et. al. , 2013). Further missing factors in static models are consideration of dispersal and evolutionary change such as climate- induced range-shifts (Pearson and Dowson, 2003). It is assumed that the species have colonised to the maximum possible range and are in optimum state (no further migration). Thus, the distribution is in the state of (pseudo-)equilibrium with environmental conditions in its native range (Guisan and Thuiller, 2005; Guisan and Zimmermann, 2000). This is often not the case (Araújo and Peterson, 2012) due to several factors: e.g. biogeographic history (Yesson and Culham, 2006), geographic (altitudinal and ‘latitudinal’) constraints limiting dispersal (Munguía et al ., 2008), invasion not fully

Page| 12 2. Background on species distribution modelling

established i.e. continuing expansion (Guisan and Thuiller, 2005; Peterson, 2003). Further, the presence samples are considered to represent the total population in its native range. Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation

(a) EM iteration 1 (b) EM iteration 2

Probability Probability pr: z=1 bg (pr): z=0, y=1 bg (ab): z=0, y=0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 Elevation Elevation (c) EM iteration 3 (d) EM !nal iteration Figure 2-2: Fitting the probability (logistic regression) of background data (z = 0) in an iterative process shown with one predictor variable ‘Elevation’. z = 1 represents presence samples; z = 0 represents background samples; z = 0, y = 1 represents samples at favourable environment (presences); and z = 0, y = 0 represents samples at unfavourable environments (absences). At each iteration step, the value of y changes for z = 0 (source: Ward et al ., 2009, figure 3; redrawn and legend slightly altered)

Simplified form – As statistical models do not include all possible factors that affect the distribution pattern, the real shapes of species response to environmental covariates are unknown and responses to environmental covariates are limited (Austin, 2007). Thus, these models are represented by a simplified form of a more complex system (Barry and Elith, 2006). Furthermore, different modelling methods offer different strategies of incorporating environmental variables. The choice of modelling methods, model variables and variables’ interactions can impact the model’s accuracy and generality (Guisan and Zimmermann, 2000).

Niche – The dynamics of species’ evolution, behaviour and adoption to the climatic environment are not and cannot be directly included in a static model. Because of the static nature and with the assumption that the samples represent full population, the output of the models represents more of a ‘fundamental niche’, which relates mainly with climate and does not consider biotic-interactions, than a ‘realised niche’ which has less range due to several biotic and abiotic limiting factors (Huggett, 2004). Currently, most of the models also include variables which partially and indirectly represent some of the biotic interactions (Lehmann et al ., 2002) and hence the outputs lie somewhere between the fundamental and realised niche (Farber and Kadmon, 2003). However, in reality, often the samples are not complete and do not represent the full population nor does the model output represent a complete fundamental niche.

Presence or absence – A site can be considered an absence location after more than 28 visits of confirmed non-sightings at 95% confidence interval (Stauffer et al ., 2002) but a single sighting is

Page| 13 2. Background on species distribution modelling

considered presence locations even if the species is present for unknown reason and in reality, therefore, not a real habitat or habitable location. Ideally, such locations should not be used for fitting a model (Guisan and Thuiller, 2005). The inclusion of species presence records at unsuitable habitats introduces false presences but this information is often not considered. Moreover, for many species absence locations (within the 95% confidence interval) are not recorded. Pulliam (2000) discusses and provides several references where species have been found in unsuitable habitat and where suitable habitat is not occupied by species. However, they are often included in the modelling as the information on unsuitable habitat is often not known. Discriminative methods such as GLM can utilise the information provided by the absence data efficiently (Hirzel et al ., 2002b) but the non- recording of true absence locations is withholding important information about the species- environment relationship for determining the species range.

Sample bias – Biases in sampling is often not known except in case of well-designed surveys (Guisan et al ., 2007). It has to be noted that one of the aims of the SDM is to reduce expensive surveys. Often a collection of species’ locations is opportunistic in nature (Barry and Elith, 2006; Boitani et al ., 2011; Elith and Leathwick, 2009). This leads to some areas having very high sample densities whereas other areas remain under-sampled. Such differences in sampling density may lead to false prediction. Often such biases are assumed to be non-occurring and ignored i.e. they are not considered during modelling the species distribution range. The results of Elith et al . (2006) are improved in Phillips et al . (2009) by introducing bias in the background samples similar to those in presence samples. Phillips et al . (2009) report that by partially incorporating the sampling bias, the prediction by different modelling tools improved for the same 226 species (and data) used by Elith et al . (2006).

Presence-background sampling ratio – Bias in presence-background modelling is introduced via ratio of number of backgrounds to presences and the extent used for background information, which affects the evaluation of model’s predictive performance (Lobo et al ., 2008; VanDerWal et al ., 2009). Although there have been suggestions for using weights to balance this proportional bias, only few studies (e.g. Maggini et al ., 2006; Guisan et al ., 2007) have actually included weights. Maggini et al. applied weights to absence samples to gain an overall prevalence value of 0.5 and found that models were more stable and performed better in explaining deviance as compared to the results with non- weighted absences. However, the effect of the weights to balance the sampling ratio will have very less influence if the cut-off value for binary representation is calculated based on the sensitivity and specificity values of the prediction instead of the classical 50% mark (Jiménez-Valverde et al ., 2009).

Variable importance – The importance or ranking of environmental variables presented by statistical measures may be different from the ecological theory (reality but not generality, see chapter 2.1, Table 2-1 on type of models). This is because the statistical model is not based on cause-and-effect relationship between the variables and responses (Guisan and Zimmermann, 2000). In addition, the final model uses only subsets of variables from the initial pool. The subset of variables are often based on either ‘goodness-of-fit’ test for individual variables, or selection via ‘information criteria’ on log-likelihood such as AIC, BIC, deviance or similar metrics (Heikkienen et al ., 2006; Austin, 2007; Elith and Graham, 2009; Wisz and Guisan, 2009).

Scale – Static SDM are valid generally at larger grain size and for broad-scale patterns. The assumption that the environmental conditions remain constant is valid only at coarser grain size

Page| 14 2. Background on species distribution modelling

(Austin, 2007). Biotic interactions play more of a role in determining the species’ distributions at finer resolution (Austin, 2007; Guisan and Thuiller, 2005), but such detailed data (measurements) is often not available due to the expensive nature of data acquisition (Hartley et al ., 2010), this especially also in a spatially explicit form.

Independence of data – The sample locations used for modelling the species distribution is supposed and assumed to be independent of each other. However, often the samples are not truly spatially independent; the sample locations closer to each other have more similar values (here environmental attribute values) as compared to those in a farther distance (Dormann et al . 2007; Guisan and Zimmermann, 2000). Furthermore, combining samples from several years of field surveys and collections from natural history museums and herbariums, which is often the case for SDM, treat the data temporally dependent. Another characteristics or limitation is that once the model is fitted (trained), the parameters are stationary and not dynamic, i.e. constant throughout space (everywhere) as well as over times whether stemming from past or present.

Commonly used data in statistical species distribution models

Climatic variables are the core predictors for many SDM tasks but they lack some important explanatory variables mainly related to dynamic processes. Although the dynamic processes cannot be included in the statistical phenomenal models, indirect variables have been used to account for some of the processes and to improve predictions (Austin, 2007; Guisan and Thuiller, 2005). This is in the view that indirect variables relating to ecological processes or needs (e.g. resources) and causal relationships lead to an ecologically more interpretable model and thus can derive more realistic and generalised output (Guisan et al ., 2007; Guisan and Zimmermann, 2000). The following predictor variables (those that may represent more than one class based on how they are defined) are often used:

Elevation and derivatives – Elevation is one of the often used predictors. It is also correlated (although degree varies) with most of the climate variables (Austin, 2002a), for example higher altitudes have lower temperatures. Slope and aspect are examples of variables derived from the elevation affecting species richness and abundance (Åström et al ., 2007). The south facing slope is preferred by many floral species in the northern hemisphere and north facing slope in the southern hemisphere. The slope may be important up to a certain grain size (Chapman et al ., 2005) but for the equatorial region the aspect may play a little role. Similarly, the modelling done for areas including both sides of the equator (e.g. Africa) may not have any use of aspect. Topographic position such as ridges, peaks, valleys (Guisan et al ., 1999) can be important predictors at finer grain size; at larger grain size such details are often lost.

Climate – Climate related variables are used to relate physiological needs of species (Guisan et al ., 2007). Bioclimatic indices variables as conceptualised by H. A. Nix (Busby, 1986) are among the most used data in species distribution models. Although 19 variables are synthesised from temperature (12 variables) and precipitation (7) related variables (Hijmans et al ., 2005; www.worldclim.org), not all of them are useful and generally the selection of variables are determined based on species (Beamount et al ., 2005). Among the 19 variables, the mean average annual temperature and the average annual precipitation are very common ( ibid.) whereas temperature range and seasonality

Page| 15 2. Background on species distribution modelling

and precipitation seasonality are other data that are often included. However, it has to be noted that many of the 19 bioclimatic variables are highly correlated with each other (Leaché et al ., 2009).

Other indirect variables (surface strata) – Several variables characterising the land surface are used to indirectly incorporate some biotic factors. Land-cover/use has been often used as surrogate for habitat type when modelling of faunal species (Franklin, 1995). Although the combined effect of temperature and precipitation are correlated with the phenology, the use of remotely-sensed data such as NDVI data (an indirect index for above-ground green bio-mass) has also found its usefulness as surrogate of resource availability (Chapman et al ., 2005). Further, the use of NDVI, temperature and precipitation indirectly includes the time lag between the precipitation and plant growth (Dall’Olmo and Kernieli, 2002). Factors such as the amount of soil moisture content, organic content, and pH also play an important role for plant species (Franklin, 1995) but the availability of detailed geospatial data are not or may not be available. Soil type and geology can be included as proxy for nutrient when modelling floral species (Engler et al ., 2004). Water is one of the essential elements for any living organism. For species such as Odonata which depend on aquatic and terrestrial ecosystem, inland water bodies can be an important predictor variable (Kalkman et al ., 2008). Hence, hydrographical features should be useful predictors for such mobile species.

Incorporating statistics and GIS for species distribution modelling

Chapter 2.3 showed that various statistical modelling methods are in use for predicting spatial distribution of species. Despite having some constraints such as general assumptions (chapter 2.5) and overcoming methodological challenges (chapter 2.4), examples of uses (chapter 2.2) have revealed the success of these methods in various application domains. However, spatial data handling is a separate application domain which ideally needs to be integrated with statistical modelling framework for a smooth workflow of the overall SDM process (see chapter 4.1). Several bio- (and) physical environmental variables, in geo-spatial data format explaining ecological significance, has been in use (chapter 2.6). Some of them need pre-processing so that they can be used for modelling, while others may need pre-processing so that they can be assigned to ecological meaning. Often pre-processing of eco-variables is done to form proxy variables, i.e. derivative of the original variable which then sometimes represent causal relationship providing partial mechanistic effects or reflect indirect relationship with the species presence and/or absence data (Guisan and Zimmermann, 2000) in statistical SDM. These pre-processing steps may need chaining several spatial functions/operations before the final variable is derived (e.g. see chapter 5.2.1). Geo-referencing museum and herbarium records as well as harmonising all geo-datasets to a common spatial frame and grain also belong to the tasks of pre-processing. GIS is ideally suited for such purposes. Although GIS offers basic statistical functions, the available functions are not enough for modelling species distribution. Therefore, one of the options to integrate thorough statistics and geospatial processing is to pre-process the data in GIS, export the data readable or importable in statistical software, run the modelling, export the results to GIS, and present the result in GIS. Another option is to couple the GIS and statistical software providing a common interface for both software. In total three different ways are possible for integrating GIS and statistical software; a) loose-coupling, b) tight coupling, and c) integrated coupling (single software) providing GIS and statistical functions.

Page| 16 2. Background on species distribution modelling

Loose-coupling – In this type of integration, each software works independent of each other where one software does not know about the data-structure and processes of other software. An intermediary interface (usually transfer of data files) acts as a bridge (connector) between two software for communicating the input and output (Figure 2-3). Thus, the interaction of input and output among software often takes place using an exchange file format through Graphical User Interface (GUI) and/or Command Line Interface (CLI), provided that the common exchange format exists (Jankowski, 1995). In absence of a common exchange format, a converter can be plugged-in to facilitate for exchanging files. Implementation of such interface is comparatively easy and provides full flexibility. However, care has to be taken when designing the intermediary interface for not to make an interface too complex or difficult to use. From a usability (chapter 3.2.3) aspect, GUI coupling is better suited where a well-designed user-friendly GUI acting as a bridge hides the internal coupling mechanism and are often used in modelling applications (Brandmeyer and Karimi, 2000). GRASP (Lehmann et al ., 2002) is an interface providing GAM (statistics) of S-plus or R in ArcView GIS (version 3.x). Because of the exchange-file format, an upgrading of the statistical software will, more likely, not affect the functioning of the processing.

Figure 2-3: Loose GUI coupling of GIS and statistical SDM involving data converter and bridged via GUI (adapted from Jankowski (1995), Karimi and Houston (1996), and Brandmeyer and Karimi (2000))

Tight-coupling – In this type of integration, the systems works independent of each other. But, unlike loose-coupling, one software will have the information about the data-structure of the other. Application Programming Interfaces (APIs) are centre of such systems linked together via GUI (Figure 2-4) and many software also publish APIs to facilitate tight-coupling of applications. So, the data can be shared in the memory and often does not involve exchange formats. A tightly coupled system can, thus, make benefit of the analytical capabilities of the coupled components (Karimi and Houston, 1996). Such integration comes with a more difficult implementation and lesser flexibility as compared to loosely coupled software. Further, upgrading one software may affect the functioning of the system if the APIs change in the upgraded version. GeoEco (Roberts et al ., 2010) is such an example where the statistical modelling (GLM and GAM) capabilities are coupled within the ArcGIS Desktop environment.

Page| 17 2. Background on species distribution modelling

Figure 2-4: Tight coupling of GIS and Statistical SDM with the APIs as core component in the centre of the different systems, database and the GUI (adapted from Jankowski (1995))

Integrated coupling – This is a full integration of all the systems or components and share common GUI and data access mechanism (Figure 2-5). Since the underlying data as well as the memory are shared across processes, users of the system will not notice difference in handling different types of functionalities offered by the system. From the users’ perspective, the system will provide one common user-interface and thus harmonised user-experience (see chapter 3.4). For the developers, the full integration involves complex implementation (Brandmeyer and Karimi, 2000).

Figure 2-5: Integrated coupling of GIS and SDM with systems and database interacting as a single unit (adapted from Brandmeyer and Karimi (2000))

Summary and main considerations

Theory and mathematical based analytical, process and behaviour based mechanistic and stochastic events or phenomena based empirical models are three different classes of SDM. The analytical and the mechanistic models and modelling methods are generally specific to a certain species only. The model output can be generalised across different landscapes and model parameters can be transferred because of specific processes and behaviours. The empirical modelling methods, relying only on observation of stochastic events of presence or absence, are

Page| 18 2. Background on species distribution modelling

non-specific to a certain species but transferring model parameters across different landscape may often not be valid, since a phenomenon based result cannot be generalised and, hence the model output loses generality. However, empirical models offer ‘realistic’ output based on stochastic events and present ‘precise’ information based on the observation with respect to environmental variables. Nevertheless, with the aid of ‘precision’ and ‘reality’, it is still used to test and explain various ecological hypotheses and processes, finding useful applications such as unearthing hot-spots, designing strategic surveys, facilitating biodiversity assessments, determining shifts in habitat range due to environmental change, etc. The increasing use of statistical SDM in various ecological domains has inspired development of several methods adapting different requirements such as experiment or hypothesis setting, simplicity for explaining and/or relating to other phenomena or complexity for getting as much reality as possible in current and future environmental scenarios. The methods are based on observations of either presence or presence-absence locations, some utilising notion of pseudo-absences where true-absences are not available. Several permutations of generating pseudo-absences have been conceived and used. The concept of background information instead of pseudo-absences presented a new dimension in species distribution modelling offering conceptually better model formulation as compared to using uncertain pseudo-absence data. Presence-background models are, therefore, getting more attention. Statistics is core in empirical SDM methods, but equally important is the role of GIS in shaping the data for modelling as well as further processing for inference for various management applications. Although having GIS and statistical modelling functionalities in one tool may be desired, other possibilities such as coupling in loose or tight framework can, nevertheless, serve the purpose. Loose coupling systems can offer GIS and statistical methods flexibility. One of the benefits is that changing of one of the components on the loosely coupled system can be possible with less trouble. In the case of tightly coupled system, such possibility is reduced without changing the programming code because of the involvement of APIs which tie the different components.

Page| 19 3. Quality-in-use for development of species distribution modelling tool

3. Quality-in-use for development of species distribution modelling tool

‘Quality’ is defined as “the standard of something as measured against other things of a similar kind; the degree of excellence of something” (Oxford Dictionaries, 2010). Quality-in-use is the measurement of the effectiveness, efficiency and achieved satisfaction by a user while performing a certain task in a specified environment (Bevan, 1999). The inclusion of processes, which ensures better user experience through quality checks, during the development of a product, can determine a product (here a model or modelling tool) to be accepted or rejected for use by users (Raza, 2011). Quality is measured over several parameters within the ‘context of use’ (Bevan, 2001b). Employing user-centred design during the design stage ( ibid.) and considering experience as one of the components can improve or enhance usability and user experience (Petrie and Bevan, 2009). For species distribution modelling (SDM), the quality-in-use measures the attributes in the context of efficiently modelling the species distribution with all the related tasks.

Context of use

Species distribution models have been used in several contexts (see chapter 2.2). Different purposes generally have different requirements. Thus, the quality assessment would differ with the use-case. Further, the species distribution modelling task involves, besides applying ecological theory, the use of GIS and statistics (see chapter 2.7). Here, the SDM task can be divided into two contexts: a) scientific modelling context where species prediction is the aim; and b) applied model- output use-context where the output is used in different application domains. The necessary expertises are additionally related to basic functionalities required for the task (see chapter 3.2.1). Although the outcome of a modelling tool will be same for a given data set and method, distinct contexts and requirements, model evaluation methods and consequences (e.g. Table 3.1) will result in different output quality (Guisan and Zimmermann, 2000). The outcome of modelling species distribution can have over- or under-predictions and the effect can vary according to the context for which the modelling is done. For example, in a scientific context the effect of either over- or under-prediction can lead to wrong hypothesis being assumed. Likewise, the biodiversity inventory process may not be efficient with either too much or too little resource allocation. For a conservation planner, trying to determine the possible extent of invasion by a non-native species, under-prediction might have severe after-effect on ecosystem and conservation of native species for failure in a proper assessment, thus, leading to degraded quality-in-use metric for the modelling tool, whereas with over-predictions, the conservation status might benefit (positive for species conservation) but with extra resources (negative for e.g. economic input). Here, within the same context, over-predictions may result in larger areas allocation for conservation but the result may not be as efficient in terms of optimal resources (e.g. economic input) used. Students can be made aware of different consequences by making use of different modelling tools and techniques which enables them to learn strengths and weaknesses of the tools and making them aware of potential pitfalls.

Page| 20 3. Quality-in-use for development of species distribution modelling tool

Hence, confined in the boundary of a specified context of use, the overall ‘quality-in-use’ is assessed using various quality measures.

Table 3-1: Some examples of SDM users, contexts of modelling and the typical consequences of over- and under-prediction User context (task)* prediction typical consequences* present and historical over / wrong hypothesis on biogeographic distributions scientist / biogeography under and ranges academics evolutionary (phylogenetic) over / wrong hypothesis on speciation or evolution and distribution under origin (-) increased economic input biodiversity over (+) detection inventory survey design (+) lower economic resource official under (-) likely detection missed over (-) higher resources needed monitoring / evaluation / (-) probable conservation areas not being prioritisation under allocated (-) higher resources (+) more conservation areas allocation over species extinction / (-) false ‘downgrade’ of threatened red-list conservation status status** conservation (-) less conservation areas allocation planner / under (-) false ‘upgrade’ of threatened red-list status** manager (+) increased effort on native species over conservation species invasion (-) higher resource allocation (-) low effort on native species conservation under (-) resource for not proper assessment over future climate change differing management plans and policies under understanding strengths and weakness of various modelling methods, learning different contexts over or learn about potential pitfalls of over and under student and applications of SDM under prediction, develop or simulate ideas for improving model output * non-exclusive list ** several other criteria are also needed for a full assessment (+) positive effect (-) negative effect

Quality measures

Quality of software is measured based on functionality, reliability, usability, efficiency, maintainability, and portability (Bevan, 1999). Among the six measures, end-users are concerned

Page| 21 3. Quality-in-use for development of species distribution modelling tool

with the first four metrics, whereas the last two metrics are of interest to the support-user (Bevan, 2001a).

3.2.1. Functionality

Task analysis is central to decide on the functionalities to be offered by a tool. While insufficient functionalities can frustrate users and the tool may not become the choice, having too much functionality can lead to under-utilisation because extra functions can have negative effects on usability measures (see 3.2.3) such as learnability, memorability, or error rate (Shneiderman and Plaisant, 2005). The SDM task needs a series of steps, i.e. several functionalities (see Table 3-2 for functionalities in some SDM tools) with statistical modelling obviously being the core together with some form of evaluating the model output. Before creating a statistical model, data needed for the (environmental) modelling have to be pre-processed e.g. to ensure that all the GIS data layers are properly harmonised (Ghisla et al ., 2012; Graef et al ., 2005). Some of the SDM tools without GIS interface or functions (e.g. GARP, GRASP) package customised scripts for a certain GIS software for this basic data preparation.

Table 3-2: Basic functionality matrix overview of selected SDM tools

BIOCLIM ENFA GLM/GAM/BRT GARP Tool MaxEnt d (Diva-GIS) a (Biomapper) b (R/S-plus) c (DesktopGarp) e Functionalities Basic GIS pre- ArcView 3.x yes yes additional packages no processing 1 scripts included Statistical modelling yes yes yes yes yes Statistical yes yes yes yes yes evaluation 2 GIS post-processing 1 yes yes additional packages no no Simple visualisation yes yes additional packages yes yes Interactive yes yes – no no visualisation a Hijmans et al . (2012) e Scachetti-Pereira (2002) b Hirzel et al . (2002a) 1 some (basic) GIS functions c Hijmans and Elith (2011) 2 some sort of evaluation of predicted result d Phillips et al . (2006)

The probability distribution maps show gradients and thus provide useful information but it can also be misleading and reclassifying to few classes may actually provide better summary; the minimum number of classes being binary presence-absence based on one of several threshold criteria (Hirzel et al ., 2006). The conversion process is the basic GIS post-processing step facilitating for the assessment of the prediction accuracy. Interactive visualisation of distribution as map is often sought, but a quick overview can offer preliminary visual evaluation of the output.

3.2.2. Reliability

Reliability is a quality attribute related to performance measuring the likelihood of potential failure, how failures are handled and what measures are placed to recover from any failure (Bevan,

Page| 22 3. Quality-in-use for development of species distribution modelling tool

1999). For SDM methods, reliability includes the ability to calculate parameters fitting the covariates. Parameter fitting is an iterative process which depends on the convergence threshold (Nocedal and Wright, 2006; Vanderbei, 2008). Smaller threshold value generally requires higher number of iterations. Although the implemented iterative methods are often guaranteed to converge (e.g. Dudík 2007; Friedman et. al., 2010), it may fail at times due to e.g. singularity of predictor variables (Press et al ., 2007). However, convergence may not necessarily mean the model is stable. SDM methods offer option to specify maximum number of iterations which in the case of non-convergence ensures the program is not running forever, although this measure is only a workaround and not the perfect solution. Further, regression with exponential forms (e.g. Poisson regression, maximum entropy) may often predict the results outside the desired bounds (Phillips et al ., 2006; Ward, 2007). The out-of-bounds results can be corrected by applying transformation of results via e.g. logistic transformation as implemented in MaxEnt (Phillips and Dudík, 2008), thus providing workaround for undesired result. Other reliability factor in regression modelling is the presence of correlated variables and noises in predictor datasets; noises tend to influence the result, sometimes also contributing to instability of the model, e.g. high standard deviations. There are methods for controlling over-fitting, removal of noise in predictor variables and handling correlated variables, and some of the tools have the options for applying them or have incorporated them as standard procedure (see Table 3-3). L1-regularisation is part of MaxEnt for controlling over-fitting as well as noise removal. Users can choose L1, L2 or elastic-net for over-fitting when using regression based models in R, however use of elastic-net in SDM with continuous and categorical data has not been published. Generally, regularisations techniques are not often employed in SDM (Phillips and Dudík, 2008). No options are available for noise removal in BICOLIM and ENFA; they are based on envelopes. MaxEnt do not handle correlated data well. The exact method for noise removal, over-fitting and handling correlated data in DesktopGarp is not known here. ENFA uses factorisation technique for modelling and thus the correlated data are, by nature, handled while regression models in R can use L2-regularisation. L1 and L2 regularisations cannot be used at the same time; instead elastic-net facilitates both. A further important consideration in modelling is the repeatability of results (Golafshani, 2003). If results cannot be reproduced consistently with the same model settings, the reliability on the model output decreases because it lacks cross-checking of results. The results can be reproduced in all except for GARP. For GARP, the results may differ to some extent and is related to the stochastic (non-deterministic) nature of genetic-algorithm (Stockwell and Peters, 1999). Furthermore for large datasets, subsampling is necessary leading to different results across different model runs for the same dataset (Hastie et al ., 2009) but the differences in results may get minimised or reduced but not eliminated because of the iterative simultaneous testing with several portioned (subsets or sub- sample of) training and test datasets which is part of the modelling (Stockwell and Peters, 1999).

3.2.3. Usability

Usability issues are most sought quality attributes (Carroll, 2004). 'Easy-to-use' criterion directly influences other usability metrics and is most focused on user (usability) testing. Usability is the subjective measure of the ease-of-use or user’s interaction while performing a specified task effectively to attain a specified level of achievement (Cooper et al ., 2007; Galitz, 2007). Usability testing can reveal areas that need corrections in design to improve the user experience, efficiency and overall ‘quality in use’ of a product (Cooper et al ., 2007). Usability also depends on other quality

Page| 23 3. Quality-in-use for development of species distribution modelling tool

measures, e.g. functionalities but they are dealt separately. Typical attributes to describe usability are a) learnability, b) operability /performance, c) memorability, d) error rate, and e) satisfaction (Galitz, 2007; Nielsen, 2010; Shneiderman and Plaisant, 2005). The preference is to provide a high summation value from all these metrics but trade-offs are common because of various factors (e.g. rate of errors vs. speed of performance, learnability vs. operability) and subjectivity (e.g. nature of task and needs, users’ expertise, cultural background) related to the context of use (Shneiderman and Plaisant, 2005).

Table 3-3: Methods to improve reliability regarding noise in predictor variables, over-fitting and correlated variables in selected SDM tools BIOCLIM ENFA GLM/GAM/BRT GARP Tool MaxEnt (Diva-GIS) (Biomapper) (in R) (DesktopGarp) Noise removal L1-regularisation, L1- n/a n/a yes a and over-fitting elastic-net regularisation Handling L2-regularisation, correlated n/a factorisation n/a yes a elastic-net variables Repeatable not yes yes yes yes results necessarily b Hastie et al . (2009 ) Hijmans et Hirzel et al . Phillips et al . Stockwell and References Zou and Hastie al . (2012) (2002b) (2006) Peters (1999) (2005) a exact method not known b difficult to achieve due to nature of the genetic algorithm

Learnability is a measure for ease-of-use when a user encounters the design for the first time performing some tasks. The time required for getting used-to and amount of effort required to use the software proficiently measures the learnability (Tulis and Albert, 2008). The familiarity of elements in the user interface and their expected behaviour may increase learnability whereas a different behaviour of controls can confuse the user. Consistency in design and appearance of the dialog windows increases the users’ handling ability of the system. Further, a simpler interface enhances the user experience (Apple, 2009; Microsoft, 2010; Shneiderman and Plaisant, 2005).

Operability/performance discusses the user’s efficiency (time taken) in using the system while performing a specified task (Shneiderman and Plaisant, 2005). This has direct dependence on learnability; the easier to learn how to use, the higher is the efficiency in use. As with any system, experience in using the system has a positive effect on the operability; experienced users can operate more efficiently whereas a novice user will take time to attain a certain level of efficiency. Here, the user interface plays a vital role in interaction of the user and the system, thus effective interface design is crucial for efficient use ( ibid.).

Memorability refers to the ability of user in attaining the same level of efficiency after not having used the system for a long period of time. A good learnability would help in quick memorability for simple tasks. A frequent user of the software will have fewer problems remembering how to perform a certain task but for a casual user a well-designed interface offering recognizable controls aids in remembering the sequences and commands in short span of time (Nielsen, 2010).

Page| 24 3. Quality-in-use for development of species distribution modelling tool

Error rate , as measured in usability, is different to errors measured in reliability (see chapter 3.2.2). In usability testing, it is the number of errors a user makes during execution of a specified task. Usability measures how often users make errors and if there is a way to recover how users recover from them or whether the system informs users about the errors. Well formulated error messages can reduce future error rates and reduced number of errors can increase productivity or efficiency of use (Shneiderman and Plasiant, 2005). “If an error is possible, someone will make it. The designer must assume that all possible errors will occur and design so as to minimize the chance of the error in the first place” (Norman, 1990, p36). Thus, measuring errors (by users) during usability testing can help in understanding the design failures which induced incorrect action (Tulis and Albert, 2008).

Satisfaction is referred to as level of acceptance of the achieved goal as well as the attractiveness of the user interface, measuring the affection of various aspects of the interaction interface (Shneiderman and Plaisant, 2005; Tullis and Albert, 2008). Command-line interface and graphical user interface are two broad categories of interfaces providing interactions between users and computers. The choice of appropriate interface depends on knowing the user and the task and is better facilitated through user-centred design (see chapter 3.3).

3.2.4. Efficiency

Generally, efficiency is measured as a ratio of input to output. For quality metrics, these input and output can be different based on what is being measured. Petrie and Bevan (2009, p20-1) define efficiency as “the resources expended in relation to the accuracy and completeness with which users achieve goals”. Time is one of the factors often used for measuring efficiency, e.g. the amount of time spent on completing a task (Tullis and Albert, 2008). However, past experience, expertise in the task and the nature of task can influence the completion time. Thus, measuring relative efficiency can provide better metrics where average time taken by users is compared directly with average time taken by experts within the same context and environment in achieving the stipulated goal (Bevan, 2006). Including designer in a pool of experts for measuring the time can highlight the potential gap between designer’s ideas and concepts and users’ perception. Another way of expressing efficiency is to evaluate the amount of effort a user requires for completing a task. The effort can be physical or cognitive in nature. The physical effort can be number of steps to be performed via mouse clicks and key input. The cognitive effort is to find out where to click (Tullis and Albert, 2008). Efficiency can also be expressed in terms of effectiveness where effectiveness is described as function of quality and quantity of the completed task. User’s efficiency is the ratio of effectiveness to the task-time (Bevan, 1995; Bevan and Macleod, 1994). Resources can also be physical objects such as CPU time or memory required to run the task. For processing large amount of data, there is often a trade-off between the memory required and CPU time based on which optimisation strategy has been employed.

3.2.5. Maintainability

One of the two quality criteria that are directly related to a support-user is maintainability. Maintainability of software is the ability to modify the software for correctness, improvement, adoption in different environment (e.g. operating system) or to change the functional requirements

Page| 25 3. Quality-in-use for development of species distribution modelling tool

and specifications (Bevan, 1999). In ‘quality model’ defined in ISO/IEC FDSI 9126-1, maintainability is referred with analysability, changeability, stability and testability (Bevan, 2001a, p540 figure 2). Two distinct groups can be formed based on the closeness: a) with analysability and changeability, and b) with stability and testability.

3.2.6. Portability

Software portability is the ability to easily move and use a program between different operating systems/platforms (and architecture) with reasonable cost and effort. For most end-users, portability means minimal changes to a program when moving to a different system, no or little (re-)training on handling of the program and ability to work either with local or remote systems (Garen, 2007). Further, portability also refers to the mechanism which allows converting, sharing and using data in multiple software and/or hardware environments (Shneiderman and Plasiant, 2005). ISO/IEC FDIS 9126-1 characterises portability with adaptability, installability, co-existence and replaceability (Bevan, 2001a, p540 figure 2).

User-centred design

Although the task is important, an equally important role is how users are enabled to perform the task. Thus, focusing the user in the process of design offers a better or more pleasant experience to the end user (Lewis and Reiman, 1994). Further, the user-centred design process can ensure a stable system and reduces the risk of failure. The user-centred design process is an iterative process consisting of context definition, requirement analysis, design and test (Lewis and Reiman, 1994; Maguire et al ., 1998). Maguire et al . (1998, p20) suggest twelve basic questions to summarise a project (i.e. the overall context) from the users’ viewpoint, some may not be applicable in every case, or some may have common answer (see Box 3-1). The questions in Box 3-1 help not only to decide how a system should work but also give an idea on the next step: the analysis of requirements, which includes user characteristics, working environment, and user goals and tasks (see Box 3-2, adjusted to SDM tools). Based on targeted users’ profile and requirement analysis, other necessary strategies can be planned such as sequence of execution (see Table 3-2), or additional functions and features (e.g. see Table 3-3 for features) to complement main functions (Shneiderman and Plaisant, 2005). Technical working environment is core of user-centred design process when creating specification. Hardware such as memory, processor capacity, architecture, input and output devices, software platform (and additional dependent components) and networking infrastructure are key components which describe the environment for interface design (Thomas and Bevan, 1996). If an interface is designed considering users’ profile/characteristics, there is more likelihood that the user will learn faster to use the tool, to use it efficiently and to make less error. Thereby a better user experience is offered and higher confidence in its use is simulated. User interface plays an important role, not only for the user in mind but also related to the nature of the task. Shneiderman and Plaisant (2005) discuss the advantages and disadvantages of five common software user interface styles: direct manipulation, menu selection, form filling, command line and natural language. The first three styles are some form of graphical user interface (GUI) and the latter two styles are command line interface (CLI). GUI offers interactive manipulation of input and output parameters, provides easy learning and explorations with high subjective satisfaction,

Page| 26 3. Quality-in-use for development of species distribution modelling tool

and allows easy memorability. GUI has been the state-of-the-art for interacting with the software or system. CLI can offer flexible use, perform complex command sequences in a batch, and make the user feel to have full control of what is happening. But, the CLI comes with a slow and difficult learning curve and poor memorability; and with increased complexity introduces errors (e.g. Roy, 1992). Although natural language command reduces the burden of learning command syntax, different cultures and languages are hindrances for effective implementations.

Box 3-1: Basic questions to summarise a project in determining the context of use of a system (based on Maguire et al ., 1998)

• What is the system or service? • What functions or a service is it intended for the system to provide? • What are the aims of the project (product)? • Who is the system intended for (i.e. target market)? • Who will use the system? • Why is the system needed? • Where will the system be used? • How will the system be used? • How will the user obtain the system? • How will the user learn to use the system? • How will the system be installed? • How will the system be maintained?

Use of software begins with installation or acquirement at the least. The SDM tools listed in Table 3-4 can be obtained easily and freely from the Internet. Although installation is easy for all the tools, it may not be straight forward for R where dependencies may have to be installed separately or DesktopGarp where the installer does not include all required files. In both cases, however, availability of Internet connection will facilitate downloading of the required files. R informs clearly about the missing dependencies and will download them if Internet connection is available. But, DesktopGarp gives a message which is not understandable (cryptic). The next issue a user faces is the interface to interact with the software and availability of support for its use. Biomapper and DesktopGarp are GUI based whereas Diva-GIS and MaxEnt can be used either in GUI or in CLI. R is available as a command line interface. A user-support system is crucial for any interactive product. Documentation such as user’s manual, quick-reference and other self-help materials are parts of user-centred design (Thomas and Bevan, 1996). Availability of documents describing the logic (how the software computes results) offers insights about the computation providing better judgement of the appropriateness or relevance of the software for a particular task. Equally, a brief guide explaining ‘how to’ can facilitate easy learning experience to start immediately. The process of modelling species distribution requires several steps in sequence, e.g. data preparation (as pre-process), statistical modelling as main task and post processes, the latter based on the field of use. Depending upon users’ experience and data requirements of SDM software, the pre-processing steps can vary. So, a clear guidance on these inter-related steps can be an important asset (e.g. Biomapper 3 user’s manual by Hirzel, 2004). Another important aspect is the operating system on which the software runs. In comparing some of

Page| 27 3. Quality-in-use for development of species distribution modelling tool

the SDM tools, Biomapper is clearly a well-designed user-centred tool offering easy installation, GUI for interaction and offering necessary documentation (see Table 3-4). Although, all four tools in the table provide some sort of documentation, the fundamental theory or concept behind the modelling method is not included for DIVA-GIS, R (regression based models), MaxEnt and DesktopGarp.

Box 3-2: Typical questions for development of a species distribution modelling tool regarding user profile (based on Johnson, 2010; Lewis and Reiman, 1994; Maguire et al ., 1998)

• What is the intended users’ qualification (expertise/field)? • What special skills do users poses (e.g. regarding GIS, statistics)? • Do users have experience with similar tools (e.g. one or many other SMD tools)? • How much IT experience does the users have on e.g. command-line and graphical user interface, operating systems? • What do users know about the details of tasks (e.g. basic and advanced pre- and post- processing in GIS)? • Do users have previous training (on performing similar task or using similar system)? • What is the frequency of use? • What are the common terminologies used by the users (in GIS, statistics)? • What are the factors influencing the users for motivation or discretion to use a certain tool?

User experience

Software with bad usability can be a factor for failure of good software but a good usability may not guarantee a pleasant user experience (Kuniavsky, 2010). It is the quality of experience that stimulates users to the acceptance or rejection (Buxton, 2007). User experience (UX) is directly related to usability but with different perspective. UX is an emotional consequence of good or bad usability design. It is often subjective and very much personal which differs from one user to another user (Bevan, 2009; Hasenzahl, 2003). UX is developing as a core concept to the perception of usability (Carroll, 2004; McCarthy and Wright, 2004). “’Experience’ is an elusive concept that resists specification and finalisation” (Wright et al ., 2003 p44). Experience is constructed through the repetitive use of the product by creating mental models. It cannot be disassembled into discrete key elements. Hence, designing an experience is difficult, and may not even be possible “but with a sensitive and skilled way of understanding our users, we can design for experience” (p52). For software developers and designers, system software providers like Microsoft, Apple and others provide style guidelines (e.g. Apple, 2009; Microsoft, 2010) e.g. designing GUI elements and interactions based on their APIs. The user-interface design rules and guidelines are mostly based on the research on human-computer interactions involving psychology and cognitive system (Johnson, 2010). One of the objectives of these guidelines is to provide a ‘look and feel’ which is consistent within the system environment and offers pleasant experience; ‘look’ is for the visual perception of design, while the emotional or experiential part is accounted for by ‘feel’ (Nielsen, 2003). However, and interestingly, Microsoft and Apple violate their own published guidelines (Cooper, 2004).

Page| 28 3. Quality-in-use for development of species distribution modelling tool

Table 3-4: User-centred features of some selected SDM tools BICLIM ENFA GLM/GAM/BRT GARP MaxEnt (Diva GIS) (Biomapper (R/S-plus) (DesktopGarp) Installation easy easy easy a easy not as easy b Software interface GUI/CLI GUI CLI GUI/CLI GUI Documentation fundamental yes c yes yes c,d yes c yes c theory how to (manual) yes yes yes d yes yes help integration in – yes yes d yes e yes f UI Windows Windows and and UNIX Windows up Operating system Windows Windows UNIX based based to XP systems systems* a full package or separate individual download (careful about dependencies if not connected to Internet) b missing installer component (error message not understandable) c literature available in Internet (single source may not be sufficient to understand) d individual package specific (package dependent, i.e. if the developer provided) e not easily understandable (a bit technical) f single file, same as ‘how to’ (manual), although installed, tricky to get it displayed or work * requires Java virtual machine (Java interpreter)

3.4.1. Cognition

Experience is often related to cognition, what and how users perceive. Although, at current age, technologies change rapidly, the fundamentals of people’s perception and thinking do not change with the same speed. Hence, the knowledge on cognitive response and behaviour can help in better designing the interaction (Johnson, 2010). People learn fast when the operation is task-focused, simple and consistent. Users have to translate the task mentally into the operations offered by a tool. This cognitive process leads user, instead, to focus on the requirements of the tool and re-focus on the task. A simpler conceptual model will provide an easy translation of tasks to functions or operations and vice versa ( ibid.). The smaller the difference in the actual task and the function, the faster user can re-focus on the task and hence learnability increases. The differences in task and function can be reduced if the vocabulary or terminologies used for functions are mapped closely and consistently to the task ( ibid.). As such, metaphors play an important role in human (users’) cognition system. Tapping the key concepts and presenting them, when relevant, as suitable ‘cognitive’ 6 metaphors (Blackwell, 2006) offer better experience (Kuniavsky, 2010).

6 generally metaphors are mostly considered as linguistic-metaphors; here the word is used to refer to the mental processes (as explained by Blackwell 2006, p 494).

Page| 29 3. Quality-in-use for development of species distribution modelling tool

3.4.2. Metaphors

Since the devise of the ‘Desktop metaphor’ (Kim, 2004) and the predominant use of GUI in most interactive software, icons have been one of the most used elements to express metaphors (Cooper et al ., 2007). Well-designed icons offer effective metaphors for the functions on menus and buttons in toolbars. The toolbar-buttons can and may contain short text explaining about the button’s function or also be displayed as tool-tip or as text like in menu-items, however the pictorial representation are quickly grasped by the users. Although icons have served well describing or giving hints on the function, they, as cognitive metaphors, are also contextual (Passini et al ., 2008; Salman et al ., 2012) similar to the linguistic metaphor, (e.g. the refresh button in Web-browsers looks similar to undo/redo button in Office software). The similar looking icons have different functionalities but still users’ perceptions of these icons are different. Human perception is biased by past (via user's experience), present (in current context on which it is being used) and immediate future (the goal which users want to achieve) (Johnson, 2010). As the user would use only one software at a given time, there will be only one valid context.

3.4.3. Emotions

The first few uses of a software create a cognitive model about the software which is a basic exploratory phase. After that, the process transforms to an experiential or behavioural phase where emotions are created (Ma et al ., 2009). Emotions can be positive (e.g. fun, excitement) or negative (e.g. frustration, disappointment) associated with and resulting from needs within the context of use (McCarthy and Wright, 2004). Positive emotions are the consequences of e.g. fulfilment of goals or satisfaction in a (challenging) situation, whereas difficulty in achieving the goal induces negative emotions (Hasenzahl, 2003).

Fun – Incorporating fun, as one of the emotional aspects for complementing usability, can act as an attraction for making people use a system (Carroll, 2004). Shneiderman (2004) discuses three goals when designing a software: a) offer right functions to achieve goals, b) provide better usability and reliability to avert frustration, and c) engage users with pleasant features. The first one is related to functionality (see 3.2.1). For second goal, Shneiderman and Plaisant (2005) suggest eight ‘golden-rules’ which are similar to the ones covered in sections 3.2.2 and 3.2.3. The third goal is related to GUI design where metaphors (e.g. through icons and animations) play vital role which can present delights and surprises. Surprises when occurring in a positive way can be fun. Distractions can also surprise but only for short-term; they are actually annoying after a couple of instances (Carroll, 2004). “Things are fun when they present challenges or puzzles [… ,] when they transparently suggest what can be done, provide guidance in the doing, and then instantaneous adequate feedback and task closure” ( ibid., p39).

Frustration – There is hardly anyone who has not experienced, at times, frustration while using computers (Cooper, 2004), in general, interactive systems. The frustration can be due to several reasons such as software crashes, insufficient information, vague error messages, non-responsiveness, etc. These are mainly due to bad or ill-conceived design (Preece et al ., 2002). Scheirer et al . (2002) performed a ‘user-testing’ experiment to find out events which are likely to stimulate frustration. Their purpose was to measure physiological and behavioural data so “a system could 'get to know' an individual's patterns of frustration (and other emotion-related responses)”

Page| 30 3. Quality-in-use for development of species distribution modelling tool

(p115). Although the experiment was to deliberately frustrate users, the authors’ also showed that such experiments can offer designers the opportunity to gather information on which of the system’s functions are likely sources of user frustration. An interesting observation in that experiment was that for some of the participants, who suspected the intent of experiment told to them were not true, the level of frustration was lower. This suggests that if users are aware of the situations or the system’s response, the degree of frustration stimulated to or felt can be reduced.

Quality-in-use of software offers valuable insights on why some software are preferred by users although better alternatives, in terms of getting better results, may be available. Quality measures are core in the ‘context of use’. The measurements are designed to cover several characteristics describing the factors which affect software’s acceptability by users. Usability, with most focus lying on ‘ease-of-use’ is among the most discussed quality and measures such as functionality and reliability are important in determining whether goals can be achieved. Most of the SDM tools lack complete functionality, in a sense they focus only on statistical modelling part but lack basic needed GIS functionalities. Software having high usability score but lacking the necessary functionality or being less reliable cannot be self-fulfilling and can simulate negative user experience. The usability and user-experience, although different, are close to each other and have complementary effects. Although usability testing may not be possible for every newly developed tool, however, following established guidelines as well as applying previously gained experience can help attaining certain usability levels. In the context of SDM tool, more focus is required on functionality, reliability and usability. Since these scores are dependent on the context-of-use, ‘user-centred design’ is the pivotal process for offering good quality. User-centred design process not only helps in profiling the targeted user group but also makes sure that the technical requirements as well as quality-in-use criteria are met at the time of design. While most of the characteristics are covered by quality aspect, especially usability, the emotional part of using software also plays a role in being accepted or used widely. User experience, being subjective, cannot be designed. Nonetheless, efforts can be made to design for experience.

Page| 31 4. Developing a robust and easy to use species distribution modelling tool

4. Developing a robust and easy to use species distribution modelling tool

Species distribution modelling (SDM) has been used for several ecological applications (e.g. see chapter 2.2). However, a universal model that fits for modelling each and every type of species is not available as shown by e.g. Brotons et al ., 2004; Elith et al ., 2006: for data of different species different models preformed differently. With the intent of modelling the Odonata species of Africa, a new modelling tool is being conceived and developed focusing on usability for organisations like IUCN which can, in parts, include SDM in their workflow of assessing the threat status. This chapter investigates the requirements of modelling tasks, conceives a modelling work-flow and presents a modelling tool. While there are models using presence-absence and presence-background sample data (see chapter 2.4 for different sample data types) for discriminatory and deterministic statistical models, to date, not a single model offers both presence-absence and presence-background formulation. The tool is called SPEciEs DIstribution modelling (SpeeDi) Tool.

Work flow concept for geodata processing and statistical modelling for SDM in SpeeDi Tool

One of the important elements to be considered at the time of concept development is the type of input and output data. SDM task involves mainly raster data, one of the two widely used basic forms of geospatial data (vector and raster) (Chapman et al ., 2005). However, not every data may be available in raster format and conversion of vector data into raster format is required. The input data are mainly: a) species location coordinates, and b) environmental predictors in geospatial format. The primary output of the task is the prediction of species presence or absence within the modelling extent. With inputs (various environmental geodatasets) and outputs (basic probability distribution raster and a presence-absence raster) being defined, a workflow can be established (Figure 4-1). The process, guided through a graphical user interface (GUI) incorporating a thorough help system, is divided into three different steps: a) geodata preparation, b) statistical modelling, and c) post- processing. The pre- and post-processing are done in GIS. The necessary functions of each step are discussed in chapter 4.6. Modelling step consists of statistical modelling which is independent of GIS and creation of probability distribution raster which depends on GIS functions. The input and output are in the form of geo-database. The two sides (left and right) represent the presentation layer (GUI) and their respective APIs, the central part is the logic layer and the data layer is shown at the top. The gaps between the main GUI and statistical modelling are there to represent loose coupling of different components. Light blue (cyan) colour is used for logical and presentation layer using DotNET API only, whereas light green (olive) colour is used for the layers using both ArcObjects (ArcGIS Engine 7) and DotNet APIs. Light orange (beige) colour is used for representing GUI.

7 http://www.esri.com/software/arcgis/arcgisengine/ (02-Jul-2013)

Page| 32 4. Developing a robust and easy to use species distribution modelling tool

Figure 4-1: Conceptual work flow for modelling species distribution in SpeeDi Tool with three steps: pre- processing, modelling and post-processing

4.1.1. Geodata preparation

Since input datasets may come from different sources and may also be in different formats and referenced spatially in different systems, these are first to be harmonised in order to have all datasets confined within a gridded geospatial region defined by a common spatial reference system and a common grid size. Further, harmonisation ensures that the pixel positions are matched in the stack of several geodatasets, which is a basic prerequisite for any GIS analysis (Bernhardsen, 1999). Species sample data may be in a format which is not in standard GIS file formats (e.g. plain text files) and these sample locations have to be imported into a GIS. Moreover, when modelling with presence-background samples (see chapter 4.5), background samples are to be generated. Other pre-processing tasks include deleting duplicate records, assigning and updating weights to the samples.

4.1.2. Statistical modelling

This is the stage which influences the output of the task. The species data are sampled over the environmental layers (geodata) and the interactions among the environmental variables (e.g. polynomial degree of regression and products among variables) are selected. The model is trained calculating the regression coefficients of environmental variables. After training the model, the fitted model parameters are used to create a probability distribution raster data.

Page| 33 4. Developing a robust and easy to use species distribution modelling tool

4.1.3. Post-processing

This part can be divided into two different steps: data analysis and presentation.

a) Data analysis – The predicted result is then evaluated with e.g. ROC curve (model’s strength), and specificity and sensitivity of prediction (model’s performance in discriminating presence from absence), and turned into binary presence and absence by applying an optimised threshold value. This is the primary output which can be then analysed further using spatial analysis techniques.

b) Presentation – Cartographic visualisation is one of the efficient and intuitive methods of communication involving geospatial data. The result of the data analysis will be presented in form of a map.

User-centred design and user profile for the SpeeDi Tool

The conceptualisation for a new tool for SDM and the workflow are partly dependent on the users’ knowledge of similar methods and tools. With the intension of offering both GIS related functions and statistical modelling, handling of these different techniques on both approaches should be carefully considered. Lacking the familiarity of one should not be the hindrance of using the new tool. Although, the tool to be developed is targeted for the ecological conservation planners and managers for modelling the species distribution who may not be frequently using either of the technologies involved, however, it has to be noted that this target group is only one of the examples and several other group of users should find it useful too. In order to address different user bases, a basic user profile is assumed (Table 4-1) so that a basic level of user support e.g. help system can be provided. Qualification is an important criterion as it presents an overview on a user’s expertise domain and it is not sought to find what academic degree a user possesses. The familiarity of different IT systems such as GUI based and Command Line Interface (CLI) based modelling tool in different operation systems can be an influencing factor in user experience. The tool offers a GUI based interface for Microsoft Windows operating system and it is expected that users are familiar with the different GUI components in that operating system. The user profiling also helps to determine how much skill a user commands for example in GIS and modelling. Since a major part of the task in SDM involves working in a GIS, basic knowledge of GIS and the handling of basic GIS operations skills (e.g. GIS data types, creation of geo-datasets, data type conversion, spatial reference systems) are necessary (at least the concepts if not practical experience). Familiarity of advanced GIS operations such as spatial analyses (with raster and vector datasets) can be beneficial as they are often useful for transforming datasets in order to have them ecologically interpretable. Furthermore, GIS have its own terminologies and same term may be expressed differently in ecology. Previous experience of SDM tasks may help a user adapting workflows and handling the new tool easily. Although a user may not have modelling experience, previous training on GIS and species distribution modelling are beneficial. Likewise, the frequency of use of GIS and SDM is also central in designing a new tool, especially for considering how to expose the functions and their input and output parameters so that users can maximise the use of various functions offered by the tool in innovative ways. This is indirectly associated with experience in GIS, with more experienced users can exploit the potential offered by

Page| 34 4. Developing a robust and easy to use species distribution modelling tool

the tool. Moreover, profiling the user can help in preparing support documents effectively (Robinson, 2009) in reducing the knowledge gaps in skills relating to GIS and modelling.

Table 4-1: Assumed user profile for using the SpeeDi Tool sought attribute Qualification/expertise basic GIS and SDM knowledge IT experience CLI and GUI GUI helpful operating system MS Windows GIS experience basic operations required advanced spatial analysis not essential knowledge on GIS specific helpful terminologies experience/knowledge with SDM knowledge required, experience not essential previous training on GIS / SDM beneficial (not essential) frequency of use of GIS / SDM low (to high)

Architecture for the modelling tool

A layered approach for the architecture is selected with three main layers: presentation layer, logic layer and data layer (see Figure 4-1). The layered approach makes it easier to update only the required part if and when necessary (Microsoft, 2009). The approach presents options for programming of functionalities in traditional way where the focus is task oriented, but at the same time the current way of focusing on usability and user experience is not compromised. This offers better integration of approaches, task centred and user centred. For example any changes in the codes of the user interface will not affect the working of the logic and vice versa. This will help in cases such as changes arising for improved user experience, which can be focused only on the presentation layer or changes for fixing previously un-noticed bugs that can be focused on the logic layer only (ibid.).

Presentation layer – The presentation layer (Figure 4-1, beige coloured left and right boxes) offers the visual layout of the tool offering user interaction and visualisation of data. This layer is the only visible layer to the user and is responsible for most of the user experiences while interacting with the application. The visual elements are used to display data and information as well as to allow for interaction by users. For displaying geodata, two different views (see Figure 4-2) are preferred and are offered by GIS components (via ArcObjects API), one as general data view within the map coordinate system and other as map layout view in paper coordinate system. Since geodata contains attribute data, additionally, attribute view is also necessary which is possible through presentation logic component. Presentation logic queries the data, binds it to an entity (e.g. a data table) and displays it via a presentation component as, for example, a table.

Page| 35 4. Developing a robust and easy to use species distribution modelling tool

Logic layer – The logic layer (Figure 4-1, cyan and olive coloured central part) is the core of retrieval, processing and managing the logic of data. The presentation layer collects user inputs and passes the inputs to the application. When a complex set of actions are needed (e.g. for harmonisation of data, chapter 4.1.1) where several functions are needed, the layer executes the necessary functions in the required sequence. The logic layer is also responsible of internal (tight) and external (loose) coupling. For the tool, GIS related functions are to be tightly integrated whereas the communication between the GIS component and statistical modelling component is loosely coupled. The data communication in the loosely coupled environment is made through two different strategies, a) data serialisation (Weisfeld, 2009) for complex objects, and b) data piping (Ritchie, 1980) for simple objects. The output of statistical modelling contains complex object 8 with all the model parameters (coefficients) as well as model specific tuning parameters (e.g. regularisation controllers) and using data serialisation would be effective means for this scenario. The response curves of the environmental variables are plotted using the DotNET API, however creating surface plots which are used to show the effects of variable interactions is not possible by the same API for which Gnuplot 9 is used. Creating surface-plots (e.g. Figure 5-5) needs only part of the model parameters and the use of data piping (supported by Gnuplot) offers easier option. Data piping, also known as redirection, is a technique of feeding (redirecting) the output of one application directly as input to another application (Ritchie, 1980). A GUI coupling can be and is used when input and output parameters are to be interactively set and a transparent (not visible to user) loose coupling (see Figure 4-1) can be used.

Data layer – This layer is responsible for handling different types of data (Figure 4-1, top cylindrical shapes). Attribute data are also to be written and read for modelling (modelling data in Figure 4-1). This would include how the attributes are arranged, how interactions are defined and how the original geodata are referred in the interactions. The use of ArcObjects API provides the logic for handling different types of geospatial data. Attribute data are stored in plain text form. To store the fitted model parameters for statistical modelling part, a new internal data format is created which would allow easy retrieval of the parameters. The new data format is saved in a binary encoded 10 file for use in the SpeeDi Tool. For allowing interoperability with other software, the same data are also saved in an SOAP-XML format.

GUI design

The main application is designed to embed GIS functionality using ArcGIS Engine whereas statistical modelling is loosely coupled and accessed from the main GUI. The statistical modelling framework is made without any linkage or dependency to the ArcGIS Engine so that it can function also in the case where ArcGIS Engine is not available. However, creation of prediction raster would still need ArcGIS Engine. By using loose coupling mechanism, the statistical modelling (binary logistic regression) is integrated into the GIS based GUI. In order to achieve harmonised user experience, the GUI of the statistical modelling part is designed with a ‘look-and-feel’ similar to the main GUI, making user unaware of any coupling mechanism. The main GUI is divided into five main visible components

8 Most of the data are in the form of Matrix and Vector; the Matrix uses program code from http://www.codeproject.com/Articles/5835/DotNetMatrix-Simple-Matrix-Library-for-NET (26-Jan-2013) 9 http://www.gnuplot.info (13-Jan-2012) 10 Binary encoded: readable by machine (computer) but not meaningful for human reading

Page| 36 4. Developing a robust and easy to use species distribution modelling tool

(Figure 4-2): a) menu, b) toolbars, c) geodata TOC (table of contents), d) geodata and map layout view, and e) pop-up menus.

Figure 4-2: Different components in the main GUI of the SpeeDi tool.

The menu offers functions for saving and opening an ArcGIS map document as well as loading, editing and saving general options for SDM tasks (e.g. pre-defined cell-size, spatial reference system). It also facilitates some advanced pre- and post-processing functions as well as printing utilities. Two distinct toolbars are present, one for general navigation of geodata displayed in the geospatial view and another with a set of functions related to SDM task. The functions for SDM task are grouped and arranged in a sequence similar to the modelling work flow of pre-processing and modelling. The geodata TOC shows which geodata are loaded, their cartographic representations which are visible (displayed) in the geospatial view, and in which order they are arranged or displayed. The geospatial view offers the mechanism to view the geodata (defined by some cartographic representation) facilitated in two tabs: a) geodata view - for viewing the data in detail, and b) print or map layout view - for viewing the data as if it is a printed version in a specified paper size and format, i.e. page layout. Pop-up menus are offered on the geodata TOC providing specific functions related to geodata. These menus are context sensitive, i.e. the functions in menu differ based on the type of data (vector or raster).

Page| 37 4. Developing a robust and easy to use species distribution modelling tool

Different visual elements for specifying input and output opons

OK Cancel Help

Figure 4-3: Common layout of dialog-boxes (top left) for pre- and post-processing functions in SpeeDi Tool; an example of dialog-box for running local function (top right) and displaying the help associated with the function when the ‘Help’ button is clicked (bottom)

Apart from the main GUI, the pre- and post-processing functions are also designed in interactive GUI. Although input and output parameters differ based on functions, the general layout is similarly designed (Figure 4-3) to offer consistency in appearance increasing learnability (see chapter 3.2.3). A large part of the window (dialog-box) contains input of different parameters and often three buttons are arranged at the lower right corner to accept (OK), dismiss (Cancel) and seek help (Help). On clicking the help button, the integrated help shows the description about the function related to current context (task). The availability of context related help offers better user-experience (see chapters 3.2.3 and 3.4) as the detailed documentation is just around with a ‘click of a mouse button’. Further, feedbacks are provided for input parameters offering only valid options based on other selected options. This will reduce user error rate and increases user performance (see chapter 3.2.3). Tool-tips are among the elements enhancing usability (see chapter 3.2.3) in assisting learnability, operability and memorability and they also help increasing users efficiency (see chapter 3.2.4).

Page| 38 4. Developing a robust and easy to use species distribution modelling tool

Some of the parameters do not change for a particular task (e.g. grain size, spatial reference system) and instead of setting the parameter every time a function is used otherwise continuing the task afterwards is annoying and induces frustration. With the focus on simplicity and ‘ease-of-use’, for such cases, global settings for such parameters are saved which will be loaded at the start and can be modified at any time. Whenever possible, a set of default values (Figure 4-4) are offered that would likely to work well for most of the times, thus providing centralised setting as core experience (see chapter 3.4). However, users can easily change these default values too.

Figure 4-4: Setting default preferences in the tool, accessible via menubar; left: for logistic regression modelling most of them are related to the output graphs, and right: for modelling task related mainly to spatial properties

Logistic regression with presence, absence and background data

Binary logistic regression is one of the popular statistical models for calculating probability based on recorded events. Further, its characteristics of the predicted output value bounded between zero and one offers an attractive solution (Hosmer and Lemeshow, 2000). More importantly, no assumption has to be made about the predictor variables statistical distribution ( ibid.).

4.5.1. Formulating binary logistic regression model

As the name suggests, the binary logistic regression follows binomial distribution and is calculated by equation 1 (Hosmer and Lemeshow, 2000). For SDM, it is the probability of having a favourable environmental condition at a given location for finding a species.

Page| 39 4. Developing a robust and easy to use species distribution modelling tool

(1) ℙ = 1| = ⟹ ℙ = 1| = ℎ = + + + ⋯ + = = However, one of the challenges in using binary logistic regression for predicting a species distribution is, in most of the cases, the lack of absence data. But innovative methods such as using pseudo-absences (see chapter 2.4 and Table 2-2) are used for formulating models. Ward et al . (2009) explored the concept of expectation-maximisation and introduced a novel method of using background data instead of pseudo-absences. By using the case-control condition (Hosmer and Lemeshow, 2000), the logit of the presence-background data is calculated by equation 2 (from Ward et al ., 2009, p557, equation 3):

(2) ℙ = 1|, = 1 = + log ℎ = ℎ = = = ℎ + Equation 2 is a special form of case-control condition in McCullagh and Nelder (1989, chapter 4.3.3, p111) but in this case the number of absences is zero. By using the same concept used to derive equation 2 and including the number of absences, the logit of presence-absence-background can be expressed (equation 3) as:

(3) ℙ = 1|, = 1 = + log ℎ = It is to be noted that in equations 2 and 3, the logit or odds ratio is simply altered by a computable value which does not change during each iterative step. The SpeeDi tool has one additional optional parameter namely ‘soft-buffer-threshold’ (SBT). Since it is not possible to quantify the exact population prevalence of a species in the pool of background samples (Ward et. al., 2009), the SBT (equation 4) offers heuristic mechanism in discriminating presences in the background. If the option for the SBT is set, for each background sample, an additional update of probability value is performed during the iterative process. If the probability value of a sample is greater than the assumed population prevalence (π), then the maximum between the probability value ( ) and the mean of π and average probability value ( ) of all samples is assigned. Here, is the prevalence value of all the samples in the iteration. Figure 4-5 shows how probability values change based on initially assumed value of π and when SBT is applied (the axes represent the population prevalence). The diagonal line representing the unadjusted (SBT not applied) shows the calculated probability value based on equation 2, whereas, other four lines (applying SBT) show the SBT-adjusted values for respective assumed population prevalence (π). These lines show two regions deviating from the main diagonal line; the upper buffer region for background samples favoured for presences and the lower buffer region for background samples favoured as absences (see also equation 4).

Page| 40 4. Developing a robust and easy to use species distribution modelling tool

(4) max , , = min , ,

ℎ, = = = = = ′′

Figure 4-5: Profiles of SBT adjustment for background samples for assumed pop. prev. (pi) = 0.3. The x-axis represents the original value and the adjusted value is shown on y-axis.The legend ‘y_0.n’ represents the average probability value of all samples.

4.5.2. Control mechanism to counter over-fitting in regression model

One of the concerns of using regression is over-fitting of the models. Over-fitting can be effectively handled and reduced by applying regularisation also known as penalty or normalisation (norm) to the estimated coefficients. The commonly applied penalty term is L2-norm (equation 5), with ridge regression which shrinks the ridge coefficients in a way such that the sum of squares of the coefficient is used as penalty (Hastie et. al., 2009) being a popular choice. Another penalty term used in regression is L1-norm (equation 6), with least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996) being often used. L1-norm shrinks the coefficient in a slightly different way than the L2-norm; here the absolute sum of the coefficient is the penalty value. LASSO performs very well even in the case where the number of predictor variables is more than the number of observations (Hastie et al ., 2009). The recent effort in combining the L1- and L2-norm led to the development of elastic-net penalty (Zho and Hastie, 2005), which uses elastic-net factor (α) as a controlling parameter to set the strength of its L1 and L2 characteristics (equation 7). This makes elastic-net penalty an interesting,

Page| 41 4. Developing a robust and easy to use species distribution modelling tool

effective and efficient hybrid L1 – L2 penalty option for controlling over-fitting of a model. Thus, using elastic-net as regularising technique is adopted for the SpeeDi Tool so that any of L1, L2 and elastic-net penalties can be effectively applied.

(5) = argmin ∑ ∑ + ∑ ℎ = = , = , = = = = (6) = argmin ∑ ∑ + ∑

(7) = argmin ∑ ∑ + ∑ 1 + ℎ = The tool uses the regularisation path algorithm for binomial logistic regression (LOGNET) implemented in GLMNET by Friedman et al . (2010), and a minimum of five models in the path is set. The regularisation path starts with the highest possible penalty value (λ) determined automatically based on the total number of all samples and decreasing in a logarithmic linear path. One of the advantages of the path-based regularisation is that model coefficients can be interpolated for values given between different regularisation factors ( ibid.). However, the most useful advantage is that if a model cannot improve from one to next model, it stops there, i.e. the number of models will be less than the specified number, and the last model would be the best fit model. This will help in finding the result with optimal regularisation for the best result within the specified number of models. Thus, a higher number can be useful but would require more time to run the model.

Functions offered by the SpeeDi Tool

Apart from statistical modelling, most of the tasks are pre- and post-processing. The tool is designed to offer ‘ease’ in handling geodata for SDM even for non-regular users of GIS. The functions with brief description and categorised as per the modelling steps are listed in Table 4-2.

4.6.1. Pre-processing in the SpeeDi Tool

The pre-processing step comprises several functions for preparing the datasets before performing statistical modelling. Importing species location into GIS is a basic task. The function allows importing point data from a text file with either of tab, comma, colon or semi-colon as column delimiters. Additionally, option for selecting the decimal value separator is also provided (e.g. English system uses a DOT (“.”) for

Page| 42 4. Developing a robust and easy to use species distribution modelling tool

decimal separator whereas German or French system uses COMMA (“,”) as decimal separator). Another input required from the user is the spatial reference system of the recorded coordinates. Environmental geodatasets may come from several sources; in different formats (raster or vector), in different resolutions or scale and different spatial reference systems. The use of projected coordinate reference system is a must when it comes to the analyses related to area and density calculation. Thus, harmonising these data is the very first step. Since the SDM is primarily applying the raster format, vector datasets have to be converted to raster datasets. Two separate functions are provided, one each for raster and vector for harmonising datasets. The conversion of vector to raster can be performed with different options, e.g. attribute values can be used for vector land-cover datasets, a distance based approach such as a proximity measure (Heywood et al ., 2006) can be used for linear features such as rivers. The input for the functions are a) selection of geographical extent from available geodatasets, b) selection of target spatial reference system (also from available geodatasets), c) a common cell-size for harmonised datasets, destination folder (directory) for storing the harmonised datasets, and d) the datasets to be harmonised. Additional input for vector datasets is selecting one of three options (based on attribute, buffer or distance) mentioned above for converting to raster. In order to account for unknown spatial variance in large geospatial extent, geographic coordinates (x and y) can be used as trend surface (Legendre and Legendre, 1998). The tool provides option to create such a coordinate dataset allowing for inclusion of trend surface in any modelling tool which facilitates using polynomial interactions of predictor variables. For SDM, records can become a duplicate because of various reasons: same location is sampled at different time (year, month), falling in the same grid-cell although having different coordinates, records of male and female species, etc. Duplicate location records can give false statistical measures and are to be removed. The function in the tool will remove duplicate points which lie within the given distance. For presence-background modelling scenario, random sample points as background samples are generated from the landscape. The tool offers options such as minimum separating distance between any two points, number of points and specifying the area (landscape) where the points are sampled as well as an option for selecting the pseudo-random number generator and seed values. The seed value is an important parameter when generating random numbers to insure the repeatability of the results. Assigning weights to the samples can be useful e.g. as an alternative to sub-samplig, reducing or increasing the effect of the samples. The logistic regression method used by the tool can use weights for the observations. Samples can have high uncertainty in location recording, such samples can be assigned lower weights or some areas may have very low density of points which can be assigned higher weights. Further, samples which are very close to one another can have undesired effects on the final result. In such cases such closer samples can be given lower weights based on and up to a certain distance. Such an option can also be an alternative for sub-sampling of densely sampled area (Araújo and Guisan, 2006). Another use of weights of samples is that abundance data can also be included (Friedman et al ., 2010) in the modelling.

The last step in the pre-processing is to export the environmental attributes associated with the samples which is also the first step for the statistical modelling.

Page| 43 4. Developing a robust and easy to use species distribution modelling tool

4.6.2. Statistical modelling using logistic regression in the SpeeDi Tool

Exporting the environmental attribute values of the samples is the first step in statistical modelling. This function represents two steps as the model parameters are also defined in this step (e.g. number of polynomial terms, interactions among variables). The statistical modelling, the logistic regression model, has few tuning parameters (see chapter 4.5), namely the elastic-net factor and the population prevalence. A minimum of five models in the regularisation path will be formed but the number can be increased. As the tool uses the algorithm of GLMNET (Freidman et al, 2010), the regularisation starts with the maximum penalty for each variable and decreases along the path. Apart from fitting the parameters, the tool generates several outputs related to the sampling data and selected options (see Figure 4-4, left) e.g. the logistic responses of the fitted parameter, surface-plots (via Gnuplot) for interacting (product) variables, 2-dimensional scatter plots of all pairs of variables, histograms of main-effect (primary environmental variables), Receiver Operating Characteristics (ROC) -curve and sensitivity-specificity graph (examples are presented in chapter 5.3.1). Options are also provided to generate scatter plots and histogram of variables. Scatter plots help in analysing the relationships between the variables and presences and backgrounds as well as visualising outliers, if any, in the data. Histogram of presences shows the distribution of data over the range of environmental variable. Further, choice of generating surface-plot is also included. A surface-plot between two variables is useful to visualise the effect or response by interaction terms of any two variables. The area under the ROC-curve is a measure of a model’s discrimination ability (performance) (Hosmer and Lemeshow, 2000). The sensitivity curve shows the accuracy of predicting the presences at different threshold level and the specificity curve shows the accuracy of predicting the absences at corresponding thresholds. Thus, the intersection of sensitivity and specificity curves provides the optimum cut-off value which corresponds to the common maximum accuracy for binary classification of both classes (ibid.). Based on the standardised coefficients of the environmental variables, ranking of variables is made (Grömping, 2006) and stored in a text file. The ranking is done for each variable in the model; separate rank for main effects and each interaction terms. An alternate ranking is also provided with the main effect combined with polynomial terms. The model coefficients are saved in two different file formats, binary encoded format as object, and an SOAP-XML. The SOAP-XML format is used for interoperability purpose. XML has become standard file format for exchanging data and SOAP format offers constructing the underlying object easily from an XML file (Bosworth et al ., 2003). Binary encoded object is used by the tool to read the model parameters (e.g. elastic-net factor, assumed population prevalence, number of presence, absence and background samples, number of regularisation path models, and coefficients in general and standardised form). The next step in modelling is to project the model with fitted coefficients into a raster dataset, i.e. to create a raster dataset of the selected logistic regression fitted model. Since neither the binary encoded format nor the SOAP-XML are readily human readable, a function to view the model parameters in a list is also included in the model.

4.6.3. Post-processing functions in the SpeeDi Tool

The post processing functions are to be accessed from the main menu. The first part of post-processing is to convert the probability value raster dataset into a binary presence-absence raster dataset by applying a threshold. The tool has a function to apply a threshold for classification.

Page| 44 4. Developing a robust and easy to use species distribution modelling tool

The appropriate threshold value can be obtained from the sensitivity-specificity graph and, for users’ convenience the tool provides the value in the graph together with the accuracy of classification. The classified raster dataset can then be used for several spatial analyses, one of them is using sum (a local function) with datasets for several species to create dataset showing species rich areas (hotspots). However, it is advised to take cautions when analysing such hotspots as Calabrese et al . (2014) argued for a different approach claiming to be better method for determining species hotspots from probability distribution maps. Several strategies can be used to evaluate the predicted range or set-up hypothesis. An example would be that a species is known to occur only within certain elevation range and the areas falling in this range is extracted by using ‘filter-out values’: first retaining all the pixels above the lower limit and then retaining the remaining pixels below the upper-limit. By overlaying the predicted presences over the filtered elevation range, the prediction can be visually assessed whether the predicted range falls within this bound. Histograms can be useful in analysing the distribution of species within range of environmental variables. A function is also made available to generate histograms which sample all the environmental values from the raster datasets within the predicted range. Since the tool also generates histogram of presence samples against the environmental variables, it is also possible to visually compare the histograms from the sample data and the histograms from the predicted range. Another part of post-processing is creation of maps as a means for communication of results. The tool facilitates creating simple maps via functions for cartographic representation (symbology) and page layout (print view). Functions for adding and interactively editing of texts and scale bar are also included. Layout maps can be printed directly or exported as EPS, JPEG, PDF, PNG and TIFF files. A function is also included to export each layer (dataset) as image file; the advantage being exporting maps of several species in a batch.

Page| 45

Table 4-2: Functions offered by the SpeeDi Tool for species distribution modelling

Category Task Function* Description

Importing species Import points Creating point dataset from x-y coordinates in a text file location data

Harmonise raster Harmonise raster to a common spatial reference system, extent and cell size

Harmonising geodata Convert to raster using one of three different options (distance, buffer, Harmonise vector attribute) and harmonise the raster to a common spatial reference system,

extent and cell size Create distance Distance in terms of number of cells, or inverse of number of cells {as X or Creating distance raster raster 1/(1+X)} Creating coordinate Create raster from Create raster data as coordinates as specified in preference [ a) map raster coordinates coordinate in X or Y direction, b) pixel row or column] Pre- Removing duplicate Remove duplicate Keep only points whose distance with nearest point are farther than specified processing points (samples) points distance Generate specified number of random points within the extent of selected Generating random Generate random raster or vector dataset with points separated by at least specified minimum sample points points distance and a previously set algorithm (from four available) for pseudo-random number generator 11 Adding attribute field Add weight field Adds a field for storing assigned weight for each point

Updating attribute value Update weight Update weight value based on a distance to a nearest point

Exporting attribute data Sample environmental values for the selected species presence, absence and for logistic regression Export attribute + background data with different interaction options (polynomial order of one,

modelling two and three; and product of two) for logistic regression modelling

11 Uses program codes from http://www.codeproject.com/Articles/15102/NET-random-number-generators-and-distributions (26-Jan-2013)

Page| 46

Category Task Function* Description

Importing ASCII grid Import ASCII grid Creates raster dataset(s) from ArcInfo ASCII grid file(s) raster Exporting ASCII grid Creates ArcInfo ASCII grid file(s) from raster dataset(s) together with spatial Export ASCII grid raster reference information file(s) Fitting logistic regression Run Logistic Train the model for computing the parameters with specified options of model regression control parameters and variables Statistical Creating raster from Generate the prediction as raster dataset from the selected fitted logistic Create raster modelling # logistic regression model regression models Viewing logistic View models View the fitted parameters of LR models in tabular form regression coefficients Apply binary Apply threshold criteria for binary classification from the selected filter option threshold Applying logical function Retain only pixels with values fulfilling the specified criteria such as values Filter out values above or below specified value Run simple statistical functions on the selected raster datasets with following functions: i) max (maximum value), ii) min (minimum value), iii) mean (mean (Pre- &) post- Run local function Running mathematical value), iv) std (standard deviation), v) sum (sum of values), and vi) combine processing functions on stack of (combinatorial value) layers (data) per cell Run simple arithmetical function on two raster datasets or one raster and a Run math function constant value; create raster with following functions: i) plus (addition), ii) minus (subtraction), iii) multiply (times), and iv) divide (division) Histogram from Create histograms of environmental variables raster datasets based on the Creating histograms prediction raster dataset of predicted distribution range

Add text Add texts in the map

Presentation Laying out map (page) Add scalebar Add a scale bar in the map

Page| 47

Category Task Function* Description

Add legend Add a legend in the map

Export each layer as Export each data layer as a separate map to image file (tiff, jpeg, png) map Laying out map (page) Page setup Select paper size for map layout and printing

Print preview Preview the map layout

Print Print the map layout

Export the map layout in one of the following image formats: TIF, JPG, PNG, Export map EPS, PDF * functions with icons are accessed via toolbar, without icons are accessed via menu + this step belongs to pre-processing and statistical modelling # all three functions are loosely coupled

Page| 48 4. Developing a robust and easy to use species distribution modelling tool

The non-existence of a universal method which offers satisfactory result for modelling species distribution is one of the reasons for the development of several tools. However, several of them focus on statistical modelling part only, thus, users have to rely on separate GIS package for most of the task. Most often the users (modellers) are not accustomed with the GIS environment and hence, this can become a cumbersome task. The development of ‘SpeeDi Tool’ which incorporates the basic GIS functionalities and statistical modelling is a step to simplify the task for regular as well as non-regular users of GIS. The tool utilises ESRI’s ArcGIS Engine (via ArcObjects API) for handling geospatial data. It incorporates the binary logistic regression modelling implementing elastic-net regularisation in a standalone program via loose coupling mechanism. The choice of elastic-net as regularisation takes modelling on a par with the recent technological development. The logistic regression modelling is based on the concept of Ward et al . (2009), however, the tool extends with improved regularisation: elastic-net instead of L1-penalty. Apart from the improved and flexible regularisation method, the new tool also focuses on offering a robust method for using presence samples with absence and/or background samples, which opens a new possibility for SDM practices and explorations. The user-centred approach is used to define, identify and prioritise basic functionalities, e.g. the GIS pre- and post-processing functions are focused mainly to streamline the workflow. Some complex GIS pre-processing tasks are made look simpler for novice modellers. Sometimes few tasks (e.g. cell-size, reference system) are unknowingly overlooked, but the tool incorporates them within the processing steps thus ensures that basic GIS operations are always performed at the correct time. The tasks related to data processing are performed within the interface provided by the tool. Presented in an easy-to-use graphical user interface, the tool is designed to provide various options users can set or choose, e.g. how the background data are sampled, various options for data pre-processing before modelling and spatial analyses after modelling. The user-centred approach is also beneficial in designing the user-support system, notably to offer effective integrated help system (focused on offering guidance on capabilities and simulating ideas for advanced usages of the functions). Although the skeleton for help system is implemented, the contents (descriptions) are, however, not complete. The help system uses an external file and updating of the file will be sufficient for full functionality. Use of loose coupling mechanism for integrating GIS and statistical modelling may have some disadvantages, e.g. more input parameters to be filled. However, using GUI coupling with harmonised user experience has advantages, e.g. no need to change the interface for GIS or statistics program. The use of standalone logistic regression modelling application not only offered coupling mechanism, but by making the tool non-reliant on ArcObjects, it can easily be incorporated with other tools as well as run independently. Hence, users modelling several species can have options to use one computer (which has ArcGIS Engine available) for pre- and post-processing and (if desired) make use of several computers (which have no ArcGIS Engine available) to run the statistical modelling.

Page| 49 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

Odonata and database of African Odonata

Odonata belongs to the class Insecta, one of the most diverse groups in kingdom. They are among the better-known groups. The taxonomy is relatively straight forward; they are easy-to- study and well-studied (Dijkstra et al ., 2011). Odonata have quite different dispersal capacity compared to other insects. The species’ dispersal range is very negligible in larval stage whereas the range can be intercontinental for a fully matured adult (Corbet, 1962). Odonata have two different habitat requirements, namely fresh-water aquatic habitat during larval stage (nymphs) and semi- terrestrial and fresh-water habitat during adult (imagio) stage of the life cycle (Clausnitzer et al ., 2012). The life-span of Odonata larvae can be up to six years based on altitude and latitude but far less time is observed for tropical species (Corbet, 1962). Adult odonates prey upon small insects such as ants, mosquitos, honey-bees as well as other smaller Odonata. Odonata larvae feed on small fish, tadpoles and annelids (e.g. leeches, worms) (ibid.). However, they are preyed upon by birds, lizards, frogs and spiders as well as by larger Odonata (ibid. ). They are sensitive to habitat morphology and responds strongly to any change in environmental quality above and below water surface (Dijkstra et al ., 2011). Moreover, studies have shown that the presence of Odonata species have been positively correlated to the overall biodiversity species richness (Holland et al ., 2011; Sahlén and Ekestubbe, 2001). Hence, Odonata have been regarded as suitable species to evaluate natural and anthropogenic change, to evaluate habitat connectivity in the long and short term (Dijkstra et al ., 2011). There are 5680 species of Odonata present in the world and more than 700 species are found in the continent of Africa (Clausnitzer et al. , 2012; Darwal et al. , 2011). The Odonata Database of Africa (ODA; Kipping et al ., 2009) has over 80000 records sampled at more than 9000 locations. Most of the older records are based on literature survey, field notes and museum collections (Clausnitzer et al ., 2012). The last two decades have been the most active Figure 5-1.Photo of Pseudagrion kersteni period for collecting records when more (Source: G. Lasley, http://www.inaturalist.org/photos/463938 ) than half of the data has been recorded (V. Clausnitzer, personal communication). Pseudagrion kersteni (Figure 5-1) is wide-spread in sub-Saharan Africa and has its ranges on both sides of the equator (Figure 5-2). It occurs in “shady open and half open streams and rivers” and below an elevation of 1800 m (Clausnitzer et al., 2010). Furthermore, P. kersteni is one of the eleven species

Page| 50 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool proposed as an indicator species for monitoring habitat degradation by Suhling et al . (2006) in .

Pre-processing of location data and environmental geodata: setting the modelling scenario

1026 presence location of P kersteni were obtained from the ODA (Figure 5-2) where 596 samples have been collected between 1990 and 2010. Among the 596 samples, 73 records overlap point localities (i.e. exact duplicates in coordinates).

Figure 5-2: Distribution of Pseudagrion kersteni sample locations (ODA: Kipping et al., 2009) and the distribution range (Clausnitzer et al., 2012)

The extent of continental Africa is considered for modelling. The densities of the samples are very high in the southern and south-eastern regions and very few samples in central and south-western region. There are few sparsely distributed samples in western Africa. The species is absent in north of the Sahel (see Figure 5-2). However, it has to be noted that there are some known data gaps, mainly in and in the east/south-east part of the continent, in southern part, (V. Clausnitzer, personal communication; see also Kipping et al ., 2009; Dijkstra et al ., 2011) and South in central part. For modelling purpose, 36 climate related

Page| 51 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool variables, elevation, NDVI, hydrographical features and land-cover are used (see Table 5-1 for details). The NDVI and hydrographical datasets act as proxy variables for available resources, whereas land-cover and elevation represents basic form of habitat structure variables.

5.2.1. Geodata pre-processing with the SpeeDi Tool for predicting the spatial distribution of P. kersteni

Since the geodata came from distinct sources with different spatial attributes (e.g. cell-size, spatial reference system and extent), it is necessary to harmonise them into a common spatial frame. Lambert Azimuthal equal-area projection is selected as spatial reference system; this system is recommended for maps of continents and regions which extend equally in all directions from centre and equal-area preserves the value of area across longitudes and latitudes (Bugayevskiy and Snyder, 1995; Snyder, 1997). A grid cell size of eight kilometres is chosen based on the lowest spatial resolution among the selected environmental variables. Part of the harmonisation process involved converting vector geodata into raster format. The hydrographical features contained major rivers (as polyline vector dataset), lakes and wetlands (as raster and/or vector datasets). In the first step the vector datasets are converted to harmonised raster datasets. Similarly, the raster datasets are harmonised to have common a) spatial reference system, b) spatial extent, and c) grid cell-size. In order to get an ecological meaning out of these hydrographical datasets as proxy variables for proximity to resource availability, distance raster datasets are created.

Table 5-1: List of climatic and environmental geodata used for modelling of P. kersteni and their sources geodataset 1 source Scale format Climate 2 average minimum monthly temperature Hijmans et al . (2005) 30 arc seconds raster (tmin_xx) average maximum monthly temperature Hijmans et al . (2005) 30 arc seconds raster (tmax_xx) average mean monthly precipitation Hijmans et al . (2005) 30 arc seconds raster (prec_xx) Habitat land -cover (glc2k) Mayaux et al . (2004) 30 arc seconds raster elevation (elev) Jarvis et al . (2008) 3 arc seconds raster Resources surrogates minimum NDVI (af_ndvi_min) Tucker et al . (2005) 8 km raster Jenness et al . (2007) 1:5 million vector hydrographical features (dist2water) Lehner and Döll (2004) 30 arc seconds raster 1 variable name is denoted inside brackets 2 xx = 01 for January to 12 for December Figure 5-3 (top-left) illustrates the cell-distance value for creating proximity distance raster (Heywood et al ., 2006). The distance is calculated from the centre to centre of the cells. The pixel value for the cell where the feature falls will get the value of 0 and the neighbouring cell (non-diagonal) gets value of 1. The raster dataset for minimum distance to hydrographical features

Page| 52 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

(bottom-right) is constructed by taking the minimum pixel value of distance to linear features (top-right) and distance to areal features (bottom-left). The resulting raster thus corresponds to the minimum distance to water resources.

Figure 5-3: Illustrating of distance dataset of hydrographical features; top left shows how the pixel values are calculated for linear features based on the cell-distance, top-right: distance raster for rivers; bottom-left: distance raster for areal features (lakes, ponds, wetlands), bottom-right: minimum distance value raster combined from linear and areal datasets

The location database has records distinguishing the sex of the species, as well as year of observation. Thus, two samples of male and female of same species at a location as well as sample records of different years at the same or nearby location can form duplicates. This reduced the unique number of sample locations to 604 of which 444 locations have only one sample, i.e. no duplication (see Figure 5-2 and Table 5-2). Since the modelling is done at spatial grain-size of 8 km, some of the samples can fall on same grid cell, further (effectively) forming duplicate records. After removing these duplicate records (falling within a radius of 6 km), there remained 496 records. Because the sites at Southern Africa are intensely sampled and Western Africa is poorly sampled, the samples in southern Africa are denser than in any other regions (regional density bias). In order to adjust for equal representation of samples throughout, in regions of Africa where sampling intensity is low, the samples are given higher weights. This weight is termed here as regional density adjusted

Page| 53 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool weight. As locally clustered samples are still apparent (local density bias) and in order to reduce this effect, a distance based factor is applied within a certain distance (of 40 km, i.e. 5x cell size) to the regionally adjusted weights of samples to adjust the local density bias. The readjustment of weight is done in a linear way with a factor of 0.5 for samples within 1-cell distance (8 km) to 1.0 for samples exceeding 8-cells distance (40 km). This is similar to the concept mentioned by Araujo and Guisan (2006) as an alternative approach for sub-sampling intensely sampled region. Here, distance based approach is used to assign weights. As spatial dependence reduces with distance, this approach partially reduces spatial dependence of locations but does not completely eliminate it. As the data are available for presences only, 5000 uniformly distributed but randomly generated background samples are created. The ALFG P-RNG is used for random number generating algorithm. Further, in order not to have any two background samples falling in same grid cell, a minimum distance of 12 000 m (1.5x cell-distance) is given as parameter. All the background samples are given a weight of 1.0 (i.e. no preference of one sample over others). As a final step before modelling, the model is defined to contain linear, quadratic and cubic polynomial terms and product of two (interactions) are considered for continuous variables. The categorical variable (land cover) is used without any interactions. From the available 40 variables, considering each interaction as an independent variable, the total numbers of variables in the model becomes 859 (858 continuous and 1 categorical with 20 classes). The SpeeDi Tool facilitated all the necessary pre-processing with ease (see list of functions in Table 4-2).

Table 5-2: Total and unique sample locations of P. kersteni in the ODA samples overlapping samples per location unique sample locations 444 1 444 180 2 90 84 3 28 64 4 16 45 5 9 18 6 3 21 7 3 24 8 3 18 9 2 10 10 1 12 12 1 13 13 1 32 16 2 45 45 1 16* Total 1026 604 * wrong or no coordinates

5.2.2. Logistic regression modelling for predicting the presences of P. kersteni

As conceptualised in chapter 4, the modelling of species distribution needs a few tuning parameters. For the elastic-net factor, a value of 0.2 is assigned. A value closer to zero makes correlated variables contribute equally. Since 0.2 is still not very close to zero, the LASSO characteristics is also retained (0 = no LASSO), thus a sparse output will be obtained filtering possible

Page| 54 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool noises. Since the tool offers a path-based regularised model (see chapter 4.5), ten models are set to be formed. The use of background samples requires input of population prevalence in the background samples. The population prevalence is generally an unknown parameter and a value of 0.5 for the ‘initial population prevalence’ is used (i.e. 50 percent of background samples are assumed to be presences). Further, the option of applying SBT (see chapter 4.5.1) is selected to account for the uncertainty in the prevalence value (see chapter 6.1.2).

Result of the modelling

5.3.1. Intermediary output

The primary output of the modelling is a collection of 10 statistical models. Settings (Figure 4-4) are adjusted for generating scatter plots of presences and backgrounds and histograms of presences for environmental variables. Similarly, option for generating surface plots with contour lines is also selected. Model number 10 (the least regularised among 10 models) is selected as final statistical model and the output mentioned hereafter refers to this specific model. Among other output, a ROC-curve and ‘sensitivity and specificity’ graphs (Figure 5-4) and tables are also created together with the files with predicted values of both presence and background samples and ranking of the variables. The AUC value of 0.9056 tells that the model performance is very good (Hosmer and Lemeshow, 2000) i.e. the output can be relied on. The crossing point of the sensitivity and specificity is 23.7326 percent of the cut-off value providing 81.98 percent accuracy and this point corresponds to the optimal cut-point for binary classification maximising both sensitivity and specificity ( ibid.). The graphs includes the values of elastic-net factor (α) and initial population prevalence (π) as input during the model run and also shows the actual regularisation factor (λ) applied to the model.

Figure 5-4: ROC-curve (left) and ‘sensitivity and specificity’ graphs (right) of the distribution model for P. kersteni

Out of 859 variables, 190 variables entered into the model. When the linear, quadratic and cubic polynomials of variables are combined to count as one (instead of three separate variables), the total number reduced to 178. The five most contributing variables are land-cover (glc2k), minimum temperature of February (tmin_02), maximum temperature of August (tmax_08), interaction

Page| 55 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool between mean precipitation of March (prec_03) and June (prec_06), average minimum NDVI (af_ndvi_min), and interaction between precipitation of March and average minimum NDVI (Table 5-3); their responses are shown in Figure 5-5. The standardised coefficient of logistic regression offers an easier interpretation of the response. For the model output, with all other environmental conditions remaining at their corresponding mean values, change (increase) in value of 1 standard deviation unit of minimum temperature in February would change (increase) the odds for presence by 0.5603 (i.e. 1 in 1.785). Similarly, change (decrease) of 1 standard deviation unit of maximum temperature in August would alter (decrease) the odds for presence by 0.6414 (i.e. 1 in 1.559). The value ranges from -0.5368 (class 1) to 0.7033 (class 18) for land-cover classes (see Figure 5-8 for class number). So, the broadleaved evergreen tree cover (class 1) has a probability of around 37 % for finding the species whereas the mosaic of cropland shrubs and grass cover (class 18) has probability of around 66 % for finding the species (see Figure 5-5a).

Table 5-3: Five most contributing variables of the model for predicting the distribution of P. kersteni

ranking environm ental variable standardised coefficient* contribution 1 glc2k 0. 7033 (-0. 5368 ) 16.3 % (9.47 %) 2 tmin_02 0. 5603 5.32 % 3 tmax_08 -0.6414 4.73% 4 prec_03 * prec_06 0. 3744 4.16 % 5 prec_03 * af_ndvi_min -0. 3203 3.37 % * truncated to only 4-digits after decimal

The land cover classes 2, 16, 17, 18, 20 and 22 have positive (maximum contribution by class 18) responses and classes 1, 9, 12, 14 and 19 have negative (maximum contribution by class 1) responses (Figure 5-5 a). When the minimum temperature of February (b) increases above 15 degrees, the probability of presence increases (taking probability of 0.5 = neutral) and when the maximum temperature in August (c) increases above 30 degrees, the probability of presence decreases. The usual line graph (d) shows that the increase of value represented by interaction of precipitation in March and June increases the probability but does not show the overall effect of interaction (mutual effect). However, the surface-plot (e) reveals the interacting relationship in detail and offers useful information i.e. increase in both variables increases the probability. The interaction term between average minimum NDVI and precipitation in March (f and g) shows that higher value of the average minimum NDVI with higher value of precipitation of March (mutual effect) would decrease the probability, but with either of precipitation of March or minimum average NDVI remaining low, would increase the probability. Thus, the interaction term used here controls in a way that not every where having higher precipitation in March will have higher probability of presence, i.e. precipitation in June and minimum average NDVI are controlling the overall effect of precipitation in March.

5.3.2. Post processing the intermediary output

While the primary output of the statistical modelling is the regression coefficients, it still has to be projected in the spatial format. The last model (here model 10) in the collection is projected over the geographic space to get the probability distribution map (Figure 5-6, left). The probability value alone is not meaningful and hence thresholds are applied to classify presence and absence areas (Figure 5-6, right). Thresholds of two different values based on ‘Specificity and Sensitivity’ (Figure 5-4, right)

Page| 56 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool graph are selected for classifying into three classes: presence, probably-presence and absence. A value of 23.73% or greater is classified as presences which corresponds to approximately 82 % accuracy for predicting presence. At this cut-off value, the accuracy of predicting the absence is also approximately 82 %. This threshold will satisfy both criteria of MDT (minimised difference threshold) and MST (maximised sum threshold) explored by Jiménez-Valverde and Lobo (2007). However, without any record of absences, this (accuracy of predicting absence) cannot be certain. Hence, another cut-off value of 12.5 % is taken for which any value below this would represent absences. This value corresponds to 90% accuracy for predicting presences. The probability values from 12.5 % to 23.73 % are classified as probably presences.

a) b)

c) d)

Figure 5-5: Individual logistic response of the 5 most contributing environmental variables in the model predicting the presence of P. kersteni (note: the base reference in y-axis in figure ‘a’ is 0.5 and not 0)

Page| 57 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

e)

f) g)

Figure 5-5: Individual logistic response of the 5 most contributing environmental variables in the model predicting the presence of P. kersteni (continued)

Visual assessment of output of modelling the distribution of P. kersteni

Apart from sample locations, the estimated distribution range (Figure 5-7) based on ‘expert knowledge’ created during IUCN redlist assessment can be used to assess or evaluate the model output. The range map is based on United States Geological Survey (USGS) hydro1k-watershed model providing water catchment for delineating presence locations and expert knowledge of the assessors (Clausnitzer et al ., 2012). Thus, the map does not represent the true ranges but offers insights about potential or likely distribution. Moreover, it allows making a visual assessment of the predicted range of P. kersteni in Africa. The assessment is made across five different regions a) northern Africa including Sahel, b) eastern Africa, c) western Africa, d) central or middle Africa, and e) southern Africa.

Northern Africa and Sahel zone – In the IUCN range map, the species is absent in northern Africa. Apart from a section in the southern border region between and Sudan, where species occurrences are recorded (see Figure 5-2), the species is absent in the Sahel region. The model output (Figure 5-6, right) is in agreement with this information except at those recorded presences at Chad and Sudan boundary region.

Page| 58 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

Figure 5-6: Probability of environmental suitability for occurrence of P. kersteni across Africa (left) and predicted classified range into presence, probably presence and absence (right) modelled using SpeeDi Tool. The black line represents the 5 different regions aggregated from countries’ boundaries.

Eastern Africa – The IUCN range map for P. kersteni presumes the species to be present in the Horn of Africa and large part of South-Sudan, but the model did not predict the species to be present. Similarly, large part of southern Tanzania and northern Mozambique are predicted to have unfavourable condition which is opposite to the IUCN range map. One of the reasons for not predicting presences could be that the favourable climatic conditions which the region offers are missing for training purpose. A visual comparison of Figure 5-2 and Figure 5-8 with Figure 5-7 reveal that the combination of elevation range (generally regionally correlated with climate) and land-cover classes around the region are not sampled as presences, thus not offering enough signal. Hence, the area is likely to be disadvantaged because of data gap.

Western Africa – The range predicted by the model has a pattern similar to the IUCN range map. However, part of eastern and southern are predicted to be absence locations and some patches near to Sahel are predicted to be presences.

Central / Middle Africa – Large part of the Congo basin has dense forest cover (Figure 5-8) and are identified as absence locations in the IUCN range map (Figure 5-7). The model output (prediction: Figure 5-6 and response: Figure 5-5, a; class 1) is able to match this information. Yet, Central African Republic, south Democratic Republic of Congo and Angola (except western part) are missing in the prediction, but these are also, similar to the east African region, likely due to lack of samples in geographic but more importantly in environmental space ( cf. Figure 5-2 and Figure 5-8) including climatic variables.

Page| 59 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

Figure 5-7: Distributional range of P. kersteni as predicted by IUCN redlist assessment in 2010 (Clausnitzer et al ., 2010)

Southern Africa – In southern Africa, the species is present along the east and south coast regions but not much in the central part. Some areas in north western Namibia are assumed by IUCN range map to have the species present. The model predicted similar pattern in most areas except the west coast and north western Namibia. Similarly only a small part in east Botswana is assigned as presences by the range map, but the model also predicted some patches in the north. According to Suhling et al . (2006), P. kersteni is present at very isolated places and are constrained very much locally at the origin of springs and wells in Namibia. This information is difficult to include in a model at continental scale because of the geodata and associated scale and accuracy. Furthermore, the information may require micro-ecological modelling which statistical SDM cannot perform ( cf. chapter 2.5). Apart from this constraint on modelling, Rach et al . (2008) report that the species recorded at three Namibian locations exhibit different DNA profiles to each other and also to the species found in eastern Africa. Because of this, they have different structural habitat requirements (F. Suhling, personal communication) among the population which implies that these samples in the region are outliers in the total samples.

Page| 60 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool

Figure 5-8: Land cover classes of the continent of Africa (Mayaux et al ., 2004) as used for the modelling of P. kersteni ; coded values are inclosed in brackets

Odonata species have been used as one of the indicator species to monitor the environmental health, specially the habitat structure (Suhling et al. , 2006). However, not every species can be used as indicator species. P. kersteni is identified as one of the species to be considered as indicator species (ibid. ). Short term monitoring needs micro-ecological level analysis but long term and large-scale monitoring can be facilitated through SDM. The purpose of this chapter was to demonstrate the functioning and usefulness of the SpeeDi Tool which facilitates all major tasks of modelling species distribution within a harmonised user interface. All the output maps and graphs presented in the chapter are created using the tool. Statistical modelling, as offered by the tool, creates several outputs which are useful for interpreting the results. Response graphics such as Figure 5-5 (d & f, and e & g) are generally not produced by other SDM tools but they are useful in interpreting the interaction terms used for modelling. The overall prediction range for P. kersteni is in large parts matching the pattern of the expert-drawn range map. Further, the predicted range map provides more detailed shape than the watershed based range map. However, some areas are not predicted as expected or not confirming to the range map. Diagnostics tests such as sensitivity analysis (chapter 6) can identify the shortcomings and constrains in the modelling.

Page| 61 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

Logistic regression modelling in the SpeeDi Tool has several tuning parameters. Here, different approaches are used to perform sensitivity analyses a) for model tuning parameters, b) for sample data and model formulation, and c) for modelling approach. Knowing the effects of these parameters is an essential part of modelling process. Sensitivity analysis, on one hand, reveals the effects, and on other hand with proper tests, offers the confidence in using the tool. Sensitivity analysis can also be used to suggest default values for these tuning parameters in obtaining the most accurate result possible by using the tool; however, finding a value which fits for all species will not be possible but suggesting a value which is likely to work for many species will be beneficial. Spatial distribution of sample data and model formulation (interactions) can influence the prediction and pattern of distribution. Tests regarding modelling approach will act as diagnostics for improving the prediction or reveals what is lacking. This chapter will examine the effects of altering the values of different parameters, namely, elastic-net factor, assumed population prevalence and use of soft-buffer-threshold (SBT), size of background samples, polynomials and interactions of predictor variables, effect of spatial density of presence samples, modelling approaches (presence with background or and absence data) and environmental predictor variables. Apart from comparing the predictions with each other, i.e. visually, Figure 5-7 serves for judging over- and under-predictions.

Sensitivity analysis of model tuning parameters for P. kersteni

The SpeeDi Tool has few control parameters: a) elastic-net, b) initial population prevalence, and c) soft buffer threshold. These parameters affect the prediction results. In order to find out about the different effects as well as to determine optimal values, several sensitivity tests are run.

6.1.1. Elastic-net factor

Elastic-net factor is a regularisation parameter which controls how the coefficients of environmental variables fit within a model (see chapter 4.5). For modelling the distribution of P. kersteni , the initial number of variables (which included linear, quadratic, cubic polynomial terms and interactions of linear terms for continuous data) is 859, the final variables that enter into the model is 768 for L2-regularisation (see Table 6-1, elastic-net factor = 0) and just 51 for L1-regularisation (see Table 6-1, elastic-net factor = 1). The reduction of number of variables in final model is due to the effective filtering of noise in the data. For L1 condition, it is also due to removal of correlated variables internally. For L2-regularisation, the AUC value (see chapter 2.4) as well as the accuracy for binary classification with a cut-off value based on sensitivity and specificity curves is the lowest (Table 6-1). However, classifying the presences at 90 percent accuracy (i.e. 10 % omission), the cut- off values for L1 and L2 are very close to each other. The total iterations required to fit the models are also very low in case of L2-regularisation requiring less computational time. In contrary, the

Page| 62 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

L1-regularised model has the highest AUC value, as well as the highest accuracy for binary classification. However, the total iterations required are very high and require much computational time. Being hybrid L1-L2 regularisation, elastic-net is robust and flexible. Model with L2- regularisation tends to over-predict the spatial range whereas model with L1-regularisation tends to under-predict the spatial range (see Appendix A1), although the statistical results (AUC values) for both type of regularisation show good performance. Using elastic-net regularisation, a hybrid L1-L2 regularisation, over- and under-fitting of the regression parameters as well as controlling of under- and over-predictions are balanced. However, an interesting effect is the number of variables that enter the model (see Table 6-1).

Table 6-1: Effect of elastic-net factor on model (1) performance measured with AUC for binary classification of distribution of P. kersteni

Elastic -net Final AUC Optimum Accuracy % Cut -off % Total factor variables cut-off % 3 for 90% iterations in model 2 accuracy 4 required 0.00 768 0.8652 21.7077 77.40 20.0 742 0.05 375 0.8941 22.5220 80.51 12.5 2811 0.10 275 0.8999 22.7452 81.31 12.5 3305 0.20 183 0.9056 23.7326 81.98 12.5 4678 0.30 162 0.9087 25.0028 82.86 13.5 6167 0.40 131 0.9109 24.9436 83.09 15.0 7319 0.50 116 0.9133 25.8207 83.74 16.5 8946 1.00 51 0.9160 28.9778 84.89 19.0 26830 1 all values are related to the last model (model #10), except total iterations 2 initial number of variables used: 859 3 calculated by the program (interpolated value between a cut-off interval of 0.1) 4 value derived from the graph

A trend can be seen for the value of elastic-net factor, number of variables entering into the model and number of iterations required to fit the model (see Figure 6-1). The graph shows that the number of variables decreased when the elastic-net factor increased, however, the rate of decrement (or slope) reduced dramatically after 0.2. As the elastic-net factor gradually increases from zero, the LASSO characteristic becomes stronger and increases sparseness, i.e. less number of variables enter into the model. Hence, the initial rapid rate of decrease in number of variables is observed. Further, AUC value and optimum accuracy for binary classification as well as the total iterations is at optimum level for elastic-net factor of 0.2. Thus, 0.2 is used for other tests and is implemented as a default value in the SpeeDi Tool.

6.1.2. Initial population prevalence

Initial population prevalence (IPP, denoted by π in chapter 4) is of relevance only when background samples are used. The term initial indicates that the value is preliminary since the real value is unknown (Ward et al ., 2009), but it is needed for model formulation with background data (see chapter 4.5.1 equations 2 and 3. Ward et al . (2009) demonstrated that the overall shape is not affected by the changes in IPP but there are shifts in the predicted probability values. This can be seen in Figure 6-2 with four different values (at horizontal axis) of ‘unadjusted logit’ (see chapter

Page| 63 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

4.5.1, equation 1) with IPP of 0 corresponds to no adjustment; the number of background and presence samples used to illustrate are the same as in the modelling case for P. kersteni (chapter 5). The overall effect of IPP as observed for modelling of P. kersteni is summarised in Table 6-2, where the cut-off value for binary classification shifts (increases) with change (increase) in IPP. The close values of AUC and corresponding accuracy show that the model performances in discriminating presences are consistent throughout.

800 30000 number iterations of number 25000 600 20000

400 15000

10000

numberof variables 200 5000

0 0 0 0.2 0.4 0.6 0.8 1 elastic-net factor

variables iterations

Figure 6-1: Effect of elastic-net factor on the final numbers of variables entered into the model with least regularisation (10th), and total number of iterations required for fitting each path models consisting of 10 models

Table 6-2: Comparative values of model performance regarding the initial population prevalence for modelling the distribution of P. kersteni IPP AUC Cut -off % Accuracy % 0.1 0.9136 23.0180 83.04 0.2 0.9083 33.6758 82.25 0.3 0.9042 44.1001 82.02 0.4 0.8999 53.9838 81.67 0.5 0.8929 63.3659 81.36

While the statistical values and the overall spatial patterns of ranges are consistent (see Appendix A2), there are some changes at the edges of the range, but this is expected as the true value of IPP is unknown. As the meaning of population prevalence is the ratio of number of presences to the total number of samples in the background, for starting a value of 0.5 is suggested. However, this adds uncertainty to the model but the uncertainty can be reduced or minimised by performing sensitivity analysis to estimate a closer value of IPP (Ward et al ., 2009) and then using the estimated value for the final model. Table 6-3 provides correlations of predicted probability distribution with different IPP values. The correlation coefficient (for probability values) for distribution of P. kersteni with IPP values 0.1 and 0.5 are the least (italicised in the table) indicating that they represent slightly different distribution pattern. But, the null hypothesis that both models represent same distribution is supported (p<0.00001). However, the smaller differences in values (and ranges) among models for

Page| 64 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

IPP of 0.2, 0.3 and 0.4 indicates that the real value may lie somewhere near those values. Further, to find a closer value, several models have to be run with an interval smaller than 0.1.

1

π = 0.5 0.8 π = 0.4 0.6 π = 0.3 0.4 π = 0.2 probability value 0.2 π = 0.1 0 π = 0 (un-adjusted) -2 0 1 2

unadjusted logit

Figure 6-2: Effect of initial population prevalence (π) on the probability value, calculated for number of background samples = 5000, and number of presence samples = 496

Table 6-3: Initial population prevalence of background samples and the correlation matrix of the probability values for the distribution of P. kersteni IPP 0.1 0.2 0.3 0.4 0.5 0.1 1.00000 0.97714 0.95574 0.92608 0. 88950 0.2 0.97714 1.00000 0.99076 0.96839 0.93599 0.3 0.95574 0.99076 1.00000 0.99222 0.97109 0.4 0.92608 0.96839 0.99222 1.00000 0.99261 0.5 0.88950 0.93599 0.97109 0.99261 1.00000

6.1.3. Soft buffer threshold for background

As the IPP value can influence the prediction output (although the difference is found to be statistically not significant, see 6.1.2) and the true value of IPP cannot be determined (Ward et al ., 2009) the uncertainty related to IPP has to be reduced. For this purpose a heuristic approach of applying a soft-buffer-threshold (SBT) option is conceptualised (see 4.5.1, equation 4) in SpeeDi Tool which refines the estimated probability value for background samples within the iterative process. As this option is not yet available in any modelling methods, it is examined here to find its impact on estimating the probability.

Table 6-4: Comparative values of model performance regarding the initial population prevalence and applying 'soft-buffer-threshold’ for modelling the distribution of P. kersteni IPP AUC Cut -off % Accuracy % 0.1 0.9141 19.6228 83.12 0.2 0.9089 25.4708 82.68 0.3 0.9065 26.9848 82.07 0.4 0.9057 25.6054 82.13 0.5 0.9056 23.7326 81.98

Page| 65 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

The performance measured using AUC value and accuracy of binary classification in Table 6-4 is similar to Table 6-2. However, the cut-off values are different. The cut-off-values after applying SBT are closer to each other implying that not only the shapes of the response are similar, but also the prediction values are very close to each other. Further, it can be seen in Table 6-5 that the correlation values are also very close to each other with 0.96606 (correlation coefficient for IPP value of 0.1 and 0.4) being the lowest, a much better figure than 0.88950 (without SBT, see Table 6-3; correlation for IPP value of 0.1 and 0.5). Moreover, it is also observed that the changes seen in the spatial range at the edges after applying the SBT also got reduced and got more consistent (see Appendix A3).

Table 6-5: Initial population prevalence of background samples and the correlation matrix of the probability values calculated applying ‘soft-buffer-threshold’ for the distribution of P. kersteni

IPP 0.1 0.2 0.3 0.4 0.5 0.1 1.00000 0.97618 0.96822 0.96606 0.96669 0.2 0.97618 1.00000 0.99755 0.99612 0.99564 0.3 0.96822 0.99755 1.00000 0.99891 0.99767 0.4 0.96606 0.99612 0.99891 1.00000 0.99799 0.5 0. 96669 0.99564 0.99767 0.99799 1.00000

Since, the uncertainty related to IPP got reduced by applying SBT, a value of 0.5 (although not known) and option of SBT is used for the tests of other parameters. Further, the elastic-net factor of 0.2 is used throughout the remaining sections.

Sample data and model definition/formulation

The input samples are the basis for the prediction of any distribution. When using background samples, their characteristics (number, spatial distribution) influence the output. Further, model formulation and the distribution of presence samples (e.g. sampling density) can also alter the prediction. This chapter aims at revealing these effects.

6.2.1. Size of background samples

There have been few studies regarding the effect of sample size related to species presence (Jiménez-Valverde et al ., 2009; Stockwell and Peterson, 2002; Wisz et al ., 2008). However, no such studies are available for background samples except by Phillips et al . (2009) where the effect was evaluated on the model’s discrimination ability through AUC value. No recommendation exists yet on the number of background samples required, except that Phillips and Dudík (2008) suggests using 10000 for MaxEnt modelling of species distribution. A lower number of samples may not be able to capture the full range of environmental values but would be beneficial in regard to computational resources (e.g. memory, processing time). For P. kersteni four different numbers of background data are used to compare the results (see Table 6-6). The AUC score is consistently high for all four different numbers but the cut-off value for classification decreases as the number of background samples increases, partly due to shift in population prevalence of overall samples (presences and backgrounds combined). Nevertheless, the optimum accuracy for binary classification based on the specificity and specificity (Figure 6-3) remains within a very small range. As the number of

Page| 66 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

background samples increases, the population prevalence changes and hence explains this behaviour.

Table 6-6: Model performance for different background sample sizes for modelling the distribution of P. kersteni

Number AUC Cut -off % Accuracy % Total iterations 1000 0.8980 62.9772 82.09 3884 2000 0.9031 49.6377 83.16 4825 5000 0.9056 23.7326 81.98 4678 10000 0.9086 12.3866 82.43 5455

A visual comparison of spatial distribution reveals that all four models performed relatively well. Comparatively, the use of 1000 background samples is not able to predict well, missing some areas (under-prediction) and over-predicting other areas (see Appendix A4, a). This is also perceptible in the specificity curve (specificity_1000 in Figure 6-3) which is flatter than the sensitivity curve (sensitivity_1000). With 2000 samples the under-prediction at some areas got reduced but some over-prediction still remained (see Appendix A4, b, c and d) compared to modelling results with 5000 and 10000 background samples. The closeness of sensitivity curve towards the diagonal for 10000 samples should be noted, which can be a sign that the prediction of presences may have been slightly over-fitted. In this particular test, the number of samples with 1000 and 10000 are at opposite levels for sensitivity and specificity; when sensitivity is higher, the specificity is lower and vice-versa. Considering the trade-off for smoothness of sensitivity and specificity, 5000 numbers (1 background sample for each 90 grid cells) seems reasonable for this scenario. Apart from the curves, the total area of modelling extent and spatial resolution should also be considered for determining the required number of background samples.

6.2.2. Algorithm of random number generator for creating background samples

Background samples are generated randomly over a landscape. In general, the spatial pattern of these random samples should not influence the overall prediction pattern. There are several ways of generating random patterns; among them are different algorithms for pseudo-random number generators (p-RNG) or the use of different seed values for the same p-RNG algorithm. Here, four random number generator algorithms a) Additive Lagged Fibonacci Generator (ALFG; Brent, 1992), b) XorShift (Marsaglia, 2003), c) Mersenne Twister (mt19937; Matsumoto and Nishimura, 1998), and d) Subtractive Random Number Generator (SRNG; Knuth, 1998) (available as Standard p-RNG in DotNET API) are used to test the consistency in the prediction. Comparing the model results of all four algorithms based on AUC and classification accuracy criteria, all models performed well (see Table 6-7), with AUC value around 0.9 and binary classification accuracy around 82 percent (see box in the Figure 6-3).

Page| 67 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

1 sensivity_10000 sensivity_5000 sensivity_2000 sensivity_1000 specificity_10000 0.8 0 specificity_5000 specificity_2000 specificity_1000

0.6

and specificity and

0.4

sensitivity

0.2

0 0 20 40 60 80 100 cut-off %

Figure 6-3: Sensitivity and specificity curves of different background samples for the modelled probability distribution of P. kersteni , the suffixed numbers represent the number of background samples; the box highlights the closeness of range in threshold for obtaining maximum accuracy for optimal binary classification considering sensitivity and specificity values

Table 6-7: Model performance for predicting the distribution of P. kersteni based on AUC value and accuracy when using different algorithm of pseudo random number generator for background data generation Algorithm AUC Cut -off % Accuracy % ALFG 0.9056 23.7326 81. 98 XorShift 0.9012 25.1352 81.77 mt19937 0.9056 24.8905 82.55 SRNG 0.8999 27.2948 82.14

The correlation of probability values among the models is also high (see Table 6-8), with lowest correlation coefficient being 0.96287. Although none of the correlations values are significantly different from each other and the overall spatial pattern of prediction (and classification) are similar, some visible differences are observed for algorithm SRNG (see Appendix A5); the pattern differing here more from those derived via other algorithms. This may be the effect relating to randomness characteristics of pseudo-random numbers generated through different algorithms (Marsaglia and Tsang, 2002). However, further tests with many sample sets (e.g. also using different seed values) may be required to suggest which one is best. Since, the effect of the different p-RNGs is not noticeable among three (ALFG, XorShift and mt19937) of the four algorithms, ALFG is chosen as standard for sensitivity analyses.

Page| 68 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

Table 6-8: Correlation of probability values among different model outputs when using different algorithms for generating background samples ALFG XorShift mt19937 Standard ALFG 1.00000 0.99198 0.99069 0.97068 XorShift 0.99198 1.00000 0.99066 0.97289 mt19937 0.99069 0.99066 1.00000 0.96287 SRNG 0.97068 0.97289 0.96287 1.00000

6.2.3. Polynomial degree and interaction term for continuous variables

Not only how the samples (presences, see chapter 6.3 and backgrounds see chapters 6.2.1 and 6.2.2) are distributed, but how the relationships of occurrence (or lack thereof) with environmental variables are modelled will influence the prediction. In general more complex relationships will result in output matching the reality closely (e.g. Philips and Dudík, 2008) but the complexity will also introduce difficulties in explaining the relationships. Thus, the nature of modelling task i.e. finding the relationships and thus predicting closest to reality, would weigh the choice of model formulation with environmental variables. For finding the prediction capabilities and model complexities, five different complexities for continuous variables are defined:

a) Linear (1 degree polynomial) term and product of 2 linear variables (LP) (e.g. x 1 + x 2 + x 1*x 2) 2 2 b) Linear and quadratic (2 degree polynomial) terms (LQ) (e.g. x 1 + x 2 +x 1 +x 2 ) 2 2 3 3 c) Linear, quadratic and cubic terms (LQC) (e.g. x 1 +x 2 + x 1 + x 2 + x 1 + x 2 ) 2 2 d) Case b and product of 2 linear variables (LQP) (e.g. x 1 + x 2 + x 1 + x 2 + x 1*x 2) 2 2 3 3 e) Case d and product of 2 linear variables (LQCP) (e.g. x 1 + x 2 + x 1 + x 2 + x 1 + x 2 + x 1*x 2)

All five models with different complexities performed well with AUC values around 0.9 and all models are able to offer more than 81 percent accuracy for binary classification (see Table 6-9). As with other tests, a visual assessment is made which reveals two different patterns (Figure 6-4) after classification: one with LP, LQP and LQCP (a, d and e) and other with LQ and LQC (b and c). The first group contained model complexities which included the products of two linear variables. In comparison the second group of models under-predicted large areas compared to the first group highlighting the importance of interaction terms. This shows that product interaction of environmental variables is important in modelling the distribution of P. kersteni .

Table 6-9: Performance of models of different complexities measured by AUC value and accuracy for predicting the distribution of P. kersteni (see chapter 6.2.3 for explanations on model abbreviations) Model complexity AUC Cut -off % Accuracy % LP 0.9035 32.5191 81. 85 LQ 0.8985 32.3021 82.94 LQC 0.9048 32.8332 83.74 LQP 0.9039 23.7057 82.03 LQCP 0.9056 23.7326 81.98

Since the details on ecology of P. kersteni is known only partially, only two predictor variables are included for further analysis (see Figure 6-5): elevation (species known to occur below 1800 m, Clausnitzer et al., 2010) and maximum temperature of August (one of the top five contributing

Page| 69 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

predictor variables present in all five models). Similar to the grouping found in the spatial patterns, the response curves also show two patterns; one group with the inclusion of product interaction (d and e) and other group without the product interaction (a, b and c).

a) b) c)

d) e)

Figure 6-4: Predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms (same as Figure 5-6 right)

When analysing the five most contributing variables (see Table 6-10), land-cover is the most influential variable in all cases. Maximum temperature of August is another environmental variable which is present within the top 5 contributors. Product interactions of precipitation of March and June and precipitation of March and minimum NDVI are two further variables among the five influential variables when product interactions are included in the model. Presence of two product interactions terms in the top five contributing variables shows that complexities are necessary if predictions are to be made closer to the reality. Although the predictions after classifications of LP, LQP and LQCP did not differ much, LQCP, the most complex among three, is suggested. The reason for this is that linear feature of environment variable with single degree polynomial is too simple to reflect the actual response of species presence or absence whereas LQP with second degree polynomial is symmetrical across axes forming a mirrored response. This symmetry is true not only for quadratic term but for all polynomials with even number and hence odd number of polynomials is the choice. Moreover, the species and environment relationships are often skewed (Diniz-Filho et al ., 2007) which is also observed for the predicted distribution of P. kersteni ; some of the examples are

Page| 70 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

left skewed distribution of elevation and precipitation in March and right skewed distribution of temperature in February (Figure 6-6). elevation a) linear out -regularised b) c)

d) e)

maximum temperature of August a) b) c)

d) e)

Figure 6-5: Response curves of elevation (upper five) and maximum temperature of August (lower five) for the predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms

Page| 71 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

Table 6-10: Ranking of environmental variables for 5 different model complexities based on the explained deviance in contributing to the calculated probability for modelling distribution of P. kersteni Rank LP LQ LQC LQP LQCP 1 glc2k glc2k glc2k glc2k glc2k 2 prec_03*prec_06 prec_11 prec_11 prec_03*prec_06 tmin_02 3 prec_03*af_ndvi_min tmax_08 tmin_02 tm ax_08 tmax_08 4 tmax_08 tmin_02 tmax_08 prec_03*af_ndvi_min prec_03*prec_06 5 af_ndvi_min dist2water prec_03 af_ndvi_min prec_03*af_ndvi_min note: see chapter 6.2.3 for explanations on model abbreviations and Table 5-1 for naming of variables

6.2.4. Effect of sample density of presences

Although there have been studies related to number of samples, the effect of differing sample densities in different regions have not been explored. It is known that bias in presence samples influences the prediction (Philips, 2008; Philips et al ., 2009). Balancing different sampling density should reduce such biases (see also chapter 5.2.1). For testing the effect of different sampling densities, three different case models are used by applying weights to the samples. For the first case each sample is given a weight factor of one (i.e. no balancing). For the second case, weights are assigned to clusters of samples in different regions and weights are adjudged subjectively by visual assessment of differences in sample distribution and density (for approach, see chapter 5.2.1). For the third case, a mathematical approach for assigning weight based on the distances to all sample locations is conceptualised. For each sample, sum of distances to other samples are computed and stored in a new attribute field. From the distance values, an average distance is calculated. This average distance is then used to normalise the distances which is used as the final weight. The first case (Figure 6-7, top left) is not able to predict the range and shape of the expected distribution (Figure 5-7). The influence of dense samples in the south and south-east region has so much influence that distribution in large parts of western Africa is not predicted. Moreover, there is over-prediction at the southern Africa. For the second case (Figure 6-7, top right), samples in southern and south-eastern Africa are assigned a lower weight and samples in western Africa are assigned a higher weight. Further, the weight of each sample is refined (multiplied linearly by a factor between 0.5 and 1.0) based on the distance to the nearest sample; 0.5 for samples which are 1-cell distance apart and 1.0 for samples which are at least 5-cell distance apart. This, in effect, countered the higher density in southern part providing more balanced contribution from all samples. The third case (Figure 6-7, bottom), as compared to the first approach, improved the prediction in west Africa where the sample density is very low but, as compared the result of second approach, the prediction is still far from being closer to the expected range (Figure 5-7).

Page| 72 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

Figure 6-6: Histogram of elevation (m), precipitation (mm) in March and minimum temperature (°Celsius times 10) in February inferred from the predicted presence range of P. kersteni

Different modelling approach for predicting the distribution of P. kersteni

Presence and absence or pseudo-absence, and presence and background data modelling are the main approaches applied in regression based species distribution modelling (chapter 2.4). Using different approaches can lead to different output (e.g. Elith and Grahm, 2009). Here, these

Page| 73 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

approaches are employed with different cases to see the difference in spatial pattern and evaluate the different results.

Figure 6-7: Models using weights for presence samples density, samples with weight of 1 throughout (top left), manually assigned and adjusted weights per cluster (top right, same as Figure 5-6 right), and weights based on the global average distance to other samples (bottom left)

6.3.1. Modelling with presences and absences derived from known distribution range

Virtual species have been used to evaluate the suitability of modelling methods (e.g. Elith and Graham, 2009; Wisz and Guisan, 2009). In this concept, spatial distribution of a virtual species is described, typically defined with a mathematical relationship between the virtual species presence and few environmental variables. Here, a similar concept is used to evaluate the logistic-regression

Page| 74 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

with elastic-net regularisation. But, instead of defining a mathematical relationship, the range map of P. kersteni (Figure 6-8, left green shaded areas) is used as the base distribution. Same predictor environmental variables are used as in the ‘actual case’ (chapter 5) so that it can act as diagnostics for the model output. This exercise will provide confidence in using elastic-net penalised logistic for modelling species distribution. From the range map, 1000 samples (Figure 6-8, left: red and green dots) are randomly drawn where the map showed the species presence range. From the areas where the species are absent, 3000 samples are randomly drawn. 500 out of 1000 presence samples and 2000 out of 3000 absence samples are used to train the model. The rest of the samples are used to evaluate the model output. The choice of 500 presence samples is made, as the number is close to the actual number of samples (496) for P. kersteni taken from the Odonata Database of Africa, so that comparison and inference can be made. Furthermore, the random samples are generated in such a way that distances between any two samples are at least 5 cells and all samples are given an equal weight of one.

Figure 6-8: Modelling with presence and absence samples generated from the assumed watershed based range of P. kersteni (Clausnitzer et al ., 2012); left: light green areas as presence range and light blue as absence range; randomly generated presence samples used for training (red dots) and evaluation (green dots) of the model; right: predicted binary classified (presence-absence) range of P. kersteni

The model output has an AUC value of 0.9873 indicating a very good performance. Figure 6-8 (right) shows the binary classified prediction result, for which with a cut-off threshold value of 25.6667% has a reported accuracy of 94.93%. However, this accuracy is an interpolated value from the sensitivity and the specificity curves and the actual accuracy value differs slightly in Table 6-11 which is 94.92%. Since the distribution range is defined and known, an accuracy assessment can be made with the evaluation set of samples. If accuracy of these evaluation data matches closely with the accuracy values for training data, then there is no or, if any then, only negligible over- and under-fitting. The values for overall accuracy of training (94.92%) and evaluation (94.27%) sets are very high and very close differing just by (0.65%). However, the output also has to be evaluated spatially. By comparing the spatial distribution (left and right maps of Figure 6-8), there are few places with discrepancies: a) small patches in the north-western region where the species is wrongly predicted and the south, b) in the southern coastal region where the model did not predict enough

Page| 75 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

range, and c) the patch in the southern bordering region of Sudan and Chad is not predicted as presence area. Here (for b and c), the probable reason for not predicting enough range can be due to lack of enough samples at those locations: only four samples are present in the training set. Overall, the result looks well predicted.

Table 6-11: Accuracy assessment of the model performance in predicting a pre-defined arbitrary range of a species based on training and evaluation sample data samples presence absence user’s accuracy training presence 500 474 26 94.80% absence 2000 101 1899 94.95% overall accuracy 94.92% evaluation presence 500 478 22 95.60% absence 1000 64 936 93.60% overall accuracy 94.27%

6.3.2. Modelling with presences from watershed based range map and random background samples for all of Africa

The presence-absence model formulated from samples taken from range map (Figure 5-7) performed well and with the confidence that the logistic regression modelling method with elastic-net works properly (chapter 6.3.1), modelling with a combination of presence and background samples scenario can now be compared with presence-absence model. In order to assess the use of background samples in presence-background modelling, 500 randomly generated presence samples within the watershed based range map and 5000 background samples (chapter 6.2.1) distributed all over Africa are used. The AUC value of the model output is 0.8275 which is still considered as ‘excellent’ discriminating performance ( cf. Hosmer and Lemeshow, 2000). But, the resulting species distribution pattern (Figure 6-9 left) differs from the presence-absence model result (Figure 6-8 right). This is expected as the presence-background modelling scenario does not have all the information as the presence-absence model which specifically considers environmental information provided at absence locations (Elith and Graham, 2009). The patches seen in northern Africa for presence-absence model disappeared signalling a positive effect. The overall shape of the pattern looks familiar but the range has shrunk. The distribution at the horn of Africa, in central Africa (some parts of Central African Republic and some parts of Angola) and in southern Africa as well as in some parts of Tanzania and Mozambique are under-predicted, in other words falsely predicted as absence.

6.3.3. Modelling with actual field samples and absences sampled from the watershed based range map

The presence-absence model result predicted using the samples from watershed based map (Figure 6-8 right) is very close to the range based on the watershed (Figure 5-7). With the assumption that the field samples of P. kersteni are unbiased samples of the African population (but see chapter 2.5 for sample bias), it can be tested whether the presence samples in the Odonata Database of Africa are enough for modelling the correct distribution. However, the field samples used here are weighted according to the second case of chapter 6.2.4 (i.e. weights assigned by visual assessment of

Page| 76 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

differences in sample distribution and density). With further assumption that the habitat range derived from the watershed based map can act as true range, 2000 random absence samples are selected (same 2000 used as in chapter 6.3.1) for modelling. The prediction of distribution has an AUC value of 0.9881. The classification accuracy of 94.23% for the model output is achieved with the cut-off threshold of 25.6%, a value close to presence-absence model of chapter 6.3.1. The spatial pattern of distribution (Figure 6-9, right) is again closer to the expected range (Figure 6-8, left), except at the horn of Africa: most parts of , and not being predicted as presence area. Similarly, few patches in northern Africa are wrongly predicted as presence. There are also wrong predictions (false absences) in northern Namibia and Central African Republic. Overall, the predicted range matches the watershed based range. Further, the output predicted correctly in areas in southern Africa where the models in chapters 6.3.1 (Figure 6-8 right) and 6.3.2 (Figure 6-9 left) are not able to predict correctly.

Figure 6-9: Model output using different approaches, using randomly generated presences from the watershed based range map (Clausnitzer et al ., 2012) and background samples from all over Africa (left), and using field collected presence samples and randomly generated absence samples from absence regions of watershed based range map (right)

6.3.4. Feeding the presence-background model with auxiliary absence data

Often absence data are not available and hence the use of background samples are useful means for modelling species distribution in a regression based model (chapter 2.4). However, in certain cases absence data are available or can be confidently derived based on the specific knowledge of habitat requirement of particular species. Odonata need fresh-water aquatic environment. The model output in Figure 6-10 predicted the species to be present in the salt-water environment of Okavango delta in southern Africa which is clearly, based on habitat requirement, a false prediction. Since the salt-water environment is not a suitable habitat for P. kersteni and presence of P. kersteni has not been observed in several years of visit in the delta region (personal communication F. Suhling) the wetlands around the Okavango delta can be confidently used to randomly sample absence data. So, 50 samples are randomly drawn from the area which would be missing information when applying presence and background samples only. The scenario of presence-absence-

Page| 77 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

background modelling is a new concept and is so far only supported in the SpeeDi Tool. Thus, the opportunity is offered to test the tool’s modelling ability in such cases also. Although there may be other such areas, only this particular area is considered here. Indeed, the predicted range (Figure 6-10, right) improved the distribution by excluding the Okavango delta.

Figure 6-10: Example of improving the model predictions, prediction result with field collected presence samples with background samples (left, same as Figure 5-6 right) and improved prediction reducing the false presences at the Okavango Delta (see circle) by using auxiliary information of knowledge based absences locations at the Okavango Delta.

Several parameters related to modelling species distribution modelling with SpeeDi Tool were examined for sensitivity analyses. These analyses help to increase the understanding of the behaviour of logistic regression model under elastic-net regularisation in the context of species distribution modelling. The results show elastic-net regularisation is a better choice than L1- or L2-regularisation, i.e. superiority of predicted results in terms of over- and under-predicted areas (cf. Appendix A1). The use of background samples is increasingly applied in modelling species distribution modelling. However, the uncertain nature of population prevalence in background samples cannot be avoided because population prevalence is a part of the mathematical formulation. Here, it is revealed that the use of ‘soft-buffer-threshold’, not included in SDM so far, is effective in reducing the level of uncertainty. The size of background samples is an important factor in estimating probability values and the sensitivity analysis is used, taking P. kersteni as example species, to find the appropriate number of samples for modelling the distribution of African Odonata. The results showed that a lower number of background samples (5000) is sufficient as compared to 10000 background samples suggested by Philips and Dudik (2008). However, it is to be noted that the suggestion of Philips and Dudik is as general one and it may be differ from data of species to species. Different patterns of distributions are observed when the product interaction of environmental variables are included, thus highlighting its importance in improving the prediction. Modelling with unbiased presence and absence data are seen as the best option, however, bias in sample data and moreover the lack of absence data is part of reality. Sample bias in presence data has been one of the limitations for modelling species distributions. Applying weights to the presence samples can be a measure in reducing such bias. The different tests performed show that it is very

Page| 78 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters

effective. Manual assignment of weights based on visual assessment of differences in sample distribution is able to reduce the effect of sample bias; however, a mathematical concept of assigning the weights employed here is still far from an acceptable level. Until now, the use of presence, absence and background samples have not been used for species distribution modelling. Such scenarios can be handled by the SpeeDi Tool and has been tested in a small region applying a well thought logic in deriving absence samples which resulted in a positive outcome.

Page| 79 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land-cover datasets, spatial extent and resolution

Environment related geodata are common predictor variables for modelling species distribution. These variables relate the event of occurrences of species to favourable environmental conditions. The events are, thus, used to explain the importance of environmental variables for the ecological or environmental niche of the species. Here, climate is fundamental to the environmental niche and hence the use of climatic variables is common in statistical species distribution models (Franklin, 2010). Similarly, land-cover is also part of habitat structure and is often used as one of the predictor variables (ibid. ). Climatic variables can be applied in several ways, e.g. as observed detailed ‘raw’ climate data or as synthesised climatic variables. Land-cover can also be expressed in several ways (see schemas such as USGS GLCC system). The use of different types (or forms) of climatic data or different outlines of land-cover classification may result in different predictions of species occurrence ranges. Furthermore, the prediction may also be affected by the spatial resolution (grain size) used for modelling as the predictor variables get more generalised when the grain-size increases. In addition, the modelling extent offers different sets or ranges of values for predictor variables and, thus, may result in different outcomes. This chapter investigates a) the effects of using different forms of climate data (in ‘raw’ form and synthesised, here the ‘bioclim’ data), using only subsets of ‘bioclim’ and supplementing it with geographic coordinates, b) effect of using different land-cover datasets, and c) the effect of modelling grain-size and extent. Pseudagrion kersteni is used as sample species to investigate these effects except for the grain-size for which Aeshna minuscula is used.

Bioclimatic data and its influence on modelling the prediction of Pseudagrion kersteni

Bioclimatic variables are synthesised climatic variables which are commonly used to describe the differences in climatic patterns across regions (Xu and Hutchinson, 2013). Although several bioclimatic variables can be formed, temperature and precipitation related variables are those most commonly used in species distribution modelling (e.g. Willems and Hill, 2009; VanDerWal et al ., 2009; Lobo et al ., 2010). The worldclim database 12 offers 19 bioclimatic variables (see Appendix B1), out of which eleven are related to temperature and eight are related to precipitation. Synthesised variable types related to temperature are, among others, maximum, minimum and mean of different seasons as well as annual mean, standard deviation (seasonality); and variable types related to precipitation are annual precipitation, precipitation in wettest and driest season, and precipitation seasonality.

12 http://www.worldclim.org

Page| 80 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

7.1.1. Using bioclimatic variables related to precipitation and temperature

The 19 bioclimatic variables, related to temperature and precipitation, are used as primary climatic data for modelling the distribution of P. kersteni . Apart from these bioclimatic variables, following variables are also included for modelling: a) distance to nearest inland water bodies (rivers, lakes and wetlands), b) average minimum NDVI, c) land-cover, and d) elevation (see Table 5-1 for data sources). Temperature is one of the most important climatic factors in relation to the ecological functioning of Odonata (Corbet, 1962). However, the five most contributing variables (Table 7-1) of the model using 19 bioclimatic variables in predicting the distribution are related to precipitation; the only temperature related variable is temperature seasonality (bioc_04). These five variables contributed in explaining about 36 percent of the deviance. Comparing Figure 7-1 (left) with the range predicted by IUCN (Clausnitzer et al ., 2010) (see Figure 5-7), the model output is satisfactory in the southern, the south-eastern and the western Africa but the horn of Africa, the eastern and the central Africa are under predicted. A false positive prediction in the Maghreb region is also observed.

Table 7-1: Contribution of environmental variables in predicting the distribution of P. kersteni using bioclimatic data as climatic variables rank environme ntal variable * contribution 1 glc2k 13.9243 % 2 bioc_04 * bioc_18 7.9070 % 3 elev * bioc_15 5.2346 % 4 bioc_14 * bioc_18 4.8115 % 5 bioc_17 * bioc_18 3.6934 % * see Table 5-1 and Appendix B1 for variable notation

7.1.2. Using six selected bioclimatic variables related to precipitation and temperature based on ecological relevance for P. kersteni

Ecologists interested in species’ spatial distribution often prefer to include, in models, only those variables most likely to have higher ecological significance to species presence or absence (Austin, 2007). This will reduce the likelihood of introducing unnecessary and unwanted noise data in the model. In cooperation with Odonata experts from the IUCN-SSC 13 Dragonflies specialist group, six bioclimatic variables were selected as major climatic variables and thus as covariates for modelling P. kersteni . These six variables are: a) annual mean temperature (bioc_1) b) temperature seasonality (bioc_4) c) temperature annual range (bioc_7) d) mean temperature of coldest quarter (bioc_11) e) annual precipitation (bioc_12), and f) precipitation seasonality (bioc_15). Compared to the prediction using 19 bioclimatic variables, the total contribution of model deviance of the five most contributing variables in the model (see Table 7-2) using selected six bioclimatic variables almost doubled (about 64 compared to 36 percent). One of the reasons is the reduction of number of variables. Temperature seasonality is clearly the most important climate related variables (appearing three times) and precipitation seasonality represents the only precipitation related variable in the top five contributing variables. The prediction (Figure 7-1 right), as compared to model with 19 bioclimatic variables (left), improved on the estimated range in eastern and central Africa. However, the false predicted area in the Maghreb region increased (including Libya and Egypt) as it did in the western

13 International Union for Conservation of Nature - Species Survival Commission (here, V. Clausnitzer and F. Suhling)

Page| 81 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution part of the central Africa. Overall, in terms of spatial distribution as well as statistical measure of explained deviance, the model prediction for reduced number of variables is better than using 19 bioclimatic variables which has less false negatives albeit with some false positives.

Figure 7-1: Modelled distribution range of P. kersteni using different climatic data: a) with the 19 bioclimatic variables (left, chapter 7.1.1), and b) with 6 selected bioclimatic variables (right, chapter 7.1.2)

Table 7-2: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables rank environmental variable * contribution 1 glc2k 17.7998 % 2 bioc_04 * bioc_12 17.7634 % 3 bioc_04 15.4527 % 4 bioc_04 * af_ndvi_min 7.0901 % 5 bioc_15 6.5070 % * see Table 5-1 and Appendix B1 for variable notation

Supplementing ‘selected six bioclimatic variables’ with x-y coordinates for predicting the distribution range of P. kersteni

One of the earliest methods for analysing broad scale spatial pattern of data is trend surface analysis which uses a regression model, unlike interpolation method, to estimate the variable at a given location (Legendre and Legendre, 1998). A cubic trend surface is considered sufficient for most of the ecological phenomena ( ibid.). One special application for the inclusion of a geographic trend surface is to indirectly consider unaccounted spatially structured factors or variations, e.g. missing unknown but influential covariates in the model (Lobo et al ., 2006). Although the use of selected six bioclimatic variables predicted better than the use of all 19 bioclimatic variables (chapter 7.1.2), false predicted areas in northern Africa indicates the missing of influential variables. Here, the use of trend surface can supplement that information. Apart from using X-Y coordinates as trend surface, other variables are allowed to interact (i.e. product

Page| 82 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution interaction) with these X and Y coordinates. Indeed the inclusion of coordinates as variables for predicting the distribution of P. kersteni (Figure 7-2) is able to effectively remove the false predictions introduced after reducing the number of bioclimatic variables (Figure 7-1, right). Moreover, it also retained most of the other areas positively predicted after reducing the variables, although the western part of central Africa are still predicted as false positives. Temperature related variables are still the most contributing climate related variables. The interaction of east-west positioning (coord_x) and north-south positioning (coord_y) being the second most contributing variable shows the effectiveness of including trend surface variables. However, the five most contributing variables explain only about 47 percent of the model deviance which is lower than when trend-surface variables are not included.

Table 7-3: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables and supplemented by geographic coordinates rank environmental variable * contribution 1 bioc_04 * bio c_12 12.4 657 % 2 'coord_x * coord_y 12.0706 % 3 glc2k 11.5563 % 4 coord_y 6.3176 % 5 coord_x * bio c_07 4.9414 % * see Table 5-1 and Appendix B1 for variable notation

Figure 7-2: Predicted distribution of P. kersteni based on climate data as 6 selected bioclimatic variables supplemented by x-y geographic coordinates

It has been known that adding trend surface in models yields bias in parameter estimation (Bocquet-Appel and Bacro, 1993). The observed overall response curves for P. kersteni modelled with and without trend surface in SpeeDi Tool, however, show similar characteristics (Figure 7-3) although the contributions of polynomial terms differ. Here, the trend surface is able to consider (or counter) spatial bias in prediction because the trend variables introduce or include the spatial term in the model. However, care has to be taken as there can be over- or under-compensation. The figure

Page| 83 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution shows four out of six bioclim variables (bioclim 1, 4, 7 and 12, all related to temperature) and elevation.

With 6 Bioclim variables With 6 Bioclim variables + X and Y Elevation

Bioclim 1

Bioclim 4

Bioclim 7

Bioclim 12

Figure 7-3: Response curves of five of the predictor variables (elevation and four temperature related bioclim variables) for the predicted distribution of P. kersteni based on climate data of six selected bioclimatic variables and with additional x-y geographic coordinates

Page| 84 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

Using monthly temperature and precipitation data as main climate variables for predicting the distribution of P. kersteni

The use of a geographic trend surface with six selected bioclimatic variables selected on knowledge based consideration performed well making-up for the missing variables or spatial structures. The bioclimatic variables are synthesised variables from ‘raw’ monthly climate data. The spatial patterns of bioclimatic variables can identify and summarise different patterns of climate (Busby, 1991), however, they act already as proxy variables. The use of monthly climate data as direct variables may be the better option to relate climate and species occurrence within the context of phenomenon (event) based predictive modelling. This is because the model itself is summarising climate in relation to the species favourable environmental regime. So, the climate related variables represented by bioclimatic variables (six or 19) are replaced by 36 datasets (see Table 5-1), twelve monthly datasets from precipitation and each of minimum and maximum temperature. The geographic trend surface is not considered as most of the unknown structural variance may already be present due to presence of additional and more detailed spatial variables. The additional information considered by using raw climate data is evident from the resulting prediction (Figure 7-4 left, same as Figure 5-6 right). Among the top five contributing variables, temperature related variables are contributing more than precipitation related variables (see Table 7-4). The false positive in the northern Africa and the western part of central Africa is almost non-existing. However, as a side effect, the distribution range in southern Africa got increased and some areas in Tanzania got reduced. Nevertheless, the inclusion of original monthly data for climate improved the result considerably.

Table 7-4: Contribution of environmental variables in predicting the distribution of P. kersteni using monthly temperature and precipitation data as climatic variables rank environmental variable * contribution 1 glc2k 16.2689 % 2 tmin_02 5.3206 % 3 tmax_08 4.7287 % 4 prec_03 * prec_06 4.6118 % 5 prec_03 * af_ndvi_min 3.3738 % * see Table 5-1 for variable notation

Role of land-cover data in modelling P. kersteni – effect of classification schemes

Land-cover is often used as the major habitat related parameter in species distribution modelling of fauna (Franklin, 1995). Since classification schemes are purpose built (e.g. Anderson et al ., 1976; Loveland et al ., 2000; Olsen et al ., 2001), they should influence the output. Here, the land-cover datasets from two general purpose classification schemes are compared: GLC2000 and USGS land use land cover system (LUCS) modified version 2. The GLC2000 dataset, primarily based on SPOT Vegetation satellite imagery from 1999 (Mayaux et al ., 2003) with a spatial resolution of 30-arc seconds (approx. 1 km at equator), is among the most detailed datasets (see Figure 5-8) in terms of extensive field verification. Before GLC2000, there existed several datasets; among them the datasets available from the USGS global land cover characterization (GLCC) with several classification schemes. The GLCC datasets are also based on

Page| 85 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution satellite imagery but AVHRR 14 data from April 1992 to March 1993 (Loveland et al ., 2000) and have a spatial resolution of 1 km too. However as compared to GLC2000, they are not extensively field verified (ibid. ). The classification scheme applied in GLC2000 is based on the FAO-classification scheme (Mayaux et al ., 2003) and the GLCC dataset considered here is based on the USGS Land use / land cover system modified level 2 (Loveland et al ., 2000). Assuming the change in land-cover between 1993 and 2000 as not being significantly different, the effect of land-cover classification schemes on the predictive distribution of P. kersteni can be tested.

Figure 7-4: Predicted distribution range of P. kersteni based on datasets using different land-cover classification schemes: GLC2000 with FAO scheme (upper left, same as Figure 5-6 right) and GLCC with USGS modified level 2 scheme (upper right); and differing classes in the belt from west to east Africa in the two datasets, GLC2000 (lower-left) and GLCC-USGS (lower-right)

Although the overall patterns are similar between the two predictions (Figure 7-4, left and right), a slightly different range is seen in and a more visible difference in the belt from west Africa to east Africa. This is likely due to the different schemes used to classify land cover and thus vegetation in the datasets. Within the modelled extent, the USGS dataset has 17 classes and the GLC2000 has 16 classes but far from having a one to one relationship. A quick visual inspection (Figure 7-4 lower part) shows that most of west Africa is classified as savannah, see orange (as dominant class) in USGS dataset (right) whereas the same area has five different land-cover classes (appearing as belts) in the GLC2000 dataset (left). Further, the USGS-savannah class was the most positively contributing land-cover class (ca. 26 %). The equivalent or closest GLC2000 class, shrub cover (class 12) contributed negatively (with only 2.3 %) for the probability of presence. Hence, the modelling is sensitive to the classification scheme although the overall output is not much different. Despite intriguing differences in the land-cover datasets, large similarities between the modelling

14 AVHRR: Advanced Very High Resolution Radiometer

Page| 86 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution results exist. Possible reasons might be found in the other environmental variables or the location (distribution) of presence samples.

Predicting the past and the future distribution of P. kersteni with scenarios for land-cover and climate

Several studies have attempted to reveal the effect of changed climate in the future distribution of species range (e.g. Beaumont et al ., 2005; Thuiller et al ., 2004, Thuiller et al ., 2006) and some have endeavoured to predict historical species distribution going back thousands to millions of years (e.g. Graham et al ., 2004; Yesson and Culham, 2006). However, it would be interesting to look not too far into the past, somewhere early to mid of twentieth century and compare with the near future and P. kersteni is selected for exploring. This would allow the changes in species range to be observed in a way that can be useful e.g. in hind casting species distribution, in planning conservation strategies. The Climate Research Unit (CRU) offers high resolution (10 arc minutes) gridded climate surfaces of the Earth from 1901 to 2002 (Mitchell and Jones, 2005). The Climate Change, Agriculture and Food Security (CCAFS) unit of Consultative Group on International Agricultural Research (CGIAR) hosts gridded climate data of future scenarios (Ramirez and Jarvis, 2008) for several years (ten years interval from 2020 to 2080) based on the Intergovernmental Panel on Climate Change (IPCC) forth assessment report. The availability of these climate datasets allows projecting the distributions to the past as well as the future. But, the habitat related variables (past and future land-cover geodatasets) are not readily available or available in a very course resolution (spatially and number of classes) only. The Historical Database of Global Environment (HYDE database) 2.0 (Klein Goldewijk, 2001) offers land-cover geodatasets in half a degree spatial resolution from 1700 to 1990 at an interval of 50 years (to 1950) or 20 years (after 1950). When viewing the accompanying animation in 10 years steps, there is a distinctively visible change in land-cover for Africa, mainly related to forested areas from 1940 to 1950. This might be attributed, in parts, to large scale commercial exploitation of forest during and after the Second World War (Counsell, 2009; Nasi et al ., 2006; Okali and Eyog-Matig, 2004). So to capture the distinctive change in land-cover, the year 1940 is selected. With the base year for current scenario being 2000, by selecting the year 2050 for the future scenario can give a trajectory of land-cover change for around 100 years. The challenge is to obtain realistic land-cover scenarios of the past and the future.

7.5.1. Developing the land-cover scenario for the year 1940

To develop a scenario of land-cover for around 1940, several rules are defined which are used to assign changes in land-cover classes. The GLC2000 (Figure 5-8) is used as base dataset and the land-cover classes are sequentially altered as defined by those rules. For creating the scenario several geodatasets have to be considered: in particular the percentage of grass cover, the percentage of crop cover, population density of HYDE. The HYDE 3.1 (Klein Goldewijk et al ., 2007) is the latest dataset providing past grass cover and crop cover in percentage of area covered by each pixel. Hyde 2.0 has two scenarios for land-use classes and scenario ‘b’ is selected as it provides a higher number of classes than scenario ‘a’ but also as it seems to have a more likely pattern of forest cover for central Africa. Table 7-5 lists the datasets, sources, spatial resolution and for which scenario the datasets were used. Spatial resolution dictates the amount of details in classifying land-cover that

Page| 87 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution can be incorporated. Further, to capture some of the forested areas which seem to have been missed by the hind-casted land-cover otherwise, 10 days composites of earliest available NDVI datasets (1981 – 1982: Tucker et al ., 2005), as well as monthly precipitation datasets from 1931 to 1940 (Mitchell and Jones, 2005) are considered. The latest version of monthly climate time-series dataset of CRU is 2.10 (referred as CRU-TS hereafter). All these datasets are first spatially harmonised, i.e. spatially transformed to a common spatial reference system (Lambert Azimuthal equal area projection, see chapter 5.2.1) and resampled to a common grid cell size of 8 km.

Table 7-5: Geodatasets used for developing scenarios for 1940 and 2050, with sources and descriptive information Number of Spatial Time - Scenario Dataset Data reference datasets resolution period for past & GLC2000 (base data) Mayaux et al . (2004) 1 30 arc seconds 2000 future Crop cover (%) HYDE 3.1 1 5 arc minutes 1940 past Grass cover (%) HYDE 3.1 1 5 arc minutes 1940 past Crop cover (%) HYDE 3.1 1 5 arc minutes 2005 future Grass cover (%) HYDE 3.1 1 5 arc minutes 2005 future Population density HYDE 3.1 1 5 arc minutes 1940 past Population density HYDE 3.1 1 5 arc minutes 2050 future Land -cover HYDE 2.0 , scenario b 1 0.5 arc degree 1950 past FAO land -use FAO geonetwork 1 5 arc minutes 2010 future AVHRR NDVI 07/ 1981 – FEWS-ADDC 36 8 km past (10-days composites) 06/1982 Precipitation * 01 /1931 – CRU-TS 2.10 120 0.5 arc degree past (monthly) 12/1940 Precipitation CCASF-CGIAR 12 5 arc minutes 2050 future (monthly**) SRTM elevation Jarvis et al . (2008) 1 30 arc seconds 2000 past * for the purpose of past land-cover scenario development only ** assemblage mean of seven models (see 7.5.4)

Rules are created to upgrade the current land-cover classes according to natural succession retrospectively. The current forested areas (tree-cover, for GLC2000 classes 1 to 6) as well as water bodies or flooded areas (large rivers, lakes, and wetlands; classes 7, 8, 15, 20) are maintained as they are. The HYDE land-cover dataset is used to extract the croplands (class 16), grassland/steppe (class 14) and hot deserts (class 19). However, the HYDE dataset seems to overestimate the croplands (compared to current land-cover) and has to be compensated by other rules developed. NDVI values can, in parts, be used to classify different vegetation cover classes (DeFries et al ., 1995, Frederiksen and Lawesson, 1992). DeFries et al . (1995) reported that mean NDVI value was able to discriminate 68% of the 78 different class pairs in deriving 12 different land cover types. Detailed classification of land-cover based on NDVI was not attempted here but it is used to determine past tree-cover. The idea here is that if an area is already forested in 1981/1982, it is highly likely also to have been forested in 1940 already. Therefore, an average NDVI is calculated from the 36 NDVI datasets and thresholds are applied (see Appendix C1, Table C1-2 for values) to extract areas with tree-cover (evergreen, deciduous, mixed), as well as areas with herbaceous cover (closed, sparse or sparse shrubs). The threshold is determined by randomly sampling average NDVI values for different land-cover classes and only the classes that provide clear distinction are used. The combination of

Page| 88 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution percentage of crop cover, grass cover and population density are used to classify sparse herbaceous or shrub cover (class 14), herbaceous cover (class 13), mosaic of cropland and shrub (class 18), cultivated areas (class 16). Further refinements were applied with the combination of population density and the current artificial surfaces (populated places). If the population density in 1940 was higher than 50 persons per square kilometre and current class is artificial surfaces (class 22), those pixels are retained for that time period. The precipitation data is used to calculate the average annual precipitation in the decade leading to 1940. If the area has very high average annual precipitation (more than 1500 mm), it is likely there is very little human activities (no agriculture) and hence any classes which are not tree cover were shrunk by 10 cells, i.e. replaced by the nearest adjacent tree cover class. Further, at low-altitudes (elevation less than 500 m) with high average annual precipitation, the chances of water-logging are high and thus these low lands can also be considered to have low probability for agriculture, and pixel value of classes consisting areas with agricultural use (16, 17, 18) and herbaceous cover (13, 14) were replaced by pixel values of adjacent classes. Moreover, people in Africa were mostly living in the highlands due to most of the lowlands in Africa being considered epidemic for malaria in the past and this (case of malaria) is still being the case in recent years (Lindsay and Martens, 1998).

Figure 7-5: Land-cover scenario hind casted for the year 1940 for modelling the past distribution of Odonata species

Page| 89 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

While the scenario of past land-cover is created, some of the assignment of classes had to be reviewed. These classes include water bodies (class 20) and bare areas (deserts, class 19). The water bodies may have been expanded when shrinking non-tree classes based on precipitation. Similarly, if the areas are not desert under current condition, it is also not likely that they have been desert in the past. These classes are selectively restored as in the current land-cover dataset. The detailed rules and arrangements for assignments of classes are presented in Appendix C1. The final land-cover scenario (Figure 7-5) has, in comparison to Figure 5-8, more tree cover areas in central and western Africa as well as south-eastern Africa. The central part is mostly broadleaved evergreen trees, typically representing the rainforests, whereas the tree covers in other areas are deciduous types. Another visible distinction compared to the GLC2000 (current condition) is the decrease in bare areas at the horn of Africa and in the Sahel region (mind the resolution effects) as well as an elongated stretch in western Namibia being replaced by sparse herbaceous or shrub cover. Furthermore, most of the agricultural areas (classes 16, 17 and 18) are being replaced by mainly tree cover stretching from southern to eastern Africa and by herbaceous cover (closed-open class) along the Sahel.

7.5.2. Developing the land-cover scenario for the year 2050

Like the land-cover scenario of 1940, certain rules were conceptualised for developing a future land-cover scenario making use of currently available geodatasets, with GLC2000 land-cover as the base dataset. While the scenario for 1940 used the retrospective natural succession, the scenario for 2050 assumes degradation of natural vegetation. The FAO land-use system (FAO, 2010) is mainly focused on classifying the agricultural use of land. This enables assigning of cultivated and managed areas in the future scenario. For defining new urban areas, projected population density (HYDE 3.1) is used to extract these areas based on a threshold of 1000 persons per square kilometre. This number is a high value, partly owing to uncertainty in the population projection for Africa but also differences in settlement structure (UN-Habitat, 2011), as compared with the values for developed countries e.g. 250 in Australia (Pink, 2011) or 400 in Canada (Statistics Canada, 2012). As the scenario assumes degradation of natural vegetation, the classes representing tree cover (1-6, 9) and shrub cover (12) are shrunk by 1 cell thus expanding the adjacent land-cover classes. Considering that 30 percent of the current forest cover in central and western Africa are under concession (Laporte et al ., 2007), this can be considered a conservative approach for not even degrading the forest covers enough. Although the FAO dataset is valid for the year 2010, it can be assumed that any classes representing agricultural use will remain cultivated and managed areas in the future. This is true also for the urban areas. So, the classes with agricultural use (19-24 of FAO-LUS) are assigned as cultivated and managed area. Further, the FAO urban areas are extracted to be used as artificial surfaces to supplement the areas determined by the population density in the land-cover scenario. Also, the FAO forested areas with high livestock density are assigned as sparse herbaceous or shrub cover. Similarly, the bare areas with low to moderate livestock density in FAO-LUS are assigned as herbaceous cover. From the grass and crop cover dataset of HYDE, further rules are applied to classify the cultivated and managed areas (see Appendix C2 for detailed rules). Nonetheless, if the class in current condition is bare areas (e.g. deserts), then they are retained. As with the past scenario, some adjustments have to be made to compensate some unlikely changes. The increase in cultivated and managed areas is driven by an increase in food demand by increased population. Based on the population density (between 25 and 100 persons per sq. km) and the annual average precipitation (between 300 and 500 mm), additional rule is created to classify as

Page| 90 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution managed areas if the new land-cover class is sparse herbaceous or shrub cover and related to the amount of precipitation being a driving factor for agricultural activities. Another adjustment condition is needed for the unwanted shrinkage and expansion of other classes in favour and expense respectively of water bodies and hence those areas are restored to the original land-cover dataset to get the final scenario (Figure 7-6).

Figure 7-6: Land-cover scenario projected for the year 2050 for modelling the future distribution of Odonata species

7.5.3. Comparison of past and future land-cover scenarios with current situation

A comparison of the land-cover change over time based on today’s situation (year 2000) and past (year 1940) and future (year 2050) scenarios shows (Figure 7-7, bar chart) a clear trend of decreasing tree cover (classes 1, 2 and 3) and of increasing cultivated and managed areas (classes 16, 17 and 18) as well as of increasing artificial surfaces (class 22: urban areas). As compared to the base year of 2000, the area under tree cover is 50 % (54 % including class 17) more in 1940 and 45 % (33 % including class 17) less in 2050. With the reported 30 % concession of forested areas in central and western Africa already at the beginning of the 21st century (Laporte et al ., 2007), and without active protection measures, fuelled by increasing population (Turok and Parnell, 2009) as well as uncertain

Page| 91 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution political situations across the region (Naudé, 2008; UNEP, 2011), the decrease in area suggested by the scenario for 2050 seems realistic. The regularly flooded areas (classes 7, 8 and 15) which form large wetlands and large inland water bodies (class 20) remain unchanged in the scenarios; they are intentionally not changed. Likewise, the bare areas (class 19, mostly deserts) are also not changed for 2050 owing uncertainty. The managed and cultivated areas (classes 16, 17 and 18) in 1940 are below 7 % which increase to almost 14 % in 2000 and to almost 24 % of total area in 2050. However, it has to be noted that the herbaceous and shrub cover can be used for livestock but have not been included here as agricultural areas. Increasing plantations (such as oil palm) replacing the natural tree cover in western Africa and expanding in central and eastern Africa (Germer and Sauerborn, 2008, Greenpeace 2012) are considered as agriculture related (class 17: mosaic – cropland/tree cover). The shrub and herbaceous covers (classes 12, 13 and 14) is different in the three time steps with shrub cover (class 12) being the most and sparse herbaceous or shrub (class 14) being the least in 2000 but the pattern is reversed in 1940 and in 1950, i.e. sparse herbaceous or shrub (class 14) being the most and shrub cover (class 12) being the least. The mosaic of tree cover and other natural vegetation (class 9) is very subjective and the percentage of cover is changed in 1940 and 2050 only because of expanded or contracted neighbouring cells.

Figure 7-7: Area and proportions covered by different land-cover classes and their scenarios for Africa in three time steps; 1940, 2000 and 2050 (colours and land-cover classes match those of maps in Figures 5-8, 7-5 and 7- 6)

7.5.4. Predicting the distribution of P. kersteni with land-cover and climate scenarios for the year 1940 and 2050

Although 30 years have been generally used for climatology data, 40 year data from 1901 to 1940 from the CRU-TS is used to create mean monthly minimum and maximum temperature and mean monthly precipitation. The IPCC forth assessment report (IPCC, 2007) describes different emission scenarios influencing future climate. The real trend of emissions is uncertain and the values

Page| 92 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution represented by those scenarios vary in large amounts. Here, the A1B emission scenario is considered. This scenario considers a) rapid economic growth, b) population peaking in the mid-century, c) rapid introduction of efficient technologies. Furthermore, it assumes homogenous society with increased cultural, religious and social interactions with reduced regional differences in per capita income as well as balanced use of fossil and non-fossil energy sources. The A1B scenario represents a situation somewhere in the middle of the extremes represented by all scenarios. Seven different modelled climate datasets 15 for future scenario of A1B are available for the year 2050 from CCASF with the spatial resolution of five arc minutes. These models have also very contrasting output. Hence, an assemblage mean dataset from seven models was created for monthly minimum and maximum temperature and monthly precipitation. With the past and future climate and land-cover scenario data, predictions can be made for the past and the future distribution of P. kersteni in relation to the climate and the land-cover scenarios. The previous models included minimum annual NDVI as one of the predictor variables but, the NDVI is not available for past and future. The highest contribution of minimum annual NDVI is less than 3.5 percent when interaction with precipitation of March is considered (see Table 7-4) and hence leaving it out will have little difference. So, a model is trained with elevation, 36 climate datasets representing the year 2000, distance to water bodies and land-cover (GLC 2000) only. The model parameters are then projected to the assumed conditions in 1940 and 2050 to get past and future predicted distributions of P. kersteni .

Figure 7-8 shows the predicted distribution of P. kersteni , with model trained based on the environmental condition for the year 2000 (upper right), and the projected prediction for the year 1940 (upper left) and the year 2050 (lower left). The predicted distribution of P. kersteni in 1940 covers larger area as compared to the year 2000. Apart from the Horn of Africa, northern Kenya and some parts in Tanzania, the presence distribution range in 1940 is closer to the expected range map of IUCN (Figure 5-7). However, the range also predicts more suitable areas in southern Africa where western Namibia and almost all of South Africa are predicted as presence area. The distribution range of P. kersteni for 2000 as compared to 1940 has mostly reduced range; with low reduction in Kenya, west Africa and southern Africa. However, the range in 2000 as compared 1940 expanded in Tanzania and some parts of Ethiopia but again reduced in 2050 (Figure 7-8, lower right, green shades). The predicted range in 2050 lost some areas in and as well as most of the areas in Tanzania and Angola. Further, the link between west Africa and central Africa in is almost lost. Most of the range in Angola, Democratic Republic of Congo and northern Mozambique are lost and gaps formed in central and southern Mozambique.

15 Models: CCCMA-CGCM3.1, CISRO-MK3.0, IPSL-CM4, MPI-ECHAM5, NCAR-CCSM3, UKMO-HADCM3, UKMO-HADGEM1 (IPCC 2007)

Page| 93 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

Figure 7-8: Modelled distribution range of P. kersteni for the year 1940 (upper left), 2000 (upper right), 2050 (lower left), and change in distribution range (lower right). The model is trained with the environmental data from 2000 (see Table 5-1, except minimum NDVI) and projected for the scenarios of land-cover and climate in the years 1940 and 2050

Role of modelling extent and spatial resolution of geodata in predicting the distribution of Aeshna minuscula

As species distribution models have several ecological applications with a wide range of grain size, the effect of the spatial resolution is examined. A. minuscula , (also known as Zosteraeschna minuscula and Aeshna dolobrata ; Suhling, 2010) a species regionally occurring in southern Africa, is selected. This species occurs mainly on “small mountain streams with pools” ( ibid. ). The modelling task is complete with five different setups (Table 7-6). Spatial resolution of the model is primary variable that will be examined, and should be closely linked with the modelling extent. The highest resolution for the available datasets, except NDVI is 1 km, and thus, a fine 1 km grid cell size is

Page| 94 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution selected for modelling. Based on the grid-cell size, the number of presence samples varies as the grid size is one of the factors determining duplicate records (i.e. falling on a same cell) when sampling the environmental variables (chapter 5.2.1). Further, the weight factor applied to the presence samples based on the cell distance is explored. In chapter 6.2.4, the weight factors Figure 7-9: Photo of Aeshna minuscula were applied based on how far a (Source: http://www.warwicktarboton.co.za/dfpgs/086minuscula.html ) sample is from an adjacent sample. When the spatial resolution of model is 1 km, with the same scheme, the linear weight factor will be 0.5 for samples being 1 km apart and 1.0 for the samples being at least 5 km apart. When comparing these weight factors, the samples which are 8 km apart will have a weight factor of 0.5 in 8 km spatial resolution but the same samples will have a weight factor of 1.0 in 1 km spatial resolution. The comparison of such cases can, thus, be interesting. The environmental variables for the models are same as in chapter 5 (i.e. Table 5-1).

Figure 7-10: A. minuscula sample locations (ODA: Kipping et al ., 2009) and expected presence range based on watersheds (Clausnitzer et al ., 2012)

The northernmost sample of A. minuscula in the Odonata Database of Africa (ODA) has latitude of 18.76 degree south of equator. Hence, the northern boundary for modelling is set at 15-degree southern latitude for the southern Africa extent. Models ‘am1’ and ‘am2’ use the same 5000 background samples used in chapter 5 and are referred to as group 1. Models ‘am3’ and ‘am4’ use 736 background samples which are a subset of the 5000 samples falling within the modelling extent.

Page| 95 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

As per inference from chapter 6.2.1 where for each 90 cells, one background samples is required, at 8 km spatial resolution, the number of background samples for the southern Africa extent is 650. But at a spatial resolution of 1 km, this number would be even 64 times higher. The three models ‘am3’ to ‘am5’ are referred to as group 2. Figure 7-10 shows the expected distribution range for A. minuscula , based on watersheds (Clausnitzer et al ., 2012).

Table 7-6: Different modelling extents and spatial resolutions applied for modelling A. minuscula Model Spatial Modelling Number of Number of Spatial no. resolution of extent presence background resolution for model samples samples weighting am1 8 km Africa 87 5000 8 km am2 1 km Africa 104 5000 1 km am3 8 km southern Africa 87 736 1 8 km am4 1 km southern Africa 104 736 1 1 km am5 1 km southern Africa 104 650 2 1 km 1 subset of the 5000 background samples falling within the southern Africa extent 2 required number of background samples calculated as per chapter 6.2.1 for 8 km spatial resolution

For models ‘am1’ (Figure 7-11, top left) and ‘am2’ (top right), the prediction of A. minuscula presence range in the southern Africa matches largely the expected area (Figure 7-10) except in northern Namibia. There is almost no difference in the overall prediction between the two models, except the fact that model ‘am2’ offers a finer detail along the edges (Figure 7-11, bottom). However, comparing with Figure 7-10, it is also seen that some areas in Mozambique are false positives and the predicted range within the expected area in South Africa expands beyond certain watershed boundaries. There are several small false positive patches much further north of the native region in southern Africa. The models ‘am3’ (Figure 7-12, top left), ‘am4’ (top right) and ‘am5’ (bottom) are exclusively modelled within the expected region, i.e. southern Africa. Similar to the relationship between the models in group 1, the three models here show similarity in the predicted range. One visible difference between the group 1 and group 2 models is that most of the predicted areas outside the relevant watersheds in group 1 models (Figure 7-11, bottom) do not exist in group 2 models. When prediction is done using different grid sizes, like in the group 1 models, the difference in predicted ranges are mainly along the edges of the range (Figure 7-13). Most of the change in range is seen at the north with the changed area are predicted as absence in 8 km grid size and class ‘probably-presence’ in 1 km grid size. This indicates that modelling within the expected range, when possible, is able to offer better predictions. However, the group 1 models offer a better similarity of predicted and observed range in southern Namibia although the predicted range in Namibia is still less than the expected area based on the watersheds. Another observation is that the samples are rather well distributed. Because of this, the weighting of the samples has less influence. Nevertheless, all five models predicted species absence in the northern Namibia, contrary to the expected range according to Figure 7-10. Suhling (2010) states that the species have not been seen in the region for more than 80 years, thus the model’s prediction of absence in those areas can be asserted as correct prediction.

Page| 96 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

Figure 7-11: Predicted presence range for A. minuscula based on backgrounds sampled over the entire continent and different spatial resolutions of environmental geodatasets, 1 km (top, left) and 8 km (top, right) and the difference in predicted range in the southern Africa due to change in resolution (bottom)

Use of bioclimatic data is common among the environmental variables considered in modelling species distribution. A subset of only ecologically meaningful variables for a particular species is often made to reduce model complexity but also to reduce data noise. The use of a subset may result in reduced information regarding the spatial structure when used at a large geospatial range, e.g. at continental level. The use of a geographic trend surface can overcome lacking broad range spatial structures. However, the use of monthly climate data that have been the source for deriving bioclimatic variable includes the required spatial structure and offers better prediction output.

Page| 97 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution

Figure 7-12: Predicted range of A. minuscula modelled only within the species’ native region of southern Africa; top (using 736 background samples) left: environmental variables with 8 km grid size, right: environmental variables with 1 km grid size, bottom: with 650 background samples and environmental variables with 1 km grid size

Figure 7-13: Difference in predicted range of presence of A. minuscula when modelled within the native region of southern Africa at different spatial resolutions of 1 km and 8 km using 736 background samples

Land-cover is one of the ecologically important habitat variables used in species distribution modelling. Different classification schemes result in different pattern of the landscape. Although the use of two different general purpose schemes of land-cover classification resulted in similar ranges of

Page| 98 7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land- cover datasets, spatial extent and resolution distributions, some distinctly visible differences indicate that aggregation of land-cover classes specific to concerned species may result in better response curves.

One of the uses of species distribution models is to assess the change in distribution over time. While physical elements such as elevation and drainage networks remain relatively stable, climate and land-cover change over time. The hind-casted land-cover scenario of 1940 and the forecasted scenario of 2050 together with the past recorded and the future projected climate facilitated modelling change in the species distribution range. The three time steps show the potential change in the distribution range of P. kersteni over more than 100 years due to effect of change in environmental condition.

Although spatial resolution of environmental data offers different levels of details, the influence of the grid cell-size 8 km to 1 km is not observed for A. minuscula. The observations are consistence irrespective of two different extents (Africa, southern Africa). However, selecting different extents resulted in different ranges of prediction.

Page| 99 8. Discussion and outlook

8. Discussion and outlook

Predicting the spatial distribution of Pseudagrion kersteni by means of the SpeeDi Tool and sensitivity of modelling parameters

8.1.1. Predicted distribution of P. kersteni and effect of samples

The comprehensive Odonata Database of Africa (ODA: Kipping et al ., 2009) offers an opportunity for modelling the African Odonata species. For the purpose, SpeeDi Tool has been developed with the improved modelling method using elastic-net regularisation for binary logistic regression (chapter 4). To accommodate presence-only samples, the use of background samples based on expectation-maximisation concept is used (chapter 4.5.1). Modelling the distribution of P. kersteni showed the potential of the tool, predicting most of the expected areas but it also missed some areas (Figure 5-6). Modelling parameters, including the samples, are the likely sources for the well-predicted and not so well-predicted areas. Here, three different samples come into play: a) presences, b) backgrounds, and c) absences.

Distribution of presence samples – The first question is whether the SpeeDi Tool can satisfactorily predict the distribution range. The concept of virtual species with presence and absence samples in a real landscape was applied to test the prediction ability (chapter 6.3.1). Considering the Figure 6-8 left (green shade) as true distribution range and Figure 6-8 right (red shade) as the prediction shows that the prediction is rather close to the assumed condition. The tool is able to predict with an accuracy of more than 94%, both based on the training and evaluation datasets (Table 6-11). It reveals the method implemented in the tool is adequately functional for modelling the predictive distribution of species. The prediction has an AUC value of 0.9873 for the training dataset; this is higher than the value (0.97) obtained by Wisz and Guisan (2009) when studying the pseudo-absence selection strategies. The next step is to know whether the expectation-maximisation algorithm can be effective in modelling when making use of presence and background samples. The same virtual species concept was used but this time applying the background samples instead of absence data (chapter 6.3.2). Until now the virtual species concept has only been used as presence-absence modelling scenarios (e.g. Elith and Graham, 2009; Santika and Hutchinson, 2009; Wisz and Guisan, 2009) but not for the presence and background samples. The non-inclusion of absence data reduces the information available to the model. Although the AUC score of 0.8275 for presence-background data is lower than for the presence and absence data, the value is still a high model performance measurement (cf. Hosmer and Lemeshow, 2000). Considering the huge amount of information gaps due to the non- availability of localities of absence data, one would expect the prediction to ‘miss’ large part of areas; however the prediction (Figure 6-9 left) resembles closely to the assumed true distribution range (Figure 6-8, left). This removes the doubt regarding the applicability of background samples in the SpeeDi Tool.

Page| 100 8. Discussion and outlook

Here, both presence-absence and presence-background datasets used in combination with the virtual species concept are able to predict the species range reasonably well. One condition similar to Wisz and Guisan (2009) is that the presence, absence and background samples used are randomly but evenly distributed across the landscape, and thus contribute to the higher prediction accuracy rates, whereas in the real-world scenario the presence samples are often biased (Austin, 2007; Phillips and Dudík, 2008; Phillips et al ., 2009). How much could this bias of sampling effect the prediction? The sensitivity test with the field-presence samples and the pseudo-absences from expected absence areas provides this information. The ODA has data gaps, particularly in central Africa and central Angola, parts of Mozambique and central Tanzania (see chapter 5.2) and P. kersteni is affected by these data gaps. Although, the prediction based on the presence (field sample) and (pseudo-)absence (outside the virtual species range) datasets has an AUC of 0.9881, the predicted range (Figure 6-9, right) is comparatively smaller than when modelled with either the presence-absence or presence-background (both with virtual species concept) datasets. This shows that the presence samples of P. kersteni in the ODA do not fully represent the entire favourable environmental niche. And in the real-world scenario with the potential additional information obtained from absence data not being available, it is expected that the output misses some potential expected areas as can be seen in Figure 5-7. Further, “mismatches between prediction and observation do not necessarily mean that the model is wrong, but simply that the model might be incomplete” (Araújo and Peterson, 2012 p1533).

Sample density of presences and duplicate records – The number of P. kersteni presence samples in the ODA is 1026 (with 16 records being without coordinates). With the spatial resolution of 8 km, some samples fall into same grid cells forming duplicate records. Using all 1010 samples will affect the statistical significance of the prediction by falsely inflating the statistical test during fitting of the model. Hence, such duplicate records were deleted. Further, records within a radial distance of 6 km are also deleted. This in effect reduced the number of samples to 496 (chapter 5.2.1). The sensitivity analysis regarding the weights applied to the samples (Figure 6-7) revealed that the uneven sampling intensity at different regions (see Figure 8-1) has broad implications on the prediction. Phillips et al . (2009) termed the differing density of presence samples as sampling bias citing the opportunistic nature of data collection to more easily accessible areas or data from museums and herbarium records. In order to negate the bias in sampling of presence data, Phillips et al . sampled background data with the same amount of bias, i.e. densely sampling background samples in the areas where presence samples are also dense. By applying this approach for 226 species occurring in different regions, the authors reported improved prediction as compared to Elith et al . (2006) using the same 226 species data. Although the approach of using biased sampling of background samples by Philips et al . (2009) performed well, a different approach of balancing the density by making use of weights was explored here. In particular as Lobo et al . (2010) suggest against sampling (pseudo-) absence or background data from localities which are geographically close to presence samples. Moreover, the uncertainty regarding the population prevalence gets further complicated with the biased sampling of background data. However, the prevalence itself is not known whether the samples are biased or not.

Page| 101 8. Discussion and outlook

Figure 8-1: P. kersteni samples (ODA: Kipping et al ., 2009) measured as kernel-density within 80 km radius; also indicated is the expected habitat range corresponding to the watershed based distribution map (Clausnitzer et al ., 2012)

Using equal weights for presence samples is only able to predict correctly at densely sampled locations in southern Africa but also over-predicted the range in the same region (see Figure 6-7, top left). At the same time, the low intense areas in western and central Africa were under-predicted. The use of a distance-based factor in adjusting the sampling intensity reduced the over-prediction in southern Africa and also partially improved the prediction in western Africa (see Figure 6-7, bottom left). This exercise countered uneven distribution by introducing lower weights for areas with dense samples and higher weights for areas with less density of samples. However, as evident from Figure 8-1, the regions show different sampling intensity with southern Africa having large number of locations with proportionally less duplicate records and western Africa having few locations but proportionally more duplicate coordinate records (Figure 5-2). The manual adjustment of weights for samples considered regional sampling bias: higher weights for western Africa and lower for southern Africa. Further, weight factor is applied to re-adjust the weight based on the distance to a nearest sample, i.e. adjust local sampling bias (chapter 5.2.1). This significantly improved the prediction

Page| 102 8. Discussion and outlook

(Figure 6-7, top right), with the predicted area having a shape similar to the expected range. It shows that sampling bias (or density) has to be balanced or harmonised regionally as well as locally. The explored mathematical concept used only single assignment step for assignments of weights to the samples whereas manual adjustment used 2-steps (regional and local adjustment). The use of manual adjustment of weights undertaken here explored the effects of weights as well as to diagnose the underlying problem of sampling bias. However, the manual adjustment of weight is prone to abuse in manipulating the distribution. An ideal method should be based on ‘sound’ concept and algorithmic-based adjustment but a suitable concept is far from ready and remains to be investigated. As the two-step strategy in manual adjustment worked well, it can be used as a starting point. One of the steps, weight factor based on distance, is already working, and finding a proper algorithm to ‘mimic’ the balancing of regional density may provide a useful solution.

Background samples – Use of background samples with presence samples can offer good prediction result (Elith et al ., 2006; Phillips et al ., 2009; see also Figure 6-9, left). However, there is no standard rule that tells the required number of background samples and can depend on the number of presence samples (Phillips and Dudík, 2008). The Table 6-6 showed that increasing the number of background samples increases the AUC score of the prediction. This can be linked to the fact that a larger number of background samples can capture and represent the range of values from the environmental variables in a better way. This result is similar to the one by Phillips and Dudík (2008) where the average AUC score of MaxEnt modelling of 226 species increased before saturating when the number was increased above 8000. However, it has to be noted that AUC alone should not be used to evaluate the model’s prediction power (Lobo et al ., 2008) and the graph of sensitivity and specificity has to be considered too for better understanding of the prediction (Jiménez-Valverde et al ., 2009). As evident from sensitivity and specificity graph (Figure 6-3), the higher number of background samples gives better response for true negative rate with a possibility of increasing the type-II error (i.e. false negatives) and a lower number results better output for true positive rate with likely increase in type-I error (i.e. false positives). Hence, a compromise has to be made based on the purpose for which modelling is performed (e.g. predicting distribution of rare species to allocate conservation areas versus predicting the potential expansion range of invasive species). However, modelling based on the use of presence and background samples only may likely over-estimate false negatives (i.e. absences), as no detailed information about the absence area is available.

Absence samples – The model using presence-absence data for predicting the distribution of P. kersteni has been tested with virtual species concept because absence samples are not available. Selection of absence locations with higher certainty (Jiménez-Valverde et al ., 2009) will complement the species presence locations obtained via herbarium and natural history museums collections. So far, absence information is often not recorded but the opinions of experts regularly making field visits could be valuable. Hence, the practice of generating absences should consider sound judgement (e.g. logical) as well as active involvement of experts (e.g. in getting inputs from field visits) from respective fields. Here, models using presence-background data were tested with virtual species as well as real case scenarios (chapter 6.3). The use of absence data in modelling the species distribution using presence-background scenario has not been done before. Here, the real potential of using absence data in a presence-background modelling scenario, although with a very limited number, has been demonstrated (see Figure 6-10). The prediction is improved by additional information provided by the use of well-devised absence samples around the Okavango delta in southern Africa (see chapter 6.3.4). The environmental information provided by the absence data

Page| 103 8. Discussion and outlook

around the delta is then able to discriminate suitable from unsuitable conditions. Several other places could have been similarly used to generate reliable absence data; however, the intension here was only to demonstrate the functionality and usefulness of the SpeeDi Tool in supplementing presence-background data modelling with auxiliary absence information.

The SpeeDi Tool introduces additional method for modelling species distribution with presence, absence and background datasets. The use of virtual species concept utilising background data as well as the performance of the model in presence-absence or presence-background modelling demonstrated the proof-of-concept for substantiating the ability of the SpeeDi Tool. The concept of generating the (pseudo-)absences can be used for generating presences at known gaps to improve the model prediction. To reduce the influence by such pseudo-presences, they can be assigned lower weight as compared to field collection data. However, this concept should be carefully applied so to avoid direct or deliberate manipulation. Otherwise, as such, the modelling result would deviate from being the field data based to being similar situation of virtual species based and would lose integrity for practical applications. Until now regression based modelling scenario is limited to presence and (pseudo-)absence and presence-background (by MaxEnt and ecogbm) only. Although studies have been done to compare the predictive ability of different modelling methods (presence and absence or presence and background) in different data scenario (Brotons et al ., 2004; Elith et al ., 2006; Phillips et al ., 2006), different data requirements per modelling method (e.g. type: presence/absence/background) have not allowed direct comparison (Austin, 2007). Since the SpeeDi Tool can be used for presence and absence, presence and background or presence, absence and background scenario, the results of the SpeeDi Tool can be used as a ‘link’ in comparing different modelling methods (i.e. benchmarking the output of SpeeDi Tool for comparison).

8.1.2. Sensitivity regarding regression control parameters for modelling of P. kersteni

Elastic-net factor – Regression based models have utilised the penalisation (regularisation) techniques to ward off over-fitting of model parameters. Methods utilising L1-regularisation is reasoned to perform better (Phillips and Dudík, 2008). When the elastic-net factor is 0 (i.e. L2-regularisation), the model has the lowest classification accuracy whereas the model with elastic-net factor of 1.0 (i.e. L1-regularisation) has the highest classification accuracy (Table 6-1). L2-regularisation has the tendency to over-predict the species range (see Appendix A1, a). In contrast, L1-regularisation tends to slightly under-predict the species spatial distribution (see Appendix A1, d). Figure 6-1 shows the trend of the number of variables selected by the model due to a varying elastic--net factor. The interesting and important task is, however, to find an optimal value. From the figure, the elastic-net value of 0.2 is closer to a conflation point, where the rate of change of the tangent is abrupt. Incidentally, at this value, the predicted distribution of P. kersteni , among the results, is close to the expected distribution (Appendix A1). This effect of elastic-net is expected as the L2 characteristics acts as grouping effect for highly correlated variables while at the same time the L1 characteristics offers sparseness in model and filter noisy variables (Zou and Hastie, 2005). As the value increases from zero, the L1 characteristics start getting stronger, thus reducing the effect of, if any, noise in the data. This, in turn, reduces over-prediction. When the value is closer to one, the L2 characteristics get weaker and thus loose the grouping effect of the co-related variables resulting in fewer variables in the model. The effect of stronger L1 characteristics means very few

Page| 104 8. Discussion and outlook

variables will contribute in the model and therefore will result in slight under-prediction. Thus, elastic-net factor is a sensitive and a very useful parameter in modelling species distribution where the predictor variables are often correlated as well and, considering different interactions, the number of predictor variables becomes large.

Assumed population prevalence in background samples – The use of background samples instead of absence samples offered a novel approach in modelling species distribution (Elith et al ., 2006; Ward et al ., 2009). However, this approach comes with a caveat in the form of the population prevalence or the proportion of presence samples. This is due to the fact that the background samples contain unknown presences and absences; in other words, absences contaminated with presences (Elith and Leathwick, 2007; Phillips et al ., 2009, Ward et al ., 2009). As shown by Ward et al . (2009) the true value of prevalence is not directly identifiable, but the effect of prevalence can be simulated, and the estimation of parameters can still be close to the true value. Moreover, the estimated probability values shift when the assumed prevalence changes. However, sensitivity analysis can be used to obtain a closer estimate of the prevalence ( ibid.). Likewise, the amount of shift is deterministic which can be seen in the mathematical formulation (see chapter 4.5.1, equations 2 and 3) and graphical illustration in Figure 6-2. The result (Table 6-2) shows that as long as the sensitivity and specificity are considered for binary classification 16 , the effect of population prevalence does not influence the model’s prediction ability. This is a positive outcome supporting the use of background samples. Although with different concept used for finding the effect of prevalence, this result has similarity to Jiménez-Valverde et al . (2009). It differs on how the prevalence is accounted for in the study. The difference is that Jiménez-Valverde et al . tested the actual ratio of presences to the sum of presences and absences for a virtual species, whereas here only the assumed prevalence on the background samples is considered for a real species with actual field data. Since the actual prevalence in background samples is an unknown entity, the assumed prevalence in the background has uncertainty related with it. However, the uncertainty is here reduced by a soft-buffer-threshold (SBT) mechanism (see chapters 4.5.1 and 6.1.3). The correlation among the predicted probabilities when modelled with different prevalence values for background samples increased when applying the SBT (compare Tables 6-3 and 6-5). Moreover, the large shifts in the cut-off thresholds associated with the prevalence are also reduced and became more stable (compare Tables 6-2 and 6-4). The reason behind the positive outcome lies in the way SBT operates. The SBT adjusts the probability value of each background sample based on the relation among the dynamic update of the overall prevalence (not just the prevalence in background), the initially assumed prevalence in background samples and the EM-estimated probability value of each sample (see chapter 4.5.1, equation 4). Thus, the concept introduced and availability of the option for applying SBT in SpeeDi Tool has shown its usefulness in reducing uncertainty associated with population prevalence in background samples (e.g. reduced fluctuation in cut-off threshold value) for modelling the distribution of P. kersteni . Since the concept is not specific to one species, it should be useful for modelling distributions of other species too.

Polynomials and product interactions of model covariates – Logistic regression being a family of generalised linear models can include polynomials as model variables. Further transformations can be added by allowing products of two continuous variables, which increases the information available to the model as well as the model’s complexity. However, applying model selection (e.g.

16 The cut-off percentage used in the table is based on the sensitivity-specificity curves.

Page| 105 8. Discussion and outlook

forward selection) method can be difficult when polynomials and interactions are included (Engler et al ., 2004). The use of L1-regularisation can perform model selection process (Tibshirani, 1996). A value of elastic-net factor greater than zero will ensure the presence of L1-characteristic in the modelling, and hence eliminates this difficulty. For P. kersteni , different combinations of polynomials and interactions were tested (see chapter 6.2.3). Although the AUC and accuracy values among the models were similar (Table 6-9), however, the observed spatial patterns were different, with third degree polynomial and product interaction (LQCP) 17 being observed as the best among the five models. Further, the result also revealed that the product interactions of variables are needed for better prediction. Table 6-10 shows that, for P. kersteni , the product term formed by the precipitation of March and precipitation of June is an important predictor variable. This observation (usefulness of interactions) is in agreement with Maggini et al . (2006). Maggini et al . modelled 18 forest communities using GRASP modelling tool found that simple (product) interactions of variables are efficient for controlling the effect of one predictor variable by another predictor variable. For predicting distribution of P. kersteni , interaction terms were useful in mining additional information from combined effects of predictor variables, for example the precipitation of March and June. Another observation is that the broad spatial patterns are similar when the model included the product interaction terms. The results suggested that the lack of product interactions is not able to predict well even with the 3-degree polynomials (LQC) (cf. Figure 6-4). The assumption of 1-degree polynomial (L) relationship of the predictor variable to the species occurrence is not suitable because the species response to the environment is generally non-linear (Austin 2007). Use of quadratic (LQ) relationship allows modelling symmetric unimodal responses (Guisan and Zimmermann, 2000), however responses of environmental variables to species are often skewed and unimodal (Austin, 2002b; Austin 2007; Figure 6-6). The 3rd-degree (LQC) relationship allows simulating skewed relationship and higher polynomial terms allow skewed and bimodal relationship (Guisan et al , 1999). Polynomial degree higher than cubic, despite being able to simulate skewed responses, produce artificial and ecologically unrealistic response shapes which are not easily interpretable (Guisan and Zimmermann, 2000; Legendre and Fortin, 1989). Although, as observed, the 1-degree (L) and 2- degree (LQ) polynomials together with product interactions (P) are able to predict the species range as close to the range predicted by the 3rd-degree polynomial (LQC) with product term, the cubic relationship is suitable because of the skewness in the relationship between species presence and the environmental variables. Hence, the use of 3rd-degree polynomial and product interactions are suggested for modelling species distribution.

Role of environmental data in the prediction of species distribution with the SpeeDi Tool

8.2.1. Climate data and geographic trend surface for predicting the distribution of P. kersteni

Bioclimatic dataset – Bioclimatic variables are commonly used environmental data in predicting species distribution (e.g. Simaika et al ., 2013; Watling et al ., 2012; Willems and Hill, 2009). The attractive part in it lies in that bioclimatic variables represent the annual (seasonal) trend

17 L,Q,C and P – linear (straight line), quadratic, cubic and product

Page| 106 8. Discussion and outlook

summarising the climatic variations and stratification irrespective of geographic location (Busby, 1991) i.e. either northern or southern hemisphere; particularly as the seasons at one hemisphere are opposite to the other hemisphere. This alone offers the most advantageous use: harmonised seasonal trend data (Watling et al ., 2012). Further, linking the events of occurrence to summarised bioclimatic data is easier to explain (ecologically) as the level of abstraction increases in the synthesised condition. Despite this advantage, the prediction range of P. kersteni utilising 19 and six-selected bioclimatic variables (see Figure 7-1) are not able to provide species range as expected, especially predicting presences at locations where the species are absent (e.g. north of the Sahara). Additionally, the prediction made using 19 bioclimatic variables has under-predictions (false absences, e.g. central Africa when all 19 bioclimatic variables are used). The reason for under-prediction may have come from excess noise in the environmental dataset. Presence of noises is common in distributional data (Jiménez-Valverde et al ., 2009). Noise in data stems from multiple sources and the effects of noise depend on the level of contamination and its general spatial structure ( ibid.). Although elastic-net can handle noise in data (Hastie et al ., 2009), excessive noise in data may not have been filtered completely. Some of the noises may not have been detected and/or may not get filtered because the synthesised trend may have concealed them or they may not be detectable above a certain spatial resolution. The selection of six ecologically significant bioclimatic variables (bioc_01, bioc_04, bioc_07, bioc_11, bioc_12, and bioc_15; see Appendix B1), especially targeted for Odonata species, reduced the under-prediction, however it then increased over-predictions (false positives, e.g. west coast of central Africa). Using indirect variables in species distribution modelling is often not suggested and such variables should only be used if explanation by original variables from which they are derived cannot be made (Austin 2002a; 2007). In some way the bioclimatic variables are used as proxy variables e.g. species limiting range in temperature and/or precipitation extremes (Huntley et al ., 2011). Stankowski and Parker (2010) observed that the use of six bioclimatic variables (mean annual temperature and precipitation, average temperature of warmest and coolest month, average seasonal precipitation of warmest and coolest quarter) for logistic regression modelling of 30 willow species distribution in Ontario over-predicted the range and was the least accurate among three different variable selection strategies. The result in chapter 7.1.2 shows agreement with Stankowski and Parker regarding over-prediction of range when using pre-selected bioclimatic variables. The reduction in false absences can be linked to the reduced level of noise in the environmental data. The falsely predicted Maghreb region and the species’ known presence area in southern African coastal areas have similarity in broad range climatic zones (Köppen-Geiger climate zones; Peel et al ., 2007). This shows one of the limitations of using bioclimatic variables for species distribution modelling and indicates that some important variables are missing in the model. Further, the increased false presences in north of Sahara using six bioclimatic variables indicate that some sort of spatial structure may also be needed. Geographic trend surface analysis is one of the measures to include such spatial structure (Legendre and Legendre, 1998). When the x-y coordinates were added to the predictor variables, the prediction improved (see Figure 7-2) reducing the over-prediction in the Maghreb region. Here, the pre-selection of final model variables not only reduced the number of variables and reduced the model’s complexity, but, by adding a geographic trend surface, also offered a better prediction result. The addition of geographic coordinates (x and y coordinates) as predictor variables captured the spatial trend and hence provided the missing broad range spatial structure in regression modelling which reduces false results ( cf. Ferrier et al ., 2002). The results

Page| 107 8. Discussion and outlook

(chapters 7.1 and 7.2) show that: a) the use of bioclimatic data alone is not able to predict the species range satisfactorily, and b) the bioclimatic data lose spatial structure in the synthesis process.

‘Raw’ monthly climate data – Instead of using bioclimatic data, the use of ‘raw’ monthly climate data as variables used to derive the bioclimatic data generated a better prediction for P. kersteni (see chapter 7.3). Species limiting range in temperature and/or precipitation extremes can be better represented by monthly data as compared to the summarised bioclimatic data. Bioclimatic data offers the summary of each season, but not the trend of monthly variations within a season, e.g. temperature difference within a season or from mid of a season to mid of another season. The monthly climate data provides direct gradients, thus, is useful in obtaining the monthly variations. Information provided by e.g. maximum temperature of August and/or the interaction (product) term of precipitations of March and June cannot be considered by the synthesised bioclimatic data, but they are among the most contributing factors (see Tables 5-3, 6-10) in predicting occurrence range of P. kersteni . The dependence of two mutual factors, here the precipitations of March and June, for one event (i.e. presence of species) not only shows the significance of product terms but also shows that such an important variable is easily lost when synthesised in the form of bioclimatic variables. The results in chapters 7.1 (modelling with synthesised bioclim data) and 7.3 (modelling with ‘raw’ monthly data) differ from the conclusions of Watling et al . (2012) but the results are in agreement with the conclusion of Stankowski and Parker (2010). Watling et al . predicted distributions of one mammal, three amphibians and reptiles, and eight avian species and found no significance difference in prediction using either bioclimatic variables or monthly climate variables. They compared results based on the respective AUC values and calculated probability values but did not compare the results based on classified ‘presence-absence’ map. However, the authors did acknowledge observing some spatial discrepancies in spatial prediction. In contrary, Stankowski and Parker, modelling distributions of 30 willow species, found that models using monthly climate data are able to predict more accurately than models using bioclimatic data.

Araújo and Peterson (2012) discuss the use and the misuse of describing species distribution modelling. Particular interests are whether a model is able to predict the distribution range satisfactorily or whether the contributions of variables and their response curves are reflected properly in accordance with ecological theory. Achieving both can be difficult mainly due to working assumptions (ibid .; chapter 2.5) which may not be sufficient in extracting both sets of information e.g. use of static vs. dynamic or both and used environmental variables (Dormann, 2007). The case briefly discussed here provide an example. By using the selected six bioclimatic variables when modelling the distribution of P. kersteni , the response curves may satisfy the expectations as the predicted range is more confined to the areas where samples are present or where broad scale climate patterns are similar. Despite the similar climatic conditions in some coastal areas in the Maghreb region and the coastal regions of southern Africa (Peel et al ., 2007), P. kersteni is observed only in southern Africa but not in the Maghreb. The prediction using bioclimatic data shows both areas as presence areas (see Figure 7-1 and Figure 7-2). Thus, despite having response curve as expected, the spatial pattern is different from the expected pattern. On the other hand, the use of ‘raw’ monthly climatic data is able to predict the species range more accurately but testing the response curves is tedious and may be difficult because detailed ecological data, especially related to the interactions among variables, are necessary for such comparison.

Page| 108 8. Discussion and outlook

8.2.2. Effects of land-cover and climate datasets on the predicted distribution range of P. kersteni

Land-cover – Land-cover is a primary surrogate to habitat for faunal species (Franklin, 2010). The predictions (Figure 7-4) show that general purpose land-cover datasets are not much useful in explaining the ecological importance of those land-cover classes for modelling the distribution of P. kersteni although they are contributing in determining the probability values. Here, the different level and sign (positive vs. negative) of contribution of similar class (shrub cover in GLC2000 and savannah in USGS) shows that the classification scheme used for land-cover classes is sensitive to the prediction (see chapter 7.4) and this can lead to misinterpretation of ecological significance of that particular land-cover class for the species under concern. The precise information of habitats for most of the African Odonata species is not known and the available information is interpreted for the environmental conditions at the observed locations (Dijkstra et al , 2011). Developing an appropriate classification scheme for each species may not be feasible. Since the systematics of species is based on similarities, aggregation of land-cover classes with similar characteristics for specific Family or Order may be helpful in providing useful information when used in modelling. Further, use of fewer classes will improve stability of a model offering more degrees of freedom which in turn reduces the inflation of statistical tests.

Changing environmental conditions of climate and land-cover – Climate datasets are primary environmental variables when modelling species distribution at coarse spatial resolution and larger extent (Austin, 2007; Simaika et al ., 2013) and thus used by many researchers for the study of habitat range shifts induced by global climate change (e.g. Beaumont et al ., 2005; Hijmans and Graham, 2006; Thuiller, 2003; Thuiller et al ., 2004). Climate surfaces interpolated from historical measurements from several stations (e.g. Mitchell and Jones, 2005) and forecasted scenarios based on trends (e.g. IPCC climate scenarios; IPCC, 2007) offer opportunities for modelling past and future distributions of species based on climate variables. The CRU-TS dataset contains the climate surface data from 1901 to 2002, so choosing an arbitrary year between the times does not limit the use of data, however land-cover, the main habitat related variable, is not available. The selection of the year 1940 for modelling is reasoned based on the availability in change of land-cover indicated by the HYDE 2.0 dataset. The task of creating a past (hind casted) scenarios of land-cover in the chapter 7.5.1 uses rules for assigning land-cover classes and offered an option for explorations. The datasets for the (assumed) present conditions are representative for the year 2000. The year 2050 is chosen for future scenario. This resulted in an interval between past, present and future of 60 and 50 years offering a comparable time span. For climate data, several datasets from different future climate models are available. However, the nature of climate models is such that different models produce very different predictions (e.g. Christensen et al ., 2007; Meehl, et al ., 2007). In addition, several emission scenarios introduce further uncertainty (e.g. Kundzewicz et al ., 2007). The predictions in climatic conditions in the A1B emission scenario of the IPCC fourth assessment report has values representing between the extremes among the emission scenarios (Carter et al ., 2007; Meehl, et al ., 2007) and has thus been selected. Christensen et al . (2007) suggest for cautions when using results of climate models for Africa citing systematic errors and insufficient information for proper assessment. The difficulty in

Page| 109 8. Discussion and outlook

selecting a specific climate model is averted by using assemblage mean of seven different models with equal weights and thus not preferring any one over other. “Ecosystems in Africa are likely to experience major shifts and changes in species range and possible extinctions” (Parry et al ., 2007 p59). Furthermore, land degradation and structural changes are projected for the Great Lakes region in east Africa ( ibid.). Past, present and future land-cover datasets facilitated using habitat surrogate in modelling spatio-temporal distribution of P. kersteni covering over 100 years. The past climate datasets (CRU-TS) and hind-casted land-cover (chapter 7.5.1, Figure 7-5) and projected future climate surfaces and land-cover scenario (chapter 7.5.2, Figure 7-6) allow simulating the effects of a changing environment on the distribution of P. kersteni and reveal the changes in distribution range. Thuiller et al . (2006) modelled 277 African mammal species for studying effects of anthropogenic climate change and land transformation. They modelled the current distribution and projected for 2050 and 2080 using A2 and B2 SRES scenarios of IPCC. They found shifts in range with the species richness and community composition shifting eastwards in southern Africa and westwards in central Africa. The changes from present distribution to the distributions in 2050 and 2080 are mainly due to the gradient in latitudinal aridity. For P. kersteni , the modelled distribution also showed change in distribution range. However, change in the range of one species cannot be compared with the overall species richness and community composition. Nevertheless, Figure 7-8 (lower left) reveals that the distribution range of P. kersteni reduced more than the area gained. From the period of 1940 to 2050, the distribution range in Guinea and in western Africa suffer loss. Although most of the predicted range of P. kersteni in the Democratic Republic of Congo (DRC) is outside the estimated distribution range by Clausnitzer et al . (2012), the range also reduced in the DRC. The distribution range of P. kersteni also reduced in Cameroon, Central African Republic, Zambia and Angola in central or middle Africa with Angola being the country with most of the losses in predicted areas. Ethiopia, Kenya, South Sudan and Tanzania in eastern Africa also have reduction in predicted distribution range with Mozambique loosing almost half of the area. In Ethiopia, South Sudan and Tanzania, some range expansion has been predicted. In Tanzania despite the loss of range in overall predicted area, patches of areas have been predicted to be presence areas in the year 2000 but not either in 1940 or in 2050. Overall there is reduction of distribution range of P. kersteni but the distribution (as compared to 1940) is less affected in west Africa as well as in southern Africa by the change in climatic conditions.

Another observation is that change in land-cover has low influence in the change in distribution of P. kersteni . A detailed look in to the proportion of land-cover classes (see Figure 8-2) over the distribution of field samples, predicted distribution range (presence and prob. presence, Figure 7-8, upper right) and expected range based on watersheds can explain the cause. The field samples are distributed over 13 of the 16 land-cover classes present. No field samples are recorded in classes 7, 8 and 15; all three classes being regularly flooded areas. Species sample location records are rarely unbiased especially when museum collections are included and Odonata database is no exception. The inferred expected range, such as Clausnitzer et al . (2012), may offer better representation which considered some ‘expert-knowledge’. However, the expected range is distributed over 15 of the 16 classes, two more classes (classes 8 and 15) than the classes from recorded samples. This is due to the fact that the same sample records have been used for delineating the expected range, but with additional ‘expert-knowledge’ of assessor(s). The predicted presence range of P. kersteni using the SpeeDi Tool has all 16 land-cover classes. It shows that, indeed, almost all the land-cover classes are

Page| 110 8. Discussion and outlook

suitable for P. kersteni . Based on the fitted model and among the land cover-classes, classes 16, 17, 18, 20 and 22 have positive effect on the model prediction with only class 18 having the maximum positive effect while class 1 has the maximum negative effect (see Figure 5-5, a). This explains why change in land-cover has low impact on overall change in predicted range of P. kersteni over time.

25 samples ODA (496) 20 predicted cells (132247) watershed 15 cells (179805)

10 frequency (in %) frequency

5

0 1 2 3 7 8 9 12 13 14 15 16 17 18 19 20 22 land-cover class

Figure 8-2: Distribution of sample location records in ODA, predicted range (see Figure 5-6, right) and expected range based on watershed range of P. kersteni over different land-cover classes; the colour and number of the land-cover classes corresponds to the colour and number of Figure 5-8.

8.2.3. Effect of scale and modelling extent on predicting the distribution range of A. minuscula

Extent and resolution are the two important facets of scale. Extent depends on the purpose of the modelling, e.g. investigation of the environmentally realised niche should include areas beyond the observed environmental limits (Austin, 2007). The native range of A. minuscula is predominantly at the southernmost part of Africa. The predicted range of A. minuscula , when the entire African continent is considered (see Figure 7-11), includes some areas outside the species native range. This is due to the similarity of the conditions in environmental space in those areas. When the extent of modelling is considering only the southern Africa region which includes only little areas outside the native region, a more detailed pattern as compared to continental extent (see Figure 7-12) can be seen with the predicted range matching more closely the expected range (Figure 7-10). This indicates that modelling extent should not be too much larger than the extent covered by presence samples (or native range) and the background samples should not be taken too far away from the species actual range. VanDerWal et al . (2009) simulated the effect of sampling the backgrounds from varying extents for 12 species in the Australian wet tropics. The interpretation of the simulated results of A. minuscula and 12 species by VanDerWal et al. are similar: background environmental values should not be sampled far outside the species native geographical range, however the results are different.

Page| 111 8. Discussion and outlook

VanDerWal et al . showed that when background samples are drawn far from the actual species range, the prediction relied very much on few variables providing less information on responses and did not resemble true distribution whereas the result for A. minuscula showed that the broad scale pattern matches within the native range whether the extent is regional or continental. Moreover, the environmental ranges where the species are recorded are same in cases of regional and continental extents but the environmental conditions offered by the background samples vary when different extents are used. The task of modelling species distribution is not only to predict where species are present but also to predict where the species are absent within the favourable environmental envelope. Thus, when the background samples are drawn from far outside the native range, the prediction can be only regarded as preliminary outcome, also because the statistical test during the model fitting process is likely to be inflated by the background samples outside the geographic region (VanDerWal et al ., 2009). Spatial resolution is relevant in modelling the species distribution and datasets with fine spatial resolution providing detailed information about environmental conditions (Elith and Leathwick, 2009). However, the final spatial resolution used for modelling species distribution relies mostly on the resolution of available datasets ( ibid.). The prediction range did not change much when modelled with spatial resolutions of 1 km and 8 km for continental as well as regional extent of modelling (see Figures 7-9 and 7-10) for A. minuscula , although it was expected that a finer resolution would give more detailed pattern. Most of the differences or fineness is observed only at the edges of the predicted range which are the direct result of change in the grid-cell size (fine or coarse). This may be due to the fact that datasets with 8 km resolution are not independent datasets but resampled datasets of 1 km resolution except for the NDVI dataset and as such, the resampled values are not much far away than the original value (maximum of 4 cell distance). As the presence of Odonata species being associated with water-bodies, it is expected to show some patterns resembling some of the hydrographic networks (due to proximity: distance to water bodies). At finer resolution, the proximity to water bodies may become a significant variable. However, this effect is not observed in predicted probability values for A. minuscula . But, the effect of variable ‘distance to water bodies’ can be seen at some places for the distribution of P. kersteni even at 8 km, especially in Angola in the elevation range from 1200 m to 1800 m. The effect is prominent in the distribution when viewed with probability values but not easily noticeable when threshold is applied (see Figure 5-6). Nevertheless, predicting Odonata distributions at spatial resolutions of 1 km and 8 km may not make much difference because adult odonates are very mobile and have high dispersal ability much more than 8 km, but the fineness in the ranges can have significant role if applied to the field of ecological conservation using Odonata as flagship (Clausnitzer et al. , 2012) and/or surrogate (Simaika et al. , 2013) species.

Ranking of different parameters for predicting the distribution of P. kersteni using the SpeeDi Tool

With several factors determining the prediction outcome, an insight on the influential factors is of interest. The list of factors includes the sample data quality, environmental variables and regression controls (e.g. elastic-net, number of samples, population prevalence). Ranking each factor is difficult because each factor influences the model’s performance and only combined performance can be

Page| 112 8. Discussion and outlook

examined. However, a general assertion on rank is performed (see Table 8-1) reflecting the influence in predicting the range of P. kersteni .

Rank 1 – Sample data quality is the most important factor along with the environmental geodata (see rank 2). The sample quality can be linked with two parts: a) resolution and b) density. Resolution, here, is referred to the positional accuracy as most of the records are geocoded from location names. This is directly linked to the values or qualities thereof that are sampled in environmental space. As evident from chapter 6.2.4 (see also chapter 8.1.1), sampling density (and distribution) is influencing the prediction in a very drastic manner. When the presence locations are not sampled well across the regions (i.e. varying sampling density), the predicted range is able to resemble the expectation only at the densely sampled areas. This difference can be observed by simulating the modelling using virtual species concept (chapter 6.3). The use of weights to counteract the sampling density, both locally and regionally (chapter 6.2.4, second case), confirmed that sampling density plays a very significant role. Additionally, the background samples also have its role in the outcome of prediction with moderate impact in comparison to the effects of presence samples. The background samples were distributed randomly but uniformly (i.e. low density bias). Thus, the effect of density was not the subject of testing. The number of background samples influenced to some extent as fewer samples may not be able to represent the entire environmental space. But when the background samples are generated well outside the species native range, as seen for the prediction of A. minuscula (chapter 7.6), the background samples showed some effects with over-predicting the range.

Table 8-1: Ranking of factors affecting predictions of species distribution based on potential impact on output of SpeeDi Tool for modelling P. kersteni

Rank Factor impact results discussion 1 samples: a) presence ( density and distribution ) high 6.2.4, 6.3 8.1.1 b) background moderate 6.2.1, 6.2.2 , 7.6 2 environmental data high 7.1 - 7.4, 7.6 8.2.1, 8.2.2 model complexity ( polynomials and interactions ) high 6.2.3 8.1.2 3 elastic -net factor moderate to high 6.1.1 8.1.2 population prevalence moderate 6.1.2, 6.1.3

Rank 2 – Another factor that is directly related to the sampling is the environmental geodata, its quality and spatial resolution. Quality and spatial resolution are also inter-linked, with quality being partly depending on the spatial resolution. The selection of appropriate and ecologically meaningful environmental datasets is a difficult task and several testing may be needed. On one hand, selecting few variables is a compelling force as it allows explaining the model easily and on the other hand using more variables cannot be ignored because of the prospect of getting more realistic model output (see chapter 7.1 – 7.4). Using datasets with same theme (e.g. land-cover) but with different schemes can give different response for similarly characterised classes (chapter 7.4) which can lead to misinterpretation. Model complexity is ranked together with environmental data as the complexity is directly related to the number of environmental variables. The complexity increases exponentially with the increase in number of ‘primary’ continuous environmental variables 18 . The

18 ‘primary’ means variables without the polynomial term, but used to create polynomials

Page| 113 8. Discussion and outlook

need of product term (interaction) between two continuous variables was revealed when modelling of P. kersteni was performed using five different complexities (chapter 6.2.3). The polynomials terms seemed to have less influence, but the product interaction terms showed strong influence.

Rank 3 – Elastic-net factor and the assumed initial population prevalence complete the list of factors. Regression models, although very useful, are known to suffer from over-fitting. The use of regularisation techniques is, thus, common to counter over-fitting. The hybrid nature of elastic-net can emulate the commonly used regularisation techniques L1, L2 or both at the same time. The notion L1 regularisation performs better (Phillips and Dudík, 2008) than L2 regularisation has also been observed (see Table 6-1). However, L1 regularisation (elastic-net factor = 1) under-predicts and L2 regularisation (elastic-net factor = 0) grossly over-predicts the distribution range (see chapter 8.1.2 and Appendix A1). An elastic-net value acting as hybrid L1-L2 regularisation performed best, thus showed its importance in balancing over- and under-prediction. Population prevalence is an unavoidable element in presence-background regression modelling scenario. The case studies with five different prevalence values in chapter 6.1.1 (see also Tables 6-2: cut-off % and 6-3: correlations) showed the dependence and the uncertainty associated with it. However, using the SBT option (chapter 6.1.2) reduced the uncertainty (see Tables 6-4: cut-off % and 6-5: correlations). Thus, its impact is moderate compared to other factors.

The ranking covers only six factors and it should not be considered as a ‘de-facto’ order but a general guideline because the level of impact from a higher ranked factor to a lower ranked factor cannot be determined. Another point to be noted regarding the above ranking is that it is based solely on modelling of P. kersteni species; however, a more reliable conclusion can be made with the experience of modelling and testing several species.

Suitability of the SpeeDi Tool for modelling spatial distribution of African Odonata

Predictive distribution modelling of species is based on limiting factors: resources, interactions (biotic or abiotic), etc. The limiting factors when modelling the predictive distribution of species for large extent and large grain size are mostly abiotic factors (i.e. physical environment, e.g. climate) and broad habitat structure or patterns in landscape (Austin 2007; Araújo and Peterson, 2012). For small extent and fine grain size, the limiting factors are mostly biotic factors (e.g. dispersal, inter- and intra-species competition) (Austin 2007). At fine scale, biotic interactions e.g. feeding habit (predator-prey relationship) determines the limiting factors. Further, data related to biotic factors for many species are either not available or not available to be used in a geospatial form (ibid.). This is true also for several species data, including African Odonata, as most of the location data in ODA are based on museum records, literature and field notes (Clausnitzer et al ., 2012). Further, adult Odonata are highly mobile with higher dispersal range and thus suits for a larger grain size. Obtaining spatial biotic data is more difficult for Odonata as their complete life-cycle requires both aquatic and terrestrial habitat. Lack of knowledge on detailed life-cycles of African Odonata (Dijkstra et al ., 2011) are, thus, further limitations related to availability of biotic factors. These biotic factors are core of mechanistic and analytic models (chapter 2.1). Most of the available data at a continental level in spatially explicit form are for abiotic factors; mostly climate and habitat structure related (elevation, land cover). Species distribution modelling often relies on what is available in lieu of which is the

Page| 114 8. Discussion and outlook

most suitable method (Austin, 2002b). The suitable modelling method for Odonata species based on the combination of available data, both the species location and the environmental data, is thus the statistical modelling methods. Often, only presence records are available and profile or envelope based models are more suited in such condition (Hirzel et al ., 2002b). Although envelope or profile based models can predict the suitable climate envelope; however, the envelope based models will not be able to locate the areas where the species will not be present within the envelope (Austin, 2002b). This is one of the explanations for over-predictions for envelope based modelling, and when possible, the suggested methods are regression based discriminatory methods such as logistic regression. Discriminatory methods need presence and absence samples to discriminate environmental suitability but absence data are seldom recorded. Studies have shown that a single modelling tool is not able to satisfactorily predict distribution of different species (e.g. Elith et al ., 2006; Farber and Kadmon, 2003; Guisan et al ., 1999; Guisan et al ., 2007; Terribile et al ., 2010). This is one of the reasons in either developing new methods or seeking improvements of existing methods. The development of discriminatory methods applying background samples instead of absence data have been efficient for species distribution modelling (Elith et al ., 2006; Phillips et al ., 2006; Ward et al ., 2009) but they are generally not readily available in most of the 'general' statistical packages and need special development (Elith and Leathwick, 2009; Phillips et al ., 2006) such as MaxEnt (maximum entropy based) and ecogbm (logistic regression based). So, any new development in the field of ‘general’ or ‘standard’ statistical method, such as elastic-net regularisation, is not readily available to be utilised in those SDM tools based on the concepts of background samples. Applying elastic-net regularisation for species distribution modelling using background samples is not trivial because the use of background samples is among the unique methods in predicting species distribution. The SpeeDi Tool utilising the concepts of Ward et al . (2009) is, thus, an improvement implementing elastic-net regularisation technique. The sensitivity analysis of elastic-net factor in modelling P. kersteni demonstrated the superiority of elastic-net over L1 and L2 regularisation (see chapter 6.1.1 and Appendix A1). Moreover, the inclusion of ‘soft buffer threshold’ concept is new. The investigation on the effect of unknown population prevalence in background samples (chapter 6.1.2) and the impact of applying ‘SBT’ (chapter 6.1.3) shows the usefulness of SBT in reducing the uncertainty in prevalence of background samples . Further, the facilitation of combined use of background and absence data together with presence samples (see chapter 6.3.4) is another option for species distribution modelling in SpeeDi Tool which is not yet available in other tools. The application of ‘SBT’ and combined use of background samples and auxiliary absence data were tested for predicting the distribution range of P. kersteni and further tests using more species data are required to get more confidence in using these concepts. Modelling other species, Odonata species in the context of this thesis, can exemplify the usefulness of the SpeeDi Tool.

The SpeeDi Tool from a user’s perspective for modelling species distribution

Quality measures are often the basis to measure user’s acceptability of a software or tool. Although no quality tests were performed, the development of the SpeeDi Tool considered those factors usually required. The requirements in species distribution modelling are not just the provision of functions for statistical modelling, but also of the necessary GIS functionalities for pre- and

Page| 115 8. Discussion and outlook

post-processing and interactive visualisation of distribution (chapter 3.2.1). Regression based models are based on iterative processing for fitting the regression parameters and hence the need of reliable output is very important. Further, the over- and under-fitting of parameters, correlation in data and noises are associated with regression models and measures to control them are necessary. Since modelling tasks often needs to be verified, repeatable results are necessary for the model outputs to be reliable (chapter 3.2.2). These issues are carefully considered; the use of elastic-net regularisation handles the problem associated with over- and under-prediction, correlation and noises. While several parameters constitute for overall quality aspect, functionality (chapter 3.2.1) and usability (chapter 3.2.3) offered by the SpeeDi Tool are discussed here.

Functionality – The core of the species distribution modelling tools is statistical methods applied to predict the distribution based on presence of species and the environment at a given location. The SpeeDi Tool uses binary logistic regression as core statistical method for prediction. Fitting the coefficients of variable in logistic regression is an iterative process and several iterative methods can be used with each method having its own parameters and complexities. Although the statistical methods are deterministic and algorithms published, the iterative methods used to fit model, sometimes become ‘black-box’ in nature when all the parameters are not exposed to users. The SpeeDi Tool has most of the relevant parameters exposed to users for modification with an exception of the convergence threshold. This threshold tells the iterative method when to stop. It has been often observed that many users are happy to use the default values provided in a tool (e.g. Åstrom et al ., 2007; Buermann et al ., 2008; Simaika, et al ., 2013). Part of this (using default values) lies in the fact that, a) for most users, it is difficult to understand the numerical methods behind those techniques but understanding them is also not necessary and b) few users are interested in knowing the effect of the parameters although the outcome depends on these parameters. However, those default values in the tool and/or suggested values are often ‘tuned-up’ values based on the modelling results (e.g. Phillips and Dudík, 2008). Although modelled in the environmental space, the amount of data processing in GIS for predicting spatial distribution cannot be overlooked (Pearson, 2007). Geodatasets are first pre-processed in a GIS before feeding into the statistical modelling task and applying the threshold to the probability values, in order to create map of presence and absence ranges of species is often performed in a GIS. Thus, GIS is an integral part in species distribution modelling and GIS functionalities are prerequisites for a ‘complete’ SDM tool. Converting vector datasets to raster datasets, matching pixel size and position, and transforming spatial reference system are the most basic geodata pre-processing functions in a GIS required for species distribution modelling ( ibid.). Inclusion of basic but important GIS functions in the SpeeDi Tool ensures that users have access to these functions with ease. The task of predicting the distribution range of P. kersteni (chapter 5) utilised the pre-processing functions of GIS available in the SpeeDi Tool. Often, applications that use the predicted distribution of species undergo simple to complex spatial analyses as post-processing in GIS. ‘Standard’ and ‘matured’ GIS software would be needed for complex applications making use of full capabilities of GIS. However, the SpeeDi Tool has some post-processing GIS functions too. Analysis such as change in predicted distribution range of P. kersteni over time (Figure 7-8) or change in prediction range due to the resolution of environmental datasets (Figure 7-11) can be easily performed as post-processing of results in the SpeeDi Tool.

Usability – Usability is an issue related to the easiness and user friendliness in handling a tool. By using the graphical user interface and providing necessary information, user’s interaction with the

Page| 116 8. Discussion and outlook

SpeeDi Tool is made easy. As much as usability being part of quality attribute, it has also direct relationship with other quality attributes such as functionality, reliability or efficiency. Graphical user interface is an intuitive technique for human-computer interaction. Implementing a consistent layout of user interface is advocated for not confusing users (US-HSS, 2006). The use of common controls (e.g. buttons, selection box instead of text box) in the tool consistent with users’ expectation will speed-up the task and increase operability and efficiency. The grouping and sequencing of the functions in the SpeeDi Tool increase the memorability as well as operability (cf. ibid.). The implementation of integrated help system, where every dialog box has a ‘Help’ button linked to the context related help, will a) assist learnability (learn about the function), b) increase performance (efficiency), c) reduce error rate (guide in selecting correct input), and d) increase memorability (frequent use of the help and learning) (see chapter 3.2.3). A small note is that although the functionality of the help system is implemented in the SpeeDi Tool, the detailed contents are not yet complete. However, the help system uses external file for its content and only updating of the content will suffice. The use of GUI, the grouping and ordering of functions covering all the required utilities and the integration of help system offer better user-experience (chapter 3.4).

Comparing the SpeeDi Tool with other tools – As evident from above discussion, the use of GIS is integral part in modelling the species distribution. However, widely used tools such as MaxEnt (Phillips et al ., 2006) or DesktopGARP (Scachetti-Pereira, 2002) do not offer GIS functionality in itself (see Table 3-2); exception is Biomapper (Hirzel et al ., 2002a). Often separate statistical packages are used for GLM based modelling (Hijmans and Elith 2011). The SpeeDi Tool is, in that regard, different. It is a customised GIS application providing necessary GIS functions and binary logistic regression modelling. The logistic regression module is independent of any statistical package and is loosely coupled to the GIS environment. A comparable tool, in terms of overall functionality and concept, is Biomapper in which statistical modelling and, to some extent, pre- and post-processing can be done within the tool. Further, different components in the Biomapper are also loosely coupled and help system is well integrated. However, the main difference lies in the statistical modelling technique and related pre-processing functions; Biomapper offers modelling based on elliptical-envelope using factorisation technique with ‘presence-only’ samples (chapter 2.4) whereas the SpeeDi Tool is based on logistic regression and the required sample data are presences with backgrounds and/or absences. If only the statistical modelling technique is to be compared, being regression based and requiring presence and background samples, the SpeeDi Tool can be compared with MaxEnt and ecogbm. Ecogbm shares more similarities as it is based on logistic regression and expectation-maximisation (EM) concept (Ward et al ., 2009). The SpeeDi Tool a) uses the EM concept used in the ecogbm, b) implements the concept of SBT which is a new conception, c) introduces the use of background samples together with absence samples, as well as d) applies better regularisation technique: the elastic-net. Furthermore, ecogbm is a command line based tool (available as R-package) and the SpeeDi Tool is GUI based standalone application. The features of the SpeeDi Tool with other species distribution modelling tools are compared in the Table 8-2 where DesktopGarp can be seen to have almost no similar features with only similarities are a) it has a GUI, and b) a part of statistical modelling uses logistic regression but with pseudo-absence samples. Components integration is only relevant when different components are used and Biomapper is the only one to have GIS and statistical components. MaxEnt and ecogbm employ L1-regularisation which is comparable when the elastic-net factor in the SpeeDi Tool is 1.0. MaxEnt has a simple help system where all the

Page| 117 8. Discussion and outlook

information it provides is embedded. The use of external files for help system is more suitable because the help files can be extended and updated separately. The help of DesktopGarp, although external file, is a simple document. In contrary, the Biomapper has the context related help system for most of the functions and this is similar to the help system implemented in the SpeeDi Tool although the implemented technologies are different.

Table 8-2: The features of SpeeDi Tool in comparison to other species distribution modelling tools (similar features in other tools are italicised, for reference see tables 3-2 and 3-3) Biomapper DesktopGarp MaxEnt ecogbm SpeeDi Tool

Components loose no no no loose coupling integration coupling GIS yes no no no yes processing genetic factorisation maximum logistic Statistical algorithm, logistic and envelope entropy regression model envelope and regression based (regression) based logistic regression factorisation elastic-net regularisation no L1 a L1 a (≈L2) (hybrid L1-L2) integrated, integrated, functional, partially functional, functional , Help system context functional, not known context embedded related, external file related, external files external files b Sample data presence and absence and/or presence background background requirement pseudo absence c background Command User interface GUI GUI GUI GUI line a can be considered similar if elastic-net factor is set to 1 in the SpeeDi Tool b the functionality is implemented however the contents are not complete c no possibility of user supplied pseudo-absence

Overall the development of the SpeeDi Tool considered the positive aspects of many other modelling tools and implemented more functions. However, there are still places for improvements. Tool such as ecogbm which lacks GIS analysis functionalities but has its root in the statistical software has advantage of analysing the data statistically in several ways. Adding some of the statistical functions for resampling techniques such as bootstrap and Jack-knife, cross-validation technique like k-fold data splitting/modelling would be advantageous in exploring and analysing output data as well as increasing the confidence in using the predicted results. As for the most users preferring to use default values of parameters, the sensitivity analyses provided some insights for recommending such values. However, these values are based on experience in modelling a single Odonata species. Performing tests with several species can help in finding a common value that would work for several species.

Page| 118 9. Extended summary

9. Extended summary

This thesis introduces a new tool, the SpeeDi Tool (acronym for SPEciEs DIstribution modelling Tool), for modelling spatial distribution of species. By applying this new tool, the particular focus is on predicting the distribution ranges of diverse dragonfly species (Odonata) of Africa with consideration of the Odonata Database of Africa.

Species distribution modelling (SDM) has been used in wide array of ecological applications such as determining hotspots, planning of reserves, designing survey for biodiversity inventory, understanding phylogenetic and phyloclimatic relationships, or assessing the impacts of environmental change on biodiversity. Statistical modelling, here, is a common type of modelling methods in SDM; analytical and mechanistic being other types. In the last few decades, Africa has been a dynamic continent regarding the changes in landscape, population and climate. Being sensitive to both terrestrial and aquatic ecosystem, Odonata can be used as flagship species for many ecological applications. Modelling the distribution ranges of Odonata species can, thus, be of importance e.g. in planning biodiversity conservation actions. However, as in most modelling tasks, care should be taken regarding assumptions made for statistical modelling and data being used. Being spatial in nature, the use of geographic information system (GIS) in pre-processing and post-processing is integral part of the SDM workflow. Thus, integration of GIS and statistical modelling is one of the quality aspects regarding the functionality of a SDM tool. Further, usability is one of the often sought ‘quality-in-use’ attributes. Task-centred approaches can lead to achieving the goal of modelling species distribution ranges but such approaches, generally, do not consider simplicities or difficulties in handling a tool. Thus, such approaches can lack usability requirements. Instead, user-centred approaches can offer both the usability and achievement of the goal because these approaches consider users’ ability as well as ‘ease of use’ criteria. Pseudagrion kersteni is a dragonfly species widely spread in sub-Saharan Africa and this species is taken as species of interest to demonstrate the use of the SpeeDi Tool. One of the aims of the thesis is to perform various sensitivity analyses for getting information about the influence of various parameters. While actual distribution range of P. kersteni is not known, an expert-drawn watershed based range map from IUCN served the purpose for visually comparing the predicted range. The sensitivity analyses are performed using P. kersteni samples for different modelling approaches, modelling parameters and use of different environmental geodatasets.

Among the special features of the SpeeDi Tool is the use of random background samples instead of pseudo-absence samples in binary logistic regression. The use of background samples is increasingly adopted in the field of SDM where the environmental conditions at presence samples’ locations are compared with those of background samples. This removes the difficulty in obtaining records of absences which are rarely officially recorded or, when available, are not sufficient to be used in modelling. However, in some cases such reliable absence records can also be derived based on expert knowledge. At such instances the use of presence, absence and background samples can

Page| 119 9. Extended summary

offer better prediction results. The thesis demonstrated and discussed the concept of using presence, absence and background samples as realised in the SpeeDi Tool. Elastic-net regularisation is a more effective regularisation method as compared to L1 or L2 regularisation. It is a hybrid L1 and L2 regularisation and, thus, offers mutual compensation (balance) of the under-prediction of L1 and over-prediction of L2 regularisation. Although elastic-net regularisation has been available in the field of statistics for a while, adopting it for modelling species distribution with presence and background samples scenario in SDM is not straight forward and requires special programming and implementation which the SpeeDi Tool offers. Sampling bias influences the prediction result and is difficult to avoid when the species records are collected from museum and herbarium collections. Bias is also introduced in the data with unplanned or non-systematic surveying design. The modelling in the SpeeDi Tool has adopted the concept of weights for samples which can be used to balance different sampling densities and, thus, reduces sampling bias. The use of weight also facilitates the inclusion of abundance (count) data. Other benefits of weights are that the statistics in calculating the significance is not inflated and the prevalence of the species presence is maintained. Although the concept and use of background samples enabled the use of field records of presences samples only, the population prevalence of presences in background samples are not known. Generally, an assumed value is used but it introduces uncertainty. For reducing the uncertainty or negating the effect in the assumed population prevalence in background samples, a new heuristic approach of ‘soft-buffer-threshold’ (SBT) has been conceptualised and implemented in the SpeeDi Tool. Even though processing of geodata are integral part in SDM, GIS functionalities are not common in many SDM tools because many SDM tools often followed the task centred approach where statistical modelling is regarded as the main task. The consequence is the use of separate GIS software for basic geodata processing as well as the need of import and export of data without which final results cannot be achieved. The implementation of the SpeeDi Tool followed the user-centred approach which highlighted e.g. the need of GIS functionalities as part of a complete SDM workflow. The user interface of the SpeeDi Tool provides the basic GIS functionalities for pre- and post-processing of the geodata and the binary logistic regression modelling is integrated via software coupling mechanism. This allows access of GIS and required statistical functions directly from the main graphical user interface. Among the above mentioned features of the SpeeDi Tool, the possibility of modelling species distribution by combining presence with background and absence samples, the elastic-net regularisation in the case of using background samples and the concept of SBT are new options in the field of species distribution modelling which are currently not available in any other SDM tools. Although weights can be used in logistic regression, weighting of samples has not been commonly used in SDM tasks. Another notable feature of the tool is explicit use of weight for each sample which can be used to counter sampling bias.

The modelling approaches considered the use of virtual species concept with presence and absence data modelling to gain confidence in the models ability in predicting the distribution range. Since true distribution range of a real species is not known, the virtual species concept offered ways to evaluate (or validate) the model’s performance in depicting the true distribution range. The same concept is then extended for using presence and background modelling scenario to substantiate suitability of background samples in modelling species distribution using the SpeeDi Tool. A

Page| 120 9. Extended summary

comparison is then made for the result with real field sample data and background samples. Finally, the use of expert-knowledge based additional absence samples with presence and background samples is tested which showed improvement in prediction. In brief, the results of sensitivity analyses on modelling parameters show: • The elastic-net regularisation performs better than L1 and L2 regularisation. The predicted distribution of P. kersteni is sensitive to the amount of regularisation. However, the analysis offered insight in selecting optimal elastic-net factor; using the selected value for P. kersteni performed well also for modelling the distribution of A. minuscula . • The assumed value of population prevalence in background samples is sensitive to the predicted probability values, but has less influence when thresholds are applied for delineating presence and absence ranges; the threshold value is based on either minimising the difference or maximising the sum of sensitivity and specificity values of the classification. However, the application of SBT reduced very much the difference in probability values (variance and range) and is, thus, able to reduce the uncertainty associated with population prevalence. • Although the statistical measure of model performance via area under the receiver operating characteristics curve did not show much difference, the visual comparison of predicted range showed that the number of background samples in the model is a sensitive parameter. Too few background samples may lead to over prediction while too much samples may lead to under-prediction in areas where presence samples are lacking. • Different pseudo random number generators used in generating random background samples have no real effect in predicting the distribution of P. kersteni . This indicates that the distribution pattern of background samples is not sensitive to the ability of the SpeeDi Tool in predicting the distribution range of P. kersteni . • Regarding the complexity of variables’ relationships in the model, the degree of polynomial does not influence the prediction but the product interaction among variables is very much an influential interaction term in determining the probability of presence. • Regarding handling of uneven sampling density (often termed as sampling bias), weights are assigned to the presence samples based on spatial density of samples and the results have indicated that the use of weight can be an effective way to handle such sampling bias.

The sensitivity analyses of environmental variables regarding different climate and land-cover datasets provided useful information for modelling the distribution of P. kersteni . Bioclimatic variables and ‘raw’ monthly climate variables are considered for climate datasets. Sensitivity of land-cover datasets are performed using two datasets comprising different classification schemes, GLC 2000 dataset which uses FAO land-use classification scheme and USGS global land cover characterisation dataset based on USGS land use land cover system (modified version 2). Following results summarise the sensitivity of climate and land-cover related datasets: • The use of 19 bioclimatic variables describing climatic trends related to temperatures and precipitation under-predicted the distribution range. • The use of six selected bioclimatic variables (temperature: annual mean, seasonality, annual range and mean of coldest quarter; precipitation: annual and seasonality) showed less under-prediction but increased over-prediction.

Page| 121 9. Extended summary

• The large part of over-prediction was effectively removed when a geographical trend surface was added as predictor variables to complement the six selected bioclimatic variables. • The result of prediction using monthly climate datasets of temperature and precipitation showed that the ‘raw’ monthly climatic datasets are preferred over the synthesised bioclimatic datasets because noises in the data may not be filtered effectively when synthesised datasets are used. • Although the use of different land-cover datasets (schemes) did not influence the predicted distribution range, the results suggest that the land-cover classes may induce false ecological interpretation, e.g. classes showing similar characteristics (closed-open, deciduous shrub-cover in GLC2000 and savannah in USGS datasets) have different levels (amount) and opposite nature (positive/negative) of contribution in determining the probability value. • The results regarding the use of different land-cover datasets (schemes) illustrate one of the limitations of using statistical modelling methods where the importance of a variable determined by a statistical model may not match the importance expected and/or determined by ecological theory. • Within the static realm of SDM, the projections of model variables based on current conditions (base year 2000) to the past climatic and simulated land-cover scenario of 1940 and future scenarios of climate and land-cover for the year 2050 are made. • The comparisons of distribution ranges from 1940, 2000 and 2050 show reduction in distribution range of P. kersteni with central and eastern Africa loosing large amount of suitable environmental conditions.

Among other sensitivity analyses are the spatial resolution of the geodatasets and the modelling extent. A. Minuscula , a regional Odonata species native to southern Africa is chosen. The modelling results suggest: • When modelling is performed at the continental extent, the results were only able to provide ideas regarding the general shape of species range. • When modelled at a regional extent of southern Africa, the predicted range is able to match closely the expected distribution range. The pattern was also more detailed compared to when modelled at the continental extent. • The predicted distribution range of A. minuscula did not reveal much difference when compared at continental extent irrespective of modelling at spatial resolutions of 8 km and 1 km. • Similarly, the predicted distribution range did not show much change when comparing results with modelling at a regional extent of southern Africa with spatial resolutions of 8 km and 1 km.

The aim of providing a new species distribution modelling tool is met with inclusion of comprehensive sensitivity analyses comprising model parameters and environmental variables. However, only two African Odonata species are modelled using the SpeeDi Tool and many more Odonata species are to be modelled alone for the African Continent (but see Simaika et al ., 2013). The sensitivity analyses of regression control parameters are so far only related to modelling P. kersteni ; the results of which are suggested as default values in the SpeeDi Tool. In order to establish

Page| 122 9. Extended summary

sound background and to further tune those default parameters, modelling of several species including similarly performed sensitivity analyses are suggested. In particular, the heuristic SBT technique needs more testing for increasing the confidence in applying it for SDM. The results showed that sampling bias can be handled or considered in the modelling via assigning weights to biased samples; however a proper algorithm is still lacking. Nevertheless, the results have already shown some directions for developing such an algorithm, i.e. a) by balancing local density within a region (each cluster separately) and b) by balancing density across regions (among clusters, e.g. density in southern Africa and western Africa in case of P. kersteni ). As for the technical side, the help system has the required skeleton but the contents are not yet fully filled in. This should be the next step before the tools can be widely used. While preparing the Help-contents on the functioning and technical aspects of the tool can be achieved in short time, following of the user-centred approach will be more time consuming but will make the contents more useful. This will require a close collaboration with biologists/ecologists making use of SDM in closing knowledge gaps between the field of GIS and SDM as well as preparing and/or including the contents with the GIS terminologies used in the field of ecology. Although basic map layout can be easily done, some additional ‘cosmetic’ functions for cartographic representations would enhance map making experience. The tool made use and/or adopted some free and open source algorithms but the GIS program library used in the tool is not. So adoption to open source GIS may appeal to many users considering the community for maintenance, improvements and developing extensions over time which can enhance the overall aspect of the tool as well as the cost.

Zusammenfassung

Mit dem SpeeDi Tool (Akronym für SPEciEs DIstribution modelling Tool) stellt diese Arbeit ein neues Werkzeug zur Modellierung der räumlichen Verbreitung von Arten vor. Unter Berücksichtigung der Odonata Datenbank Afrika (ODA) liegt das Hauptaugenmerk dieser Arbeit auf der Vorhersage von Verbreitungsmustern diverser Libellenarten (Odonata) Afrikas unter Zuhilfenahme des neuen Tools.

Die Modellierung von Artenverbreitungen (Species Distribution Modelling, kurz SDM) wird in vielen ökologischen Anwendungen eingesetzt, z.B. bei der Bestimmung von Hotspots, der Planung und Zuteilung von Schutzgebieten, der Bestandsaufnahme von Biodiversität, für das Verständnis von phylogenetischen und phyloklimatischen Beziehungen, oder beim Einschätzen der Auswirkungen von Umweltveränderungen auf die biologische Vielfalt. Statistische Modellierung ist eine Modellierungsmethode, die häufig in SDM eingesetzt wird, analytische und mechanische Modellierungen sind weitere. In Hinblick auf Landschaft, Bevölkerung und Klima hat Afrika während der letzten Jahrzehnte viele Änderungen erfahren. Da Odonata sowohl terrestrische als auch aquatische Ökosysteme nutzen und auf Veränderungen sehr sensibel reagieren, sind sie eine geeignete Flaggschiffart für viele ökologische Anwendungen. Die Modellierung der Verbreitungsmuster von Odonata kann somit eine wichtige Rolle spielen, z.B. bei der Planung von Aktionen für den Naturschutz. Wie bei den meisten

Page| 123 9. Extended summary

Modellierungsaufgaben muss jedoch sorgfältig darauf geachtet werden, welche Annahmen bei der statistischen Modellierung gemacht werden und welche Daten zum Einsatz kommen. Da Geografische Informationssysteme (GIS) von Natur aus für die Analyse von räumlichen Daten gedacht sind, sollten sie im SDM-Arbeitsablauf integraler Bestandteil bei der Vor- und Nachbereitung von Geodaten sein. Somit ist die Verbindung von GIS und statistischer Modellierung ein qualitativer Aspekt bzgl. der Funktionalität eines SDM-Werkzeugs. Benutzerfreundlichkeit ist ein weiteres der oft gewünschten „Quality-in-Use“ Attribute. Aufgabenorientierte Ansätze können das Ziel der Modellierung von Verbreitungsmustern von Arten zwar erreichen, aber im Allgemeinen berücksichtigen solche Ansätze den praktischen Umgang mit dem Werkzeug nicht ausreichend. Benutzerorientierte Ansätze decken dagegen die Bedienbarkeit (Usability) und die Erreichbarkeit des Ziels ab, da diese Ansätze die Fähigkeiten des Nutzers sowie die Benutzerfreundlichkeit berücksichtigen. Pseudagrion kersteni ist eine Libellenart, die südlich der Sahara weit verbreitet ist und mit der die Anwendung des SpeeDi Tools hier hauptsächlich demonstriert wird. Eines der Ziele der Dissertation ist die Durchführung von verschiedenen Sensitivitätsanalysen, um Informationen über den Einfluss von verschiedenen Parametern zu erhalten. Da das tatsächliche Verbreitungsgebiet von P. kersteni nicht bekannt ist, dient als visuelle Vergleichsgrundlage der vorhergesagten Veränderungen eine Verbreitungskarte des IUCN, die von mehrere Fachmann basierend auf Wassereinzugsgebieten erstellt wurde. Die Sensitivitätsanalysen nutzen Stichprobendaten von P. kersteni bei Verwendung unterschiedlicher Modellierungsansätze, Modellierungsparameter und ökologischer Geodatensätze.

Zu den Besonderheiten des SpeeDi Tools gehört, dass anstelle von Pseudo- Abwesenheitsstichproben, die aus binärer logistischer Regression generiert werden, zufällig generierte Nulleffektproben verwendet werden. Im SDM-Umfeld kommen immer häufiger Nulleffektproben zum Einsatz, um die Umgebungsbedingungen bei den Standorten der Anwesenheitsstichproben mit Nulleffektproben zu vergleichen. Dadurch wird das Problem gelöst, dass Nachweise von Artenabwesenheit nur selten offiziell erhoben werden oder, falls doch vorhanden, für Modellierungen jedoch nicht ausreichend sind. Eine zuverlässige Artenabwesenheit kann dann manchmal von Experten abgeleitet werden. Unter solchen Umständen kann die Verwendung von Anwesenheits-, Abwesenheits- und Nulleffektstichproben zu besseren Modellierungsergebnissen führen. Diese Arbeit demonstriert und diskutiert das Konzept der Verwendung dieser drei Stichproben-Arten und wie sie im SpeeDi Tool zur Anwendung kommt. Die „Elastic-net“ Regularisierung ist eine effektivere Regularisierungsmethode, als es L1- oder L2- Regularisierungen sind. Sie ist ein Hybrid aus der L1- und L2- Regularisierung und kompensiert daher die Unterprognosen der L1- und die Überprognosen der L2- Regularisierung. Obwohl die „Elastic-net“ Regularisierung in der Statistik bereits bekannt ist, wird sie in der Modellierung von Artenverteilungen mit Anwesenheits- und Nulleffektstichproben bislang nicht eingesetzt. Die Implementierung erfordert eine spezielle Programmierung, welche das SpeeDi Tool bietet. Stichprobenfehler beeinflussen die Ergebnisse und sind schwierig zu vermeiden, wenn die Artennachweise aus Museen und Herbarien stammen. Fehlerhafte Stichproben entstehen aber auch durch ungeplante oder nicht-systematische Erhebungen. Die Modellierung im SpeeDi Tool wendet eine Gewichtung der Stichproben an, um unterschiedliche Stichprobendichten auszubalancieren und damit Stichprobenfehler zu reduzieren. Außerdem erleichtert die Gewichtung das Hinzufügen von Abwesenheitsdaten. Weitere Vorteile sind, dass die Statistik bei der Berechnung der Signifikanz nicht aufgebläht wird und die Proben-Prävalenz für das Artenvorkommen erhalten bleibt.

Page| 124 9. Extended summary

Obwohl das Konzept und die Nutzung von Nulleffektproben den Einsatz von Feldstichproben ermöglicht, sind die Populationsschwerpunkte der Vorkommen in Nulleffektproben jedoch nicht bekannt. Im Allgemeinen wird deshalb ein vermuteter Wert verwendet, der wiederum zu Ungenauigkeiten führt. Um die Ungenauigkeit zu minimieren oder ihren Effekt auf die angenommenen Populationsschwerpunkte in den Nulleffektproben zu eliminieren, wurde im SpeeDi Tool ein heuristischer Ansatz des „Soft Buffer-Threshold“ (SBT) konzeptualisiert und implementiert. Obwohl die Verarbeitung von Geodaten ein fester Bestandteil der SDM ist, werden GIS- Funktionalitäten in SDM Werkzeugen oft nicht vorgehalten, da viele SDM Werkzeuge einen aufgabenorientierten Ansatz verfolgen, in dem die statistische Modellierung als Hauptaufgabe angesehen wird. Die Konsequenz ist, dass ein separates GIS-Programm für die grundlegende Geodatenverarbeitung verwendet wird und dass In- und Export von Daten nötig sind, um Ergebnisse zu erhalten. Die Implementierung des SpeeDi Tools folgt dem nutzerorientierten Ansatz, welcher z.B. die Notwendigkeit der GIS-Funktionalität als einen integralen Teil des gesamten SDM Arbeitsablaufes ansieht. Die Benutzeroberfläche des SpeeDi Tools ermöglicht grundlegende GIS-Funktionalität für die Vor- und Nachbereitung der Geodaten und die Modellierung mittels binärer logistischer Regression ist über eine angekoppelte Software integriert. Dies erlaubt den Zugang zu GIS- Funktionalität und den benötigten statistischen Funktionen direkt aus der Hauptoberfläche heraus. Aus den oben genannten Eigenschaften des SpeeDi Tools sind die Möglichkeit der Modellierung der Artenverteilung durch die Kombination aus Nulleffekt- und Abwesenheitsstichproben, die „Elastic-net" Regularisierung im Falle von Nulleffektproben und das Konzept des SBT neue Möglichkeiten im SDM-Feld, die derzeit in anderen SDM Werkzeugen nicht vorhanden sind. Obwohl Gewichte in der logistischen Regression genutzt werden können, ist die Gewichtung von Stichproben bisher nicht üblich in SDM-Aufgaben. Ein weiterer erwähnenswerter Punkt des Werkzeugs ist daher die Möglichkeit der expliziten Gewichtung einzelner Stichprobenpunkte, die angewandt werden kann, um Stichprobenfehlern entgegenzuwirken.

Die Modellierungen der vorliegenden Arbeit berücksichtigen die Verwendung von fiktiven Artenkonzepten mit der Modellierung von An- und Abwesenheitsdaten, um an Sicherheit in der Fähigkeit des Modells, Verbreitungsgebiete vorherzusagen, zu gewinnen. Da die tatsächliche Verbreitung einer realen Art unbekannt ist, bietet das Konzept der fiktiven Arten die Möglichkeit, die Modellzuverlässigkeit zu evaluieren, die wahre Verbreitung darzustellen. Dasselbe Konzept wird dann auf Anwesenheits- und Nulleffektmodellierungsszenarien ausgeweitet, um die Eignung der Nulleffektproben in der Modellierung von Artenverteilungen mit dem SpeeDi Werkzeug zu untermauern. Anschließend wird ein Vergleich mit den Ergebnissen der realen Feldstichproben und der Nulleffektproben gemacht. Zuletzt wird die Verwendung von zusätzlichen Experten-basierten Abwesenheitsstichproben mit Anwesenheits- und Nulleffektstichproben getestet, welche die Vorhersage verbesserte. Kurz dargestellt zeigen die Ergebnisse der Sensitivitätsanalysen der Modellparameter Folgendes: • Die „Elasitc-net“ Regularisierung führt zu besseren Ergebnissen als die L1- und die L2- Regularisierungen. Die modellierte Verbreitung von P. kersteni ist empfindlich gegenüber der Stärke der Regularisierung. Die Analyse erlaubt Einblick in die Auswahl der optimalen „Elastic-net“ Faktoren; der für P. kersteni ausgewählte Wert brachte auch gute Ergebnisse für die Modellierung der Verbreitung von A. minuscula . • Der angenommene Wert für den Populationsschwerpunkt der Nulleffektproben reagiert empfindlich gegenüber prognostizierten Wahrscheinlichkeitswerten. Allerdings hat der

Page| 125 9. Extended summary

angenommene Wert weniger Einfluss, wenn Grenzwerte für die An- und Abwesenheitsgebiete angewendet werden; der Grenzwert basiert entweder darauf, die Differenz zu minimalisieren oder die Summe der Sensitivität- und Spezifität-Werte der Klassifikation zu maximieren. Die Anwendung von SBT reduziert hier die Differenz der Wahrscheinlichkeitswerte (Varianz und Umfang) sehr stark und ist daher in der Lage, Unsicherheiten, die mit den Populationsschwerpunkten zusammenhängen, zu reduzieren. • Obwohl das statistische Maß der Vorhersagegenauigkeit anhand der Fläche unterhalb der „Receiver Operating Characteristics“ Kurve (die Grenzwertoptimierungskurve) keine großen Unterschiede aufwies, hat der visuelle Vergleich der prognostizierten Gebiete gezeigt, dass die Anzahl der Nulleffektproben im Modell ein sehr empfindlicher Parameter ist. Zu wenige Nulleffektproben könnten zu einer Überschätzung führen, während zu viele Stichproben zu einer Unterschätzung in Gebieten mit fehlenden Anwesenheitsstichproben führen können. • Verschiedene Pseudozufallszahlengeneratoren, die für die Generierung zufälliger Nulleffektproben zum Einsatz kamen, haben keinen echten Einfluss auf die modellierte Verteilung von P. kersteni . Das zeigt, dass das Verbreitungsmuster der Nulleffektproben auf die Fähigkeit des SpeeDi Toos, Vorhersagen über Verbreitungsgebiete zu treffen, nicht sensibel reagiert. • Beim Betrachten der komplexen Beziehungen der Modellvariablen hat sich gezeigt, dass der Grad des Polynoms keinen Einfluss auf das Prognoseergebnis hat; allerdings beeinflusst die Wechselwirkung zwischen den Variablen die Bestimmung der Wahrscheinlichkeit von Artenvorkommen sehr stark. • Bei ungleichen Stichprobendichten, die häufig als Stichprobenfehler bezeichnet werden, werden den Anwesenheitsstichproben je nach räumlicher Dichte Gewichte zugewiesen; die Ergebnisse haben gezeigt, dass die Verwendung von Gewichten ein effektives Mittel sein kann, um mit solchen Stichprobenfehlern umzugehen.

Im Hinblick auf Klima- und Landbedeckungsdaten liefern die Sensitivitätsanalysen der Umweltvariablen wertvolle Informationen für die Modellierung der Verbreitung von P. kersteni . Bioklimatische und monatliche Klimaroh-Daten werden hier als klimatische Datensätze berücksichtigt. Die Sensitivität von Landbedeckungsdaten wird mit Hilfe zweier Datensätze mit unterschiedlichen Klassifikationsschemata durchgeführt: Dem „GLC 2000“ Datensatz, der das Klassifikationsschema der FAO Landnutzungsklassifikation benutzt, und dem „USGS global land cover characterisation“ Datensatz, der auf der (modifizierten, zweiten) USGS Landnutzungs- /Landbedeckungsklassifikation basiert. Die folgenden Ergebnisse fassen die Sensitivität gegenüber der Klima- und Landbedeckungsdatensätze zusammen: • Die Verwendung von 19 bioklimatischen Variablen, die den Klimatrend in Bezug auf Temperatur und Niederschlag beschreiben, führt zu einer Unterprognose des Verbreitungsgebietes. • Die Verwendung von sechs bioklimatischen Variablen (Temperatur: jährliches Mittel, saisonale Schwankungen, Jahresbereich und mittel des kältesten Viertel, Niederschlag: Jahressumme und saisonale Schwankungen) zeigen weniger Unter-, aber zunehmende Überprognosen.

Page| 126 9. Extended summary

• Der Großteil der Überprognosen konnte effektiv behoben werden, indem die sechs ausgewählten bioklimatischen Variablen mit einer geographischen Trendoberfläche als Einflussvariable ergänzt wurden. • Das Ergebnis der Modellierung bei Nutzung der monatlichen Klimadaten (Temperatur und Niederschlag) zeigte, dass die monatlichen Klimaroh-Daten den synthetischen bioklimatischen Datensätzen vorzuziehen sind, da das Rauschen in synthetischen Daten nicht immer effektiv herausgefiltert wird. • Obwohl der Gebrauch von unterschiedlichen Landbedeckungsdaten (Klassifikationsschemata) keinen Einfluss auf die modellierten Verbreitungsgebiete hat, lassen die Ergebnisse den Verdacht aufkommen, dass die Landbedeckungsklassen einen falschen ökologischen Eindruck vermitteln können; z.B. weisen Klassen ähnlicher Eigenschaften (geschlossene bis offene, sommergrüne Strauchvegetation im GLC2000- Datensatz und Savanne im USGS-Datensatz) unterschiedliche Einflussanteile und einen gegensätzlichen Einfluss (positiv/negativ) bei der Bestimmung der Wahrscheinlichkeitswerte auf. • Im Hinblick auf die unterschiedlichen Landbedeckungsdaten (Schemata) zeigt das Ergebnis die Grenzen statistischer Modellierungsmethoden auf, bei denen die Bedeutung einer Variable von einem statistischen Modell bestimmt wird und nicht der erwarteten Bedeutung und/oder den ökologischen Theorien entsprechen mag. • Im statischen Rahmen von SDM wurden die Vorhersagen der Modellvariablen allein basierend auf den aktuellen Gegebenheiten (Basisjahr 2000) hin zu den vergangenen Klima- und simulierten Landbedeckungsszenarien von 1940 sowie zu den Zukunftsszenarien bzgl. Klima und Landbedeckung im Jahr 2050 gemacht. • Der Vergleich der modellierten Verbreitungsgebiete von 1940, 2000 und 2050 zeigt, dass das Verbreitungsgebiet von P. kersteni kleiner zu werden scheint und dass in Zentral- und Ostafrika große Gebiete mit passenden Umweltbedingungen verloren gehen.

Neben weiteren Sensitivitätsanalysen werden auch der Einfluss der räumlichen Auflösung der Geodaten und der räumliche Ausdehnung der Modellierung untersucht. Dafür kommt die regionale Odonata-Art Aeshna Minuscula , die in Südafrika vorkommt, zum Einsatz. Die Modellierungsergebnisse schlagen Folgendes vor: • Wenn sich die Modellierung auf den gesamten Kontinent bezieht, konnten die Ergebnisse nur einen grundlegenden Eindruck des Verbreitungsmusters der Art vermitteln. • Bei Beschränkung der Modellierung auf die Region Südafrika stimmt die prognostizierte Verbreitung sehr gut mit der erwarteten Verbreitung überein. Zusätzlich waren die Muster detaillierter als bei der Berücksichtigung des ganzen Kontinents. • Das prognostizierte Verbreitungsgebiet von A. Minuscula zeigte keine großen Unterschiede bei der Modellierung kontinentaler Ausdehnung für die unterschiedlichen Auflösungen von 1 km und 8 km. • Ebenso wiesen die Ergebnisse von Modellierungen von einem regionalen Bereich in Südafrika mit 1 km und 8 km Auflösung keine bedeutenden Veränderungen des prognostizierten Verbreitungsgebietes auf.

Das Ziel, ein neues SDM-Werkzeug unter Einbeziehung einer umfassenden Sensitivitätsanalyse der Modellparameter und von Umweltvariablen zu erstellen, ist erreicht. Allerdings wurde nur die

Page| 127 9. Extended summary

Verbreitung zweier afrikanischer Libellenarten mit dem SpeeDi Tool modelliert; alleine für den afrikanischen Kontinent gilt es, viele weitere Odonata zu modellieren (siehe Simaika et al ., 2013). Die Sensitivitätsanalysen der Regressionskontrollparameter beziehen sich bisher nur auf die Modellierung von P. kersteni ; die Ergebnisse dieser Modellierung werden als Standartwerte für die Modellierung mit dem SpeeDi Tool vorgeschlagen. Um einen noch fundierteren Hintergrund zu bieten und um die Standardparameter weiter zu verbessern, wird die Modellierung von weiteren Arten inklusive ähnlicher Sensitivitätsanalysen vorgeschlagen. Insbesondere muss die heuristische SBT Methode weiter getestet werden, um das Vertrauen in ihre Anwendung für SDM weiter zu stärken. Die Ergebnisse haben gezeigt, wie mit Stichprobenfehlern über die Zuweisung von Gewichten umgangen werden kann, allerdings fehlt noch ein angemessener Algorithmus. Nichtsdestotrotz geben die Ergebnisse bereits eine Richtung vor, um solch einen Algorithmus zu entwickeln, z.B. a) über das Ausgleichen von lokalen Dichten innerhalb einer Region (jeder Cluster separat) und b) über das Ausgleichen von Dichten über mehrere Regionen (zwischen Clustern, z.B. im Falle von P. kersteni die Dichte in Süd- und Westafrika). Auf der technischen Seite besteht bereits das benötigte Gerüst für ein Hilfesystem, aber die Inhalte fehlen teilweise noch. Diese zu füllen sollte der nächste Schritt sein, bevor das Werkzeug allgemein genutzt werden kann. Bei der Entwicklung des Hilfeinhalts können die Funktionalität und die technischen Aspekte des Werkzeugs in kurzer Zeit beschrieben werden; wird allerdings ein nutzerzentrierter Ansatz verfolgt, kostet das mehr Zeit, macht die Inhalte aber auch nützlicher. Dies wird eine gute Zusammenarbeit mit Biologen/Ökologen erfordern, die SDM nutzen und sich gleichzeitig für das Schließen von Wissenslücken zwischen GIS und SDM interessieren und Inhalte mit GIS-Terminologie im Bereich der Ökologie vorbereiten und hinzufügen. Obwohl ein grundlegendes Kartenlayout einfach erstellt werden kann, würden ein paar zusätzliche ‚kosmetische’ Funktionen im Bereich der kartographischen Darstellung die Kartenerstellung verbessern. Das Werkzeug nutzt bzw. hat einige freie und quelloffene Algorithmen, die GIS-Programmbibliothek des Werkzeugs ist jedoch nicht frei verfügbar. Deswegen könnte die Übernahme eines quelloffenen GIS bei vielen Nutzern Anklang finden, insbesondere wenn man berücksichtigt, dass dann die Gemeinschaft der Nutzer für die Erhaltung, Verbesserungen und Entwicklung von Erweiterungen im Laufe der Zeit sorgen kann, was sich nicht nur auf das Werkzeug selbst, sondern auch positiv auf die Kosten auswirken kann.

Page| 128 References

References

Anderson, J. R., Hardy, E. E., Roach, J. T. and Witmer, R. E. 1976. A land use and land cover classification system for use with remote sensor data . US Geological Survey Professional Paper 964, USGS, Washington.

Apple. 2009. Apple human interface guideline. Applie Inc, Cupertino.

Araújo, M. B. and Guisan, A. 2006. Five (or so) challenges for species distribution modelling. Journal of Biogeography 33 (10): 1677-1688.

Araújo, M. B. and Luoto, M. 2007. The importance of biotic interactions for modelling species distributions under climate change. Global Ecology and Biogeography 16 : 743–753.

Araújo, M. B. and Peterson, A. T. 2012. Uses and misuses of bioclimatic envelope modeling. Ecology 93 (7): 1527-1539.

Araújo, M. B., Williams, P. H. and Fuller, R. J. 2002. Dynamics of extinction and the selection of nature reserves. Proceedings of the Royal Society (Biological sciences) 269 (1504): 1971-80.

Åström, M., Dynesius, M., Hylander, K., Nilsson and Christer. 2007. Slope aspect modifies community responses to clear-cutting in boreal forests. Ecology 88 (3): 749-758.

Austin, M. P. 2002a. Case studies of the use of environmental gradients in vegetation and fauna modeling: Theory and practice in Australia and New Zealand. In: Scott, J. M., Heglund, P. J., Morisson, M. L., Haufler, J. B., Raphael, M. G., Wall, W. A., Samson, F. B. (eds.) Predicting species occurrences: Issues of accuracy and scale. Island press, Washington, pp 73-82.

Austin, M. P. 2002b. Spatial prediction of species distribution: an interface between ecological theory and statistical modelling. Ecological Modelling 157 (2-3): 101-118.

Austin, M. 2007. Species distribution models and ecological theory: A critical assessment and some possible new approaches. Ecological Modelling 200 (1-2): 1-19.

Barry, S. and Elith, J. 2006. Error and uncertainty in habitat models. Journal of Applied Ecology 43 (3): 413-423.

Beaumont, L., Hughes, L. and Poulsen, M. 2005. Predicting species distributions: use of climatic parameters in BIOCLIM and its impact on predictions of species' current and future distributions. Ecological Modelling 186 (2): 251-270.

Bernhardsen, T. 1999. Geographic information systems: an introduction. 2nd ed. Wiley, New York.

Bevan, N. 1995. Measuring usability as quality of use. Software Quality Journal 4(2): 115-130.

Bevan, N. 1999. Quality in use: Meeting user needs for quality. The Journal of Systems and Software 49 (1): 89-96.

Page| 129 References

Bevan, N. 2001a. International standards for HCI and usability. International Journal of Human- Computer Studies 55 (4): 533-552.

Bevan, N. 2001b. Quality in use for all. In: Stephanidis, C. (ed.) User interfaces for all: concepts, methods, and tools. Lawerence Erlbaum Associates, Mahwah, pp 353-368.

Bevan, N. 2006. Practical issues in usability measurement. Interactions 13 (6): 42-43.

Bevan, N. 2009. Extending quality in use to provide a framework for usability measurement. In: Holzinger, A., Kurosu, M. (eds.) Human Centered Design. Springer, Berlin, pp 13-22.

Bevan, N. and Macleod, M. 1994. Usability measurement in context. Behaviour and Information Technology 13 (1): 132-145.

Blackwell, A. F. 2006. The reification of metaphor as a design tool. ACM Transactions on Computer- Human Interaction 13 (4): 490-530.

Bocquet-Appel, J.-P. and Bacro, J.-N. 1993. Isolation by distance, trend surface analysis, and spatial autocorrelation. Human Biology 65 (1): 11-27.

Boitani, L., Maiorano, L., Baisero, D., Falcucci, A., Visconti, P. and Rondinini, C. 2011. What spatial data do we need to develop global mammal conservation strategies? Philosophical Transactions of the Royal Society (Biological Sciences) 366 : 2623-2632.

Bosworth, A., Box, D., Gudgin, M., Nottingham, M., Orchard, D. and Schlimmer, J. 2003. XML, SOAP, and binary data . White paper, Microsoft developer network. http://msdn.microsoft.com/en- us/library/ms996427.aspx (26-Apr-2013)

Brandmeyer, J. E. and Karimi, H. A. 2000. Coupling methodologies for environmental models. Environmental Modelling & Software 15 (5): 479-488.

Brent, R. P. 1992. Uniform random number generators for supercomputers. In: Proceedings of Fifth Australian Supercomputer Conference , Melbourne, December 1992; pp 95-104.

Brotons, L., Thuiller, W., Araúo, M. B. and Hirzel, A. H. 2004. Presence-absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27 (4): 437-448.

Buermann, W., Saatchi, S., Smith, T. B., Zutta, B. R., Chaves, J. A., Milá, B. and Graham, C. H. 2008. Predicting species distributions across the Amazonian and Andean regions using remote sensing data. Journal of Biogeography 35 (7): 1160-1176.

Bugayevskiy, L. M. and Snyder, J. P. 1995. Map projections : a reference manual. Taylor & Francis, London.

Buisson, L., Thuiller, W., Casajus, N., Lek, S. and Grenouillet, G. 2010. Uncertainty in ensemble forecasting of species distribution. Global Change Biology 16 (4): 1145-1157.

Busby, J. R. 1986. A biogeoclimatic analysis of Nothofagus cunninghamii (Hook.) Oerst. in southeastern Australia. Austral Ecology 11 (1): 1-7.

Busby, J. R. 1991. BIOCLIM - a bioclimatic analysis and prediction system. In: Margules, C. R., Austin, M. P. (eds.) Nature conservation: cost effective biological surveys and data analysis. CSIRO, pp 64- 68.

Page| 130 References

Buxton, W. 2007. Sketching user experiences: getting the design right and the right design. Morgan Kaufmann, Heidelberg.

Calabrese, J. M., Certain, G., Kraan, C. and Dormann, C. F. 2014. Stacking species distribution models and adjusting bias by linking them to macroecological models. Global Ecology and Biogeography 23 : 99-112.

Carpenter, G., Gillison, A. N. and Winter, J. 1993. DOMAIN: a flexible modelling procedure for mapping potential distributions of plants and . Biodiversity and Conservation 2(6): 667- 680.

Carroll, J. M. 2004. Beyond fun. Interactions 11 (5): 38-40.

Carter, T., Jones, R., Lu, X., Bhadwal, S., Conde, C., Mearns, L., O’Neill, B., Rounsevell, M. and Zurek, M. 2007. New assessment methods and the characterisation of future conditions. In: Parry, M., Canziani, O., Palutikof, J., van der Linden, P., Hanson, C. (eds.) Climate change 2007: Impacts, adaptation and vulnerability. Contribution of working group II to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, pp 133- 171.

Chapman, A. D., Muñoz, M. E. and Koch, I. 2005. Environmental information: Placing biodiversity phenomena in an ecological and environmental context. Biodiversity Informatics 2: 24-41.

Christensen, J., Hewitson, B., Busuioc, A., Chen, A., Gao, X., Held, I., Jones, R., Kolli, R., Kwon, W.-T., Laprise, R., Magaña Rueda, V., Mearns, L., Menéndez, C., Räisänen, J., Rinke, A., Sarr, A. and Whetton, P. 2007. Regional climate projections. In: Solomon, S., Qin, D., Manning, M., Chen, Z., Marquis, M., Averyt, K., Tignor, M., Miller, H. (eds.) Climate change 2007: The physical science basis. Contribution of working group I to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, pp 849-940.

Clausnitzer, V., Dijkstra, K.-D. B., Koch, R., Boudot, J.-P., Darwall, W. R. T., Kipping, J., Samraoui, B., Samways, M. J., Simaika, J. P. and Suhling, F. 2012. Focus on African freshwaters: hotspots of dragonfly diversity and conservation concern. Frontiers in Ecology and the Environment 10 (3): 129–134.

Clausnitzer, V., Suhling, F. and Dijkstra, K.-D. 2010. Pseudagrion kersteni . IUCN Red List of Threatened Species. Version 2012.2. http://www.iucnredlist.org/details/full/60022/0 (2012-Sept-05)

Cooper, A. 2004. The inmates are running the asylum. Sams, Indianapolis.

Cooper, A., Reimann, R. and Croniin, D. 2007. About face: the essentials of interaction design. 3rd ed. Wiley, Indianapolis.

Corbet, P. S. 1962. A biology of dragonflies. Witherby, London.

Counsell, S. 2009. Forest governance in Africa . SAIIA Occasional Paper, No 50. http://www.saiia.org.za/images/stories/pubs/occasional_papers/saia_sop_50_counsell_20091026.pdf (5-Dec-2012)

Cox, C. B. and Moore, P. D. 2005. Biogeography: an ecological and evolutionary approach. 7th ed. Blackwell, Malden.

Page| 131 References

Dall'Olmo, G. and Karnieli, a. 2002. Monitoring phenological cycles of desert ecosystems using NDVI and LST data derived from NOAA-AVHRR imagery. International Journal of Remote Sensing 23 (19): 4055-4071.

Darwall, W., Smith, K., Allen, D., Holland, R., Harrison, I., Brooks, E. (eds.). 2011. The Diversity of Life in African Freshwaters: Under Water, Under Threat. An analysis of the status and distribution of freshwater species throughout mainland Africa. IUCN, Gland, Switzerland and Cambridge, UK, p 347.

David R. B., S. and A Peterson, T. 2002. Effects of sample size on accuracy of species distribution models. Ecological Modelling 148 (1): 1-13.

DeFries, R., Hansen, M. and Townshend, J. 1995. Global discrimination of land cover types from metrics derived from AVHRR pathfinder data. Remote Sensing of Environment 54 (3): 209-222.

Dempster, A., Laird, N. and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1-38.

Dijkstra, K.-D. B., Boudot, J.-P., Clausnitzer, V., Kipping, J., Kisakye, J. J., Ogbogu, S. S., Samraoui, B., Samways, M. J., Schütte, K., Simaika, J. P., Suhling, F. and Tchibozo, S. L. 2011. Dragonflies and damselflies of Africa (Odonata): history, diversity, distribution, and conservation. In: Darwall, W. R. T., Smith, K. G., Allen, D. J., Holland, R. A., Harrison, I. J., Brooks, E. G. E. (eds.) The diversity of life in African freshwaters: under water, under threat. An analysis of the status and distribution of freshwater species throughout mainland Africa. IUCN, Gland, pp 128-177.

Diniz-Filho, J. A. F., Rangel, T. F. L., Bini, L. M. and Hawkins, B. A. 2007. Macroevolutionary dynamics in environmental space and the latitudinal diversity gradient in New World birds. Proceedings of the Royal Society B: Biological Sciences 274 (1606): 43-52.

Dormann, C. F., McPherson, J. M., Araújo, M. B., Bivand, R., Bolliger, J., Carl, G., Davies, R. G., Hirzel, A., Jetz, W., Kissling, W. D., Kühn, I., Ohlemüller, R., Peres-Neto, P. R., Reineking, B., Schröder, B., Schurr, F. M. and Wilson, R. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30 : 609-628.

Dudík, M. 2007. Maximum entropy density estimation and modeling geographic distribution of species . PhD Thesis, Princeton University, USA.

Elith, J. and Burgman, M. 2002. Predictions and their validation: rare plants in the Central Highlands, Victoria, Australia. In: Scott, J. M., Heglund, P. J., Morisson, M. L., Haufler, J. B., Raphael, M. G., Wall, W. A., Samson, F. B. (eds.) Predicting species occurrences: Issues of accuracy and scale. Island press, Washington, pp 303-314.

Elith, J. and Graham, C. H. 2009. Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models. Ecography 32 (1): 66-77.

Elith, J., Graham, C. H., Anderson, R. P., Dudík, M., Ferrier, S., Guisan, A., Hijmans, R. J., Huettmann, F., Leathwick, J. R., Lehmann, A., Li, J., Lohmann, L. G., Loiselle, B. A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., M., J. M., Peterson, A. T., Phillips, S. J., Richardson, K., Scachetti- Pereira, R., Schapire, R. E., Soberón, J., Williams, S., Wisz, M. S. and Zimmermann, N. E. 2006.

Page| 132 References

Novel methods improve prediction of species' distributions from occurrence data. Ecography 29 (2): 129-151.

Elith, J. and Leathwick, J. 2007. Predicting species distributions from museum and herbarium records using multiresponse models fitted with multivariate adaptive regression splines. Diversity and Distributions 13 (3): 265-275.

Elith, J. and Leathwick, J. R. 2009. Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution and Systematics 40 : 677-697.

Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E. and Yates, C. J. 2011. A statistical explanation of MaxEnt for ecologists. Diversity and Distributions 17 (1): 43-57.

Engler, R., Guisan, A. and Rechsteiner, L. 2004. An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. Journal of Applied Ecology 41 (2): 263-274.

FAO. 2010. Land Use Systems of the World - Sub-Saharan Africa . FAO/UNEP. http://www.fao.org/geonetwork/srv/en/metadata.show?id=37048 (26-Jun-2013)

Farber, O. and Kadmon, R. 2003. Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance. Ecological Modelling 160 (1-2): 115-130.

Ferrier, S., Watson, G., Pearce, J. and Drielsma, M. 2002. Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. I. Species-level modelling. Biodiversity and Conservation 11 (12): 2275-2307.

Franklin, J. 1995. Predictive vegetation mapping: geographic modelling of biospatial patterns in relation to environmental gradients. Progress in Physical Geography 19 (4): 474-499.

Franklin, J. 2010. Mapping species distributions: spatial inference and prediction. Cambridge University Press, Cambridge.

Frederiksen, P. and Lawesson, J. E. 1992. Vegetation types and patterns in Senegal based on multivariate analysis of field and NOAA-AVHRR satellite data. Journal of Vegetation Science 3(4): 535-544.

Friedman, J. 1991. Multivariate adaptive regression splines. The annals of statistics 19 (1): 1-67.

Friedman, J., Hastie, T. and Tibshirani, R. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1): 1-22.

Galitz, W. O. 2007. The essential guide to user interface design: an introduction to GUI design principles and techniques. 3rd ed. Wiley, Indianapolis.

García, A. 2006. Using ecological niche modelling to identify diversity hotspots for the herpetofauna of Pacific lowlands and adjacent interior valleys of Mexico. Biological Conservation 130 (1): 25-46.

Garen, K. 2007. Software Portability: Weighing Options, Making Choices. The CPA Journal 77 (11): 10- 12.

Germer, J. and Sauerborn, J. 2008. Estimation of the impact of oil palm plantation establishment on greenhouse gas balance. Environment, Development and Sustainability 10 (6): 697-716.

Page| 133 References

Ghisla, A., Rocchini, D., Neteler, M., Förster, M. and Kleinschmit, B. 2012. Species distribution modelling and open source GIS: why are they still so loosely connected? In: Proceedings of 2012 International Congress on Environmental Modelling and Software2012 International Congress on Environmental Modelling and Software , Seppelt, R., Voinov, A., Lange, S., Bankamp, D. (eds.) Leipzig, Germany; pp 1481-1488.

Golafshani, N. 2003. Understanding reliability and validity in qualitative research. The Qualitative Report 8(4): 597-607.

Graef, F., Schmidt, G., Schröder, W. and Stachow, U. 2005. Determining ecoregions for environmental and GMO monitoring networks. Environmental Monitoring and Assessment 108 (1-3): 189-203.

Graham, C. H., Ron, S. R., Santos, J. C., Schneider, C. J. and Moritz, C. 2004. Integrating phylogenetics and environmental niche models to explore speciation mechanisms in dendrobatid frogs. Evolution 58 (8): 1781-93.

Greenpeace. 2012. Palm oil's new frontier: How industrial expansion threatens Africa's rainforests. Greenpeace International, Amsterdam.

Grömping, U. 2006. Relative importance for linear regression in R: the package relaimpo. Journal of Statistical Software 17 (1): 1-27.

Guisan, A., Broennimann, O., Engler, R., Vust, M., Yoccoz, N. G., Lehmann, A. and Zimmermann, N. E. 2006. Using niche-based models to improve the sampling of rare species. Conservation Biology 20 (2): 501-511.

Guisan, A. and Thuiller, W. 2005. Predicting species distribution: offering more than simple habitat models. Ecology Letters 8(9): 993-1009.

Guisan, A., Weiss, S. B. and Weiss, A. D. 1999. GLM versus CCA spatial modeling of plant species distribution. Plant Ecology 143 (1): 107-122.

Guisan, A. and Zimmermann, N. E. 2000. Predictive habitat distribution models in ecology. Ecological modelling 135 (2-3): 147-186.

Guisan, A., Zimmermann, N., Elith, J., Graham, C., Phillips, S. and Peterson, A. 2007. What matters for predicting the occurrences of trees: techniques, data, or species' characteristics? Ecological Monographs 77 (4): 615-630.

Hartley, S., Krushelnycky, P. D. and Lester, P. J. 2010. Integrating physiology, population dynamics and climate to make multi-scale predictions for the spread of an invasive insect: the Argentine ant at Haleakala National Park, Hawaii. Ecography 33 (1): 83-94.

Hasenzahl, M. 2003. The thing and I: understanding the relationship between user and product. In: Blythe, M., Monk, A., Overbeeke, C., Wright, P. (eds.) Funology: From usability to user enjoyment. Kluwer, Dordrecht, pp 31-42.

Hastie, T., Tibshirani, R. and Friedman, J. 2009. The elements of statistical learning - data mining, inference, and prediction. 2nd ed. Springer, New York.

He, F. 2010. Maximum entropy, logistic regression, and species abundance. Oikos 119 (4): 578-582.

Page| 134 References

Heikkinen, R. K., Luoto, M., Araújo, M. B., Virkkala, R., Thuiller, W. and Sykes, M. T. 2006. Methods and uncertainties in bioclimatic envelope modelling under climate change. Progress in Physical Geography 30 (6): 751-777.

Hengl, T., Sierdsema, H., Radovic, A. and Dilo, A. 2009. Spatial prediction of species' distributions from occurrence-only records: combining point pattern analysis, ENFA and regression-kriging. Ecological Modelling 220 (24): 3499-3511.

Heywood, I., Cornelius, S. and Carver, S. 2006. An introduction to geographical information systems. 3rd ed. Pearson, Harlow.

Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G. and Jarvis, A. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25 (15): 1965-1978.

Hijmans, R. and Elith, J. 2011. Species distribution modeling with R . http://cran.r- project.org/web/packages/dismo/vignettes/sdm.pdf (11-Apr-2013)

Hijmans, R. J. and Graham, C. H. 2006. The ability of climate envelope models to predict the effect of climate change on species distributions. Global Change Biology 12 (12): 2272-2281.

Hijmans, R. J., Guarino, L. and Mathur, P. 2012. Diva-GIS Version 7.5 . Manual. http://www.diva-gis.org (25-Apr-2013)

Hirzel, A. 2004. BioMapper 3 . User's manual. http://www2.unil.ch/biomapper/ (03-Feb-2012)

Hirzel, A. H., Hausser, J., Chessel, D. and Perrin, N. 2002b. Ecological-Niche Factor Analysis: how to compute habitat-suitability maps without absence data? Ecology 83 (7): 2027-2036.

Hirzel, A., Hausser, J. and Perrin, N. 2002a. Biomapper 3.1 . Lab. for Conservation Biology, Lausanne. http://www.unil.ch/biomapper (11-Apr-2013)

Hirzel, A., Le Lay, G., Helfer, V., Randin, C. and Guisan, A. 2006. Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling 199 (2): 142-152.

Holland, R., Garcia, N. and Brooks, E. 2011. Synthesis for all taxa. In: Darwall, W. R. T., Smith, K. G., Allen, D. J., Holland, R. A., Harrison, I. J., Brooks, E. G. E. (eds.) The diversity of life in African freshwaters: under water, under threat. An analysis of the status and distribution of freshwater species throughout mainland Africa. IUCN, Gland, pp 226-269.

Holt, R. D. and Barfield, M. 2009. Trophic interactions and range limits: the diverse roles of predation. Proceedings of the Royal Society B: Biological Sciences 276 : 1435–1442.

Hosmer, D. and Lemeshow, S. 2000. Applied Logistic Regression. 2nd ed. John Wiley & Sons, New York.

Huggett, R. J. 2004. Fundamentals of Biogeography. 2nd ed. Routledge, Oxfordshire.

Huntley, B., Hole, D. G. and Willis, S. G. 2011. Assessing the effectiveness of a protected area network in the face of climatic change. In: Hodkinson, T. R., Jones, M. B., Waldren, S., Parnell, J. A. N. (eds.) Climate Change, Ecology and Systematics. Cambridge University Press, Cambridge, pp 345-364.

Page| 135 References

IPCC. 2007. Climate change 2007: The physical science basis. Contribution of working group I to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, Cambridge.

Jankowski, P. 1995. Integrating geographical information systems and multiple criteria decision- making methods. International Journal of Geographical Information Systems 9(3): 251-273.

Jarvis, A., Reuter, H., Nelson, A. and Guevara, E. 2008. Hole-filled seamless SRTM data V4. International Centre for Tropical Agriculture (CIAT). http://srtm.jrc.ec.europa.eu/ (13-Jul-2012).

Jeffers, J. N. R. 1999. Genetic Algorithms I. In: Fielding, A. H. (ed.) Machine Learning Methods for Ecological Applications. Kluwer Academic, Boston, pp 107-121.

Jenness, J., Dooley, J., Aguilar-Manjarrez, J. and Riva, C. 2007. African water resource database. GIS- based tools for inland aquatic resource management - Technical manual and workbook . CIFA Technical Paper. No. 33, FAO, Rome.

Jiménez-Valverde, A. and Lobo, J. 2007. Threshold criteria for conversion of probability of species presence to either-or presence-absence. Acta Oecologica 31 (3): 361-369.

Jiménez-Valverde, A., Lobo, J. M. and Hortal, J. 2009. The effect of prevalence and its interaction with sample size on the reliability of species distribution models. Community Ecology 10 (2): 196-205.

Johnson, J. 2010. Designing with the mind in mind: simple guide to understanding user interface design rules. Morgan Kaufmann, Heidelberg.

Kalkman, V. J., Clausnitzer, V., Dijkstra, K.-D. B., Orr, A. G., Paulson, D. R. and van Tol, J. 2008. Global diversity of dragonflies (Odonata) in freshwater. Hydrobiologia 595 (1): 351-363.

Karimi, H. A. and Houston, B. H. 1996. Evaluating strategies for integrating environmental models with GIS: current trends and future needs. Computers, Environment and Urban Systems 20 (6): 413-425.

Kim, J.-I. 2004. Desktop metaphor. In: Bainbridge, W. S. (ed.) Berkshire encyclopedia of human- computer interaction. Berkshire Publishing, Great Barrington, pp 158-162.

Kipping, J., Dijkstra, K.-D. B., Clausnitzer, V., Suhling, F. and Schütte, K. 2009. Odonata Database of Africa (ODA). Agrion 13 (1): 20-23.

Klein Goldewijk, K. 2001. Estimating global land use change over the past 300 years: The HYDE Database. Global Biogeochemical Cycles 15 (12): 417-433.

Klein Goldewijk, K., Van Drecht, G. and Bouwman, A. F. 2007. Mapping contemporary global cropland and grassland distributions on a 5 x 5 minute resolution. Journal of Land Use Science 2(3): 167- 190.

Knuth, D. E. 1998. The art of computer programming. Vol. 2, 3rd ed. Addison-Wesley, Reading.

Krzanowski, W. J. and Hand, D. J. 2009. ROC curves for continuous data. CRC Press, Boca Raton.

Kundzewicz, Z., Mata, L., Arnell, N., Döll, P., Kabat, P., Jiménez, B., Miller, K., Oki, T., Sen, Z. and Shiklomanov, I. 2007. Freshwater resources and their management. In: Parry, M. L., Canziani, O. F., Palutikof, J. P., van der Linden, P. J., Hanson, C. E. (eds.) Climate change 2007: Impacts,

Page| 136 References

adaptation and vulnerability. Contribution of working group II to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, pp 173- 210.

Kuniavsky, M. 2010. Smart things: ubiquitous computing user experience design. Morgan Kaufmann, Heidelberg.

Laporte, N. T., Stabach, J. A., Grosch, R., Lin, T. S. and Goetz, S. J. 2007. Expansion of industrial logging in Central Africa. Science 316 (5830): 1451.

Le Lay, G., Engler, R., Franc, E. and Guisan, A. 2010. Prospective sampling based on model ensembles improves the detection of rare species. Ecography 33 (6): 1015-1027.

Leaché, A. D., Koo, M. S., Spencer, C. L., Papenfuss, T. J., Fisher, R. N. and McGuire, J. A. 2009. Quantifying ecological, morphological, and genetic variation to delimit species in the coast horned lizard species complex (Phrynosoma). PNAS 106 (30): 12418-12423.

Leathwick, J., Elith, J. and Hastie, T. 2006. Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecological Modelling 199 (2): 188-196.

Legendre, P. and Legendre, L. 1998. Numerical ecology. Second English ed. Elsevier, Amsterdam.

Legendre, P. and Fortin, M.-J. 1989. Spatial pattern and ecological analysis. Vegetatio 80 (2): 107-138.

Lehmann, A., Overton, J. and Leathwick, J. 2002. GRASP: generalized regression analysis and spatial prediction. Ecological Modelling 157 (2-3): 189-207.

Lehner, B. and Döll, P. 2004. Development and validation of a global database of lakes, reservoirs and wetlands. Journal of Hydrology 296 (1-4): 1-22.

Lewis, C. and Rieman, J. 1994. Task-centered user interface design: a practical introduction . University of Colorado, Colorado. http://hcibib.org/tcuid/tcuid.pdf (20-Oct-2010)

Lindsay, S. W. and Martens, W. J. 1998. Malaria in the African highlands: past, present and future. Bulletin of the World Health Organization 76 (1): 33-45.

Lobo, J. M., Jiménez-Valverde, A. and Hortal, J. 2010. The uncertain nature of absences and their importance in species distribution modelling. Ecography 33 (1): 103-114.

Lobo, J. M., Jiménez-Valverde, A. and Real, R. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography 17 (2): 145-151.

Lobo, J. M., Verdu, J. R. and Numa, C. 2006. Environmental and geographical factors affecting the Iberian distribution of flightless Jekelius species (Coleoptera: Geotrupidae). Diversity & Distributions 12 (2): 179-188.

Lonsdale, M. M. 2002. Biological invasion. In: Mooney, H. A., Canadell, J. G. (eds.) The Earth system: Biological and ecological dimensions of global environmental change, Encyclopedia of global environmental change. Wiley, New York, Vol. 2, pp 11-19.

Page| 137 References

Loveland, T. R., Reed, B. C., Brown, J. F., Ohlen, D. O., Zhu, Z., Yang, L. and Merchant, J. W. 2000. Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data. International Journal of Remote Sensing 21 (6-7): 1303-1330.

Lütolf, M., Kienast, F. and Guisan, A. 2006. The ghost of past species occurrence: improving species distribution models for presence-only data. Journal of Applied Ecology 43 (4): 802-815.

Maggini, R., Lehmann, A., Zimmermann, N. E. and Guisan, A. 2006. Improving generalized regression analysis for the spatial prediction of forest communities. Journal of Biogeography 33 (10): 1729- 1749.

Maguire, M. C., Kirakowski, J. and Vereker, N. 1998. User-centred requirements handbook Version 3.3 . TE2010 RESPECT WP5 Deliverable D5.3, HUSAT Research Institute, Loughborough.

Marsaglia, G. 2003. Xorshift RNGs. Journal of Statistical Software 8(14): 1-6.

Marsaglia, G. and Tsang, W. W. 2002. Some difficult-to-pass tests of randomness. Journal of Statistical Software 7(3): 1-9.

Matsumoto, M. and Nishimura, T. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Transactions on Modeling and Computer Simulation 8(1): 3-30.

Ma, J., Wang, X. and Wang, F. 2009. Research of interaction design method based on metaphor. In: Proceedings of IEEE 10th International Conference on Computer-Aided Industrial Design & Conceptual Design, 26-29 Nov. 2009 , Wenzhou, China; pp 142-145.

Mayaux, P., Bartholomé, E., Fritz, S. and Belward, A. 2004. A new land-cover map of Africa for the year 2000. Journal of Biogeography 31 (6): 861-877.

Mayaux, P., Bartholomé, E., Massart, M., Van Cutsem, C., Cabral, A., Nonguierma, A., Diallo, O., Pretorius, C., Thompson, M., Cherlet, M., Pekel, J.-F., Defourny, P., Vasconcelos, M., Di Gregorio, A., Fritz, S., De Grandi, G., Elvidge, C., Vogt, P. and Belward, A. 2003. A land cover map of Africa . EUR 20665, Office for Official Publications of the European, Luxembourg.

McCarthy, J. and Wright, P. 2004. Technology as experience. Interactions 11 (5): 42-43.

McClean, C. J., Lovett, J. C., Küper, W., Hannah, L., Sommer, J. H., Barthlott, W., Termansen, M., Smith, G. F., Tokumine, S. and D., J. R. 2005. African plant diversity and climate change. Annals of the Missouri Botanical Garden 92 (2): 139-152.

McCullagh, P. and Nelder, J. A. 1989. Generalized Linear Models. 2nd ed. Chapman & Hall, London.

Meehl, G., Stocker, T., Collins, W., Friedlingstein, P., Gaye, A., Gregory, J., Kitoh, A., Knutti, R., Murphy, J., Noda, A., Raper, S., Watterson, I., Weaver, A. and Zhao, Z.-C. 2007. Global climate projections. In: Solomon, S., Qin, D., Manning, M., Chen, Z., Marquis, M., Averyt, K., Tignor, M., Miller, H. (eds.) Climate change 2007: The physical science basis. Contribution of working group I to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, pp 747-845.

Microsoft. 2009. Microsoft application architecture guide. 2nd ed. Microsoft press.

Page| 138 References

Microsoft. 2010. Windows user experience interaction guidelines . http://www.microsoft.com/en- us/download/details.aspx?id=2695 (19-Jan-2012)

Mitchell, T. D. and Jones, P. D. 2005. An improved method of constructing a database of monthly climate observations and associated high-resolution grids. International Journal of Climatology 25 (6): 693-712.

Munguía, M., Peterson, A. T. and Sánchez-Cordero, V. 2008. Dispersal limitation and geographical distributions of mammal species. Journal of Biogeography 35 (10): 1879-1887.

Myers, N., Mittermeier, R. A., Mittermeier, C. G., da Fonseca, G. A. B. and Kent, J. 2000. Biodiversity hotspots for conservation priorities. Nature 403 (6772): 853-858.

Nasi, R., Cassagne, B. and Billand, A. 2006. Forest management in Central Africa: where are we? International Forestry Review 8(1): 14-20.

Naudé, W. A. 2008. Conflict, disasters and no jobs: reasons for international migration from Sub- Saharan Africa . Research paper no. 2008/85, UNU-WIDER, Helsinki.

Nielsen, J. 2003. User empowerment and fun factor. In: Blythe, M., Monk, A., Overbeeke, C., Wright, P. (eds.) Funology: From usability to user enjoyment. Kluwer, Dordrecht, pp 103-105.

Nielsen, J. 2010. What is usability? In: Wilson, C. (ed.) User experience re-mastered: your guide to getting the right design. Morgan Kaufmann, Heidelberg, pp 3-22.

Nocedal, J. and Wright, S. J. 2006. Numerical optimization. 2nd ed. Springer, New York p 664.

Norman, D. A. 1990. Design of everyday things. Doubleday, New York.

Okali, D. and Eyog-Matig, O. 2004. Rain forest management for wood production in West and Central Africa . A report prepared for the project Lessons Learnt on Sustainable Forest Management in Africa for AFORNET, KSLA and FAO.

Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N., Underwood, E. C., D'amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt, T. F., Ricketts, T. H., Kura, Y. and Jo. 2001. Terrestrial ecoregions of the world: A new map of life on Earth. BioScience 51 (11): 933-938.

Oxford Dictionaries. 2010. "quality". Oxford University Press. http://oxforddictionaries.com/definition/english/quality?q=quality (23-January-2013).

Parry, M., Canziani, O., Palutikof, J. and Co-authors. 2007. Technical summary. In: Parry, M. L., Canziani, O. F., Palutikof, J. P., van der Linden, P. J., Hanson, C. E. (eds.) Climate change 2007: Impacts, adaptation and vulnerability. Contribution of working group II to the fourth assessment report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, pp 23-78.

Passini, S., Strazzari, F. and Borghi, A. 2008. Icon-function relationship in toolbar icons. Displays 29 (5): 521–525.

Page| 139 References

Pearson, R. G. 2007. Species' distribution modeling for conservation educators and practitioners . Synthesis, Center for Biodiversity Conservation, American Museum of Natural History, New York. http://ncep.amnh.org (17-May-2013)

Pearson, R. G. and Dawson, T. P. 2003. Predicting the impacts of climate change on the distribution of species: are bioclimate envelope models useful? Global Ecology & Biogeography 12 (5): 361-371.

Peel, M. C., Finlayson, B. L. and McMahon, T. A. 2007. Updated world map of the Köppen-Geiger climate classification. Hydrology and Earth System Sciences Discussions 4(2): 439-473.

Peterson, A. T. 2003. Predicting the geography of species' invasions via ecological niche modeling. The Quarterly review of biology 78 (4): 419-33.

Petrie, H. and Bevan, N. 2009. The evaluation of accessibility, usability, and user experience. In: Stephanidis, C. (ed.) The universal access handbook. CRC Press, Boca Raton, pp 20.1 - 20.16.

Phillips, S. J. 2008. Transferability, sample selection bias and background data in presence-only modelling: a response to Peterson et al. (2007). Ecography 31 (2): 272-278.

Phillips, S., Anderson, R. and Schapire, R. 2006. Maximum entropy modeling of species geographic distributions. Ecological Modelling 190 (3-4): 231-259.

Phillips, S. J. and Dudík, M. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31 (2): 161-175.

Phillips, S. J., Dudík, M., Elith, J., Graham, C. H., Lehmann, A., Leathwick, J. and Ferrier, S. 2009. Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications 19 (1): 181-197.

Pink, B. 2011. Australian standard geographical classification. Australian Bureau of Atatistics, Canberra.

Preece, J., Rogers, Y. and Sharp, H. 2002. Interaction design: beyond human-computer interaction. Wiley, New York.

Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. 2007. Numerical recipes : the art of scientific computing. 3rd ed. Cambridge University Press, Cambridge.

Pulliam, H. R. 2000. On the relationship between niche and distribution. Ecology Letters 3(4): 349- 361.

Rach, J., DeSalle, R., Sarkar, I. N., Schierwater, B. and Hadrys, H. 2008. Character-based DNA barcoding allows discrimination of genera, species and populations in Odonata. Proceedings of the Royal Society (Biological sciences) 275 (1632): 237-247.

Ramirez, J. and Jarvis, A. 2008. High resolution statistically downscaled future climate surfaces . International Center for Tropical Agriculture (CIAT); CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS), Cali, Colombia. http://www.ccafs-climate.org/data/ (10-Nov- 2012)

Raza, A. 2011. A Usability Maturity Model for Open Source Software. PhD Thesis, The University of Western Ontario, Ontario, Canada.

Page| 140 References

Ritchie, D. M. 1980. The evolution of the Unix time-sharing system. In: Tobias, J. M. (ed.) Lecture notes in computer science - Language design and programming methodology. Springer, Heidelberg, Vol. 79, pp 25-35.

Roberts, J. J., Best, B. D., Dunn, D. C., Treml, E. A. and Halpin, P. N. 2010. Marine Geospatial Ecology Tools: An integrated framework for ecological geoprocessing with ArcGIS, Python, R, MATLAB, and C++. Environmental Modelling & Software 25 (10): 1197-1207.

Robinson, P. A. 2009. Writing and designing manuals and warnings. 4th ed. CRC Press, Boca Raton.

Rodrigues, A. S. L., Andelman, S. J., Bakarr, M. I., Boitani, L., Brooks, T. M., Cowling, R. M., Fishpool, L. D. C., da Fonseca, G. A. B., Gaston, K. J., Hoffmann, M., Long, J. S., Marquet, P. A., Pilgrim, J. D., Pressey, R. L., Schipper, J., Sechrest, W., Stuart, S. N., Underhill, L. G., Waller, R. W., Watts, M. E. J. and Yan, X. 2004. Effectiveness of the global protected area network in representing species diversity. Nature 428 : 640-643.

Roy, G. G. 1992. An evaluation of command line and menu interfaces in a CAD environment. International Journal of Computer Integrated Manufacturing 5(2): 94-106.

Sahlén, G. and Ekestubbe, K. 2001. Identification of dragonflies (Odonata) as indicators of general species richness in boreal forest lakes. Biodiversity and Conservation 10 (5): 673-690.

Salman, Y. B., Cheng, H.-I. and Patterson, P. E. 2012. Icon and user interface design for emergency medical information systems: A case study. International Journal of Medical Informatics 81 (1): 29– 35.

Santika, T. and Hutchinson, M. F. 2009. The effect of species response form on species distribution model prediction and inference. Ecological Modelling 220 (19): 2365-2379.

Scachetti-Pereira, R. 2002. DesktopGarp: a software package for biodiversity and ecologic research . The University of Kansas Biodiversity Research Center. http://www.nhm.ku.edu/desktopgarp/ (22-Dec- 2008)

Scheirer, J., Fernandez, R., Klein, J. and Picard, R. W. 2002. Frustrating the user on purpose: a step toward building an affective computer. Interacting with Computers 14 (2): 93-118.

Shneiderman, B. 2004. Designing for fun: how can we design user interfaces to be more fun? Interactions 11 (5): 48-50.

Shneiderman, B. and Plaisant, C. 2005. Designing the user interface: strategies for effective human- computer interaction. 4th ed. Pearson, Munich.

Simaika, J. P., Samways, M. J., Kipping, J., Suhling, F., Dijkstra, K.-D. B., Clausnitzer, V., Boudot, J.-P. and Domisch, S. 2013. Continental-scale conservation prioritization of African dragonflies. Biological Conservation 187 : 245-254.

Snyder, J. P. 1997. Flattening the earth : two thousand years of map projections. The University of Chicago Press, Chicago.

Stankowski, P. A. and Parker, W. H. 2010. Species distribution modelling: Does one size fit all? A phytogeographic analysis of Salix in Ontario. Ecological Modelling 221 (13-14): 1655-1664.

Page| 141 References

Statistics Canada. 2012. 2011 census dictionary. Statistics Canada Catalogue no. 98-301-X2011001. Ottawa, Ontario. http://www12.statcan.gc.ca/census-recensement/2011/ref/dict/98-301-X2011001-eng.pdf (25- Apr-2013).

Stauffer, H. B., Ralph, C. J. and Miller, S. L. 2002. Incorporating detection uncertainty into presence- absence surveys for marbled murrelet. In: Scott, J. M., Heglund, P. J., Morrison, M. L., Haufler, J. B., Raphael, M. G., Wall, W. A., Samson, F. B. (eds.) Prediction species occurrences: issues of scale and accuracy. Island Press, Covello, pp 357-365.

Stockwell, D. and Peters, D. 1999. The GARP modelling system: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science 13 (2): 143-158.

Stockwell, D. R. and Peterson, A. 2002. Effects of sample size on accuracy of species distribution models. Ecological Modelling 148 (1): 1-13.

Suhling, F. 2010. Zosteraeschna minuscula . IUCN Red List of Threatened Species. Version 2012.2. http://www.iucnredlist.org/details/full/63193/0 (1-Jan-2013)

Suhling, F., Sahlén, G., Martens, A., Marais, E. and Schütte, C. 2006. Dragonfly assemblages in arid tropical environments: a case study from Western Namibia. Biodiversity and Conservation 15 (1): 311-332.

Terribile, L. C., Diniz-Filho, J. A. F. and De Marco Jr., P. 2010. How many studies are necessary to compare niche-based models for geographic distributions? Inductive reasoning may fail at the end. Brazilian Journal of Biology 70 (2): 263-269.

Thomas, C., Bevan, N. (eds.). 1996. Usability context analysis: a practical guide. National Physical Laboratory, Teddington.

Thuiller, W. 2003. BIOMOD - optimizing predictions of species distributions and projecting potential future shifts under global change. Global Change Biology 9(10): 1353-1362.

Thuiller, W., Albert, C., Araújo, M. B., Berry, P. M., Cabeza, M., Guisan, A., Hickler, T., Midgley, G. F., Paterson, J., Schurr, F. M., Sykes, M. T. and Zimmermann, N. E. 2008. Predicting global change impacts on plant species' distributions: Future challenges. Perspectives in Plant Ecology, Evolution and Systematics 9(3-4): 137-152.

Thuiller, W., Broennimann, O., Hughes, G., Alkemade, J. R. M., Midgley, G. F. and Corsi, F. 2006. Vulnerability of African mammals to anthropogenic climate change under conservative land transformation assumptions. Global Change Biology 12 (3): 424-440.

Thuiller, W., Brotons, L., Araújo, M. B. and Lavorel, S. 2004. Effects of restricting environmental range of data to project current and future species distributions. Ecography 27 (2): 165-172.

Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267-288.

Tucker, C. J., Pinzon, J. E., Brown, M. E., Slayback, D. A., Pak, E. W., Mahoney, R., Vermote, E. F. and Saleous, N. E. 2005. An extended AVHRR 8-km NDVI dataset compatible with MODIS and SPOT vegetation NDVI data. International Journal of Remote Sensing 26(20): 4485-4498.

Page| 142 References

Tullis, T. and Albert, B. 2008. Measuring the user experience: Collecting, analyzing, and presenting usability metrics. Morgan Kaufmann, Heidelberg.

Turok, I. and Parnell, S. 2009. Reshaping cities, rebuilding nations: the role of national urban policies. Urban Forum 20 (2): 157-174.

UNEP. 2008. Africa: Atlas of our changing environment. UNEP-DEWA, Nairobi.

UNEP. 2011. Vital forest graphics . UNEP Job No: DWE/1032/NA, UNEP, FAO, UNFF, Nairobi.

UN-Habitat. 2011. Cities and climate change : global report on human settlements. Earthscan, London.

US-HHS. 2006. Research-based web design and usability guidelines. 2nd ed. U.S. Dept. of Health and Human Services : U.S. General Services Administration, Washington.

Vanderbei, R. J. 2008. Linear programming : foundations and extensions. 3rd ed. Springer, New York.

VanDerWal, J., Shoo, L. P., Graham, C. and Williams, S. E. 2009. Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know? Ecological Modelling 220 (4): 589-594.

Walker, P. and Cocks, K. 1991. HABITAT: a procedure for modelling a disjoint environmental envelope for a plant or animal species. Global Ecology and Biogeography Letters 1(4): 108-118.

Ward, G. 2007. Statistics in ecological modeling presence-only data and boosted MARS . PhD Thesis, Stanford University, USA.

Ward, G., Hastie, T., Barry, S., Elith, J. and Leathwick, J. R. 2009. Presence-only data and the em algorithm. Biometrics 65 (2): 554-63.

Watling, J. I., Romañach, S. S., Bucklin, D. N., Speroterra, C., Brandt, L. A., Pearlstine, L. G. and Mazzotti, F. J. 2012. Do bioclimate variables improve performance of climate envelope models? Ecological Modelling 246 : 79-85.

Weisfeld, M. 2009. The object-oriented thought process. 3rd ed. Addison-Wesley, Upper Saddle River, NJ.

Willems, E. P. and Hill, R. A. 2009. A critical assessment of two species distribution models: a case study of the vervet monkey (Cercopithecus aethiops). Journal of Biogeography 36 (12): 2300-2312.

Wisz, M. S. and Guisan, A. 2009. Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data. BMC ecology 9: 8.

Wisz, M. S., Hijmans, R. J., Li, J., Peterson, a. T., Graham, C. H. and Guisan, A. 2008. Effects of sample size on the performance of species distribution models. Diversity and Distributions 14 (5): 763-773.

Wisz, M. S., Pottier, J., Kissling, W. D., Pellissier, L., Lenoir, J., Damgaard, C. F., Dormann, C. F., Forchhammer, M. C., Grytnes, J.-A., Guisan, A., Heikkinen, R. K., Høye, T. T., Kühn, I., Luoto, M., Maiorano, L., Nilsson, M.-C., Normand, S., Öckinger, E., Schmidt, N. M., Termansen, M. and Timm, A. 2013. The role of biotic interactions in shaping distributions and realised assemblages of species: implications for species distribution modelling. Biological Reviews 88 (2013): 15-30.

Page| 143 References

Wonham, M. 2006. Species invasion. In: Groom, M. J., Meffe, G. K., Carroll, C. R. (eds.) Principles of conservation biology. Sinauer Associates, Sunderland, pp 293-331.

Wright, P., McCarthy, J. and Meekison, L. 2003. Making sense of experience. In: Blythe, M., Monk, A., Overbeeke, C., Wright, P. (eds.) Funology: From usability to user enjoyment. Kluwer, Dordrecht, pp 43-53.

Xu, T. and Hutchinson, M. F. 2013. New developments and applications in the ANUCLIM spatial climatic and bioclimatic modelling package. Environmental Modelling & Software 40 : 267-279.

Yesson, C. and Culham, A. 2006. Phyloclimatic modeling: combining phylogenetics and bioclimatic modeling. Systematic biology 55 (5): 785-802.

Zaniewski, A., Lehmann, A. and Overton, J. 2002. Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecological Modelling 157 (2-3): 261- 280.

Zarnetske, P. L., Edwards Jr., T. C. and Moisen, G. G. 2007. Habitat classification modeling with incomplete data: pushing the habitat envelope. Ecological Applications 17 (6): 1714-1726.

Zou, H. and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301-320.

Page| 144 Appendices

Appendices

Appendix A1: Predicted distribution of P. kersteni with different values of elastic-net factor

Page| 145 Appendices

Appendix A2: Predicted distribution of P. kersteni with different assumed values of initial population prevalence

Page| 146 Appendices

Appendix A3: Predicted distribution of P. kersteni with different assumed values of initial population prevalence after applying SBT

Page| 147 Appendices

Appendix A4: Predicted distribution of P. kersteni using different numbers of background samples

Page| 148 Appendices

Appendix A5: Predicted distribution of P. kersteni using different random number generators for background samples

Page| 149 Appendices

Appendix B: List of bioclimatic variables related to temperature and precipitation (source: http:// www.worldclim.org/bioclim accessed 21-01-2014) Bioclimatic variable description bioc_01 annual mean temperature bioc_02 mean diurnal temperature range (mean of monthly (max temp –min temp)) bioc_03 isothermality (bioc_02 / bioc_07) bioc_04 temperature seasonality (standard deviation) bioc_05 maximum temperature of warmest month bioc_06 minimum temperature of coldest month bioc_07 temperature annual range (bioc_05 - bioc_06) bioc_08 mean temperature of wettest quarter bioc_09 mean temperature of driest quarter bioc_10 mean temperature of warmest quarter bioc_11 mean temperature of coldest quarter bioc_12 annual precipitation bioc_13 precipitation of wettest month bioc_14 precipitation of driest month bioc_15 precipitation seasonality bioc_16 precipitation of wettest quarter bioc_17 precipitation of driest quarter bioc_18 precipitation of warmest quarter bioc_19 precipitation of coldest quarter

Page| 150 Appendices

Appendix C1: Rules and steps for creating land-cover scenario of 1940 The land cover scenario made use of datasets listed in Table C1-1. The steps are listed in Table C1-2. The (rule-)set can be taken as steps. Each row (rule-set) also shows which dataset(s) were used. The rules or operations define what tasks were performed at each step. The sets 19 and 24 are intermediary datasets and the set 26 is the final land cover scenario of 1940.

Table C1-1: Geodatasets used for creating hind-casted land-cover scenario of 1940

Geodatasets Description Reference GLC 2000* Global Land Cover dataset 2000 Mayaux et al . (2004) CREP1950* HYDE land cover dataset for year 1950, scenario b Klein Goldewijk (2001) NDVI mean value of 12 months, 36 datasets from 07/1981 to 06/1982 Tucker et al . (2005) crop HYDE dataset of cropland for 1940; value represents percentage of cropland Klein Goldewijk et al . (2007) gras HYDE dataset of grassland for 1940; value represents percentage of pastures Klein Goldewijk et al . (2007) CG dataset created by summing up of ‘crop’ and ‘gras’ for 1940 popd HYDE population density for 1940 Klein Goldewijk et al . (2007) prec CRU-TS annual average precipitation from 01/1931 and 12/1940 Mitchell and Jones (2005) elev digital elevation model from SRTM Jarvis et al . (2008) * refer corresponding reference (and dataset therein) for the description of coded values

Table C1-2: Geodatasets and operations (rules) for creating hind-casted land-cover scenario of 1940

Set source operation description dataset 1 GLC 2000 extract pixels where GLC 2000 = 1 to 6 extract tree cover classes 2 GLC 2000 extract pixels where GLC 2000 = 7, 8, 15, 20 extract water bodies classes 3 CREP1950 extract pixels where CREP1950 = 12 assign grassland/steppe to class 14 of GLC (grassland/steppe), set value = 14 4 CREP1950 extract pixels where CREP1950 = 16 (hot assign hot desert to class 19 of GLC desert), set value = 19 5 NDVI extract pixels where 25 < NDVI ≤ 50, assign the extracted values to class 14 of GLC set value = 14 6 NDVI extract pixels where 50 < NDVI ≤75, assign the extracted values to class 13 of GLC set value = 13 7 Crop, extract pixels from GLC 2000 where extract tree classes where crop percentage is 0 GLC 2000 GLC 2000 = 9 to 12 and crop = 0 8 CREP1950 extract pixels where CREP1950 = 1 (intensive assign intensive cropland to class 16 of GLC cropland), set value = 16 9 NDVI extract pixels where NDVI > 163, assign extracted values to class 1 of GLC set value = 1 10 NDVI, extract pixels where 137 < NDVI ≤ 163 and assign extracted values to class 2 of GLC GLC 2000 GLC 2000 ≠ 1, set value = 2 11 NDVI, extract pixels where 112 < NDVI ≤ 137 and GLC assign extracted values to class 3 of GLC GLC 2000 2000 ≠ 1, 2, set value = 3 12 crop, CG, extract pixels where 20 crop ≤ 30, 40 CG ≤ 50 assign extracted values to class 17 of GLC popd and popd > 50, set value = 17

Page| 151 Appendices

Set source operation description dataset 13 gras, crop extract pixels where gras > 70 and crop = 0, assign extracted values to class 14 of GLC set value = 14 14 gras, crop extract pixels where gras > 40 and crop < 10, assign extracted values to class 13 of GLC set value = 13 15 crop extract pixels where crop > 70, assign extracted values to class 16 of GLC set value = 16 16 crop, popd extract pixels where crop > 40 and popd > 50, assign extracted values to class 16 of GLC set value = 16 17 CG, crop extract pixels where CG > 15 and crop > 5, assign extracted values to class 17 of GLC set value = 17 18 popd, extract pixels where popd > 50 and retain GLC class 22 GLC 2000 GLC 2000 = 22 19 Sets 1 to 17 merge/mosaic datasets from 1 to 17; order intermediary dataset 1 = MOS_1 following the sets (top first): 1 .. 17 20 MOS_1, extract pixels where MOS_1 > 9 and retain GLC class 9 GLC 2000 GLC 2000 = 9 21 GLC 2000, shrink pixels by 10 cells where shrink land cover classes which are not tree and prec GLC 2000 ≠ 1 to 9, 15, 20 and prec > 15000 water bodies and the annual precipitation is more than 15 m a 22 GLC 2000 shrink pixels by 10 cells where shrink land cover classes which are shrubs and GLC 2000 = 13, 14, 16, 17, 18 and elev < 500 cultivated areas and elevation is less than 500 m b 23 MOS_1, Extract pixels retain pixels where the value is 19 (bare areas) in GLC 2000 where MOS_1 = 19 and GLC 2000 ≠ 19 new dataset and not 19 in GLC 2000 c 24 Sets 18 to Merge/mosaic datasets from 18 to 22 in the intermediary dataset 2 = MOS_2 22 following order (top first): Sets 18, 20, 21, 22, 23, 19

25 MOS_2 Shrink water bodies by 20 cells shrink water bodies d 26 MOS_2, Merge/mosaic in the order (top first): Scenario dataset Set 2 Set 2, MOS_2 a assumption: too much water that only trees will survive; this assumption will expand areas for tree-cover and water bodies classes (shrink sequence: 12, 13, 14, 16, 17, 18, 19, 22) and adjustment is made in rule-set 25 b assumption: low population at lower elevation and hence no cultivated and managed area in the past (shrink sequence: 13, 14, 16, 17, 18) c assumption: if the pixels are not desert in 2000, they were not desert in 1940 d set 21 contains increased number of pixels for water bodies and adjustment is required (shrink sequence: 7, 8, 15, 20)

Page| 152 Appendices

Appendix C2: Rules and steps for creating land-cover scenario of 2050 The land cover scenario made use of datasets listed in Table C2-1. The steps are listed in Table C2-2. The (rule-)set can be taken as steps. Each row (rule-set) also shows which dataset(s) are used. The rules or operations define the task(s) performed at each step. The set 14 is intermediary dataset and the set 17 is the final land cover scenario of 2050.

Table C2-1: Geodatasets used for creating forecasted land-cover scenario of 2050

Geodatasets Description Reference GLC 2000* Global land cover dataset 2000 Mayaux et al . (2004) LUC* Land-use dataset of FAO (geonetwork) for the year 2010 FAO (2010) crop HYDE dataset of cropland for 2005; value represents percentage of cropland Klein Goldewijk et al . (2007) gras HYDE dataset of grassland for 2005; value represents percentage of pastures Klein Goldewijk et al . (2007) CG dataset created by summing up ‘crop’ and ‘gras’ for 2005 popd HYDE population density for 2050 Klein Goldewijk et al . (2007) prec Monthly precipitation for the year 2050 Ramirez and Jarvis (2008) elev digital elevation model from SRTM Jarvis et al . (2008) * refer corresponding reference (and dataset therein) for the description of coded values

Table C2-2: Geodatasets and operation (rules) for creating forecasted land-cover scenario of 2050

source Set operation description dataset

1 GLC 2000 extract pixels where GLC 2000 = 19 extract bare areas

2 GLC 2000 extract pixels where GLC 2000 = 7, 8, 15, 20 extract water bodies classes

extract pixels where popd > 1000, 3 popd assign densely populated areas to class 22 of GLC set value = 22

4 LUC extract pixels where LUC = 25, set value = 22 assign ‘urban land’ to class 22 of GLC

LUC, extract pixels where LUC =19 to 24 and assign ‘crops’ and ‘agriculture’ related classes to 5 GLC 2000 GLC 2000 ≠ 7, 8, 15 and, set value = 16 class 16 of GLC extract pixels where LUC = 4 and LUC, 6 GLC 2000 < 7 or GLC 2000 = 13, 14, assign ‘forest with livestock’ to class 14 of GLC GLC 2000 set value = 14 extract pixels where LUC = 36, 37, 7 LUC assign ‘bare areas with livestock’ to class 13 of GLC set value = 13 shrink pixels by 1 cells (8 km resolution) 8 GLC 2000 shrink the tree-cover area at the edge where GLC 2000 = 1 to 6, 9 to 12 crop, extract pixels where 25 < gras ≤ 50 and 9 assign extracted pixels to class 14 of GLC gras crop ≤ 10, set value = 14 crop, extract pixels where 50 < crop ≤ 60 and 10 assign extracted pixels to class 16 of GLC gras gras < 10, set value = 16 extract pixels where crop ≤ 60, 11 crop assign extracted pixels to class 16 of GLC set value = 16 extract pixels where CG < 70 and crop < 20, 12 CG, crop assign extracted pixels to class 16 of GLC set value = 16

Page| 153 Appendices

source Set operation description dataset

13 GLC 2000 as it is baseline data

merge/mosaic datasets from sets 1 to 13; 14 Sets 1 to 13 intermediary datasets = MOS_1 order following the sets (top first): 1 .. 13 MOS_1, extract pixels from GLC 2000 where extract land-cover classes from GLC 2000 where the 15 GLC 2000 MOS_1 = 7, 8, 15, 20 intermediary dataset has water bodies = WB_set extract pixels where 25 < popd ≤ 100 and popd, prec, 16 300 < prec ≤ 500 and MOS_1 = 13, 14, assign extracted pixels to class 16 of GLC MOS_1 set value = 16 Sets 14 to merge/mosaic datasets from sets 14 to 16; 17 Scenario dataset 16 order following the sets (top first): 16, 15, 14

Page| 154