A new tool for predicting distribution patterns of African dragonflies in space and time: sensitivity analyses of model parameters and environmental variables
A thesis submitted in partial fulfilment of the requirements of the degree of Doctor of Natural Sciences (Dr. rer. nat.) of the Faculty of Environment and Natural Resources, Albert-Ludwigs-Universität Freiburg im Breisgau, Germany
Submitted by Nirmal Ojha from Nepal
Freiburg im Breisgau 2014
Dean: Prof. Dr. Barbara Koch
Supervisor: Prof. Dr. Axel Drescher
2nd Reviewer : Prof. Dr. Carsten F. Dormann
Date of thesis' defence : 19 Nov 2013
Acknowledgements
Firstly I would like to thank the Faculty of Environment and Natural Resources for accepting my application for pursuing a doctoral degree. Many thanks go to Prof Dr Axel Drescher for supervising my research work. My warm thanks to Prof Dr Carsten F Dormann for accepting to review my work. Special thanks are due to Prof Dr Gertrud Schaab from the Hochschule Karlsruhe who supervised my work offering her guidance throughout the entire period of my work. I am grateful to the office of the Chancellor of the Hochschule Karlsruhe and helpful team of the Institute of Applied Research (IAF) for providing me the work space and allowing me to use the facilities. Acknowledgement is also due to Christian Stern who organised each year the necessary ArcGIS developer’s license in timely manner. I am thankful to Dr Viola Clausnitzer, Dr Frank Suhling, Dr K. D-B Dijkstra and Jens Kipping for providing me the records of dragonflies’ locations. The discussions and feedbacks offered by them on the initial predicted distributions of some species contributed in improving the modelling tool, thereby better prediction results. My appreciations are also to the former members of the working group G(V)ISЯ at the IAF while being part of the BIOTA Africa project for their friendship, moral support and cordial working environment making part of the journey enjoyable. Thanks also to Dorothea Heim who helped translating the extended summary in German. I am indebted to Prof Dr Prajwal Lal Pradhan from the Institute of Engineering, Pulchowk Campus Nepal. He has been motivational figure and his advice and moral support has led me to this position today. Finally, I would like to express sincere gratitude to my parents for their support.
Page| iii
Page| iv
Abstract
In the last few decades, Africa has been a dynamic continent regarding the changes in landscape, population and climate. To identify effects of the changes in environmental conditions on biodiversity, species distribution modelling (SDM) can be of use and SDM has been used in wide array of ecological applications such as determining hotspots, planning of reserves, designing survey for biodiversity inventory, or assessing the impacts of environmental change on biodiversity. Odonata which require both terrestrial and aquatic ecosystem for a lifecycle, is suitable species to consider as flagship species for many ecological studies. Here, a logistic regression based new SDM tool, the ‘SpeeDi Tool' is presented focusing on modelling the distribution of African Odonata species using the Odonata Database of Africa. The use of geographic information system (GIS) in pre- and post- processing is integral part of the SDM workflow and GIS and statistical modelling is integrated in the SpeeDi Tool. The user centred approach for the development of the SpeeDi Tool offers usability and achievement of the goal (i.e. predicting the distribution range) with ease. Pseudagrion kersteni , a widely spread dragonfly species in sub-Saharan Africa, is taken as species of interest to demonstrate the use and ability of the SpeeDi Tool. An expert-drawn watershed based range map from IUCN serves the purpose for visually comparing the modelled spatial distribution and, thus, enables to evaluate the predicted range. The SpeeDi Tool has several modelling parameters, some of which have been new in SDM field, namely, elastic-net factor which has not been applied to SDM using background samples until now, soft buffer threshold (SBT) which is a new concept introduced here, and weights for samples. In addition to the use of background samples, it introduces the modelling by using presence samples with absence and / or background samples; the combination of presence, absence and background samples is a new option not found in existing SDM tools yet. In order to gain confidence in using the SpeeDi Tool, several sensitivity analyses are performed using P. kersteni samples for different modelling approaches for applying different modelling parameters and for using different environmental geodatasets. These sensitivity analyses are thought for determining the optimum values of different regression parameters to maximise the model’s performance, and for finding the important environmental variables and their effects on the prediction of distribution ranges. The concept similar to that of a virtual species is used to evaluate general applicability of the SpeeDi Tool. The sensitivity analyses of modelling parameters showed a) the elastic-net regularisation is superior to L1 or L2 regularisation, b) the uncertainty in population prevalence in background samples can be reduced by applying SBT, c) weights can be effective in reducing effects of sampling bias, d) the number of background samples is sensitive for fitting the model, and e) product interaction of variables are necessary for better prediction of distribution range. The sensitivity of environmental datasets showed a) monthly climate datasets should be preferred over synthesised bioclimatic datasets, b) predicted distributions using land-cover datasets with different classification schemes are not much different but the contribution of land cover classes in different datasets indicated that false interpretation regarding ecological significance of these classes can be possible. Further, the results for the modelling of A. minuscula showed that there is not much difference in distribution range when modelled at spatial resolutions of 1 km and 8 km. The results also indicated that modelling extent should not extend too far beyond the species’ native region.
Page| v
Page| vi
Table of Contents Acknowledgements ...... iii Abstract ...... v List of Figures ...... x List of Tables ...... xiii List of Boxes ...... xv List of acronyms ...... xv 1. Introduction ...... 1 Background ...... 1 The thesis’ aims and limits...... 2 Outline of the thesis ...... 3 2. Background on species distribution modelling ...... 5 Three types of species distribution models ...... 5 Uses of species distribution modelling ...... 6 Empirical methods for species distribution modelling ...... 8 Use of presence, absence and background data in species distribution modelling ...... 10 General characteristics and assumptions of statistical species distribution models ...... 12 Commonly used data in statistical species distribution models ...... 15 Incorporating statistics and GIS for species distribution modelling ...... 16 Summary and main considerations ...... 18 3. Quality-in-use for development of species distribution modelling tool ...... 20 Context of use ...... 20 Quality measures ...... 21 3.2.1. Functionality ...... 22 3.2.2. Reliability ...... 22 3.2.3. Usability ...... 23 3.2.4. Efficiency ...... 25 3.2.5. Maintainability ...... 25 3.2.6. Portability ...... 26 User-centred design ...... 26 User experience ...... 28 3.4.1. Cognition ...... 29 3.4.2. Metaphors ...... 30 3.4.3. Emotions ...... 30 4. Developing a robust and easy to use species distribution modelling tool ...... 32 Work flow concept for geodata processing and statistical modelling for SDM in SpeeDi Tool ...... 32 4.1.1. Geodata preparation ...... 33 4.1.2. Statistical modelling...... 33 4.1.3. Post-processing ...... 34
Page| vii
User-centred design and user profile for the SpeeDi Tool ...... 34 Architecture for the modelling tool ...... 35 GUI design ...... 36 Logistic regression with presence, absence and background data ...... 39 4.5.1. Formulating binary logistic regression model...... 39 4.5.2. Control mechanism to counter over-fitting in regression model ...... 41 Functions offered by the SpeeDi Tool ...... 42 4.6.1. Pre-processing in the SpeeDi Tool ...... 42 4.6.2. Statistical modelling using logistic regression in the SpeeDi Tool ...... 44 4.6.3. Post-processing functions in the SpeeDi Tool ...... 44 5. Predicted spatial distribution of Pseudagrion kersteni in Africa: an example of applying the logistic regression based modelling tool ...... 50 Odonata and database of African Odonata ...... 50 Pre-processing of location data and environmental geodata: setting the modelling scenario ...... 51 5.2.1. Geodata pre-processing with the SpeeDi Tool for predicting the spatial distribution of P. kersteni ...... 52 5.2.2. Logistic regression modelling for predicting the presences of P. kersteni ...... 54 Result of the modelling ...... 55 5.3.1. Intermediary output ...... 55 5.3.2. Post processing the intermediary output ...... 56 Visual assessment of output of modelling the distribution of P. kersteni...... 58 6. Sensitivity analyses of the LR-based SpeeDi Tool: a comprehensive analyses of model tuning parameters ...... 62 Sensitivity analysis of model tuning parameters for P. kersteni ...... 62 6.1.1. Elastic-net factor ...... 62 6.1.2. Initial population prevalence ...... 63 6.1.3. Soft buffer threshold for background ...... 65 Sample data and model definition/formulation ...... 66 6.2.1. Size of background samples ...... 66 6.2.2. Algorithm of random number generator for creating background samples ...... 67 6.2.3. Polynomial degree and interaction term for continuous variables ...... 69 6.2.4. Effect of sample density of presences ...... 72 Different modelling approach for predicting the distribution of P. kersteni ...... 73 6.3.1. Modelling with presences and absences derived from known distribution range ...... 74 6.3.2. Modelling with presences from watershed based range map and random background samples for all of Africa ...... 76 6.3.3. Modelling with actual field samples and absences sampled from the watershed based range map ...... 76 6.3.4. Feeding the presence-background model with auxiliary absence data ...... 77
Page| viii
7. Effect of spatial environmental data on modelling species distribution: sensitivity analyses regarding climate and land-cover datasets, spatial extent and resolution ...... 80 Bioclimatic data and its influence on modelling the prediction of Pseudagrion kersteni ..... 80 7.1.1. Using bioclimatic variables related to precipitation and temperature ...... 81 7.1.2. Using six selected bioclimatic variables related to precipitation and temperature based on ecological relevance for P. kersteni ...... 81 Supplementing ‘selected six bioclimatic variables’ with x-y coordinates for predicting the distribution range of P. kersteni ...... 82 Using monthly temperature and precipitation data as main climate variables for predicting the distribution of P. kersteni ...... 85 Role of land-cover data in modelling P. kersteni – effect of classification schemes ...... 85 Predicting the past and the future distribution of P. kersteni with scenarios for land-cover and climate ...... 87 7.5.1. Developing the land-cover scenario for the year 1940 ...... 87 7.5.2. Developing the land-cover scenario for the year 2050 ...... 90 7.5.3. Comparison of past and future land-cover scenarios with current situation ...... 91 7.5.4. Predicting the distribution of P. kersteni with land-cover and climate scenarios for the year 1940 and 2050 ...... 92 Role of modelling extent and spatial resolution of geodata in predicting the distribution of Aeshna minuscula ...... 94 8. Discussion and outlook ...... 100 Predicting the spatial distribution of Pseudagrion kersteni by means of the SpeeDi Tool and sensitivity of modelling parameters ...... 100 8.1.1. Predicted distribution of P. kersteni and effect of samples ...... 100 8.1.2. Sensitivity regarding regression control parameters for modelling of P. kersteni ...... 104 Role of environmental data in the prediction of species distribution with the SpeeDi Tool...... 106 8.2.1. Climate data and geographic trend surface for predicting the distribution of P. kersteni ...... 106 8.2.2. Effects of land-cover and climate datasets on the predicted distribution range of P. kersteni ...... 109 8.2.3. Effect of scale and modelling extent on predicting the distribution range of A. minuscula ...... 111 Ranking of different parameters for predicting the distribution of P. kersteni using the SpeeDi Tool ...... 112 Suitability of the SpeeDi Tool for modelling spatial distribution of African Odonata ...... 114 The SpeeDi Tool from a user’s perspective for modelling species distribution ...... 115 9. Extended summary ...... 119 Zusammenfassung ...... 123 References ...... 129 Appendices ...... 145
Page| ix
List of Figures
Figure 2-1: Fitting the probability (logistic regression) of true- and pseudo-absence data shown with one predictor variable ‘Elevation’. The probability of absence data (a and b: true; c and d: pseudo) is zero and this value is used throughout the iteration process. But despite pseudo-absence data is likely to contain a mixture of true-absence (prob. = 0, circular) and non-absence (i.e. prob. > 0, plus sign; their actual value indicated by squares), probability value of zero is assumed. (taking Ward et. al., 2009 figure 3 for idea) ...... 11 Figure 2-2: Fitting the probability (logistic regression) of background data (z = 0) in an iterative process shown with one predictor variable ‘Elevation’. z = 1 represents presence samples; z = 0 represents background samples; z = 0, y = 1 represents samples at favourable environment (presences); and z = 0, y = 0 represents samples at unfavourable environments (absences). At each iteration step, the value of y changes for z = 0 ...... 13 Figure 2-3: Loose GUI coupling of GIS and statistical SDM involving data converter and bridged via GUI (adapted from Jankowski (1995), Karimi and Houston (1996), and Brandmeyer and Karimi (2000)) ...... 17 Figure 2-4: Tight coupling of GIS and Statistical SDM with the APIs as core component in the centre of the different systems, database and the GUI (adapted from Jankowski (1995)) ...... 18 Figure 2-5: Integrated coupling of GIS and SDM with systems and database interacting as a single unit (adapted from Brandmeyer and Karimi (2000)) ...... 18 Figure 4-1: Conceptual work flow for modelling species distribution in SpeeDi Tool with three steps: pre-processing, modelling and post-processing ...... 33
Figure 4-2: Different components in the main GUI of the SpeeDi tool...... 37 Figure 4-3: Common layout of dialog-boxes (top left) for pre- and post-processing functions in SpeeDi Tool; an example of dialog-box for running local function (top right) and displaying the help associated with the function when the ‘Help’ button is clicked (bottom) ...... 38 Figure 4-4: Setting default preferences in the tool, accessible via menubar; left: for logistic regression modelling most of them are related to the output graphs, and right: for modelling task related mainly to spatial properties ...... 39 Figure 4-5: Profiles of SBT adjustment for background samples for assumed pop. prev. (pi) = 0.3. The x-axis represents the original value and the adjusted value is shown on y-axis.The legend ‘y_0.n’ represents the average probability value of all samples...... 41
Figure 5-1.Photo of Pseudagrion kersteni ...... 50 Figure 5-2: Distribution of Pseudagrion kersteni sample locations (ODA: Kipping et al., 2009) and the distribution range (Clausnitzer et al., 2012) ...... 51 Figure 5-3: Illustrating of distance dataset of hydrographical features; top left shows how the pixel values are calculated for linear features based on the cell-distance, top-right: distance
Page| x
raster for rivers; bottom-left: distance raster for areal features (lakes, ponds, wetlands), bottom-right: minimum distance value raster combined from linear and areal datasets . 53 Figure 5-4: ROC-curve (left) and ‘sensitivity and specificity’ graphs (right) of the distribution model for P. kersteni...... 55 Figure 5-5: Individual logistic response of the 5 most contributing environmental variables in the model predicting the presence of P. kersteni (note: the base reference in y-axis in figure ‘a’ is 0.5 and not 0) ...... 57 Figure 5-6: Probability of environmental suitability for occurrence of P. kersteni across Africa (left) and predicted classified range into presence, probably presence and absence (right) modelled using SpeeDi Tool. The black line represents the 5 different regions aggregated from countries’ boundaries...... 59 Figure 5-7: Distributional range of P. kersteni as predicted by IUCN redlist assessment in 2010 (Clausnitzer et al., 2010) ...... 60 Figure 5-8: Land cover classes of the continent of Africa (Mayaux et al., 2004) as used for the modelling of P. kersteni; coded values are inclosed in brackets ...... 61 Figure 6-1: Effect of elastic-net factor on the final numbers of variables entered into the model with least regularisation (10th), and total number of iterations required for fitting each path models consisting of 10 models ...... 64 Figure 6-2: Effect of initial population prevalence (π) on the probability value, calculated for number of background samples = 5000, and number of presence samples = 496 ...... 65 Figure 6-3: Sensitivity and specificity curves of different background samples for the modelled probability distribution of P. kersteni, the suffixed numbers represent the number of background samples; the box highlights the closeness of range in threshold for obtaining maximum accuracy for optimal binary classification considering sensitivity and specificity values ...... 68 Figure 6-4: Predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms (same as Figure 5-6 right) ...... 70 Figure 6-5: Response curves of elevation (upper five) and maximum temperature of August (lower five) for the predicted distribution of P. kersteni with five different combinations of interaction terms: a) linear and product terms, b) linear and quadratic terms, c) linear, quadratic and cubic terms, (d) linear, quadratic and product terms, and e) linear, quadratic, cubic and product terms ...... 71 Figure 6-6: Histogram of elevation (m), precipitation (mm) in March and minimum temperature (°Celsius times 10) in February inferred from the predicted presence range of P. kersteni ...... 73 Figure 6-7: Models using weights for presence samples density, samples with weight of 1 throughout (top left), manually assigned and adjusted weights per cluster (top right, same as Figure 5-
Page| xi
6 right), and weights based on the global average distance to other samples (bottom left) ...... 74 Figure 6-8: Modelling with presence and absence samples generated from the assumed watershed based range of P. kersteni (Clausnitzer et al., 2012); left: light green areas as presence range and light blue as absence range; randomly generated presence samples used for training (red dots) and evaluation (green dots) of the model; right: predicted binary classified (presence-absence) range of P. kersteni ...... 75 Figure 6-9: Model output using different approaches, using randomly generated presences from the watershed based range map (Clausnitzer et al., 2012) and background samples from all over Africa (left), and using field collected presence samples and randomly generated absence samples from absence regions of watershed based range map (right) ...... 77 Figure 6-10: Example of improving the model predictions, prediction result with field collected presence samples with background samples (left, same as Figure 5-6 right) and improved prediction reducing the false presences at the Okavango Delta (see circle) by using auxiliary information of knowledge based absences locations at the Okavango Delta. .... 78 Figure 7-1: Modelled distribution range of P. kersteni using different climatic data: a) with the 19 bioclimatic variables (left, chapter 7.1.1), and b) with 6 selected bioclimatic variables (right, chapter 7.1.2) ...... 82 Figure 7-2: Predicted distribution of P. kersteni based on climate data as 6 selected bioclimatic variables supplemented by x-y geographic coordinates ...... 83 Figure 7-3: Response curves of five of the predictor variables (elevation and four temperature related bioclim variables) for the predicted distribution of P. kersteni based on climate data of six selected bioclimatic variables and with additional x-y geographic coordinates ...... 84 Figure 7-4: Predicted distribution range of P. kersteni based on datasets using different land-cover classification schemes: GLC2000 with FAO scheme (upper left, same as Figure 5-6 right) and GLCC with USGS modified level 2 scheme (upper right); and differing classes in the belt from west to east Africa in the two datasets, GLC2000 (lower-left) and GLCC-USGS (lower-right) ...... 86 Figure 7-5: Land-cover scenario hind casted for the year 1940 for modelling the past distribution of Odonata species ...... 89 Figure 7-6: Land-cover scenario projected for the year 2050 for modelling the future distribution of Odonata species ...... 91 Figure 7-7: Area and proportions covered by different land-cover classes and their scenarios for Africa in three time steps; 1940, 2000 and 2050 (colours and land-cover classes match those of maps in Figures 5-8, 7-5 and 7-6) ...... 92 Figure 7-8: Modelled distribution range of P. kersteni for the year 1940 (upper left), 2000 (upper right), 2050 (lower left), and change in distribution range (lower right). The model is trained with the environmental data from 2000 (see Table 5-1, except minimum NDVI) and projected for the scenarios of land-cover and climate in the years 1940 and 2050 ... 94
Figure 7-9: Photo of Aeshna minuscula ...... 95
Page| xii
Figure 7-10: A. minuscula sample locations (ODA: Kipping et al., 2009) and expected presence range based on watersheds (Clausnitzer et al., 2012) ...... 95 Figure 7-11: Predicted presence range for A. minuscula based on backgrounds sampled over the entire continent and different spatial resolutions of environmental geodatasets, 1 km (top, left) and 8 km (top, right) and the difference in predicted range in the southern Africa due to change in resolution (bottom) ...... 97 Figure 8-1: P. kersteni samples (ODA: Kipping et al., 2009) measured as kernel-density within 80 km radius; also indicated is the expected habitat range corresponding to the watershed based distribution map (Clausnitzer et al., 2012) ...... 102 Figure 8-2: Distribution of sample location records in ODA, predicted range (see Figure 5-6, right) and expected range based on watershed range of P. kersteni over different land-cover classes; the colour and number of the land-cover classes corresponds to the colour and number of Figure 5-8...... 111
List of Tables
Table 2-1: Three different categories of species distribution models and their basic characteristics primarily based on Guisan and Zimmermann (2000) ...... 6 Table 2-2: Some of the approaches used in studies for generating pseudo-absences for presence- absence models...... 12 Table 3-1: Some examples of SDM users, contexts of modelling and the typical consequences of over- and under-prediction ...... 21
Table 3-2: Basic functionality matrix overview of selected SDM tools ...... 22 Table 3-3: Methods to improve reliability regarding noise, over-fitting and correlated variables in selected SDM tools...... 24
Table 3-4: User-centred features of some selected SDM tools ...... 29
Table 4-1: Assumed user profile for using the SpeeDi Tool ...... 35
Table 4-2: Functions offered by the SpeeDi Tool for species distribution modelling ...... 46 Table 5-1: List of climatic and environmental geodata used for modelling of P. kersteni and their sources ...... 52
Table 5-2: Total and unique sample locations of P. kersteni in the ODA ...... 54 Table 5-3: Five most contributing variables of the model for predicting the distribution of P. kersteni ...... 56 Table 6-1: Effect of elastic-net factor on model (1) performance measured with AUC for binary classification of distribution of P. kersteni ...... 63 Table 6-2: Comparative values of model performance regarding the initial population prevalence for modelling the distribution of P. kersteni ...... 64
Page| xiii
Table 6-3: Initial population prevalence of background samples and the correlation matrix of the probability values for the distribution of P. kersteni ...... 65 Table 6-4: Comparative values of model performance regarding the initial population prevalence and applying 'soft-buffer-threshold’ for modelling the distribution of P. kersteni ...... 65 Table 6-5: Initial population prevalence of background samples and the correlation matrix of the probability values calculated applying ‘soft-buffer-threshold’ for the distribution of P. kersteni ...... 66 Table 6-6: Model performance for different background sample sizes for modelling the distribution of P. kersteni ...... 67 Table 6-7: Model performance for predicting the distribution of P. kersteni based on AUC value and accuracy when using different algorithm of pseudo random number generator for background data generation ...... 68 Table 6-8: Correlation of probability values among different model outputs when using different algorithms for generating background samples ...... 69 Table 6-9: Performance of models of different complexities measured by AUC value and accuracy for predicting the distribution of P. kersteni (see chapter 6.2.3 for explanations on model abbreviations) ...... 69 Table 6-10: Ranking of environmental variables for 5 different model complexities based on the explained deviance in contributing to the calculated probability for modelling distribution of P. kersteni (see chapter 6.2.3 for explanations on model abbreviations and Table 5-1 for naming of variables) ...... 72 Table 6-11: Accuracy assessment of the model performance in predicting a pre-defined arbitrary range of a species based on training and evaluation sample data ...... 76 Table 7-1: Contribution of environmental variables in predicting the distribution of P. kersteni using bioclimatic data as climatic variables ...... 81 Table 7-2: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables ...... 82 Table 7-3: Contribution of environmental variables in predicting the distribution of P. kersteni using six-selected bioclimatic data as climatic variables and supplemented by geographic coordinates ...... 83 Table 7-4: Contribution of environmental variables in predicting the distribution of P. kersteni using monthly temperature and precipitation data as climatic variables ...... 85 Table 7-5: Geodatasets used for developing scenarios for 1940 and 2050, with sources and descriptive information ...... 88
Table 7-6: Different modelling extents and spatial resolutions applied for modelling A. minuscula ... 96 Table 8-1: Ranking of factors affecting predictions of species distribution based on potential impact on output of SpeeDi Tool for modelling P. kersteni ...... 113 Table 8-2: The features of SpeeDi Tool in comparison to other species distribution modelling tools (similar features in other tools are italicised) ...... 118
Page| xiv
List of Boxes
Box 3-1: Basic questions to summarise a project in determining the context of use of a system (based on Maguire et al., 1998) ...... 27 Box 3-2: Typical questions for development of a species distribution modelling tool regarding user profile (based on Johnson, 2010; Lewis and Reiman, 1994; Maguire et al., 1998) ...... 28
List of acronyms
AIC Akaike Information Criterion ALFG Additive Lagged Fibonacci Generator API Application Programming Interface AUC Area Under the Curve (of an ROC) BIC Bayesian Information Criterion BRT Boosted Regression Tree CART Classification And Regression Tree CCAFS Climate Change, Agriculture and Food Security CGIAR Consultative Group on International Agricultural Research CLI Command Line Interface CRU Climate Research Unit CRU-TS CRU - Time Series EM Expectation-Maximisation ENFA Ecological Niche Factor Analysis FAO Food and Agriculture Organisation FEWS-ADDC Famine Early Warning System - Africa Data Dissemination Centre GAM Generalised Additive Model GARP Genetic Algorithm and Rule-set Prediction GBIF Global Biological Information Facility GIS Geographic Information System GLC2000 Global Land Cover of the year 2000 GLCC Global Land Cover Characterisation GLM Generalised Linear Model GRASP Generalised Regression Analysis and Spatial Prediction GUI Graphical User Interface HYDE Historical Database of global Environment IPCC Intergovernmental Panel on Climate Change IPP Initial Population Prevalence IUCN International Union for Conservation of Nature LASSO Least Absolute Shrinkage Selection Operator LUCS Land Use land Cover System LUS Land Use System MARS Multivariate Additive Regression Spline MaxEnt Maximum Entropy
Page| xv
MDT Minimised Difference Threshold MST Maximised Sum Threshold NDVI Normalised Differential Vegetation Index ODA Odonata Database of Africa P-RNG Pseudo-Random Number Generator ROC Receiver Operating Characteristics SBT Soft Buffer Threshold SDM Species Distribution Modelling SOAP Simple Object Access Protocol SpeeDi Tool SPEciEs DIstribution modelling Tool SRNG Subtractive Random Number Generator TOC Table Of Contents USGS United States Geological Survey XML eXtensible Mark-up Language
Page| xvi 1. Introduction
1. Introduction
Background
Africa is the continent with rapidly changing landscape with changes in population structure, land-cover and climate (UNEP, 2008). It is also a continent with some of the global hotspots of biodiversity (Myers et al ., 2000). Biodiversity inventory plays an important role in identifying hotspots and helps in planning relating to conservation activities. The location database of global records of faunal species is being dominated by vertebrates: mammals, reptiles, amphibians and birds (e.g. Rodrigues et al ., 2004). Such records are lacking for invertebrates, including insects which are the most diverse group. In this regard, Odonata (commonly known as dragonflies and damselflies) is among the front runner, at least, in Africa due to the Odonata Database of Africa (ODA) (Clausnitzer et al ., 2012). The database can, thus, add to conservation planning of biodiversity (e.g. Simaika et al ., 2013). The usefulness of the database is partly shown by the IUCN redlist assessment of African Odonata species where the location records are used for estimating and mapping species range at a spatial resolution provided by watershed boundaries (Clausnitzer et al ., 2012, Dijkstra et al ., 2011). Naturally, the resolution of such maps depends on the size of each watershed and, thus, the resolution is heterogeneous. The delineated ranges can, however, be used as ‘first level filters’ for assessing species conservation status as well as planning actions (Clausnitzer et al ., 2012). However, distribution ranges based on watershed also include large areas which do not represent the true habitable areas (Simaika et al ., 2013). Fine and homogeneous level of details (better than the levels offered by watersheds) of species ranges can be achieved through regularly- sized grid based species distribution modelling (SDM) methods which are generally founded on the correlation of recorded species’ location and the surrounding environment (Guisan and Zimmermann, 2000). As such, SDM have been used as part of a broader collection of chained tools in many ecological applications (Guisan and Thuiller, 2005). Here, the availability of spatio-temporal environmental datasets, especially climatic variables have contributed in rapid adoption of SDM (Buisson et al ., 2010). Most of the SDM tools are based and focused on statistical methods. However, a complete species distribution modelling task includes much work in processing the required spatial data. A geographic information system (GIS) is more suited for handling spatial data. The ability of a SDM tool to handle spatial data would not only facilitate a smooth workflow but also enhance usability and overall user experience. However, most of the currently available SDM tools lack even basic GIS capabilities. The wish is not only to have GIS capabilities integrated in a SDM tool but to have a user interface which is easy to handle and offers support for a smooth flow of all relevant tasks. Further, many SDM tools make use of statistical methods but a straight forward adoption of any new development in the field of statistical computation cannot often be easily adopted due to the nature of data (spatial vs. non-spatial) involved and the data requirements (presences and/or absences) of the statistical methods. The fact that a single SDM tool, which can satisfactorily predict distribution of each species, does not exist (e.g. Elith et al ., 2006; Farber and Kadmon, 2003; Guisan et al ., 1999; Guisan et al ., 2007; Terribile et al ., 2010) is one of the reasons for the development and availability of
Page| 1 1. Introduction
several SDM tools employing different methods. Having several tools to choose from can be a dilemma for modellers but it can also be an advantage. Several studies have found that one method may be better suited for some species and other methods may be suited better for other species (e.g. Brotons et al ., 2004; Elith et al ., 2006; Guisan et al ., 2007). Thus, a pool of different methods as well as similar methods applying different techniques, e.g. regularisation and/or iterative methods in fitting regression coefficients offer a better choice with which better results can be achieved in more straight forward means (Araújo and Guisan, 2006).
The thesis’ aims and limits
As the title of the thesis suggests, the main aim of the thesis is to develop a new modelling tool for predicting the distribution of African dragonflies (Odonata). This includes conceptualisation and implementation of binary logistic regression based method. As the Odonata database has only the records of presence locations, applying the concept of expectation-maximisation is to be considered. The thesis also targets to introduce the recent development of elastic-net regularisation in regression for presence-only modelling scenario. It also aims to offer insights on effects of different model parameters in predicting species distribution through comprehensive sensitivity analyses of model parameters. Investigating the level of influence of environmental geodatasets is another aim. The sensitivity analyses regarding model parameters and different environmental geodatasets will be useful when modelling later for a large pool of Odonata species. With GIS functionalities being important, the GIS and statistical modelling should be available within the tool. Further, the aim is also to make the tool easy to use for the so called 'non GIS-experts' in handling GIS tasks.
Therefore, the tasks of the thesis are summarised as:
• Collect, analyse and process (harmonise) geodata for modelling the spatial distribution of African dragonflies species • Develop and implement concept for binary logistic regression modelling using presence-only data • Incorporate recent developments (e.g. elastic-net regularisation) in regression based modelling • Use coupled mechanism for integrating logistic regression and GIS in a harmonised graphical user interface • Offer smooth workflow regarding basic SDM tasks and required functionalities • Make use of ODA to demonstrate the ability of the new tool in modelling species distribution • Perform sensitivity analyses of model (regression) parameters • Perform sensitivity analyses of environmental geodata • Simulate historical and future distribution of Kersten’s Sprite ( Pseudagrion kersteni ), a wide spread Odonata species in sub Saharan Africa, using climate and land-cover scenarios of past and future
Although the tasks are thus defined, several options are possible for achieving some of the tasks. ArcGIS Engine is selected as main GIS engine and ArcObjects application programming interface (API) is used as GIS API. The Microsoft DotNET framework is selected for the programming environment. However, the thesis will neither list other options nor elaborate advantages and disadvantages of the
Page| 2 1. Introduction
selected option. Freely available environmental geodata will be the basis for required environmental layers in predicting the species’ distribution range. The ODA is the primary source of the dragonflies’ location data. The thesis neither attempts to identify the outliers in the Odonata database nor tries to rectify the errors in coordinate information. So any errors in the coordinate information and outliers are ignored. Another limitation is that the thesis presents distribution of only two species. Instead, it focuses on sensitivity analyses. The information obtained from these analyses will, however, be useful in modelling distributions of other species. And finally, no attempt has been made to interpret the results ecologically.
Outline of the thesis
Chapter 2 looks into different aspects of species distribution models. It provides an overview on different modelling methods and techniques used in predicting spatial distribution of species. Further, the elaborated characteristics and assumptions of SDM offer insights on the limitations of species distribution models. Since GIS is an integral part in the overall SDM process, the chapter also looks into different ways of incorporating GIS and statistical modelling for SDM. Chapter 3 on the quality-in-use investigates the users’ need while using an SDM tool. It, therefore, looks into the context of using such an SDM tool. The context of use is relevant for different quality aspects such as functionality, reliability, usability which are essential from a user’s perspective. It also reviews the processes necessary for incorporating users’ requirements including the experiential aspect.
The task of developing a new tool is described in chapter 4 which incorporates the reviews of chapters 2 and 3. It offers the concept of the working mechanism and handling of the new tool. Moreover, the basic requirements of a complete SDM task are analysed and its implementation described. Furthermore, the chapter provides the necessary concept for logistic regression modelling implemented in the tool. The chapter also lists the functions available in the tool which are necessary and have been selected for predicting species distribution range. The demonstration of the new tool’s ability is presented in chapter 5. Here, P. kersteni is selected as the target species to be modelled. The chapter lists the environmental geodatasets, explains the different steps performed including processing of the geodatasets in predicting the distribution range of P. kersteni in Africa using the new tool and illustrates various outputs with their according explanations. It also covers the visual assessment of the predicted distribution range of P. kersteni .
Confidence is needed in adopting the results provided by a new tool for predicting species distribution and one way of getting confidence is performing sensitivity analyses. Chapter 6 focuses on the sensitivity of various parameters related to the model. These analyses show the effects of different model parameters such as elastic-net factor, population prevalence in background samples, number of background samples, ‘soft-buffer-threshold’, distribution pattern of background samples and sampling bias of presence locations on the output (prediction) which can provide clues for choosing optimal values in obtaining best results. The second part of the sensitivity analyses, chapter 7, looks into the role and effects of environmental geodata. Climate related projections of past and future distributions are among the mostly sought applications of species distribution modelling, so the sensitivity of using different types of climate datasets and different schemes of land-cover datasets are the foci when modelling the
Page| 3 1. Introduction
distribution range of P. kersteni in Africa. Other aspects are geographical extent (or landscape) and the spatial resolution of the datasets.
Several aspects of modelling the spatial distribution of P. kersteni mainly, with the new tool are analysed in chapter 8. The discussions start with various impacts of the characteristics of sample data followed by the influences of regression parameters and effects of environmental datasets. The chapter also discusses the new and special features the new tool offers, also in the context of usability. A comparison of the new tool with other SDM tools is included too. Finally, chapter 9 provides the overview of the thesis summarising various findings. As improvements are always possible for any work, some of the tasks that can be taken over from there are also presented as an outlook.
Page| 4 2. Background on species distribution modelling
2. Background on species distribution modelling
Three types of species distribution models
A predicted spatial distribution of occurrence of species based on statistical techniques and GIS technology is termed as species distribution model (Guisan and Zimmermann, 2000). The data used in the spatially explicit models are the location records of presences or presences and absences of species and different environment related measurements (e.g. bioclimatic variables) (Guisan and Thuiller, 2005). The variables chosen are normally the ones which are ecologically meaningful and relate to the species niche (Hirzel et al ., 2006). Guisan and Zimmermann (2000) classified species distribution models into three general classes: analytical, mechanistic, and empirical models (see Table 2-1) based on three characteristics of 'generality', 'reality' and 'precision'. It has to be noted, however, that it is often difficult to exactly differentiate between the three classes, especially among analytical and mechanistic models ( ibid.). The analytical method is designed for predicting the response of the species-environment relationship within the given conditions. Thus, the properties regarding precision and generality (on response curves) are analysed. The mechanistic method is primarily based on the fundamental physiological processes and offers reality and generality. The performance of such model is measured based on “the theoretical correctness of the predicted response” (Guisan and Zimmermann, 2000 p150). Key to such model formulation is the primary knowledge of interactions among process variables (i.e. behaviours) which define the model structure (Guisan and Thuiller, 2005). Hence, the data requirement for this type of model is related to the processes and their interactions. Empirical models are statistically derived models which focus on precision and reality. The models are based on the observation of environmental covariates at the species presence (and absence) locations and on formulating the empirical relationship between the species and the covariates (e.g. regression based methods). These models are not necessarily expected to describe the ecological functions, mechanisms and causative relationships between the model parameters and the predicted responses (Guisan and Zimmermann, 2000). The selection of any of the three methods is determined by the purpose based on the desired properties, data availability, complexity of the model and the spatial scale and extent. The choice of a model also depends on the purpose of modelling such as understanding the ecological dynamics, interactions among communities, and relationship with the biophysical environment (Austin, 2007; Guisan and Zimmermann, 2000). Empirical or statistical models are commonly used as they are relatively easy to formulate in comparison to establishing the causal relationship based on the analytical method and the mechanistic method (Hartley et al ., 2010). Using geocoding technique, large amounts of natural history and herbarium collections have been and are being georeferenced (Yesson and Culham, 2006) (e.g. GBIF 1). The georeferenced species data have facilitated in modelling the predictive distribution of species and several empirical modelling methods have been developed (Elith et al ., 2006; Guisan and Thuiller, 2005). Although statistical models are easy to formulate, combinatorial use of an increasing number of potential predictor variables (see chapter 2.6) has been pushing the
1http://data.gbif.org/welcome.htm (17-Apr-2012)
Page| 5 2. Background on species distribution modelling
limits on computation due to the complexity as well as the iterative nature of numerical algorithms. Further, inclusion of interaction term increases the number of predictor variables exponentially especially in GLM and its derivatives (Guisan and Thuiller, 2005) and thus increases computing. However, the developments of new modelling methods are also helped by increasing computing power (Buisson et al ., 2010; Jeffers, 1999) helping to handle iterative and complex numerical algorithms.
Table 2-1: Three different categories of species distribution models and their basic characteristics primarily based on Guisan and Zimmermann (2000) Analytical Mechanistic Empirical focus on precision & reality & Precision & generality generality reality Basis theoretical physiological process phenomenological (mathematical) Nature dynamic, easily dynamic, easily static; not (or not easily) transferable from one transferable from one transferable from one landscape to other landscape to other landscape to other (often based on indirect gradients) Purpose predict response in describe cause and describe the state (e.g. simplified reality (e.g. effects from response present/past/future logistic growth (e.g. competition, distribution) equation) dispersal) model medium to difficult , difficult easier than mechanistic formulation based on simplification ecological easy because of being fairly easy, based on may not confirm to ecological interpretation directly based on ecological processes theory, based on stochastic theory events
Uses of species distribution modelling
Species distribution modelling (SDM) has wide ranges of uses. Locating hotspots for conservation, designing cost-effective surveys, predicting impacts of environmental change, predicting potential species invasion and relationship between phylogeny and climate are common examples of SDM uses which are briefly looked over below.
Determining biodiversity hotspots and reserve planning – Araújo et al . (2002) applied species distribution models for 78 breeding passerine bird species in Great Britain from the recorded locations of presence and extinction in order to prioritise conservation areas. They reported a negative correlation between the local probability of extinction and the probability of occurrence. They argue that selecting areas with high probability of occurrence is suitable for reserve selection as the probability of retaining the species in the area in future will be high in those areas. They showed that areas selected with higher probability of occurrence will lead to a reduced rate of local extinction of the species. García (2006) used SDM of 301 species to derive the patterns of Herpetofauna biodiversity in Mexico. By mapping the biodiversity, the author found several hotspots of species richness. With the large proportion of species used in the mapping being endemic and endangered, and forming the biodiversity hotspots, the author concludes that several areas should be included for prioritising conservation.
Page| 6 2. Background on species distribution modelling
Design surveys and/or facilitate (re-)discovery of species – Some species are rare and difficult to find, and hence the term rare species is used. The data for such species can also be used for effective conservation planning. Here, use of species distribution models even with limited data can facilitate for quick field sampling (Le Lay et al ., 2010). Based on a predictive distribution map, Engler et al . (2004) reported discoveries of four new sites of highly endangered Eryngium alpinum species in Switzerland. Guisan et al . (2006) also used SDM in an iterative way for a stratified sampling design for the rare species E. alpinum . With the improved records of several newly discovered populations, the authors suggested that the Swiss Red list status of the species could be downgraded from vulnerable to endangered. The process not only helped discovering new populations but also offered an effective way, economically and efficiently (in terms of time), for assessing the species threat and conservation status. A model-based sampling strategy was exercised by Le Lay et al . (2010) for three rare plant species ( Cypripedium calceolus, Eryngium alpinum and Scorzonera laciniata ) and five common plant species ( Anthyllis vulneraria, Astrantia major, Briza media, Heracleum sphondylium, Pulsatilla alpina ) in western Swiss Alps. Also here, the authors viewed that the model played an important role in increasing the knowledge of the distribution of the rare species and led to discovery of new sites.
Impacts of environmental change on biodiversity – Statistical SDM are commonly applied for predicting the current distribution of species and are used for projecting into the future via climate scenarios. The process of projecting into the future reveals the range shifts, loss or gain in suitable habitat (Thuiller, 2003). This can help e.g. for designing corridors for migration, or better formulation of conservation strategies (Thuiller et al ., 2008; McClean et al ., 2005). But, are static models able to predict the future species distributions? Hijmans and Graham (2006) compared the outputs of static models with a mechanistic model to evaluate the ability of static models for predicting the shifts, shrinkage and expansion in distribution range induced by climate change. Modelling 100 plant species, they concluded that the static models are indeed able to predict the change but suggest some cautions and approaches which can improve the predictions further.
Species invasion – Competition is natural among species, either within the same community (intra-specific) or with different communities (inter-specific). The dominance of non-indigenous species over the native (indigenous) species is termed biological invasion (Lonsdale, 2002). Invasive species may have severe effect on the ecosystem, and thus they are a risk to biodiversity, especially to endangered species. In terms of economy, there can be positive or negative impact based on the way (intentionally or accidentally), purpose and management (control) of introduction (Wonham, 2006). Although several studies have been made to find out the nature or the characteristics of invasion and few key factors have been discovered, the discoveries are not sufficient to predict whether the introduction of a new species will lead to invasion or not (Peterson, 2003; Wonham, 2006). One of the preliminary options to determine the potential range of invasion is to find out whether the environment in the landscape is habitable for the new species and the species distribution modelling offers the possibility in assessing this habitat suitability. Peterson (2003) discussed the use of species’ potential distribution for predicting the potential of invasion by an introduced species. The application was demonstrated by modelling the aquatic plant Hydrilla verticillata , a native species in Asia-Pacific and invasive in North America, based on the environmental niche of the invasive species’ native landscape and projecting these ecological characteristics (or parameters) to the new geographic space. The prediction of invasion was then compared with the data collected in the North American region.
Page| 7 2. Background on species distribution modelling
Phylogenetic and Phyloclimatic relationships – One of the uses of species distribution models based on climate variables is the study of biogeography, either to test a hypothesis, or to understand the biogeographic processes. Biogeographic studies explain why a species occur at certain geographic locations, what the limiting factors are (e.g. climate, altitude), how they migrate with time, and so on (Cox and Moore, 2005). Phylogenetic hypothesis in combination with geographic range maps are commonly used to suggest speciation theory. To identify important speciation mechanisms in dendrobatid frogs in South America, Graham et al . (2004) used phylogenetic information and environmental niche models. They were able to show the lineages of speciation in environmental space and thus added knowledge regarding phylogeny and species distribution. Yesson and Culham (2006) showed for plant genus Drosera that bioclimatic models reveal phylogenetic patterns. They pointed out that although central regions of Australia have, at present, a suitable climate for D. macrantha , they are not observed in this region because the climate in the past (paleoclimate) was not suitable and thus acted as a historical limiting factor. Such studies can help explaining why a species is not observed in areas of potentially suitable environmental conditions (at present) and thus contribute to understanding the historical biogeography of the species.
Empirical methods for species distribution modelling
There have been several studies comparing the strength of predicting the species distribution based on the applied statistical measures and few of these studies have included several tools (e.g. Elith et al ., 2006; Guisan et al ., 2007, Wisz et al ., 2008). These methods include profile based models such as hyper rectilinear envelope (BIOCLIM Model, Busby, 1991), multi-dimensional convex-sub envelopes (HABITAT, Walker and Cocks, 1991), distance based envelope (DOMAIN, Carpenter et al ., 1993), hyper-ellipsoid envelope (ENFA-Biomapper, Hirzel et al ., 2002b) and parametric linear models (GLM, Hosmer and Lemeshow, 2000), non-parametric additive linear models (e.g. GAM, and MARS, Hastie et al ., 2009), regression tree (e.g. CART, and BRT, ibid.), and several other machine learning methods like neural networks ( ibid.), GARP (Stockwell and Peters, 1999), MaxEnt (Phillips et al ., 2006), etc. The choices of the methods are also dependent on the type of species data available such as presence-absence or presence-only. Some methods are briefly discussed below 2.
Profile based - The bio-climatic envelope model (BIOCLIM, Busby, 1991) was one of the first methods to offer modelling presence-only data. A bioclimatic profile of climate variables is created within a rectilinear envelope. The suitability of climatic habitat is then classified into ‘suitable’, ‘marginal’ and ‘unsuitable’ categories. Walker and Cocks (1991) introduced the HABITAT tool to further enhance the BIOCLIM concept by introducing simple classification and regression tree (CART) procedure to form sub-envelopes of the original rectilinear envelope. The sub-envelopes reduced the error prediction by forming a compact environmental envelope, i.e. condensed envelopes (ibid.). However, the rectilinear envelopes (and sub-envelopes) based methods provide some constraints. The distance based DOMAIN modelling method was developed to reduce the problems due to the tightly constrained convex-envelopes of the HABITAT procedure and to provide further options for modellers. It uses the environmental distance (point to point similarity function) to predict the species occurrence (Carpenter et al ., 1993). All three envelope based methods, BIOCLIM, HABITAT and DOMAIN, were first developed at the CSIRO, Australia.
2 arranged based on nature of species sample data requirements
Page| 8 2. Background on species distribution modelling
Ecological niche factor analysis (ENFA) method was implemented in Biomapper software which uses the elliptical envelope formed by factorising the variables into principal components. The first component represents species’ marginality (difference in central tendency measure, e.g. mean or median) and the second component represents the specialization (comparison of variance) (Hirzel et al ., 2002b). This method includes the envelope in elliptic form (created by factor analysis) and the distances (mean or median) in the environmental variables’ space. One of the drawbacks of the envelope based BIOCLIM, DOMAIN and ENFA methods is that none of these can use categorical data as input. Further, ENFA method assumes normality of predictor variables which may not be true in many cases (Engler et al ., 2004) but transformation such as box-cox has been suggested as workaround (Hirzel et al ., 2002b; Wisz and Guisan, 2009).
Linear models and derivatives – Generalised linear model (GLM) is a set of parametric linear models where a model is created to find the systematic effects from a set of data (McCullagh and Nelder, 1989). GLM based models are able to use continuous and categorical type data as input, with the categorical data transformed by using ‘dummy’ variables. Binary logistic regression modelling has been the popular choice of many for probability calculation because of its characteristics of confining the result within the range of 0 and 1 (Hosmer and Lemeshow, 2000). However, the necessity of absence data has troubled many modellers. With the lack of absence data, the presence-absence models use randomly generated pseudo-absence data (Wisz and Guisan, 2009). Generalised adaptive model (GAM) is a non-parametric extension of GLM. The parameters are fitted by applying smoothing functions for non-linear relationships, often with the scatter plot smoother (Hastie et al ., 2009). One of the drawbacks of the GAM based methods is that the method is not designed for large dataset because it is computationally expensive (Hastie et al ., 2009). However, GAM based models are successfully used in SDM. GRASP (Lehmann et al ., 2002) is such a GAM based tool used in R and S-PLUS environment. For small region, GRASP can be applied directly but use of look-up table is suggested for modelling with a larger number of cells ( ibid. ). With inclusion of the BRUTO procedure, the computational efficiency of model fitting process in GAM is increased by several factors (Leathwick et al . 2006). Multivariate adaptive regression splines (MARS) is an adaptive regression model which fits the response in piece-wise linear functions and includes recursive partitioning (tree) method to improve the regression fit (Friedman, 1991). Since GAM and MARS use logistic regression for species distribution modelling, these methods also require absence data and pseudo-absences are commonly used.
Other methods – Genetic algorithm for rule-set prediction (GARP) is based on genetic algorithm and few rules for predicting the species occurrence. The rules include the bioclimatic profile rules, a simple logistic regression rule, an atomic rule and a GARP rule 3 (Stockwell and Peters, 1999). GARP uses presence and (pseudo-) absence data. GARP was one of the widely used methods for studies with presence-only data at the time of development of MaxEnt (Phillips et al ., 2006). MaxEnt is a parametric machine learning method which predicts the species distribution by using maximum entropy estimation, i.e. the distribution is closest to uniform or close to its empirical average of each predictor variables (Phillips et al ., 2006). The maximum entropy method is equivalent to maximising the negative log-likelihood in logistic regression but differs in how it is formulated. MaxEnt is formulated with the ‘Bayesian’ perspective while logistic regression is
3 Refer Stockwell and Peters (1999) for details on how rules are defined
Page| 9 2. Background on species distribution modelling
formulated with ‘Frequentists’ perspective (He, 2010). Further, in Maxent the likelihood is used for performance measurement and not for parameter estimation (Dudik, 2007).Thus, the MaxEnt method is regarded as generative whereas logistic regression is a discriminative method (Phillips et al ., 2006). The MaxEnt method for species distribution modelling uses presence and background data (see chapter 2.4 for background data), and hence absence data is not necessary. If background samples are not provided, the MaxEnt tool generates them before modelling. Boosting is a machine learning method for improving the fit of the model. Boosted regression tree (BRT) is a gradient boosted model which uses several regression trees. The final model is calculated from these trees in weighted fashion (Hastie et al ., 2009). The gbm 4 package of R requires presence and (pseudo-) absence data. For species distribution modelling, a special package (ecogbm 5) is derived from the original gbm package and the requirement is presence and background data.
Use of presence, absence and background data in species distribution modelling
Species sample data requirements vary based on the modelling methods. With the conception of background samples, literatures (e.g. Elith et al ., 2006; Elith and Leathwick, 2009; Phillips et al ., 2009; Ward et al ., 2009) have used the term ‘presence-only’ modelling to indicate both types of methods a) requiring only presence samples (e.g. BIOCLIM model, ENFA), and b) requiring presence and background samples (e.g. MaxEnt, ecogbm). The term ‘presence-only’ used in the thesis means the method requiring only presence samples and not ‘presence-background’ samples. Uses of the three types of sample data are briefly described here with an overview on a method to measure the performance of models using these data.
Presences – Presence samples are the locations where species occurrences have been recorded. Profile (envelope) based modelling needs only presence locations. Thus, it provides a good opportunity to make models with geo-referenced museum records. However, there were some concerns regarding the predicted areas as many non-habitable areas are predicted as suitable (Elith and Burgman, 2002). Although the ENFA approach improved the envelope-technique a lot, the suggested method, if possible, has been the presence-absence models such as GLM (Hirzel et al ., 2002b).
True-absences and pseudo-absences – Absence samples are the locations where the occurrence of species has not been observed on several visits. The records of such instances are true-absences. Typically, these are not recorded and hence true-absences lack for many species (Barry and Elith, 2006; Hirzel et al ., 2002b). So, in order to get absence locations, pseudo-absences samples are generated randomly based on some knowledge where the species may not be present (Wisz and Guisan, 2009). In this way the presence-absence models are formed, i.e. pseudo-absences are used as pure/true absences (see Figure 2-1). One of the difficulties using pseudo-absence approach is how to generate reliable absences. Apart from sampling random points on a landscape, several strategies for generating pseudo-absences (see Table 2-2) have been devised and investigated for building binary logistic regression models. The primary purpose of devising these strategies is to get indirectly
4 http://cran.r-project.org/web/packages/gbm/index.html, accessed 28.12.2011 5 ecogbm is not available at official CRAN distribution site, a beta version is available at: http://www.stanford.edu/~hastie/Papers/Ecology/ecogbm_1.01.tar.gz, accessed 28.12.2011
Page| 10 2. Background on species distribution modelling
as much reliable absence samples as possible. However, there is not a confirmed recommended approach. Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation
(a) true-absence (initial) (c) pseudo-absence (initial)
pr ab. true ab. assumed actual Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation (b) true-absence (!nal) (d) pseudo-absence (!nal) Figure 2-1: Fitting the probability (logistic regression) of true- and pseudo-absence data shown with one predictor variable ‘Elevation’. The probability of absence data (a and b: true; c and d: pseudo) is zero and this value is used throughout the iteration process. But despite pseudo-absence data is likely to contain a mixture of true-absence (prob. = 0, circular) and non-absence (i.e. prob. > 0, plus sign; their actual value indicated by squares), probability value of zero is assumed. (taking Ward et. al ., 2009 figure 3 for idea)
Backgrounds – Phillips et al . (2006) introduced the Maxent method using background samples for predicting species distribution in regression model. When using background samples, the covariates of the presence samples across the landscape are then compared with the covariates of the background samples (Ward et al ., 2009). The concept of Phillips et al . (2006) in using background samples goes into the process of model fitting and not merely for comparing covariate values as in envelope based method such as ENFA approach in Biomapper. The method is similar to the use of pseudo-absences, but it uses the random sample locations with some prior population prevalence value (MaxEnt initialises with probability value of 0.5, Elith et al ., 2011), and then adjusts the values iteratively during fitting of the model parameters. A similar concept of background samples for GLM based on Expectation-Maximisation (EM) concept (Dempster et al ., 1977) was explored by Ward et al . (2009) and implemented in ecogbm package of R statistical software showing that EM applied GLM-based models can perform very well. The fitting process begins by assigning the background samples with a prior prevalence value. The probability values for the background samples are then assigned iteratively (see Figure 2-2). Thus, the probability value of background samples with suitable environment for a species will get increased whereas the probability values for samples with unsuitable environment get decreased. The iteration process runs until maximum iteration number is reached or other statistical criteria are satisfied.
Model evaluation – The model performance is evaluated by comparing the number of correct predictions of presences and absences via binary classification matrix. A widely used evaluation measure is the Area Under the Curve (AUC) value of the Receiver Operating Characteristics (ROC)
Page| 11 2. Background on species distribution modelling
curve. The ROC curve is threshold independent and is based on the sensitivity (true-positive rate) on a vertical axis and specificity (false positive rate) on the horizontal axis of the classified model output (Krazanowski and Hand, 2009). An evaluation of the model formed by presence and true absence data provides the real performance of the model. For the model with pseudo-absence data, the evaluation result may only be real if the assumptions made for generating pseudo-absences are as valid as for true absences. Models using background samples (presence-only) cannot, directly, provide the real performance metric in determining the presence and absence locations (Hirzel et al ., 2006), and can only measure for true presences and false absences. For a proper evaluation, absence (either true or pseudo) data is required. Generally for the evaluation purpose when lacking absence data, the background data is used as (pseudo-)absences in generating the ROC-curve (Phillips et al ., 2006).
Table 2-2: Some of the approaches used in studies for generating pseudo-absences for presence-absence models Approach Devised for Employed by random distribution and 43 fern species Zaniewski et al . (2002) weighted-random distribution Alpine herbaceous plant ENFA-weighted pseudo-absences Engler et al . (2004) (Eryngium alpinum ) three butterfly species ( Melitaea 4 different approaches based on 12 didyma , Coenonympha tullia and Lütolf et al . (2006) species and museum records Maculinea teleius ) habita t envelope via logical (spatial) Northern Goshawk ( Accipiter Zarnetsek et al . (2007) function of 1 and 2 eco-variables gentilisatricapillus ) nests Root vole ( Microtes oeconomus ) combined ENFA and distance weighted and white-tailed eagle Hengl et al . (2009) (Haliaeetus albicilla ) nests
General characteristics and assumptions of statistical species distribution models
When modelling the species distributions, assumptions are made and the output will be valid within the assumed conditions revealing certain characteristics. Some general assumptions and associated characteristics are presented here.
Static and nature at equilibrium – Statistical models are static (not-dynamic) and deterministic (event based). They do not include the physiological processes (see Table 2-1) (Franklin, 1995). Biotic interactions such as inter- and intra-specific competitions cannot be modelled and these models also lack micro-habitat requirements but these interactions are the vital processes determining the species range (Araújo and Luoto, 2007; Holt and Barfield, 2009, Wisz et. al. , 2013). Further missing factors in static models are consideration of dispersal and evolutionary change such as climate- induced range-shifts (Pearson and Dowson, 2003). It is assumed that the species have colonised to the maximum possible range and are in optimum state (no further migration). Thus, the distribution is in the state of (pseudo-)equilibrium with environmental conditions in its native range (Guisan and Thuiller, 2005; Guisan and Zimmermann, 2000). This is often not the case (Araújo and Peterson, 2012) due to several factors: e.g. biogeographic history (Yesson and Culham, 2006), geographic (altitudinal and ‘latitudinal’) constraints limiting dispersal (Munguía et al ., 2008), invasion not fully
Page| 12 2. Background on species distribution modelling
established i.e. continuing expansion (Guisan and Thuiller, 2005; Peterson, 2003). Further, the presence samples are considered to represent the total population in its native range. Probability Probability 0.0 0.5 1.0 0.0 0.5 0.0 0.5 1.0 Elevation Elevation
(a) EM iteration 1 (b) EM iteration 2
Probability Probability pr: z=1 bg (pr): z=0, y=1 bg (ab): z=0, y=0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 Elevation Elevation (c) EM iteration 3 (d) EM !nal iteration Figure 2-2: Fitting the probability (logistic regression) of background data (z = 0) in an iterative process shown with one predictor variable ‘Elevation’. z = 1 represents presence samples; z = 0 represents background samples; z = 0, y = 1 represents samples at favourable environment (presences); and z = 0, y = 0 represents samples at unfavourable environments (absences). At each iteration step, the value of y changes for z = 0 (source: Ward et al ., 2009, figure 3; redrawn and legend slightly altered)
Simplified form – As statistical models do not include all possible factors that affect the distribution pattern, the real shapes of species response to environmental covariates are unknown and responses to environmental covariates are limited (Austin, 2007). Thus, these models are represented by a simplified form of a more complex system (Barry and Elith, 2006). Furthermore, different modelling methods offer different strategies of incorporating environmental variables. The choice of modelling methods, model variables and variables’ interactions can impact the model’s accuracy and generality (Guisan and Zimmermann, 2000).
Niche – The dynamics of species’ evolution, behaviour and adoption to the climatic environment are not and cannot be directly included in a static model. Because of the static nature and with the assumption that the samples represent full population, the output of the models represents more of a ‘fundamental niche’, which relates mainly with climate and does not consider biotic-interactions, than a ‘realised niche’ which has less range due to several biotic and abiotic limiting factors (Huggett, 2004). Currently, most of the models also include variables which partially and indirectly represent some of the biotic interactions (Lehmann et al ., 2002) and hence the outputs lie somewhere between the fundamental and realised niche (Farber and Kadmon, 2003). However, in reality, often the samples are not complete and do not represent the full population nor does the model output represent a complete fundamental niche.
Presence or absence – A site can be considered an absence location after more than 28 visits of confirmed non-sightings at 95% confidence interval (Stauffer et al ., 2002) but a single sighting is
Page| 13 2. Background on species distribution modelling
considered presence locations even if the species is present for unknown reason and in reality, therefore, not a real habitat or habitable location. Ideally, such locations should not be used for fitting a model (Guisan and Thuiller, 2005). The inclusion of species presence records at unsuitable habitats introduces false presences but this information is often not considered. Moreover, for many species absence locations (within the 95% confidence interval) are not recorded. Pulliam (2000) discusses and provides several references where species have been found in unsuitable habitat and where suitable habitat is not occupied by species. However, they are often included in the modelling as the information on unsuitable habitat is often not known. Discriminative methods such as GLM can utilise the information provided by the absence data efficiently (Hirzel et al ., 2002b) but the non- recording of true absence locations is withholding important information about the species- environment relationship for determining the species range.
Sample bias – Biases in sampling is often not known except in case of well-designed surveys (Guisan et al ., 2007). It has to be noted that one of the aims of the SDM is to reduce expensive surveys. Often a collection of species’ locations is opportunistic in nature (Barry and Elith, 2006; Boitani et al ., 2011; Elith and Leathwick, 2009). This leads to some areas having very high sample densities whereas other areas remain under-sampled. Such differences in sampling density may lead to false prediction. Often such biases are assumed to be non-occurring and ignored i.e. they are not considered during modelling the species distribution range. The results of Elith et al . (2006) are improved in Phillips et al . (2009) by introducing bias in the background samples similar to those in presence samples. Phillips et al . (2009) report that by partially incorporating the sampling bias, the prediction by different modelling tools improved for the same 226 species (and data) used by Elith et al . (2006).
Presence-background sampling ratio – Bias in presence-background modelling is introduced via ratio of number of backgrounds to presences and the extent used for background information, which affects the evaluation of model’s predictive performance (Lobo et al ., 2008; VanDerWal et al ., 2009). Although there have been suggestions for using weights to balance this proportional bias, only few studies (e.g. Maggini et al ., 2006; Guisan et al ., 2007) have actually included weights. Maggini et al. applied weights to absence samples to gain an overall prevalence value of 0.5 and found that models were more stable and performed better in explaining deviance as compared to the results with non- weighted absences. However, the effect of the weights to balance the sampling ratio will have very less influence if the cut-off value for binary representation is calculated based on the sensitivity and specificity values of the prediction instead of the classical 50% mark (Jiménez-Valverde et al ., 2009).
Variable importance – The importance or ranking of environmental variables presented by statistical measures may be different from the ecological theory (reality but not generality, see chapter 2.1, Table 2-1 on type of models). This is because the statistical model is not based on cause-and-effect relationship between the variables and responses (Guisan and Zimmermann, 2000). In addition, the final model uses only subsets of variables from the initial pool. The subset of variables are often based on either ‘goodness-of-fit’ test for individual variables, or selection via ‘information criteria’ on log-likelihood such as AIC, BIC, deviance or similar metrics (Heikkienen et al ., 2006; Austin, 2007; Elith and Graham, 2009; Wisz and Guisan, 2009).
Scale – Static SDM are valid generally at larger grain size and for broad-scale patterns. The assumption that the environmental conditions remain constant is valid only at coarser grain size
Page| 14 2. Background on species distribution modelling
(Austin, 2007). Biotic interactions play more of a role in determining the species’ distributions at finer resolution (Austin, 2007; Guisan and Thuiller, 2005), but such detailed data (measurements) is often not available due to the expensive nature of data acquisition (Hartley et al ., 2010), this especially also in a spatially explicit form.
Independence of data – The sample locations used for modelling the species distribution is supposed and assumed to be independent of each other. However, often the samples are not truly spatially independent; the sample locations closer to each other have more similar values (here environmental attribute values) as compared to those in a farther distance (Dormann et al . 2007; Guisan and Zimmermann, 2000). Furthermore, combining samples from several years of field surveys and collections from natural history museums and herbariums, which is often the case for SDM, treat the data temporally dependent. Another characteristics or limitation is that once the model is fitted (trained), the parameters are stationary and not dynamic, i.e. constant throughout space (everywhere) as well as over times whether stemming from past or present.
Commonly used data in statistical species distribution models
Climatic variables are the core predictors for many SDM tasks but they lack some important explanatory variables mainly related to dynamic processes. Although the dynamic processes cannot be included in the statistical phenomenal models, indirect variables have been used to account for some of the processes and to improve predictions (Austin, 2007; Guisan and Thuiller, 2005). This is in the view that indirect variables relating to ecological processes or needs (e.g. resources) and causal relationships lead to an ecologically more interpretable model and thus can derive more realistic and generalised output (Guisan et al ., 2007; Guisan and Zimmermann, 2000). The following predictor variables (those that may represent more than one class based on how they are defined) are often used:
Elevation and derivatives – Elevation is one of the often used predictors. It is also correlated (although degree varies) with most of the climate variables (Austin, 2002a), for example higher altitudes have lower temperatures. Slope and aspect are examples of variables derived from the elevation affecting species richness and abundance (Åström et al ., 2007). The south facing slope is preferred by many floral species in the northern hemisphere and north facing slope in the southern hemisphere. The slope may be important up to a certain grain size (Chapman et al ., 2005) but for the equatorial region the aspect may play a little role. Similarly, the modelling done for areas including both sides of the equator (e.g. Africa) may not have any use of aspect. Topographic position such as ridges, peaks, valleys (Guisan et al ., 1999) can be important predictors at finer grain size; at larger grain size such details are often lost.
Climate – Climate related variables are used to relate physiological needs of species (Guisan et al ., 2007). Bioclimatic indices variables as conceptualised by H. A. Nix (Busby, 1986) are among the most used data in species distribution models. Although 19 variables are synthesised from temperature (12 variables) and precipitation (7) related variables (Hijmans et al ., 2005; www.worldclim.org), not all of them are useful and generally the selection of variables are determined based on species (Beamount et al ., 2005). Among the 19 variables, the mean average annual temperature and the average annual precipitation are very common ( ibid.) whereas temperature range and seasonality
Page| 15 2. Background on species distribution modelling
and precipitation seasonality are other data that are often included. However, it has to be noted that many of the 19 bioclimatic variables are highly correlated with each other (Leaché et al ., 2009).
Other indirect variables (surface strata) – Several variables characterising the land surface are used to indirectly incorporate some biotic factors. Land-cover/use has been often used as surrogate for habitat type when modelling of faunal species (Franklin, 1995). Although the combined effect of temperature and precipitation are correlated with the phenology, the use of remotely-sensed data such as NDVI data (an indirect index for above-ground green bio-mass) has also found its usefulness as surrogate of resource availability (Chapman et al ., 2005). Further, the use of NDVI, temperature and precipitation indirectly includes the time lag between the precipitation and plant growth (Dall’Olmo and Kernieli, 2002). Factors such as the amount of soil moisture content, organic content, and pH also play an important role for plant species (Franklin, 1995) but the availability of detailed geospatial data are not or may not be available. Soil type and geology can be included as proxy for nutrient when modelling floral species (Engler et al ., 2004). Water is one of the essential elements for any living organism. For species such as Odonata which depend on aquatic and terrestrial ecosystem, inland water bodies can be an important predictor variable (Kalkman et al ., 2008). Hence, hydrographical features should be useful predictors for such mobile species.
Incorporating statistics and GIS for species distribution modelling
Chapter 2.3 showed that various statistical modelling methods are in use for predicting spatial distribution of species. Despite having some constraints such as general assumptions (chapter 2.5) and overcoming methodological challenges (chapter 2.4), examples of uses (chapter 2.2) have revealed the success of these methods in various application domains. However, spatial data handling is a separate application domain which ideally needs to be integrated with statistical modelling framework for a smooth workflow of the overall SDM process (see chapter 4.1). Several bio- (and) physical environmental variables, in geo-spatial data format explaining ecological significance, has been in use (chapter 2.6). Some of them need pre-processing so that they can be used for modelling, while others may need pre-processing so that they can be assigned to ecological meaning. Often pre-processing of eco-variables is done to form proxy variables, i.e. derivative of the original variable which then sometimes represent causal relationship providing partial mechanistic effects or reflect indirect relationship with the species presence and/or absence data (Guisan and Zimmermann, 2000) in statistical SDM. These pre-processing steps may need chaining several spatial functions/operations before the final variable is derived (e.g. see chapter 5.2.1). Geo-referencing museum and herbarium records as well as harmonising all geo-datasets to a common spatial frame and grain also belong to the tasks of pre-processing. GIS is ideally suited for such purposes. Although GIS offers basic statistical functions, the available functions are not enough for modelling species distribution. Therefore, one of the options to integrate thorough statistics and geospatial processing is to pre-process the data in GIS, export the data readable or importable in statistical software, run the modelling, export the results to GIS, and present the result in GIS. Another option is to couple the GIS and statistical software providing a common interface for both software. In total three different ways are possible for integrating GIS and statistical software; a) loose-coupling, b) tight coupling, and c) integrated coupling (single software) providing GIS and statistical functions.
Page| 16 2. Background on species distribution modelling
Loose-coupling – In this type of integration, each software works independent of each other where one software does not know about the data-structure and processes of other software. An intermediary interface (usually transfer of data files) acts as a bridge (connector) between two software for communicating the input and output (Figure 2-3). Thus, the interaction of input and output among software often takes place using an exchange file format through Graphical User Interface (GUI) and/or Command Line Interface (CLI), provided that the common exchange format exists (Jankowski, 1995). In absence of a common exchange format, a converter can be plugged-in to facilitate for exchanging files. Implementation of such interface is comparatively easy and provides full flexibility. However, care has to be taken when designing the intermediary interface for not to make an interface too complex or difficult to use. From a usability (chapter 3.2.3) aspect, GUI coupling is better suited where a well-designed user-friendly GUI acting as a bridge hides the internal coupling mechanism and are often used in modelling applications (Brandmeyer and Karimi, 2000). GRASP (Lehmann et al ., 2002) is an interface providing GAM (statistics) of S-plus or R in ArcView GIS (version 3.x). Because of the exchange-file format, an upgrading of the statistical software will, more likely, not affect the functioning of the processing.
Figure 2-3: Loose GUI coupling of GIS and statistical SDM involving data converter and bridged via GUI (adapted from Jankowski (1995), Karimi and Houston (1996), and Brandmeyer and Karimi (2000))
Tight-coupling – In this type of integration, the systems works independent of each other. But, unlike loose-coupling, one software will have the information about the data-structure of the other. Application Programming Interfaces (APIs) are centre of such systems linked together via GUI (Figure 2-4) and many software also publish APIs to facilitate tight-coupling of applications. So, the data can be shared in the memory and often does not involve exchange formats. A tightly coupled system can, thus, make benefit of the analytical capabilities of the coupled components (Karimi and Houston, 1996). Such integration comes with a more difficult implementation and lesser flexibility as compared to loosely coupled software. Further, upgrading one software may affect the functioning of the system if the APIs change in the upgraded version. GeoEco (Roberts et al ., 2010) is such an example where the statistical modelling (GLM and GAM) capabilities are coupled within the ArcGIS Desktop environment.
Page| 17 2. Background on species distribution modelling
Figure 2-4: Tight coupling of GIS and Statistical SDM with the APIs as core component in the centre of the different systems, database and the GUI (adapted from Jankowski (1995))
Integrated coupling – This is a full integration of all the systems or components and share common GUI and data access mechanism (Figure 2-5). Since the underlying data as well as the memory are shared across processes, users of the system will not notice difference in handling different types of functionalities offered by the system. From the users’ perspective, the system will provide one common user-interface and thus harmonised user-experience (see chapter 3.4). For the developers, the full integration involves complex implementation (Brandmeyer and Karimi, 2000).
Figure 2-5: Integrated coupling of GIS and SDM with systems and database interacting as a single unit (adapted from Brandmeyer and Karimi (2000))
Summary and main considerations
Theory and mathematical based analytical, process and behaviour based mechanistic and stochastic events or phenomena based empirical models are three different classes of SDM. The analytical and the mechanistic models and modelling methods are generally specific to a certain species only. The model output can be generalised across different landscapes and model parameters can be transferred because of specific processes and behaviours. The empirical modelling methods, relying only on observation of stochastic events of presence or absence, are
Page| 18 2. Background on species distribution modelling
non-specific to a certain species but transferring model parameters across different landscape may often not be valid, since a phenomenon based result cannot be generalised and, hence the model output loses generality. However, empirical models offer ‘realistic’ output based on stochastic events and present ‘precise’ information based on the observation with respect to environmental variables. Nevertheless, with the aid of ‘precision’ and ‘reality’, it is still used to test and explain various ecological hypotheses and processes, finding useful applications such as unearthing hot-spots, designing strategic surveys, facilitating biodiversity assessments, determining shifts in habitat range due to environmental change, etc. The increasing use of statistical SDM in various ecological domains has inspired development of several methods adapting different requirements such as experiment or hypothesis setting, simplicity for explaining and/or relating to other phenomena or complexity for getting as much reality as possible in current and future environmental scenarios. The methods are based on observations of either presence or presence-absence locations, some utilising notion of pseudo-absences where true-absences are not available. Several permutations of generating pseudo-absences have been conceived and used. The concept of background information instead of pseudo-absences presented a new dimension in species distribution modelling offering conceptually better model formulation as compared to using uncertain pseudo-absence data. Presence-background models are, therefore, getting more attention. Statistics is core in empirical SDM methods, but equally important is the role of GIS in shaping the data for modelling as well as further processing for inference for various management applications. Although having GIS and statistical modelling functionalities in one tool may be desired, other possibilities such as coupling in loose or tight framework can, nevertheless, serve the purpose. Loose coupling systems can offer GIS and statistical methods flexibility. One of the benefits is that changing of one of the components on the loosely coupled system can be possible with less trouble. In the case of tightly coupled system, such possibility is reduced without changing the programming code because of the involvement of APIs which tie the different components.
Page| 19 3. Quality-in-use for development of species distribution modelling tool
3. Quality-in-use for development of species distribution modelling tool
‘Quality’ is defined as “the standard of something as measured against other things of a similar kind; the degree of excellence of something” (Oxford Dictionaries, 2010). Quality-in-use is the measurement of the effectiveness, efficiency and achieved satisfaction by a user while performing a certain task in a specified environment (Bevan, 1999). The inclusion of processes, which ensures better user experience through quality checks, during the development of a product, can determine a product (here a model or modelling tool) to be accepted or rejected for use by users (Raza, 2011). Quality is measured over several parameters within the ‘context of use’ (Bevan, 2001b). Employing user-centred design during the design stage ( ibid.) and considering experience as one of the components can improve or enhance usability and user experience (Petrie and Bevan, 2009). For species distribution modelling (SDM), the quality-in-use measures the attributes in the context of efficiently modelling the species distribution with all the related tasks.
Context of use
Species distribution models have been used in several contexts (see chapter 2.2). Different purposes generally have different requirements. Thus, the quality assessment would differ with the use-case. Further, the species distribution modelling task involves, besides applying ecological theory, the use of GIS and statistics (see chapter 2.7). Here, the SDM task can be divided into two contexts: a) scientific modelling context where species prediction is the aim; and b) applied model- output use-context where the output is used in different application domains. The necessary expertises are additionally related to basic functionalities required for the task (see chapter 3.2.1). Although the outcome of a modelling tool will be same for a given data set and method, distinct contexts and requirements, model evaluation methods and consequences (e.g. Table 3.1) will result in different output quality (Guisan and Zimmermann, 2000). The outcome of modelling species distribution can have over- or under-predictions and the effect can vary according to the context for which the modelling is done. For example, in a scientific context the effect of either over- or under-prediction can lead to wrong hypothesis being assumed. Likewise, the biodiversity inventory process may not be efficient with either too much or too little resource allocation. For a conservation planner, trying to determine the possible extent of invasion by a non-native species, under-prediction might have severe after-effect on ecosystem and conservation of native species for failure in a proper assessment, thus, leading to degraded quality-in-use metric for the modelling tool, whereas with over-predictions, the conservation status might benefit (positive for species conservation) but with extra resources (negative for e.g. economic input). Here, within the same context, over-predictions may result in larger areas allocation for conservation but the result may not be as efficient in terms of optimal resources (e.g. economic input) used. Students can be made aware of different consequences by making use of different modelling tools and techniques which enables them to learn strengths and weaknesses of the tools and making them aware of potential pitfalls.
Page| 20 3. Quality-in-use for development of species distribution modelling tool
Hence, confined in the boundary of a specified context of use, the overall ‘quality-in-use’ is assessed using various quality measures.
Table 3-1: Some examples of SDM users, contexts of modelling and the typical consequences of over- and under-prediction User context (task)* prediction typical consequences* present and historical over / wrong hypothesis on biogeographic distributions scientist / biogeography under and ranges academics evolutionary (phylogenetic) over / wrong hypothesis on speciation or evolution and distribution under origin (-) increased economic input biodiversity over (+) detection inventory survey design (+) lower economic resource official under (-) likely detection missed over (-) higher resources needed monitoring / evaluation / (-) probable conservation areas not being prioritisation under allocated (-) higher resources (+) more conservation areas allocation over species extinction / (-) false ‘downgrade’ of threatened red-list conservation status status** conservation (-) less conservation areas allocation planner / under (-) false ‘upgrade’ of threatened red-list status** manager (+) increased effort on native species over conservation species invasion (-) higher resource allocation (-) low effort on native species conservation under (-) resource for not proper assessment over future climate change differing management plans and policies under understanding strengths and weakness of various modelling methods, learning different contexts over or learn about potential pitfalls of over and under student and applications of SDM under prediction, develop or simulate ideas for improving model output * non-exclusive list ** several other criteria are also needed for a full assessment (+) positive effect (-) negative effect
Quality measures
Quality of software is measured based on functionality, reliability, usability, efficiency, maintainability, and portability (Bevan, 1999). Among the six measures, end-users are concerned
Page| 21 3. Quality-in-use for development of species distribution modelling tool
with the first four metrics, whereas the last two metrics are of interest to the support-user (Bevan, 2001a).
3.2.1. Functionality
Task analysis is central to decide on the functionalities to be offered by a tool. While insufficient functionalities can frustrate users and the tool may not become the choice, having too much functionality can lead to under-utilisation because extra functions can have negative effects on usability measures (see 3.2.3) such as learnability, memorability, or error rate (Shneiderman and Plaisant, 2005). The SDM task needs a series of steps, i.e. several functionalities (see Table 3-2 for functionalities in some SDM tools) with statistical modelling obviously being the core together with some form of evaluating the model output. Before creating a statistical model, data needed for the (environmental) modelling have to be pre-processed e.g. to ensure that all the GIS data layers are properly harmonised (Ghisla et al ., 2012; Graef et al ., 2005). Some of the SDM tools without GIS interface or functions (e.g. GARP, GRASP) package customised scripts for a certain GIS software for this basic data preparation.
Table 3-2: Basic functionality matrix overview of selected SDM tools
BIOCLIM ENFA GLM/GAM/BRT GARP Tool MaxEnt d (Diva-GIS) a (Biomapper) b (R/S-plus) c (DesktopGarp) e Functionalities Basic GIS pre- ArcView 3.x yes yes additional packages no processing 1 scripts included Statistical modelling yes yes yes yes yes Statistical yes yes yes yes yes evaluation 2 GIS post-processing 1 yes yes additional packages no no Simple visualisation yes yes additional packages yes yes Interactive yes yes – no no visualisation a Hijmans et al . (2012) e Scachetti-Pereira (2002) b Hirzel et al . (2002a) 1 some (basic) GIS functions c Hijmans and Elith (2011) 2 some sort of evaluation of predicted result d Phillips et al . (2006)
The probability distribution maps show gradients and thus provide useful information but it can also be misleading and reclassifying to few classes may actually provide better summary; the minimum number of classes being binary presence-absence based on one of several threshold criteria (Hirzel et al ., 2006). The conversion process is the basic GIS post-processing step facilitating for the assessment of the prediction accuracy. Interactive visualisation of distribution as map is often sought, but a quick overview can offer preliminary visual evaluation of the output.
3.2.2. Reliability
Reliability is a quality attribute related to performance measuring the likelihood of potential failure, how failures are handled and what measures are placed to recover from any failure (Bevan,
Page| 22 3. Quality-in-use for development of species distribution modelling tool
1999). For SDM methods, reliability includes the ability to calculate parameters fitting the covariates. Parameter fitting is an iterative process which depends on the convergence threshold (Nocedal and Wright, 2006; Vanderbei, 2008). Smaller threshold value generally requires higher number of iterations. Although the implemented iterative methods are often guaranteed to converge (e.g. Dudík 2007; Friedman et. al., 2010), it may fail at times due to e.g. singularity of predictor variables (Press et al ., 2007). However, convergence may not necessarily mean the model is stable. SDM methods offer option to specify maximum number of iterations which in the case of non-convergence ensures the program is not running forever, although this measure is only a workaround and not the perfect solution. Further, regression with exponential forms (e.g. Poisson regression, maximum entropy) may often predict the results outside the desired bounds (Phillips et al ., 2006; Ward, 2007). The out-of-bounds results can be corrected by applying transformation of results via e.g. logistic transformation as implemented in MaxEnt (Phillips and Dudík, 2008), thus providing workaround for undesired result. Other reliability factor in regression modelling is the presence of correlated variables and noises in predictor datasets; noises tend to influence the result, sometimes also contributing to instability of the model, e.g. high standard deviations. There are methods for controlling over-fitting, removal of noise in predictor variables and handling correlated variables, and some of the tools have the options for applying them or have incorporated them as standard procedure (see Table 3-3). L1-regularisation is part of MaxEnt for controlling over-fitting as well as noise removal. Users can choose L1, L2 or elastic-net for over-fitting when using regression based models in R, however use of elastic-net in SDM with continuous and categorical data has not been published. Generally, regularisations techniques are not often employed in SDM (Phillips and Dudík, 2008). No options are available for noise removal in BICOLIM and ENFA; they are based on envelopes. MaxEnt do not handle correlated data well. The exact method for noise removal, over-fitting and handling correlated data in DesktopGarp is not known here. ENFA uses factorisation technique for modelling and thus the correlated data are, by nature, handled while regression models in R can use L2-regularisation. L1 and L2 regularisations cannot be used at the same time; instead elastic-net facilitates both. A further important consideration in modelling is the repeatability of results (Golafshani, 2003). If results cannot be reproduced consistently with the same model settings, the reliability on the model output decreases because it lacks cross-checking of results. The results can be reproduced in all except for GARP. For GARP, the results may differ to some extent and is related to the stochastic (non-deterministic) nature of genetic-algorithm (Stockwell and Peters, 1999). Furthermore for large datasets, subsampling is necessary leading to different results across different model runs for the same dataset (Hastie et al ., 2009) but the differences in results may get minimised or reduced but not eliminated because of the iterative simultaneous testing with several portioned (subsets or sub- sample of) training and test datasets which is part of the modelling (Stockwell and Peters, 1999).
3.2.3. Usability
Usability issues are most sought quality attributes (Carroll, 2004). 'Easy-to-use' criterion directly influences other usability metrics and is most focused on user (usability) testing. Usability is the subjective measure of the ease-of-use or user’s interaction while performing a specified task effectively to attain a specified level of achievement (Cooper et al ., 2007; Galitz, 2007). Usability testing can reveal areas that need corrections in design to improve the user experience, efficiency and overall ‘quality in use’ of a product (Cooper et al ., 2007). Usability also depends on other quality
Page| 23 3. Quality-in-use for development of species distribution modelling tool
measures, e.g. functionalities but they are dealt separately. Typical attributes to describe usability are a) learnability, b) operability /performance, c) memorability, d) error rate, and e) satisfaction (Galitz, 2007; Nielsen, 2010; Shneiderman and Plaisant, 2005). The preference is to provide a high summation value from all these metrics but trade-offs are common because of various factors (e.g. rate of errors vs. speed of performance, learnability vs. operability) and subjectivity (e.g. nature of task and needs, users’ expertise, cultural background) related to the context of use (Shneiderman and Plaisant, 2005).
Table 3-3: Methods to improve reliability regarding noise in predictor variables, over-fitting and correlated variables in selected SDM tools BIOCLIM ENFA GLM/GAM/BRT GARP Tool MaxEnt (Diva-GIS) (Biomapper) (in R) (DesktopGarp) Noise removal L1-regularisation, L1- n/a n/a yes a and over-fitting elastic-net regularisation Handling L2-regularisation, correlated n/a factorisation n/a yes a elastic-net variables Repeatable not yes yes yes yes results necessarily b Hastie et al . (2009 ) Hijmans et Hirzel et al . Phillips et al . Stockwell and References Zou and Hastie al . (2012) (2002b) (2006) Peters (1999) (2005) a exact method not known b difficult to achieve due to nature of the genetic algorithm
Learnability is a measure for ease-of-use when a user encounters the design for the first time performing some tasks. The time required for getting used-to and amount of effort required to use the software proficiently measures the learnability (Tulis and Albert, 2008). The familiarity of elements in the user interface and their expected behaviour may increase learnability whereas a different behaviour of controls can confuse the user. Consistency in design and appearance of the dialog windows increases the users’ handling ability of the system. Further, a simpler interface enhances the user experience (Apple, 2009; Microsoft, 2010; Shneiderman and Plaisant, 2005).
Operability/performance discusses the user’s efficiency (time taken) in using the system while performing a specified task (Shneiderman and Plaisant, 2005). This has direct dependence on learnability; the easier to learn how to use, the higher is the efficiency in use. As with any system, experience in using the system has a positive effect on the operability; experienced users can operate more efficiently whereas a novice user will take time to attain a certain level of efficiency. Here, the user interface plays a vital role in interaction of the user and the system, thus effective interface design is crucial for efficient use ( ibid.).
Memorability refers to the ability of user in attaining the same level of efficiency after not having used the system for a long period of time. A good learnability would help in quick memorability for simple tasks. A frequent user of the software will have fewer problems remembering how to perform a certain task but for a casual user a well-designed interface offering recognizable controls aids in remembering the sequences and commands in short span of time (Nielsen, 2010).
Page| 24 3. Quality-in-use for development of species distribution modelling tool
Error rate , as measured in usability, is different to errors measured in reliability (see chapter 3.2.2). In usability testing, it is the number of errors a user makes during execution of a specified task. Usability measures how often users make errors and if there is a way to recover how users recover from them or whether the system informs users about the errors. Well formulated error messages can reduce future error rates and reduced number of errors can increase productivity or efficiency of use (Shneiderman and Plasiant, 2005). “If an error is possible, someone will make it. The designer must assume that all possible errors will occur and design so as to minimize the chance of the error in the first place” (Norman, 1990, p36). Thus, measuring errors (by users) during usability testing can help in understanding the design failures which induced incorrect action (Tulis and Albert, 2008).
Satisfaction is referred to as level of acceptance of the achieved goal as well as the attractiveness of the user interface, measuring the affection of various aspects of the interaction interface (Shneiderman and Plaisant, 2005; Tullis and Albert, 2008). Command-line interface and graphical user interface are two broad categories of interfaces providing interactions between users and computers. The choice of appropriate interface depends on knowing the user and the task and is better facilitated through user-centred design (see chapter 3.3).
3.2.4. Efficiency
Generally, efficiency is measured as a ratio of input to output. For quality metrics, these input and output can be different based on what is being measured. Petrie and Bevan (2009, p20-1) define efficiency as “the resources expended in relation to the accuracy and completeness with which users achieve goals”. Time is one of the factors often used for measuring efficiency, e.g. the amount of time spent on completing a task (Tullis and Albert, 2008). However, past experience, expertise in the task and the nature of task can influence the completion time. Thus, measuring relative efficiency can provide better metrics where average time taken by users is compared directly with average time taken by experts within the same context and environment in achieving the stipulated goal (Bevan, 2006). Including designer in a pool of experts for measuring the time can highlight the potential gap between designer’s ideas and concepts and users’ perception. Another way of expressing efficiency is to evaluate the amount of effort a user requires for completing a task. The effort can be physical or cognitive in nature. The physical effort can be number of steps to be performed via mouse clicks and key input. The cognitive effort is to find out where to click (Tullis and Albert, 2008). Efficiency can also be expressed in terms of effectiveness where effectiveness is described as function of quality and quantity of the completed task. User’s efficiency is the ratio of effectiveness to the task-time (Bevan, 1995; Bevan and Macleod, 1994). Resources can also be physical objects such as CPU time or memory required to run the task. For processing large amount of data, there is often a trade-off between the memory required and CPU time based on which optimisation strategy has been employed.
3.2.5. Maintainability
One of the two quality criteria that are directly related to a support-user is maintainability. Maintainability of software is the ability to modify the software for correctness, improvement, adoption in different environment (e.g. operating system) or to change the functional requirements
Page| 25 3. Quality-in-use for development of species distribution modelling tool
and specifications (Bevan, 1999). In ‘quality model’ defined in ISO/IEC FDSI 9126-1, maintainability is referred with analysability, changeability, stability and testability (Bevan, 2001a, p540 figure 2). Two distinct groups can be formed based on the closeness: a) with analysability and changeability, and b) with stability and testability.
3.2.6. Portability
Software portability is the ability to easily move and use a program between different operating systems/platforms (and architecture) with reasonable cost and effort. For most end-users, portability means minimal changes to a program when moving to a different system, no or little (re-)training on handling of the program and ability to work either with local or remote systems (Garen, 2007). Further, portability also refers to the mechanism which allows converting, sharing and using data in multiple software and/or hardware environments (Shneiderman and Plasiant, 2005). ISO/IEC FDIS 9126-1 characterises portability with adaptability, installability, co-existence and replaceability (Bevan, 2001a, p540 figure 2).
User-centred design
Although the task is important, an equally important role is how users are enabled to perform the task. Thus, focusing the user in the process of design offers a better or more pleasant experience to the end user (Lewis and Reiman, 1994). Further, the user-centred design process can ensure a stable system and reduces the risk of failure. The user-centred design process is an iterative process consisting of context definition, requirement analysis, design and test (Lewis and Reiman, 1994; Maguire et al ., 1998). Maguire et al . (1998, p20) suggest twelve basic questions to summarise a project (i.e. the overall context) from the users’ viewpoint, some may not be applicable in every case, or some may have common answer (see Box 3-1). The questions in Box 3-1 help not only to decide how a system should work but also give an idea on the next step: the analysis of requirements, which includes user characteristics, working environment, and user goals and tasks (see Box 3-2, adjusted to SDM tools). Based on targeted users’ profile and requirement analysis, other necessary strategies can be planned such as sequence of execution (see Table 3-2), or additional functions and features (e.g. see Table 3-3 for features) to complement main functions (Shneiderman and Plaisant, 2005). Technical working environment is core of user-centred design process when creating specification. Hardware such as memory, processor capacity, architecture, input and output devices, software platform (and additional dependent components) and networking infrastructure are key components which describe the environment for interface design (Thomas and Bevan, 1996). If an interface is designed considering users’ profile/characteristics, there is more likelihood that the user will learn faster to use the tool, to use it efficiently and to make less error. Thereby a better user experience is offered and higher confidence in its use is simulated. User interface plays an important role, not only for the user in mind but also related to the nature of the task. Shneiderman and Plaisant (2005) discuss the advantages and disadvantages of five common software user interface styles: direct manipulation, menu selection, form filling, command line and natural language. The first three styles are some form of graphical user interface (GUI) and the latter two styles are command line interface (CLI). GUI offers interactive manipulation of input and output parameters, provides easy learning and explorations with high subjective satisfaction,
Page| 26 3. Quality-in-use for development of species distribution modelling tool
and allows easy memorability. GUI has been the state-of-the-art for interacting with the software or system. CLI can offer flexible use, perform complex command sequences in a batch, and make the user feel to have full control of what is happening. But, the CLI comes with a slow and difficult learning curve and poor memorability; and with increased complexity introduces errors (e.g. Roy, 1992). Although natural language command reduces the burden of learning command syntax, different cultures and languages are hindrances for effective implementations.
Box 3-1: Basic questions to summarise a project in determining the context of use of a system (based on Maguire et al ., 1998)
• What is the system or service? • What functions or a service is it intended for the system to provide? • What are the aims of the project (product)? • Who is the system intended for (i.e. target market)? • Who will use the system? • Why is the system needed? • Where will the system be used? • How will the system be used? • How will the user obtain the system? • How will the user learn to use the system? • How will the system be installed? • How will the system be maintained?
Use of software begins with installation or acquirement at the least. The SDM tools listed in Table 3-4 can be obtained easily and freely from the Internet. Although installation is easy for all the tools, it may not be straight forward for R where dependencies may have to be installed separately or DesktopGarp where the installer does not include all required files. In both cases, however, availability of Internet connection will facilitate downloading of the required files. R informs clearly about the missing dependencies and will download them if Internet connection is available. But, DesktopGarp gives a message which is not understandable (cryptic). The next issue a user faces is the interface to interact with the software and availability of support for its use. Biomapper and DesktopGarp are GUI based whereas Diva-GIS and MaxEnt can be used either in GUI or in CLI. R is available as a command line interface. A user-support system is crucial for any interactive product. Documentation such as user’s manual, quick-reference and other self-help materials are parts of user-centred design (Thomas and Bevan, 1996). Availability of documents describing the logic (how the software computes results) offers insights about the computation providing better judgement of the appropriateness or relevance of the software for a particular task. Equally, a brief guide explaining ‘how to’ can facilitate easy learning experience to start immediately. The process of modelling species distribution requires several steps in sequence, e.g. data preparation (as pre-process), statistical modelling as main task and post processes, the latter based on the field of use. Depending upon users’ experience and data requirements of SDM software, the pre-processing steps can vary. So, a clear guidance on these inter-related steps can be an important asset (e.g. Biomapper 3 user’s manual by Hirzel, 2004). Another important aspect is the operating system on which the software runs. In comparing some of
Page| 27 3. Quality-in-use for development of species distribution modelling tool
the SDM tools, Biomapper is clearly a well-designed user-centred tool offering easy installation, GUI for interaction and offering necessary documentation (see Table 3-4). Although, all four tools in the table provide some sort of documentation, the fundamental theory or concept behind the modelling method is not included for DIVA-GIS, R (regression based models), MaxEnt and DesktopGarp.
Box 3-2: Typical questions for development of a species distribution modelling tool regarding user profile (based on Johnson, 2010; Lewis and Reiman, 1994; Maguire et al ., 1998)
• What is the intended users’ qualification (expertise/field)? • What special skills do users poses (e.g. regarding GIS, statistics)? • Do users have experience with similar tools (e.g. one or many other SMD tools)? • How much IT experience does the users have on e.g. command-line and graphical user interface, operating systems? • What do users know about the details of tasks (e.g. basic and advanced pre- and post- processing in GIS)? • Do users have previous training (on performing similar task or using similar system)? • What is the frequency of use? • What are the common terminologies used by the users (in GIS, statistics)? • What are the factors influencing the users for motivation or discretion to use a certain tool?
User experience
Software with bad usability can be a factor for failure of good software but a good usability may not guarantee a pleasant user experience (Kuniavsky, 2010). It is the quality of experience that stimulates users to the acceptance or rejection (Buxton, 2007). User experience (UX) is directly related to usability but with different perspective. UX is an emotional consequence of good or bad usability design. It is often subjective and very much personal which differs from one user to another user (Bevan, 2009; Hasenzahl, 2003). UX is developing as a core concept to the perception of usability (Carroll, 2004; McCarthy and Wright, 2004). “’Experience’ is an elusive concept that resists specification and finalisation” (Wright et al ., 2003 p44). Experience is constructed through the repetitive use of the product by creating mental models. It cannot be disassembled into discrete key elements. Hence, designing an experience is difficult, and may not even be possible “but with a sensitive and skilled way of understanding our users, we can design for experience” (p52). For software developers and designers, system software providers like Microsoft, Apple and others provide style guidelines (e.g. Apple, 2009; Microsoft, 2010) e.g. designing GUI elements and interactions based on their APIs. The user-interface design rules and guidelines are mostly based on the research on human-computer interactions involving psychology and cognitive system (Johnson, 2010). One of the objectives of these guidelines is to provide a ‘look and feel’ which is consistent within the system environment and offers pleasant experience; ‘look’ is for the visual perception of design, while the emotional or experiential part is accounted for by ‘feel’ (Nielsen, 2003). However, and interestingly, Microsoft and Apple violate their own published guidelines (Cooper, 2004).
Page| 28 3. Quality-in-use for development of species distribution modelling tool
Table 3-4: User-centred features of some selected SDM tools BICLIM ENFA GLM/GAM/BRT GARP MaxEnt (Diva GIS) (Biomapper (R/S-plus) (DesktopGarp) Installation easy easy easy a easy not as easy b Software interface GUI/CLI GUI CLI GUI/CLI GUI Documentation fundamental yes c yes yes c,d yes c yes c theory how to (manual) yes yes yes d yes yes help integration in – yes yes d yes e yes f UI Windows Windows and and UNIX Windows up Operating system Windows Windows UNIX based based to XP systems systems* a full package or separate individual download (careful about dependencies if not connected to Internet) b missing installer component (error message not understandable) c literature available in Internet (single source may not be sufficient to understand) d individual package specific (package dependent, i.e. if the developer provided) e not easily understandable (a bit technical) f single file, same as ‘how to’ (manual), although installed, tricky to get it displayed or work * requires Java virtual machine (Java interpreter)
3.4.1. Cognition
Experience is often related to cognition, what and how users perceive. Although, at current age, technologies change rapidly, the fundamentals of people’s perception and thinking do not change with the same speed. Hence, the knowledge on cognitive response and behaviour can help in better designing the interaction (Johnson, 2010). People learn fast when the operation is task-focused, simple and consistent. Users have to translate the task mentally into the operations offered by a tool. This cognitive process leads user, instead, to focus on the requirements of the tool and re-focus on the task. A simpler conceptual model will provide an easy translation of tasks to functions or operations and vice versa ( ibid.). The smaller the difference in the actual task and the function, the faster user can re-focus on the task and hence learnability increases. The differences in task and function can be reduced if the vocabulary or terminologies used for functions are mapped closely and consistently to the task ( ibid.). As such, metaphors play an important role in human (users’) cognition system. Tapping the key concepts and presenting them, when relevant, as suitable ‘cognitive’ 6 metaphors (Blackwell, 2006) offer better experience (Kuniavsky, 2010).
6 generally metaphors are mostly considered as linguistic-metaphors; here the word is used to refer to the mental processes (as explained by Blackwell 2006, p 494).
Page| 29 3. Quality-in-use for development of species distribution modelling tool
3.4.2. Metaphors
Since the devise of the ‘Desktop metaphor’ (Kim, 2004) and the predominant use of GUI in most interactive software, icons have been one of the most used elements to express metaphors (Cooper et al ., 2007). Well-designed icons offer effective metaphors for the functions on menus and buttons in toolbars. The toolbar-buttons can and may contain short text explaining about the button’s function or also be displayed as tool-tip or as text like in menu-items, however the pictorial representation are quickly grasped by the users. Although icons have served well describing or giving hints on the function, they, as cognitive metaphors, are also contextual (Passini et al ., 2008; Salman et al ., 2012) similar to the linguistic metaphor, (e.g. the refresh button in Web-browsers looks similar to undo/redo button in Office software). The similar looking icons have different functionalities but still users’ perceptions of these icons are different. Human perception is biased by past (via user's experience), present (in current context on which it is being used) and immediate future (the goal which users want to achieve) (Johnson, 2010). As the user would use only one software at a given time, there will be only one valid context.
3.4.3. Emotions
The first few uses of a software create a cognitive model about the software which is a basic exploratory phase. After that, the process transforms to an experiential or behavioural phase where emotions are created (Ma et al ., 2009). Emotions can be positive (e.g. fun, excitement) or negative (e.g. frustration, disappointment) associated with and resulting from needs within the context of use (McCarthy and Wright, 2004). Positive emotions are the consequences of e.g. fulfilment of goals or satisfaction in a (challenging) situation, whereas difficulty in achieving the goal induces negative emotions (Hasenzahl, 2003).
Fun – Incorporating fun, as one of the emotional aspects for complementing usability, can act as an attraction for making people use a system (Carroll, 2004). Shneiderman (2004) discuses three goals when designing a software: a) offer right functions to achieve goals, b) provide better usability and reliability to avert frustration, and c) engage users with pleasant features. The first one is related to functionality (see 3.2.1). For second goal, Shneiderman and Plaisant (2005) suggest eight ‘golden-rules’ which are similar to the ones covered in sections 3.2.2 and 3.2.3. The third goal is related to GUI design where metaphors (e.g. through icons and animations) play vital role which can present delights and surprises. Surprises when occurring in a positive way can be fun. Distractions can also surprise but only for short-term; they are actually annoying after a couple of instances (Carroll, 2004). “Things are fun when they present challenges or puzzles [… ,] when they transparently suggest what can be done, provide guidance in the doing, and then instantaneous adequate feedback and task closure” ( ibid., p39).
Frustration – There is hardly anyone who has not experienced, at times, frustration while using computers (Cooper, 2004), in general, interactive systems. The frustration can be due to several reasons such as software crashes, insufficient information, vague error messages, non-responsiveness, etc. These are mainly due to bad or ill-conceived design (Preece et al ., 2002). Scheirer et al . (2002) performed a ‘user-testing’ experiment to find out events which are likely to stimulate frustration. Their purpose was to measure physiological and behavioural data so “a system could 'get to know' an individual's patterns of frustration (and other emotion-related responses)”
Page| 30 3. Quality-in-use for development of species distribution modelling tool
(p115). Although the experiment was to deliberately frustrate users, the authors’ also showed that such experiments can offer designers the opportunity to gather information on which of the system’s functions are likely sources of user frustration. An interesting observation in that experiment was that for some of the participants, who suspected the intent of experiment told to them were not true, the level of frustration was lower. This suggests that if users are aware of the situations or the system’s response, the degree of frustration stimulated to or felt can be reduced.
Quality-in-use of software offers valuable insights on why some software are preferred by users although better alternatives, in terms of getting better results, may be available. Quality measures are core in the ‘context of use’. The measurements are designed to cover several characteristics describing the factors which affect software’s acceptability by users. Usability, with most focus lying on ‘ease-of-use’ is among the most discussed quality and measures such as functionality and reliability are important in determining whether goals can be achieved. Most of the SDM tools lack complete functionality, in a sense they focus only on statistical modelling part but lack basic needed GIS functionalities. Software having high usability score but lacking the necessary functionality or being less reliable cannot be self-fulfilling and can simulate negative user experience. The usability and user-experience, although different, are close to each other and have complementary effects. Although usability testing may not be possible for every newly developed tool, however, following established guidelines as well as applying previously gained experience can help attaining certain usability levels. In the context of SDM tool, more focus is required on functionality, reliability and usability. Since these scores are dependent on the context-of-use, ‘user-centred design’ is the pivotal process for offering good quality. User-centred design process not only helps in profiling the targeted user group but also makes sure that the technical requirements as well as quality-in-use criteria are met at the time of design. While most of the characteristics are covered by quality aspect, especially usability, the emotional part of using software also plays a role in being accepted or used widely. User experience, being subjective, cannot be designed. Nonetheless, efforts can be made to design for experience.
Page| 31 4. Developing a robust and easy to use species distribution modelling tool
4. Developing a robust and easy to use species distribution modelling tool
Species distribution modelling (SDM) has been used for several ecological applications (e.g. see chapter 2.2). However, a universal model that fits for modelling each and every type of species is not available as shown by e.g. Brotons et al ., 2004; Elith et al ., 2006: for data of different species different models preformed differently. With the intent of modelling the Odonata species of Africa, a new modelling tool is being conceived and developed focusing on usability for organisations like IUCN which can, in parts, include SDM in their workflow of assessing the threat status. This chapter investigates the requirements of modelling tasks, conceives a modelling work-flow and presents a modelling tool. While there are models using presence-absence and presence-background sample data (see chapter 2.4 for different sample data types) for discriminatory and deterministic statistical models, to date, not a single model offers both presence-absence and presence-background formulation. The tool is called SPEciEs DIstribution modelling (SpeeDi) Tool.
Work flow concept for geodata processing and statistical modelling for SDM in SpeeDi Tool
One of the important elements to be considered at the time of concept development is the type of input and output data. SDM task involves mainly raster data, one of the two widely used basic forms of geospatial data (vector and raster) (Chapman et al ., 2005). However, not every data may be available in raster format and conversion of vector data into raster format is required. The input data are mainly: a) species location coordinates, and b) environmental predictors in geospatial format. The primary output of the task is the prediction of species presence or absence within the modelling extent. With inputs (various environmental geodatasets) and outputs (basic probability distribution raster and a presence-absence raster) being defined, a workflow can be established (Figure 4-1). The process, guided through a graphical user interface (GUI) incorporating a thorough help system, is divided into three different steps: a) geodata preparation, b) statistical modelling, and c) post- processing. The pre- and post-processing are done in GIS. The necessary functions of each step are discussed in chapter 4.6. Modelling step consists of statistical modelling which is independent of GIS and creation of probability distribution raster which depends on GIS functions. The input and output are in the form of geo-database. The two sides (left and right) represent the presentation layer (GUI) and their respective APIs, the central part is the logic layer and the data layer is shown at the top. The gaps between the main GUI and statistical modelling are there to represent loose coupling of different components. Light blue (cyan) colour is used for logical and presentation layer using DotNET API only, whereas light green (olive) colour is used for the layers using both ArcObjects (ArcGIS Engine 7) and DotNet APIs. Light orange (beige) colour is used for representing GUI.
7 http://www.esri.com/software/arcgis/arcgisengine/ (02-Jul-2013)
Page| 32 4. Developing a robust and easy to use species distribution modelling tool
Figure 4-1: Conceptual work flow for modelling species distribution in SpeeDi Tool with three steps: pre- processing, modelling and post-processing
4.1.1. Geodata preparation
Since input datasets may come from different sources and may also be in different formats and referenced spatially in different systems, these are first to be harmonised in order to have all datasets confined within a gridded geospatial region defined by a common spatial reference system and a common grid size. Further, harmonisation ensures that the pixel positions are matched in the stack of several geodatasets, which is a basic prerequisite for any GIS analysis (Bernhardsen, 1999). Species sample data may be in a format which is not in standard GIS file formats (e.g. plain text files) and these sample locations have to be imported into a GIS. Moreover, when modelling with presence-background samples (see chapter 4.5), background samples are to be generated. Other pre-processing tasks include deleting duplicate records, assigning and updating weights to the samples.
4.1.2. Statistical modelling
This is the stage which influences the output of the task. The species data are sampled over the environmental layers (geodata) and the interactions among the environmental variables (e.g. polynomial degree of regression and products among variables) are selected. The model is trained calculating the regression coefficients of environmental variables. After training the model, the fitted model parameters are used to create a probability distribution raster data.
Page| 33 4. Developing a robust and easy to use species distribution modelling tool
4.1.3. Post-processing
This part can be divided into two different steps: data analysis and presentation.
a) Data analysis – The predicted result is then evaluated with e.g. ROC curve (model’s strength), and specificity and sensitivity of prediction (model’s performance in discriminating presence from absence), and turned into binary presence and absence by applying an optimised threshold value. This is the primary output which can be then analysed further using spatial analysis techniques.
b) Presentation – Cartographic visualisation is one of the efficient and intuitive methods of communication involving geospatial data. The result of the data analysis will be presented in form of a map.
User-centred design and user profile for the SpeeDi Tool
The conceptualisation for a new tool for SDM and the workflow are partly dependent on the users’ knowledge of similar methods and tools. With the intension of offering both GIS related functions and statistical modelling, handling of these different techniques on both approaches should be carefully considered. Lacking the familiarity of one should not be the hindrance of using the new tool. Although, the tool to be developed is targeted for the ecological conservation planners and managers for modelling the species distribution who may not be frequently using either of the technologies involved, however, it has to be noted that this target group is only one of the examples and several other group of users should find it useful too. In order to address different user bases, a basic user profile is assumed (Table 4-1) so that a basic level of user support e.g. help system can be provided. Qualification is an important criterion as it presents an overview on a user’s expertise domain and it is not sought to find what academic degree a user possesses. The familiarity of different IT systems such as GUI based and Command Line Interface (CLI) based modelling tool in different operation systems can be an influencing factor in user experience. The tool offers a GUI based interface for Microsoft Windows operating system and it is expected that users are familiar with the different GUI components in that operating system. The user profiling also helps to determine how much skill a user commands for example in GIS and modelling. Since a major part of the task in SDM involves working in a GIS, basic knowledge of GIS and the handling of basic GIS operations skills (e.g. GIS data types, creation of geo-datasets, data type conversion, spatial reference systems) are necessary (at least the concepts if not practical experience). Familiarity of advanced GIS operations such as spatial analyses (with raster and vector datasets) can be beneficial as they are often useful for transforming datasets in order to have them ecologically interpretable. Furthermore, GIS have its own terminologies and same term may be expressed differently in ecology. Previous experience of SDM tasks may help a user adapting workflows and handling the new tool easily. Although a user may not have modelling experience, previous training on GIS and species distribution modelling are beneficial. Likewise, the frequency of use of GIS and SDM is also central in designing a new tool, especially for considering how to expose the functions and their input and output parameters so that users can maximise the use of various functions offered by the tool in innovative ways. This is indirectly associated with experience in GIS, with more experienced users can exploit the potential offered by
Page| 34 4. Developing a robust and easy to use species distribution modelling tool
the tool. Moreover, profiling the user can help in preparing support documents effectively (Robinson, 2009) in reducing the knowledge gaps in skills relating to GIS and modelling.
Table 4-1: Assumed user profile for using the SpeeDi Tool sought attribute Qualification/expertise basic GIS and SDM knowledge IT experience CLI and GUI GUI helpful operating system MS Windows GIS experience basic operations required advanced spatial analysis not essential knowledge on GIS specific helpful terminologies experience/knowledge with SDM knowledge required, experience not essential previous training on GIS / SDM beneficial (not essential) frequency of use of GIS / SDM low (to high)
Architecture for the modelling tool
A layered approach for the architecture is selected with three main layers: presentation layer, logic layer and data layer (see Figure 4-1). The layered approach makes it easier to update only the required part if and when necessary (Microsoft, 2009). The approach presents options for programming of functionalities in traditional way where the focus is task oriented, but at the same time the current way of focusing on usability and user experience is not compromised. This offers better integration of approaches, task centred and user centred. For example any changes in the codes of the user interface will not affect the working of the logic and vice versa. This will help in cases such as changes arising for improved user experience, which can be focused only on the presentation layer or changes for fixing previously un-noticed bugs that can be focused on the logic layer only (ibid.).
Presentation layer – The presentation layer (Figure 4-1, beige coloured left and right boxes) offers the visual layout of the tool offering user interaction and visualisation of data. This layer is the only visible layer to the user and is responsible for most of the user experiences while interacting with the application. The visual elements are used to display data and information as well as to allow for interaction by users. For displaying geodata, two different views (see Figure 4-2) are preferred and are offered by GIS components (via ArcObjects API), one as general data view within the map coordinate system and other as map layout view in paper coordinate system. Since geodata contains attribute data, additionally, attribute view is also necessary which is possible through presentation logic component. Presentation logic queries the data, binds it to an entity (e.g. a data table) and displays it via a presentation component as, for example, a table.
Page| 35 4. Developing a robust and easy to use species distribution modelling tool
Logic layer – The logic layer (Figure 4-1, cyan and olive coloured central part) is the core of retrieval, processing and managing the logic of data. The presentation layer collects user inputs and passes the inputs to the application. When a complex set of actions are needed (e.g. for harmonisation of data, chapter 4.1.1) where several functions are needed, the layer executes the necessary functions in the required sequence. The logic layer is also responsible of internal (tight) and external (loose) coupling. For the tool, GIS related functions are to be tightly integrated whereas the communication between the GIS component and statistical modelling component is loosely coupled. The data communication in the loosely coupled environment is made through two different strategies, a) data serialisation (Weisfeld, 2009) for complex objects, and b) data piping (Ritchie, 1980) for simple objects. The output of statistical modelling contains complex object 8 with all the model parameters (coefficients) as well as model specific tuning parameters (e.g. regularisation controllers) and using data serialisation would be effective means for this scenario. The response curves of the environmental variables are plotted using the DotNET API, however creating surface plots which are used to show the effects of variable interactions is not possible by the same API for which Gnuplot 9 is used. Creating surface-plots (e.g. Figure 5-5) needs only part of the model parameters and the use of data piping (supported by Gnuplot) offers easier option. Data piping, also known as redirection, is a technique of feeding (redirecting) the output of one application directly as input to another application (Ritchie, 1980). A GUI coupling can be and is used when input and output parameters are to be interactively set and a transparent (not visible to user) loose coupling (see Figure 4-1) can be used.
Data layer – This layer is responsible for handling different types of data (Figure 4-1, top cylindrical shapes). Attribute data are also to be written and read for modelling (modelling data in Figure 4-1). This would include how the attributes are arranged, how interactions are defined and how the original geodata are referred in the interactions. The use of ArcObjects API provides the logic for handling different types of geospatial data. Attribute data are stored in plain text form. To store the fitted model parameters for statistical modelling part, a new internal data format is created which would allow easy retrieval of the parameters. The new data format is saved in a binary encoded 10 file for use in the SpeeDi Tool. For allowing interoperability with other software, the same data are also saved in an SOAP-XML format.
GUI design
The main application is designed to embed GIS functionality using ArcGIS Engine whereas statistical modelling is loosely coupled and accessed from the main GUI. The statistical modelling framework is made without any linkage or dependency to the ArcGIS Engine so that it can function also in the case where ArcGIS Engine is not available. However, creation of prediction raster would still need ArcGIS Engine. By using loose coupling mechanism, the statistical modelling (binary logistic regression) is integrated into the GIS based GUI. In order to achieve harmonised user experience, the GUI of the statistical modelling part is designed with a ‘look-and-feel’ similar to the main GUI, making user unaware of any coupling mechanism. The main GUI is divided into five main visible components
8 Most of the data are in the form of Matrix and Vector; the Matrix uses program code from http://www.codeproject.com/Articles/5835/DotNetMatrix-Simple-Matrix-Library-for-NET (26-Jan-2013) 9 http://www.gnuplot.info (13-Jan-2012) 10 Binary encoded: readable by machine (computer) but not meaningful for human reading
Page| 36 4. Developing a robust and easy to use species distribution modelling tool
(Figure 4-2): a) menu, b) toolbars, c) geodata TOC (table of contents), d) geodata and map layout view, and e) pop-up menus.
Figure 4-2: Different components in the main GUI of the SpeeDi tool.
The menu offers functions for saving and opening an ArcGIS map document as well as loading, editing and saving general options for SDM tasks (e.g. pre-defined cell-size, spatial reference system). It also facilitates some advanced pre- and post-processing functions as well as printing utilities. Two distinct toolbars are present, one for general navigation of geodata displayed in the geospatial view and another with a set of functions related to SDM task. The functions for SDM task are grouped and arranged in a sequence similar to the modelling work flow of pre-processing and modelling. The geodata TOC shows which geodata are loaded, their cartographic representations which are visible (displayed) in the geospatial view, and in which order they are arranged or displayed. The geospatial view offers the mechanism to view the geodata (defined by some cartographic representation) facilitated in two tabs: a) geodata view - for viewing the data in detail, and b) print or map layout view - for viewing the data as if it is a printed version in a specified paper size and format, i.e. page layout. Pop-up menus are offered on the geodata TOC providing specific functions related to geodata. These menus are context sensitive, i.e. the functions in menu differ based on the type of data (vector or raster).
Page| 37 4. Developing a robust and easy to use species distribution modelling tool
Different visual elements for specifying input and output opons
OK Cancel Help
Figure 4-3: Common layout of dialog-boxes (top left) for pre- and post-processing functions in SpeeDi Tool; an example of dialog-box for running local function (top right) and displaying the help associated with the function when the ‘Help’ button is clicked (bottom)
Apart from the main GUI, the pre- and post-processing functions are also designed in interactive GUI. Although input and output parameters differ based on functions, the general layout is similarly designed (Figure 4-3) to offer consistency in appearance increasing learnability (see chapter 3.2.3). A large part of the window (dialog-box) contains input of different parameters and often three buttons are arranged at the lower right corner to accept (OK), dismiss (Cancel) and seek help (Help). On clicking the help button, the integrated help shows the description about the function related to current context (task). The availability of context related help offers better user-experience (see chapters 3.2.3 and 3.4) as the detailed documentation is just around with a ‘click of a mouse button’. Further, feedbacks are provided for input parameters offering only valid options based on other selected options. This will reduce user error rate and increases user performance (see chapter 3.2.3). Tool-tips are among the elements enhancing usability (see chapter 3.2.3) in assisting learnability, operability and memorability and they also help increasing users efficiency (see chapter 3.2.4).
Page| 38 4. Developing a robust and easy to use species distribution modelling tool
Some of the parameters do not change for a particular task (e.g. grain size, spatial reference system) and instead of setting the parameter every time a function is used otherwise continuing the task afterwards is annoying and induces frustration. With the focus on simplicity and ‘ease-of-use’, for such cases, global settings for such parameters are saved which will be loaded at the start and can be modified at any time. Whenever possible, a set of default values (Figure 4-4) are offered that would likely to work well for most of the times, thus providing centralised setting as core experience (see chapter 3.4). However, users can easily change these default values too.
Figure 4-4: Setting default preferences in the tool, accessible via menubar; left: for logistic regression modelling most of them are related to the output graphs, and right: for modelling task related mainly to spatial properties
Logistic regression with presence, absence and background data
Binary logistic regression is one of the popular statistical models for calculating probability based on recorded events. Further, its characteristics of the predicted output value bounded between zero and one offers an attractive solution (Hosmer and Lemeshow, 2000). More importantly, no assumption has to be made about the predictor variables statistical distribution ( ibid.).
4.5.1. Formulating binary logistic regression model
As the name suggests, the binary logistic regression follows binomial distribution and is calculated by equation 1 (Hosmer and Lemeshow, 2000). For SDM, it is the probability of having a favourable environmental condition at a given location for finding a species.
Page| 39 4. Developing a robust and easy to use species distribution modelling tool