Eindhoven University of Technology

MASTER

Predicting the Processionary Population using Statistical Modeling

Scholtens, Tim P.H.

Award date: 2021

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Mathematics and Computer Science Predicting the Oak Processionary Moth Population using Statistical Modeling

Master’s Thesis T.P.H. Scholtens

Supervisors Prof. dr. Jakob de Vlieg (TU/e) Dr. Rogier Brussee (TU/e) Dr. ir. Arie Weeren (VAA)

Assesement commitee Prof. dr. Jakob de Vlieg (TU/e) Dr. Rogier Brussee (TU/e) Prof. dr. ir. Boudewijn van Dongen (TU/e) Dr. Bert M. Sadowski (TU/e)

Eindhoven, January 12, 2021 ABSTRACT

Context The oak processionary moth is an infamous within the Netherlands. During its stage it develops hairs that cause health issues on physical contact. Current solutions to eradicate the OPM have their drawbacks as they too eradicate the OPM’s predators, or are too labour intensive to be effective. Therefore, experts conclude that total eradication is no longer an option and we should aim for controlling the OPM population [3].

AimOuraimfor thisstudyis toenableorganisations makinginformeddecisionsconcerningthe OPM population distribution. Using statistical modeling, a model can be defined for estimating the OPM population within a given area. By estimating the population size, organisations can allocate resources proportionally, thus allowing for more efficient decisions.

Innovation Using the statistical framework Species Distribution Modeling, we defined, to the best of our knowledge, the first predictive model to estimate the OPM population within a given area. Additionally, using ecological theory, we conceptualized and evaluated a model of factors that affect the OPM population.

Conclusion Our work provides a baseline predictive model and conceptual model which can be further extended upon. However, due to the limited availability of OPM occurrence data, our predictive model knows several limitations. Firstly, we have not accounted for spatial autocorrelation between neighbouring areas. Secondly, a temporal component is missing in our models. Therefore we strongly recommend the gathering of detailed (both spatial and temporal) data such that in future research these aspects can be taken into account.

Keywords Oak processionary moth, Species Distribution Modeling, BIOCLIM.

2 PREFACE

Before you lies my thesis performed at VAA, it contains my research on how statistical modeling can be used for estimating the oak processionary moth population within an area.

I would like to express my gratitude towards my inspiring supervisors, whom without their guidance, this thesis would not have been possible. Rogier Brussee, my day-to-day supervisor, thank you for the good times, and to help me grow not only professionally, but also into a better person overall. Jakob de Vlieg, while going through an extremely difficult time, yet still keeps going strong, thank you for being an inspiration.

Furthermore, I would like to thank my day-to-day supervisor at VAA. Arie Weeren, thank you for your expert guidance in statistics, and you never-ending patience in explaining me the mathematics behind.

Lastly, I would like to thank the following parties for supplying data or their guidance thereof:

Antea group, T. Grooten BomenMonitor, W. Tims BoomRegister, D. Voets Eikenprocessierups expertise centrum, H. Kuppen Gemeente Amsterdam, J. Bijleveld Gemeente Geldrop, G. Broeren Gemeente Leiden, R. Jonke, E. Hilgersom Nationale databank flora en fauna Naturalis, dr. M. Roos, P. van Aalst, J. Dercksen, V. Beckers Provincie Gelderland Provinciehuis Noord-brabant, S. op ’t Hof Sovon, J. Schoppers Vlinderstichting, J. van Deijk Vogelbescherming, M. Platel Wageningen Universiteit, F. Brouwer

3 Contents

1 Introduction 10 1.1 Context and topic ...... 10 1.2 State of the art ...... 10 1.3 Research question ...... 10 1.4 Research methodology ...... 11

2 Preliminaries 12 2.1 Ecological Theory ...... 12 2.1.1 Environmental space ...... 12 2.1.2 Environmental factors ...... 13 2.2 Oak processionary moth ...... 14

3 Literature review 16 3.1 Frameworks ...... 16 3.1.1 Species distribution modeling ...... 16 3.1.2 Population dynamics ...... 17 3.1.3 Conclusion ...... 18 3.2 State of the art ...... 19

4 Data 22 4.1 Conceptualization ...... 22 4.1.1 Conceptual model ...... 22 4.1.2 Predictor variables ...... 24 4.2 Data collection ...... 25 4.2.1 Oak processionary ...... 25 4.2.2 Climate ...... 25 4.2.3 Geology ...... 26 4.2.4 Oak trees ...... 27 4.2.5 Predators ...... 27 4.3 Data transformation ...... 28 4.3.1 BIOCLIM variables ...... 28 4.3.2 Spatial scale ...... 29 4.4 Data organisation ...... 30 4.4.1 Datasets ...... 30 4.4.2 Overview ...... 31

4 5 Analysis 32 5.1 Data visualization ...... 32 5.1.1 Geographic bounds ...... 32 5.1.2 Oak processionary moths ...... 33 5.1.3 Oak trees ...... 35 5.1.4 Great tits ...... 38 5.1.5 Soil ...... 41 5.1.6 BIOCLIM ...... 43 5.1.7 Conclusion ...... 44 5.2 Exploratory data analysis ...... 45 5.2.1 Scatter plots ...... 46 5.2.2 Correlation matrix ...... 49 5.3 Feature selection ...... 53 5.3.1 Method ...... 53 5.3.2 Results ...... 53 5.3.3 Conclusion ...... 56

6 Models 57 6.1 Scaling ...... 57 6.2 Algorithm selection ...... 57 6.3 Model building ...... 57 6.4 Model evaluation ...... 59 6.5 Results ...... 60 6.5.1 Spatial analysis ...... 60 6.5.2 Test error ...... 64

7 Conclusion 65 7.1 Concluding summary ...... 65 7.2 Contributions ...... 66 7.3 Limitations and future work ...... 67

Bibliography 68

Appendix 69

A Data collection 70 A.1 BIOCLIM ...... 70 A.1.1 Variables ...... 70

B Feature selection 75 B.1 LASSO ...... 75

5 C Models 76 C.1 Generalized linear models ...... 76 C.2 Random Forest ...... 78 C.3 Neural networks ...... 81

6 List of Figures

1 Research methodology ...... 11 2 Hutchinsonian niche ...... 12 3 Species distribution modeling cycle ...... 17 4 Neural network architecture ...... 20 5 Conceptual model of factors affecting the OPM’s population ...... 23 6 Provincial roads in Gelderland (light green) ...... 25 7 Locations of the KNMI weatherstations ...... 26 8 Neighbourhood geographic bounds ...... 32 9 Distribution OPM in Amsterdam 2018 ...... 33 10 Distribution OPM in Amsterdam 2019 ...... 34 11 Distribution OPM in Gelderland ...... 35 12 Distribution oak trees Gelderland ...... 36 13 Distribution oak trees Amsterdam ...... 37 14 Distribution great tits Amsterdam 2018 ...... 38 15 Distribution great tits Amsterdam 2019 ...... 39 16 Distribution great tits Gelderland ...... 40 17 Soil polygons region Amsterdam ...... 41 18 Soil polygons province Gelderland ...... 42 19 BIOCLIM variables Amsterdam ...... 43 20 BIOCLIM variables Gelderland ...... 44 21 Scatter plots Amsterdam 2018 ...... 46 22 Scatter plots Amsterdam 2019 ...... 47 23 Scatter plots Gelderland ...... 48 24 Correlation matrix Amsterdam 2018 ...... 49 25 Correlation matrix Amsterdam 2019 ...... 50 26 Correlation matrix Gelderland ...... 51 27 LASSO selected features ...... 54 28 Subset selected features ...... 55 29 Neural network architecture ...... 59 30 Predicted OPMN distribution Amsterdam 2018 ...... 61 31 Predicted OPMN distribution Amsterdam 2019 ...... 62 32 Predicted OPMN distribution Gelderland ...... 63 33 LASSO regularization parameter ...... 75 34 Distribution dependent variable ...... 76

7 List of Tables

1 Variables ...... 31 2 Training and test observations ...... 57 3 Model test errors ...... 64 4 Fitted GLM Amsterdam 2018 ...... 77 5 Fitted GLM Amsterdam 2019 ...... 77 6 Fitted GLM Gelderland ...... 77 7 Random forest hyper-parameter grid space ...... 78 8 Random Forest hyper-parameter tuning Amsterdam 2018 ...... 79 9 Random Forest hyper-parameter tuning Amsterdam 2019 ...... 79 10 Random Forest hyper-parameter tuning Gelderland ...... 79 11 Selected hyper-parameters Random Forest ...... 80 12 Neural network hyper-parameter grid space ...... 81 13 Neural network hyper-parameter tuning Amsterdam 2018 ...... 82 14 Neural network hyper-parameter tuning Amsterdam 2019 ...... 82 15 Neural network hyper-parameter tuning Gelderland ...... 82 16 Selected hyper-parameters neural network ...... 83

8 GLOSSARY

Abiotic Chemical and non-living components in an ecosystem. Biotic Living or once living components in an ecosystem. Niche In context of ecology, the habitat in which a species’ occurs. Fundamental niche The environmental space where the biotic conditions allow for a species to survive and reproduce. Realized niche The environmental space where the biotic and abiotic conditions allow for a species to survive and reproduce. A widespread and common bird species throughout Europe.

ACRONYMS

MSE...... Mean squared error MAE...... Mean absolute error OPM...... Oak processionary moth SDM...... Species distribution modeling LASSO...... Least absolute shrinkage and selection operator GLM...... Generalized linear model BIOCLIM...... Bioclimatic variables KNMI...... Koninkelijk Nederlands Metereologisch Instituut ANN...... Artificial neural network EIV..... Ellenberg’s Indicator Values RDBMS...... Relational database management system

9 1 INTRODUCTION

The field of statistics is the science of learning from data. Statistical modeling techniques provide ways to make models of complex topics, allowing for a better understanding of the related factors. An infamous topic within the Netherlands is the oak processionary moth (OPM)...... During its catterpillar stage it develops hairs that cause health issues on physical contact. These issues can vary from skin issues, limited eyesight, or breathing problems, to more severe issues, vomiting, fever, or an overall miserable feeling. Current solutions to eradicate the ...... OPM have their drawbacks as they too eradicate the OPM’s...... predators, or are too labour intensive to be effective. Therefore, experts conclude that total eradication is no longer an option and we should aim for controlling the OPM...... population [3].

1.1 CONTEXT AND TOPIC

Our motivation for this study is to enable organisations making informed decisions concerning the population distribution of OPMs...... By predicting the population size within an area, organisations can allocate resources proportionally, thus making more efficient decisions.

1.2 STATE OF THE ART

Although there exists a body of literature studying the effects of climate change on the OPM...... [13], and regarding the OPM’s...... family, the pine processionary moth [22], to the best of our knowledge, we are the first to study the estimation of the OPM’s...... population size. Additionally, as far as we are aware, there exists no research of the...... OPM’s ecological requirements.

1.3 RESEARCH QUESTION

Our motivation for this study is to enable organisations making informed decisions concerning the population distribution of OPMs...... By using statistical modeling techniques, we can define a quantitative model, predicting the OPM’s...... population in a given area. From this, we can state our main research question as follows:

How can we use statistical modeling to predict the OPM...... population of a given area?

We answer our main research question by formulating several sub questions:

1. What are known techniques for predicting the OPM...... population ? Answering this question, provides an overview of possible (statistical) frameworks, structuring our process for developing a statistical model.

10 2. Which statistical modeling methods are suitable for predicting the OPM...... population? The answer to this question provides possible statistical methods compatible with our chosen framework.

3. Which predictor variables are suitable for predicting the OPM...... population? These predictor variables are used for fitting our statistical model.

1.4 RESEARCH METHODOLOGY

Figure 1: Research methodology

Figure 1 summarizes our research methodology and its relationship to the research questions. We start our research with a literature study, exploring suitable statistical frameworks and models, thereby answering our first and second research question. We continue our research by defining a conceptual model of factors that affect the OPM...... population, and identify variables related to these factors, thereby answering our third research question. Furthermore, we describe the collection, transformation and organisation of the identified variables. Chapter 5 consists of an exploratory data analysis and feature selection. Describing the characteristics of the data and determining the most relevant features. Chapter 6 consists of building the statistical models and their evaluation. Finally, in chapter 7, we provide the conclusions and limitations of our work, along with recommendations for future research.

11 2 PRELIMINARIES

This chapter introduces the definitions, concepts, and relevant context used throughout our study. Section 2.1 describes ecological definitions and theories. Section 2.2 describes the characteristics of our subject under study, the oak processionary moth.

2.1 ECOLOGICAL THEORY

This section describes the environmental space in which a species can exist, and the environmental factors related to this space.

2.1.1 ENVIRONMENTAL SPACE

The environmental space in which a species exists can be described using Hutchinson’s niche theory [15]. Hutchinson defines the environmental space as a hyper-volume shaped by the environmental conditions under which a species can ‘exist indefinitely’. Hutchinson further distinguishes the environmental space from the space where environmental conditions and resources allow for a species to survive and reproduce, known as the fundamental niche. And the space which includes the fundamental niche, but also accounts for interactions with other living organisms, known as the realized niche (figure 2).

Figure2: Thenicheconceptof Hutchinson. Thebiotopedescribes thevarietyofenvironmental conditions that occur in an area. The fundamental niche expresses the space where the abiotic conditions allow for a species to exist indefinitely. The realized niche includes both abiotic and biotic conditions [11].

12 2.1.2 ENVIRONMENTAL FACTORS

Two types of environmental factors can be distinguished; proximal, also known as causal, and distal, known as proxy or surrogate factors [5]. Surrogate factors are related to causal factors, but may be easier to acquire than causal factors. Additionally, the proximal factors can be further decomposed into resource and direct variables [4]. Resource variables encompasses the energy and matter that are consumed by plants or . Direct variables are environmental factors that are of physiological importance to a species, but cannot be consumed.

13 2.2 OAK PROCESSIONARY MOTH

This section describes the characteristics of the...... OPM.

Life cycle The OPM’s...... life cycle consists of the stages egg, caterpillar, pupa, and moth [1]. The egg stage occurs from mid-August till mid-April. During this stage they can be found as plaques of 2-3 cm long in high branches and twigs. Mid April emerge from the eggs, although it can occur that the emergence starts as early as March due to warm weather. This stage lasts till the end of June where they pass through six developmental stages known as instars. These stages are numbered from L1 till L6. In developmental stage L3 till L6 the caterpillars develop irritating hairs, where the number of hairs increases during each stage. From late June to the beginning of August the caterpillars retreat into their nests and moult to the pupal stage. During this stage, the pupea will stay in their nests until they emerge as adult moths. The final stage spans a total of three to four days, where the moths fly away from their nests and mate. The female moths will usually fly 4-5km from their nests, up to a maximum of 10km, whereas the male moths are capable of flying greater distances [12]. After mating, the male moths die immediately, whereas the female moth first lays eggs before dying.

Resources The caterpillar stage is the only stage where the OPM...... feeds itself. Its primary food source consists of the leaves of oak trees, making little to no distinction between different oak species [12]. There are records of OPM...... feeding on different tree species, however this only occurs when no oak trees are left to feed on.

Habitat The habitat of ...... OPM consists of central and south Europe and is strongly related to the areas where oak trees grow.

Predators OPM’s...... predators consist of birds, and bacteria. The bird species the great tit and cuckoo favor OPMs...... during their caterpillar stage, however they are an exception as most bird species avoid hairy caterpillars. Other birds species that are also recorded of eating ...... OPMs are the blue tit, western jackdaw and eurasian nuthatch [2]. In the insect realm chrysoperla carnea and dendroxena quadrimaculata feed on the OPMs...... during their egg stage.

14 Control Governments take mechanical, thermal and biological measurements to control the number of OPMs...... [12]. The mechanical and thermal method consists of vacuuming and burning the OPMs...... during their egg or caterpillar stage. The biological method consists of using bacteria or nematodes during the egg stage.

15 3 LITERATURE REVIEW

In this chapter, we review existing literature on frameworks and modelling techniques used for predicting a species’ population size. In section 3.1, we review the existing frameworks and conclude which framework fits our context best. In section 3.2, we describe the current state– of–the–art statistical modelling techniques that fit within our selected framework.

3.1 FRAMEWORKS

Two fundamentally different approaches exist for estimating the population of a species, the mechanistic and the correlative approach [18]. The mechanistic approach, also known as process-based or biophysical, conducts experiments in a controlled environment to determine a species’ physiological requirements. The correlative approach assumes that the species’ current distribution is a good indicator of the species’ physiological requirements. Below, we discuss two frameworks to estimate a species’ population size, using either the mechanistic or correlative approach.

3.1.1 SPECIES DISTRIBUTION MODELING

Species distribution modeling is a correlative modeling framework used to model a species’ distribution. The framework uses ecological theory to conceptualize a model of factors affecting the species distribution. Using this model, predictor variables can be defined which can be used in a statistical model. The framework’s modeling process is explained in depth below.

Modeling process A recent protocol is proposed for building species distribution models, consisting of the steps conceptualisation, data preparation, model fitting, model assessment, and prediction (figure 3) [25]. Repetition of these steps can occur if the model’s prediction does not match reality.

16 Figure 3: Species distribution modeling cycle

In the conceptualisation step, a conceptual model is constructed by defining the environmental factors which can affect a species’ distribution. Next, in the data preparation step, the data related to the defined factors is collected and processed. The model fitting step consists of an analysis of the predictor variables, feature selection, and model building. In the model assessment step, the built model is analyzed based on its performance statistics, and a plausibility check is conducted. Finally, in the prediction step, the final model is evaluated based on its spatial or temporal predictions for new sits.

3.1.2 POPULATION DYNAMICS

Population dynamics uses a mechanistic approach to estimate a species’ population size. The framework identifies four major demographic processes to estimate a species’ population size; birth, mortality, immigration, and emigration. Using these processes, a simple model for estimating a species’ population size can be defined (equation 1, assuming there are no external factors.

Nt+1 = Nt + Bt − Dt + It − Et (1)

This model, also known as the BIDE...... (birth, immigration, dead, emigration) model [21], consists of the variables; Nt+1, the population size at time t + 1, Nt, the population size at time t, B, the number of births between time Nt and Nt+1, D, the number of deaths between time Nt and

Nt+1, I, the number of individuals immigrating into the population between Nt and Nt+1, and

E, the number of individuals emigrating from the population between Nt and Nt+1.

These variables are then estimated using biophysical models, i.e., climate-dependent models for estimating mortality and growth rate, or water–energy–nutrient models for estimating the dispersal range and the number of births [18].

17 3.1.3 CONCLUSION

The mechanistic framework Population dynamics requires detailed information about a species’ physiological requirements and interactions with its environment [17], in order to construct the biophysical models. To the best of our knowledge, no such data is currently available for OPMs...... Therefore, we conclude that the correlative framework Species Distribution Modeling fits our problem context best, as the occurrence data of OPMs...... is currently available.

18 3.2 STATE OF THE ART

This section describes the state–of–the–art statistical models, used for estimating a species’ population.

Generalized linear models Generalized linear models (GLMs)...... are extensions of linear models and can cope with non- normal distributions of the response variable [19]. GLMs...... allows for the response variable being modeled using the exponential family of distributions. Whereas, linear models are limited to only using the Gaussian distribution. The ...... GLM exists of three components, which can be explained using the equation below.

∑p E(Y ) = g(LP ) = β0 + βiXi (2) i=1 The systematic component LP , linear predictor, is a linear function of the predictor variables. The random component E(Y ), which consists of the expected value of the probability distribution, and the link function g(), linking the systematic component to the random component. More specifically, it explains how the expected value of the probability distribution, relates to the linear predictor.

Generalized additive models Generalized Additive Models (GAMs)...... are similar to GLMs,...... but its linear predictor is composed of the sum of smoothing functions, instead of a linear function. This difference results in the following equation.

∑p E(Y ) = g(LP ) = β0 + fi(Xi) (3) i=1

The systematic component LP , consists of the sum of smoothing functions fi. The random component E(Y ) and the link function g(y) are identical to that of the GLM’s......

Decision tree Decision trees are non-parametric statistical models used for classification and regression. Decision trees work by recursively partitioning the data into a set of rules [7]. These rules are used to define a tree-like model, each rule representing a decision node.

Bagged trees Bagged trees are an improvement over decision trees by aiming to reduce overfitting. Decision trees define their decision rules by using all training data, and therefore are sensitive to outliers. Bagged trees use bootstrap aggregation, also known as bagging, using the bootstrap method for data sampling, and aggregation on the constructed models, taking the average

19 prediction.

Random forest Trees constructed by the bagged tree method tend to correlate, as each tree is constructed in a similar manner. Random forest’s aim is to reduce this correlation by applying a small tweak [16]. For each decision node, only a random subset of the available predictors are considered. Using this random element, the constructed trees are less likely to correlate.

Neural networks Artificial neural networks (ANN),...... is a non-parametric statistical model inspired by the functioning of the human brain. The model uses a set of interconnected artificial neurons that loosely models the neurons in a human brain. The ANN’s...... structure is composed of nodes (artificial neurons) and links. Nodes are connected to other nodes using links. Each link is assigned a weight based on one node’s influence on another. The nodes are organized into layers, being the input-, hidden- and output-layer, each layer having a different purpose (figure 4).

Figure 4: Neural network architecture

Input layer The input layer connects the input matrix Xm×n to the neural network. Where, the number of artificial neurons, equals the number of columns n of the input matrix. Each artificial neuron is connected to all neurons of the successive layer.

Hidden layer The input layer’s successive layer is known as the hidden layer. There can be an arbitrary number of successive hidden layers, often being related to the complexity of the problem. Each layer is fully connected to its successive layer. The artificial neuron consists of the summation of the incoming weights, and an activation function f.

20 Output layer The final hidden layer’s successive layer is known as the output layer. The structure of the layer depends on the prediction problem, being either regression or classification. For regression problems only one artificial neuron is used, where the predicted value equals the sum of incoming weights. For classifications problems, each class is related to one artificial neuron.

21 4 DATA

This chapter describes the conceptualization, collection, transformation, and organisation of the data used throughout this study. Section 4.1 introduces the factors which affect the OPM’s...... population, and their related predictor variables. Section 4.2 describes the data collection processes and the datasets as a result of these. Section 4.3 describes the required transformations to align each dataset’s spatial granularity and to obtain some of the predictor variables. Section 4.4 describes the organisation of the collected data.

4.1 CONCEPTUALIZATION

This section encompasses the conceptualization phase from the species distribution modeling framework (section 3.1.1). First, a model of factors that influence the OPM’s...... population, also known as the conceptual model, is defined. Next, we describe the concrete predictor variables, based on the conceptual model.

4.1.1 CONCEPTUAL MODEL

Using Niche theory (section 2.1) and the OPM’s...... characteristics (section 2.2), we define the following conceptual model of factors influencing the ...... OPM’s population (figure 5).

22 Figure 5: Conceptual model of factors affecting the OPM’s population

Fundamental niche The OPM’s...... fundamental niche consists of the leaves of oak trees and a set of unknown other environmental variables. The OPM...... makes little to no distinction between different oak species, this is reflected by the model, by excluding details of the oak’s specie. Furthermore due to the OPM...... feeding primarily on oak trees, no different tree species are included. Factors that affect the oak tree population consists of temperature, light, water and soil. When the temperature is too low, the oak trees’ leaves will freeze. Water and light is needed for the photosynthesis process, producing energy needed for the oak tree to survive. The soil’s pH level affects the oak trees available minerals, which are required for essential processes. Tothe best of our knowledge, little is known about the physiological requirements of the OPM...... We do however expect that temperature can play a significant role in the survival of the OPM...... This claim is supported by research showing that a synchronization of budburst and insect egg hatch has a positive effect on an insect’s reproducibility [23]. Other studies that support this claim state that early hatching insects may die of starvation, whereas insects hatching too late

23 have to cope with a rapid decline in leaf quality [8], [10].

Realized niche The realized niche accounts for interactions with the OPM’s...... predators (biotic interactions) and human control measurements (disturbances). Both factors can have a severe negative impact on the...... OPM’s population, possibly even preventing their existence. The climate factors too, can affect the population of the OPM’s...... predators.

4.1.2 PREDICTOR VARIABLES

Climate We use a common set of variables to obtain a statistical description of the climate, named BIOCLIM...... - Bioclimatic variables. The set consists of 19 different variables derived from the monthly temperature and rainfall. Details about the variables and their calculation can be found in appendix A.1.

Geology The Ellenberg’s Indicator Values (EIV)..... is a common set of ecological parameters for describing the climate and soil conditions [6]. The set comprises of four parameters related to soil. The first parameter, reaction (R), describes the soil or water acidicity, measured using the metric pH. The second parameter, nitrogen (N), is related to the soil fertility. The third parameter, soil humidity (F), measures the soil’s moisture. The fourth parameter, Salt (S), measures the amount of salt in the soil.

Biotic interactions We use count data of the OPM’s...... predators (section 2.2) to assess their effect on the OPM’s...... population.

24 4.2 DATA COLLECTION

This section describes the collection of data defined in the conceptualization section.

4.2.1 OAK PROCESSIONARY MOTHS

The oak processionary moth dataset consists of data provided by two organisations. For each organisation its dataset is described in terms of their collection process and its characteristics.

Province Gelderland The provincial administration of Gelderland provides a dataset of trees inspected for OPM...... nests. The study is conducted in the summer of 2020 and is limited to the trees along their provincial roads (figure 6). For each inspected tree its location is recorded and whether or not an OPM...... nest is found. In total 45,984 trees are inspected, from which 1,383 trees contained nests.

Figure 6: Provincial roads in Gelderland (light green)

Municipality Amsterdam Municipality Amsterdam provides two studies on trees inspected for OPM...... nests. The studies are conducted in 2018 and 2019 and are limited to the trees under their maintenance. For each inspected tree, its location and presence of OPM...... nests are recorded. For both years, 14,856 trees were inspected, of which 1,521 trees contained OPM...... nests in 2018, and 3,156 in 2019.

4.2.2 CLIMATE

Despite being a set of commonly used variables, no recent dataset of the ...... BIOCLIM variables exists for the Netherlands. To the best of our knowledge, the latest available dataset for the BIOCLIM...... variables, dates from 1970 till 2000 [9]. Our collected dataset on OPM...... observations

25 dates from 2018 to 2020 (section 4.2.1), resulting in a misalignment of at least 17 years. Due to the lack of an available dataset, we calculated these variables ourselves.

The BIOCLIM...... variables are derived from the monthly temperature and rainfall (section 4.1.2). Netherland’s national weather agency, KNMI,...... provides data on daily rainfall and temperature. This data is calculated using 50 weather stations, spread across the Netherlands (figure 7).

Figure 7: Locations of the KNMI weatherstations

4.2.3 GEOLOGY

To the best of our knowledge, no available dataset exists for describing the EIV..... within the Netherlands. However, we found a qualitative dataset based on the EIV..... parameters nitrogen (N)andsoilhumidity (F).The dataset isprovidedbyWageningenUniversityand Research(WUR) and Wageningen Environmental Research (Alterra), and categorizes the soil type based on the amount of peat or mineral material found [24]. The study dates from 2006 and comprises of 340,000 soil samples at 1.5 meter depth. The dataset distinguishes between 9 different soil types, being; water, peat, sand, light clay, heavy clay, light sabulous clay, heavy sabulous clay, urban, and swamp.

26 4.2.4 OAK TREES

The oak trees dataset is provided by the identical organisations of the OPM...... dataset, municipality Amsterdam, and the provincial administration of Gelderland.

Province Gelderland The provincial administration of Gelderland provides a dataset of inspected oak trees along their provincial roads. The dataset contains 45,984 unique oak trees, each record having a unique coordinate of the species; oak, Turkey oak, European oak, Irish oak, Cypress oak and the American oak. A small set of observations, 512 records, contain no value for species, these are excluded from the dataset, leaving a total of 45,472 oak trees.

Municipality Amsterdam The municipality Amsterdam provides a dataset of trees under their maintenance. The dataset contains 14,856 unique trees of 69 different oak species.

4.2.5 PREDATORS

The predator dataset is scraped from the website https://www.waarneming.nl. The website is a collaboration of volunteers and institutions, and openly shares information about species occurrences. Occurrence data is collected using crowdsourcing, and is evaluated by experts and automated verification tests.

We scraped a total of 9,000 pages of the bird species Great tit, where the date ranged from January 2016 till December 2020. This resulted in 433,960 occurrences of the great tit over a period of 4 years. We decided to only include occurrence data from March till June, as within this period the Great tit feeds on OPMs,...... leaving a total of 140,945 occurrences.

27 4.3 DATA TRANSFORMATION

This section describes the performed data transformations, for obtaining the BIOCLIM...... variables, and spatial scale uniformity.

4.3.1 BIOCLIM VARIABLES

The BIOCLIM...... variables are calculated using the pseudocode below (algorithm 1). The algorithm takes as input a set of unknown polygons P , a set of known points wsp, the KNMI...... weather stations (section 4.2.2), and the weather station meteorological data wsmd. Then for each unknown polygon, the BIOCLIM...... variables are calculated using interpolation.

Interpolation is performed using the functions Centroid, NearestStations, InterpolateTemperature and InterpolatePrecipitation. First, the geometric center of mass of the unknown polygon p is calculated using the function Centroid(p), which takes a polygon as input, and splits the polygon into triangles, using the centroid of the triangle with the triangle area as weight, to calculate the centroid of the polygon1. Next, the function NearestStations(n, wsc, centroid) takes as input the number of nearest neighbors n, the known points wsc and the polygon’s centroid centroid. Using these values as input, the functions calculates and returns the 3 nearest weather stations nws. Where the distance is being measured by using the Haversine formula, taking the earth’s curvature into account, making it more accurate then e.g., Euclidean distance. Next, the temperature is interpolated for the unknown polygon p, using function InterpolateTemperature(nws, wsmd, centroid). The function takes as input the polygon’s nearest weather station nws, the weather station’s meteorological data wsmd and the polygon’s centroid centroid. Using inverse distance weighting, the interpolated temperature temperature is then calculated and returned. The distance is measured using the Haversine method and for simplicity temperature refers to minimum, maximum and average temperature. Next the precipitation is interpolation for the unknown polygon p, using function InterpolatePrecipitation(nws, wsmd, centroid). The function takes as input the polygon’s nearest weather stations nws, the weather station’s metereological data wsmd and the polygon’s centroid centroid. Using inverse distance weighting, the interpolated precipitation is then calculated and returned. Once more, the distance is measured using the Haversine method.

At last, the ...... BIOCLIM variables are calculated for the unknown polygon p, using the function BIOCLIM(precipitation, temperature). The function takes as input the unknown polygon’s interpolated temperature and precipitation. Using these input values and the equations described in Appendix A.1, the BIOCLIM...... variables are calculated for p.

1PostGIS function geography–centroid–from–mpoly

28 Algorithm 1: BIOCLIM variables input : polygons P , weather stations coordinates wsc, weather stations meteorological data wsmd

output: BIOCLIM variables for each element of P

for p in P do centroid ← Centroid(p) nws ← NearestStations(3, wsc, centroid) temperature ← InterpolateTemperature(nws, wsmd, centroid) precipitation ← InterpolatePrecipitation(nws, wsmd, centroid) bioclim ← BIOCLIM(precipitation, temperature)

end

4.3.2 SPATIAL SCALE

The neighbourhoods geographic boundaries, defined by Netherland’s national mapping agency, Kadaster, are the selected spatial scale for our study. An alternative method exists, known as rasterize, where the geographic space is converted into a grid of equal sized areas. This alternative method however, results in less interpretable areas, as the area loses its label, being the neighbourhood name.

Spatial scale matching All collected data its scale is transformed to that of the spatial scale of the neighbourhoods. For the OPM,...... all OPM...... nests are counted within a neighbourhood, and divided by the area’s surface area (km2), resulting in the number of OPMN-per-km2. This division accounts for proportional scaling based on a neighbourhood’s surface area. The oak tree data is transformed in a similar way, counting the number of oak trees within a neighbourhood, and dividing by the neighbourhood’s surface area (km2). This results in the number of oak-trees-per-km2. The predator data too is transformed in a similar fashion, counting the number of great tits within a neighbourhood, and dividing by the neighbourhood’s surface area (km2), resulting in great- tits-per-km2. The neighbourhood’s soil type is determined by selecting the most abundant soil type, by m2, intersecting with the neighbourhood’s polygon. No transformation is needed for the ...... BIOCLIM data, as these are already calculated per neighbourhood.

29 4.4 DATA ORGANISATION

This section describes the organisation of the data and provides an overview of all collected variables.

4.4.1 DATASETS

We only consider the geographical spaces of Amsterdam and the province Gelderland, as we only have the OPM...... data for these spaces. Furthermore, for Amsterdam the data is split in a separate dataset for the years 2018 and 2019. By making this distinction, the temporal dimension is removed from the Amsterdam dataset, resulting in less noise when comparing the dataset to that of Gelderland’s. Additionally, by making a distinction for each OPM...... data source, standardization can be applied, reducing the effects of possible inconsistent measurement methods. The datasets are summarized below, first the dataset’s identifier is specified, followed by a description of its variables.

• A-2018; Consists of all neighbourhoods within geographical space of Amsterdam, and the variables collected for the year 2018. Resulting in a total of 100 records.

• A-2019; Consists of all neighbourhoods within geographical space of Amsterdam, and the variables collected for the year 2019. Resulting in a total of 100 records.

• Gelderland; Consists of all neighbourhoods within the geographical space of Gelderland, having at least 1 registered oak tree. Thisconstraintensuresofonlyselectingneighbourhoods along provincial roads. Furthermore, the dataset consists of the variables collected for 2020, resulting in a total of 174 records.

30 4.4.2 OVERVIEW

This subsection concludes the chapter and provides an overview of the variables resulting from the data collection and transforming process (sections 4.2 and 4.3). All variables are listed in table 1, and for each dataset the variable’s minimum and maximum value are defined.

Name Description Type Resolution Values

A-2018 A-2019 Gelderland

ID Unique neighbourhood identifier Identifier Neighbourhood - - - BIO_1 Annual Mean Temperature Quantitative Neighbourhood [11.1, 11.4] [11.1, 11.3] [11.6, 12.2] BIO_2 Mean of monthly max(temp) - min(temp) Quantitative Neighbourhood [7.5, 8.9] [7.4, 8.6] [8.3, 9.8] BIO_3 Isothermality (100* BIO_2 / BIO_7) Quantitative Neighbourhood [27.5, 30.4] [33.7, 37.1] [35.6, 38.8] BIO_4 Temperature Seasonality (std. dev. * 100) Quantitative Neighbourhood [641.0, 652.7] [540.5, 549.9] [504.5, 525.9] BIO_5 Max Temperature of Warmest Month Quantitative Neighbourhood [25.2, 26.7] [23.1, 23.9] [25.4, 27.3] BIO_6 Min Temperature of Coldest Month Quantitative Neighbourhood [−2.5, −1.9] [0.8, 1.4] [1.1, 2.1] BIO_7 Temperature Annual Range (BIO_5 - BIO_6) Quantitative Neighbourhood [27.0, 29.2] [21.7, 23, 1] [23.3, 26.1] BIO_8 Mean Temperature of Wettest Quarter Quantitative Neighbourhood [11.5, 15.7] [7.8, 8.2] [6.1, 12.3] BIO_9 Mean Temperature of Driest Quarter Quantitative Neighbourhood [9.4, 14.6] [6.5, 13.3] [11.3, 13.8] BIO_10 Mean Temperature of Warmest Quarter Quantitative Neighbourhood [17.8, 18.1] [17.3, 17.6] [17.1, 17.8] BIO_11 Mean Temperature of Coldest Quarter Quantitative Neighbourhood [3.5, 3.8] [5.8, 6.0] [6.1, 6.7] BIO_12 Annual Precipitation Quantitative Neighbourhood [423.9, 472.2] [610.9, 654.3] [80.2, 678.5] BIO_13 Precipitation of Wettest Month Quantitative Neighbourhood [5.1, 65.1] [6.0, 71.4] [28.1, 140.2] BIO_14 Precipitation of Driest Month Quantitative Neighbourhood [6.4, 7.5] [18.5, 30.4] [0.6, 7.5] BIO_15 Precipitation Seasonality Quantitative Neighbourhood [1186.6, 1249.8] [979.8, 1131.7] [1139.1, 1534.5] BIO_16 Precipitation of Wettest Quarter Quantitative Neighbourhood [116.6, 142.3] [178.6, 194.4] [140.0, 243.7] BIO_17 Precipitation of Driest Quarter Quantitative Neighbourhood [44.8, 120.2] [47.8, 146.2] [56.7, 102.9] BIO_18 Precipitation of Warmest Quarter Quantitative Neighbourhood [146.9, 174.5] [193.1, 224.9] [130.3, 231.1] BIO_19 Precipitation of Coldest Quarter Quantitative Neighbourhood [132.5, 170.1] [151.8, 204.5] [206.4, 259.8] Soil Soil type Qualitative Neighbourhood - - - Oak_trees_per_km2 Oak trees per km2 Quantitative Neighbourhood [0, 629.7] [0, 629.7] [0, 118.2] Great_tits_per_km2 Great tits per km2 Quantitative Neighbourhood [0, 36.8] [0, 17.3] [0, 123.3] OPMN_per_km2 Oak processionary moth nests per km2 Quantitative Neighbourhood [0, 65.3] [0, 71.7] [0, 30.3]

Table 1: Variables

31 5 ANALYSIS

This chapter describes the characteristics of the collected data, specified in chapter 4. Section 5.1 describes the spatial characteristics by visualizing the data on the map of the Netherlands. Section 5.2 describes the relationships between the response- and predictor variables and the predictor variables themselves. Section 5.3 describes the selection process of the predictor variables.

5.1 DATA VISUALIZATION

This section describes the spatial aspect of the collected variables.

5.1.1 GEOGRAPHIC BOUNDS

In figure 8, this study’s selected spatial scale is visualized (section 4.3.2). In the figure, the geometry of 3,268 neighbourhoods within the Netherlands are shown. For each neighbourhood, its unique code, name, municipality, and area size (in km2) are registered.

Figure 8: Neighbourhood geographic bounds

32 5.1.2 OAK PROCESSIONARY MOTHS

In this section, the distributions of the OPMs...... are visualized for all datasets.

Amsterdam In figures 9 and 10, the...... OPM distributions for Amsterdam are visualized for the years 2018 and 2019. For both years, most of the OPMs...... are concentrated in the south of Amsterdam. Which shares the same location as the city park, Amsterdamse Bos. Furthermore, when comparing the years 2018 and 2019, it can be observed that the OPMs...... are spreading towards the north-west, and that the number of OPM...... has doubled in the south.

Figure 9: Distribution OPM in Amsterdam 2018

33 Figure 10: Distribution OPM in Amsterdam 2019

34 Gelderland In figure 11, the OPM...... distribution for Gelderland is visualized for the year 2020. Please note that only OPM...... nests along provincial roads are registered (figure 6).

Figure 11: Distribution OPM in Gelderland

The number of OPM...... nests per km2 in Gelderland is significantly less compared to Amsterdam. A concentration can be found roughly in the center of the province, where the national park Hoger Veluwe is located.

5.1.3 OAK TREES

In this section, the distributions of the oak trees are visualized for all datasets.

Gelderland In figure 12 the distribution of oak trees are visualized for the province Gelderland. It can be observed that the concentration of oak trees per km2 is significantly less compared to Amsterdam. This, however, can be explained by the dataset only including oak trees along provincial roads (figure 6). Furthermore, a small concentration of oak trees can be observed in

35 roughly the center of Gelderland. This concentration shares the same location as the national park Hoger Veluwe.

Figure 12: Distribution oak trees Gelderland

Amsterdam In figure 13 the distribution of oak trees for Amsterdam is visualized. The concentration of oak trees is located in the south, again sharing the same location as the city park Amsterdamse Bos. Furthermore, please note that the distribution of oak trees remained static between the years 2018 and 2019.

36 Figure 13: Distribution oak trees Amsterdam

37 5.1.4 GREAT TITS

In this section the distributions of the great tits are visualized for all datasets.

Amsterdam In figures 14 and 15, the great tit distributions for Amsterdam are visualized for the years 2018 and 2019. For 2018, the distribution is roughly scattered evenly over Amsterdam, where for 2019 there is a concentration in the center. Furthermore, comparing the year 2018 to 2019, a decline can be observed in the number of reported great tits.

Figure 14: Distribution great tits Amsterdam 2018

38 Figure 15: Distribution great tits Amsterdam 2019

Gelderland In figure 16, the Great tit distribution for Gelderland is visualized for the year 2020. The concentration of great tits is located roughly in the center of Gelderland, again sharing the same location with the national park, Hoger Veluwe.

39 Figure 16: Distribution great tits Gelderland

40 5.1.5 SOIL

In this section, the distributions of the soil polygons are visualized for all datasets.

Amsterdam In figure 17, the soil polygons for Amsterdam are visualized for the year 2006. Amsterdam, being a highly populated city, consists almost entirely of the urban soil type.

Figure 17: Soil polygons region Amsterdam

Gelderland In figure 18, the soil polygons for the province Gelderland are visualized for the year 2006. Gelderland’s soil type consists for the most part of the soil types sand and clay. Roughly along the central vertical axis the river Ijsel is located, which can be linked to the clay soil type.

41 Figure 18: Soil polygons province Gelderland

42 5.1.6 BIOCLIM

In this section, the BIOCLIM...... variables are visualized for all datasets.

Amsterdam In figures 19a and 19b, the ...... BIOCLIM variables are visualized for the neighbourhood Nieuwmarkt, a randomly selected sample. From this sample, it can be observed that the average temperature remained approximately equal between the years 2018 and 2019. However, in 2018, there is a greater difference between the maximum and minimum temperature. Furthermore, the minimum temperature in 2019 has not dropped below 0 degrees Celsius. The sum of yearly rainfall is significantly more in 2019 compared to 2018, having an increase of 35%. Furthermore, in 2018, an exceptionally dry month occurred, only having a total of 7 mm of rainfall in one month.

(a) Year 2018 (b) Year 2019

Figure 19: BIOCLIM variables Amsterdam

Gelderland In figure 20, the BIOCLIM...... variables are visualized for a randomly selected neighbourhood in Gelderland, for the year 2020. The temperature is roughly equal to that of Amsterdam in 2019, however, there is a large different in the total amount of rainfall.

43 Figure 20: BIOCLIM variables Gelderland

5.1.7 CONCLUSION

Based on the visual inspections, we decided to exclude the soil data from further use. The polygons are too coarse grained, and therefore the soil types remain mostly constant for both Amsterdam and Gelderland.

44 5.2 EXPLORATORY DATA ANALYSIS

In this section, the relationship between the predictor variables and the dependent variable is analyzed. In subsection 5.2.1, scatter plots are used for inspecting the data for nonlinear relationships and outliers. In subsection 5.2.2 the variables’ Pearson correlation coefficient is studied using a correlation matrix.

45 5.2.1 SCATTER PLOTS

Amsterdam In figure 21, the predictor variables for the dataset Amsterdam are plotted for the year 2018. A strong positive linear relationship can be observed for the predictor variable oak-trees-per- km2, whereas no strong correlation can be observed for the other variables. Furthermore, the scatter plots show no apparent outliers or nonlinear relationships.

Figure 21: Scatter plots Amsterdam 2018

46 In figure 22, the predictor variables for the dataset Amsterdam are plotted for the year 2019. Again, a strong positive linear relationship can be observed for the predictor variable oak-trees- per-km2, whereas no strong correlation can be observed for the other variables. Furthermore, the scatter plots show no apparent outliers or nonlinear relationships.

Figure 22: Scatter plots Amsterdam 2019

47 Gelderland In figure 23, the predictor variables for the dataset Gelderland are plotted for the year 2020. Again, a positive linear relationship can be observed for the predictor variable oak-trees-per- km2, whereas no strong correlation can be observed for the other variables. Furthermore, the scatter plots show no apparent outliers or nonlinear relationships.

Figure 23: Scatter plots Gelderland

48 5.2.2 CORRELATION MATRIX

In this section, the relationships between the predictor variables themselves and the dependent variable are analyzed. For each dataset, a correlation matrix is constructed using the Pearson correlation coefficient.

Figure 24: Correlation matrix Amsterdam 2018

49 Figure 25: Correlation matrix Amsterdam 2019

50 Figure 26: Correlation matrix Gelderland

Multicollinearity In figures 24, 25, 26, it can be observed that the BIOCLIM...... variables, the first 19 variables, are affected by multicollinearity. Upon inspection of these variables, we concluded that data- based and structural multicollinearity occurs. The predictor variables temperature-range-max- year, temperature-min-year, temperature-max-year are affected by structural multicollinearity (figures 24,25). This can be observed in equation 10 where one variable can be derived from the two others. Data-based multicollinearity affects the predictor variables temperature- avg-coldest-quarter and temperature-avg-driest-quarter in dataset Amsterdam. This is due the average temperature in the driest quarter equals that of the coldest quarter.

51 Oak trees The predictor variable oak-trees-per-km2, is the only variable having a consistent strong correlation with the dependent variable.

52 5.3 FEATURE SELECTION

This section describes the feature selection process of our study. In subsection 5.3.1 the method used for selecting the features is described. Subsection 5.3.2 describes the results acquired from the proposed feature selection method. Subsection 5.3.3 concludes this section, describing the selected features used for building the predictive models.

5.3.1 METHOD

We selected the techniques Subset selection and Shrinkage for the feature selection process [16]. Subset selection emphasizes on the model’s accuracy, whereas Shrinkage emphasizes on features contributing to the fit of the model. Using two different techniques allows for comparing results and thus making a more informed decision. Furthermore, the independent variables are standardized, ensuring that the variables’ scale not affects the feature selection process. Below we discuss the appliance of the selected techniques to the available predictor variables.

Subset selection Subset selection is performed using an ElasticNet model and the evaluation metric adjusted R2. Using the evaluation metric adjusted R2 the model is penalized for adding extra predictor variables. Forward stepwise selection is used for selecting predictor variable subsets, not considering all possible subsets to prevent overfitting [16]. Finally cross validation with 4 folds is used for improving generalizability.

Shrinkage Feature selection using the Shrinkage technique is performed by a Least absolute shrinkage and selection operator (LASSO)...... model. The model is trained by first determining the regularization parameter . The regularization parameter is determined searching a grid space of values and evaluated using cross validation with 4 folds. The regularization parameter having the lowest cross validated...... MSE is used for fitting the LASSO model on all available data [16].

5.3.2 RESULTS

LASSO ALASSOmodelistrainedoneachdatasetseparatelyusingtheirbestperformingregularization...... hyper-parameter λ (details can be found in appendix B). In figure 27 the fitted predictor variables coefficients are visualized.

53 Figure 27: LASSO selected features

The models are consistent in fitting a high coefficient to the predictor variable oak-trees-per- km2. There is a significant difference between the second highest fitted coefficient, rain-sum- coldest-quarter.

54 Subset selection In figure 28 the adjusted R2 score is plotted against the number of features for each dataset.

Figure 28: Subset selected features

For datasets Amsterdam-2018 and Amsterdam-2019 the subset selection method reaches its maximum adjusted R2 score using two or three features, being oak-trees-per-km2, rain- sum-driest-quarter and rain-sum-coldest-quarter for Amsterdam-2018, and oak-trees-per-km2, rain-sum-wettest-quarter for Amsterdam-2019. The dataset for the Province Gelderland reaches its maximum adjusted R2 score using 6 features, being oak-trees-per-km2, rain- sum-warmest-quarter, temperature-avg-year, temperature-std-year, rain-sum-coldest-quarter and isothermality. However, the adjusted R2 score for is substantially lower compared to the datasets Amsterdam-2018, Amsterdam-2019.

55 5.3.3 CONCLUSION

The predictor variables oak-trees-per-km2, temperature-avg-coldest-quarter, temperature-avg- driest-quarter, rain-sum-driest-quarter, rain-sum-coldest-quarter are the selected predictor variables. Both subset selection and shrinkage selected oak-trees-per-km2 to be a significant variable. Furthermore, the LASSO...... models of datasets A-2018, Gelderland indicated that rain-sum-driest-quarter, rain-sum-coldest-quarter to be of significant importance. We too decide to include the temperature-related predictor variables temperature-avg-coldest-quarter, temperature-avg-driest-quarter, as we believe that temperature does affect the OPM...... population (section 4.1.1).

56 6 MODELS

This chapter describes the scaling, selection, training, evaluation, and our conclusion of the constructed statistical models. Section 6.1 describes the scaling of predictor variables, increasing the interpretability of the models. Section 6.2 describes the selection of algorithms used for constructing a statistical model. Section 6.3 describes the construction of the models. Section 6.4 describes the evaluation process of the constructed models. Section 6.5 describes the predictions of the constructed models and the evaluation thereof.

6.1 SCALING

The predictor variables are standardized, allowing to interpret the model’s calculated coefficients with more ease. The variables are standardized by subtracting their mean value and dividing by their standard deviation.

6.2 ALGORITHM SELECTION

We selected the algorithms ...... GLM, ANN...... and Random Forest for constructing the statistical models. We base our selection of algorithms on the current state of the art (section 3.2). We decided to not include ...... GAMs, as the smoothing functions would not benefit from our selected predictor variables (section 5.2.1).

6.3 MODEL BUILDING

This section describes the construction of statistical models. The section is organised in a subsection for each selected algorithm (section 6.2). For each model, its hyper-parameters, evaluation metric and fitting results are described.

Training- and test set All models are fitted and evaluated (section 6.4) using the same training and test data. For each dataset (section 4.4.1), 75 percent is used for the training set, and the remaining 25 percent is used for the test set. The sets are constructed by random sampling, using a fixed seed for replicability. In table 2, an overview is given of the number of observations per training- and test set, after splitting the dataset.

Oberservations

Dataset Training Test

A-2018 75 25 A-2019 75 25 Gelderland 130 44

Table 2: Training and test observations

57 Generalized linear model The response variable of the GLMs...... is modeled using the Poisson probability distribution. The log function is used as a link function between the linear predictor and the response variable. The coefficients of the linear predictor are determined using maximum likelihood estimation. In appendix C.1, we verify the response variable is distributed according to the Poisson distribution, furthermore, we provide details of the GLM’s...... fitting results.

Random Forest The values for the Random Forest’s hyper-parameters are selected using hyper-parameter tuning. The values are randomly selected from a grid of values and the constructed models are evaluated using the MAE...... The evaluation metric MAE...... is chosen over the popular metric MSE...... This because of the response variable being exponentially distributed, and outliers are penalized harder using the MSE...... In appendix C.2, we provide details of the hyper-parameter tuning process.

Artificial Neural Network The neural network’s architecture is illustrated in figure 29. The input layer consists of 5 neurons, equating the number of selected predictor variables. The number of hidden layers are variable, and can vary between 0 and 5 hidden layers. Each of the hidden layers consists of 5 neurons, allowing the predictor variables being modeled to the n-th power. The output layer consists of 1 neuron. The sum of its incoming weights is used as input for the rectified linear unit function. This function ensures that the predicted value is non-negative.

58 Figure 29: Neural network architecture

The ReLU...... is used too for an activation function of the neurons. The ReLU...... is commonly used in neural networks and is computationally inexpensive. Again, the evaluation metric ...... MAE is chosen over the metric ...... MSE, due to it being less sensitive to outliers. Hyper-parameter tuning is used for selecting the hyper-parameters number of hidden layers and the regularization parameter. Details of the hyper-parameter tuning process are provided in appendix C.3.

6.4 MODEL EVALUATION

The models are evaluated based on their deviance from the actual population, which is the test dataset defined in 6.3. This evaluation is consistent with our research objective and the model assessment phase of our chosen framework (section 3.1.1). For measuring the deviance of the predicted populations, the evaluation metric...... MAE is used, as the ...... MAE is less sensitive to outliers than the MSE......

59 6.5 RESULTS

In this section, the constructed statistical models are evaluated using the method described in section 6.4. For each dataset, the predicted distributions’ spatial aspect is analyzed, followed by a quantitative evaluation of their performance.

6.5.1 SPATIAL ANALYSIS

Amsterdam 2018 In figure 30, Amsterdam’s actual and predicted ...... OPMN distributions are shown for the year 2018. The models are consistent in predicting the OPMN...... concentration, which is located roughly in the center of Amsterdam. The GLM and Random Forest model (figures 30b, 30c) tend to overestimate the OPMN population, whereas the Neural network model (figure 30d) tends to underestimate the population.

60 (a) True distribution (b) GLM

(c) Random forest (d) Neural network

Figure 30: Predicted OPMN distribution Amsterdam 2018

61 Amsterdam 2019 In figure 31, Amsterdam’s actual and predicted ...... OPMN distributions are shown for the year 2019. The predicted distributions seem to have difficulty estimating high concentrations of OPMNs,...... where the Random Forest model (figure 31c) is the exception.

(a) True distribution (b) GLM

(c) Random forest (d) Neural network

Figure 31: Predicted OPMN distribution Amsterdam 2019

62 Gelderland In figure 32, Gelderland’s actual and predicted OPMN...... distributions are shown for the year 2020. The Neural Network model predicts a constant value of 0.31 for all neighbourhoods. None of the models seems to both predict low and high concentrations of OPMN...... populations.

(a) True distribution (b) GLM

(c) Random forest (d) Neural network

Figure 32: Predicted OPMN distribution Gelderland

63 6.5.2 TEST ERROR

In table 3, the test errors of the models are shown using the evaluation metric MAE...... None of the models are consistent in achieving the lowest error. Moreover, each model seems to be the best fit for a particular dataset.

Datasets

Model A-2018 A-2019 Gelderland

GLM 4.76 8.38 2.96 Random forest 3.32 7.10 3.24 Neural network 3.55 6.36 3.21

Table 3: Model test errors

64 7 CONCLUSION

7.1 CONCLUDING SUMMARY

In this work, we focus on enabling organisations making informed decisions concerning the population distribution of OPMs...... However, to the best of our knowledge, we are the first to estimate the ...... OPM’s population size using statistical modeling. Accordingly, we attempted, to the best of our ability, to answer the research question: How can we use statistical modeling to predict the OPM population for a given area?

Literature review By conducting a literature study, we answered RQ1: What are known techniques for predicting the OPM population? We concluded that two fundamentally different approaches exist; the mechanistic approach, which determines a species’ ecological requirements in a controlled environment, and the correlative approach, which assumes that a species’ current environment is a good estimate of the ecological requirements. The mechanistic approach however, requires detailed biophysical data of a species, which, to the best of our knowledge, is not available for the OPM...... Therefore, we concluded that the correlative approach, and thereby the Species Distribution Modeling framework, fits our problem context best. Furthermore, we continued our literature study to answer RQ2: Which statistical modeling methods are suitable for predicting the OPM...... population? Thereby, we found that the statistical models; Generalized Linear Models, Generalized Additive Models, Decision trees and Neural networks are the current state of the art within the SDM...... framework.

Predictor variables According to the conceptualization phase of the ...... SDM framework, we defined a conceptual model of factors that are hypothesized to affect the ...... OPM population (section 4.1.1). In the conceptual model we proposed that oak tree leaves are the OPM’s...... exclusive energy source, and that predators and climate too affect the OPM...... population. We used these factors to define a substantiated set of predictor variables; The ...... BIOCLIM variables, a popular set of variables used to describe the climate of a geographical space, the Ellenberg’s Indicator Values, a set of variablestodescribetheclimateandsoilconditions, andcountdataofthe...... OPM’spredators. By identifying these variables, we answered our last research question; Which predictor variables are suitable for predicting the OPM...... population?

Analysis In chapter 5 we analyzed the spatial aspect of the predictor variables. We concluded that the soil polygons are too coarse grained and thus soil types remain mostly unchanged for Amsterdam and Gelderland. Therefore, we decided to exclude the soil data from further

65 use within our study. Additionally, we analyzed the statistical significance of the predictor variables. We concluded that the independent variable, oak-trees-per-km2, has a strong correlation with our dependent variable. The ...... BIOCLIM variables and the biotic interaction variable, great-tits-per-km2, showed no strong correlation. However, we decided to include the temperature-related predictor variables temperature-avg-coldest-quarter, temperature-avg- driest-quarter, as we have good reason to assume that temperature does affect the OPM...... population (section 4.1.1).

Results In chapter 6 we constructed statistical models for the municipality Amsterdam and the province Gelderland. Overall, the constructed models seem to have difficulty predicting areas with exceptional high or low OPM...... populations.

7.2 CONTRIBUTIONS

In this thesis, we defined multiple models to predict the OPM...... population within a given area. Additionally, we defined and evaluated our conceptual model of factors affecting the OPM...... population. The predictive models may be used as a baseline for future research, in which our conceptual model can be further extended.

Furthermore, along our research we accomplished the following:

• The BIOCLIM...... variables, a common set of variables within ecological modeling studies, are computed for the past 20 years for all 3,268 neighbourhoods within the Netherlands.

• We integrated the datasets; ...... BIOCLIM, oak trees, OPMs,...... Great tits and soil, into a single RDBMS,...... resulting in a database having; The BIOCLIM...... variables for the past 20 years for all neighbourhoods within the Netherlands, 6,060 oak processionary moth nest observations, 60,840 oak tree observations, 17,340 soil polygons covering the Netherlands in its entirety and 433,960 great tit observations.

• We built an API2, allowing other researchers to easily access our collected data.

2https://thesis.scholtens.io

66 7.3 LIMITATIONS AND FUTURE WORK

A limitation of our chosen framework, Species Distribution Modeling, is the assumption that the species are in (quasi-) equilibrium with contemporary environmental conditions [11], [14]. It could be however that the OPM...... are currently moving towards a more suitable environment as suggested by Groenen et al. [13].

Furthermore, several concerns exist regarding to our data. Firstly, our dependent variable, the oak processionary moth nest, does not perfectly represent the...... OPM population, as the number of OPM...... can vary per nest, yet still counts as 1 nest. Secondly, the oak tree data, our most relevant predictor variable, is not collected independently from the oak processionary moth data. Additionally, the dataset provided for Province Gelderland only takes into account oak trees along provincial roads, and is therefore no exact representation of the OPM...... population within a neighbourhood.

Finally, due to the limited availability of ...... OPM occurrence data, several modeling components are missing. Firstly, in our constructed models, we have not accounted for spatial auto- correlation between adjacent neighbourhoods. Secondly, we have not included a temporal aspect, i.e., how the OPM...... population is developing over a period of time. Therefore we strongly recommend the gathering of detailed (both spatial and temporal) data such that in future research these aspects can be taken into account.

67 References

[1] OPM manual 4 - biology and life-cycle - Forest Research.

[2] Vogels en insecten ultiem wapen tegen eikenprocessierups | Vogelbescherming.

[3] Arnold van Vliet. (65) De eikenprocessierups: waarom komen we er maar niet vanaf? - YouTube.

[4] M P Austin. Searching for a model for use in vegetation analysis. Vegetatio, 42(1):11–21, 1980.

[5] M. P. Austin. Spatial prediction of species distribution: An interface between ecological theory and statistical modelling. Ecological Modelling, 157(2-3):101–118, 11 2002.

[6] D. Benkert. Ellenberg, Heinz, Zeigerwerte der Gefäßpflanzen Mitteleuropas. Scripta Geobotanica IX. 97 S., Verlag Erich Goltze KG, Göttingen, 1974. Kartoniert, DM 17,-. Feddes Repertorium, 87(1-2):161–162, 4 2008.

[7] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. 1984.

[8] P. P. Feeny. Effect of oak leaf tannins on larval growth of the winter moth Operophtera brumata. Journal of Insect Physiology, 14(6):805–817, 6 1968.

[9] Stephen E. Fick and Robert J. Hijmans. WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology, 37(12):4302–4315, 10 2017.

[10] Rebecca E. Forkner, Robert J. Marquis, and John T. Lill. Feeny revisited: Condensed tannins as anti-herbivore defences in leaf-chewing herbivore communities of Quercus. Ecological Entomology, 29(2):174–187, 4 2004.

[11] Janet Franklin and Jennifer A. Miller. Mapping species distributions: Spatial inference and prediction. Cambridge University Press, 1 2010.

[12] J.J. Fransen. Leidraad beheersing eikenprocessierups. Technical report, NVWA - Alterra, 2013.

[13] Frans Groenen and Nicolas Meurisse. Historical distribution of the oak processionary moth processionea in Europe suggests recolonization instead of expansion. Agricultural and Forest Entomology, 14(2):147–155, 5 2012.

[14] Antoine Guisan and Niklaus E. Zimmermann. Predictive habitat distribution models in ecology. Ecological Modelling, 135(2-3):147–186, 12 2000.

68 [15] G. E. Hutchinson. Concluding Remarks. Cold Spring Harbor Symposia on Quantitative Biology, 22(0):415–427, 1 1957.

[16] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Springer Texts in Statistics An Introduction to Statistical Learning. Technical report.

[17] M. Kearney. Habitat, environment and niche: What are we modelling?, 10 2006.

[18] Michael Kearney and Warren Porter. Mechanistic niche modelling: Combining physiological and spatial data to predict species’ ranges. Ecology Letters, 12(4):334–350, 4 2009.

[19] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman and Hall, London, 2nd edition, 1989.

[20] Michael S O’donnell and Drew A Ignizio. Bioclimatic Predictors for Supporting Ecological Applications in the Conterminous United States Data Series 691. Technical report.

[21] Larkin Powell and George Gale. Estimation of Parameters for Populations: a primer for the rest of us. 8 2015.

[22] Alain Roques and Andrea Battisti. Introduction. In Processionary Moths and Climate Change: An Update, pages 1–13. Springer Netherlands, 1 2015.

[23] Margriet van Asch and Marcel E. Visser. Phenology of Forest Caterpillars and Their Host Trees: The Importance of Synchrony. Annual Review of Entomology, 52(1):37–55, 1 2007.

[24] WUR-Alterra. Dataset Grondsoortenkaart van Nederland 2006, 2006.

[25] Damaris Zurell, Janet Franklin, Christian König, Phil J. Bouchet, Carsten F. Dormann, Jane Elith, Guillermo Fandos, Xiao Feng, Gurutzeta Guillera￿Arroita, Antoine Guisan, José J. Lahoz￿Monfort, Pedro J. Leitão, Daniel S. Park, A. Townsend Peterson, Giovanni Rapacciuolo, Dirk R. Schmatz, Boris Schröder, Josep M. Serra￿Diaz, Wilfried Thuiller, Katherine L. Yates, Niklaus E. Zimmermann, and Cory Merow. A standard protocol for reporting species distribution models. Ecography, 43(9):1261–1277, 9 2020.

69 A DATA COLLECTION

A.1 BIOCLIM

A.1.1 VARIABLES

The described variables are taken directly from the United States Geological Survey’s publication about BIOCLIM...... variables [20].

Notation All values are defined according to the metric system, being Celsius for temperature and millimeters for precipitation.

i = Month

Tmax = Monthly mean of daily max temperatures

Tmin = Monthly mean of daily min temperatures

Tavg = Monthly average temperature, (Tmax + Tmin)/2 PPT = Monthly total precipitation ∑ i=12 i=1 Summation of a climate measurement across all months within a given year.

Description Bio 1 - Annual Mean Temperature The first BIOCLIM variable describes the yearly average temperature. As input it requires the average temperature of each month Tavg. The annual mean approximates the total energy inputs for an ecosystem.

1 i∑=12 Bio = T (4) 1 12 avgi i=1 Bio 2 - Annual Mean Diurnal Range The mean of the monthly temperature ranges. As input it requires the monthly maximum temperature Tmax, and the monthly minimum temperatures Tmin. This index can provide information how the species reacts to temperature fluctuations.

1 i∑=12 Bio = (T − T ) (5) 2 12 maxi mini i=1

70 Bio 3 - Isothermality Quantifies how large the day-to-night temperatures oscillate relative to the summer- to- winter oscillations. Requires as input the bioclim variables Bio2, Bio7. Isothermality is generally useful for tropical, insular, and maritime environments.

Bio2 Bio3 = ∗ 100 (6) Bio7 Bio 4 - Temperature Seasonality (Standard Deviation) The temperature variation over a given year based on the standard deviation of monthly temperature averages Tavg. Requires as input the monthly average temperatures.

Bio4 = SD(Tavg1 , ..., Tavg12 ) (7) Bio 5 - Max Temperature of Warmest Month Monthly maximum temperature for a given year. Requires as input the monthly maximum temperature Tmax. This information can be useful for examining whether the species are affected by maximum temperature anomalies throughout the year.

Bio5 = MAX(Tmax1 , ..., Tmax12 ) (8) Bio 6 - Min Temperature of Coldest Month Monthly minimum temperature for a given year. Requires as input the monthly minimum temperature Tmin. This information can be useful for examining whether the species are affected by minimum temperature anomalies throughout the year.

Bio6 = MIN(Tmin1 , ..., Tmin12 ) (9) Bio 7 - Annual Temperature Range Maximum difference for a given year’s monthly’s minimum and maximum temperature.

Requires as input the bioclim variables Bio5, Bio6. This information can be useful to examine whether species are effected by extreme temperature conditions.

Bio7 = Bio5 − Bio6 (10)

Bio 8 - Mean Temperature of Wettest Quarter Mean temperature that prevail during the wettest season. Requires as input the monthly average temperature Tavg and total precipitation PTT . This information can be useful to examine whether species are effected by seasonal occurrences.

{ 1 ∑i=3 Bio = T Where the three selected months are of the wettest quarter (11) 8 3 avgi i=1

71 Bio 9 - Mean Temperature of Driest Quarter Mean temperature that prevail during the driest season. Requires as input the monthly average temperature Tavg and total precipitation PTT . This information can be useful to examine whether species are effected by seasonal occurrences.

{ 1 ∑i=3 Bio = T Where the three selected months are of the driest quarter (12) 9 3 avgi i=1 Bio 10 - Mean Temperature of Warmest Quarter Mean temperatures that will prevail during the warmest quarter. Requires as input the monthly average temperature Tavg. This information can be useful to examine whether species are effected by seasonal occurrences.

{ 1 ∑i=3 Bio = T Where the three selected months are of the warmest quarter (13) 10 3 avgi i=1

Bio 11 - Mean Temperature of Coldest Quarter Mean temperatures that will prevail during the coldest quarter. Requires as input the monthly average temperature Tavg. This information can be useful to examine whether species are effected by seasonal occurrences.

{ 1 ∑i=3 Bio = T Where the three selected months are of the coldest quarter (14) 11 3 avgi i=1 Bio 12 - Annual Precipitation The sum of all monthly precipitation values. Requires as input the total precipitation per month

PTTi. This information can be used to approximate the total water inputs and thereby the importance of water availability for a species.

1 i∑=12 Bio = PTT (15) 12 12 i i=1 Bio 13 - Precipitation of Wettest Month Total precipitation for the wettest month. Requires as input the total precipitation per month PTT . This information can be used to examine if extreme precipitation conditions affect species.

Bio13 = MAX(PTT max1 , ..., P T T max12 ) (16)

72 Bio 14 - Precipitation of Driest Month Total precipitation for the driest month. Requires as input the total precipitation per month PTT . This information can be used to examine if extreme precipitation conditions affect species.

Bio14 = MIN(PTT max1 , ..., P T T max12 ) (17)

Bio 15 - Precipitation Seasonality This index is the ratio of the standard deviation of the monthly total precipitation to the mean monthly total precipitation PTT . Requires as input the monthly total precipitation. This information can be used to examine if species are affected by variation in precipitation. The 1 in the denominator is used as a smoothing factor when no rainfall has occurred.

SD(PPT1, ..., P P T12) Bio15 = ∗ 100 (18) 1 + (Bio12/12)

Bio 16 - Precipitation of Wettest Quarter The total sum of precipitation during the wettest quarter. Requires as input the monthly total precipitation PTT . This information can be used to examine if species are affected by this environmental factor.

{ ∑i=3 Bio16 = PTTi Where the three selected months are of the wettest quarter (19) i=1

Bio 17 - Precipitation of Driest Quarter The total sum of precipitation during the driest quarter. Requires as input the monthly total precipitation PTT . This information can be used to examine if species are affected by this environmental factor.

{ ∑i=3 Bio17 = PTTi Where the three selected months are of the driest quarter (20) i=1

73 Bio 18 - Precipitation of Warmest Quarter The total sum of precipitation during the warmest quarter. Requires as input the monthly total precipitation PTT and average temperature Tavg. This information can be used to examine if species are affected by this environmental factor.

{ ∑i=3 Bio18 = PTTi Where the three selected months are of the warmest quarter (21) i=1

Bio 19 - Precipitation of Coldest Quarter The total sum of precipitation during the coldest quarter. Requires as input the monthly total precipitation PTT and average temperature Tavg. This information can be used to examine if species are affected by this environmental factor.

{ ∑i=3 Bio19 = PTTi Where the three selected months are of the coldest quarter (22) i=1

74 B FEATURE SELECTION

B.1 LASSO

This section describes the tuning of the hyper-parameter regularization for the constructed LASSO...... models.

Method For selecting the regularization parameter 100 different λ values are generated by SKLearn and validated using cross validation (4 folds). Finally for each model the λ with the lowest MSE...... is selected.

Results In figure 33 for each of the datasets the MSE...... is plotted as a function of the regularization parameter λ. The LASSO models for the datasets Amsterdam-2018, Amsterdam-2019 scored their lowest MSE...... using a regularization parameter of λ = 0.733 and λ = 2.581. The LASSO model for dataset Gelderland scored its lowest...... MSE using λ = 0.049.

(a) Amsterdam 2018 (b) Amsterdam 2019

(c) Gelderland

Figure 33: LASSO regularization parameter

75 C MODELS

C.1 GENERALIZED LINEAR MODELS

In this section the fitting of GLMs...... for the datasets A-2018, A-2019 and Gelderland is described. First the dependent variable is verified being distributed exponentially using a distribution plot. Next, details are provided of the fitted GLMs......

Distribution The GLMs...... are fitted using the Poisson distribution, as the response variable is count data for a given space. In figure 34 it can be observed that the dependent variable is exponentially distributed. Furthermore, it can be observed that dataset Gelderland has a more dispersed distribution than the datasets A-2018, A-2019.

(a) Amsterdam 2018

(b) Amsterdam 2019

(c) Gelderland

Figure 34: Distribution dependent variable

76 Fitting results In tables 4, 5 and 6, the fitting results of the constructed models are shown. All models are consistent in fitting a large positive coefficient for the predictor variable oak-trees-per-km2, being in pair with the feature selection process (section 5.3). For the remaining variables, there is no consensus on their contribution, being it positive or negative.

coef std err z P> |z| [0.025 0.975]

const 1.4565 0.058 25.044 0.000 1.343 1.571 oak_trees_per_km2 0.5004 0.024 20.860 0.000 0.453 0.547 temperature_avg_coldest_quarter -0.2893 0.275 -1.053 0.293 -0.828 0.249 temperature_avg_driest_quarter 0.2090 0.284 0.735 0.462 -0.348 0.766 rain_sum_coldest_quarter 0.3206 0.098 3.284 0.001 0.129 0.512 rain_sum_driest_quarter -0.1342 0.052 -2.572 0.010 -0.236 -0.032

Table 4: Fitted GLM Amsterdam 2018

coef std err z P> |z| [0.025 0.975]

const 2.2880 0.038 59.797 0.000 2.213 2.363 oak_trees_per_km2 0.4606 0.018 26.013 0.000 0.426 0.495 temperature_avg_coldest_quarter -0.0684 0.324 -0.211 0.833 -0.703 0.566 temperature_avg_driest_quarter -0.0983 0.385 -0.255 0.798 -0.853 0.656 rain_sum_coldest_quarter -0.0640 0.093 -0.689 0.491 -0.246 0.118 rain_sum_driest_quarter 0.0977 0.036 2.716 0.007 0.027 0.168

Table 5: Fitted GLM Amsterdam 2019

coef std err z P> |z| [0.025 0.975]

const -0.4195 0.124 -3.388 0.001 -0.662 -0.177 oak_trees_per_km2 0.4032 0.055 7.379 0.000 0.296 0.510 temperature_avg_coldest_quarter 1.0121 0.134 7.526 0.000 0.749 1.276 temperature_avg_driest_quarter -0.6183 0.091 -6.817 0.000 -0.796 -0.441 rain_sum_coldest_quarter -1.2863 0.228 -5.629 0.000 -1.734 -0.838 rain_sum_driest_quarter 0.2748 0.153 1.802 0.072 -0.024 0.574

Table 6: Fitted GLM Gelderland

77 C.2 RANDOM FOREST

In this section the selection of hyper-parameters for the Random Forest model are described. First the method for selecting the hyper-parameters is described, followed by the results using this method. Finally, the selected hyper-parameters are discussed for each model.

Method The selection of hyper-parameters is determined by traversing a grid space of hyper- parameter values (table 7), and evaluating the model’s performance and complexity. The model’s performance is determined by splitting the training data into 3 sets and calculating their average MAE...... The model’s complexity is determined by the number of decision trees, tree depth, and the number of samples on split, where a higher value is related to a more complex model.

Hyper-parameter Description Values

n-estimators Number of decision trees [200, 400, 600 , 30, 50] max-depth Maximum number of features per decision tree [3, 4, 5] min-samples-split Minimum number of samples needed to split [2, 5, 10]

Table 7: Random forest hyper-parameter grid space

To compensate for the low number of training samples (table 2), we choose a high number of decision trees (n-estimators), ranging between 200 and 2000, and bootstrapping, as an effort for the training data to better estimate the true population. The maximum number of features to consider per decision tree and node varies between 3, 4, and 5 (all available features). Using less than 5 features is due the results in the feature selection chapter 5.3, which showed that only oak-trees-per-km2 is a relevant predictor variable. The minimum samples to consider for each node varies between 5 and 10, which represents around 10 to 20 percent of the dataset with the fewest observations (dataset A-2018, section 4.4), which keeps the model from capturing all anomalies.

78 Fitting results In tables 8, 9, 10, the top 10 best results of the hyper-parameter tuning process are shown. As anticipated, the lowest value for the parameter min-samples-split, results in the lowest MAE,...... as it allows for modeling anomalies. Furthermore, for each dataset the MAE...... is in pair with the dispersion of the dependent variable (tables 4, 5, 6), where e.g. dataset Gelderland has the smallest dispersion and matching the lowest MAE......

MAE max_depth min_samples_split n_estimators

5.45 3 2 1000 5.46 3 2 600 5.46 3 2 800 5.49 3 2 200 5.49 3 2 1800 5.49 3 2 1600 5.50 3 2 1200 5.50 3 2 1400 5.51 3 2 2000 5.53 5 2 800

Table 8: Random Forest hyper-parameter tuning Amsterdam 2018

MAE max_depth min_samples_split n_estimators

9.19 3 10 400 9.28 5 10 1200 9.28 3 10 1200 9.29 3 10 1800 9.29 5 10 2000 9.30 3 10 2000 9.32 3 10 1600 9.32 5 10 1600 9.32 5 10 1400 9.33 5 10 400

Table 9: Random Forest hyper-parameter tuning Amsterdam 2019

MAE max_depth min_samples_split n_estimators

1.13 1 10 200 1.13 1 2 200 1.13 1 10 600 1.13 1 5 2000 1.13 1 5 1800 1.13 1 2 600 1.13 1 5 800 1.13 1 2 1400 1.14 1 5 200 1.14 1 10 1200

Table 10: Random Forest hyper-parameter tuning Gelderland

79 Selected hyper-parameters For each dataset constructed model, a trade-off is made between the model’s complexity and its predictive performance. For all datasets, we choose the hyper-parameter values associated with the lowest MAE...... (tables 8, 9, 10). Table 11 summarizes the models’ selected hyper- parameters are summarized.

Model max_depth min_samples_split n_estimators

A-2018 3 2 1000 A-2019 3 10 400 Gelderland 1 10 200

Table 11: Selected hyper-parameters Random Forest

80 C.3 NEURAL NETWORKS

In this section, the selection of hyper-parameters for the Neural network models are described. First, the method for selecting the hyper-parameters is described, followed by the fitting results using this method. Finally, the selected hyper-parameters are discussed for each model.

Method The selection of hyper-parameters is determined by randomly traversing a grid space of hyper- parameter values (table 12), and evaluating the model’s performance and generalizability. The model’s performance is determined by splitting the training data into 3 sets and calculating their mean MAE...... The model’s generalizability is determined by the number of hidden layers and the regularization parameter l1.

Hyper-parameter Description Values

hidden-layers Number of layers between input and output layer [0,1,2,3,4]

regularization l1 Regularization parameter for the activation function [0.01, 0.03, 0.05]

Table 12: Neural network hyper-parameter grid space

Different values are evaluated for the hyper-parameter hidden-layer, searching for a balance between complexity and the model’s ...... MAE. The regularization hyper-parameter l1 is used for avoiding over-fitting, which is accomplished by discarding neuron outputs values close to zero.

81 Fitting results In tables 13, 14, 15, the top 10 best results of the hyper-parameter tuning process are shown. Similar to the models constructed using the Random Forest algorithm, the MAE...... is in pair with the dispersion of the dependent variable (tables 4, 5, 6). Furthermore, can be observed that increasing the number of hidden layers have a limited effect on the model’s performance.

MAE number_of_hidden_layers regularization_l1

4.72 4 0.03 4.90 3 0.03 5.09 0 0.01 5.16 1 0.02 5.21 3 0.01 5.26 0 0.03 5.32 1 0.01 5.45 1 0.03 5.45 2 0.01 5.58 0 0.02

Table 13: Neural network hyper-parameter tuning Amsterdam 2018

MAE number_of_hidden_layers regularization_l1

8.44 0 0.02 8.66 0 0.03 8.97 1 0.01 9.04 1 0.03 9.36 3 0.01 9.85 0 0.01 9.87 4 0.02 10.02 1 0.02 10.48 4 0.03 10.63 2 0.02

Table 14: Neural network hyper-parameter tuning Amsterdam 2019

MAE number_of_hidden_layers regularization_l1

0.96 1 0.03 1.00 4 0.02 1.03 2 0.01 1.04 4 0.03 1.07 4 0.01 1.09 0 0.03 1.10 0 0.01 1.11 1 0.02 1.11 0 0.02 1.23 3 0.02

Table 15: Neural network hyper-parameter tuning Gelderland

Selected hyper-parameters For each model, a trade-off is made between the complexity and its performance. Increasing the neural network model’s number of hidden layers significantly reduces the model’s interpretability. Therefore, we choose the neural network models with 0 hidden layers for the datasets A-2018 and A-2019, having a small decrease in the model’s performance for dataset

82 A-2018 (tables 13, 14). For the model constructed for dataset Gelderland, we choose 1 hidden layer, as 0 hidden layers would have a significant negative effect on the model’s performance.

Model number_of_hidden_layers regularization_l1

A-2018 1 0.03 A-2019 4 0.02 Gelderland 2 0.01

Table 16: Selected hyper-parameters neural network

83