Nature Template - PC Word 97 s8

1

Methods

1 “Model set up, calibration and validation”

In this study, a Random Forest algorithm (RF) (Breiman, 2001) implemented in the R environment (R Development Core Team, 2004) was employed to predict olive tree habitat suitability in past, present and future time periods over the Mediterranean basin using a set of bioclimatic indexes as predictor variables.

The RF classifier basically consists of a combination of decision trees where each classifier is generated using a random vector sampled (mtry) independently from the input vector in a bootstrapping procedure. The bootstrap sample is randomly split into two subsets of size 66%, which is used for training, and 33% (out-of-bag sample, OOB), which is used for internal testing. This feature reduces the problem of correlated variables because they may be extracted in turn, thus contributing independently to the aggregated tree model. The trees are fully grown until a final node is reached, and each is used to predict the OOB observations.

The predicted class of an observation is determined by the majority vote of the OOB predictions for that observation..

The algorithm includes the computation of the OOB error estimate, which for each tree is calculated over the data split out of the corresponding bootstrap sample, and then averaged.

Because the OOB observations are not used in the training of the trees, these are essentially cross-validated accuracy estimates.

In the prediction mode, a calibrated RF model consists of an ensemble of classification trees each of which is allowed one vote for the model prediction. The most voted prediction from all of the trees in the random forest becomes the final model prediction. 2

The RF algorithm also allows a measure of variable importance in the modelling to be obtained, by accumulating the contribution of each variable along all the nodes and for all the trees in which it is used. Additionally, the dependence of the probability of presence of a species on one predictor variable may be investigated to describe the effect of a single predictor on the modelled response (presence/absence of a species) after averaging out the effect of all the other predictors (partial plots).

In the study we used the presence/absence of olive trees in the European domain as binary response variable (1/0 respectively) and a set of climatic indexes describing current climate as predictors.

Current climate data were derived from the average monthly precipitation and temperature values from 1950 to 1999 for the land surface of the globe at 0.5°x 0.5° resolution (Willmott

& Robeson, 1995). Monthly climatic mean temperatures (°C, 12 variables) and cumulative rainfall (mm, 12 variables) were extracted for the European domain and used as predictor variables of the presence/absence of olive trees and used for calculating the average temperature of the coldest and warmest month (January and July, respectively; T_JAN,

T_JUL) and the annual water deficit calculated according the water balance model described in McCabe et al. (2007) (W_DEF), average annual temperature (AVG_T) and the continentality index (the range between the average temperatures of the warmest and coldest months of the year (CON).

The entire climatological dataset, comprising 431 grid points belonging to olive cultivation areas and 6123 not belonging, was first split into training and test subsets. For the training subset, 80% of the points belonging to olive cultivation (334) and 80% of the points not belonging (4,902) were selected randomly. The remaining data were then used as an independent subset for model testing. 3

Because RF tends to overestimate the majority class (i.e., not belonging to olive cultivation) in the case of unbalanced data, the algorithm needed to be trained using a similar amount of test cases from both classes (Evans et al., 2011). Accordingly, during the training phase we used a bootstrap procedure to randomly select 334 grid points from the group not belonging to the olive cultivation area and repeated this procedure 10 times in order to have significant coverage of climates unsuitable for olives.

As part of the random forest procedure, 500 classification trees were built and data selected from the training dataset were used to feed each tree. For each fork in these trees, three predictor variables (mtry = 3) were selected as candidates at random from the full set of 5 predictor variables. The 10 RF-trained models were then combined into a single model, which was first used to evaluate the relative importance of each variable in the classification process and then to investigate the marginal effect of the variables on the probability of olive presence

(partial dependence plots).

The algorithm calibration included the computation of the out of bag error estimate (OOB), and the relationships between individual predictor variables and predicted probabilities of olive tree presence (partial dependence plots).

Because the RF model approach produces predicted probabilities (0/1), the production of a binary map requires the selection of a threshold to transform the scores into a set of presence/absence predictions. The possible thresholds to be applied were scaled between 0 and 1 with a step of 0.1 and for each iteration the relevant number of: true positive (a), false positive (b), false negative (c) and true negative (d) cases predicted by the model were recorded. These indices were then used to calculate the true skill statistic (TSS, Allouche et al., 2006) (eq. 1)

ad  bc TSS = [1] a+cb+ d  4

TSS ranges from -1 to +1, where +1 indicates perfect agreement and values of zero or less indicate a performance no better than random.

In the testing procedure, the performance of the calibrated RF model was evaluated by applying the selected threshold and calculating the relevant TSS.

2 “Past and future climate data and downscaling technique”

Climatic data for past and future periods were obtained from the National Center for

Atmospheric Research Climate System Model (NCAR-CSM) Version 1.4. This is a global coupled atmosphere-ocean-sea ice-land surface model without flux adjustments and is described in detail in ref. 24.

To reproduce past climates, NCAR-CSM was forced over the period 850-2000 AD using observation-based time histories of solar irradiance, spatially explicit aerosol loading from explosive volcanism, greenhouse gases (CO2, CH4, N2O, CFC-11, and CFC-12), and anthropogenic sulphate aerosols with a recurring annual cycle of ozone and natural sulphate aerosol (Ammann et al., 2007).

NCAR-CSM simulations, forced by increasing trace gases and aerosol concentrations from

2001 to 2100 according to the SRES scenario A1B1 were selected as the climatic dataset for future climate conditions.

Time slices relevant to the Medieval Warm Period (MWP, 1200-1299 AD) and to the Little

Ice Age (LIA, 1600-1699 AD) were used to drive the reconstruction of the olive cultivation area in past periods. Three time slices were considered for future periods, namely 2001-2030,

2031-2060, and 2061-2090.

A Delta change approach was used to downscale the GCM data over the observed climate data. Deltas for average temperature and rainfall were obtained as monthly average differences between the baseline period (1950-1999) and the past/future time slices 5 considered from each dataset. For temperature, delta was expressed as the absolute difference between the baseline and the past/future periods, while for rainfall a delta ratio was calculated.

The monthly deltas for temperature and rainfall were interpolated over the European domain using a spline function and added to the relevant observed climate data for the present period.

After this procedure, the average temperature of the coldest and warmest month and the annual water deficit, average annual temperature and the continentality index were recalculated and used as predictors of the trained RF model.

Results

The developed RF ecological model showed almost complete classification accuracy when driven with climate parameters for the present period, which indicated that the olive tree has been forced up to its bioclimatic limits

The RF model was applied using 10 different random datasets where the presence and absence of olive trees were equally distributed. These RF models were then merged into a single model, which resulted in an OOB error of 7.12%. With the selection of a threshold of

70% to classify the predicted presence and absence of olive trees at each grid point, the RF model had a high prediction accuracy resulting in a TSS of 0.92.

Finally, looking at the role of predictor variables, the results indicated that soil moisture

(W_DEF) played a major role in the limitation of the area of olive cultivation, whereas the remaining parameters all had a similar importance.

Partial dependence plots indicated that areas of olive cultivation are excluded from regions that are too arid or too humid (Fig. 3a), whereas the temperatures of the coldest (T_JAN) (Fig. 6

3b) and warmest months (T_JUL) (Fig. 3c) constrained areas of olives within the lower and upper temperature limits of the year.

Following these results, the calibrated RF model was considered robust and coherent and was applied to the relevant predictor variables calculated for the past (MCA and LIA) and the future time slices (2001–2030, 2031–2060 and 2061–2090) of the scenario A1b, to derive the relevant olive growing areas. 7 a 2 2 / ) e c

n 1 e s e r 0 p

f

o 0 100 200 300 400 500 600 700 800 900 1000

y t

i -1 l i b a

b -2 o r p

f o

-3 t i g o l ( -4 W_DEF (mm)

b 1 2 / ) e

c 0.5 n e s

e 0 r p -5 -3 -1 1 3 5 7 9 11 f o

-0.5 y t i l i

b -1 a b o

r -1.5 p

f o

t -2 i g o l ( -2.5 T_JAN (°C)

c 1 2 / ) e

c 0.5 n e

s 0 e r 0 5 10 15 20 25 30 35 40 p

f -0.5 o

y t

i -1 l i b

a -1.5 b o r

p -2

f o

t

i -2.5 g o l ( -3 T_JUL (°C)

Figure 3. Partial dependence plots of the most important predictor variables for: (a) water deficit (W_DEF),

(b) temperature of the coldest month (T_JAN) and (c) temperature of the warmest month (T_JUL). Plots are for RF predictions of the presence/absence of olives across the Mediterranean basin. The partial dependence of a variable gives a graphical depiction of the marginal effect of a variable on the presence probability. 8

References

Allouche, O., Tsoar, A. & Kadmon, R. (2006) Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology 43:1223-1232.

Ammann C.M., Joos F., Schimel D.S., Otto-Bliesner B.L.,Tomas R.A (2007). Solar influence on climate during the past millennium: Results from transient simulations with the NCAR Climate System Model.

PNAS 104: 3713–3718

Breiman, L., 2001. Random forest. Mach. Learn. 45, 5–32.

Evans, J.S., Murphy, M.A., Holden, Z.A. & Cushman, A. (2011) Modelling species distribution and change using random forest. Predictive species and habitat modelling in landscape ecology: Concepts and applications (ed. by C.A. Drew, Y.F. Wiersma and F. Huettman.), pp 139-159, Springer Science+Business

Media, LLC 2011.

McCabe, G.J. & Markstrom, S.L. (2007). A monthly water-balance model driven by a graphical user interface. U.S. Geological Survey Open-File report 2007-1088, 6 p.

R Development Core Team, R: a language and environment for statistical computing. R Foundation for

Statistical Computing. Vienna, Austria, ISBN 3-900051-07-0 (2004), URL http://www.R-project.org/.

Willmott, C.J., Robeson, S.M., 1995. Climatologically Aided Interpolation (CAI) of tterrestrial air temperature. Int. J. Climatol. 15, 221–229. 9