Photo:%Hipposideros%commersoni?%

Barbara%Han,%JP%Schmidt,%Laura%Alexander,%% David%Hayman,%Sarah%Bowden,%John%Drake% Viruses 2014, 6 1763

Figure 1. (A) The multiple transmission pathways are shown for Ebolavirus genera viruses. The role of vectors is unlikely, but not known (dashed line). Those pathways with epidemiological uncertainty are shown with question marks. Potential reservoir dynamics are shown in blue, spillover epidemics in small (Africa), pigs (Reston ebolavirus only), duikers (Africa), primates and humans shown in red and ongoing human transmission in orange; (B) The multiple transmission pathways are shown for Marburgvirus genera viruses. The role of vectors is unlikely, but not known (dashed line). Those with epidemiological uncertainty are shown with question marks. Potential reservoir dynamics are shown in blue, spillover epidemics in primates and humans shown in red and ongoing human transmission in orange.

A) Ebolavirus transmission dynamics Spillover dynamics ? ?

? ? ? ? ? Intra- and inter- species ? ? ? transmission ??%

?

? ? Reservoir dynamics Olival&and&Hayman&2014&

B) Marburgvirus transmission dynamics

Spillover dynamics

Intra- and inter- ? species ? ? transmission

?

? ? Reservoir dynamics

! Can%we%help%target%surveillance?%%

! What%are%additional%candidate%reservoir%species% for%filoviruses?%

! How%are%filovirus%reservoirs%distinct%from%nonG reservoirs?% % “Reservoir”&=&any&evidence&of&a&positive&filovirus&infection&using&any&diagnostic&test& Distribution*of*all**species*found*‘positive’*for*filovirus*infection*(N=23/1116*)*

k Number of Filovirus Positive Bat Species 1 6 2 7 3 8 4 9 5 10

k Meliandou

0 5,000 10,000km To%identify%candidate%reservoirs,%we%need%to%know%what%current%reservoirs%look%like.% 0 4,000 8,000mi ~%70%features% DATA*SOURCES:**PanTHERIA;**Wilman*et*al.*2014*Ecology;**Luis*et*al.*2013' ProcRoySocB;''Hamilton*et*al.*2010*ProcRoySocB;**Derived*features*

A%single%regression%tree% Boosted regression tree models! 803How%does%boosted%regression% X’s*=*randomly*selected*features* work?%% partly because they are considered less interpretable and therefore less open to scrutiny. It may also be that ecologists ! Training%data%(80%)% are less familiar with the modelling paradigm of ML, which ▪ Build%a%tree%(random%sample%of% differs from that of statistics. Statistical approaches to model predictor%variables)% fitting start by assuming an appropriate data model, and ▪ Boost%(weight%incorrect%answers,% parameters for this model are then estimated from the data. By contrast, ML avoids starting with a data model and rather new%trees%built%on%the%residuals)% uses an algorithm to learn the relationship between the ▪ Each%tree%weak,%together%strong% response and its predictors (Breiman 2001). The statistical ensemble%predictive%model% approach focuses on questions such as what model will be Y*=*binomial*response* ! Test%data%(20%)% postulated (e.g. are the effects additive, or are there interactions?), (reservoir*status)* how the response is distributed, and whether observations are ▪ Assess%prediction%accuracy%(AUC%=% independent. By contrast, the ML approach assumes that the true%pos/false%pos)% data-generating process (in the case of ecology, nature) is Fig.%1%from%Elith%et%al.%2008.%J.%Anim.%Ecol.;% complex and unknown, and tries to learn the response by R%packages:%gbm,%dismo,%caret% observing inputs and responses and finding dominant patterns. This places the emphasis on a model’s ability to predict well, and focuses on what is being predicted and how prediction success should be measured. In this paper we discuss a relatively new technique, boosted regression trees (BRT), which draws on insights and tech- niques from both statistical and ML traditions. The BRT approach differs fundamentally from traditional regression methods that produce a single ‘best’ model, instead using the technique of boosting to combine large numbers of relatively simple tree models adaptively, to optimize predictive per- Fig. 1. A single decision tree (upper panel), with a response Y, two formance (e.g. Elith et al. 2006; Leathwick et al. 2006, 2008). predictor variables, X1 and X2 and split points t1, t2, etc. The bottom The boosting approach used in BRT places its origins within panel shows its prediction surface (after Hastie et al. 2001) ML (Schapire 2003), but subsequent developments in the statistical community reinterpret it as an advanced form of regression (Friedman, Hastie & Tibshirani 2000). for ecological applications by De’ath & Fabricius (2000). Despite clear evidence of strong predictive performance Tree-based models partition the predictor space into rectangles, and reliable identification of relevant variables and interactions, using a series of rules to identify regions having the most BRT has been rarely used in ecology (although see Moisen homogeneous responses to predictors. They then fit a constant et al. 2006; De’ath 2007). In this paper we aim to facilitate the to each region (Fig. 1), with classification trees fitting the wider use of BRT by ecologists, demonstrating its use in an most probable class as the constant, and regression trees analysis of relationships between frequency of capture of fitting the mean response for observations in that region, short-finned eels (Anguilla australis Richardson), and a set of assuming normally distributed errors. For example, in Fig. 1 predictors describing river environments in New Zealand. We the two predictor variables X1 and X2 could be temperature first explain what BRT models are, and then show how to develop, and rainfall, and the response Y, the mean adult weight of a explore and interpret an optimal model. Supporting software species. Regions Y1, Y2, etc. are terminal nodes or leaves, and t1, and a tutorial are provided as Supplementary material. t2, etc. are split points. Predictors and split points are chosen to minimize prediction errors. Growing a tree involves recursive binary splits: a binary split is repeatedly applied to its own EXPLANATION OF BOOSTED REGRESSION TREES output until some stopping criterion is reached. An effective BRT is one of several techniques that aim to improve the strategy for fitting a single decision tree is to grow a large tree, performance of a single model by fitting many models and then prune it by collapsing the weakest links identified combining them for prediction. BRT uses two algorithms: through cross-validation (CV) (Hastie et al. 2001). regression trees are from the classification and regression Decision trees are popular because they represent information tree (decision tree) group of models, and boosting builds and in a way that is intuitive and easy to visualize, and have several combines a collection of models. We deal with each of these other advantageous properties. Preparation of candidate components in turn. predictors is simplified because predictor variables can be of any type (numeric, binary, categorical, etc.), model outcomes are unaffected by monotone transformations and differing DECISION TREES scales of measurement among predictors, and irrelevant Modern decision trees are described statistically by Breiman predictors are seldom selected. Trees are insensitive to outliers, et al. (1984) and Hastie, Tibshirani & Friedman (2001), and and can accommodate missing data in predictor variables by

© 2008 The Authors. Journal compilation © 2008 British Ecological Society, Journal of Ecology, 77, 802–813 1.0 0.99 0.8 0.79

Test% 0.6 (AUC%=%0.85)% 0.59 The*model*predicts* Training% filovirusVpositive*bat* 0.4 (AUC%=%0.99)% 0.4 species*with*~85%* True%positive%rate% accuracy.* 0.2 0.2 0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 False%positive%rate% Training%data%=%80%%;%Test%data%=%20%%;%%CV=%10%fold% Phyllostomidae mig torpTRAITS* postnatGR aridity 10 8 6 4 2 0 RELATIVE%INFLUENCE% Relative influence species* most*other*bat* larger*neonates*than* tend*to*have** Filovirus**reservoirs*

Phyllostomidae mig torpTRAITS* postnatGR aridity 10 8 6 4 2 0

0 5 10 15 Relative influence 012345 Log NeonateBodyMass(g)

−3.0 −2.6 −2.2 2.2 − 1.8 − 2.2 − 2.0 − 2.4 − 2.4 2.2 − − 2.6 − 2.4 − 2.6 − 2.6 2.8 − − 2.8 − 2.8 − 3.0 − 3.0 3.0 − − 0 5 10 15 0 5 10 15 0 2 4 6 8 10 12 14 3.2 3.2 − − 012345 0 50 100 150 200 400 600 800 Log NeonateBodyMass(g) WeaningBM(g) SexualMaturityAge(d) 2.0 2.70 − − 2.6 2.75 − 2.2 − − 2.80 − 2.4 − 2.8 − 2.85 − 2.6 − 2.90 − 2.8 3.0 − − 2.95 − 3.0 − 3.00 − 3.2 − 3.2 − 0 2 4 6 8 10 12 0 20 40 60 80 100 120 140 02468 3.05 − 0 2 4 6 8 10 12 1.0 1.5 2.0 2.5 3.0 −0.4 −0.2 0.0 0.2 0.4 Log PopulationGrpSize X16.1_LittersPerYear Log Production k Number of Filovirus Positive Bat Species 1 6 2 7 3 8 4 9 5 10

k Meliandou

0 5,000 10,000km

0 4,000 8,000mi

But%the%candidate%reservoirs%predicted%by%our%model%is%much%more%globally%distributed.% Number of Predicted Filovirus Reservoirs (90th Percentile) 1 - 2 9 - 10 3 - 4 11 - 12 5 - 6 13 - 14 7 - 8 15 - 17

Hipposideros)commersoni) Phyllostomus)hastatus) In%the%90th%percentile,%there% Macroderma)gigas0 ) 5,000 Mops)midas10,000) km were%82%candidate%species%% )moluccensis0 ) 4,Pteropus000 )livingstonii8,000)mi predicted.%The%top%12%include:% Epomops)bue:koferi) )jagori) Rhinolophus)ferrumequinum) )labiatus) Brachyphylla)nana) )rodricensis) Number of Predicted Filovirus Reservoirs (90th Percentile) 1 - 2 9 - 10 3 - 4 11 - 12 5 - 6 13 - 14 7 - 8 15 - 17

Hipposideros)commersoni) Phyllostomus)hastatus) In%the%90th%percentile,%there% Macroderma)gigas0 ) 5,000 Mops)midas10,000) km were%82%candidate%species%% Dobsonia)moluccensis0 ) 4,Pteropus000 )livingstonii8,000)mi predicted.%The%top%12%include:% Epomops)bue:koferi) Ptenochirus)jagori) Rhinolophus)ferrumequinum) Epomophorus)labiatus) Brachyphylla)nana) Pteropus)rodricensis) Russian Federation

Kazakhstan

Mongolia

Kyrgyzstan Uzbekistan

Tajikistan

China

Pakistan

Nepal Bhutan

Bangladesh

India

Myanmar Vietnam

Laos

Thailand

Number of Predicted Cambodia Filovirus Reservoirs (90th Percentile) 1 - 2 9 - 10 Sri Lanka 3 - 4 11 - 12 5 - 6 13 - 14 7 - 8 15 - 17 Brunei Darussalam Malaysia

Maldives Singapore 0 250 500 1,000 1,500 km Indonesia

0 250 500 1,000 mi ! Target%species% ! 82%bats%globally% ! Decisions%about%how%to%prioritize%% ! Target%regions:%SE%Asia% ! NonGfruit%bats% ! Members%of%the%Ebola%Working%Group% ! Pasha%Feinberg%–%Cary%Institute% ! Sean%Maher%–%U.%Missouri% ! Shan%Huang%–%U.%Chicago%