1 Divide and conquer – machine-learning integrates 2 mammalian, viral, and network traits to predict unknown 3 -mammal associations

4 Electronic Supplementary Materials 5 6 Maya Wardeh*, Marcus SC Blagrove, Kieran J. Sharkey, Matthew Baylis 7

8 Supplementary Note 1 – Virus-mammal interactions data 9 Mammal-virus associations: Species-level virus-mammal associations were extracted from the 10 Enhanced Infectious Diseases Database1 – EID2. EID2 contains 4,799 species of mammals and 11 9,605 species of . EID2 extracts information on pathogens, their hosts and locations from 12 two sources: 1) meta-data accompanying nucleotide sequences (hereafter sequences) published in 13 Genbank2,3; and 2) titles and abstracts (hereafter TIABs) of publications indexed in the PubMed 14 database4. To date, EID2 has extracted information from 71,397,061 sequences (and processed 15 100M+ sequences), and 8,643,203 TIABs. Here, we recursively aggregated virus-mammal 16 associations – a mammal that was found to host a strain or subspecies of virus was considered a 17 host of the corresponding virus species (and vice versa). We further checked these species level 18 associations for accuracy and to eliminate laboratory-produced results. This resulted in 6,331 19 associations between 1,896 viruses and 1,436 terrestrial mammals. 20 21 Virus taxonomy: For the purposes of taxonomic classification (Table S1) of viruses we obtained 22 the most recent taxonomic tree of our viruses from the NCBI taxonomy database 5 (accessed 23 09/2019). 24 Group N H Description DNA Group I 726 491 dsDNA viruses - double-stranded DNA viruses (e.g. herpesviruses) Group II 273 160 ssDNA viruses - single-stranded DNA viruses (e.g. circoviruses) RNA Group 46 156 dsRNA viruses - double-stranded RNA viruses (e.g. III ) Group 427 511 (+)ssRNA viruses - positive-sense single-stranded RNA IV viruses (e.g. Zika virus, Yellow fever virus) Group 360 1051 (-)ssRNA viruses - negative-sense single-stranded RNA V viruses (e.g. Ebola virus, A virus) Retro- Group 57 182 ssRNA-RT viruses - (+ strand or sense) RNA with DNA transcribing VI intermediate in life-cycle (e.g. retroviruses such as HIV1) Group 7 41 dsDNA-RT viruses - DNA with RNA intermediate in life- VII cycle (e.g. B virus) 25 Table S1 – Baltimore classification of mammalian viruses included in this study. N indicates number 26 of virus species in each group. H indicates number of mammal species for which associations with viruses 27 in each group were found in EID2. 28 29 Domestication level of mammals: We classified the domestication status of our mammalian hosts 30 into four levels: wild mammals (N=1326), semi-domesticated mammals (N=81), domesticated 31 mammals (N=28), and human (N=1). Table S2 lists our domesticated mammals. Our semi-

1

32 domesticated mammals group included N=29 of each Carnivora and Ruminantia, 13 Rodentia, 3 33 Perissodactyla, 2 of each Proboscidea and Tylopoda, and one of each Diprotodontia, Eulipotyphla 34 and Suina. 35 N Species Artiodactyla 16 Bison bonasus, Bos frontalis, Bos grunniens, Bos indicus, Bos javanicus, Bos taurus, Bos mutus, Bubalus bubalis, Bubalus carabanensis, Capra hircus, Ovis aries, Camelus bactrianus, Camelus dromedaries, Lama glama, Lama pacos, Sus scrofa Carnivora 3 Canis lupus familiaris, Felis catus and Vulpes vulpes. Perissodactyla 3 Equus asinus, Equus caballu and Equus asinus x caballus Rodentia 5 Cavia porcellus, Mesocricetus auratus, Mus musculus, Rattus norvegicus and Rattus rattus. Lagomorpha 1 Oryctolagus cuniculus 36 Table S2 – Domesticated mammals included in our analyses. 37

2

38 Supplementary Note 2 – Viral Feature Space (Viral traits & features) 39 Virus and capsid: We classified the genome of each virus as RNA (binary factor, no= 40 DNA); retro-transcribing (binary factor); negative-sense (binary factor, yes = negative-sense; and 41 positive-sense (binary factor, yes = positive -sense). RNA viruses adapt faster 6, and are generally 42 more fragile (cannot survive as long outside of the cell). Retroviruses are generally very conserved7, 43 and have to enter the nucleus8 and insert into the genome, these additional steps may require 44 specificity and limit range. Sense affects replication cycle and range of host enzymes needed. 45 With regards to the genome architecture and organisation we checked if the virus has circular or 46 linear genome (binary factor, no=linear), as this attribute affects replication and translation. Rolling 47 circle replication and translation are common with circular , negating the need to re-enlist 48 host enzymes9, therefore possibly affecting host range as less specificity may be required. In 49 addition, we noted if the virus was monopartite (has a single nucleic acid molecule protected in a 50 shell made of proteins) or segmented (divided into two or more nucleic acid segment) (binary 51 factor, no=monopartite). Segmented viruses can undergo recombination if two strains of the same 52 virus infect a cell (e.g., influenza hemagglutinin & neuraminidase recombination10). This in turn 53 can lead to host range changes of segments of the genome. In practice, there exists a third class of 54 viral architecture - multipartite viruses. These viruses have their genome divided into two or more 55 nucleic acid segment (similarly to segment viruses), but these segments are each packaged into 56 separate virus particles. We ignored multipartite viruses in this study due to them being very rare 57 and poorly understood11. 58 Regarding capsids, we indicated if the virus is enveloped or not (binary factor). Envelopes are 59 usually derived from the host cell membrane; this can help them avoid host immune system. They 60 may limit range by providing antigens for other immune systems. The envelopes are very sensitive 61 to the external environment, and enveloped viruses often require to be directly transferred between 62 hosts; finally, because the envelope is made from the current host’s cell membrane, it will change 63 upon of a new host, making the virus rapidly adaptable12. 64 GC-content (guanine-cytosine content), the percentage of a nucleotide sequence that is made up of 65 either guanine or cytosine bases of each virus, and the average genome size (in bases) were also 66 obtained for our viruses. GC hydrogen bonding is stronger than AT/U, high GC content usually 67 leads to higher thermo-stability of the genome13, including single stranded genomes which often 68 self-anneal. This may affect longevity outside of hosts, and replication inside of ectothermic hosts. 69 Genome size is indicative of many aspects of the virus such as complexity, DNA/RNA, and 70 replication type. 71 72 Virus replication, release, and cell entry: we collated information on the replication site of the 73 virus. We expressed these data as a binary factor indicating if the virus replicates in the cytoplasm 74 (versus nucleus replication). Replication site is linked to RNA/DNA genome – if a virus has a DNA 75 stage it must replicate in the nucleus. This creates extra barriers to overcome for entry to the nucleus 76 and may restrict host range. We classified the release of the virus into three broad categories: 77 budding, lysis, or other. The mechanism of release affects aspects the rate of virus production, cell 78 life-span and means of presentation to the immune system14, each of these aspects could influence 79 the virus host range. Finally, we recognised that availability of receptors influences potential host 80 range, therefore we broadly categorised the mechanism of virus cell entry into 4 categories: cell- 81 receptor endocytosis, clathrin-mediated endocytosis, glycoprotein-mediated or other. 82 83 Transmission routes: We categorised major transmission routes of our viruses as follows: direct 84 transmission – by direct contact, via skin, broken sink or droplets (unique viruses Nv=1,168, 85 mammalian hosts Nm=1305, associations=4,389); sexual transmission (Nv=273, Nm=405, 86 associations=1,154); vertical transmission – mother to child, or via breast-milk contact (Nv = 252, 87 Nm=445, associations =1,161); indirect transmission – e.g. via secretions/excretions/tissues (Nv =

3

88 602, Nm=1260, associations =3,224); ingestion – including water ingestion and faecal-oral routes 89 (Nv = 848, Nm=1,013, associations =2,792); inhalation – via droplet or airborne particles or dust 90 (Nv = 641, Nm=612, associations =2,029); environmental – through fomite, contact with 91 environment (Nv = 372, Nm=633, associations =1,596); and vector – strictly via arthropod vector 92 such as ticks or mosquitoes (Nv = 194, Nm=377, associations =1,019). 93 94 We adopted a three-fold strategy to capture major transmission routes of our viruses as follows: 95 1. Title and abstract (TIABs) mining: we utilised EID2 to extract TIABs of PubMed papers 96 linked to single virus species (i.e. excluding TIABs with multiple viruses). These TIABs were 97 subsequently classified via keyword matching into the transmission routes described above. 98 The TIABs were further checked manually to ensure correctness and to remove transmission 99 routes outside mammalian species (e.g. we removed instances of vertical transmission within 100 arthropod vectors). 101 2. Manual extractions from online-sources: we manually extracted transmission routes of viruses 102 for which no papers were identified by the previous step, as well as for routes not detected in 103 the TIABs from various sources15–18. 104 3. Within genus generalisations: finally, we assigned transmission routes to viruses not captured 105 by the previous two steps, by taking a minimal agreement set of within-genus transmission 106 routes. 107 For the purposes of our study we utilised a simple multi-label classification of our viruses whereby 108 routes were assigned either 1 or 0 value corresponding to whether the virus was found to be 109 transmitted via the specified route as described above. 110

4

111 Supplementary Note 3 – Mammalian Feature Space (mammalian traits & features) 112 Phylogeny: Host phylogenetic distance has been found to drive sharing of pathogens, particularly 113 viruses20–22, between mammals. We utilised a recent mammalian supertree23 to calculate pairwise 114 phylogenetic distance between each mammal-mammal pair. We then aggregated these values 115 (mean) between each mammal species and the known hosts of each of our viruses. This measure 116 indicated, per virus, whether a potential host species was phylogenetically close to or distant from 117 the viruses preferred host range. Mammalian species for which could not be matched to this 118 phylogeny was dropped from our analyses. 119 In addition, we calculated the evolutionary distinctiveness for each mammal using fair proportion24, 120 as implemented in the R package picante25. Evolutionary distinctiveness quantifies how isolated a 121 species is on its phylogenetic tree26, and has been shown to correlate negatively with pathogen 122 species richness27. 123 Host traits: We compiled data on morphological and life-history traits, diet and habitat for our 124 mammal species from online databases and literature28–33. We selected the following traits for their 125 known correlation with host-pathogen associations, and their wide availability. Body mass (g) 126 represented morphological traits, as it proxies key features of metabolism and adaption to 127 environment. For life-history traits we included: maximum age (months), activity cycle, and 128 migration29. We included gestation period length (days), litters per year, litter size, weaning age 129 (days), and age at sexual maturity (days), to represent the reproductive characteristics of our 130 mammalian species. Reproductive traits could be viewed as proxies to within-host virus-dynamics 131 and therefore may influence the viruses harboured by the host. 132 We included geographical area range32 (in squared km) as species with wider areas might be 133 exposed to more viruses. We incorporated habitat utilisation32 as multiple binary indicators of 134 whether a species uses one or more of 14 natural and artificial habitats. We hypothesised that 135 mammals utilising similar habitats might come into contact with similar viruses, and this in turn 136 would increase the chances of being infected with these viruses. 137 We used the proportional use of 10 diet categories31 to indicate the dietary preferences of mammals. 138 We incorporated these categories as independent variables in our models as we assumed that similar 139 dietary habit might associate with similar viral assemblage. 140 141 We included the above listed traits as independent variables in our virus perspective models. In 142 addition, we utilised them to quantify the pair-wise ecological distance between each pair of 143 mammals in our study. We based this distance on a generalised form of Gower’s distance 144 matrices34,35. We incorporated these distances to compute the mean distance between each mammal 145 and known host of each virus (and vice versa). Similarly, to the mean phylogenetic distance listed 146 above, the mean ecological distance indicated, for each virus, whether a potential host species was 147 ecologically close to or distant from the viruses preferred host range. 148 149 Mammalian geospatial features 150 The geographical distribution of host species influences the pathogens with which it might come 151 into contact. Geography also correlates with other factors such as climate, natural environment, and 152 agricultural practices. Climate has been shown to influence a number of human and domestic 153 mammal pathogens (including viruses)36–38. Furthermore, climate indirectly affects certain groups 154 such as vector-borne viruses (e.g. Zika, bluetongue and West Nile viruses) through the direct effect 155 it has on the viability of the associated arthropod vectors39,40 and their capacity to transmit 156 pathogens. In addition to climate, other geographical factors such as biodiversity (species richness), 157 land cover type, agriculture and farming practices, urbanisation and human population have been 158 found to influence certain categories of host-pathogen associations41,42. 159

5

160 We obtained and processed species-presence maps of mammalian species32,43,44. We supplemented 161 these with grids expressing climate45, mammalian diversity46, human population44, land cover 162 (including urbanisation)47, agriculture47,48, and distribution of livestock43. This allowed us to 163 generate the following geographical features of mammalian species: 1) we expressed climate in 164 two features: mean temperature, and mean precipitation. 2) We quantified the diversity of natural 165 land cover type (not directly associated with humans). 3) We quantified agricultural (including 166 land-cover associated with humans e.g. managed vegetation) and farming practices (expressed in 167 number of domesticated livestock and poultry in the species presence area) as entropies of the 168 associated values in the species range. 3) We computed urbanisation as percent of urban land in 169 the species presence area. 4) We summed human population in the species presence area. Finally, 170 6) we computed average mammalian diversity in in the species presence area. 171 172 Species-presence maps: We obtained species-presence maps for majority of our mammalian 173 species from IUCN32. We extrapolated livestock (including horses) species-presence maps from 174 most recent global distribution maps 43. Finally, we inferred presence-maps for three domesticated 175 species - dogs (Canis lupus familiaris), cats (Felis catus) and guinea pigs (Cavia porcellus) from 176 Gridded population of the world maps44, by assuming they co-exist with humans where there is 177 sufficient human populations (n>100). We used the same gridded population maps to extrapolate 178 human species-presence map (n>0). All our geographical maps manipulation was done in QGIS. 179 We dropped any mammalian species for which no presence maps could be derived from our models 180 trained in viral perspective. 181 182 Geospatial features: we utilised several sources to extract geo-attributes of our mammalian 183 species. Table S3 lists these sources, the features derived, and reason for inclusion in our analyses. 184 Figure S1 illustrates this process. We derived the following geo-features for our mammalian 185 species: 186 1. Climate attributes: 187 a. Mean temperature: mean of monthly temperatures recorded in the species-presence area, 188 averaged between years: 1900-201045. 189 b. Mean precipitation: Sum of monthly rainfall (precipitation) recorded in the species- 190 presence area, averaged between years: 1900-201045. 191 2. Natural land cover diversity: calculated as Shannon’s entropy of mean percent of land covered 192 with each of the attributes listed in table S3 under category (Natural) Land-cover. 193 3. Agriculture and farming diversity: calculated as Shannon’s entropy of mean percent of land 194 utilised for the following categories: managed/cultivated vegetation, regularly flooded 195 vegetation, cropland and pasture, and the sum of each of the livestock, horses and poultry 196 listed in table S3. 197 4. Human population: human population in the species-presence area calculated by intersecting 198 with a recent gridded human population dataset44. 199 5. Urbanisation: mean percent of urban land in the species-presence area. 200 6. Mammalian diversity: mean mammalian diversity46 in the species-presence area. 201 202

6

Category layers(s)/geo- Source Res Reason attributes (Natural) Evergreen/deciduous EarthEnv47 0°0′30″ Type of land cover has been Land-cover needle-leaf trees (%) associated with distribution of Evergreen broad-leaf various mammals 49. It potentially trees (%) increases chances of contact Deciduous broad-leaf between mammalian reservoirs of trees (%) different viruses. Mixed/other trees (%) Shrubs (%) Herbaceous vegetation (%) Barren land (%) Agriculture Managed/Cultivated EarthEnv47 0°0′30″ & farming Vegetation (%) Regularly flooded vegetation (%) Cropland (%) HYDE50 0°5′ Pasture (%) Cattle (head count) Global 0.0833° Livestock farming has been linked Sheep (head count) distribution to emergence and cross-species Buffalo (head count) data transmission of number of viruses Pigs (head count) (livestock)43 (e.g. Nipah virus, influenza Horses (head count) viruses)51. Chicken (head count) Duck (head count) Human Human population SEDAC44 0°5′ Urbanisation and human Urban land (%) EarthEnv47 0°0′30″ population density have been shown to be drivers of disease emergence and spill-over through wildlife-domestic-human interface 42,52,53. Climate Mean temperature CRUTS345 0°5′ Climate has been shown to Mean precipitation influence a number of human and domestic-mammals pathogens36–38. Furthermore, climate indirectly affects certain groups such as vector-borne viruses (e.g. Zika, Bluetongue and West Nile viruses) through the direct effect it has on the viability of the associated arthropod vectors39,40. Mammalian Number of different SEDAC46 0°5′ Mammalian species present in diversity mammalian species in a mammal rich areas might be grid cell. exposed to diverse viruses54–57. 203 Table S3 - List of geographical predictor layers integrated within our framework.

204

7

205 206 Figure S1 - Geographical feature extraction. Mammalian species-presence maps were first 207 extracted from our sources32,43,44 – step 1; these maps were then intersected with our geo-layers 208 (table S3) – step2. This enabled us to derive values of our geo-attributes for majority of our 209 mammalian species – step3, which we then summarised into the geo-features included in our 210 models – step 4. 211

8

212 Supplementary Note 4 – Potential motifs in bipartite networks

213 214 Figure S2 – Motifs as features of bipartite networks (domestication status). Heat-maps 215 illustrating distribution of motif-features (counts of potential motifs per each possible edge) in our 216 bipartite network, grouped by domestication status of mammalian hosts and Baltimore 217 classification of the viruses. The counts are logged to allow for better visualisation.

9

218 219 Figure S3 – Motifs as features of bipartite networks (mammalian order). Heat-maps 220 illustrating distribution of motif-features (counts of potential motifs per each possible edge) in our 221 bipartite network, grouped by domestication status of mammalian hosts and Baltimore 222 classification of the viruses. The counts are logged to allow for better visualisation.

10

223 Supplementary Note 5 – Multi-perspective framework to predict unknown virus- 224 mammal associations 225 226 Machine learning algorithms selection 227 Table S4 lists the machine learning algorithms (classifiers) included in each of the three 228 perspectives of our framework (N=8). These classifiers offer a varied subset of plethora of 229 classifiers available for experimentation (over 179 classifiers categorised into at least 17 families58), 230 and were selected due to their robustness, scalability58, and their potentially good performance with 231 imbalance data classification59. We performed the training and optimisation (tuning) of these 232 algorithms in each perspective using the caret R Package60,61. 233 234 Classifiers training, optimisation and validation 235 Class rebalancing techniques: Our bipartite virus-mammal network is sparsely connected – 236 roughly 0.23% of potential associations are documented in our input associations set. On average 237 each virus in our dataset affected 3.45 mammals (~0.24%), and each mammalian host was affected 238 by 4.41 viruses (~0.24%). This presented an imbalance in our data, whereby a small percentage of 239 instances are positive (i.e. present a documented association between a virus and a mammal 240 species). To attend to this issue, we deployed two class rebalancing techniques: 241 242 1. SMOTE: We deployed SMOTE - Synthetic Minority Over-sampling Technique67,68 to 243 rebalance the classes of our viral and mammalian models. SMOTE synthesises new minority 244 class instances between existing minority instances using a flavour of k-nearest neighbour 245 algorithm. The SMOTE algorithm then over-samples from the minority instances and under- 246 samples from the majority class to create a balanced training set. SMOTE was applied to balance 247 the training set of each mammalian and viral species model. It was not applied when drawing 248 models’ predictions or when assessing performance metrics as discussed below 249 2. Random balanced under-sampling: We trained our network perspective models (N=8×100) 250 with balanced samples drawn at random (without replacement) from the set of all potential 251 virus-mammal associations. Each sample comprised 2000 instances (1,000 positive (known) 252 and 1,000 negative (true negative or unknown) virus-mammal associations. 253 254 10-fold cross validation: We performed 10-fold cross validation of our potential and constituent 255 models (with various number of repeats – illustrated below). This validation methods works by 256 splitting training data into 10 random samples, each sample is in turn is held out, and the model is 257 trained on the remainder groups. The model’s prediction for the existence or absence of the 258 mammal-virus associations in the held-out group are used to construct confusion matrices and 259 calculate an optimisation metric (here AUC). The optimisation metric is used to select best model 260 in the validation process. 261 We repeated this cross validation for each of our constituent models as follows: 262 1. Models trained in the viral and mammalian perspectives: 263 a. Parameter tuning and classifier selection: 10 repeats of 10-fold cross validation per model 264 b. Predictions and reported analyses: 50 repeats of 10-fold cross validation per model. Results 265 and metrics were obtained per repeat, and median, 0.05 and 0.95 quantiles for all scores were 266 reported for predictions and metrics 267 2. Models trained in the network perspective: the 10-fold cross validation was repeated 100 times 268 with balanced random samples drawn from the full network. Performance metrics, classifier 269 selection, and predictions were obtained by testing each repeat against the full network. Median, 270 0.05 and 0.95 quantiles for all scores were reported. 271 272

11

Model Caret method Base Summary family Model Averaged avNNet Neural The same neural network model is fit using Neural Network networks different random number seeds. All resulting (avNNet) models are then used for prediction. To generate overall class probability, the probability outputs from all networks are averaged. Stochastic gbm Decision Stochastic Gradient Boosting – GBM57,62–64 fits a Gradient trees series of trees (weak classifiers) to random Boosting (GBM) partition of the data, and aggregates the results sequentially (i.e. boosting). Random Forest ranger Random forests algorithm65 construct an (RF) ensemble (termed forest) of decision trees grown randomly from the input data. The randomness is twofold: first, each tree is built from a random sample of the data. Second, at each tree node, a subset of features (input variables) is randomly selected to generate the best split. eXtreme xgbTree Similarly to GBM, eXtreme Gradient Boosting Gradient (XGBoost)66 constructs an ensemble (strong Boosting learner) from weak learners, typically decision (XGBoost) trees. XGBoost has the additional advantages of speed; ease of use and parallelisation; and in many cases high predictive accuracy. Support Vector svmRadialWeights Support Support vector machines aim to find a Machines with vector hyperplane that best separates the features Radial basis machines (predictors) into different domains (classes). kernel and Class Radial SVMs utilises Radial Basis Function Weights (SVM- Kernel (RBF), whereas polynomial SVMs RW) adopts polynomial kernels. Linear SVMs uses Linear Support svmLinearWeights linear kernels. Vector Machines SVMs with Class Weights – enjoys additional with Class class-imbalance robustness as it applies penalty Weights (SVM- to miss classification, with weights inversely LW) proportional to class frequency. Support Vector svmPoly Machines with Polynomial Kernel (SVM-P) Naive Bayes (NB) naive_bayes Bayesian Naïve Bayes classifiers are a family of probabilistic classifiers centred upon Bayes Theorem. They assume that input features (virusal, mammalian and network features in our case) are independent – they contribute independently to probability to the outcome class (a virus-mammal association in our case), regardless of any correlation between the features. Features, however, are not always independent which is why the Naïve Bayes classifiers are labelled as naïve. 273 Table S4 - List of machine learning algorithms (supervised classifiers) used in our models. 274 Optimisation: The hyper parameters of our viral and mammalian perspective models were tuned 275 (optimised) via 10-fold cross validation (with 10 repeats) per each virus (or mammal) models. 276 Hyper-parameters of best performing models were carried across to our bagging pipelines (50

12

277 repeats) which were used to produce median predictions (of class probabilities) and generate 278 empirical confidence intervals. Models trained with motif-features (network-perspective) were 279 optimised via 10-fold cross validation of each generated subsample (N=100). 280 281 We performed the training and tuning of our selected classifiers in each perspective using the caret 282 R Package60,61. We adopted an adaptive resample approach69 to tune the hyper-parameters of our 283 models. This approach allows caret to adaptively resample the tuning parameter grid in a way that 284 concentrates on values that are the in the neighbourhood of the optimal settings. Due to the large 285 number of classifiers trained in our framework this adaptive approach allowed us to find optimal 286 (or near optimal) values of the hyper-parameters of each included machine learning technique (table 287 S4) without replaying on the nominal resampling process whereby all the tuning parameter 288 combinations are computed for all the resamples before a choice is made about which parameters 289 are good and which are poor. 290 291 Performance metrics Confusion matrix Detected Predicted 1 0 1 A B 0 C D

Measure Formula Meaning Sensitivity 퐴 Sensitivity is the percentage of actual positives (observed

(recall) 퐴 + 퐶 associations) that were correctly predicted. It indicates the percentage of 1s that was covered by the model. Specificity 퐷 Specificity is the percentage of negatives (here unknown

퐵 + 퐷 associations, not necessarily true negative) that were correctly predicted Precision 퐴 Percentage of accurate predictions of the model

퐴 + 퐵 AUC Area Under the ROC Curve AUC is a threshold-independent measure of model predictive performance that is commonly used as a validation metric for host- pathogen predictive models70,71. F1-score Precision × Recall Captures the harmonic mean of the precision and recall. it is often 2 × Precision + Recall used with uneven class distribution. Our approach is relaxed with respect to false positives (unknown associations), hence the low F1-score recorded overall. However, in our selection process, where two classifiers produced similar AUC and TSS statistics, the best performing on F1-score was selected (conservative approach). TSS sensitivity + specificity − 1 Use of AUC has been criticised for its insensitivity to absolute predicted probability and its inclusion of a priori untenable prediction 57,72, we also calculated the True Skill Statistic (TSS)73. 292 Table S5 - Measures utilised to assess the performance of our 10-fold cross-validated classifiers. 293 Entries highlighted in grey were not used in classifier selection. 294 Classifiers selection per perspective 295 We implemented two strategies to select constituent models of our three perspective, post training 296 and optimising as per the previous subsection: 297 1. Classifiers trained in the mammalian and viral feature spaces: for each virus (or mammal) 298 we selected the classifier which achieved the best performance metrics (table S5) overall (as 299 derived from predictions). We were able to achieve this dynamic selection as we trained each 300 classifier for each virus (or mammal) independently of other viruses (or mammals). We 301 selected best performing classifiers based on their F1-Score, AUC, and True skill statistics – 302 TSS (as defined in table S5 above).

13

303 2. Classifiers trained in the motifs feature (network) space: we selected the classifier which 304 obtained the best test performance metric (median of 100 classifiers in each pipeline, applied 305 to all 2,722,656 possible associations). 306 307 Held-out test set validation and performance metrics estimation 308 We estimated the performance of our framework and its constituent models by training these 309 models on a stratified random sample comprising 85% of our input data (N=2,315,391 with 5,377 310 known virus-mammal associations), and then calculating performance metrics (table S5) on the 311 held-out test set (N=407,265 known virus-mammal associations). Figures S4-S6 illustrates the 312 validation process of each of our perspectives. 313

314 315 Figure S4 – Training and validation of mammalian perspective models. Training and test sets 316 (85%, 15%) are split into (N=699) training and test sets, where each training/test pair contains 317 associations of one mammalian species (e.g. humans) with all viruses in our input set (N=1,896). 318 For each mammalian species, the training set is first balanced using SMOTE, then 8 classifiers are 319 trained with this balanced set. These classifiers are then tested against the (unbalanced) test set, and 320 the best performing (per each species) is carried forward to our multi-perspective framework. 321

14

322 323 Figure S5 – Training and validation of viral perspective models. Training and test sets (85%, 324 15%) are split into (N=556) training and test sets, where each training/test pair contains associations 325 of one mammalian species (e.g. humans) with all viruses in our input set (N=1,436). For each 326 mammalian species, the training set is first balanced using SMOTE, then 8 classifiers are trained 327 with this balanced set. These classifiers are then tested against the (unbalanced) test set, and the 328 best performing (per each species) is carried forward to our multi-perspective framework.

15

329 330 Figure S6 – Training and validation of network perspective models. 100 balanced samples 331 (under-sampling) are drawn randomly from the training set (85% of all potential associations). Each 332 sample (2,000 instances: 1,000 negative and 1,000 positive associations) is used to train 8 different 333 classifiers (one of each of included algorithms - table S4). A Bragging (median prediction, 334 probability) ensemble is then created for each algorithm. These ensembles are tested against the 335 held-out test set (15%) and the best ensemble is carried forward to our multi-perspective 336 framework. 337 338

16

339 Supplementary Results 1 – Systematic prediction of removed virus-mammal 340 associations 341 In addition to validating each constituent model as described in the supplementary text and reported 342 in manuscript and subsequent sections, we performed a systematic test to assess the ability of our 343 framework to predict removed virus-mammal associations. This test differs from estimating 344 performance metrics using a held-out test set (as per the procedures described in the supplementary 345 text) in one important way: whereas the latter removes and tests the framework and its constituent 346 model against larger amounts of held-out data (15% of all potential edges), the systematic test we 347 describe below assesses our framework’s ability to test potentially unseen virus-mammal 348 associations by removing known (recorded in EID2) associations from our input set and 349 recalculating all input parameters accordingly. 350 The test was performed systematically by removing one documented association (known to exist) 351 between one virus and one mammal. The following input parameters were recalculated: 352 1. Recounting of all potential motifs as per the methods described above and in the main 353 manuscript. 354 2. Recalculating the following mammalian features: 355 a. Mean phylogenetic distance between focal mammal and known hosts of virus for each 356 virus-mammal association. 357 b. Mean ecological distance between focal mammal and known hosts of virus for each virus- 358 mammal association. 359 3. Recalculating the following viral features: 360 a. Mean phylogenetic distance between focal virus and viruses known to infect the 361 associated mammal for each potential/known virus-mammal association 362 b. Mean phylogenetic distance between focal virus and viruses known to infect the 363 associated mammal 364 c. Maximum phylogenetic distance between known hosts of virus. 365 366 Following these adjustments, we retrained all constituent models (mammalian, viral, and network 367 perspectives) using 10-fold cross validation then attempted to predict the removed link. 368 We factored in differences in host-ranges of viruses, and variations in number of viruses detected 369 per mammalian species by adopting the following processes: 370 1. Viruses: we ordered virus species by the number of unique mammalian hosts detected (in 371 EID2), and second to fourth letter of virus name. We selected the first virus in each count 372 category and ordered their hosts by the second to fourth letter of host species name (i.e. 373 excluding genus). We selected the first host (in order) and removed the resulting host-virus 374 association. Table S6 lists all removed associations from viruses’ point of view. 375 2. Mammals: we ordered mammalian species by the number of unique virus species detected (in 376 EID2), and second to fourth letter of species name (excluding genus). We selected the first 377 mammalian species in each count category and ordered their viruses by the second to fourth 378 letter of species name. We selected the first virus (in order) and removed the resulting host- 379 virus association. Table S7 lists all removed associations from mammals’ point of view. 380 381

17

382 Table S6 - Systematic prediction of removed virus-mammal associations (viruses): virus- 383 mammal associations removed from viruses’ point of view. No. is number of associations used 384 to link with figure S7 (panel A – viruses), n = number of hosts per selected virus. Vote indicates if 385 our framework predicted the removed association successfully (1) or not (0). No. N mammal virus vote No. N mammal virus vote 1 410 Homo sapiens tai forest 1 23 23 dobrava- 1 ebolavirus Myodes belgrade glareolus orthohantavirus 2 134 Sus scrofa akabane 1 24 22 Loxodonta lyssavirus 1 orthobunyavirus africana 3 128 Bos taurus new jersey 1 25 21 Rousettus rabies lyssavirus 1 vesiculovirus aegyptiacus 4 86 Pan troglodytes macaca mulatta 1 26 20 Odocoileus orf virus 1 polyomavirus 1 hemionus 5 73 Macaca mulatta zika virus 1 27 19 Dama dama rabies lyssavirus 0 6 71 Equus caballus new jersey 1 28 18 Saimiri macaca mulatta 1 vesiculovirus sciureus polyomavirus 1 7 69 Ovis aries akabane 1 29 17 Papio baboon 1 orthobunyavirus hamadryas orthoreovirus 8 65 Canis lupus betacoronavirus 1 30 16 Eptesicus bat 1 familiaris 1 serotinus 9 54 Felis catus rabies lyssavirus 1 31 15 Microtus rabies lyssavirus 0 agrestis 10 53 Rattus rat astrovirus 1 32 14 rinderpest 1 norvegicus Syncerus caffer morbillivirus 11 47 Mus musculus thailand 1 33 13 Callithrix rabies lyssavirus 0 orthohantavirus jacchus 12 45 Macaca tai forest 1 34 12 Chlorocebus pegivirus a 1 fascicularis ebolavirus sabaeus 13 44 Eidolon helvum rabies lyssavirus 1 35 11 Giraffa betacoronavirus 1 camelopardalis 1 14 40 Miniopterus bat astrovirus 1 1 36 10 Pteropus rabies lyssavirus 1 schreibersii scapulatus 15 35 Chlorocebus measles 1 37 9 Lontra rabies lyssavirus 0 aethiops morbillivirus canadensis 16 34 Zalophus sea lion 1 38 8 Camelus a 1 californianus mastadenovirus a bactrianus 17 33 Equus asinus betacoronavirus 1 39 7 kyasanur forest 1 1 Macaca radiata disease virus 18 32 Cervus elaphus rabies lyssavirus 1 40 6 Vulpes lagopus rabies lyssavirus 1 19 30 Rattus rattus rat astrovirus 1 41 5 Tylonycteris bat circovirus 1 pachypus 20 28 Rhinolophus bat bocavirus 1 42 4 Saguinus hepatovirus a 1 ferrumequinum labiatus 21 26 Rangifer orf virus 1 43 3 Plecotus rabies lyssavirus 1 tarandus rafinesquii 22 24 Capreolus rotavirus a 1 capreolus 386 387

18

388 Table S7 - Systematic prediction of removed virus-mammal associations (mammals): virus- 389 mammal associations removed from mammals’ point of view. No. is number of associations 390 used to link with figure S7 (panel B – mammals), N = number of viruses per selected mammal. 391 Vote indicates if our framework predicted the removed association successfully (1) or not (0). No. N mammal virus vote No. n mammal virus vote 1 682 Chrysocyon rabies lyssavirus 1 24 26 Rattus rodent astrovirus 1 brachyurus tanezumi 2 97 Callithrix monkeypox virus 1 25 25 Cervus respiratory 1 jacchus elaphus syncytial virus 3 78 Chrysocyon canine 1 26 24 Equus saint louis 1 brachyurus morbillivirus caballus encephalitis virus 4 62 Vulpes carnivore 1 27 23 Oligoryzomys andes 1 macrotis protoparvovirus 1 flavescens orthohantavirus 5 57 Lama glama rotavirus a 1 28 22 Oligoryzomys andes 1 flavescens orthohantavirus 6 56 Equus west nile virus 1 29 21 Tayassu suid 0 caballus pecari alphaherpesvirus 1 7 55 Artibeus 0 30 20 Boselaphus small ruminant 1 planirostris tragocamelus morbillivirus 8 50 Dama dama tick-borne 1 31 19 Acinonyx felid 1 encephalitis virus jubatus alphaherpesvirus 1 9 48 Lama glama pestivirus a 1 32 18 Pteropus pegivirus b 0 giganteus 10 45 Lama pacos bluetongue virus 1 33 17 Macaca macaca mulatta 1 cyclopis polyomavirus 1 11 44 Pteropus japanese 1 34 16 Equus deltapapillomavirus 1 scapulatus encephalitis virus caballus 4 12 43 Hipposideros bat paramyxovirus 1 35 14 Ovis dalli orf virus 1 caffer 13 42 Equus ovine 1 36 13 Chlorocebus pegivirus a 1 caballus gammaherpesvirus sabaeus 2 14 41 Pteropus dengue virus 1 37 12 Homo dobrava-belgrade 1 scapulatus sapiens orthohantavirus 15 40 Equus eastern equine 1 38 11 Equus akabane 1 caballus encephalitis virus caballus orthobunyavirus 16 39 Cynopterus bat 1 39 10 Vulpes vulpes rabbit hemorrhagic 0 brachyotis disease virus 17 38 Equus a 1 40 9 Akodon argentinian 1 caballus azarae mammarenavirus 18 36 Boselaphus alcelaphine 0 41 8 Macaca kyasanur forest 1 tragocamelus gammaherpesvirus radiata disease virus 1 19 35 Saimiri yellow fever virus 1 42 7 Mandrillus macacine 1 sciureus sphinx gammaherpesvirus 5 20 34 Cercopithecus lymphocryptovirus 1 43 6 Homo bhanja virus 1 nictitans 1 sapiens 21 32 Hipposideros rift valley fever 1 44 5 Zalophus seal parapoxvirus 1 abae phlebovirus californianus 22 31 Puma alphacoronavirus 1 1 45 4 Rattus rattus thailand 1 yagouaroundi orthohantavirus 23 30 Equus cowpox virus 1 46 3 Peromyscus rabies lyssavirus 1 caballus leucopus 392

393

19

394

395 396 Figure S7 – Results of systematic prediction of removed virus-mammal associations . Circles 397 represent the following in order: whether the leave out test succeeded in predicting the removed 398 interaction (black) or not (white); the mean probability across the three perspectives; probabilities 399 derived from the mammalian perspective; probabilities derived from the viral perspective; 400 probabilities derived from the network (motifs) perspective; the order of the host; and Baltimore 401 classification of the virus. Panel A – results of viruses (interactions removed for each virus 402 selected in table S6. Panel B – results of mammals (interactions removed for each virus 403 selected in table S7. 404

405

20

406 Supplementary Results 2 – The mammalian perspective 407 Classifier selection 408 We investigated the effect viral traits has on the distribution of 1,896 virus species in each of the 409 699 mammalian species which we found to host at least 2 virus species using 8 supervised 410 classification algorithms (table S4). We optimised (tuned) the hyper-parameters of each classifier 411 using 10-fold cross validation repeated 10 times, with SMOTE class rebalancing technique. 412 Classifier selection per mammal was achieved using the strategy outlined in Supplementary Note 413 4. Selected classifiers (i.e. algorithms) and their hyper-parameters (best tunes) were carried across 414 to bagging pipelines, whereby each classifier was retrained with 10-fold cross validation repeated 415 50 times (with same tuning parameters), to attend to uncertainties in our sampling regiments and 416 the cross-validation processes, and to generate empirical confidence intervals. Predictions were 417 generated using median probability of selected classifiers pipelines, 90% CI were also obtained and 418 were integrated into our framework pipeline. 419 420 Validation: When taking median predictions of our 10-fold cross validated mammalian perspective 421 models (trained with all available data for each mammal, via cross validation), our mammalian 422 perspective pipelines achieved an AUC = 0.916, F1-score = 0.201, and TSS = 0.832. When trained 423 with 85% of available data our mammalian perspective performance was as follows: AUC=0.816 424 F1-score = 0.104 and TSS=0.632. Our mammalian perspective performance on the held-out test set 425 (15%) was as follows: AUC=0.807 F1-score = 0.089 and TSS=0.614. 426 427 New mammal-virus associations obtained from our mammalian perspective models 428 When using viral traits to predict associations between mammals and their viruses the results of 429 our selected models (previous section) suggested that 41,537 (median, 90% CI [4275, 238971]) 430 unknown associations could be missing from our original dataset. Our mammalian perspective 431 models suggested 33.24 [6.93, 170.82] viruses per host (~8.25-fold increase [~1.36, ~37.6]), with 432 median 28.93 [2.98, 166.41] new viruses per host. Figure S10 illustrates the results obtained from 433 our mammalian perspectives. 434 435

21

436

437 438 Figure S8 - Comparison of machine learning algorithms trained in viral feature space 439 (mammalian perspective). Boxplots show the median, quartiles and range of performance metrics 440 for 8 algorithms from 10-fold cross validation, each trained with same training sets per each 441 mammal (N=699). 442 443 444

445 446 Figure S9 - Selected classifiers trained in viral feature space (mammalian perspective). 447 Boxplots show the median, quartiles, and range of performance metrics for 8 algorithms from 50 448 runs with 10-fold cross validation of best performing tuned classifiers (Figure S8) per each mammal 449 (n=699). Percentage of selected classifiers (number of times a classification algorithm was selected 450 as best performing for a given mammalian species divided by number of times the algorithm was 451 run in total) was as follows: avNNet=08.96%, GBM =10.90%, RF=11.64%, XGBoost=09.25%, 452 SVM-RW=23.43%, SVM-LW=15.52%, SVM-P=11.19%, NB=09.10%. 453 454 455

22

456 457 Figure S10 - results of our mammalian perspective models. Panel A – changes in mammalian 458 host range of viruses from known (in EID2) to predicted (by mammalian perspective models 459 – top 30 are labelled). Rabies lyssavirus was excluded from panel A to allow for better 460 visualisation. Panel B – predicted host-virus associations per order between viruses and wild 461 and semi-domesticated mammals. We selected the following orders to represent (clockwise): 462 Carnivora, Chiroptera, Perissodactyla, Primates, Rodentia, Ruminantia, Suina and Tylopoda. Dark 463 grey bars indicate detected interaction (in EID2). Panel C – changes in number of viruses per 464 wild or semi-domesticated mammalian host species from detected (in EID2) to predicted (by 465 mammalian perspective models – top 30 are labelled). Panel D – Top 15 genera (by number of 466 predicted wild or semi-domesticated mammalian host species) in selected orders (Other 467 indicated results for all orders not included in the first five circles). Each order figure comprises 468 the following circles (from outside to inside): 1) number of hosts predicted to have an association 469 with the viral genus; 2) number of hosts detected to have association; 3) number of hosts predicted 470 to harbour viral zoonotic (i.e. predicted to share at least one virus species with humans); 4) number 471 of hosts predicted to share viruses with domesticated livestock (domesticated mammals in orders 472 Perissodactyla, Ruminantia, Suina and Tylopoda); and 5) Baltimore classification of the selected 473 genera. 474 475

23

476 Supplementary Results 3 – the viral perspective 477 Classifier selection 478 We examined the effect of mammalian traits on the distribution of each of the 556 virus species for 479 which we found more than one mammalian host in 1,436 terrestrial mammals using the same 480 classifier selection strategy as the previous section. Predictions were generated using median 481 probability of selected classifiers pipelines, 90% CI were also obtained and were integrated into 482 our pipeline. 483 484 Validation: When taking median predictions of our 10-fold cross validated viral perspective 485 models (trained with all available data for each mammal, via cross validation), our mammalian 486 perspective pipelines achieved an AUC = 0.867, F1-score = 0.289, and TSS = 0.734. When trained 487 with 85% of available data our viral perspective performance was as follows: AUC=0.800 F1-score 488 = 0.150 and TSS=0.600. Our viral perspective performance on the held-out test set (15%) was as 489 follows: AUC=0.797 F1-score = 0.150 and TSS=0.594. 490 491 New links (mammalian host-virus associations) 492 When using mammalian features to predict potential associations with multi-host viruses, the 493 results of our selected models indicated that 21,352 (median, 90% CI [2536, 95630]) unknown 494 associations could be missing from the original bipartite network. Our mammalian perspective 495 models suggested a median 14.56 [4.48, 53.78] hosts per virus (~2.96 [~1.15, ~11.84]), with median 496 11.26 [1.34, 50.44] new hosts per virus. Figure S13 illustrates the results obtained from our viral 497 perspective models. 498

24

499 500

501 502 Figure S11 - Comparison of machine learning algorithms trained in mammalian feature 503 space (viral perspective). Boxplots show the median, quartiles, and range of performance metrics 504 for 8 algorithms from 10-fold cross validation, each trained with same training sets per each virus 505 (n=556). 506

507 508 Figure S12 - Selected classifiers trained in mammalian feature space (viral perspective). 509 Boxplots show the median, quartiles, and range of performance metrics for 8 algorithms from 50 510 runs with 10-fold cross validation of best performing tuned classifiers (Figure S11) per each virus 511 (n=556). Percentage of selected classifiers (number of times a classification algorithm was selected 512 as best performing for a given virus species divided by number of times the algorithm was run in 513 total) was as follows: avNNet =6.18%, GBM =16.6%, RF=13.13%, XGBoost=20.46%, SVM- 514 RW=14.48%, SVM-LW=4.63%, SVM-P=5.02%, NB=19.5%. 515

25

516 517 Figure S13 - results of our viral perspective models. Panel A – changes in mammalian host 518 range of viruses from known (in EID2) to predicted (by viral perspective models – top 30 are 519 labelled). Rabies lyssavirus was excluded from panel A to allow for better visualisation. Panel B 520 – predicted host-virus associations per order between viruses and wild and semi-domesticated 521 mammals. We selected the following orders to represent (clockwise): Carnivora, Chiroptera, 522 Perissodactyla, Primates, Rodentia, Ruminantia, Suina and Tylopoda. Dark grey bars indicate 523 detected interaction (in EID2). Panel C – changes in number of viruses per wild or semi- 524 domesticated mammalian host species from detected (in EID2) to predicted (by viral 525 perspective models – top 30 are labelled). Panel D – Top 15 genera (by number of predicted 526 wild or semi-domesticated mammalian host species) in selected orders (Other indicated 527 results for all orders not included in the first five circles). Each order figure comprises the 528 following circles (from outside to inside): 1) number of hosts predicted to have an association with 529 the viral genus; 2) number of hosts detected to have association; 3) number of hosts predicted to 530 harbour viral zoonotic (i.e. predicted to share at least one virus species with humans); 4) number 531 of hosts predicted to share viruses with domesticated livestock (domesticated mammals in orders 532 Perissodactyla, Ruminantia, Suina and Tylopoda); and 5) Baltimore classification of the selected 533 genera. 534 535

26

536 Supplementary Results 4 – The network perspective 537 Classifier selection 538 We performed full network prediction (all possible edges, N = 2,722,656) using our trained 539 classifiers (as per the sampling and aggregating strategy listed in main manuscript and 540 Supplementary Note 4). We assessed the results of these classifiers (including unseen data) using 541 the metrics outlined in Supplementary Note 4. Following this, SVM(RW) (Support Vector Machine 542 with RBF kernel and class weights) emerged as the best classifier overall. 543 544 Validation: When taking median predictions of our 10-fold cross validated network perspective 545 models (trained with all available virus-mammal associations, via cross validation), our network 546 perspective pipelines achieved an AUC = 0.959, F1-score = 0.139, and TSS = 0.918. When trained 547 with 85% of available data our viral perspective performance was as follows: AUC=0.927 F1-score 548 = 0.300 and TSS=0.856. Our network perspective performance on the held-out test set (15%) was 549 as follows: AUC=0.925 F1-score = 0.138 and TSS=0.850. 550 551 New links (mammalian host-virus associations) 552 When inferring association between mammals and their viruses using motifs as features, our 553 models suggested 76,081 (median, 90% CI [27738, 205814]) unknonw associations could be 554 missing from the original bipartite network. Our network perspective models indicated a median 555 57.15 [23.22, 147.62] viruses per mammalian host (median fold increase ~14.12 [~4.64, ~34.29]), 556 with 52.98 [19.32, 143.32] new viruses per host. Conversely, they suggested a median 43.29 [17.59, 557 111.81] mammalian hosts per virus (~7.58 [~2.75, ~25.25]), with 40.13 [14.63, 108.55] new hosts 558 per virus. Figure S15 illustrates the results obtained from our networks perspective models. 559 560 Variable importance – Motifs selection 561 As stated in the main manuscript, we did not filter motifs prior to feeding them to the machine 562 learning algorithms. Figure S16 illustrates the relative influence of motifs count per edge (simply 563 motifs) in our selected models – SVM (RW). 564 565

27

566

567 568 Figure S14 - Comparison of machine learning algorithms trained in mammalian feature 569 space (viral perspective). Boxplots show the median, quantiles, and range of performance metrics 570 for 8 algorithms from 10-fold cross validation (100 repeats), results derived from applying each 571 trained model to the set of possible mammal virus interactions (N=2,722,656). 572 573

28

574 575 Figure S15 - results of our network perspective models. Panel A – changes in mammalian 576 host range of viruses from known (in EID2) to predicted (by network perspective models – 577 top 30 are labelled). Rabies lyssavirus was excluded from panel A to allow for better visualisation. 578 Panel B – predicted host-virus associations per order between viruses and wild and semi- 579 domesticated mammals. We selected the following orders to represent (clockwise): Carnivora, 580 Chiroptera, Perissodactyla, Primates, Rodentia, Ruminantia, Suina and Tylopoda. Dark grey bars 581 indicate detected interaction (in EID2). Panel C – changes in number of viruses per wild or 582 semi-domesticated mammalian host species from detected (in EID2) to predicted (by network 583 perspective models – top 30 are labelled). Panel D – Top 15 genera (by number of predicted 584 wild or semi-domesticated mammalian host species) in selected orders (Other indicated 585 results for all orders not included in the first five circles). Each order figure comprises the 586 following circles (from outside to inside): 1) number of hosts predicted to have an association with 587 the viral genus; 2) number of hosts detected to have association; 3) number of hosts predicted to 588 harbour viral zoonotic (i.e. predicted to share at least one virus species with humans); 4) number 589 of hosts predicted to share viruses with domesticated livestock (domesticated mammals in orders 590 Perissodactyla, Ruminantia, Suina and Tylopoda); and 5) Baltimore classification of the selected 591 genera.

29

592

593 Figure S16 - Relative contribution (importance) of motif-features (variables) to the AUC of 594 our SVM-RW models trained with random balanced subsampling (n=100 times) in with our 595 “potential” motif-features as predictors of unknown virus-host association (new links). Motifs 596 (subgraphs) are coloured by the number of nodes (K=3, 4, 5) and correspond to figure 1 in the main 597 manuscript. Research effort into both viruses and mammals was added as independent variables in 598 our motifs models (coloured in yellow). 599

30

600 Supplementary Results 5 – Additional Results 601 Difference in performance metrics between training and testing sets 602 Table S8– Differences in performance metrics between training and testing sets at full model 603 and individual perspective levels. 604 Perspective AUC F1-Score TSS Full model Training 0.972 0.322 0.944 Testing 0.956 0.310 0.915 Mammalian Training 0.916 0.138 0.832 Testing 0.890 0.115 0.781 Viral Training 0.867 0.194 0.735 Testing 0.839 0.181 0.678 Network Training 0.952 0.192 0.904 Testing 0.947 0.190 0.893 605 606

31

607 Viruses found in human 608 Our multi-perspective framework generates predictions for each potential virus-mammal 609 association (N=2,722,656 between 1,896 viruses and 1,436 terrestrial mammals). Here we highlight 610 results for humans. In addition to 425 virus species known to affect humans, our model predicted 611 16 undocumented (in EID2) virus species that could potentially be found in humans. Table S9 612 presents these viruses. 613 614 Table S9– Predicted viruses that could potentially be found in humans (not documented in 615 input dataset). Baltimore is Baltimore classification (7 groups I to VII). Mammalian lists 616 probabilities drawn from median mammalian perspective (Homo sapiens model trained with viral 617 traits). Viral lists median probabilities drawn for humans from viral perspective (16 virus models 618 trained with mammalian traits). Network lists probabilities drawn from network perspective 619 (trained with motif features). Mean: is average probability across the three perspectives. Voting 620 was drawn for each perspective (1 if probability 0.5). 621 Virus species Baltimore Mammalian Viral Network Mean murid betaherpesvirus 1 I 0.003 0.541 0.999 0.514 cercopithecine alphaherpesvirus 9 I 0.042 0.781 0.999 0.607 lymphocryptovirus 1 I 0.073 1.000 0.999 0.691 deltapapillomavirus 4 I 0.002 0.994 0.997 0.665 bat rotavirus III 0.005 0.601 0.988 0.531 bat hepatovirus IV 0.006 0.979 0.973 0.653 primate astrovirus IV 0.045 0.618 0.999 0.554 enterovirus j IV 0.158 0.707 0.999 0.621 enterovirus f IV 0.007 0.708 0.996 0.570 mamastrovirus 5 IV 0.574 0.224 0.998 0.599 aravan lyssavirus V 0.004 0.519 0.955 0.493 beilong jeilongvirus V 0.001 0.518 0.990 0.503 coastal plains tibrovirus V 0.008 0.627 0.996 0.544 V 0.019 0.902 0.998 0.640 simian immunodeficiency virus VI 0.459 0.696 0.999 0.718 simian retrovirus 5 VI 0.021 0.575 0.997 0.531 622 623 Rare viruses and rarely studied mammals 624 Our multi-perspective framework can capture and expand associations of both rare viruses, and 625 understudied mammalian species, due to separation of perspectives. If a virus is rare, the framework 626 would capture potential hosts via the network and mammalian perspectives. Similarly, if a 627 mammalian species is rarely studied, then the framework would still capture viruses potentially 628 found in this mammalian species via the network and viral perspectives. In relation to rare viruses 629 and understudied mammalian species, our framework was able to expand the host range of rare 630 viruses (n=1,450) from 1,619 to 4,174 (~ 2.16 average increase per rare virus). Virus range of rare 631 mammals (n=954) was also increased from 1,150 to 4,318 (~3.21 average fold increase per host). 632 633

32

634

635 SI References 636 1. Wardeh, M., Risley, C., Mcintyre, M. K., Setzkorn, C. & Baylis, M. Database of host-pathogen and 637 related species interactions, and their global distribution. Sci. Data 2, (2015). 638 2. Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36-42 (2013). 639 3. Bethesda (MD): National Library of Medicine (US), N. C. for B. I. GenBank [Internet]. (1982). 640 Available at: https://www.ncbi.nlm.nih.gov/nucleotide/. (Accessed: 31st December 2017) 641 4. Bethesda (MD): National Library of Medicine (US). PubMed [Internet]. (1946). Available at: 642 https://www.ncbi.nlm.nih.gov/pubmed. (Accessed: 31st December 2017) 643 5. Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136-43 (2012). 644 6. Sanjuán, R. et al. Viral Mutation Rates Viral Mutation Rates ᰔ. J. Virol. 84, 9733–9748 (2010). 645 7. Coffin, J. M. Structure and Classification of Retroviruses. in The Retroviridae 19–49 (Springer US, 646 1992). doi:10.1007/978-1-4615-3372-6_2 647 8. Nisole, S. & Saïb, A. Early steps of retrovirus replicative cycle. Retrovirology 1, (2004). 648 9. Wawrzyniak, P., Plucienniczak, G. & Bartosik, D. The different faces of rolling-circle replication 649 and its multifunctional initiator proteins. Frontiers in Microbiology 8, (2017). 650 10. Lin, X. et al. Order and disorder control the functional rearrangement of influenza hemagglutinin. 651 Proc. Natl. Acad. Sci. U. S. A. 111, 12049–54 (2014). 652 11. Sicard, A., Michalakis, Y., Gutiérrez, S. & Blanc, S. The Strange Lifestyle of Multipartite Viruses. 653 PLoS Pathogens 12, (2016). 654 12. Rey, F. A. & Lok, S. M. Common Features of Enveloped Viruses and Implications for Immunogen 655 Design for Next-Generation Vaccines. Cell 172, 1319–1334 (2018). 656 13. Yakovchuk, P., Protozanova, E. & Frank-Kamenetskii, M. D. Base-stacking and base-pairing 657 contributions into thermal stability of the DNA double 1. Yakovchuk, P., Protozanova, E. & Frank- 658 Kamenetskii, M. D. Base-stacking and base-pairing contributions into thermal stability of the DNA 659 double helix. Nucleic Acids R. Nucleic Acids Res. 34, 564–574 (2006). 660 14. Komarova, N. L. Viral reproductive strategies: How can lytic viruses be evolutionarily competitive? 661 J. Theor. Biol. 249, 766–84 (2007). 662 15. Lefkowitz, E. J. et al. Virus taxonomy: The database of the International Committee on Taxonomy 663 of Viruses (ICTV). Nucleic Acids Res. 46, D708–D717 (2018). 664 16. International Committee on Taxonomy of Viruses (ICTV). Available at: https://talk.ictvonline.org/. 665 (Accessed: 21st October 2019) 666 17. Brierley, L., Vonhof, M. J., Olival, K. J., Daszak, P. & Jones, K. E. Quantifying Global Drivers of 667 Zoonotic Bat Viruses: A Process-Based Perspective. Am. Nat. 187, E53–E64 (2016). 668 18. ViralZone root. Available at: https://viralzone.expasy.org/. (Accessed: 21st October 2019) 669 19. Stekhoven, D. J. Using the missForest Package. (2011). 670 20. Olival, K. J. et al. Host and viral traits predict zoonotic spillover from mammals. Nature 546, 646– 671 650 (2017). 672 21. Guth, S., Visher, E., Boots, M. & Brook, C. E. Host phylogenetic distance drives trends in virus 673 virulence and transmissibility across the animal–human interface. Philos. Trans. R. Soc. B Biol. Sci. 674 374, 20190296 (2019). 675 22. Longdon, B., Brockhurst, M. A., Russell, C. A., Welch, J. J. & Jiggins, F. M. The Evolution and 676 Genetics of Virus Host Shifts. PLoS Pathog. 10, e1004395 (2014). 677 23. Fritz, S. A., Bininda-Emonds, O. R. P. & Purvis, A. Geographical variation in predictors of 678 mammalian extinction risk: big is bad, but only in the tropics. Ecol. Lett. 12, 538–549 (2009). 679 24. Isaac, N. J. B., Turvey, S. T., Collen, B., Waterman, C. & Baillie, J. E. M. Mammals on the EDGE: 680 Conservation Priorities Based on Threat and Phylogeny. PLoS One 2, e296 (2007). 681 25. Kembel, S. W. et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26, 682 1463–1464 (2010). 683 26. Vane-Wright, R. I., Humphries, C. J. & Williams, P. H. What to protect?—Systematics and the agony 684 of choice. Biol. Conserv. 55, 235–254 (1991). 685 27. Park, A. W. et al. Characterizing the phylogenetic specialism–generalism spectrum of mammal 686 parasites. Proc. R. Soc. B Biol. Sci. 285, 20172613 (2018). 687 28. Jones, K. E. et al. PanTHERIA: a species-level database of life history, ecology, and geography of

33

688 extant and recently extinct mammals. Ecology 90, 2648–2648 (2009). 689 29. Gnanadesikan, G. E., Pearse, W. D. & Shaw, A. K. Evolution of mammalian migrations for refuge, 690 breeding, and food. Ecol. Evol. 7, 5891–5900 (2017). 691 30. Olival, K. J. et al. Host and viral traits predict zoonotic spillover from mammals. Nature 546, 646– 692 650 (2017). 693 31. Wilman, H. et al. EltonTraits 1.0: Species-level foraging attributes of the world’s birds and mammals. 694 Ecology 95, 2027–2027 (2014). 695 32. IUCN 2018. The IUCN Red List of Threatened Species.Version 2018-2. Available at: 696 http://www.iucnredlist.org. (Accessed: 11th February 2018) 697 33. DE MAGALHÃES, J. P. & COSTA, J. A database of vertebrate longevity records and their relation 698 to other life‐history traits. J. Evol. Biol. 22, 1770–1774 (2009). 699 34. Gower, J. C. A General Coefficient of Similarity and Some of Its Properties. Biometrics 27, 857 700 (1971). 701 35. Pavoine, S., Vallet, J., Dufour, A.-B., Gachet, S. & Daniel, H. On the challenge of treating various 702 types of variables: application for improving the measurement of functional diversity. Oikos 118, 703 391–402 (2009). 704 36. McIntyre, K. M. et al. Systematic Assessment of the Climate Sensitivity of Important Human and 705 Domestic Animals Pathogens in Europe. Sci. Rep. 7, 7134 (2017). 706 37. Hay, S. I. et al. Global mapping of infectious disease. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 368, 707 20120250 (2013). 708 38. Anyamba, A. et al. Global Disease Outbreaks Associated with the 2015–2016 El Niño Event. Sci. 709 Rep. 9, 1930 (2019). 710 39. Jones, A. E. et al. Bluetongue risk under future climates. Nat. Clim. Chang. 9, 153–157 (2019). 711 40. Caminade, C., McIntyre, K. M. & Jones, A. E. Impact of recent and future climate change on vector- 712 borne diseases. Ann. N. Y. Acad. Sci. 1436, 157–173 (2019). 713 41. Karesh, W. B. et al. Ecology of zoonoses: natural and unnatural histories. Lancet 380, 1936–1945 714 (2012). 715 42. Hassell, J. M., Begon, M., Ward, M. J. & Fèvre, E. M. Urbanization and Disease Emergence: 716 Dynamics at the Wildlife-Livestock-Human Interface. Trends Ecol. Evol. 32, 55–67 (2017). 717 43. Gilbert, M. et al. Global distribution data for cattle, buffaloes, horses, sheep, goats, pigs, chickens 718 and ducks in 2010. Sci. Data 5, 180227 (2018). 719 44. University, C. for I. E. S. I. N.-C.-C. Gridded Population of the World, Version 4 (GPWv4): 720 Population Density Adjusted to Match 2015 Revision UN WPP Country Totals, Revision 10. (2017). 721 45. Harris, I., Jones, P. D., Osborn, T. J. & Lister, D. H. Updated high-resolution grids of monthly 722 climatic observations - the CRU TS3.10 Dataset. Int. J. Climatol. 34, 623–642 (2014). 723 46. IUCN, I. U. for C. of N.- & University, C. for I. E. S. I. N.-C.-C. Gridded Species Distribution: Global 724 Mammal Richness Grids, 2015 Release. (2015). 725 47. Hopkins, M. E. & Nunn, C. L. A global gap analysis of infectious agents in wild primates. Divers. 726 Distrib. 13, 561–572 (2007). 727 48. Lloyd-Smith, J. O. et al. Epidemic Dynamics at the Human-Animal Interface. Science (80-. ). 326, 728 1362–1367 (2009). 729 49. Jenkins, C. N., Pimm, S. L. & Joppa, L. N. Global patterns of terrestrial vertebrate diversity and 730 conservation. Proc. Natl. Acad. Sci. U. S. A. 110, E2602-10 (2013). 731 50. Lloyd-Smith, J. O. et al. Epidemic dynamics at the human-animal interface. Science 326, 1362–7 732 (2009). 733 51. Jones, B. A. et al. Zoonosis emergence linked to agricultural intensification and environmental 734 change. Proc. Natl. Acad. Sci. 110, 8399–8404 (2013). 735 52. Weaver, S. C. Urbanization and geographic expansion of zoonotic arboviral diseases: mechanisms 736 and potential strategies for prevention. Trends Microbiol. 21, 360–3 (2013). 737 53. Eskew, E. A. & Olival, K. J. De-urbanization and Zoonotic Disease Risk. Ecohealth (2018). 738 doi:10.1007/s10393-018-1359-9 739 54. Jones, K. E. et al. Global trends in emerging infectious diseases. Nature 451, 990–993 (2008). 740 55. Dunn, R. R., Davies, T. J., Harris, N. C. & Gavin, M. C. Global drivers of human pathogen richness 741 and prevalence. Proc. R. Soc. B Biol. Sci. 277, 2587–2595 (2010). 742 56. Wolfe, N. D., Dunavan, C. P. & Diamond, J. Origins of major human infectious diseases. Nature 743 447, 279–283 (2007).

34

744 57. Allen, T. et al. Global hotspots and correlates of emerging zoonotic diseases. Nat. Commun. 8, 1124 745 (2017). 746 58. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D. & Fernández-Delgado, A. Do we Need 747 Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning 748 Research 15, (2014). 749 59. Tantithamthavorn, C., Hassan, A. E. & Matsumoto, K. The Impact of Class Rebalancing Techniques 750 on the Performance and Interpretation of Defect Prediction Models. (2018). 751 60. from Jed Wing, M. K. C. et al. caret: Classification and Regression Training. (2018). 752 61. Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28, 1–26 (2008). 753 62. Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of 754 Statistics 29, 1189–1232 755 63. Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013). 756 64. Hay, S. I. et al. Global mapping of infectious disease. Philos. Trans. R. Soc. B Biol. Sci. 368, 757 20120250–20120250 (2013). 758 65. Svetnik, V. et al. Random Forest: A Classification and Regression Tool for Compound Classification 759 and QSAR Modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003). 760 66. Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. in Proceedings of the ACM 761 SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17-August-2016, 762 785–794 (Association for Computing Machinery, 2016). 763 67. Chawla, N. V, Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic Minority Over- 764 sampling Technique. Journal of Artificial Intelligence Research 16, (2002). 765 68. Agrawal, A. & Menzies, T. Better Data Miners"?: On the Benefits of Tuning SMOTE for 766 Defect Pre-diction. 12 doi:10.1145/3180155.3180197 767 69. Kuhn, M. Futility Analysis in the Cross-Validation of Machine Learning Models1. Kuhn, M. Futility 768 Analysis in the Cross-Validation of Machine Learning Models. (2014). (2014). 769 70. Babayan, S. A., Orton, R. J. & Streicker, D. G. Predicting reservoir hosts and arthropod vectors from 770 evolutionary signatures in RNA virus genomes. Science (80-. ). 362, 577–580 (2018). 771 71. Dallas, T., Park, A. W. & Drake, J. M. Predicting cryptic links in host-parasite networks. PLOS 772 Comput. Biol. 13, e1005557 (2017). 773 72. Lobo, J. M., Jiménez-Valverde, A. & Real, R. AUC: a misleading measure of the performance of 774 predictive distribution models. Glob. Ecol. Biogeogr. 17, 145–151 (2008). 775 73. Barbet-Massin, M., Jiguet, F., Albert, C. H. & Thuiller, W. Selecting pseudo-absences for species 776 distribution models: how, where and how many? Methods Ecol. Evol. 3, 327–338 (2012). 777

35