Identifying and Prioritizing Potential Human-Infecting Viruses from Their 2 Genome Sequences
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.11.12.379917; this version posted June 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Identifying and prioritizing potential human-infecting viruses from their 2 genome sequences 3 Nardus Mollentze1,2*, Simon A. Babayan2, Daniel G. Streicker1,2 4 1Medical Research Council – University of Glasgow Centre for Virus Research, Glasgow, 5 G61 1QH, United Kingdom. 6 2Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, 7 Veterinary and Life Sciences, University of Glasgow, Glasgow, G12 8QQ, United Kingdom. 8 *Corresponding author 9 Email: [email protected] (NM) 10 Short title: 11 Candidate zoonoses identified from viral genomes 12 Author contributions: 13 Conceptualization: D.G.S., S.A.B.; Data curation: D.G.S, N.M.; Formal analysis: N.M.; 14 Visualization: N.M.; Writing – original draft: N.M.; Writing – review & editing: D.G.S., 15 S.A.B., N.M. 16 Competing interests: The authors declare no competing interests. 17 Data and materials availability: Data and code related to this paper are available at: 18 https://doi.org/10.5281/zenodo.4271479 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.12.379917; this version posted June 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 19 Abstract 20 Determining which animal viruses may be capable of infecting humans is currently 21 intractable at the time of their discovery, precluding prioritization of high-risk viruses for 22 early investigation and outbreak preparedness. Given the increasing use of genomics in virus 23 discovery and the otherwise sparse knowledge of the biology of newly-discovered viruses, 24 we developed machine learning models that identify candidate zoonoses solely using 25 signatures of host range encoded in viral genomes. Within a dataset of 861 viral species with 26 known zoonotic status, our approach outperformed models based on the phylogenetic 27 relatedness of viruses to known human-infecting viruses (AUC = 0.773), distinguishing high- 28 risk viruses within families that contain a minority of human-infecting species and 29 identifying putatively undetected or so far unrealized zoonoses. Analyses of the 30 underpinnings of model predictions suggested the existence of generalisable features of viral 31 genomes that are independent of virus taxonomic relationships and that may preadapt viruses 32 to infect humans. Our model reduced a second set of 645 animal-associated viruses that were 33 excluded from training to 272 high and 41 very high-risk candidate zoonoses and showed 34 significantly elevated predicted zoonotic risk in viruses from non-human primates, but not 35 other mammalian or avian host groups. A second application showed that our models could 36 have identified SARS-CoV-2 as a relatively high-risk coronavirus strain and that this 37 prediction required no prior knowledge of zoonotic SARS-related coronaviruses. Genome- 38 based zoonotic risk assessment provides a rapid, low-cost approach to enable evidence-driven 39 virus surveillance and increases the feasibility of downstream biological and ecological 40 characterisation of viruses. 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.12.379917; this version posted June 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 41 Introduction 42 Most emerging infectious diseases of humans are caused by viruses that originate From other 43 animal species. IdentiFying these zoonotic threats prior to emergence is a major challenge 44 since only a small minority of the estimated 1.67 million animal viruses may infect humans 45 [1–3]. Existing models of human infection risk rely on viral phenotypic information that is 46 unknown For newly discovered viruses (e.g., the diversity of species a virus can infect) or that 47 vary insufFiciently to discriminate risk at the virus species or strain level (e.g., replication in 48 the cytoplasm), limiting their predictive value beFore the virus in question has been 49 characterised [4–6]. Since most viruses are now discovered using untargeted genomic 50 sequencing, often involving many simultaneous discoveries with limited phenotypic data, an 51 ideal approach would quantiFy the relative risk of human infectivity upon relevant exposure 52 From sequence data alone. By identiFying high risk viruses warranting Further investigation, 53 such predictions could alleviate the growing imbalance between the rapid pace of virus 54 discovery and lower throughput field and laboratory research needed to comprehensively 55 evaluate risk. 56 Current models can identiFy well-characterised human-infecting viruses From 57 genomic sequences [7,8]. However, by training algorithms on very closely related viruses 58 (i.e., strains of the same species) and potentially omitting secondary characteristics of viral 59 genomes linked to infection capability, such models are less likely to Find signals of zoonotic 60 status that generalise across viruses. Consequently, predictions may be highly sensitive to 61 substantial biases in current knowledge of viral diversity [3,9]. 62 Empirical and theoretical evidence suggests that generalisable signals of human 63 infectivity might exist within viral genomes. Viruses associated with broad taxonomic groups 64 of animal reservoirs (e.g., primates versus rodents) can be distinguished using aspects of their 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.12.379917; this version posted June 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 65 genome composition, including dinucleotide, codon, and amino acid biases [10]. Whether 66 such measures of viral genome composition are speciFic enough to distinguish host range at 67 the species level remains unclear, but their speciFicity might arise through several commonly 68 hypothesised mechanisms. First, aspects of antiviral immunity that target nucleotide motiFs in 69 viral genomes might select For common mutations in diverse human-associated viruses 70 [11,12]. For example, the depletion of CpG dinucleotides in vertebrate-infecting RNA virus 71 genomes may have arisen to evade zinc-Finger antiviral protein (ZAP), an interFeron- 72 stimulated gene (ISG) that initiates the degradation of CpG-rich RNA molecules [12]. While 73 ZAP occurs widely among vertebrates, increasingly recognized lineage-speciFicity in 74 vertebrate antiviral deFences opens the possibility that analogous, undescribed nucleic-acid 75 targeting deFences might be human (or primate) speciFic [13]. Second, the Frequencies of 76 speciFic codons in virus genomes often resemble those of their reservoir hosts, possibly 77 owing to increased eFFiciency and/or accuracy of mRNA translation [14]. By driving genome 78 compositional similarity to human-adapted viruses or to the human genome, such processes 79 may preadapt viruses For human infection [15,16]. Finally, even in the absence of 80 mechanisms that assert common selective pressures on divergent viral genomes, the 81 phylogenetic relatedness of viruses could allow prediction of the potential For human 82 infectivity since closely related viruses are generally assumed to share common phenotypes 83 and host range. However, despite being a common rule of thumb For virus risk assessment, to 84 our knowledge, whether evolutionary proximity to viruses with known human infection 85 ability predicts zoonotic status remains untested. 86 We aimed to develop machine learning models which use Features engineered From 87 viral and human genome sequences to predict the probability that any animal-infecting virus 88 will infect humans given biologically relevant exposure (here, zoonotic potential). Using a 89 large dataset of viruses which had previously been assessed For human infection ability based 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.12.379917; this version posted June 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 90 on published reports, we first build models which assess zoonotic potential based on virus 91 taxonomy and/or phylogenetic relatedness to known human-infecting viruses and contrast 92 these models to alternatives based on hypothesized selective pressures on viral genome 93 composition which Favour human infectivity. We then apply the best-perForminG model to 94 explore patterns in the predicted zoonotic potential of additional virus genomes sampled From 95 a range of species. 96 Results 97 We collected a single representative genome sequence from 861 RNA and DNA virus 98 species spanning 36 viral families that contain animal-infecting species (S1 FiG). We labelled 99 each virus as being capable of infecting humans or not using published reports as ground 100 truth, and trained models to classify viruses accordingly. These classifications of human 101 infectivity were obtained by merging three previously published datasets which reported data 102 at the virus species level and therefore did not consider potential for variation in host range 103 within virus species [5,9,17]. Importantly, given diagnostic limitations and the likelihood that 104 not all viruses capable of human infection have had opportunities to emerge and be detected, 105 viruses not reported to infect humans may represent unrealized, undocumented, or genuinely 106 non-zoonotic species.