Gussow et al. BMC Biology (2020) 18:186 https://doi.org/10.1186/s12915-020-00919-9

RESEARCH ARTICLE Open Access Prediction of the incubation period for COVID-19 and future outbreaks Ayal B. Gussow†, Noam Auslander*†, Yuri I. and Eugene V. Koonin*

Abstract Background: A crucial factor in mitigating respiratory viral outbreaks is early determination of the duration of the incubation period and, accordingly, the required time for potentially exposed individuals. At the time of the COVID-19 , optimization of quarantine regimes becomes paramount for public health, societal well- being, and global economy. However, biological factors that determine the duration of the virus incubation period remain poorly understood. Results: We demonstrate a strong positive correlation between the length of the incubation period and disease severity for a wide range of human pathogenic . Using a machine learning approach, we develop a predictive model that accurately estimates, solely from several virus genome features, in particular, the number of protein-coding genes and the GC content, the incubation time ranges for diverse human pathogenic RNA viruses including SARS-CoV-2. The predictive approach described here can directly help in establishing the appropriate quarantine durations and thus facilitate controlling future outbreaks. Conclusions: The length of the incubation period in viral strongly correlates with disease severity, emphasizing the biological and epidemiological importance of the incubation period. Perhaps, surprisingly, incubation times of pathogenic RNA viruses can be accurately predicted solely from generic features of virus genomes. Elucidation of the biological underpinnings of the connections between these features and disease progression can be expected to reveal key aspects of virus pathogenesis. Keywords: SARS-CoV-2, COVID-19, Coronavirus, Incubation period, Respiratory , Pandemic, Respiratory disease, Machine learning

Background is enforcing a period of quarantine on individuals that The recent outbreak of the novel SARS-CoV-2 corona- are suspected to have come in contact with the causative virus and the resulting COVID-19 disease has led to an agent until they are proven clean of [2, 3]. The unprecedented worldwide emergency [1]. Per the World length of the quarantine depends on the time from virus Health Organization (WHO) recommendations, numer- exposure to the emergence of symptoms, i.e., the incuba- ous countries have taken severe preventive measures to tion period. The duration of the incubation period is combat and stem the spread of the virus. A key effective specific to the causative virus [4]. Underestimation of measure recommended by the WHO in viral outbreaks the incubation time could lead to infected individuals being prematurely released from quarantine and spread- * Correspondence: [email protected]; [email protected] ing the disease, whereas overestimation can have a de- †Ayal B. Gussow and Noam Auslander contributed equally to this work. National Center for Biotechnology Information, National Library of Medicine, bilitating economic impact and cause detrimental National Institutes of Health, Bethesda, MD 20894, USA psychological effects [5]. Therefore, knowledge of the

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Gussow et al. BMC Biology (2020) 18:186 Page 2 of 12

range and upper limit of a virus incubation period is severity of the symptoms and associated death rate, fol- crucial to effectively combat and prevent outbreaks while lowing the descriptions of health organizations where minimizing the negative consequences of the quarantine. applicable (see the “Methods” section for details). We The length of the incubation period varies both across found that, although the incubation periods vary sub- and within virus families [4]. Investigation of different stantially for the set of viruses collected, both within and incubation periods within a single virus species has across families, the viruses that cause severe disease pre- shown that in some cases, a longer incubation period sentations tend to have significantly longer incubation corresponds to less severe symptoms [6, 7] whereas periods (Fig. 1a, p value 1.1e−5). This trend is strongest others demonstrate the opposite trend [8]. However, to when considering all 41 viruses and diseases (Fig. 1b), our knowledge, the association between the incubation but holds for both ssRNA and double-strand DNA period and severity across different human viral diseases (dsDNA) viruses separately (Fig. 2c). Furthermore, this has not been studied systematically. Further, genomic trend is significant when considering the two largest features (if any) that correlate with the incubation time viral families in this set, Coronaviridae and Herpesviri- are currently unknown. There is therefore a vital need dae (Fig. 1d), and among diseases associated with a par- for a comprehensive investigation of viral incubation ticular tissue type (Fig. 1e). The biology behind the periods and for methods that predict the incubation pe- relationship between incubation period duration and dis- riods of emerging viruses. If such methods are devel- ease severity warrants further exploration, but the sig- oped, they can be deployed in future virus outbreaks for nificant association identified here between these two early, accurate inference of the incubation period and disease-related variables stresses the importance of the immediate implementation of optimized quarantining incubation period duration for both fundamental under- interventions that will mitigate the spread of the virus standing of the diseases and practical health care issues. while minimizing the negative societal impact [9]. Here, we comprehensively assess the incubation pe- Prediction of incubation time from genomic features riods of different viruses that cause human diseases. We We next sought to develop a model that would facilitate find that, when comparing across different virus species, prediction of incubation periods solely from genomic a longer virus incubation period is significantly associ- features. To our knowledge, this is the first attempt to ated with a more severe disease presentation. This trend predict incubation periods from virus genomes. Given is maintained within and across virus families, regardless the considerable variability observed in the incubation of the affected tissue, and is especially strong among cor- periods among viruses that infect different tissues and onaviruses, and overall, for human respiratory diseases. those with different genome types (Fig. 1c, e), we sought For an in-depth examination and construction of a pre- to focus on a relatively homogenous subset of virus fam- dictive model, we narrowed our focus to respiratory, ilies, to minimize the risk of confounding the prediction non-segmented, single-strand RNA (ssRNA) viruses and with features of no direct relevance. To this end, we fo- analyzed different genomic characteristics of these vi- cused on non-segmented ssRNA viruses that cause re- ruses. We identified features that are predictive of the spiratory infections, which is the largest group of human incubation time and are generalizable across virus fam- viruses that are relatively homogenous biologically but ilies. Based on these features, we developed an elastic show considerable variation in their incubation periods net regression model that predicts virus incubation pe- (Additional file 1: Table 1). Although the predictor is riods. We extensively validated the robustness of this built on a limited set of 14 viruses, there is a sufficient model and the selected features for the prediction of the number of genomes to train the model (n = 3604 incubation time across diverse viruses and virus families, strains). Given that the quarantine time is defined as the to enable accurate early estimation of the incubation upper limit of the virus incubation time, we extracted period for future outbreaks. the upper estimates of the incubation periods for all vi- ruses in the analyzed set (Additional file 1: Table S2, see Results the “Methods” section for details). Association between incubation period and disease To train a model with the dataset in hand, we required severity a set of features that potentially could be predictive of We first curated the information on incubation periods the incubation times. Given that this is the first attempt, for viral human diseases, where such data were available to the best of our knowledge, to identify such features, (41 viruses, Additional file 1: Table 1). To gain further we were not aware of any established mechanistic rela- insight into the relevance of the viral incubation periods tionships between characteristics of virus genomes and to human disease, we investigated the relationship be- incubation periods. Thus, we selected features that are tween viral incubation periods and disease severity. We easily derived from the viral genomes and could be rele- classified diseases as severe or mild, based on the vant for the incubation period (see the “Methods” Gussow et al. BMC Biology (2020) 18:186 Page 3 of 12

Fig. 1 The incubation periods of viruses causing severe and mild human diseases. a Incubation periods (y-axis, log-scaled) of 41 human pathogenic viruses. The circle size corresponds to the length of the incubation period (the size scale is provided in the inset), and the colors correspond to the virus family. Severe diseases are indicated with a black border. b Comparison of the incubation periods (y-axis, log-scaled) between viruses causing mild (blue) vs. severe (red) human diseases, considering all 41 viruses collected. c Comparison of the incubation periods (y-axis, log-scaled) between viruses causing mild (blue) vs. severe (red) human disease, for ssRNA and dsDNA viruses. d Comparison of the incubation periods (y-axis, log-scaled) between viruses causing mild (blue) vs. severe (red) human disease, for large virus families. e Comparison of the incubation periods (y-axis, log-scaled) between viruses causing mild (blue) vs. severe (red) human disease, for diseases associated with distinct tissues. All indicated Pvaluesare for the one sided rank-sum test section for details). We constructed 8 such features Moreover, using standard criteria [14], we did not find (Fig. 2a), based on the complete genome nucleotide se- any CpG islands in our virus set, making it unlikely that quences and within-population genome alignments of all derivations of this feature would help incubation period sequenced strains of each virus (Additional files 1 and 2, prediction beyond the impact of the GC content. Ana- see the “Methods” section for details). In addition to lysis of the pairwise associations between the 8 features these 8 features, we also assessed CpG islands as a po- (Fig. 2a) confirmed some previously reported connec- tential feature, because some viruses, such as hepatitis B tions, such as the negative correlation between genome virus (HBV), have been shown to contain varied distri- length and mutation rate [15] and the positive correl- butions of CpG islands across different strains [10]. Fur- ation between GC content and codon adaptation index thermore, CpG avoidance has been reported for diverse [16] (CAI) (Fig. 2 a,b). Strikingly, our findings indicate RNA viruses including coronaviruses [11, 12], possibly that the mutation rate of SARS-CoV-2 is substantially as a result of selection against recognition by the Zinc- lower than those of other human coronaviruses (CoV), finger Antiviral Protein (ZAP) which binds to CpG mo- including its closest human-infecting relative, SARS- tifs [13]. However, the extent of CpG suppression ap- CoV (with an average of 1.4e−3 and 7.8e−5 transitions pears to be largely uniform among RNA viruses [11]. per branch point per nucleotide for SARS-CoV and Gussow et al. BMC Biology (2020) 18:186 Page 4 of 12

Fig. 2 Genomic features of ssRNA viruses causing respiratory infections. a Pairwise correlation matrix across all features. A description of feature construction is given in the “Methods” section. Each circle indicates Spearman’s correlation coefficient (ρ) between two features. The colors represent the rank-correlation coefficients (red indicates positive correlation and blue indicates negative correlation), and the circle sizes correspond to significance (p value), where significant correlations (p value < 0.05) are circled in black. b Scatter plots illustrating the relationships between features across four virus families. c Estimation of the features’ association with the virus family, based on p values (−log-scaled) from two tests applied (see the “Methods” section for details). The cutoff (p value = 0.05) is indicated with a dashed line. Lower values correspond to features that are not significantly associated with a virus family. d Boxplot and overlaid dot plot of the incubation periods across viral families. e Dot plots of different features across virus families

SARS-CoV-2, respectively; Fig. 2b, see the “Methods” feature, whether it varies more across virus families than section for details). Preliminary reports on SARS-CoV-2 within each family (see the “Methods” section for de- genome evolution indicate a similar trend [17]. tails). The results obtained with the two approaches We then sought to select features to be used for a pre- were equivalent, demonstrating that half of the consid- dictive model of the incubation time. To avoid con- ered features varied more between families than within founding the model with features that are primarily families, and therefore might confound the model driven by virus family, we formally quantified whether a (Fig. 2c). We denote such features family-specific. By given feature is significantly associated with the family contrast, the incubation time was not significantly asso- identity. To this end, we applied two complementary ap- ciated with virus family (Fig. 2d), supporting selection of proaches, namely, analysis of variance (ANOVA) and an features that are not family-specific to train a model; we empirical, non-parametric test, to estimate, for each denote such features family-generic. Four other features Gussow et al. BMC Biology (2020) 18:186 Page 5 of 12

were found to be family-generic: GC content of the virus the model. Moreover, coronaviruses include viruses with genome, variance of the number of different nucleotides both high and low incubation periods, providing a good observed per position in the alignment of the virus representation of the range of incubation period values. strains, number of protein-coding genes in the virus gen- Thus, we trained an elastic net model on the 7 human- ome, and CAI of the virus coding sequence (Fig. 2e). infecting viruses of the family Coronaviridae (Fig. 3a), Thus, these four features were included in the model. using the four family-generic features. We found that Next, we divided the analyzed dataset into training this model, which was trained on a single viral family, and test sets. To maintain a large, diverse, and independ- generalized well to viruses from the three other families ent test set that spans multiple virus families, we se- (Fig. 3b). The test mean absolute error was 1.63 days lected the 7 human-infecting viruses of the family (Fig. 3b), attesting to a close estimation of the upper Coronaviridae as the training set. By training on a single limit of the incubation time in an independent data set. viral family, we allow for a test set with the largest pos- Moreover, the model predictions strongly correlated sible number of families, encompassing high genomic di- with the ranks of the assigned incubation periods in the versity and allowing for a comprehensive evaluation of test set (Spearman’s ρ = 0.91, p value = 0.005).

Fig. 3 Assessment of the elastic net model for virus incubation time prediction. a A scatter plot of the incubation periods of the CoV training set compared to the model predictions, with the model R2 in the upper left corner. b A scatter plot of the incubation periods of the test set compared to the model predictions. The box in the bottom right corner contains Spearman’s ρ between the predictions and the true values, the p value of the Spearman’s ρ, the model R2, and the mean absolute error (MAE). c A heatmap of the coefficients of each feature using different training sets. d Abar plot of the model performance metrics using different subsets of training and testing data, with the number of samples in the testing data for each subset indicated. e A scatter plot of the incubation periods of the test set compared to the model predictions when SARS-CoV-2 is left out of training. The box in the top left corner shows the model’s performance metrics Gussow et al. BMC Biology (2020) 18:186 Page 6 of 12

Specifically, for the virus with the longest known incuba- To assess the robustness of the selected features, we tion period, , the longest incubation time, 9.7 tested models trained with different partitioning of the days, was predicted. Although measles was assigned an data into train and test sets. We found that these upper limit incubation period of 14 days in our data, the changes did not significantly change the performance of majority of the available reports are indeed in the range the models, further attesting to the robustness of the sig- of 9–12 days [18]. The second longest incubation nal obtained using the four family-generic features period was also correctly assigned to respiratory syn- (Fig. 3d). By contrast, a model trained with family- cytial virus (RSV), with a prediction of 9.1 days, specific features does not generalize to the test set, and closely approximating an assigned period of 8 days in one trained using a mixture of family-generic and our data. For parainfluenza viruses 1–3, the model family-specific features disregards the latter by nullifying predicted 7.3, 4.9, and 6.2 days, respectively, closely their coefficients (Additional file 1: Figure S1), further approximating the assigned 6 days. Metapneumovirus demonstrating the efficacy of relying on family-generic was similarly accurately predicted to have a 6.5-day features only. The coefficients assigned to the family- incubation period, within half a day of its assigned 6 generic features did not vary substantially across differ- days. Finally, the shortest incubation time predicted ent training sets, confirming that the method is not par- was correctly assigned to , with a prediction ticularly sensitive to the data used for training (Fig. 3c). of a 1.2-day incubation period. Although rhinovirus Nevertheless, the high performance of the model that is was assigned a 4-day incubation period in our data, trained exclusively on CoV seems to suggest that this most of the cases show symptoms within 1 day [19]. virus family provides a good representation of the de- Exploration of the model indicated that the stron- pendencies of the incubation period on genomic fea- gest predictive features were the number of protein- tures, and/or that training on a single family is coding genes and GC content, with higher values in preferable given the small dataset and the possibility of either feature corresponding to a longer incubation confounding effects. time (Fig. 3c). Elucidation of the mechanisms behind To evaluate the utility of our model, we examined how these associations will require extensive experimental this method would have performed during the early work. A straightforward, even if, likely, over-simplified stages of the current COVID-19 pandemic. To this end, explanation could be that the larger number of genes we removed SARS-CoV-2 from the training data and to be translated by the virus lengthens its replication trained the model on the remaining 6 CoV only, with cycle, under the assumption that the number of trans- the caveat that this training set is poorly balanced as it lation initiation events and/or subgenomic RNAs that contains only 2 viruses with incubation times longer need to be transcribed are rate-limiting factors in than 3 days and, therefore, might underestimate when virus reproduction. Similarly, a higher GC content predicting viruses with long incubation times. The incu- leads to the formation of stable secondary structures bation period of SARS-CoV-2 is still being determined, in the virus RNA, with higher kinetic barriers that with the recommended quarantine time conservatively the ribosome then needs to disrupt during translation, set at 14 days. Recent reports indicate that the vast ma- resulting in longer translation times [20]. Thus, one jority of symptomatic patients develop symptoms within possible explanation for the association between the 10.5 days, generally, within 5 days [25, 26]. Despite hav- number of protein-coding genes and the GC content ing trained the model on an imbalanced training set and longer incubation periods is that the longer cu- biased towards shorter incubation periods, the model mulative translation time extends the replication cycle predicts an incubation period of 8.8 days for SARS-CoV- and, consequently, the incubation period. Alternatively 2, correctly placing SARS-CoV-2 in the upper range of or additionally, extra genes could contribute to more incubation periods and predicting an incubation period complex interactions of the virus with the organ- duration during which current research indicates the ism, resulting in longer incubation times. In particu- majority of symptomatic patients will have shown symp- lar, the highly virulent coronaviruses with long toms. Moreover, a recent meta-analysis that examined characteristic incubation periods encode additional, reported SARS-CoV-2 incubation periods across 18 accessory proteins compared to low virulence viruses studies concluded that the quarantine time should be that have shorter incubation times [21, 22]. The shortened to 7 days [27]. Clearly, the estimate provided accessory genes are dispensable for virus reproduction by the model could have been useful in mitigating the in cell culture and have been implicated in virus-host COVID-19 pandemic (Fig. 3e; similar analysis for the interactions [23]. Some of these additional genes en- other CoV is provided in Additional file 1: Figure S2). code proteins containing distinct immunoglobulin-like We further expanded the approach to facilitate an domains, which is compatible with roles in interac- interval prediction, to provide for the prediction of the tions with the immune system of the host [24]. full range of incubation periods for a novel virus. Given Gussow et al. BMC Biology (2020) 18:186 Page 7 of 12

Fig. 4 Interval evaluation for the predicted incubation period. a True (blue) and predicted (orange) intervals for the coronaviruses in the training set. b True (blue) and predicted (orange) intervals for the viruses in the test set. c True (blue) and predicted (orange) intervals for the viruses causing hemorrhagic fever. The blue (true) circles denote the mode value of the reported incubations periods, and the orange (predicted) circles denote the incubation period predicted by the original model (as in Fig. 3) that there is no consensus as to how to define standard test set; at least, 30% of the true range is covered for all errors or confidence intervals for elastic net regression viruses in the test set, and at least 50% is covered in five models [28–30], we introduce an empirical evaluation of of the seven viruses. The average absolute deviation is lower and upper ranges of the interval of the incubation 1.6 days from the lower incubation range and 1.8 days period (see the “Methods” section for details). We find from the upper incubation range. that the model is predictive of these intervals, in both Given the success of the model when applied to re- the training set (Fig. 4a) and the test set (Fig. 4b, permu- spiratory viruses, we sought to examine whether the tation test p value < 1e−3). On average, the predicted genomic characteristics and model that was effective for range captures 54% of the true range for viruses in the respiratory diseases would be generalizable to non- Gussow et al. BMC Biology (2020) 18:186 Page 8 of 12

respiratory viruses. Given that the model construction and should comprehensively evaluate the effects of dif- and evaluation was limited to respiratory viruses, an ferent confounders on the prediction, such as segmented evaluation on other diseases or types of viruses may be genomes (for example, the genome) and differ- confounded by the different tissue types. To mitigate ences in the tissue tropism. possible confounders resulting from genome structure We also investigated the links between incubation pe- and affected tissue types, we focused on non-segmented riods of different disease-causing viruses and the disease ssRNA viruses of the families Filoviridae (negative-sense severity and found that viruses with long incubation pe- RNA viruses) and Flaviviridae (positive-sense RNA vi- riods tend to cause severe disease. Although the rela- ruses) that cause hemorrhagic fevers. The hemorrhagic tionship between incubation periods and disease severity fever viruses were selected because they consist of a has been assessed previously for specific viral diseases large enough set of viruses that are not associated with a [6–8], to our knowledge, this connection has not been specific tissue, and thus appear to be less likely to intro- studied systematically across a large collection of human duce bias in evaluation. Indeed, the model accurately pathogenic viruses. This signal is robust across different predicted the incubation period of these viruses, includ- viral families and disease types, including coronaviruses. ing 3 types of viruses, Marburg virus, dengue To date, the study of virus incubation periods has been virus, yellow fever virus, and tick-borne encephalitis largely limited to human viruses. It remains to be ex- virus (Fig. 4c, Spearman rho = 0.76, p value = 0.05). plored whether the incubation periods of animal viruses correlate with those of human viruses. If there are robust Discussion correlations, these could provide additional avenues to The emergence of novel viruses that can cause investigate the effect of the incubation period on viral remains a major threat to human health as compellingly pathogenicity and infectivity in an evolutionary context, demonstrated by the COVID-19 pandemic. A major chal- and perhaps, contribute to the development of early in- lenge in dealing with such outbreaks is the initial lack of terventions for potential zoonotic viruses. Furthermore, biological and clinical knowledge of the infectious agent, such investigation could help with uncovering new coro- which can lead to potentially avoidable fatalities until the naviruses with high pathogenic and zoonotic potential. causative agent is thoroughly characterized. Therefore, to The underlying molecular and biological mechanisms mitigate emerging outbreaks, rapid estimation of the incu- of the dependencies between the family-generic genomic bation period of novel viruses is vital, in order to define features, the incubation times, and disease severity re- the appropriate quarantine period and to estimate the rate main to be directly and functionally investigated. One of virus spread. Furthermore, we show here that the contributing factor could be a direct mechanistic con- length of the incubation period in human viral diseases nection between increased translation times in viruses significantly correlates with the disease severity which fur- with many genes and high GC content and longer incu- ther underlines the importance of the accurate prediction bation periods. Additionally, longer incubation periods of the incubation time. are indicative of complex virus-host interactions that With recent advances in sequencing technology, gen- consequently present with more severe disease symp- omic sequences of multiple isolates of novel viruses be- toms. This explanation is compatible with the observa- come available shortly after the virus emerges. Here, we tions in coronaviruses, whereby the highly virulent comprehensively examined genomic features that could strains with long characteristic incubation periods en- be predictive of the virus incubation times of human code several accessory proteins [21, 22] that are missing pathogenic ssRNA viruses and identified four family- in viruses causing milder disease and have been impli- generic features that consistently predict the incubation cated in virus-host interactions [23]. The domain con- periods with high accuracy. Using these features, we de- tent of some of these accessory proteins, indeed, seems veloped a robust model that is predictive of incubation to implicate them in interactions with the host immune times for respiratory ssRNA viruses, the most common system [24]. Another possible explanation is, simply, that cause of viral pandemics [31]. Despite having been a longer incubation period can lead to delayed medical trained and evaluated on respiratory ssRNA viruses only, intervention, so that by the time clinical symptoms ap- our model was found to be predictive also of the incuba- pear, the medical intervention is less effective, and the tion periods of viruses that cause hemorrhagic fevers. disease presents as more severe. However, confirming or Thus, the four genomic features that we identified as be- dispelling any of these hypotheses requires extensive ing family-generic allow for robust prediction of incuba- virological experimentation. tion periods for vastly different diseases caused by viruses that belong to different phyla [32]. Future ad- Conclusions vances based on this work can be expected to expand We demonstrated a robust association between virus in- the model and feature search to additional sets of viruses cubation times and the severity of disease presentation Gussow et al. BMC Biology (2020) 18:186 Page 9 of 12

and identified a set of viral genomic features that is reports going several days beyond that. Thus, we highly predictive of incubation times. To our knowledge, set the incubation period to 14 days [18]. this work is the first to demonstrate that incubation pe- f Respiratory syncytial virus (RSV, n = 1595). For riods of respiratory ssRNA viruses can be accurately pre- RSV, the incubation period was set to 8 days, in dicted by genome analysis alone. The model established accordance with the higher range of the majority through this work and the genomic features that were of reports [37]. used for training can directly facilitate early and accurate g Parainfluenza (n = 43, 58, and 345 for estimation of the required quarantine time for future parainfluenza 1, 2, and 3, respectively). The pandemics and help the responsible agencies set initial parainfluenza incubation period was consistently guidelines accordingly. Furthermore, these results have reported to be between 2 and 6 days and therefore clear applications for controlling the spread of emergent was assigned the upper limit of 6 days [4]. ssRNA respiratory viruses, the most common cause of h Rhinovirus (n = 244). Per previous reports, a 4-day pandemics. Future work can expand this method to en- incubation period was assigned to rhinovirus [4]. compass additional virus families of interest and aid in i Metapneumovirus (n = 162). Six days was assigned mitigating the effect of potentially deadly zoonotic to human metapneumovirus, given the commonly outbreaks. reported range of 4–6 days [38].

Methods We also assigned each of the 41 viruses with a binary Incubation period and severity assignment severity annotation, of either severe or mild. Diseases The incubation time for each of the strains of each of with extreme immune responses, fevers, or other ex- the 41 viruses was collected from the literature (Add- treme symptoms were considered severe, along with dis- itional file 1: Table S2). As incubation periods vary, eases with high death rates. Diseases that cause mild where possible, the upper limit was used, and a consen- respiratory symptoms or diseases that are otherwise be- sus of reports was followed. The only exception to this is nign were considered mild. In cases where either the SARS-CoV-2. Although more data is needed to assess Centers for Disease Control and Prevention (CDC) or the incubation period of SARS-CoV-2, we set the dur- the WHO explicitly described a disease as either severe ation to 14 days given the recommended quarantine or mild, that description was applied as the severity an- times [2]. We note that there are small variations in re- notation. The disease presentations and severity deter- ports of the incubation times, and the assigned values mined, along with the rationale for the determined represent the best approximation. Changing the assigned severity, are detailed in Additional file 1: Table S2. incubation times within the range of reports maintains a Each of the 41 viruses was also classified by its symp- similarly high performance of the trained model (Add- toms and affected tissues, based on CDC and WHO de- itional file 1: Figure S2). scriptions, falling into one of these categories: central The rationale for the selected incubation times for the nervous system (CNS), fever, gastro, gastro/CNS, 14 ssRNA respiratory viruses used in model construction hemorrhagic fever, immune system, liver, skin, and swol- and assessment was as follows: len glands.

a 229E-CoV (n = 25), HKU1-CoV (n = 39), NL63-CoV (n = 60), and OC43-CoV (n = 161). For these Sequence datasets coronaviruses, which are causative agents of Reference genome sequences and GenBank files were common colds, a 3-day incubation period was downloaded from the NCBI [39] for each virus (Add- assigned, following the majority of reports [33, 34]. itional file 1: Table S1, Additional file 2). For each virus, b MERS-CoV (n = 284). A 14-day incubation period additional strains were downloaded from the NCBI and was assigned to MERS-CoV, per previous reports aligned using Mafft [40] v7.407 with default parameters, [35]. resulting in an alignment file for all strains belonging to c SARS-CoV (n = 273). The estimates show 13 days each of the 14 viruses (Additional file 3). For each virus, as an upper limit in the majority of reports [4, 36]. all available strains were downloaded. Phylogenetic trees d SARS-CoV-2 (n = 92). Although more data is were generated for each virus based on the alignment needed to assess the incubation period of SARS- using FastTree [41, 42] with the “-nt” parameter. CoV-2, we set the duration to 14 days given the recommended quarantine times [2]. e Measles virus (n = 213). There is a considerable Genomic features range of reported incubation periods, with the The following genomic features with potential links to majority of reports indicating 9–12 days and some viral replication time and efficiency were evaluated: Gussow et al. BMC Biology (2020) 18:186 Page 10 of 12

a Genome length. The number of nucleotides in the average and variance of these values are used as reference genome sequence. Rationale: The length features. Rationale: The position change mean and of the genome might correlate with virus variance might correlate with the translation efficiency reproduction time. and regulation among different genomic regions and, b Number of genes. The number of genes in the thus, with the viral incubation period. reference genome’s associated GenBank file. We verified for each virus that there were no Features that rely on the multiple sequence alignment undetected genes within its genome using of different strains of the same virus were always nor- MetaGeneMark [43] gene prediction software. malized by the number of strains available, in order to Rationale: The number of genes might correlate avoid biases that could result from different strain with the total time spent on translation in the viral counts per virus. lifecycle and, thus, with the reproduction time. CpG islands were searched for in each reference gen- c Positive or negative strand RNA. Whether the RNA ome using a Python implementation (https://github. virus is positive strand or negative strand. This com/lucasnell/TaJoCGI) with standard criteria [14]. was set to 1 if the virus has a positive strand However, none was found in any of the analyzed virus genome and to 0 if it has a negative strand genomes. genome. Rationale: The positive- or negative-sense RNA might correlate with the time required to Evaluation of the specificity of the features for virus begin translation; negative ssRNA viruses require families an additional stage to synthesize the positive-sense We sought to evaluate, for each feature, whether it is asso- antigenome before translation, and accordingly, ciated with the identity of the virus family which would be could correlate with the reproduction time. a potential confounder to the model. We hence searched d Codon adaptation index (CAI). The CAI was used for features whose variance within each virus family was to analyze the codon usage bias of each virus in not significantly smaller than its overall variance. To this comparison to human. The CAI was calculated by end, each feature was evaluated using two methods. concatenating all the coding sequences (CDS) in each The first method is a one-way analysis of variance virus reference genome GenBank file and using the (ANOVA). One-way ANOVA tests the null hypothesis Biopython [44, 45] software package (version 1.74) that the means of the measurement variable are the implementation, with the CodonAdaptationIndex class same for the different categories of data, against the al- set to a reference human codon usage table [46]. ternative hypothesis that they are not all the same. Rationale: The codon adaptation index could correlate Hence, lower assigned p values signify that the null hy- with translation efficiency and thus with the viral pothesis is rejected and that different viral families have reproduction time. different population means with respect to each feature. e GC content. This was calculated for each reference We therefore consider features assigned with a p value genome using Biopython [44]. Rationale: GC greater than 0.05, for which we could accept the null hy- content could correlate with translation times [20] pothesis, and could not conclude that the feature mean and thus with the reproduction time. was associated with the viral family. The ANOVA test f Mutation rate. Raw mutation rates were was implemented in Python using the f_oneway function estimated per each virus genome alignment, in the SciPy [48] package. without accounting for selection bias, by The ANOVA test assumes that the samples are inde- detecting the ancestral base for every base in pendent, taken from normally distributed populations the genome for every non-leaf node in the tree with equal standard deviations between the groups. using maximum parsimony. Then, at each These assumptions, which must be satisfied for the asso- branch point, the transitions between both sides ciated p value to be valid, are not guaranteed and are dif- of the branch were counted, and the average ficult to evaluate. We hence implemented a second, count was then divided by the length of the empirical test, which is not parametric and does not rely genome for the final estimate. Rationale: The on any assumptions. This empirical test evaluates, for a mutation rates could correlate with the given feature, if its variance within virus families is reproduction time [47]. smaller than would be observed by random assignment g Average and variance of the changes in each of families to viruses. We reason that a feature which is position of the alignment. The change in alignment associated with the virus family would have significantly position is defined as the number of different values smaller variance within the true family assignment than observed in each position of the virus alignment, within a random family assignment. The null hypothesis divided by the number of strains in the alignment. The is that the variance of the features within each family is Gussow et al. BMC Biology (2020) 18:186 Page 11 of 12

similar to the variance across families, and the alterna- Evaluating the significance of assigned interval using tive hypothesis is that the variance of the features within permutation test each family is smaller than the variance across families. To evaluate the significance of the correlation between To perform the empirical test, the feature variance the predicted incubation intervals and the true intervals, within each virus family is calculated and averaged. we applied a permutation test. We calculated the average Next, the feature values are randomly permuted 1000 deviation of the predicted ranges from the true incuba- times and the same calculation is performed, to generate tion ranges across all viruses in the test set, which is 1.7 a null distribution. Let the number of times the variance days. We then shuffle the true intervals 1000 times, to of the permuted values is less than the variance of the generate a null distribution. Let the number of times the real values be X. The p value is calculated as (X + 1)/ average deviation of the predicted range from the per- (1000 + 1). Thus, a lower p value indicates that the fea- muted range is less than or equal to the average devi- ture’s within-family variance is smaller than our null ex- ation of the predicted range from the true range be X. pectation. We search for features with a p value greater The p value is calculated as (X + 1)/(1000 + 1). than 0.05, for which we conclude that the variance within the actual families is not smaller than that within randomly assigned families. This evaluation does not ne- Supplementary Information The online version contains supplementary material available at https://doi. cessarily indicate that the family-specific features are org/10.1186/s12915-020-00919-9. poor predictors, rather, that, with the data available, it would not be possible to discern whether the signal from Additional file 1: Figure S1. The performance of models trained with these features is primarily driven by the variation be- family-generic features. Figure S2. Model performance for ranging incu- bation time assignment. Table S1. Virus and disease features for 41 col- tween virus families. lected viruses. Table S2. Family-generic features for the 14 viruses studied. Elastic net model Additional file 2. GenBank file with the reference genomes used for The elastic net method [49] is a generalization of LASSO each of the studied viruses. using Ridge regression shrinkage, where the naïve esti- Additional file 3. Accessions of the nucleotide sequences used for each ^ of the 14 viruses studied. The sequence alignments are available through mator β is a minimizer of the criterion L(λ1, λ2, β) by: Zenodo (https://zenodo.org/record/4239675).

^ β ¼ argminβðLðλ1; λ2; βÞÞ Acknowledgements 2 2 Not applicable. ¼ argminβðjjy − Xβjj þ λ1jjβjj1 þ λ2jjβjj Þ

Authors’ contributions for any fixed, non-negative λ , λ . Elastic net was ABG and NA initiated the study; ABG, NA, and YIW performed the research; 1 2 ABG, NA, YIW, and EVK analyzed the results; ABG, NA, and EVK wrote the chosen because it has characteristics of both LASSO and manuscript that was edited and approved by all authors. Ridge regression, which are controlled by the penalties coefficients, thus outperforming other regularization and Funding variable selection approaches [49]. This research was supported by the Intramural Research Program of the The elastic net model was constructed in Python using National Library of Medicine at the NIH. the scikit-learn [50] ElasticNet function with default pa- rameters. The features were standardized before training, Availability of data and materials with the same standardization parameters used in train- Genome sequences and GenBank files were downloaded from the NCBI [39]. Virus genome sequences and alignments used for this analysis are provided ing applied to test data before prediction. through Zenodo. (https://zenodo.org/record/4239675), with the DOI: https://doi.org/10.5281/ Evaluating intervals for the predicted incubation periods zenodo.4239675. The full accessions list is additionally provided in Additional File 3. The model, features, and accompanying code are available Given that there is no consensus as to how to define at https://github.com/noamaus/incubation-model. standard errors or confidence intervals for LASSO, Ridge, and elastic net estimates [28–30], we develop an Ethics approval and consent to participate empirical estimation of the lower and upper range of the Not applicable. incubation period using the elastic net model. To this end, we trained two models on the training data (viruses Consent for publication in the Coronaviridae family), with the first model trained Not applicable. on the lower estimates of the incubation period of coro- naviruses and the second model trained on the highest Competing interests reported estimate of the incubation periods. No competing interests declared. Gussow et al. BMC Biology (2020) 18:186 Page 12 of 12

Received: 19 August 2020 Accepted: 6 November 2020 26. Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, et al. Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data. J Clin Med. 2020;9:538. References 27. Wassie GT, Azene AG, Bantie GM, Dessie G, Aragaw AM. Incubation period 1. World Health Organization. Statement on the second meeting of the of SARS-CoV-2: a systematic review and meta-analysis. Curr Ther Res. 2020; International Health Regulations (2005) Emergency Committee regarding 93:100607. the outbreak of novel coronavirus (2019-nCoV). 2020. 28. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc 2. World Health Organization. Considerations for quarantine of individuals in Ser B. 1996;58:267–88. the context of containment for coronavirus disease (COVID-19). 2020. 29. Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, 3. Gupta AG, Moyer CA, Stern DT. The economic impact of quarantine: SARS and Bayesian lassos. Bayesian Anal. 2010;5:369–411. – in Toronto as a case study. J Inf Secur. 2005;50:386 93. 30. Osborne MR, Presnell B, Turlach BA. On the LASSO and its dual. J Comput 4. Lessler J, Reich NG, Brookmeyer R, Perl TM, Nelson KE, Cummings DAT. Graph Stat. 2000;9:319–37. Incubation periods of acute respiratory viral infections: a systematic review. 31. Carrasco-Hernandez R, Jácome R, López Vidal Y, Ponce de León S. Are RNA – Lancet Infect Dis. 2009;9:291 300. viruses candidate agents for the next global pandemic? A review. ILAR J. 5. Hawryluck L, Gold WL, Robinson S, Pogorski S, Galea S, Styra R. SARS control 2017;58:343–58. and psychological effects of quarantine, Toronto, Canada. Emerg Infect Dis. 32. Koonin EV, Dolja VV, Krupovic M, Varsani A, Wolf YI, Yutin N, et al. Global – 2004;10:1206 12. organization and proposed megataxonomy of the virus world. Microbiol 6. Virlogeux V, Park M, Wu JT, Cowling BJ. Association between severity of Mol Biol Rev. 2020;84:e00061-19. – MERS-CoV infection and incubation period. Emerg Infect Dis. 2016;22:526 8. 33. Bradburne AF, Bynoe ML, Tyrrell DA. Effects of a “new” human respiratory 7. Virlogeux V, Fang VJ, Wu JT, Ho L-M, Peiris JSM, Leung GM, et al. Brief virus in volunteers. BMJ. 1967;3:767–9. report: incubation period duration and severity of clinical disease following 34. Tyrrell DAJ, Cohen S, Schilarb JE. Signs and symptoms in common colds. severe acute respiratory syndrome coronavirus infection. Epidemiology. Epidemiol Infect. 1993;111:143–56. – 2015;26:666 9. 35. Nishiura H, Miyamatsu Y, Mizumoto K. Objective determination of end of 8. Virlogeux V, Yang J, Fang VJ, Feng L, Tsang TK, Jiang H, et al. Association MERS outbreak, South Korea, 2015. Emerg Infect Dis. 2016;22:146–8. between the severity of influenza A(H7N9) virus infections and length of 36. Meltzer MI. Multiple contact dates and SARS incubation periods. Emerg the incubation period. PLoS One. 2016;11:e0148506. Infect Dis. 2004;10:207–9. 9. Koo JR, Cook AR, Park M, Sun Y, Sun H, Lim JT, et al. Interventions to 37. Madge P, Paton JYY, McColl JHH, Mackie PLKL. Prospective controlled study mitigate early spread of SARS-CoV-2 in Singapore: a modelling study. Lancet of four infection-control procedures to prevent nosocomial infection with – Infect Dis. 2020;20:678 88. respiratory syncytial virus. Lancet. 1992;340:1079–83. 10. Zhong C, Hou Z, Huang J, Xie Q, Zhong Y. Mutations and CpG islands among 38. Schweon SJ. Human metapneumovirus. Nursing (Lond). 2013;43:62–3. hepatitis B virus genotypes in Europe. BMC Bioinformatics. 2015;16:38. 39. Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, Bolton E, et al. Database 11. Simmonds P, Xia W, Baillie JK, McKinnon K. Modelling mutational and resources of the National Center for Biotechnology Information. Nucleic selection pressures on dinucleotides in eukaryotic phyla--selection against Acids Res. 2018;46:D8–13. CpG and UpA in cytoplasmically expressed RNA and in RNA viruses. BMC 40. Katoh K, Standley DM. MAFFT multiple sequence alignment software Genomics. 2013;14:610. version 7: improvements in performance and usability. Mol Biol Evol. 2013; 12. Digard P, Lee HM, Sharp C, Grey F, Gaunt E. Intra-genome variability in the 30:772–80. dinucleotide composition of SARS-CoV-2. Virus Evol. 2020;6:veaa057. 41. Price MN, Dehal PS, Arkin AP. FastTree 2 – approximately maximum- 13. Takata MA, Gonçalves-Carneiro D, Zang TM, Soll SJ, York A, Blanco-Melo D, likelihood trees for large alignments. PLoS One. 2010;5:e9490. et al. CG dinucleotide suppression enables antiviral defence targeting non- 42. Price MN, Dehal PS, Arkin AP. Fasttree: computing large minimum evolution – self RNA. Nature. 2017;550:124 7. trees with profiles instead of a distance matrix. Mol Biol Evol. 2009;26:1641–50. 14. Takai D, Jones PA. Comprehensive analysis of CpG islands in human 43. Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in – chromosomes 21 and 22. Proc Natl Acad Sci. 2002;99:3740 5. metagenomic sequences. Nucleic Acids Res. 2010;38:e132. 15. Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral mutation rates. 44. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. – J Virol. 2010;84:9733 48. Biopython: freely available Python tools for computational molecular 16. Knight RD, Freeland SJ, Landweber LF. A simple model based on mutation and biology and bioinformatics. Bioinformatics. 2009;25:1422–3. selection explains trends in codon and amino-acid usage and GC composition 45. Sharp PM, Li W-H. The codon adaptation index-a measure of directional within and across genomes. Genome Biol. 2001;2:RESEARCH0010. synonymous codon usage bias, and its potential applications. Nucleic Acids 17. Jia Y, Shen G, Zhang Y, Huang K-S, Ho H-Y, Hor W-S, et al. Analysis of the Res. 1987;15:1281–95. mutation dynamics of SARS-CoV-2 reveals the spread history and 46. Nakamura Y. Codon usage tabulated from international DNA sequence emergence of RBD mutant with lower ACE2 binding affinity. bioRxiv. 2020; databases: status for the year 2000. Nucleic Acids Res. 2000;28:292. 2020.04.09.034942. 47. Fitzsimmons WJ, Woods RJ, McCrone JT, Woodman A, Arnold JJ, Yennawar – 18. Naim HY. Measles virus. Hum Vaccin Immunother. 2015;11:21 6. M, et al. A speed-fidelity trade-off determines the mutation rate and 19. Harris JM, Gwaltney JM. Incubation periods of experimental rhinovirus virulence of an RNA virus. PLoS Biol. 2018;16:e2006459. – infection and illness. Clin Infect Dis. 1996;23:1287 90. 48. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, 20. Qu X, Wen J-D, Lancaster L, Noller HF, Bustamante C, Tinoco I. The et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. ribosome uses two active mechanisms to unwind messenger RNA during Nat Methods. 2020;17:261–72. – translation. Nature. 2011;475:118 21. 49. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R 21. Liu DX, Fung TS, Chong KK-L, Shukla A, Hilgenfeld R. Accessory proteins of Stat Soc Ser B (statistical Methodol). 2005;67:301–20. – SARS-CoV and other coronaviruses. Antivir Res. 2014;109:97 109. 50. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. 22. Cui J, Li F, Shi Z-L. Origin and evolution of pathogenic coronaviruses. Nat Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30. Rev Microbiol. 2019;17:181–92. 23. Menachery VD, Mitchell HD, Cockrell AS, Gralinski LE, Yount BL, Graham RL, et al. MERS-CoV accessory ORFs play key role for infection and Publisher’sNote pathogenesis. MBio. 2017;8:e00665–17. Springer Nature remains neutral with regard to jurisdictional claims in 24. Tan Y, Schneider T, Leong M, Aravind L, Zhang D. Novel immunoglobulin published maps and institutional affiliations. domain proteins provide insights into evolution and pathogenesis of SARS- CoV-2-related viruses. MBio. 2020;11:e00760-00720. 25. Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Ann Intern Med. 2020;172:577.