<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Network-based -Host Interaction Prediction with Application to SARS-CoV-2

Hangyu Du†, Feng Chen†, Hongfu Liu*, and Pengyu Hong*

1Department of Computer Science, Brandeis University, Waltham, Massachusetts, †These authors contribute equally *Correspondence: [email protected], [email protected]

COVID-19, caused by Severe Acute Respiratory Syndrome above two goals for new zoonotic , which we believe 2 (SARS-CoV-2), has quickly become a global can be done by leveraging the knowledge and data about health crisis since the first report of infection in December of known viruses highly relevant to the new ones. 2019. However, the infection spectrum of SARS-CoV-2 and its The research community has accumulated a great deal comprehensive protein-level interactions with hosts remain un- of knowledge about several other human clear. There is a massive amount of under-utilized data and (including SARS-CoV (7–15), HCoV-HKU1 (15), HCoV- knowledge about RNA viruses highly relevant to SARS-CoV-2 OC43 (16, 17), HCoV-NL63 (18), MERS-CoV (19–23)) and and their hosts’ proteins. More in-depth and more compre- hensive analyses of that knowledge and data can shed new in- has collected a large amount of data about them. For exam- sight into the molecular mechanisms underlying the COVID-19 ple, it was shown that human Angiotensin-Converting En- pandemic and reveal potential risks. In this work, we con- zyme 2 (ACE2) was the primary host receptor used by the structed a multi-layer virus-host interaction network to incor- S protein (S-protein) of SARS-CoV-2 for the virus to gain porate these data and knowledge. A machine learning-based entry into human cells (24) (Figure S1). ACE2 is also the method, termed Infection Mechanism and Spectrum Prediction host receptor used by SARS-CoV (14) and HCoV-NL63 (18). (IMSP), was developed to predict virus-host interactions at both The S-protein of SARS-CoV-2 binds significantly tighter to protein and organism levels. Our approach revealed five po- ACE2 than its counterpart in SARS-CoV (25). After the virus tential infection targets of SARS-CoV-2, which deserved pub- enters host cells, -stimulated genes (ISGs) are es- lic health attention, and eight highly possible interactions be- sential for a host to defend against viral infection (Figure S2). tween SARS-CoV-2 proteins and human proteins. Given a new This knowledge and data can be utilized to investigate the virus, IMSP can utilize existing knowledge and data about other highly relevant viruses to predict multi-scale interactions be- infection spectrum of SARS-CoV-2 and its interactions with tween the new virus and potential hosts. hosts at the protein level. Finally, we built a Virus-Host Inter- action Network (VHIN) of seven viruses and seventeen hosts Keywords: Coronavirus, COVID-19, SARS-CoV-2, Machine Learning, Interac- that summarizes the existing protein-protein interaction (PPI) tion Prediction, Protein-Protein-Interaction, virus-host interaction network and infection relationships among them (Figure 1A, more de- tails in Figures S3, S4 and S5). Introduction We have developed a network-based multi-level virus-host interaction modeling and prediction, termed Infection Mech- Severe Acute Respiratory Syndrome Coronavirus-2 (SARS- anism and Spectrum Prediction (IMSP) (Figure 1B, details CoV-2), a novel virus causing the COVID-19 disease, was in the Method section), which uses machine learning tech- first reported in Wuhan, China, in December of 2019. Since nique to learn from the constructed VHIN and then predict then, it has quickly become a global health crisis (1) with novel virus-host interactions at both the protein (i.e., Mecha- over 50 million people infected and 1,250,000 death over nism) and organism (i.e., Spectrum) levels. IMSP predicts 200 countries as of November of 2020 (2). The impact that the SARS-CoV-2 S-protein can bind well with ACE2 of SARS-CoV-2 has significantly surpassed previous out- receptors in seven mammalian hosts, which have not been breakings of coronaviruses, such as Severe Acute Respiratory reported. Among those hosts, five of them are predicted to Syndrome Coronavirus (SARS-CoV) in 2003 and the Mid- have high risks of being infected by SARS-CoV-2. More- dle East Respiratory Syndrome Coronavirus (MERS-CoV) over, IMSP identifies eight new interactions between SARS- in 2012. Besides humans, SARS-CoV-2 has also been con- CoV-2 proteins and human proteins. To our best knowledge, firmed to infect several other mammals closely related to hu- our work is the first to apply machine learning techniques for man activities, including dogs (3), cats (4), tigers (5), and predicting virus-host interactions at both protein and organ- golden Syrian hamsters (6). It is important to identify a com- ism levels. Previous works (26, 27) on predicting virus-host prehensive set of such mammals because they can potentially interactions focus on the relationships between SARS-CoV-2 serve as covert means to exacerbate the spread of COVID-19. proteins and human proteins, which ignore other hosts that Moreover, identifying interactions between SARS-CoV-2 might be infected by SARS-CoV-2. proteins and host proteins can deepen our understanding of the viral invasion processes and may help design treatments and . In general, we want to promptly achieve the bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

5 (MDA5), Nuclear Factor Kappa-light-chain-enhancer of activated B (NF-κB), Protein Kinase interferon-inducible double-stranded RNA dependent Activator (PRKRA), TANK Binding Kinase 1 (TBK1), Retinoic Acid-Inducible Gene I (RIG-I), Signal Transducer and Activator of Transcription 1 (STAT1) and Signal Transducer and Activator of Transcrip- tion 2 (STAT2)). The viral proteins include homologs of S- protein, Membrane (M) protein, Nucleocapsid (N) protein, nsp1, nsp15, ORF3b, ORF4a, ORF4b, ORF6, and Papain- like protease (PLpro). There are four types of interactions between nodes: Protein-Protein Interactions, infection, be- longing, and similarity between protein homologs. Simi- larities between homologs were calculated using the Protein BLAST algorithm provided by NCBI. PPIs and infection re- lationships are gathered from academic publications (7–23).

SARS-CoV-2 – Host Interaction Predictions We applied IMSP on SARS-CoV-2 and other six human coronaviruses to get high-confidence predictions of PPIs and infections. Figure S1 shows the interactions between S pro- teins and ACE2. Figure S2 shows the interactions between virus proteins and IFN proteins in hosts. Figure S3 shows the network between the host group and the virus group in the Fig. 1. Infection Mechanism and Spectrum Prediction. (A) The virus-host inter- organism layer. The virus-host network’s complete informa- action network. Nodes represent proteins, viruses, and hosts; interactions repre- tion, built with two layers of data containing 246 nodes and sent relationships (i.e., PPI, infect, similarity, belong). The color of a node indicates its organism. The thickness of a similarity relationship indicates its level of similarity. 1,766 interactions, is displayed in Node Table S4 and Inter- For the full network, refer to the viral entry graph (Figure S3), interferon signaling action Table S5. All 65 infections and 457 PPIs predictions pathway graph (Figure S4), and infection graph (Figure S5). (B) IMSP learns a are shown in Table S6 and S7. representation for each potential interaction, which contains a structural embedding and a content embedding. The structural embedding captures the local structural features of an interaction. The content embedding captures the attributes that re- SARS-CoV-2 S-protein Binding Predictions veal biological aspects of an interaction. The representation of each interaction is The ability of SARS-CoV-2’s S-protein to bind to the host derived by concatenating its structural and content embeddings, where S stands for a structural embedding element, and C stands for a content embedding element. ACE2 receptors is a key factor deciding the infection ca- A Multi-layer Perceptron (MLP) is trained to classify the relationship between two pability of SARS-CoV-2. IMSP predicts that the S-protein nodes into PPI, infection, or no interaction. See the Methods section for calculat- of SARS-CoV-2 has a high probability of binding well with ing the structural and content embeddings. (C) Exemplar predicted interactions are highlighted and colored accordingly to their types. Existing interactions are dimmed. the ACE2 receptors in mice, rats, sheep, chimpanzees, cattle, squirrels, and horses (Figure2a). Mice and rats have been recognized to be susceptible to several other human coron- Results aviruses, such as, SARS-CoV (7), MERS-CoV (28), HCoV- Here, we explain the construction of our virus-host interac- OC43 (17) and HCoV-HKU1 (29, 30). The chimpanzee has tion network in detail, highlight the predicted interactions by the ACE2 with 99.38% similarity with human ACE2, which SARS-CoV-2, and present the overall performance evalua- has a high possibility to bind with the same S-protein. The tion of our interaction prediction model IMSP. overall similarity of ACE2 for the horse, squirrel, sheep, and cattle is 93.45%, 91.82%, 90.81%, and 90.56% compared to The Virus–Host Interaction Network human ACE2, respectively. These predictions still require The Virus-Host Interaction Network consists of two lay- more practical research is required to determine the bind- ers (an organism layer and a protein layer). The organ- ing affinity of these mammals’ ACE2 with SARS-CoV-2’s ism layer contains a set of viruses (including SARS-CoV-2, S-protein. It has been shown that ACE2 can tolerate up to SARS-CoV, HCoV-229E, HCoV-HKU1, HCoV-OC43, seven amino acids changes out of 20 critical ones that contact HCoV-NL63, and MERS-CoV) and a set of hosts (including with the S-protein, without losing the functionality as SARS- human, mouse, rat, dog, cat, camel, squirrel, cattle, chim- CoV-2’s receptor (31). This means that similarity might not panzee, red junglefowl, rabbit, horse, monkey, rat, sheep, be the only factor that influences the binding affinity between pig, and golden Syrian hamster). At the protein layer, we ACE2 and SARS-CoV-2’s S-protein. focus on proteins that are known to be involved in viral in- vasion as well as immune system response and suppression. SARS-CoV-2 and human interferon pathway interactome The network contains 17 host protein homolog groups ob- prediction tained from NCBI: ACE2, Dipeptidyl Peptidase 4 (DPP4), The Interferon (IFN) pathway plays a critical role in the hu- IRF3, IRF7, IRF9, Mitochondrial Anti-Viral-signaling pro- man immune response. After the virus infection is detected, tein (MAVS), Melanoma Differentiation-Associated protein the innate immune system will induce the IFN signaling, and bioRχiv | 2 H. Du & F. Chen et al. | bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Fig. 2. Interaction prediction for SARS-CoV-2. (A) shows the known and predicted bindings between the S-protein in SARS-CoV-2 and ACE2 in mammalian hosts. Host names are displayed in their abbreviation form: Hom.Sap. is Homo sapiens; Mus.Mus. is Mus musculus; Fel.Cat. is Felis catus; Can.Lup. is Canis lupus familiaris; Ovi.Ari. is Ovis aries; Rat.Nor. is Rattus norvegicus; Mac.Mul. is Macaca mulatta; Rhi.Fer. is Rhinolophus ferrumequinum; Mes.Aur. is Mesocricetus auratus; Bos.Tau. is Bos taurus; Ict.Tri. is Ictidomys tridecemlineatus; Equ.Cab. is Equus caballus; Pan.Tro. is Pan troglodytes. (B)-(D) show the known and predicted interactions of PLpro, nsp15, ORF6, and N protein in SARS-CoV-2 with proteins in the human IFN signaling pathway that contribute to the IFN signaling pathway suppression. the expression of IFN genes will increase the cellular resis- tance to viral invasion. Viruses have developed various strate- gies to inhibit IFN signaling to facilitate a successful viral invasion (32). SARS-CoV and MERS-CoV have been stud- ied quite comprehensively in counteracting the IFN signaling responses compared with SARS-CoV-2. From IMSP, eight interactions between three SARS-CoV-2 proteins and human proteins in the innate immune pathway have been identified, shown in Figure 2(b)-(d). These interactions have a high probability of playing crucial roles in the SARS-CoV-2’s sup- pression of the host’s innate immune system. PLpro was predicted to interact with the nuclear factor kappa-light-chain-enhancer of activated B (NF-κB) in hu- mans, shown in Figure 2(b). In SARS-CoV (33) and MERS- CoV (34), PLpro have been identified as being capable of reducing the NF-κB reporter activity and further suppressing Fig. 3. Infection prediction for SARS-CoV-2.This figure shows SARS-CoV-2’s IRF3. From our prediction, it might also be highly possible known and predicted infected mammalian hosts. The predicted infected mammalian for SARS-CoV-2’s PLpro to have similar functions. hosts without a predicted or proven corresponding receptor binding relationship with ORF6 and nsp15 in SARS-CoV-2 have been discovered as SARS-CoV-2’s S-protein were excluded. crucial viral interferon antagonists of SARS-CoV-2. These two proteins inhibit the localization of IRF3 by interacting animals include mice, rats, sheep, cattle, and squirrels. with RIG-I (35), as a similar function is found for SARS-CoV Mice and rats are similar in every protein that we took ORF6 (36), shown in Figure 2(c)&(d). ORF6 in SARS-CoV into consideration in IMSP: all of the proteins have above has also been shown to interact with STAT1 and STAT2 (37). 85% similarity, and the receptor ACE2 even has 94% sim- From IMSP’s prediction, ORF6 and nsp15 in SARS-CoV- ilarity. Mouse has been identified as a host for all beta- 2 were suggested to have potential interactions with MDA5 coronaviruses: SASR-CoV (7), MERS-CoV (28), HCoV- and MAVS. Since MAVS works as the adaptor molecule for OC43 (17) and HCoV-HKU1 (29, 30). SARS-CoV-2 also MDA5 (38), it is possible that a viral protein interacts with ei- falls into the category of beta-coronavirus (39), which has a ther one of these two would also have the interaction with the high possibility to infect mice and rats. other. Besides, MAVS and MDA5, ORF6 was also predicted Cattle are hosts for both MERS-CoV (21) and a bovine to interact with STAT1 and STAT2, the same function as in coronavirus which is exceptionally similar to HCoV-OC43 the SARS-CoV. As nsp15 and ORF6 both function in nuclear (over 94% similarity) (16). This means that cattle can be the transport machinery after the viral entry (26), it is reasonable host for other coronaviruses. Cattle, along with sheep and that, for these two proteins, similar interactions with innate squirrels, are closely related to the human living environment immune pathways are predicted. Careful experiments should or daily diet. They could be potential mammalian hosts that be conducted to identify nsp15 and ORF6’s impact on the again transmit the virus back to human society. The investi- innate immune system. gation of these highly possible infections could help identify the virus’s path and further control the transmission SARS- SARS-CoV-2 infection prediction CoV-2 from mammalian hosts. Further research on these po- Based on the protein-level interaction prediction, we con- tential hosts might be crucial to social health and safety. cluded five predicted infections for SARS-CoV-2 that is highly possible, as these mammals are not only predicted to Interaction Prediction Evaluation be infected but also to have a successful SARS-CoV-2 virus We compared IMSP to five other representative baseline binding with the host receptors. As shown in Figure 3, these models on our dataset in a cross-validation experiment. We

H. Du & F. Chen et al. | bioRχiv | 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Fig. 4. Performance on PPI and Infection Predictions. This figure demonstrates IMSP’s performance on PPI and infection predictions in comparison with five other models. The ocean-blue columns represent the performance of our IMSP model derived from the average of 30 independent runs. As a result, the IMSP model achieved 0.808 for the Infection F1-score and 0.910 for the PPI F1-score. Compared to other models, our model outperformed other models in all the evaluation metrics. Specifically, in terms of infection F1-score, our model outperformed the second-best model SDNE (44) by 14.9%. In terms of PPI F1-score. Our model also surpassed the second-best model Deepwalk (41) by 6.6%. conducted 30 runs in total for the experiment. The base- training set and measure the test set’s interaction prediction line models include two famous random-walk based models performances. We repeated the entire measurement process (Deepwalk (41) and Node2vec (43)), two neural-network- for 30 runs to minimize the measurement randomness. Fi- based models (Large-scale Information Network Embed- nally, we performed two-sample heteroscedastic T-test at the ding (LINE) (42) and Structural Deep Network Embedding 0.01 significance level to test the significance of our model’s (SDNE) (44)), and a classical matrix-based model, Graph performance improvement compared to other models. Factorization (GF) (40). In each cross-validation run, we re- Table1 shows the performance comparison, measured in served 20% of each interaction type as the positive test set several commonly used evaluation metrics, between IMSP while ensuring the remaining network to be a connected one. and other existing tools. IMSP achieved an overall interac- To ensure input balance, we collected negative interactions tion prediction accuracy of 94.9% with a standard deviation (unconnected node pairs) to match the number of positive in- of 0.004, which demonstrated a 6.0% gain compared to the teractions in both training and testing sets. Firstly, we ex- second-best model. Our model also excelled in its weighted tracted known negative interactions as true negatives, which F1-score, achieving 0.944 with a standard deviation of 0.008, consisted of spike-receptor interactions that were demon- which transcended the second-best model by 7.3%. The p- strated as not existed, such as between SARS-CoV-2’s S- values for these two metrics were all smaller than 0.01, which protein and the host receptor DPP4 (the target host recep- indicated significant improvement for our model. In terms tor of MERS-CoV). We split known negative interactions by of the performance on specific types of interactions, i.e., in- a ratio of 8:2 to add to the training and testing sets. Then, fection and PPI predictions (See Fig. 4), IMSP achieved an since we still lack some number of negative interactions in F1-score of 0.808 with 0.119 standard deviation for infection both sets, we randomly selected unconnected node pairs as predictions, a 14.9% increase compared to the second-best negative interactions to add to the training and testing sets. model. The p-value was also smaller than 0.01, indicating a In this way, we confirmed that the training set’s composi- significant improvement in our model. For PPI predictions, tion was the same as the test set’s composition in terms of our model achieved an F1-score of 0.909 with 0.029 stan- interaction type. We then applied IMSP and other models dard deviation, a 6.6% increment compared to the second- on the training set to perform network learning on the same best tool. The p-value also demonstrated a significant im-

Model Accuracy Weighted Precision Weighted F1-score ROC Macro ROC Weighted GF (40) 0.885 ± 0.008 0.864 ± 0.009 0.872 ± 0.008 0.913 ± 0.009 0.934 ± 0.009 Deepwalk (41) 0.895 ± 0.006 0.871 ± 0.006 0.880 ± 0.005 0.906 ± 0.011 0.930 ± 0.009 LINE (42) 0.774 ± 0.011 0.760 ± 0.012 0.765 ± 0.012 0.884 ± 0.012 0.891 ± 0.010 Node2vec (43) 0.894 ± 0.006 0.853 ± 0.009 0.868 ± 0.007 0.807 ± 0.010 0.845 ± 0.010 SDNE (44) 0.823 ± 0.014 0.811 ± 0.015 0.815 ± 0.015 0.937 ± 0.011 0.947 ± 0.007 IMSP 0.949 ± 0.006 0.948 ± 0.006 0.944 ± 0.008 0.994 ± 0.002 0.993 ± 0.002

Table 1. Interaction Prediction Overall Performance Evaluation and Comparison. This table gives six evaluation metrics regarding interaction prediction. During performance measurement, we generated two independent sets, training and testing sets, from the network. We set the ratio between the training and the test set to be 8:2. We ensured that the composition of positive interactions in the training set was the same as the test set. To guarantee input balance, we gathered negative interactions (unconnected node pairs) to correspondingly match the number of positive interactions in both training and testing sets. Part of negative samples was extracted from known negative interactions (true negatives), which consisted of spike-receptor interactions that were demonstrated as not existed. The remaining part of negative samples was randomly selected from other unconnected node pairs which were assumed as not existed. To eliminate performance randomness, we executed the comparison algorithm by 30 runs and recorded each run’s performance. We generated new training and test sets during each run and evaluated all models on the same test set. Then, we performed two-sample heteroscedastic T-tests for these six overall performance evaluation metrics to test the significance of IMSP’s improvement. Lastly, we reported the average with standard deviation. bioRχiv | 4 H. Du & F. Chen et al. | bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. provement for IMSP. In conclusion, our model showed sta- Notion Description Vi Node i in the network tistically significant improvements compared to all existing V The set of all Nodes models in all 12 evaluation metrics. Ii,j Edge (Interaction) between Node i and Node j E The set of all Edges (Interactions) in the network S IMSP’s high performance might result from its ability Ri Structure embedding vector for Node i C to take full advantage of well-studied knowledge and data Ri Content embedding vector for Node i CEi,j Content embedding vector for Interaction Ii,j from previous biology research with protein-level variations IEi,j Full Interaction embedding vector for Interaction Ii,j being considered. Thanks to our virus-host interaction net- Wi,j Edge (Interaction) weight between Node i and Node j EDi,j Euclidean distance between Node i and Node j work’s novel design, such cross-organism information and C C C C MD(Ri ,Rj ) Magnitude difference between Vector Ri and Rj C C multi-class linkage information can be well-preserved. An- TS-SSi,j TS-SS similarity between Vector Ri and Rj other reason behind IMSP’s performance improvement is that it factors essential biological metadata as textualized fea- Table 2. Notations. tures for nodes. In this way, two connected nodes’ vector- work would be built based on the information of the protein- ized content substantially helps the classifier output a cor- protein relations, protein-host relations, and host-host rela- rect predicted class when formulating interaction representa- tions. Based on these attributes in the network and currently tions. However, around 10% PPI predictions were unlikely known relationships between some of the proteins, IMSP by our definition, such as PPIs between S-protein and non- could predict the high-possibility interactions. We hope to receptor proteins. To minimize unlikely predictions, we in- use this pipeline as a tool to build a guideline for investigat- corporated the node’s function and existing biological meta- ing various similar viruses and their mechanisms with hosts. data as part of the node feature. This change significantly reduced unlikely PPI predictions to around 2%. We further pushed the boundaries by utilizing known negative interac- Methods tions in the protein layer to constitute part of the training and Infection Mechanism and Spectrum Prediction testing sets—this finally reduced the unlikely PPI predictions Our IMSP model requests three different inputs, including to less than 1%. pairwise similarity matrices for all host proteins and virus In conclusion, IMSP exhibited robust and stable per- proteins, a set of known PPIs and infection relationships, and formance in both top-level and detailed evaluation metrics, proteins’ biological metadata. For instance, appropriate pro- which is substantially improved compared to existing tools. tein metadata can include a protein’s name, layer, host, and When analyzing newly emerged viruses with limited avail- function. Given these three inputs, the model constructs a able information, namely SARS-CoV-2, IMSP can provide virus-host interaction network. Then, in the network learning reasonable and reliable predictions. phase, IMSP learns and combines the structural information and the content information of interactions to represent the Discussion heterogeneous network comprehensively. Lastly, in the pre- diction phase, IMSP trains a neural-network-based classifier, This study assembled 246 nodes (viruses, virus proteins, Multi-layer Perceptron classifier, along with strict filtering hosts, and host proteins) and 1,766 known interactions among procedures, to output high-possibility undiscovered interac- these nodes, including four types of interactions: similarities, tions. Specifically, our model generated PPI and infection infections, interactions, and belongings. Two main parts for predictions with each prediction’s likelihood. In the follow- virus-host protein-level interactions were taken into consid- ing, we elaborated on the details of our IMSP model’s two eration: viral entry and innate immune pathway interactions. main steps in terms of network construction and learning, and Accompanied by these two parts, we predicted the poten- interaction prediction with result validation. tial host for viruses and undiscovered novel PPIs. Among seven human coronaviruses, SARS-CoV and MERS-CoV Virus-host interaction network construction and were well studied about their binding with the receptors and learning interacting with the innate immune pathway, whereas HCoV- We utilized layers and groups to mark and sepa- OC43, HCoV-NL63, HCov-HKU1, HCoV-229E, and newly rate different nodes during virus-host interaction net- emerged SARS-CoV-2 remained relatively unknown. Based work construction, and used interactions to represent on protein similarity and known interactions, 322 PPIs were PPI/infection/similarity/belonging relationship between predicted with this model. The actual PPIs need to be inves- nodes. For modeling the network, we constructed a weighted tigated further based on the prediction result. and undirected heterogeneous network using NetworkX (45). Established discoveries about SARS-CoV-2’s viral inter- The network contained four groups of nodes: host, host action with host proteins are scarce. SARS-CoV-2 has been proteins, virus proteins, and virus. We organize the virus highly suspected of suppressing the innate immune response group and the host group into the same layer, i.e., the and reducing the production of IFN. In this way, IMSP’s find- organism layer, and we put two groups of proteins into the ings could help discover the relations between virus invasion protein layer. By nature, the network contained four types and host response and therefore provide clues to developing of interactions: PPI (between virus proteins groups and therapeutic strategies for the treatment of this disease. host proteins groups), infection (between viruses group and More broadly, IMSP could be applied to any other analysis hosts group), similarity (between homologs of proteins) and of the virus-host interaction network predictions. This net-

H. Du & F. Chen et al. | bioRχiv | 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

belonging (between organism layer and protein layer). Simi- walk with c0 denoting the starting node of the current random larity and belonging relationships were innately connected. walk. Nodes cm are generated by the following distribution: Other types of interactions were connected based on proven  molecular level knowledge from existing research (7–23). πVj ,Vi /Z if Ii,j ⊂ E P (cm = Vi | cm−1 = Vj) = , (5) The virus-host interaction network comprised of 246 nodes 0 otherwise and 1,766 interactions. After the network structure was built, where m 1, Z is the normalizing constant, and πV ,V we used Wi,j to represent the biological closeness between > j i is the unnormalized transition probability between Vi and Vi and Vj. Specifically, if Ii,j denoted the relationship V , which is calculated as π = α (V ,V ) · W . between two protein homologs, we used the full-sequence j Vj ,Vi pq t i i,j αpq(Vt,Vi), termed as search bias, is calculated as: similarity between Vi and Vj from pairwise similarity matrices to represent W . For other types of relationship, i,j 1/p if d = 0 we obtained the embeddings of node content, denoted as RC  Vt,Vi i α (V ,V )) = 1 if d = 1 , (6) for V by passing textualized node content to the Word2vec pq t i Vt,Vi i 1/q if d = 2 model (46), a famous and powerful neural-network-based Vt,Vi word embedding model. Then, we utilized the TS-SS similar- where dV ,V denotes the shortest path between Vi and Vj, ity metric (47), a robust and reliable similarity measurement t i assuming that we have just traversed from Vt to Vi and are in the field of textual mining, to calculate the similarity now evaluating the transition probability leaving V . C C i between Ri and Rj to assign to Wi,j. Precisely, for In Eq. (6), p and q are Node2vec’s hyperparameters that interactions other than similarity relations, we calculated the can be adjusted to influence the probability to go back to Vi TS-SS for all i and j if I ⊂ E as the following: i,j i,j after visiting Vj and the probability to explore undiscovered components of the network. In this way, we were able to fine- C C 0 0 TS-SSi,j = (|Ri | · |Rj | · sin(θ ) · θ · π· tune the structural embedding model, Node2vec, to generate (1) C C C C 2 a conclusive embedding for our network. (ED(Ri ,Rj ) + MD(Ri ,Rj )) /720, Secondly, to create content representations, CEi,j for all C C Ii,j in the network, we combined the textualized node con- where MD(Ri ,Rj ) (47) is defined as the magnitude differ- tent of V and V along with expected interaction type such ence between RC and RC , which is calculated as: i j i j as interacts/infects/similar/belongs, and input them into the v v Word2vec model (46) to generate vectorized interaction rep- udim RC udim RC u i u j resentations. Then, the full interaction representations for in- C C u X 2 u X 2 MD(R ,R ) = t RC − t RC , (2) teraction Ii,j are formulated as follows: i j i j n=1 n=1 S S IEi,j = [Ri ,Rj ,CEi,j], (7) S S and θ0 is defined as: IEj,i = [Rj ,Ri ,CEj,i].

0 −1 C C Note that by Word2vec’s nature, the input documents’ order θ = cos (cos(Ri ,Rj )) + 10. (3) does not affect its output, meaning that CEi,j is the same Note that θ0 is increased by 10 degrees to overcome the prob- as CEj,i. By finishing this step, we obtained all interaction representations, IEi,j, for all Vi and Vj ⊂ V and i 6= j. lem of overlapping vectors. Then, Wi,j is calculated as: Virus-host interaction prediction Wi,j = σ(T S-SSi,j/TS-SS), (4) In the interaction prediction phase, we utilized a classi- where σ is the sigmoid function, and TS-SS denotes the av- fier to perform a multi-class classification task on the in- teraction representations to learn from the existing network erage of TS-SSi,j, for all i,j, if i 6= j and Vi,Vj ⊂ V . For network learning, we captured the network’s hetero- and suggest potential undiscovered interactions. Specifically, geneity by adding the network’s heterogeneous content in- for all Ii,j 6⊂ E, we classified IEi,j into three classes, i.e. formation to its structural information. Specifically, we per- PPI/infections/no-interaction. Scikit-learn’s Multi-layer Per- formed network structural embedding assuming the network ceptron (MLP) classification model (48) was used for this is homogeneous. Then, we added the content embedding on multi-class interaction classification. top of structural embedding to model its heterogeneity. We conducted cross-validation experiments to fine-tune the Firstly, for network structural embedding, we used a power- model, where 20% of each interaction type was reserved as a ful network feature learning model, Node2vec (43), to learn positive test set. In this process, we also balanced the test set the structural embedding for nodes. Node2vec is a state- by inserting an equal number of negative interactions contain- of-the-art model for homogeneous network embedding. We ing both known negatives and unknown negatives. Specifi- took full advantage of Node2vec’s biased searching algo- cally, known negative interactions (true negatives) were gen- rithm during our application. Precisely, the Node2vec model erated from proven unconnected node pairs, such as the in- performed a fixed-length random walk for sampling, which teraction between MERS-CoV’s S-protein and mammalian hosts’ ACE2 receptors. Unknown negative interactions were takes weight into account. Let cm denote the m-th node in randomly sampled from the network, which was assumed as bioRχiv | 6 H. Du & F. Chen et al. | bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

not existing. Then, by using the fine-tuned hyperparameters 2. World Health Organization et al. Coronavirus disease (covid-19) situation dashboard. World from cross-validation experiments, all positive interactions Health Organization website. https://who. sprinklr. com/. Accessed April, 17, 2020. 3. Thomas HC Sit, Christopher J Brackman, Sin Ming Ip, Karina WS Tam, Pierra YT Law, with an equal number of negative interactions were used for Esther MW To, Veronica YT Yu, Leslie D Sims, Dominic NC Tsang, Daniel KW Chu, et al. training the classifier. IMSP later took all other unused un- Infection of dogs with sars-cov-2. Nature, pages 1–6, 2020. 4. Peter J Halfmann, Masato Hatta, Shiho Chiba, Tadashi Maemura, Shufang Fan, Makoto connected node pairs as input data and classified them into Takeda, Noriko Kinoshita, Shin-ichiro Hattori, Yuko Sakai-Tagawa, Kiyoko Iwatsuki- the class of PPI/infection/no interaction. Horimoto, et al. Transmission of sars-cov-2 in domestic cats. New England Journal of Medicine, 2020. During prediction result correction, we defined rules from 5. Leyi Wang, Patrick K Mitchell, Paul P Calle, Susan L Bartlett, Denise McAloose, Mary Lea both computational and biological perspectives to mark and Killian, Fangfeng Yuan, Ying Fang, Laura B Goodman, Richard Fredrickson, et al. Complete genome sequence of sars-cov-2 in a tiger from a us zoological collection. remove unlikely predictions. Computationally, because there Resource Announcements, 9(22), 2020. exist two representations for I , i.e., IE and IE , we 6. Sin Fun Sia, Li-Meng Yan, Alex WH Chin, Kevin Fung, Ka-Tim Choy, Alvina YL Wong, i,j i,j j,i Prathanporn Kaewpreedee, Ranawaka APM Perera, Leo LM Poon, John M Nicholls, et al. defined the prediction for Ii,j as a strong one if and only if Pathogenesis and transmission of sars-cov-2 in golden hamsters. Nature, 583(7818):834– IE and IE were all classified into the same type of in- 838, 2020. i,j j,i 7. Rudragouda Channappanavar, Anthony R Fehr, Rahul Vijay, Matthias Mack, Jincun Zhao, teraction other than the no-interaction. Biologically, we as- David K Meyerholz, and Stanley Perlman. Dysregulated type i interferon and inflammatory sumed that each virus’s S-protein only interacts/binds with monocyte-macrophage responses cause lethal pneumonia in sars-cov-infected mice. Cell host & microbe, 19(2):181–193, 2016. its target receptor. Thus, all other interactions with S-protein 8. Matthew Frieman, Mark Heise, and Ralph Baric. Sars coronavirus and innate immunity. are considered unlikely. For example, the predicted interac- Virus research, 133(1):101–112, 2008. 9. Krystal Matthews, Alexandra Schäfer, Alissa Pham, and Matthew Frieman. The sars coron- tion, if any, between SARS-CoV-2’s S-protein and DPP4 (the avirus papain like protease can inhibit irf3 at a post activation step that requires deubiquiti- target receptor for MERS-CoV’s S-protein) would be elimi- nation activity. journal, 11(1):209, 2014. 10. Allison L Totura and Ralph S Baric. Sars coronavirus pathogenesis: host innate immune nated. Thirdly, we also defined that an infection interaction responses and viral antagonism of interferon. Current opinion in virology, 2(3):264–275, was likely only when the corresponding interaction between 2012. 11. Yong Hu, Wei Li, Ting Gao, Yan Cui, Yanwen Jin, Ping Li, Qingjun Ma, Xuan Liu, and Cheng the virus’ S-protein and an appropriate host receptor was pre- Cao. Sars coronavirus nucleocapsid inhibits type i interferon production by interfering with dicted. In this way, we ensured that all the remaining predic- trim25-mediated rig-i ubiquitination. Journal of Virology, 2017. 12. Xiaojuan Chen, Xingxing Yang, Yang Zheng, Yudong Yang, Yaling Xing, and Zhongbin tions are both computationally and biologically meaningful Chen. Sars coronavirus papain-like protease inhibits the type i interferon signaling path- and practically useful to guide further research. way through interaction with the sting-traf3-tbk1 complex. Protein & cell, 5(5):369–381, 2014. 13. Chris Ka-fai Li and Xiaoning Xu. Host immune responses to sars coronavirus in humans. In Resource Availability Molecular biology of the SARS-coronavirus, pages 259–278. Springer, 2010. 14. Wenhui Li, Michael J Moore, Natalya Vasilieva, Jianhua Sui, Swee Kee Wong, Michael A All data and codes are available at Github repositories: Berne, Mohan Somasundaran, John L Sullivan, Katherine Luzuriaga, Thomas C Gree- IMSP model, its predictions and performance evaluations can nough, et al. Angiotensin-converting enzyme 2 is a functional receptor for the sars coron- avirus. Nature, 426(6965):450–454, 2003. be found at 15. Yu Lei, Chris B Moore, Rachael M Liesman, Brian P O’Connor, Daniel T Bergstralh, Zhijian J https://github.com/SupremeEthan/IMSP; Chen, Raymond J Pickles, and Jenny P-Y Ting. Mavs-mediated apoptosis and its inhibition by viral proteins. PloS one, 4(5):e5466, 2009. Data and data parsing code can be found at 16. Artur Szczepanski, Katarzyna Owczarek, Monika Bzowska, Katarzyna Gula, Inga Drebot, https://github.com/SupremeEthan/IMSP-Parser. Marek Ochman, Beata Maksym, Zenon Rajfur, Judy A Mitchell, and Krzysztof Pyrc. Canine respiratory coronavirus, bovine coronavirus, and human coronavirus oc43: receptors and attachment factors. Viruses, 11(4):328, 2019. 17. Junwei Niu, Liang Shen, Baoying Huang, Fei Ye, Li Zhao, Huijuan Wang, Yao Deng, and Acknowledgement Wenjie Tan. Non-invasive bioluminescence imaging of hcov-oc43 infection and therapy in the central nervous system of live mice. Antiviral research, 173:104646, 2020. This work was partially supported NSF OAC 1920147. 18. Heike Hofmann, Krzysztof Pyrc, Lia Van Der Hoek, Martina Geier, Ben Berkhout, and Ste- fan Pöhlmann. Human coronavirus nl63 employs the severe acute respiratory syndrome coronavirus receptor for cellular entry. Proceedings of the National Academy of Sciences, 102(22):7988–7993, 2005. Authors’ Contributions 19. Bart L Haagmans, Said HS Al Dhahiry, Chantal BEM Reusken, V Stalin Raj, Monica Galiano, Richard Myers, Gert-Jan Godeke, Marcel Jonges, Elmoubasher Farag, Ayman Conceptualization, P.H. and H.L.; Methodology, H.D., F.C., Diab, et al. Middle east respiratory syndrome coronavirus in dromedary camels: an out- and H.L.; Software, H.D.; Formal Analysis: H.D.; Investiga- break investigation. The Lancet infectious diseases, 14(2):140–145, 2014. 20. Huihui Mou, V Stalin Raj, Frank JM Van Kuppeveld, Peter JM Rottier, Bart L Haagmans, tion, F.C. and H.D.; Writing – Original Draft, F.C. and H.D.; and Berend Jan Bosch. The receptor binding domain of the new middle east respiratory Writing – Review & Editing, all authors; Visualization: F.C. syndrome coronavirus maps to a 231-residue region in the spike protein that efficiently elicits neutralizing antibodies. Journal of virology, 87(16):9379–9383, 2013. and H.D.; Supervision, H.L. and P.H. 21. Ahmed Kandeil, Mokhtar Gomaa, Mahmoud Shehata, Ahmed El-Taweel, Ahmed E Kayed, Awatef Abiadh, Jamel Jrijer, Yassmin Moatasim, Omnia Kutkat, Ola Bagato, et al. Middle east respiratory syndrome coronavirus infection in non-camelid domestic mammals. Emerg- Lead Contact ing microbes & infections, 8(1):103–108, 2019. 22. Ben A Bailey-Elkin, Robert CM Knaap, Garrett G Johnson, Tim J Dalebout, Dennis K Nin- Further information and requests for code and data should be aber, Puck B Van Kasteren, Peter J Bredenbeek, Eric J Snijder, Marjolein Kikkert, and Brian L Mark. Crystal structure of the middle east respiratory syndrome coronavirus (mers- directed to and will be fulfilled by the Lead Contact, Pengyu cov) papain-like protease bound to ubiquitin facilitates targeted disruption of deubiquitinating Hong ([email protected]). activity to demonstrate its role in innate immune suppression. Journal of Biological Chem- istry, 289(50):34667–34682, 2014. 23. Pak-Yin Lui, Lok-Yin Roy Wong, Cheuk-Lai Fung, Kam-Leung Siu, Man-Lung Yeung, Kit- San Yuen, Chi-Ping Chan, Patrick Chiu-Yat Woo, Kwok-Yung Yuen, and Dong-Yan Jin. Mid- Declaration of Interests dle east respiratory syndrome coronavirus m protein suppresses type i interferon expression through the inhibition of tbk1-dependent phosphorylation of irf3. Emerging microbes & in- The authors declare no competing interests. fections, 5(1):1–9, 2016. 24. Haibo Zhang, Josef M Penninger, Yimin Li, Nanshan Zhong, and Arthur S Slutsky. Angiotensin-converting enzyme 2 (ace2) as a sars-cov-2 receptor: molecular mechanisms References and potential therapeutic target. Intensive care medicine, 46(4):586–590, 2020. 25. Jian Shang, Gang Ye, Ke Shi, Yushun Wan, Chuming Luo, Hideki Aihara, Qibin Geng, Ash- 1. Qun Li, Xuhua Guan, Peng Wu, Xiaoye Wang, Lei Zhou, Yeqing Tong, Ruiqi Ren, Kathy SM ley Auerbach, and Fang Li. Structural basis of receptor recognition by sars-cov-2. Nature, Leung, Eric HY Lau, Jessica Y Wong, et al. Early transmission dynamics in wuhan, china, 581(7807):221–224, 2020. of novel coronavirus–infected pneumonia. New England Journal of Medicine, 2020. 26. David E Gordon, Gwendolyn M Jang, Mehdi Bouhaddou, Jiewei Xu, Kirsten Obernier,

H. Du & F. Chen et al. | bioRχiv | 7 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.09.375394; this version posted November 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Kris M White, Matthew J O’Meara, Veronica V Rezelj, Jeffrey Z Guo, Danielle L Swaney, et al. A sars-cov-2 protein interaction map reveals targets for drug repurposing. Nature, pages 1–13, 2020. 27. Francesco Messina, Emanuela Giombini, Chiara Agrati, Francesco Vairo, Tommaso As- coli Bartoli, Samir Al Moghazi, Mauro Piacentini, Franco Locatelli, Gary Kobinger, Markus Maeurer, et al. Covid-19: viral–host interactome analyzed by network based-approach model to study pathogenesis of sars-cov-2 infection. Journal of Translational Medicine, 18(1):1–10, 2020. 28. Adam S Cockrell, Boyd L Yount, Trevor Scobey, Kara Jensen, Madeline Douglas, Anne Beall, Xian-Chun Tang, Wayne A Marasco, Mark T Heise, and Ralph S Baric. A mouse model for mers coronavirus-induced acute respiratory distress syndrome. Nature microbi- ology, 2(2):1–11, 2016. 29. Geoffrey J Gorse, Theresa Z O’Connor, Susan L Hall, Joseph N Vitale, and Kristin L Nichol. Human coronavirus and acute respiratory illness in older adults with chronic obstructive pulmonary disease. The Journal of infectious diseases, 199(6):847–857, 2009. 30. Yvonne Xinyi Lim, Yan Ling Ng, James P Tam, and Ding Xiang Liu. Human coronaviruses: a review of virus–host interactions. Diseases, 4(3):26, 2016. 31. Xiaofeng Zhai, Jiumeng Sun, Ziqing Yan, Jie Zhang, Jin Zhao, Zongzheng Zhao, Qi Gao, Wan-Ting He, Michael Veit, and Shuo Su. Comparison of severe acute respiratory syn- drome coronavirus 2 spike protein binding to ace2 receptors from human, pets, farm ani- mals, and putative intermediate hosts. Journal of virology, 94(15), 2020. 32. Katharina S Schulz and Karen L Mossman. Viral evasion strategies in type i ifn signaling–a summary of recent developments. Frontiers in immunology, 7:498, 2016. 33. Matthew Frieman, Kiira Ratia, Robert E Johnston, Andrew D Mesecar, and Ralph S Baric. Severe acute respiratory syndrome coronavirus papain-like protease ubiquitin-like domain and catalytic domain regulate antagonism of irf3 and nf-κb signaling. Journal of virology, 83(13):6689–6705, 2009. 34. Anna M Mielech, Andy Kilianski, Yahira M Baez-Santos, Andrew D Mesecar, and Susan C Baker. Mers-cov papain-like protease has deisgylating and deubiquitinating activities. Virol- ogy, 450:64–70, 2014. 35. Chun-Kit Yuen, Joy-Yan Lam, Wan-Man Wong, Long-Fung Mak, Xiaohui Wang, Hin Chu, Jian-Piao Cai, Dong-Yan Jin, Kelvin Kai-Wang To, Jasper Fuk-Woo Chan, et al. Sars-cov-2 nsp13, nsp14, nsp15 and orf6 function as potent interferon antagonists. Emerging Microbes & Infections, pages 1–29, 2020. 36. Sarah A Kopecky-Bromberg, Luis Martínez-Sobrido, Matthew Frieman, Ralph A Baric, and Peter Palese. Severe acute respiratory syndrome coronavirus open reading frame (orf) 3b, orf 6, and nucleocapsid proteins function as interferon antagonists. Journal of virology, 81 (2):548–557, 2007. 37. Matthew Frieman, Boyd Yount, Mark Heise, Sarah A Kopecky-Bromberg, Peter Palese, and Ralph S Baric. Severe acute respiratory syndrome coronavirus orf6 antagonizes stat1 func- tion by sequestering nuclear import factors on the rough endoplasmic reticulum/golgi mem- brane. Journal of virology, 81(18):9812–9824, 2007. 38. Bin Wu and Sun Hur. How rig-i like receptors activate mavs. Current opinion in virology, 12: 91–98, 2015. 39. Roujian Lu, Xiang Zhao, Juan Li, Peihua Niu, Bo Yang, Honglong Wu, Wenling Wang, Hao Song, Baoying Huang, Na Zhu, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet, 395 (10224):565–574, 2020. 40. Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexan- der J Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37–48. ACM, 2013. 41. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 701–710, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9. doi: 10.1145/2623330.2623732. 42. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large- scale information network embedding. In WWW. ACM, 2015. 43. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016. 44. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceed- ings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1225–1234, New York, NY, USA, 2016. ACM. ISBN 978-1-4503- 4232-2. doi: 10.1145/2939672.2939753. 45. Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynam- ics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008. 46. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 47. A. Heidarian and M. J. Dinneen. A hybrid geometric approach for measuring similarity level among documents and document clustering. In 2016 IEEE Second International Confer- ence on Big Data Computing Service and Applications (BigDataService), pages 142–151, Los Alamitos, CA, USA, apr 2016. IEEE Computer Society. doi: 10.1109/BigDataService. 2016.14. 48. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Lay- ton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013.

bioRχiv | 8 H. Du & F. Chen et al. |