arXiv:2107.09217v1 [cs.LG] 20 Jul 2021 rtisadtedrcl eae rtismpt h P net PPI the to map a proteins et related Zhou the directly in the h proposed and HCoVs utilize proteins the we protein-relate to Here, related HCoVs proteins discovering target Figure.1. for in method learning shown deep is framework method The oietf rg htmyb eae oHCoVs. to related be may propo that algorithm drugs learning con identify deep to network a apply heterogeneous HCoVs and a the proteins, construct on related based we study, studies, this mod existing In disease coronavirus proteins. cand the HCoVs-related drug disrupt of identify to may is or disease-related goal proteins Our target related (2020). directly al. et not Gysi do networks fu spreading drugs from approved virus effectiv the Most develop prevent treatme to to need outbreak and COVID-19 urgent prevention the an develop is o to There the hard millio drugs. 2021, effective two working 20, and are June infected world By people the million coronavirus. hundred of one kind than a more also is which 19, n otlt olwd alse l 22) n21,athir a 2019, c In are (2020). 2012, al. in et Paules East Respir worldwide E Middle mortality Acute and Middle the and swept Severe which 2003, the (MERS-CoV), in including avirus losses severe HCoVs caused R which single-stranded genome. (SARS-CoV), enveloped the of in type known a is (CoVs) Coronavirus cetda okhpppra CR22 MLPCP 2021 ICLR at paper workshop as Accepted . M METHODS 2.1 AND MATERIALS 2 I 1 OIGFOR POSING H htn Jin Shuting 1 2 [email protected] { 3 { eateto optrSine imnUiest,Xiamen University, Xiamen Science, Computer of Department ainlIsiuefrDt cec nHat n Medicine, and Health in Science Data for Institute National colo nomto cec n niern,HnnUnive Hunan Engineering, and Science Information of School xzeng,slpeng stjin,huangwii,xiafeng,czjiang TRGNOSNETWORK ETEROGENEOUS NTRODUCTION ETHODS oadftr lncltil o OI-9 h oeaddat and code The COVID-19. for https://github.com/stjin-XMU/HnDR-COVID. trials clinical candidat identify future quickly network- toward to beneficial heterogeneous be powerful may which a relat method, utilizes COVID-19 work for hig this effective obtain mary, drugs We possible COVID-19. the for predicting candidates drug propo potential previously tre cover the use for and n shortcut proteins heterogeneous target best comprehensive HCoVs-related a the constructed be we co w may Therefore, Compared human repurposing world. to drug the belongs velopment, around rapidly (COVID-19) spreads 2019 which Disease (HCoVs), Virus Corona The 1,2 inxagZeng Xiangxiang , } @hnu.edu.cn COVID-19 3 e Huang Wei , A BSTRACT } 1 @stu.xmu.edu.cn 1 egXia Feng , - BASED rvninadtetetsrtge for strategies treatment and prevention e ahgncHo a ae COVID- named was HCoV pathogenic d e ypeiu okZn ta.(2020) al. et Zeng work previous by sed l ewr ytreigteneighbors the targeting by network ule rg.Seicly )Bsdo the on Based 1) Specifically, drugs. d dtsta a ietytre HCoVs- target directly may that idates rtis u idpoen ertheir near proteins bind but proteins, etn rg,dsae,adHCoVs- and diseases, drugs, necting okfrcluain ae nthese on Based calculation. for work 1 AvrsadtelretRAvirus RNA largest the and virus NA rther. around Organizations died. people n imnUiest,Xae 605 China 361005, Xiamen University, Xiamen .Zo ta.(00,w eanthese retain we (2020), al. et Zhou l. hnziJiang Changzhi , rltdtre rtispooe by proposed proteins target -related tfrCVD1,btteeaeno are there but COVID-19, for nt st,Cagh 102 China 410082, Changsha rsity, s eprtr ydoeCoron- Syndrome Respiratory ast tra fCVD1 a caused has COVID-19 of utbreak rnvrsswt ihmorbidity high with oronaviruses 605 China 361005, trgnosntokbsdon based network eterogeneous e epTe,t dis- to deepDTnet, sed tr ydoeCoro-navirus Syndrome atory dpoen.I sum- In proteins. ed eupsbedrugs repurposable e tokbsdo the on based etwork ae eplearning deep based D tn COVID-19. ating r vial at available are a efrac in performance h t e rgde- drug new ith RUG ronaviruses 1 inrn Liu Xiangrong , R EPUR - 1 hoin Peng Shaoliang & 3 Accepted as workshop paper at ICLR 2021 MLPCP

Table 1: Materials of networks Node1 Node2 Edges

Protein-protein 5390 5390 136763 Protein-disease 5390 3087 85018 Drug-protein 2953 5390 5278 Drug-drug 2953 2953 220736 Drug-disease 2953 3087 6677 4 protein similarity networks 8 drug similarity networks

HCoVs Drugs Diseases HCoVs-related targets (proteins)

Construct networks D : Low dimensional vector representation Drug-drug Encoder i Decoder for each drug vertex PCO Matrix PPMI Matrix D f Drug similarities: Chemical, Therapeutic , Protein sequence, D : Dimension of drug features f Biological process , Cellular component, D : Number of drugs Molecular function, Side effect, CoExpression ...... n HCoVs-host protein ...... D D n 1 Drug-protein Drug similarity

X X 1 S =D WH TP T L P ij i j Drug-disease Drug similarity f PU-learning Mapping DWHTP matrix completion T 1 Jaccard similarity D L S Known drug-protein network ɍ f W H

1 Protein-disease Protein similarity Encoder PCO Matrix PPMI Matrix Decoder 1 P Protein-drug Protein similarity n P : Dimension of protein features f PPIs ...... P : Number of proteins ...... T P n ...... P f Protein-protein associations

Protein similarities: P : Low dimensional vector representation Biological process, Cellular component, j X Protein sequence, Molecular function X for each protein vertex

Figure 1: A diagram illustrates the pipeline of the entire method. The method embeds the 17 types of chemical, genomic, phenotypic, and cellular networks and applies deepDTnet to predict identify candidate drugs toward COVID-19.

proteins, we integrate 17 networks including drug-protein, drug-disease, drug-drug, protein-disease, protein-protein associations network from existing related databases. 2) There are two main types of networks in this experiment: homogeneous interaction networks and heterogeneous interaction networks. For heterogeneous interaction networks, we use the Jaccard similarity coefficient to cal- culate the similarity network. 3) We apply deepDTnet Zeng et al. (2020) algorithm to infer whether two drugs share a target and predict novel drug-protein interactions, and find possible drugs that are effective for COVID-19 related proteins (see Methods).

2.1.1 CONSTRUCTING NETWORK MATERIALS

The materials of these networks are summarized in Table 1. Among them, drug-drug, drug-disease and 8 drug similarity networks directly use the data of deepDTnet work. For the rest of the target protein-related data, based on the HCoVs-related proteins Zhou et al. (2020), we mapped these pro- teins to the directly related proteins in the PPIs network, and retained these proteins to construct protein related networks.

2.1.2 CONSTRUCTIONOFSIMILARITYNETWORKS

The Jaccard similarity coefficient is a very popular statistical method for calculating the similarity between two sets. Suppose we have two sets A and B. The Jaccard similarity Niwattanakul et al. (2013) of these two sets is:

|A ∩ B| Sim(A, B)= (1) |A ∪ B|

2 Accepted as workshop paper at ICLR 2021 MLPCP

2.1.3 DRUG-PROTEIN ASSOCIATIONS PREDICTION For a certain network, the randomsurfing modelis used to extract network co-occurrence probability information and generate a corresponding probabilistic co-occurrence (PCO) matrix.After yielding the PCO matrix, we calculate a shifted PPMI matrix inspired by Bullinaria & Levy (2007). Then, we use SDAE, an advanced dimensionality reduction technology,which can help us to capture important topology information in the network and obtain better low-dimensional vectors through the learning of the PPMI matrix. The optimization process of SDAE is as follows:

2 2 min ||x − xˆ||F + λ X ||Wl||F (2) {Wl},{bl} t

Where l is the number of layers, Wl is the weight of the l layer, and bl is a biased vector of the l layer. λ is a regularization parameter and || · ||F denotes the Frobenius norm. The entire SDAE has a total of L layers, of which the first half is the encoder and the second half is the decoder, as shown in Fig.1. Finally, we use PU learning for matrix completion Hsieh et al. (2015). In detail, we make the drug- protein interaction network I. The observed drug-proteininteraction label is 1, while the unobserved is 0. After obtaining the characteristics W and H of the drug and target in the previous process, we use I to perform matrix completion on WHT . The specific optimization function is as follows:

T T 2 min (Iij − DiWH Pj ) + W,H X T T 2 2 2 (3) α X (Iij − DiWH Pj ) + λ(||W ||F + ||H||F ) (i,j)∈Ω−

Where Ω+ denotes the known response in the sample and Ω− denotes the missing term selected as the negative samples. Finally, we can approximate the probability of drug and target response:

T T Score(i, j)= DiWH Pj (4)

2.2 VERIFICATION BASED ON HETEROGENEOUS NETWORK RESULTS

To evaluate the performance, we firstly perform 5-fold cross-validation on the drug-protein predic- tion results. To reduce the data bias of cross-validation, it was repeated 10 times and the average performance was computed. The AUROC value was approximately 0.941 (Figure.2(a)) and the AUPR value was approximately 0.954 (Figure.2(b)). It proves the feasibility of the algorithm. Because our work is mainly to discover drugs related to COVID-19 proteins, we verified the pre- dicted drug-protein of the 119 HCoVs-host proteins summarized in the research by Zhouet al. (2020). We screen out the results of the drug and protein that predicted by the 119 HCoVs-host proteins and rank the correlation of 2953 approved drugs in the data based on the predicted scores. Then the 45 drugs from COVID-19 related therapeutic in clinical trials were used as positive sam- ples to verify the external data set (see Supplementary data Table 1). The obtained AUROC value was about 0.785 (Figure.2(a)) and the obtained AUPR value was about 0.803 (Figure.2(b)).

2.3 DISCOVERY OF DRUG REPURPOSING CANDIDATES FOR COVID-19

In addition to the assessment of the accuracy of the prediction results, we also analyze the drug- protein associations of the top 100 in the prediction ranking of the 119 HCoVs-host proteins (see Supplementary data Table 2) and draw a network graph (Figure.3). We find that Ruxolitinib and Imatinib in the clinical trial stage appeared in the network graph of Figure.3. To find drugs that may be relevant in addition to clinical trials, we have researched in the literature on the prediction of the top 50 drugs (see Table Supplementary data Table 3), and find that 36 drugs have been mentioned in the past studies that may have a role in the treatment of coronavirus-related diseases. These drugs are likely to play a role in COVID-19. Tamatinib, Sunitinib, Ruxolitinib, Alvocidib, Crizotinib, Dasatinib, Vandetanib among the top 10 drugs have also been found related to coronavirus through searches in related literature. We plot the most relevant top 20 proteins of these drugs in the network

3 Accepted as workshop paper at ICLR 2021 MLPCP

(a) 1 (b) 1

0.8 0.8

0.6 0.6

0.4 Precision 0.4 True Positive Rate Positive True

0.2 0.2

AUROC_5-fold = 0.941 AUPR_5-fold = 0.954 AUROC_external = 0.785 AUPR_external = 0.803 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Positive Rate Recall

Figure 2: Performance of the method was assessed by 5-fold cross-validation and the external vali- dation set.

CDK2 MAPK14 CDK4 SRC MAPK14 GSK3B PRKCQ SRC INSR INSR LCK KDR CCNA2 MAPK10 MAPK10 CDK5 BRD4 PDPK1 KDR Vandetanib PDPK1 CDK4 CDK5 Sunitinib GSK3B PRKCQ BRD4 LCK MAPK11 PIK3CA MAPK11 CDK6 CDK2 CDK6 PIK3CA EPHX2 CCNA2 CDK1 DB07235 Tandutinib CDK1 AT-7519 CDK9 EPHX2 CDK9 Sunitinib Bosutinib AMN107 Staurosporine PLX-4720 CHMP4B MKRN2 H-89 MIF4GD MAPK14 Ellagic acid BTF3 CHMP2B PRKCQ CDK4 sorafenib CDK1 MAPK14 STAT5A SNAP47 HGS CDK4 GSK3B TPSAB1 Wortmannin DCTN2 MKRN3 CDK1 KPNA2 RO-4584820 MAPK10 PIK3CA CLEC4M CDK5 KDR INSR KPNB1 PSMC2 EPHX2 Afatinib NACA DNAJB1 MAPK10 Pazopanib GSK3B CDK5 KPNA4 JUN Alvocidib INSR CCNA2 EPHX2 PIK3CA SMAD3 UBE2I CD209 Ruxolitinib CDK6 Ruxolitinib PSMA2 A-674563 EPAS1 FKBP1A CHEK2 PFDN5 BRD4 PRKCQ TGFB1 GSK3A CDK6 LCK EPAS1 TERF1 STAT3 NPM1 CDK2 SRC PSMD1 SFTPD KDR BRD4 CDK2 Doramapimod EIF3F MARK3 VKORC1 Vandetanib EIF3I CDK9 PRKACA EEF1A1 RPL19 XPO1 CEACAM1 PRKACA CDK9 TMPRSS2 SRC CLEC4G DDX5 GSK3B DB07149 CCNA2 LCK DB04716 TWF2 PABPC4 ANXA2 SYNCRIP BCL2L1 IKBKB RPS20 Sertindole KIF11 PARP1 NMT1 CD9 PPIA Alvocidib ANPEP EIF3E PHB MCL1 SRP54 HSPA9 SNX9 DDX1 DPP4 Cediranib Tamatinib PABPC1 PPP1CA BAG6 RPL13A BCL2 NUDCD1 SERPING1 HNRNPA1 IRF3 dasatinib SKP2 NCL risperidone COPB2 YKT6 CAV1 RRM2 HSPD1 GSK3B SGTA Ponatinib CDK4 PTBP1 PIK3CA G3BP1 PRKRA BCL2L2 aripiprazole NONO Crizotinib HNRNPA2B1 LCK MAPK14 SCFD1 LCK SRC CDK5 STX5 Phosphonothreonine HNRNPA3 MAPK14 MAPK10 PRKCQ ziprasidone H2AFY2 ATP6V1G1 CDK2 EPHX2 G3BP2 DB08067 CDK1 INSR Tamatinib quetiapine Canertinib MAPK10 GSK3B INSR PDPK1 PD173955 CDK5 EPAS1 CDK2 erlotinib CDK9 imatinib CDK6 BRD4 Quercetin gefitinib CDK4 dasatinib KDR MAPK11 PRKACA SRC BRD4 KDR CCNA2 CDK6 CCNA2 PRKCQ PIK3CA GSK3B CDK4 EPHX2 CDK1 CDK1 CDK9 MAPK14 CDK5 MAPK10 PIK3CA CDK6 MAPK11 Crizotinib CDK2 PRKCQ LCK EPAS1 BRD4 INSR SRC EPHX2

CCNA2 CDK9 KDR

Figure 3: Drug-protein associations networks. The center’s network graphis composedof the known interactions of 119 HCoVs-host proteins in the PPI network (the orange edge) and is the drug-protein associations of the top 100 in the prediction ranking of the 119 HCoVs-host proteins (the blue edge). Through a search of existing research literature, we found that 7 of the top 10 predicted drugs are related to coronavirus. We plotted the top 20 most relevant proteins predicted by these 7 drugs (see Supplementary data Table 4).

4 Accepted as workshop paper at ICLR 2021 MLPCP

diagram. The most relevant top 20 proteins prediction results of all top 50 drugs can be found in Supplementary data Table 4. It also shows that it is effective to find drugs that may act on HCoVs- related targets through a comprehensive heterogeneous network.

3 CONCLUSIONS AND DISCUSSION

In this study, we used a deep learning model to systematically recognize drugs that may act on HCoVs-related proteins and promote the discovery and research of drug reuse in COVID-19. We used existing HCoVs-related host proteins and proteins directly related to these host proteins in the PPI network to predict drugs that can be reused. We constructed a comprehensive heterogeneous network that connects drugs, HCoVs-related proteins, and diseases. We found the method showed high performance. At present, our research still has some limitations. In this study, we have calculated the drug reuse based on the HCoVs-related host proteins described in the existing literature. The collected virus-host interaction may not be complete, and its quality may be affected by many factors. Nonetheless, our method has taken into account as much as possible the existing interactions, the expression of target proteins, the chemical structure of drugs, and similarities. In the future, we can use the calculation method to obtain more useful and reliable information, and further systematically prediction interaction between the drug and the target.

ACKNOWLEDGMENTS This work was supported by the national key R&D program of China (2017YFE0130600); the National Natural Science Foundation of China (Grant Nos.61772441,61872309, 61922020, 61425002, 61872007); Project of marine economic innovation and development in Xiamen(No. 16PFW034SF02); Natural Science Foundation of Fujian Province (No. 2017J01099). This paper is recommended by the 5th Computational Bioinformatics Conference.

REFERENCES John A Bullinaria and Joseph P Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior research methods, 39(3):510–526, 2007.

Deisy Morselli Gysi, Italo´ Do Valle, Marinka Zitnik, Asher Ameli, Xiao Gan, Onur Varol, Helia Sanchez, Rebecca Marlene Baron, Dina Ghiassian, Joseph Loscalzo, et al. Network medicine framework for identifying drug repurposing opportunities for covid-19. ArXiv, 2020. Cho-Jui Hsieh, Nagarajan Natarajan, and Inderjit Dhillon. Pu learning for matrix completion. In International Conference on Machine Learning, pp. 2445–2453. PMLR, 2015. Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. Us- ing of jaccard coefficient for keywords similarity. In Proceedings of the international multicon- ference of engineers and computer scientists, volume 1, pp. 380–384, 2013. Catharine I Paules, Hilary D Marston, and Anthony S Fauci. Coronavirus infections—more than just the common cold. Jama, 323(8):707–708, 2020. Xiangxiang Zeng, Siyi Zhu, Weiqiang Lu, Zehui Liu, Jin Huang, Yadi Zhou, Jiansong Fang, Yin Huang, Huimin Guo, Lang Li, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chemical Science, 11(7):1775–1797, 2020. Yadi Zhou, Yuan Hou, Jiayu Shen, Yin Huang, William Martin, and Feixiong Cheng. Network- based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2. Cell discovery, 6(1):1–18, 2020.

5