Disease, Drug, and Target Association Predictions By

Home , DrugBank

DISEASE, DRUG, AND TARGET ASSOCIATION PREDICTIONS BY

INTEGRATING MULTIPLE HETEROGENEOUS SOURCES

SEN YANG

Submitted in partial fulfillment of the requirements

For the degree of Master of Science

Thesis Advisor: Dr. Jing Li

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

August, 2012

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis of

Sen Yang

candidate for the Master of Science degree *.

Jing Li

Z. Meral Ozsoyoglu

Soumya Ray

Xiang Zhang

7/2/2012

*We also certify that written approval has been obtained for any proprietary material contained therein.

1 Introduction ...... 1

1.1 Related work ...... 2

1.1.1 Drug repositioning ...... 2

1.1.2 Novel target predictions ...... 5

1.2 Motivation ...... 7

1.3 Thesis organization ...... 9 2 Analyzing Datasets ...... 10

2.1 Datasets collection ...... 10

2.2 Basic statistics ...... 12

2.3 Intra-similarity analysis ...... 14

2.4 Further analysis on similarities ...... 16 3 Disease, Drug, and Target Association Predictions ...... 19

3.1 Disease-drug association predictions ...... 19

3.1.1 Disease-drug graph construction ...... 19

3.1.2 Predictions on the disease-drug graph ...... 20

3.1.3 Predictions on the global heterogeneous graph ...... 23

3.2 Drug-target association predictions ...... 28

3.2.1 Predictions on the drug-target graph ...... 29

3.2.2 Predictions on the global heterogeneous graph ...... 30 4 Experiments ...... 31

4.1 Data preparation ...... 31

4.2 Evaluation metrics ...... 31

4.3 Comparison with existing methods ...... 32

4.4 Experimental results...... 35

4.4.1 Drug repositioning ...... 35

4.4.2 Novel target predictions ...... 42 5 Conclusion ...... 49

Appendix A. Examples of Similar Diseases with Their Associated Drugs...... 51

Appendix B. Examples of Dissimilar Diseases with Their Associated Drugs...... 52

Appendix C. Proof of Theorem 1...... 53

Appendix D. AUC values of BLM with Various Training Sample Numbers...... 56

References ...... 57

Table 1. Disease, drug, and their interactions...... 13

Table 2. Drug, target, and their interactions...... 13

Table 3. Further analysis on similarities...... 18

Table 4. AUC values of disease-drug association predictions...... 36

Table 5. Case studies on disease-drug association predictions...... 41

Table 6. AUC values of drug-target association predictions...... 42

Table 7. Case studies on drug-target association predictions...... 47

iii

Figure 1. Connectivity information flow on heterogeneous graphs...... 9

Figure 2. Initial interaction distributions...... 14

Figure 3. Intra-similarity distributions...... 15

Figure 4. The guilt-by-association assumption on the disease-drug network...... 17

Figure 5. The guilt-by-association assumption on the drug-target network...... 18

Figure 6. An illustrative disease-drug interaction graph...... 22

Figure 7. Disease, drug, and target heterogeneous graph...... 25

Figure 8. Constructing disease-target association based on drugs...... 26

Figure 9. ROC curves of disease-drug association predictions...... 36

Figure 10. The number of retrieved disease-drug interactions on different percentiles. . 38

Figure 11. ROC curves of predictions for diseases with no known drugs...... 39

Figure 12. Case studies on disease-drug association predictions...... 40

Figure 13. ROC curves of drug-target association predictions...... 43

Figure 14. The number of retrieved drug-target interactions on different percentiles. .... 44

Figure 15. ROC curves of predictions for drugs with no known targets...... 45

Figure 16. Case studies on drug-target association predictions...... 48

Acknowledgements

This thesis cannot be completed smoothly without the help of many other people.

First, I would like to express my heart-felt thanks to my advisor, Dr. Jing Li, whose patience guidance during the whole thesis writing process enabled me to understand the subject thoroughly and finish the thesis.

I would like to thank Dr. Wenhui Wang, who has provided many insightful discussions during the whole process. His critical thinking and strong background in this area helped me substantially.

I wish to thank my entire family for providing a comfortable environment for me.

Especially I wish to thank my dear wife, Yongwei Deng, who took care of the whole family when I was busy on the thesis. Her encouragement and unconditional support was the endless source of strength for me to cope with all the difficulties during the thesis preparation process. To her I dedicate this thesis.

Lastly, I offer my regards and blessings to all the persons who provided support during the thesis preparation process.

Disease, Drug, and Target Association Predictions by Integrating Multiple

Heterogeneous Sources

Abstract

SEN YANG

Computational methods for new drug development can greatly reduce time and costs compared with experimental methods. A core problem in computational drug discovery is to capture the hidden interactions among diseases, drugs, and targets, which includes two sub-problems, i.e. disease-drug association predictions and drug-target association predictions. In this thesis, computational approaches for novel large-scale disease, drug, and target association predictions are proposed. First, a heterogeneous disease-drug graph and a drug-target graph, both of which incorporate the initial interactions and intra- similarities, are constructed. Based on these graphs, a novel local graph-based inference method is introduced for both predicting problems. Second, to further enhance prediction performance, a global heterogeneous graph, which incorporates initial disease-drug and drug-target interactions and intra-similarities of disease-disease, drug-drug, and target- target, is built. A novel global graph-based inference approach is then proposed.

Experimental results indicate that the proposed methods in this thesis can greatly improve disease, drug, and target association prediction accuracy on large-scale datasets compared with existing representative methods.

1 Introduction

Drug development is traditionally a costly and time consuming process. Today new drug development suffers from inefficiency and clinical safety or toxicology which accounts for 30% failures (Kola and Landis 2004). On the other hand, many studies have focused on the associations among diseases, drugs and target proteins and sought to relieve the crisis of new drug discovery by utilizing various computational methods (Hopkins 2008,

Ashburn and Thor 2004, Barabási, Gulbahce and Loscalzo 2011, Bleakley and

Yamanishi 2009, Chiang and Butte 2009, Gottlieb, et al. 2011). Computational approaches can make significant contributions in new drug development. First, compared with traditional experimental drug development methods that usually take more than ten years before a new drug can be approved for clinical treatment (DiMasi, Hansen and

Grabowski 2003, Adams and Brantner 2006), computational approaches are able to rely on existing drugs and therefore greatly speedup the process of new drug development.

Second, computational methods generally do not heavily depend on expensive genetic experiments and various clinical trials which altogether may incur about $500 million to

$2 billion cost in new drug development (DiMasi, Hansen and Grabowski 2003, Adams and Brantner 2006). This means that computational methods are far less expensive than existing experimental approaches. Finally, extensive studies on existing diseases and their associated drugs as well as the druggable genome (i.e., a set of computationally predicted potential drug targets) provide a new possibility for computational drug development. In subsection 1.1, I will briefly summarize existing computational drug development methods, most of which can be divided into two categories: disease-drug association predictions and drug-target association predictions. The intuitive motivation of the

proposed approaches in this thesis is illustrated in subsection 1.2. Finally, the organization of the thesis is given in subsection 1.3.

1.1 Related work

1.1.1 Drug repositioning

Drug repositioning, which is also known as disease-drug association predictions, targets drug reusing directly. It basically means discovering new treatments for existing drugs or for drug candidates for which there is substantial safety data. Therefore, repositioning methods bypass the requirements for many of the pre-approval tests that are necessary for completely new therapeutic compounds because the repositioning agents have already been documented as safe for their original usage (Ashburn and Thor 2004).

One famous example of drug repositioning is the discovery of “Viagra” (Goldstein, et al.

1998), which was originally named “Sildenafil” and used for curing pulmonary arterial hypertension disease. However, the serendipitous discovery was made that the drug was a potential treatment of erectile dysfunction in men during clinical trials. Another example is the use of duloxetine (Cymbalta) for stress urinary incontinence. Duloxetine (Cymbalta) was originally developed as an anti-depressant, and was postulated to be a more effective alternative to selective serotonin reuptake inhibitors (SSRIs) such as fluoxetine (Prozac).

Its new usage was discovered by examining its mode of action (Thor and Katofiasc 1995).

Although some drug repositioning discoveries have been proven quite useful, there were still no systematic methods at that time.

Systematic drug repositioning methods have been proposed only until recently. Kinnings et al. (2009) demonstrated the strength of their computational strategy to identify off- targets of major pharmaceuticals on a proteome-wide scale through the discovery that

existing commercially available drugs prescribed for the treatment of Parkinson‟s disease have the potential to treat MDR and XDR tuberculosis. Li et al. (2009) introduced a computational framework to build disease-specific drug-protein connectivity maps by integrating gene/protein and drug connectivity information based on protein interaction networks and literature mining. Taking Alzheimer‟s disease (AD) as a primary example, their method was divided into three steps. First, molecular interaction networks were incorporated to reduce bias and improve relevance of AD seed proteins. Second, PubMed abstracts were used to retrieve enriched drug terms that are indirectly associated with AD through molecular mechanistic studies. Finally, a comprehensive AD connectivity map was created by relating enriched drugs and related proteins in literature. Kotelnikova, et al. (2010) used publically available microarray experiments for glioblastoma and automatically constructed ResNet and ChemEffect databases to exemplify how to find potentially effective chemicals for glioblastoma – the disease yet without effective treatment. Their first approach involved construction of a signaling pathway affected in glioblastoma using scientific literature and data available in ResNet database. Their second approach involved analysis of differential expressions in glioblastoma patients using Sub-Network Enrichment Analysis (SNEA). Although the above drug repositioning methods can achieve relatively high accuracy results for some specific diseases, they are still not appropriate for large scale drug repositioning.

In 2006, Lamb, et al. introduced the concept of “Connectivity Map”, which is a collection of gene-expression profiles from cultured human cells treated with bioactive small molecules, together with pattern-matching software to mine these data (Lamb, et al.

2006). This Connectivity Map demonstrated that it can be used to find connections

among small molecules sharing a mechanism of action, chemical and physiological processes, and diseases and drugs. Recently, a similar study was conducted to create a disease-drug network by analyzing the genomic expression profiles of human diseases and drugs (Hu and Agarwal 2009). A network of 170,027 significant interactions was extracted from approximately 24.5 million comparisons between approximately 7,000 publicly available transcriptomic profiles. The network includes 645 disease-disease,

5,008 disease-drug, and 164,374 drug-drug relationships. Chiang and Butte (2009) proposed a novel drug usage prediction approach based on the guilt-by-association rule, which means that suggestions for novel drug uses can be generated from the uses of drugs which cure the same diseases. Therefore, this method can only predict new uses for drugs that already have some indications, which means that it cannot be used to predict drug uses for novel chemicals.

All the above drug repositioning methods suffer from either scalability problem or low prediction accuracy. It was not until very recently, when Gottlieb et al. (2011) proposed a powerful large-scale drug repositioning method named PREDICT, that these scalability and accuracy problems were effectively addressed. The basic assumption is that similar drugs are indicated for similar diseases. The method utilizes multiple disease-disease and drug-drug similarity measures for drug repositioning that includes three main steps. In the first step, disease-disease and drug-drug similarities are constructed using various kinds of knowledge. Next, classification features are extracted from these similarities and a classification model is trained using these features. Last, the trained classifier is used to predict novel interactions. It was reported that the PREDICT method can obtain high specificity and sensitivity (AUC=0.9) in predicting drug indications, which surpassed

previous methods. The good performance of PREDICT mainly comes from the use of knowledge from disease-disease and drug-drug similarities that are calculated from many sources. Therefore, the basic setup of PREDICT is different from this study, which utilizes disease-disease and drug-drug similarities calculated from a single source.

1.1.2 Novel target predictions

Drug target (also denoted as target, target protein, protein, target gene, or gene. The word target will be used to denote drug target and druggable genome in the following parts of this thesis without confusion) is considered as a molecular structure which could interact with drugs (Imming, Sinning and Meyer 2006). Capturing new connections between existing drugs and targets or finding novel targets for a given drug plays an important role in drug development. Again, experimental prediction of drug-target associations is a laborious and costly task (Haggarty, et al. 2003).

At the same time, there are numerous known targets and even more computationally predicted targets (e.g., the so called druggable genome). The druggable genome denotes a set of human genes that encode proteins which might be able to bind drug-like molecules

(Hopkins and Groom 2002). Though different sets of druggable genes have been predicted, the consensus on the number of druggable genes is around 3000 (Russ and

Lampel 2005). In contrast, there are only about a few hundreds of known targets (Imming,

Sinning and Meyer 2006). Due to the large number of potential targets, examining each one of them with a specific drug becomes a tedious or even impossible task. From this point of view, an accurate druggable genome filtering or ranking approach becomes in urgent need.

Computational drug-target association prediction methods can overcome some of the limitations in experimental drug-target predictions. A large number of approaches have been proposed within the last few years. Zhu, et al. (2005) attempted to mine implicit chemical compound and gene relations from the co-occurrence in the literature. The results of their method are constrained to current knowledge. Furthermore, there are many inconsistencies in target names and drug names, which adversarially affect its results. The structure based maximal affinity model (Cheng, et al. 2007), which uses only basic biophysical principles, can generate accurate prediction of druggability based solely on the crystal structure of a target‟s binding site. This method, however, is only useful when we know the 3D structure of targets, which is generally not available.

Recently, several methods combined drug-drug or target-target similarities into novel target predictions (Campillos, et al. 2008, Yamanishi, Araki, et al. 2008, Bleakley and

Yamanishi 2009, Keiser, et al. 2009, Cheng, et al. 2012). Phenotypic side-effect similarities were used to build a drug-drug relation network, based on which novel drug- target associations were inferred (Campillos, et al. 2008). Yamanishi et al. (2008) formalized the drug-target interaction inference as a supervised learning problem on a bipartite graph. The learning process was based on a unified „pharmacological space‟ which was constructed by combining chemical and genomic properties. It has also been shown that chemical similarities between drugs and ligands, small molecules that bind to molecular targets, can be used to predict unanticipated associations (Keiser, et al. 2009).

Bipartite local models (BLM) used supervised methods to predict target proteins of a given drug, then to predict drugs targeting a given protein, and finally these two were combined to give a definitive prediction for each drug-target interaction (Bleakley and

Yamanishi 2009). In another work, Perlman et al. (2011) proposed a framework which combines multiple drug-drug and gene-gene similarity measures using a logistic regression model. The final classification score was used to indicate interactions between drugs and targets.

Very recently, a network based method was proposed to infer novel drug-target interactions (Cheng, et al. 2012). Three approaches, i.e. drug based similarity inference method (DBSI), target based similarity inference method (TBSI), and network based inference method (NBI), were introduced. The first two approaches are similar to the item-based collaborative filtering methods in recommendation algorithms (Sarwar, et al.

2001). Their difference is that DBSI uses 2D chemical similarities of drugs and TBSI uses genomic sequence similarities of targets. The third method, NBI, ranks drugs for a specific target based on a two-step diffusion model on the bipartite drug-target graph. The authors demonstrated that their network based inference method can get the best evaluation results on all the four validation datasets among the three methods.

1.2 Motivation

The guilt-by-association principle has been widely used in many different domains and applications (e.g., Jeh and Widom (2002)). It was originally proposed for novel drug use prediction by Chiang and Butte (2009). It states that suggestions for novel drug uses can be generated from the uses of drugs that cure the same diseases. This assumption was further extended by concluding that similar diseases tend to be connected with similar drugs and similar drugs tend to be connected with similar target (Gottlieb, et al. 2011).

Based on this assumption, the intra-similarity information can be incorporated into novel association predictions by constructing a heterogeneous graph, which includes both intra-

similarity information (connections between the same kind of nodes, such as disease- disease connections and drug-drug connections) and interaction information (connections between different kinds of nodes, such as disease-drug connections and drug-target connections). The proposed methods in this thesis also rely on this assumption, and an intuitive interpretation is that disease-drug or drug-target connectivity information can flow on the constructed disease-drug or drug-target heterogeneous graphs. At the end when information flow becomes stable, hidden connections are expected to be uncovered.

An illustrative example is given in Figure 1(a), where disease d and drug r2 are predicted to be connected because disease d and drug r1 are connected and the similarity between r1 and r2 is considered significant.

Since the connectivity information can flow within the disease-drug or drug-target heterogeneous graph, it can also flow across the whole disease, drug, and target heterogeneous graph, which incorporates all intra-similarities and interactions of diseases, drugs, and targets. Take the problem of disease-drug association predictions as an example: the introduction of drug-target interactions and target-target similarities can help to capture missing disease-drug interactions, which are hard to uncover by only considering the disease-drug graph. An illustrative example is given in Figure 1 (b). In this example, if we only use disease and drug information, the link between disease d and drug r2 cannot be uncovered because the similarity between drug r1 and r2 is not significant; and then connectivity information of disease d and drug r1 cannot flow to disease d and drug r2. However, after introducing target information, this link can then be uncovered because disease d and drug r2 are both connected with target t. Note that the link between disease d and target t can be built because they share drug r1.

Figure 1. Connectivity information flow on heterogeneous graphs. In both graphs, d represents a disease; r1 and r2 represent two different drugs. t in the right graph represents a target. The numbers above the edges in both graphs denote the weights of the corresponding edges.

1.3 Thesis organization

The rest of the thesis is organized as follows: first, characteristics of large datasets to be used in the experiments, as well as the validation of the guilt-by-association assumption on the datasets are introduced in sections 2; in section 3, details about the proposed approaches based on both local and global heterogeneous graphs are described; next, results on extensive experiments are given in section 4; and finally, conclusion of the thesis and some future directions are discussed in section 5.

2 Analyzing Datasets

Before introducing the prediction approaches, it will be helpful to first take a look at the heterogeneous information and study its basic statistical characteristics. In this section, I systematically study the characteristics of the datasets that are used in this thesis and assess the validation of the guilt-by-association assumption on these datasets. In short, there are three major concepts in this study: disease, drug, and target. In the following subsections, all three kinds of data will be illustrated and analyzed in detail.

2.1 Datasets collection

There are three intra-similarity matrices which represent the disease-disease similarities, drug-drug similarities, and target-target similarities, respectively. In addition, there are two interaction matrices, i.e. the disease-drug interaction matrix and the drug-target interaction matrix, which represent the connection information of disease-drug and drug- target, respectively.

A phenotype based disease-disease similarity matrix was downloaded from MimMiner

(van Driel, et al. 2006), which is constructed by calculating similarities based on the numbers of occurrences of MeSH (medical subject headings vocabulary) terms in the medical descriptions of each pair of diseases from the Online Mendelian Inheritance in

Man (OMIM) database (Lipscomb 2000, Hamosh, et al. 2005, Gottlieb, et al. 2011).

More specifically, the OMIM database contains record-based textual information on diseases. For each disease, van Driel, et al. (2006) extracted the full-text (TX) and clinical synopsis (CS) fields from OMIM. They referred to this combination of the TX and CS fields as a „record‟. They then used the anatomy (A) and the disease (C) sections of

MeSH to extract terms from each record. MeSH terms served as phenotype features

characterizing OMIM records, i.e. every entry in the feature vector of a disease represents the number of times a MeSH term occurring in the disease‟s record, which reflects the strength of relevance of the term to the disease. A similarity between each pair of diseases is calculated based on their feature vectors. According to the MimMiner database description, the similarities have already been normalized to the range [0, 1].

The drug-drug similarity matrix includes all the FDA-approved drugs from the DrugBank database (Knox, et al. 2011). The similarities are calculated based on their chemical structures. First, chemical structures of all drug compounds in the Canonical SMILES format (Weininger 1988) were downloaded from DrugBank (Knox, et al. 2011). Then, the Chemical Development Kit (Steinbeck, et al. 2006) was used to calculate a binary fingerprint for each drug. Finally, the similarity score of two drugs was calculated using the two-dimensional Tanimoto score (Tanimoto 1957) based on their fingerprints. Again, the resulting drug-drug similarities have been normalized to the range [0, 1].

A druggable gene is defined as a human protein coding gene that contributes to a disease phenotype and can be modified by a small molecule drug. The term “druggable genome” has been used to denote a list of genes that their proteins can serve as suitable targets for developing therapeutic drugs (Sophic 2012). The list of druggable genes was downloaded from the Sophic Integrated Druggable Genome Database project (Sophic 2012), which includes genes from the ENSEMBL database (Flicek, et al. 2011), the DrugBank database (Knox, et al. 2011) and the InterPro-BLAST database (Hunter, et al. 2009). The gene-gene similarities were calculated using Smith-Waterman Scores (Smith and

Waterman 1981) based on the amino acid sequences of their corresponding proteins. To normalize the gene-gene similarities, I adopted the same method proposed by Bleakley et

al.‟s (2009). Given two genes, say g1 and g2, let SW (.,.) represent the original Smith-

Waterman Score, the normalized gene-gene similarity score between g1 and g2 is given

SW(,) g g by SW_ norm  12 . SW(,)(,) g1 g 1 SW g 2 g 2

Initial disease-drug interactions were obtained from Gottlieb et al.‟s paper (2011), in which disease and drug interactions were assembled from diseases listed in the OMIM database (Hamosh, et al. 2005) and their indicated drugs, registered in the DrugBank database (Knox, et al. 2011). If a disease and a drug are indicated, then the corresponding value in the interaction matrix will be set to 1. Otherwise, it will be set to 0. Details about the construction of disease-drug interactions can be found in Gottlieb et al.‟s paper (2011).

Initial drug-target interactions were collected from a subset of the DrugBank database

(Knox, et al. 2011). The values for drugs and their associated targets were set to 1 in the drug-target interaction matrix. All other items in the drug-target matrix were set to 0.

2.2 Basic statistics

The basic statistics of disease, drug, target and their interactions are listed in Table 1 and

Table 2. The total number of diseases is 5080 and the total number of drugs is 1409 in both matrixes. The total number of targets is 3997. Both matrixes are very sparse with many isolated nodes (having no connections). For example, the total number of connections among diseases and drugs is only 1461, with 233 diseases having at least one drug and 549 drugs connecting with at least one disease. Similarly, the total number of connections among drugs and targets is only 2098, with 554 drugs having at least one known target and 602 targets connecting with at least one drug. Among the connected nodes, many of them have more than one connections, which means known information

about diseases/drugs/targets is very biased towards very small subset of them. The degree distribution of each entity in each of the matrixes is given in Figure 2.

Table 1. Disease, drug, and their interactions. The first row represents the minimum number of drugs connected with each disease. The corresponding items in the second and third rows represent the number of diseases that have at least specified number of connected drugs and the number of disease-drug interactions that are associated with such diseases. For example, the second row of column 4 indicates that there are totally 154 diseases which have at least two connected drugs. The third row of column 4 indicates that there are totally 1382 disease-drug interactions associated with these 154 diseases.

min # of drugs for a disease 0 1 2 5 10 disease # 5080 233 154 93 44 disease-drug interaction # 1461 1461 1382 1214 897

Table 2. Drug, target, and their interactions. The first row represents the minimum number of targets connected with each drug. The corresponding items in the second and third rows represent the number of drugs that have at least specified number of connected targets and the number of drug-target interactions that are associated with such drugs. For example, the second row of column 5 indicates that there are totally 131 drugs which have at least five connected target. The third row of column 5 indicates that there are totally 1287 drug-target interactions associated with these 131 drugs.

min # of targets for a drug 0 1 2 5 10 drug # 1409 554 371 131 47 drug-target interaction # 2098 2098 1915 1287 750

Figure 2. Initial interaction distributions. The graph on the upper left corner shows the degree distribution of diseases on disease-drug interactions. The graph on the lower left corner shows the degree distribution of drugs on disease-drug interactions. The graph on the upper right corner shows the degree distribution of drugs on drug-target interactions. The graph on the lower right corner shows the degree distribution of targets on drug-target interactions.

2.3 Intra-similarity analysis

Before the heterogeneous information from the collected datasets, i.e. intra-similarities and initial interactions, can be successfully utilized in the proposed approaches, it is necessary to study the statistical characteristics of these diverse datasets. In this subsection, the distributions for all three intra-similarity matrices, i.e. the disease-disease similarity matrix, the drug-drug similarity matrix and the target-target similarity matrix

(the first row of Figure 3) are examined. From Figure 3, it can be easily seen that the majority of the similarity values are quite small (smaller than 0.3). This is especially true for disease-disease and target-target similarities. According to previous studies (Chen,

Jiang and Jiang 2011, van Driel, et al. 2006), low level similarity values provide little information for interaction inference. It might be even worse because including the mess of low values could adversely affect prediction performance.

Figure 3. Intra-similarity distributions. The x-axis represents similarity values. The y-axis represents the number of pairs that have the specific similarity value. Notice that the self- similarities were not included in all the distributions. The first, second and last columns correspond to disease-disease, drug-drug and target-target similarity distributions, respectively.

The first row gives the whole distributions and the second row provides a closer look to the distributions of very high similarity values.

It is noted is that even though all of the self-similarity values (the similarity between a node and itself) have already been excluded, there are still many entries with value 1 (the second row of Figure 3). This is mostly due to representation limitations. Because it is normally assumed that a node can only have a similarity score of 1 to itself, to ensure such a property, there are potentially two different ways to handle it: first, grouping all

nodes with a similarity score 1 into one node; second, replacing 1 with a value that is close but smaller than 1. I choose the second strategy in our experiments and use the value 0.99 instead of 1, which should not significantly affect the results.

2.4 Further analysis on similarities

The goal of this study is to successfully infer novel interactions for disease-drug and drug-target relationships by utilizing heterogeneous information. The basic assumption of the proposed methods is similar to the guilt-by-association assumption (Chiang and Butte

2009). To study the validation of the assumption on the collected real datasets, similarities of drugs for the same diseases and similarities of drugs from different diseases are compared. The average similarity of drugs for the same diseases was calculated by averaging the similarities of all of the drug pairs that belong to the same curated diseases. To determine the average similarity of drugs from different diseases, the similarity values of all of the drugs pairs that are across different diseases were averaged.

Similarly, I examine the similarities among diseases that share the same drugs and similarities among diseases that do not share any drugs. The similarity distributions for drug-target associations were also studied in the same manner.

The comparisons of similarity distributions from the disease-drug network and the drug- target network are given in Figure 4 and Figure 5, respectively. The average similarity values are also provided in Table 3. From these results, it can be concluded that drugs

(diseases) which associate with the same diseases (drugs) possess higher similarity values than those that belong to different diseases (drugs). This observation becomes even more obvious on drug-target association. I further test the differences of the corresponding

distribution pairs using the Wilcoxon rank sum test, results show that all of the four tests reject the null hypothesis of equal medians at 5% significant level.

Figure 4. Examination of the guilt-by-association assumption on the disease-drug network. The left graph represents the similarity distribution of drugs for the same diseases (blue curve) and similarity distribution of drugs from different diseases (red curve). The right graph represents the similarity distribution of diseases sharing the same drugs (blue curve) and similarity distribution of diseases not sharing drugs (red curve).

Figure 5. Examination of the guilt-by-association assumption on the drug-target network. The left graph represents the similarity distribution of targets from the same drugs (blue curve) and the similarity distribution of targets from different drugs (red curve). The right graph represents the similarity distribution of drugs from the same targets (blue curve) and the similarity distribution of drugs from different targets (red curve).

Table 3. Further analysis on similarities. The “within similarity”, e.g. drug-drug similarity from disease-drug network, is calculated by averaging the similarities of drugs from the same disease.

The “cross similarity”, e.g. drug-drug similarity from disease-drug network, is calculated by averaging the similarities of drugs from different diseases.

within similarity cross similarity drug-drug similarity from disease-drug network 0.1850 0.1436 disease-disease similarity from disease-drug network 0.1761 0.1044

target-target similarity from drug-target network 0.1836 0.0231 drug-drug similarity from drug-target network 0.2445 0.1429

3 Disease, Drug, and Target Association Predictions

In this section, I will extensively discuss how to predict novel disease-drug and drug- target interactions using heterogeneous information. The novel approaches for disease- drug association predictions and drug-target association predictions will be introduced in subsection 3.1 and subsection 3.2 respectively.

3.1 Disease-drug association predictions

This subsection will focus on disease-drug association predictions, i.e. drug repositioning, and illustrate how to address this problem by utilizing heterogeneous information.

3.1.1 Disease-drug graph construction

Recent studies have already brought attention to incorporating disease-disease similarities and drug-drug similarities into drug repositioning (Gottlieb, et al. 2011). By including the intra-similarity information and initial disease-drug interactions, a heterogeneous graph can be constructed that includes two kinds of nodes, i.e. disease node and drug node, and three types of edges, i.e. disease-disease edges, drug-drug edges, and disease-drug edges.

More specifically, let D{ d12 , d ,..., dm } denote all of the m diseases, and let

R{ r12 , r ,..., rn } denote all of the n drugs. Two diseases are connected if and only if their similarity value is greater than a predefined threshold. The edge weights correspond to the similarity values. Similarly, edges between two drugs can be constructed. The disease-drug edges are created by connecting a disease and a drug if and only if the drug has been approved to cure the disease. The weights on all of the disease-drug edges are initially assigned 1. Edd, Err, and Edr are used to represent disease-disease, drug-drug, and disease-drug edges respectively. Furthermore, Wdd, Wrr, and Wdr are used to represent

weights on these three kinds of edges respectively. The heterogeneous graph can be

denoted by GDREEEWWWDR{{,},{ dd , rr , dr },{ dd , rr , dr }}.

Based on this heterogeneous graph, the drug repositioning problem can be transformed into a novel disease-drug edge prediction problem on this graph. This means that the original heterogeneous graph GDR can be considered as an incomplete graph with missing edges between D (disease) nodes and R (drug) nodes. Capturing hidden interactions between diseases and drugs is equivalent to adding new edges to the initial incomplete graph. Here only hidden edges between diseases and drugs need to be predicted.

Formally, the disease-drug edge prediction problem can be written as follows:

Input: GDREEEWWWDR{{,},{ dd , rr , dr },{ dd , rr , dr }}

Output: GDREEEWWWnew{{,},{ , , new },{ , , new }} DR dd rr dr dd rr dr

new new where Edr represents the edges between diseases and drugs in the final graph, and Wdr represents their final weights according to a prediction procedure. 3.1.2 Predictions on the disease-drug graph

To capture novel interactions between diseases and drugs, it is helpful to first take a look at the characteristics of interactions between existing diseases and drugs. It has been shown that similar diseases not only share some phenotype features but also tend to share similar disease genes and similar drugs (Chiang and Butte 2009, Barabási, Gulbahce and

Loscalzo 2011). The proposed prediction methods in this study are also based on the following assumption:

Assumption 1. Similar drugs tend to cure similar diseases and dissimilar drugs are prone to cure dissimilar diseases.

The validation of Assumption 1 on the experimental datasets collected in this study was shown earlier (Figure 4 and Table 3 in Section two).

An equivalent statement of Assumption 1 is that similar diseases tend to be cured by similar drugs and dissimilar diseases are prone to be cured by dissimilar drugs. Some disease-drug interaction examples are given in Appendix A and Appendix B. From these examples, it can be found that similar diseases tend to share similar drugs, especially for the disease „ATRIAL FIBRILLATION, FAMILIAL, 1; ATFB1‟ and the disease

„ATRIAL FIBRILLATION, FAMILIAL, 3; ATFB3‟ where they share 10 identical drugs.

It can also be found that drugs associated with dissimilar diseases have relatively low similarity values, e.g. disease „ROSENTHAL-KLOEPFER SYNDROME‟ and disease

„HYPERLIPIDEMIA, TYPE V‟.

Based on Assumption 1, the intra-similarity information and disease-drug association information can be combined together to predict novel interactions between diseases and

drugs. More specifically, given the graph GDREEEWWWDR{{,},{ dd , rr , dr },{ dd , rr , dr }}, the association coefficient between each disease-drug pair (or edge weight for each possible edge between a disease node and a drug node on graph GDR) will be calculated using formula 1. wdr(,)(,)(,)(,) wddi  wdr i i  wrr i (1) dii D r R where w(d,r) represents the weight on the edge (d,r).

This formula is motivated by Assumption 1. An illustrative example is given in Figure 6, which only consists of original links between disease nodes and drug nodes and intra- similarities that are greater than a threshold. Although initially d2 and r2 are not connected in Figure 6, the proposed method will assign a positive weight (0.72) on edge

(d2, r2) because they are connected through diseases d1 and d3 and drug r3. In other words, because drug r3 is used to treat diseases d1 and d3, and because drug r2 is similar to drug r3 and disease d2 is similar to diseases d1 and d3, therefore, there is a chance that drug r2 can be used to treat diseases d2 and the change depends on the strength of disease-disease and drug-drug intra-similarities.

Figure 6. An illustrative disease-drug interaction graph. The edges between diseases and drugs all have weight 1. di and rj represent i-th disease and j-th drug respectively.

Once a new value on an edge is calculated based on Formula 1, it can potentially be used again in calculating weights of other edges. Naturally, this indicates an iterative calculation procedure, in which the new interaction matrix Wdr would be calculated based on old values in the matrix. Formula 1 thus can be rewritten using matrix notation:

new old WWWWdr dd  dr  rr (2)

In general, there are two related issues that need to be resolved in order for the proposed iterative approach to work. First, one may want to treat the initial links between diseases and drugs differently from the inferred links because the initial links deserve more trust.

Second, it is desirable if the matrix Wdr will converge after iterations, which means that the information propagation is stabilized at the end. Inspired by the theoretical work by

(Zhou, et al. 2004), instead of using Formula 2 directly to iteratively calculate disease and

drug interactions, a revised formula by adding an extra term including the initial matrix is proposed.

new old 0 WWWWWdr dd  dr  rr (1  ) dr (3)

0 In this formula, α represents a decay factor; its value should be between 0 and 1. Wdr represents the initial interactions between diseases and drugs. A straightforward explanation of adding this new term is that in each iteration, the original links between diseases and drugs will contribute to the newly constructed connections, and the contribution of which is controlled by the scale factor 1 - α. In other words, α can be treated as a trust index on the newly inferred information, therefore controls the relative contributions from inferred links and original links. By iteratively using this formula, the weight/strength between a disease and a drug essentially will consider all the possible paths connecting them in the heterogeneous graph. To solve the convergence issue, Dr.

Wenhui Wang has proved that if Wdd and Wrr are properly normalized, it is guaranteed that Formula 3 will converge to a unique solution. I simply state it here as a theorem and leave Dr. Wenhui Wang‟s proof in Appendix C.

THEOREM 1. When properly normalized utilizing Formula 4, it is guaranteed that Formula 3 will converge to a unique solution which can be solved using an iterative propagation-based algorithm.

w(,) i j w(,) i j  (4) nn w(,)(,) i k w k j kk11

In formula 4, n is the number of rows (or columns because W is a square matrix) in matrix W.

3.1.3 Predictions on the global heterogeneous graph

Until now, the proposed approach has relied on disease-disease similarities, drug-drug similarities, and initial disease-drug interactions to predict novel disease-drug associations. This is in general all the information that can be used in existing drug repositioning approaches. To the best of my knowledge, almost all of the existing drug repositioning algorithms are based on all or part of these three types of information. The differences among them lie in specific prediction methods, different ways in constructing intra-similarity metrics, or initial interactions.

In this study, a novel approach that can utilize more information than existing ones is brought forward, which incorporates target-target similarities and initial drug-target interactions into drug repositioning, in addition to the information mentioned earlier.

Before discussing the approach itself, it would be helpful to first summarize all the information that will be utilized, which includes disease-disease similarities, drug-drug similarities, target-target similarities, initial disease-drug interactions and initial drug- target interactions. All of the information can be incorporated in a new global heterogeneous graph.

GDRTEEEEEWWWWWDRT{{,,},{ dd , rr , tt , dr , rt },{ dd , rr , tt , dr , rt }} where T represents the targets. Ett and Ert represent the target-target edges and drug-target edges respectively. Wtt and Wrt represent their corresponding weights. The remaining items are the same as those in the previous local graph. An example of such a graph is given in Figure 7.

Figure 7. Disease, drug, and target heterogeneous graph. Purple (circle), red (triangle), and green

(square) nodes represent diseases, drugs, and targets respectively. I first collected all the disease- drug and drug-target interactions. And then only interactions with node ids (a unique id assigned for each disease, drug, or target when collecting the datasets) smaller than 1500 for disease nodes,

250 for drug nodes, and 800 for target nodes were included. For intra-connections, I only drew those which have intra-similarity values greater than 0.3. This and other connection graphs were drawn using the Cytoscape tools (http://www.cytoscape.org/).

Our motivation to include target information in drug repositioning is based on the assumption that a drug can treat a disease because it can bind with its targets, which may be involved in a disease pathway. Therefore, although in practice, it is hard to directly link diseases with drug targets that may be specific to the diseases, one can construct such

links through their shared drugs. Once the initial connections between diseases and targets being established, more interactions can be derived based on disease-disease and target-target intra-similarities, and disease-drug and drug-target relationships assuming a disease and a target are more likely associated if they share some similar drugs. Such information can then be utilized in predicting disease-drug relationships. A similar assumption has also been studied in a previous research (Jeh and Widom 2002). Jeh and

Widom stated that two objects are similar if they are related with similar objects. Their algorithm has been proven successful in many application domains, such as matching text across documents or computing overlap among item-sets.

A simple illustrative graph is given in Figure 8. Based on this graph and by applying the assumption, it can be concluded that the connection between d1 and t1 is stronger than that between d1 and t2. This is because although both t1 and t2 share one drug with d1, d1 also has a direct link to r3 which is similar to r1 that is directly link to t1.

Figure 8. Constructing disease-target association based on drugs. Note that this is only an illustrative sample graph and the disease-disease edges and target-target edges were intentionally omitted for clarity. di, ri, and ti represent diseases, drugs, and targets respectively. The numbers above the edges represent the weights of the corresponding edges. There are no weights on interaction edges (disease-drug edges and drug-target edges) because they are all initially assigned value 1.

Therefore, I first propose an association coefficient between a disease d and a target t as follows: wdt(,)(,)(,)(,) wdri  wrr i j  wrt j (5) rij R r R which incorporates all drugs connected to d and t, as well as drug-drug similarities. Once accurate relations between diseases and targets have been achieved, new interaction coefficients between diseases and drugs or drugs and targets can be similarly achieved by considering the newly established disease-target relationships. wdr(,)(,)(,)(,) wdti  wtt i j  wtr j (6) tij T t T wrt(,)(,)(,)(,) wrdi  wdd i j  wdt j (7) dij D d D

Equations 5 – 7 can also be rewritten in a matrix format, as illustrated by equations 8 – 10 respectively.

new WWWWdt dr  rr  rt (8)

new T WWWWdr dt  tt  rt (9)

new T WWWWrt dr  dd  dt (10)

The superscript T represents the transpose of the corresponding matrix. Now, only new

Wdr and Wrt, which represent the calculated disease-drug associations and drug-target associations respectively, are required for drug repositioning and novel target discovery.

Therefore, Wdt can be simply treated as a temporary value and Wdt in the right sides of equations 9 and 10 can be replaced by the right hand side of equation 8, which results in equations 11 and 12, respectively.

new T T WWWWWWWWWWWdr dr  rr  rt  tt  rt  dr () rr  rt  tt  rt (11)

new T T WWWWWWWWWWWrt dr  dd  dr  rr  rt () dr  dd  dr  rr  rt (12)

Similar to equation 3, the initial connections can also be added to the newly calculated interactions as follows:

new T 0 WWWWWWWdr dr ( rr  rt  tt  rt )  (1  ) dr (13)

new T 0 WWWWWWWrt( dr  dd  dr  rr )  rt  (1  ) rt (14)

0 0 Again  represents the decay factor. It should be a value in the range (0, 1). Wdr and Wrt represent the initial disease-drug and drug-target interactions respectively respectively.

Similar to the prediction approach based on the disease-drug graph, these two equations

(equation 13 and 14) can be solved in an iterative propagation-based manner, after the proper normalization that is similar to the one used in Formula 4.

Now novel disease-drug interactions can be obtained based on the whole heterogeneous graph. However, the method still has one limitation, because it cannot infer new disease- drug interactions for diseases that have no known connected drugs. To address this issue, the results from the global inference (inference based on the disease, drug, and target heterogeneous graph) can be taken as the input (initial interactions) for the local inference algorithm (inference based on the disease-drug heterogeneous graph). Therefore, predictions for diseases with no known drugs can also be provided. Another important benefit of cascading the global inference with the local inference is that local information and global information are incorporated together, which can help to make more precise predictions.

3.2 Drug-target association predictions

Drug-target association predictions, i.e. novel target predictions, try to uncover hidden drug-target interactions. Based on the framework of the proposed approaches, the drug- target association prediction problem is mathematically similar to the drug repositioning problem. Although the basic methodologies are the same for these two problems, they differ in graph constructions and possible threshold controls.

3.2.1 Predictions on the drug-target graph

A drug-target graph includes drug-drug similarities, target-target similarities, and drug- target interactions. There are two kinds of nodes in this graph: drug nodes and target nodes. One drug is connected with another drug if and only if their similarity is larger than a pre-defined threshold. The weight of the edge connecting these two drugs is assigned as their similarity value. Edges and weights for target pairs can be constructed similarly. Finally, a drug and a target are connected if and only if they are proven to be interacted in the original drug-target interaction dataset. The weights of all drug-target edges are originally assigned 1. Similar to the representation of the disease-drug graph, the heterogeneous drug-target graph can be represented as

GRTEEEWWWRT{{,},{ rr , tt , rt ,},{ rr , tt , rt ,}}, where T represents the target node set, Ert and Ett denote the drug-target edge set and target-target edge set respectively, and Wrt and Wtt denote their corresponding weights.

Again, the novel target prediction problem can be transformed into a novel drug-target edge prediction problem on the constructed drug-target graph. The novel drug-target edge prediction problem can be formalized as follows:

Input: GRTEEEWWWRT{{,},{ rr , tt , rt ,},{ rr , tt , rt ,}}

Output: GRTEEEWWWnew{{,},{ , , new ,},{ , , new ,}} RT rr tt rt rr tt rt

where E new and W new represent the newly calculated edges and their weights respectively. rt rt

Obviously, the inference procedure for disease-drug association predictions can be used here.

3.2.2 Predictions on the global heterogeneous graph

To study the performance of novel target predictions by incorporating disease-disease similarities and disease-drug interactions, the global heterogeneous graph, which includes all intra-similarities, disease-drug interactions, and drug-target interactions, can also be constructed. Obviously, this global heterogeneous graph includes exactly the same

new information as the one constructed for drug repositioning. Wrt in Formula 14 represents the newly established drug-target relationships.

4 Experiments

In this section, a number of experiments and analyses will be conducted to show the effectiveness of the proposed approaches on large-scale datasets.

4.1 Data preparation

Detailed data preparation descriptions as well as preliminary data analyses can be found in section 2. In the following experiments, for all the three intra-similarity matrixes, I use

0.3 as the threshold to establish edges. I also fix  = 0.4. The effect of these parameters will be evaluated in the future work.

4.2 Evaluation metrics

To compare the performance of the proposed approaches with existing approaches, the receiver operating characteristic (ROC) analysis (Sing, et al. 2005) is adopted. ROC is used to describe the fraction of true positive rate (TPR) vs. false positive rate (FPR) at various threshold settings. In our case, the declaration of true positives and false positives is based on the relative ranks of predicted relationships according to their weights in the final matrixes. The area under curve (AUC) value, also known as A' or c-statistic, is the fraction of area under the ROC curve out of the whole area, which is normally 1. AUC is often used as a summarization criterion of ROC curve. Normally, higher AUC values indicate better performance and vice versa.

In order to systematically evaluate the proposed approaches on the collected datasets, I adopt a full leave-one-out cross-validation (LOOCV) strategy for all the experiments.

Basically, for each observation (either an initial disease-drug relationship or an initial drug-target relationship in our study), LOOCV uses all the remaining observations as the training data and tries to recover the observation itself. For drug repositioning, LOOCV

consider each disease one by one. For each disease, one of its connections to a drug is treated as the test data, and it is ranked together with all other drugs in descending order according to the calculated disease-drug association coefficients. For each specific ranking threshold (e.g., top 1%), if the rank of the testing connection is above the threshold, it is regarded as a true positive. The number of times that a true positive is discovered over all possible disease-drug relationships is regarded as the true positive rate corresponding to the specified threshold (false positive rate). The ROC curve can thus be constructed by varying the threshold. For novel target predictions, all the targets for each drug are sorted in descending order according to their calculated association coefficients.

A similar ROC curve can be constructed, which has also been used for novel target predictions in previous work (Bleakley and Yamanishi 2009, Cheng, et al. 2012).

Although ROC curve provides a whole picture for analyzing prediction performance, for drug repositioning and novel target predictions, researchers are also, or even more, interested in the accuracy of top ranked results. To compare the performance on top ranked results, the numbers of correctly retrieved testing connections based on various top percentiles (the most left side of the ROC curve) are also provided. Furthermore, I separate the diseases that only have a single known drug from the LOOCV experiment.

Instead, I use those diseases to test the capacity of the proposed approaches in discovering drugs for diseases without known drugs.

4.3 Comparison with existing methods

There exist a large number of disease-drug and drug-target association prediction approaches. To evaluate the performance of the proposed approaches, I choose to compare their results with those from two representative approaches, namely, the

bipartite local model (BLM) (Bleakley and Yamanishi 2009) and the network based inference method (Cheng, et al. 2012).

The BLM, which extended Yamanishi‟s bipartite method (Yamanishi 2008), is considered one of the state-of-the-art approaches in drug-target interaction prediction research (Xia, et al. 2010). It used supervised methods to predict target proteins of a given drug, then to predict drugs targeting a given protein, and finally these two were combined to give a final prediction for each drug-target interaction. More specifically, to predict a possible link between a drug and a target protein, the authors proposed two classification models trained based on different information. For the first one, a Support

Vector Machine (SVM) is constructed based on drug-target relationships (as Y) and target-target similarities (as X). For the second one, a SVM model is constructed based on target-drug relationships (as Y) and drug-drug similarities (as X). Finally these trained

SVM models were used to predict new drug-target interactions. The authors also proposed a unified function to combine these two models together to get a final global prediction score. Although BLM was originally proposed to predict associations between drugs and targets, it can be directly used in drug repositioning. In this study, the configuration of BLM is almost the same as the one used in the original paper. To evaluate the prediction results using ROC analysis, predicted scores generated from SVM can be used as the ranking criterion, which means that larger predicted scores yield higher rankings (Bleakley and Yamanishi 2009)

To implement BLM successfully in this study, one needs to choose a proper set of negative samples for training SVMs. It is not wise to include all un-observed links as negative samples for obvious reasons. First, the training time would be significantly

longer if all negative samples are included in the SVM training process. More importantly, it may potentially harm the prediction results if all negative samples are included. In the datasets, only a small portion of connections between disease-drug or drug-target are known. The un-observed links do not necessarily mean that there are no relationships between the pairs. It just means the connections have not been observed.

Therefore, treating all un-observed links as negative samples will introduce bias into the model and may hurt its performance. One can potentially find the best parameter by using cross-validation again. In the experiments, its performance will be tested using different number of negative samples and empirically select one based on the results.

The NBI method (Cheng, et al. 2012) used a two-step diffusion inference process, which

can also be rewritten in a matrix form as AWA0 , where A0 and A represent the initial

m 1 aapl ql and final drug-target interaction matrix, and W w(,) p q , a  nn  pl kdq l1 kt()l   nn

th th denotes the weight on the original connectivity between the p drug and l target, k(dq) denotes the number of targets that interact with dq, k(tl) denotes the number of drugs that interact with tl. As indicated by the authors, for a specific target t, all drugs are then sorted in a descending order based on the final ranking scores, which constitutes the recommendation list of the target t. I choose this approach for comparison because it can be viewed as a simplified version of the proposed approaches, in the sense that only a two-step diffusion of the matrix is used in NBI while our approach uses the converged matrix. To compare NBI with the proposed approaches in this study, the transpose of the disease-drug interaction matrix for disease-drug interaction predictions and the transpose of the drug-target interaction matrix for drug-target interaction predictions are used. And

then a ranked list of drugs for a specific disease and a ranked list of targets for a specific drug can be obtained.

4.4 Experimental results

The experimental results for drug repositioning and novel target predictions are presented in subsection 4.4.1 and 4.4.2, respectively.

For the remainder of the thesis, LI (Local Inference), GI (Global Inference) and GLI

(Global Local Inference) are used to denote the proposed inference approaches based on the local graphs (i.e. disease-drug graph or drug-target graph), the global heterogeneous graph, and the local inference approach taking the global inference results as its initial inputs. To obtain the main evaluation results, the leave-one-out cross validation experiment for drug repositioning (novel target predictions) was conducted on all diseases (drugs) which have at least two initially associated drugs (targets). In total, 154 such diseases and 1382 initial disease-drug edges and 371 such drugs and 1915 initial drug-target edges were considered (Table 1 and Table 2).

4.4.1 Drug repositioning

4.4.1.1 Main evaluation results

The AUC values of NBI, BLM, LI, GI, and GLI for drug repositioning are given in Table

4. The ROC curves of NBI, BLM, LI, and GLI are also provided in Figure 9. The method

GI is not included in this figure and other figures because it is only considered as one step in GLI instead of a complete method. To implement BLM successfully in this study, a large number of experiments were conducted to find a proper number of negative samples for training the SVM models. The AUC values of BLM with different numbers of negative samples are provided in Appendix D, based on which, the number of negative

training samples was simply set to max{20,2 positive _ training _ sample _ num } in the experiments. In addition, I have used different sets of negative samples and the results of

BLM are summarized based on the average numbers of different runs.

Table 4. AUC values of disease-drug association predictions. The result of BLM was achieved by averaging five runs with the same configuration. The variance of the five runs is 3.8275e-005.

NBI BLM LI GI GLI AUC 0.5802 0.8303 0.8367 0.9037 0.9151

Figure 9. ROC curves of disease-drug association predictions.

From both Table 4 and Figure 9, it is apparent that a two-step diffusion (NBI) is not sufficient to capture all the information from the local disease-drug graph. By iteratively integrating information from all possible paths, LI made significant improvements over

NBI. The performance of LI is comparable to that of BLM, which also demonstrates that probably methods based on the local graph (such as LI and BLM) have reached their

limit (AUC is around 0.83). However, when the target information was included into the global heterogeneous graph, both GI and GLI greatly improved the disease-drug association prediction performance. This convincingly demonstrates the usefulness of introducing extra target information into drug repositioning.

The numbers of correctly retrieved disease-drug interactions according to different percentiles are given in Figure 10. In total there are 1382 ground-truth disease-drug interactions in the dataset. For a specified top percentile, a ground-truth disease-drug interaction is considered to be correctly retrieved if the predicted ranking of this interaction is higher than the specified top percentile number. For example, for top 5 percentile, a ground-truth disease-drug interaction is considered as correctly retrieved if the predicted ranking is smaller than 70 ( 1382 0.05 ). From Figure 10, it can be easily found that when focusing on top ranked results, the improvements of GLI and LI over

NBI and BLM become even more significant, especially for the top 1 percentile, in which case the GLI and LI methods correctly retrieved 304 and 290 disease-drug interactions, whereas BLM and NBI only retrieved 15 and 2 such interactions.

Figure 10. The number of retrieved disease-drug interactions on different percentiles.

4.4.1.2 Predictions for diseases with no known connected drugs

Until now, all analyses only included diseases that have at least two connected drugs.

One advantage of the proposed methods in this study over BLM is that it can also be used to predict associations for diseases that have no known drugs. To demonstrate the effectiveness of the proposed approaches in predicting drugs for diseases that have no known drugs, all diseases that have exactly one associated drug in the dataset were collected. There are in total 79 such diseases. The reason why only including this kind of diseases is that in the leave-one-out cross validation experiments, a link between a disease and a drug will be artificially removed for testing. Therefore, for a disease with only one associated drug, there will be no known associated drugs for this disease in the training process. This inference process is equivalent to the process of predicting diseases with no known connected drugs. The ROC curves of NBI, LI, and GLI for disease-drug interaction predictions are given in Figure 11. GI and BLM are not included because they

are not designed for diseases that have no known connected drugs. The AUC values for

NBI, LI, and GLI are 0.6055, 0.7839, and 0.7890, respectively. Again, both LI and GLI achieved much higher performance for diseases with no known drugs, especially for the top ranked ones.

Figure 11. ROC curves of predictions for diseases with no known drugs.

4.4.1.3 Case studies on disease-drug association predictions

In this subsection, I discuss the results of case analysis on five diseases, i.e. Huntington disease (Huntington chorea), lung cancer, alcohol dependence, small cell lung cancer, and „polysubstance abuse, susceptibility to‟. For each disease, all the drugs that are known for the disease and the top 10 ranked predicted drugs were collected. Details about these diseases and their associated drugs are provided in Table 5. Their connections are also illustrated in Figure 12.

From table 5 and Figure 12, some interesting facts can be found. First, similar diseases do share some common predicted drugs, such as lung cancer and small cell lung cancer.

Second, predictions for diseases with no known connected drugs can also be performed, such as the disease „Polysubstance abuse, susceptibility to‟. Since it is only connected with the disease alcohol dependence, it is reasonable that its top ranked predicted drugs

(top 3 in Figure 12) are all from the drugs that are connected with alcohol dependence.

Figure 12. Case studies on disease-drug association predictions. Only top 3 predicted drugs for each disease were included in the graph. All edges with similarities smaller than 0.3 are considered as unconnected (therefore not shown in the graph).

Table 5. Case studies on disease-drug association predictions.

Disease Initial connected drugs Top 10 ranked (DrugBank IDs) predictions Huntington disease (OMIM ID: 143100) Olanzapine (DB00334) OMIM description: Quetiapine (DB01224) An autosomal dominant progressive neurodegenerative Ziprasidone (DB00246) disorder with a distinct phenotype characterized by chorea, Baclofen (DB00181) Clozapine (DB00363) Risperidone (DB00734) dystonia, incoordination, cognitive decline, and behavioral Tetrabenazine (DB04844) difficulties. Amitriptyline (DB00321) Doxepin (DB01142) Caused by an expanded trinucleotide repeat (CAG)n, Methotrimeprazine (DB01403) encoding glutamine, in the gene encoding huntingtin (HTT; Aripiprazole (DB01238) 613004) on chromosome 4p16.3. Tramadol (DB00193) Lung cancer (OMIM ID: 211980) Cisplatin (DB00515) Carboplatin (DB00958) OMIM description: Temozolomide (DB00853) The leading cause of cancer deaths in the U.S. and Methotrexate (DB00563) Dacarbazine (DB00851) worldwide. The 2 major forms of lung cancer are nonsmall Doxorubicin (DB00997) cell lung cancer and small cell lung cancer. Triamterene (DB00384) Anastrozole (DB01217) Many different genes are associated with lung cancer. More Daunorubicin (DB00694) details can be found in OMIM disease description. Epirubicin (DB00445) Letrozole (DB01006) Lorazepam (DB00186) Alcohol dependence (OMIM ID: 103780) Citalopram (DB00215) Alprazolam (DB00404) Chlordiazepoxide (DB00475) Clonazepam (DB01068) OMIM description: Acamprosate (DB00659) Diazepam (DB00829) Escitalopram (DB01175) Multiple genes could determine the genetic susceptibility Naltrexone (DB00704) Ziprasidone (DB00246) for alcoholism, and is supported by family, twin, and other Disulfiram (DB00822) Risperidone (DB00734) studies. More details can be found in OMIM disease Ondansetron (DB00904) Pergolide (DB01186) description. Olanzapine (DB00334) Bromocriptine (DB01200) Small cell cancer of the lung (OMIM ID: 182280) Triamterene (DB00384) Carboplatin (DB00958) OMIM description: Cisplatin (DB00515) Temozolomide (DB00853) Accounts for about a fourth of the 110,000 new cases of Methotrexate (DB00563) Galantamine (DB00674) Pemetrexed (DB00642) lung cancer that occur annually in the United States. It is Teniposide (DB00444) clinically distinctive: usually metastases are already present Bromocriptine (DB01200) Etoposide (DB00773) Daunorubicin (DB00694) at the time of discovery so that surgery is not used. In Topotecan (DB01030) contrast to adeno- and squamous carcinoma, SCCL is Morphine (DB00295) sensitive to chemotherapy and radiotherapy. Codeine (DB00318) Olanzapine (DB00334) Chlordiazepoxide (DB00475) Polysubstance abuse, susceptibility to, Disulfiram (DB00822) (OMIM ID: 606581) Acamprosate (DB00659) Citalopram (DB00215) Escitalopram (DB01175) OMIM description: None Niacin (DB00627) Much of the genetic vulnerability to abuse of different legal Ondansetron (DB00904) and illegal addictive substances is shared and many abusers Ethosuximide (DB00593) use multiple addictive substances. Clofibrate (DB00636) Pyridoxal (DB00147)

The top ranked predictions for each of these diseases were further studied through literature search. For Huntington disease, the top five ranked drugs were studied and it was found that all of these top five predicted drugs have already been studied for this disease (Paleacu, Anca and Giladi 2002, Alpay and Koroshetz 2006, Bonelli, et al. 2003,

van Vugt, et al. 1997, Duff, et al. 2008). For other four diseases, their top three predicted drugs were studied. It turns out that all the top three predicted drugs for lung cancer have already been studied for curing this disease (Ardizzoni, et al. 2007, Dziadziuszko, et al.

2003). For alcohol dependence, the drug “Lorazepam” has already been under clinical trial (ClinicalTrials.gov(a) 2012). For drug “Alprazolam” and “Clonazepam”, direct research about their functions in curing alcohol dependence cannot be found. However, there is evidence that both drugs are connected with alcohol abuse as shown in the

DrugBank description. For small cell lung cancer, it was found that the second and third predicted drugs have already been under clinical trials for curing this disease (NIH

Clinical Trials 2012, ClinicalTrials.gov(b) 2012). All the results from the case analysis have shown that the proposed approach can be effective in drug repositioning.

4.4.2 Novel target predictions

4.4.2.1 Main evaluation results

The AUC values of NBI, BLM, LI, GI, and GLI for novel target predictions are given in

Table 6. The ROC curves of NBI, BLM, LI, and GLI are also provided in Figure 13.

From both Table 6 and Figure 13, it can be found that the proposed LI method on the drug-target graph significantly outperform both BLM and NBI. However, unlike drug- repositioning, GLI here cannot achieve much better performance than LI. I suspect that the reason could be that disease-disease similarities cannot contribute much in the whole heterogeneous graph when for novel target inference.

Table 6. AUC values of drug-target association predictions. The result of BLM was obtained by averaging five runs with the same configuration. The variance of the five runs is 1.0215e-005.

NBI BLM LI GI GLI AUC 0.7293 0.8898 0.9317 0.9033 0.9359

The numbers of correctly retrieved drug-target interactions according to different percentiles are also given in Figure 14. In total there are 1915 ground-truth drug-target interactions in the dataset. Similarly, for a specified top percentile, a ground-truth drug- target interaction is considered as correctly retrieved if the predicted ranking of this interaction is higher than the specified top percentile number. From Figure 14, it can be easily found that when focusing on the top ranked results, the performance of GLI and LI for novel drug-target predictions is much better compared with NBI and BLM, especially for the top 1 percentile, in which case the GLI and LI method correctly retrieved 1271 and 1339 drug-target interactions, whereas BLM and NBI only retrieved 50 and 10 such interactions.

Figure 13. ROC curves of drug-target association predictions.

Figure 14. The number of retrieved drug-target interactions on different percentiles.

4.4.2.2 Predictions for drugs with no known connected targets

To demonstrate the effectiveness of the proposed approaches in this study on predictions for drugs that have no known connected targets, only drugs with exactly one connected target in the dataset were included. There are in total 183 such drugs. The ROC curves of

NBI, LI, and GLI for drug-target interaction predictions are given in Figure 15. The AUC values of NBI, LI, and GLI are 0.7178, 0.9306, and 0.9526 respectively. Again, GLI and

LI achieved much better performance than NBI. Also, it can be found that the AUC value of GLI is a little higher than that of LI. This could indicate that although disease-disease similarities could not contribute that much on interaction predictions for drugs with known targets, it could provide some useful information when only very little information about the predicting drugs is known, i.e. drugs with no known targets.

Figure 15. ROC curves of predictions for drugs with no known targets.

4.4.2.3 Case studies on drug-target association predictions

To further analyze performance of the proposed drug-target prediction approaches, six drugs, i.e. Citalopram, Escitalopram, Terfenadine, Diphenidol, Fexofenadine, and

Naltrexone, were randomly chosen for case studies. For each drug, all their initial target connections and the top 10 ranked predicted targets were collected. Details about these drugs and their targets are provided in Table 7. The drugs, targets, and their connections are also illustrated in Figure 16. Only the top 3 predicted targets were included in the figure.

From table 7 and Figure 16, some interesting facts about drug-target interaction predictions can be found. First, similar drugs tend to share similar predicted targets, such as the drugs Diphenidol and Terfenadine. Second, predictions for drugs that have no known connected targets, such as the drug Fexofenadine, can also be performed. Since it

is connected with the drug Diphenidol, one of its top ranked predicted targets (target

Entrez_ID: 1128 in Figure 16) is also connected with the drug Diphenidol.

The interactions between these six drugs and their top predicted targets were further studied. To examine these interactions, the Supertarget database (Hecker, et al. 2012), which is an extensive web resource for analyzing drug-target interactions, was used to verify all of the top-ranked targets for each drug. There are four drugs, i.e. Citalopram,

Terfenadine, Terfenadine, and Fexofenadine, which have new targets in the Supertarget database compared with the Drugbank database. The correctly retrieved targets for all the drugs are in bold format in Table 7. It can be found that the top-ranked targets do include some novel targets for all drugs except “Escitalopram”. It can also be found that most of the top-ranked targets are not included in the Supertarget database. The reason could be two-fold. First, compared with the small number of known targets, the number potential targets in the druggable genome is extremely large. Therefore, many top-ranked predictions can be real target, but have not been validated. Second, although the

Supertarget database includes a large number of drug-target interactions, the number of new drug-target interactions in the Supertarget database but not the Drugbank database is still small.

Table 7. Case studies on drug-target association predictions.

Drugs Initial targets Predicted targets (Entrez _ID) HRH1 (3269) ADRA1A (148) CHRM1 (1128) SLC6A2 (6530) CHRM3 (1131) Citalopram (DB00215) SLC6A4(6532) SLC6A3 (6531) CHRM2 (1129) ADRA1B (147) CHRM4 (1132) CHRM5 (1133) CHRM3 (1131) ADRA1A (148) CHRM2 (1129) CHRM1 (1128) ADRA1B (147) HRH1 (3269) CHRM5 (1133) CHRM4 (1132) Escitalopram (DB01175) SLC6A2 (6530) ADRA1D (146) SLC6A3 (6531) ADRB2 (154) SLC6A4 (6532) DRD2 (1813) ADRB1 (153) HTR2A (3356) ADRA1A (148) CHRM1 (1128) ADRA1B (147) CHRM3 (1131) ADRB2 (154) ADRB1 (153) Terfenadine (DB00342) HRH1 (3269) CHRM2 (1129) KCNH2 (3757) CHRM4 (1132) ADRA1D (146) CHRM5 (1133) ADRA2A (150) HRH1 (3269) ADRA1A (148) ADRA1B (147) CHRM1 (1128) ADRB2 (154) CHRM4 (1132) Diphenidol (DB01231) CHRM2 (1129) ADRB1 (153) CHRM3 (1131) CHRM5 (1133) ADRA1D (146) ADRA2A (150) SLC6A2 (6530) CHRM1 (1128) PTGS2 (5743) PTGS1 (5742) CHRM3 (1131) CHRM2 (1129) Fexofenadine (DB00950) None CHRM4 (1132) CHRM5 (1133) HRH1 (3269) ADRA1A (148) ADRB2 (154) ADRA1A (148) HTR2A (3356) ADRA2A (150) OPRD1 (4985) TOP2A (7153) DRD2 (1813) Naltrexone (DB00704) OPRK1 (4986) ADRA2B (151) OPRM1 (4988) ADRA1B (147) ADRB2 (154) ADRA2C (152) ADRB1 (153)

Figure 16. Case studies on drug-target association predictions. For clarity, only part of the whole similarity graph was shown, i.e. only drug-drug similarities and target-target similarities larger than 0.3 were included in the graph.

5 Conclusion

In this thesis, I studied the problem of disease, drug, and target association predictions.

Computational association prediction methods can greatly improve the efficiency of drug development. In the past several decades, numerous computational disease-drug or drug- target association prediction algorithms have been proposed. It has been proved that computational prediction methods can achieve relatively high accuracy. Similar to existing computational methods, the proposed approaches in this study are based on the assumption that two similar diseases (drugs) tend to share similar drugs (targets). Various heterogeneous graphs, i.e. disease-drug graph, drug-target graph, and disease-drug-target graph, were constructed using disease-disease similarities, drug-drug similarities, target- target similarities, disease-drug interactions, and drug-target interactions. The proposed local graph-based inference approaches for disease-drug or drug-target interaction predictions collect connectivity information within the constructed disease-drug or drug- target heterogeneous graphs. To further improve performance, the global graph-based inference method, which is based on the disease-drug-target heterogeneous graph, was proposed. Experiments showed that the proposed approaches in this thesis outperform existing representative methods.

There are a few possible research directions. First, the requirement of more accurate information sources is particularly urgent. Many methods relied on the quality of the disease, drug, and target heterogeneous graph. Although the proposed approaches in this thesis achieved a relatively high level of accuracy, it has been shown that the disease- disease similarity information is far less accurate than the other two similarities, i.e. drug- drug similarities and target-target similarities. Once more accurate disease-disease

similarities are constructed, the performance of disease-drug interaction predictions could be even better. Second, recently semi-supervised classification methods have been utilized in drug-target interaction predictions. Semi-supervised classification methods utilize more unlabeled information in the model training process than supervised classification methods. In the disease, drug, and target association prediction problem, known interactions only account for a very small part of all possible interactions. Also there is no negative training sample for disease, drug, and target association predictions.

Therefore, semi-supervised one-class classification methods are worth further studies for disease, drug, and target association predictions. Third, it has been shown that using multiple similarity metrics provides more information than using a single similarity metric. In the future, heterogeneous graphs could also be constructed by incorporating multiple similarity metrics. Inference based on this hybrid graph could possibly provide more precise predictions. Finally, it has been noticed that the collective effect of a set of targets or drugs plays a role in disease treatments. Therefore, more powerful approaches would be those that can retrieve essential target sets or drug sets instead of simply assigning an association score for each single target or drug.

Appendix A. Examples of Similar Diseases with Their Associated Drugs.

Examples of similar diseases and their associated drugs are listed in this appendix. Note that for the disease “ATRIAL FIBRILLATION, FAMILIAL, 1; ATFB1” and the disease

“ATRIAL FIBRILLATION, FAMILIAL, 3; ATFB3”, I only included all the identical drug pairs and the drug pairs between the remaining drugs for “ATRIAL

FIBRILLATION, FAMILIAL, 3; ATFB3” and the first drug, dofetilide, for “ATRIAL

FIBRILLATION, FAMILIAL, 1; ATFB1”. This is because the total number of drug pairs for these two diseases is too large (10 15 150 ). Detailed information about all the diseases and drugs can be found in the OMIM database and the DrugBank database.

Disease name Diseases with high disease-disease Drug-drug (associated drugs) similarity similarity Drug-drug pairs similarity (associated drugs) MGR2 (eletriptan, eletriptan) 1.0000 (eletriptan) (topiramate, eletriptan) 0.1223 (topiramate) (dihydroergotamine, eletriptan) 0.3446 (dihydroergotamine) (timolol, eletriptan) 0.1377 (timolol) 0.9459 (cyproheptadine, eletriptan) 0.1782 (cyproheptadine) (divalproex sodium, eletriptan) 0.0393 MGR3 (divalproex sodium) (propranolol, eletriptan) 0.1051 (eletriptan) (propranolol) (sumatriptan, eletriptan) 0.4264 (sumatriptan) (ergotamine, eletriptan) 0.3246 (ergotamine) MGR4/MGOA 0.9638 (eletriptan, eletriptan) 1.0000 (eletriptan) MGR5 0.9187 (eletriptan, eletriptan) 1.0000 (eletriptan) MGR6 0.8701 (eletriptan, eletriptan) 1.0000 (eletriptan) ATRIAL FIBRILLATION, (dofetilide, dofetilide) 1.0000 FAMILIAL, 3; (metoprolol, metoprolol) 1.0000 ATRIAL ATFB3 (ibutilide, ibutilide) 1.0000 FIBRILLATION, (dofetilide) (diltiazem, diltiazem) 1.0000 FAMILIAL, 1; ATFB1 (metoprolol) (digoxin, digoxin) 1.0000 (dofetilide) (ibutilide) (sotalol, sotalol) 1.0000 (metoprolol) (diltiazem) (verapamil, verapamil) 1.0000 (ibutilide) (digoxin) 0.9078 (warfarin, warfarin) 1.0000 (diltiazem) (sotalol) (quinidine, quinidine) 1.0000 (digoxin) (verapamil) (procainamide, procainamide) 1.0000 (sotalol) (warfarin) (clonidine, dofetilide) 0.1146 (verapamil) (quinidine) (anisindione, dofetilide) 0.2167 (warfarin) (procainamide) (quinidine) (clonidine) (carvedilol, dofetilide) 0.3125 (procainamide) (anisindione) (propafenone, dofetilide) 0.3077 (carvedilol) (flecainide, dofetilide) 0.2227 (propafenone) (flecainide)

Appendix B. Examples of Dissimilar Diseases with Their Associated Drugs.

Examples of dissimilar diseases and their associated drugs are listed in this appendix.

Detailed information about all the diseases and drugs can be found in the OMIM database and the DrugBank database respectively.

Disease name Diseases with low similarity disease- Drug-drug (associated drugs) (associated drugs) disease Drug-drug pairs similarity similarity CARDIOMYOPATHY, (verapamil, lamotrigine) 0.0916 FAMILIAL 0.0000 (verapamil, felbamate) 0.1709 HYPERTROPHIC, 2; CMH2 (verapamil, clonazepam) 0.1512 (verapamil) LEWY BODY DEMENTIA (haloperidol, lamotrigine) 0.1345 (haloperidol) 0.0185 (haloperidol, felbamate) 0.1884 (haloperidol, clonazepam) 0.1894 ROSENTHAL- (niacin, lamotrigine) 0.1117 KLOEPFER (niacin, felbamate) 0.1458 SYNDROME (niacin, clonazepam) 0.0833 (lamotrigine) HYPERLIPIDEMIA, TYPE V (clofibrate, lamotrigine) 0.1260 (felbamate) (niacin) (clofibrate, felbamate) 0.1981 (clonazepam) (clofibrate) 0.0000 (clofibrate, clonazepam) 0.1260 (fenofibrate) (fenofibrate, lamotrigine) 0.1947 (gemfibrozil) (fenofibrate, felbamate) 0.2443 (fenofibrate, clonazepam) 0.1507 (gemfibrozil, lamotrigine) 0.1084 (gemfibrozil, felbamate) 0.1930 (gemfibrozil, clonazepam) 0.1000 DYSCHONDROSTEOSIS (magnesium sulfate, 0.0180 AND NEPHRITIS 0.0743 cyproheptadine) (magnesium sulfate) (magnesium sulfate, 0.0769 BRANCHIAL cyproheptadine) MYOCLONUS SCHISTOSOMA MANSONI, (praziquantel, 0.2064 WITH SPASTIC INTENSITY OF INFECTION 0.0190 cyproheptadine) PARAPARESIS BY; SM1 (praziquantel,cyproheptadin) 0.0250 AND (praziquantel) CEREBELLAR PELGER-HUET-LIKE (hyoscyamine, 0.2165 ATAXIA ANOMALY AND EPISODIC cyproheptadine) (cyproheptadine) FEVER WITH ABDOMINAL (hyoscyamine, guanidine) 0.0224 (guanidine) PAIN 0.0184 (alosetron, cyproheptadine) 0.1302 (hyoscyamine) (alosetron, guanidine) 0.0180 (alosetron) (papaverine, cyproheptadine) 0.1636 (papaverine) (papaverine, guanidine) 0.0135

Appendix C. Proof of Theorem 1.

THEOREM 1. If formula (3) uses the same normalization method as formula (4), it is guaranteed to have a converged solution and can be solved using an iterative propagation-based algorithm.

PROOF (the manuscript of this proof was provided by Dr. Wenhui Wang. the following proof was edited by the author.).

To make the proof process clear, Let A, B and X denote Wdd, Wrr, and Wdr respectively. A,

B and X are n×n, m×m and n×m matrices respectively. Ai and Aj denote the i-th row of A and j-th column of A respectively. aij is used to represent the value at the i-th row and j-th column of matrix A. These conventions are also used for matrix B and X.

Then we have AiXB=xi,j. We can also get

x1,1    x1,1   a 1,1 b 1,1,,,,,, a 1,n b 1,1 a 1,1 b m ,1 a 1, n b m ,1  xn,1      0      1  Wdr (C1) x   a b,,,,,, a b a b a b   x n,1   n ,1 1,1 n , n 1,1 n ,1 m ,1 n , n m ,1   1, m   xnm,

j If we use Ai×B to denote a1,1 b 1,1 a 1,n b 1,1 a 1,1 b m ,1 a 1, n b m ,1 , then formula (C1) can be written as

1 AB1    111 XX ABn        0     1  Wdr (C2) XXmm AB n    1    m ABn 

T 11nm Let C denote ABABABAB11nn    . Let

i sn t , j rn  , s sI{ t  0}  ( s  1) I { t  0}, t tI{ t  0}  nI { t  0} , r rI{  0}  ( r  1) I {  0},

 I{   0}  nI {   0}, and

0,tn . Then we get

ci, j a t , b r 1, s 1 (C3) and

cj, i a , t b s 1, r 1 (C4)

By comparing equation (C3) and (C4), we can easily find that C is a symmetrical matrix with row and column number n×m.

* 1 n T If we use X to represent XX, then equation (C2) can also be written as

* * 0 X CX 1   Wdr (C5)

Equation (C5) can be solved exactly, however, for large networks an iterative propagation-based algorithm works faster and is guaranteed to converge to the final solution (Vanunu, et al. 2010) (Yamanishi 2008). According to (Vanunu, et al. 2010), in order to get a converge solution for equation (C5), C is normalized as

Cnorm  D1/2 CD 1/2 (C6)

Where D is diagonal matrix with di,i equals to the sum of the i-th row of C. Therefore, we also have

norm cij, cij,  (C7) ddi,, i j j and,

nm nm n m d c  a b  a b (C8) ii, iu ,  trs ,uu 1,  1  tp ,  qs ,  1 u1 u  1 p  1 q  1

where u ruu n  , By incorporating equation (C8) into (C7), we can get

a, b a b cnorm t, r 1, s  1 t , r  1, s  1 (C9) ij, n m n m n n m m atpqs,  b ,1  a , pqr  b ,1   a tp ,  a , p  b qs ,1,1   b qr  p1 q  1 p  1 q  1 p  1 p  1 q  1 q  1

Therefore, if we normalize A and B as

a anorm  ij, (C10) ij, nn aai,, p j p pp11 and

b bnorm  ij, (C11) ij, nn bbq,, i q j qq11

We can get

norm norm norm ci, j a t , b r 1, s 1 (C12)

By comparing equation (C12) with equation (C3), which we get earlier, we can rewrite equation (C5) as

*norm * 0 XCXW 1   dr (C13)

Finally, we can get the solution for equation (3) using an iterative propagation-based algorithm.

Appendix D. AUC values of BLM with Various Training Sample Numbers.

The following figure shows the performance of BLM for disease-drug association predictions and drug-target association predictions using different negative training sample numbers.

References

Adams, Christopher P, and Van V Brantner. "Development: Is It Really $802 Million?" Health Aff 25, no. 2 (2006): 420-428.

Alpay, M, and WJ Koroshetz. "Quetiapine in the treatment of behavioral disturbances in patients with Huntington's disease." Psychosomatics 47, no. 1 (2006): 70-72.

Ardizzoni, A, et al. "Cisplatin- versus carboplatin-based chemotherapy in first-line treatment of advanced non-small-cell lung cancer: an individual patient data meta- analysis." J Natl Cancer Inst 99, no. 11 (2007): 847-857.

Ashburn, Ted T, and Karl B Thor. "Drug repositioning: identifying and developing new uses for existing drugs." Nature Reviews Drug Discovery 3, no. 8 (2004): 673-683.

Barabási, AL, N Gulbahce, and J Loscalzo. "Network medicine: a network-based approach to human disease." Nat Rev Genet 12, no. 1 (2011): 56-68.

Bleakley, K, and Y Yamanishi. "Supervised prediction of drug-target interactions using bipartite local models." Bioinformatics 25, no. 18 (2009): 2397-2403.

Bonelli, RM, BM Mayr, G Niederwieser, F Reisecker, and HP Kapfhammer. "Ziprasidone in Huntington's disease: the first case reports." J Psychopharmacol 17, no. 4 (2003): 459-460.

Campillos, M, M Kuhn, AC Gavin, LJ Jensen, and P Bork. "Drug target identification using side-effect similarity." Science 321, no. 5886 (2008): 263-266.

Chen, Y, T Jiang, and R Jiang. "Uncover disease genes by maximizing information flow in the phenome-interactome network." Bioinformatics 27, no. 13 (2011): i167- i176.

Cheng, AC, et al. "Structure-based maximal affinity model predicts small-molecule druggability." Nat Biotechnol 25, no. 1 (2007): 71-75.

Cheng, F, et al. "Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference." PLoS Comput Biol 8, no. 5 (2012): e1002503.

Chiang, AP, and AJ Butte. "Systematic evaluation of drug-disease relationships to identify leads for novel drug uses." Clin Pharmacol Ther 86, no. 5 (2009): 507- 510.

ClinicalTrials.gov(a). Disulfiram Combined With Lorazepam for Treatment of Patients With Alcohol Dependence and Primary or Secondary Anxiety Disorder. 2012. http://clinicaltrials.gov/ct2/show/NCT00721526.

ClinicalTrials.gov(b). Temozolomide for Relapsed Sensitive or Refractory Small Cell Lung Cancer. 2012. http://clinicaltrials.gov/ct2/show/NCT00740636.

DiMasi, Joseph A, Ronald W Hansen, and Henry G Grabowski. "The price of innovation: new estimates of drug development costs." Journal of Health Economics 22, no. 2 (2003): 151-185.

Duff, K, LJ Beglinger, ME O'Rourke, P Nopoulos, HL Paulson, and JS Paulsen. "Risperidone and the treatment of psychiatric, motor, and cognitive symptoms in Huntington's disease." Ann Clin Psychiatry 20, no. 1 (2008): 1-3.

Dziadziuszko, R, et al. "Temozolomide in patients with advanced non-small cell lung cancer with and without brain metastases. a phase II study of the EORTC Lung Cancer Group (08965)." Eur J Cancer 39, no. 9 (2003): 1271-1276.

Flicek, P, et al. "Ensembl 2011." Nucleic Acids Research 39, no. Database issue (2011): 800-806.

Goldstein, I, TF Lue, H Padma-Nathan, RC Rosen, WD Steers, and PA Wicker. "Oral sildenafil in the treatment of erectile dysfunction. Sildenafil Study Group." N Engl J Med 338, no. 20 (1998): 1397-1404.

Gottlieb, A, GY Stein, E Ruppin, and R Sharan. "PREDICT: a method for inferring novel drug indications with application to personalized medicine." Mol Syst Biol 7, no. 496 (2011): 496.

Haggarty, SJ, KM Koeller, JC Wong, RA Butcher, and SL Schreiber. "Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays." Chem Biol 10, no. 5 (2003): 383-396.

Hamosh, A, AF Scott, JS Amberger, CA Bocchini, and VA McKusick. "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders." Nucleic Acids Res 33, no. Database Issue (2005): 514-517.

Hecker, N, et al. "SuperTarget goes quantitative: update on drug-target interactions." Nucleic Acids Res 40, no. Database issue (2012): D1113-1117.

Hopkins, AL. "Network pharmacology: the next paradigm in drug discovery." Nature chemical biology 4, no. 11 (2008): 682-690.

Hopkins, AL, and CR Groom. "The druggable genome." Nat Rev Drug Discov 1, no. 9 (2002): 727-730.

Hu, G, and P Agarwal. "Human disease-drug network based on genomic expression profiles." PLoS One 4, no. 8 (2009): e6536.

Hunter, S, et al. "InterPro: the integrative protein signature database." Nucleic Acids Res 37, no. Database issue (2009): 211-215.

Imming, P, C Sinning, and A Meyer. "Drugs, their targets and the nature and number of drug targets." Nat Rev Drug Discov 5, no. 10 (2006): 821-834.

Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Knowledge Discovery and Data Mining - KDD. 2002. 538-543.

Keiser, MJ, et al. "Predicting new molecular targets for known drugs." Nature 462, no. 7270 (2009): 175-181.

Kinnings, Sarah L, Nina Liu, Nancy Buchmeier, Peter J Tonge, Lei Xie, and Philip E Bourne. "Drug Discovery Using Chemical Systems Biology: Repositioning the Safe Medicine Comtan to Treat Multi-Drug and Extensively Drug Resistant Tuberculosis." PLoS Computational Biology 5, no. 7 (2009).

Knox, C, et al. "DrugBank 3.0: a comprehensive resource for 'omics' research on drugs." Nucleic Acids Res 39, no. Database issue (2011): 1035-1041.

Kola, Ismail, and John Landis. "Can the pharmaceutical industry reduce attrition rates?" Nature Review, Drug Discovery 3, no. 8 (2004): 711-716.

Kotelnikova, E, A Yuryev, I Mazo, and N Daraselia. "Computational approaches for drug repositioning and combination therapy design." J Bioinform Comput Biol 8, no. 3 (2010): 593-606.

Lamb, Justin, et al. "The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease." Science 29 313, no. 5795 (2006): 1929-1935.

Li, Jiao, Xiaoyan Zhu, and Jake Yue Chen. "Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts." PLoS Comput Biol 5, no. 7 (2009).

Lipscomb, Carolyn E. "Medical Subject Headings (MeSH)." Bull Med Libr Assoc. 88, no. 3 (2000): 265-266.

NIH Clinical Trials. Carboplatin and Etoposide Plus LBH589 for Small Cell Lung Cancer. 2012. http://clinicaltrialsfeeds.org/clinical-trials/show/NCT00958022.

Paleacu, D, M Anca, and N Giladi. "Olanzapine in Huntington's disease." Acta Neurol Scand 105, no. 6 (2002): 441-444.

Perlman, L, A Gottlieb, N Atias, E Ruppin, and R Sharan. "Combining drug and gene similarity measures for drug-target elucidation." J Comput Biol 18, no. 2 (2011): 133-145.

Russ, AP, and S Lampel. "The druggable genome: an update." Drug Discov Today 10, no. 23-24 (2005): 1607-1610.

Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl . "Item-based Collaborative Filtering Recommendation Algorithms." Proceedings of the 10th international conference on World Wide Web. 2001. 285-295.

Sing, T, O Sander, N Beerenwinkel, and T Lengauer. "ROCR: visualizing classifier performance in R." Bioinformatics 21, no. 20 (2005): 3940-3941.

Smith, TF, and MS Waterman. "Identification of common molecular subsequences." J Mol Biol 147, no. 1 (1981): 195-197.

Sophic. "White paper: The integrated druggable genome database." 2012. http://www.sophicalliance.com/documents/sophicdocs/White%20Paper%20Updat e%201-27-11/The%20Druggable%20Genome012511.pdf.

Steinbeck, C, C Hoppe, S Kuhn, M Floris, R Guha, and EL Willighagen. "Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics." Curr Pharm Des 12, no. 17 (2006): 2111- 2120.

Tanimoto, TT. "IBM Internal Report 17th Nov." 1957.

Thor, KB, and MA Katofiasc. "Effects of duloxetine, a combined serotonin and norepinephrine reuptake inhibitor, on central neural control of lower urinary tract function in the chloralose-anesthetized female cat." J Pharmacol Exp Ther 274, no. 2 (1995): 1014-1024. van Driel, MA, J Bruggeman, G Vriend, HG Brunner, and JA Leunissen. "A text-mining analysis of the human phenome." Eur J Hum Genet 14, no. 5 (2006): 535-542. van Vugt, JP, S Siesling, M Vergeer, EA n der Velde, and RA Roos. "Clozapine versus placebo in Huntington's disease: a double blind randomised comparative study." J Neurol Neurosurg Psychiatry 63, no. 1 (1997): 35-39.

Vanunu, Oron, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. "Associating genes and protein complexes with disease via network propagation." PLoS Comput Biol 6, no. 1 (2010).

Weininger, David. "SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules." Journal of Chemical Information and Modeling 28, no. 1 (1988): 31-36.

Xia, Z, LY Wu, X Zhou, and ST Wong. "Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces." BMC Syst Biol, 2010.

Yamanishi, Y. "Supervised bipartite graph inference." Proceedings of the Conference on Advances in Neural Information and Processing System 21. 2008. 1433-1440.

Yamanishi, Y, M Araki, A Gutteridge, W Honda, and M Kanehisa. "Prediction of drug- target interaction networks from the integration of chemical and genomic spaces." Bioinformatics 24, no. 13 (2008): i232-i240.

Zhou, Dengyong, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. "Learning with local and global consistency." Advances in Neural Information Processing Systems 16 (MIT Press), 2004: 321-328.

Zhu, S, Y Okuno, G Tsujimoto, and H Mamitsuka. "A probabilistic model for mining implicit 'chemical compound-gene' relations from literature." Bioinformatics, 2005: ii245-ii251.