<<

In silico Methods for Repositioning and Drug-Drug Interaction Prediction

Pathima Nusrath Hameed ORCID: 0000-0002-8118-9823

Submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy

Department of Mechanical Engineering THE UNIVERSITYOF MELBOURNE

May 2018 Copyright © 2018 Pathima Nusrath Hameed

All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author. Abstract

Drug repositioning and drug-drug interaction (DDI) prediction are two fundamental ap- plications having a large impact on drug development and clinical care. Drug reposi- tioning aims to identify new uses for existing . Moreover, understanding harmful DDIs is essential to enhance the effects of clinical care. Exploring both therapeutic uses and adverse effects of drugs or a pair of drugs have significant benefits in pharmacology. The use of computational methods to support drug repositioning and DDI prediction en- able improvements in the speed of drug development compared to in vivo and in vitro methods. This thesis investigates the consequences of employing a representative training sam- ple in achieving better performance for DDI classification. The Positive-Unlabeled Learn- ing method introduced in this thesis aims to employ representative positives as well as reliable negatives to train the binary classifier for inferring potential DDIs. Moreover, it explores the importance of a finer-grained similarity metric to represent the pairwise drug similarities. Drug repositioning can be approached by new indication detection. In this study, Anatomical Therapeutic Chemical (ATC) classification is used as the primary source to determine the indications/therapeutic uses of drugs for drug repositioning. This thesis presents a two-tiered clustering approach for obtaining pairwise drug similarity and het- erogeneous drug data integration which is employed for large-scale drug repositioning. Moreover, this thesis demonstrates subnetwork identification as a suitable approach for new indication detection for existing drugs. Subnetwork identification method iden- tifies a subgraph from a large drug similarity network, connecting a set of given drugs known as ‘terminals’. In this study, the ‘terminals’ are selected according to the ATC

iii classification system; hence meaningful subnetworks are identified. The proposed sub- network identification method is employed to infer drug repositioning candidates for cardiovascular diseases and diseases related to the nervous system. New target detection for existing drugs is also beneficial for drug repositioning. This thesis proposes a useful computational method for target clustering which is extended to identify new drug-target relationships. It demonstrates the significance of integrating dimensionality reduction and outlier detection to overcome the limitations arising from the incomplete drug-target interaction data. The clinical significance and literature-based evidence illustrate the relevance of the proposed methods. The proposed methods can be employed in other similar applications where applicable.

iv Declaration

This is to certify that

1. the thesis comprises only my original work towards the PhD, 2. due acknowledgement has been made in the text to all other material used, 3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliogra- phies and appendices.

Pathima Nusrath Hameed, May 2018

v This page is intentionally left blank. Acknowledgements

I am so grateful to many people who helped me in various ways throughout this journey. First and foremost, my sincere gratitude goes to my supervisors, Professor Saman Halgamuge and Professor Karin Verspoor for their tremendous guidance, feedback, and support throughout my doctoral studies. They spent their valuable time and devotion on supervision and guidance, without which the completion of this research would not have been possible. In many ways, they are great examples for me. Their keen insights, passion for research and life, patience in guiding students, bright ideas, encouragements, and diligence in every means are invaluable. I would also like to thank Karin again for the detailed comments and suggestions. I always feel lucky to have them both as my supervisors. My academic progress was kept on track by periodic observation from my committee chair, Professor James Bailey. I would like to express my thanks to him, for his friendly and constructive feedback and suggestions throughout my PhD career. I sincerely thank Dr. Snezana Kusljic and my colleague Yahui Sun for their collabora- tions. The outcomes of their collaborations have made important pieces of this thesis. I am grateful to The University of Melbourne, Data61, Victoria Research Lab, West Melbourne, Australia, and National ICT Australia for giving me the opportunity to pur- sue my PhD and for the financial support provided by means of postgraduate scholar- ships. I wish to thank all past and present members of the optimization and pattern recog- nition research group for their support, kindness, and friendship. Also, I express my thanks to all my lab mates and friends from The University of Melbourne. I found nice people, with different backgrounds, talking to each other with ease, humility and friend-

vii ship. Chalini, thank you for supporting me in the process of university entrance. All staff members of the University of Ruhuna are remembered warmly. I appreciate the invaluable support from Dr. Tharaka Ilayperuma and Dr. MBF Mafasiya for signing my study leave bond agreement. I am indebted to Mr. T N Deen, my uncle and my first teacher who is still providing advice and guidance. Also, my primary, secondary and tertiary teachers are remembered with utmost respect. I would like to thank my cousins and relatives in my close-knit family for their love and blessings. Most importantly, my heartfelt thanks to my parents, my parents-in-law and my sis- ter who always supported me in every possible way with endless and unreserved love, encouragements, unmeasurable sacrifices and blessings. Safraz, you have been my strength and the source of all my happiness. To say thank you would be improper and impossible to list all the reasons why I should do it. I am so happy that you are my husband. I would have never started a PhD if not for your constant persuasion and belief in me. Finally, a big thank you to my lovely daughter. Your curious eyes and bright little smile made the last year of my PhD a joyful one.

viii Preface

This thesis includes two peer-reviewed journal articles in their published form:

• Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2018). A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration. BMC Bioinformatics, 19(1), 129. • Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2017). Positive- Unlabeled Learning for inferring drug interactions based on heterogeneous at- tributes. BMC Bioinformatics, 18(1), 140.

The two articles, Hameed et al. (2018) and Hameed et al. (2017) are presented in Chapter 3 and Chapter 4, respectively. In both articles, Pathima Nusrath Hameed formu- lated the research questions, developed specific methods and assessed their performance (substantially more than 50%) and is the lead-author. Snezana Kusljic provided the clin- ical relevance of the findings. The co-authors Saman Halgamuge and Karin Verspoor contributed in the supervision of work. Permission was provided by all co-authors to include these articles in their published form in this thesis. In addition to the above articles, content from another peer-reviewed published jour- nal article is included in Chapter 5 and Appendix A:

• Sun, Y., Hameed, P. N., Verspoor, K., & Halgamuge, S. (2016). A physarum- inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning. BMC systems biology, 10(5), 128.

In Sun et al. (2016), Pathima Nusrath Hameed is an equal first author. She formulated drug repositioning by Anatomical Therapeutic Chemical Classification as a suitable prob- lem to employ Prize-Collecting Steiner Tree (PCST) based subnetwork identification. She

ix also contributed to design, asses and analysis of the experiment. Yahui Sun proposed a specific PCST algorithm and contributed to design and analysis of the experiment. This research was supported by a Melbourne International Research Scholarship, a Melbourne International Fee Remission Scholarship, and a NICTA scholarship of Na- tional ICT Australia, now Data61 since merging CSIRO’s Digital Productivity team.

x To my husband for all his love, patience, sacrifices and encouragements...

xi This page is intentionally left blank. Contents

1 Introduction 1 1.1 Motivation ...... 4 1.1.1 Heterogeneous data and pairwise drug similarities ...... 4 1.1.2 Drug-drug interaction prediction ...... 5 1.1.3 Drug repositioning ...... 6 1.1.4 Anatomical Therapeutic Chemical (ATC) classification ...... 7 1.2 Research focus and thesis outline ...... 8 1.3 Contributions of this thesis ...... 15 1.4 Related publications by the author ...... 16

2 Literature Review 19 2.1 Pharmacological data ...... 19 2.1.1 Importance of heterogeneous data integration ...... 24 2.2 Drug repositioning ...... 27 2.2.1 Network-based inferences ...... 27 2.2.2 Machine learning-based approaches ...... 30 2.2.3 Importance of clustering in pharmacological data network analysis 33 2.3 Drug-drug interaction prediction ...... 35 2.4 Summary ...... 37

3 Inferring pairwise drug interactions based on heterogeneous attributes 41 3.1 Published manuscript ...... 42 3.2 Summary ...... 64

4 A two-tiered clustering approach for drug repositioning through heterogeneous data integration 65 4.1 Published manuscript ...... 66 4.2 Results analysis using the missing ATC codes in the published manuscript 85 4.2.1 Results analysis ...... 87 4.3 Summary ...... 90

5 Employing subnetwork identification to repositioning drugs using ATC classi- fication 91 5.1 Background ...... 91 5.2 The proposed method ...... 93

xiii 5.2.1 Drug Similarity Network ...... 94 5.2.2 Sparse Graph generation ...... 96 5.2.3 Drug repositioning by subnetwork identification ...... 100 5.3 Results ...... 102 5.4 Discussion ...... 110 5.5 Summary ...... 114

6 Inferring super-targets using dimensionality reduction and density-based clus- tering 117 6.1 Background ...... 117 6.2 Data ...... 119 6.3 The proposed method ...... 120 6.3.1 T-distributed stochastic neighbor embedding ...... 121 6.3.2 Clustering and outliers detection using DBSCAN ...... 122 6.3.3 Evaluation ...... 123 6.4 Results ...... 125 6.4.1 Inferring super-targets ...... 125 6.4.2 Impact of dimensionality reduction ...... 129 6.4.3 Impact of DBSCAN algorithm for inferring super-targets ...... 130 6.4.4 Importance of super-target analysis for new drug-target detection . 131 6.5 Discussion ...... 132 6.6 Summary ...... 136

7 Conclusions 139 7.1 Future work ...... 141

A A physarum-inspired prize-collecting steiner tree approach to identify subnet- works for drug repositioning 145

B The list of drugs used in Chapters 3 and 5 161

C The list of ATC class-N drugs used in Chapter 5 169

xiv List of Figures

1.1 Workflow of the main applications in computational pharmacological data analysis ...... 2 1.2 A generalized illustration of two alternative approaches involved in drug repositioning. (The notations 1* and m-n indicate one-or-many and many- to-many relationships, respectively.) ...... 7 1.3 Using Jaccard Index for drug comaprison ...... 9

2.1 The fundamental components that directly/indirectly involved in drug- disease relationships [1]...... 20

5.1 Edge cost distribution of five different sparse DSNs generated using Sparse Graph Method 1...... 98 5.2 ATC class distribution in the repositioned ATC-N drug ...... 110

6.1 The proposed workflow for target clustering ...... 122 6.2 3-D scatter plots, illustrating super-target clustering of enzyme data . . . . 131 6.3 Dendrogram analysis of higher dimensional feature space and reduced di- mensional feature space after using Barnes-Hut-SNE ...... 134

xv This page is intentionally left blank. List of Tables

4.1 Drugs that are considered to be not in the ATC classification system . . . . 86 4.2 Performance assessment of Drug Clustering Tier 1 ...... 88 4.3 Performance assessment of Drug Clustering Tier 2 ...... 89 4.4 The changes in the inferred repositioning candidates with higher confidence 90

5.1 ATC class-N ...... 96 5.2 Sparse graph method 1 ...... 97 5.3 Analysis of five DSNs generated using Sparse Graph Method 1 ...... 97 5.4 Sparse graph method 2 ...... 99 5.5 Sparse graph method 4 ...... 100 5.6 Subnetwork Evaluation for ATC-N using Sparse Graph Method 1 . . . . . 104 5.7 Subnetwork Evaluation for ATC-N using Sparse Graph Method 2 . . . . . 105 5.8 Subnetwork Evaluation for ATC-N using Sparse Graph Method 3 . . . . . 106 5.9 Subnetwork Evaluation for ATC-N using Sparse Graph Method 4 . . . . . 107 5.10 Analysis of subnetwork identification for complete graphs vs. sparse graphs108 5.11 Repositioning candidates for nervous diseases identified in multiple sub- networks ...... 115

6.1 Existing gold-standard drug-target interactions ...... 119 6.2 Analysis of in-degree and out-degree of drug-target interactions ...... 120 6.3 Super-target analysis: Enzyme ...... 127 6.4 Extrinsic evaluation of super-target prediction ...... 128 6.5 Analysis of super-targets obtained using DBSCAN algorithm alone . . . . 129 6.6 Extrinsic evaluation of super-targets obtained using DBSCAN algorithm alone ...... 130 6.7 Inferred super-targets: nuclear receptor ...... 132 6.8 Inferred new drug-target relationships: nuclear receptor ...... 133

xvii This page is intentionally left blank. Chapter 1 Introduction

HE scientific study of drugs in humans includes evaluations of both the safety and T efficacy of the available drugs [2]. The conventional drug discovery process re- quires to follow several phases before a new drug is introduced to the public market. Due to animal welfare considerations, the animal-based testing process has also become problematic. Once the drug is discovered, it needs to be approved through clinical tri- als to identify its therapeutic uses as well as off-target effects (side effects). Therefore, producing a new drug is a long process and it requires a large investment of money and time. Nevertheless, it has a high risk of failure [3]. During the last decade, there has been a great interest to implement and employ useful computational models for drug repositioning, side effect prediction and Drug-Drug Interaction (DDI) prediction. Fig- ure 1.1 which extends the idea of Vilar et al. [4] illustrates the fundamental components of computational drug repositioning and side effect prediction. Drug repositioning and side effect prediction for a single drug apply to identify new drug-disease relationships and drug-side effect relationships, respectively. DDI prediction is different to side effect prediction. It aims to infer harmful drug combinations. The idea of drug repositioning has been introduced as a cost-effective method to uti- lize the existing drugs by identifying new therapeutic uses for them. Identifying new uses for drug combinations can also be considered as another way of approaching drug repositioning. Some complex diseases such as cancer, cardiovascular disease, psychi- atric disorders, etc. are required to be treated with drug combinations. For instance, an expected outcome (therapeutic or side effect) of a particular drug can be replaced by dif- ferent drug combinations. Optimal use of existing drugs with safety and efficacy help to

1 2 Introduction improve the pharmaceutical industry.

Figure 1.1: Workflow of the main applications in computational pharmacological data analysis

In practice, multiple drugs are often co-administered to treat patients with multiple diseases as well as to treat patients with a complex disease. In these cases, one drug 3 may increase, decrease or reduce the effect of the other drug. But, some of them have harmful effects on patients [5]. Understanding harmful DDIs is essential for improving the well-being of patients. DDI prediction applications aim to identify the harmful drug combinations. Large-scale experimental DDI prediction is time-consuming and requires a large investment of money. Though there are a large number of Food and Drug Ad- ministration (FDA) approved drugs available on the market, the knowledge of DDIs is incomplete [6]. The increased number of research interest on DDI prediction also reflects the incomplete nature of DDI knowledgebases. There are many computational methods proposed for drug repositioning and DDI prediction, but they are not comprehensive in terms of overall performance and they have certain limitations. A common issue is that lack of negatives for training and testing. The existing supervised learning methods in these contexts consider randomly selected instances from the unlabeled data as negative training exemplars which may introduce lack of distinction between positives and negatives. Investigating pairwise similarity play a significant role in these applications, therefore methods to improve pairwise simi- larity is beneficial [4]. Evaluation of these models is also challenging as they unveil a new knowledge about existing drugs. Therefore, this thesis focuses on identifying and devel- oping useful new computational models to overcome the limitations of existing models while discovering useful knowledge in pharmacological data. This thesis focuses on introducing computational methods for drug repositioning and DDI prediction. Chapter 3 presents new computational methods for DDI prediction aim- ing to overcome the limitations of the existing methods. In Chapters 4 and 5, new compu- tational methods for drug repositioning is investigated and a gold standard drug classifi- cation, ATC classification is exploited to design the experiments to identify repositioning drug candidates. Chapter 6 presents computational methods for target clustering which can be extended for drug repositioning via drug-target detection and side effect predic- tion via drug-off-target detection. The predictions made by computational models may not be directly applicable in clinical practice. However, the outcomes of the computa- tional models may enable prioritizing highly probable repositioning drug candidates or DDIs for in vivo/in vitro analysis. 4 Introduction

1.1 Motivation

1.1.1 Heterogeneous data and pairwise drug similarities

In the literature, computational methods for identifying drug repositioning and side ef- fect prediction are developed based on the assumption that pairs of drugs sharing similar chemical, biological and phenotypical, etc. characteristics are more likely to produce sim- ilar therapeutic and side effects [4,5,7–12]. Hence, developing more effective and efficient pairwise similarity models may have great benefits in improving final predictions. These pairwise similarity models can be developed based on homogeneous drug characteristics such as chemical structures, known targets and known integration of them [4, 7, 8]. As shown in Figure 1.1, computational pharmacological data analysis can be per- formed by analyzing the drug characteristics. Many drug characteristics are explained in terms of binary bit vectors which represents the presence or the absence of a partic- ular feature [13–17]. In preliminary investigations on drug repositioning and side effect prediction, computational models have been developed using homogeneous pharmaco- logical components, but these homogeneous components have their own pros and cons [1]. Hence, recent studies have emphasized developing novel, efficient and reliable com- putational models to improve the final predictions using heterogeneous data integration [1, 18–21]. This thesis focuses on the application of biological, chemical and phenotypical binary fingerprint of drug data to explore the similarity between drugs and identify predic- tive patterns based on heterogeneous drug characteristics. Chapter 3 introduces a new Similarity Feature Representation that overcomes the limitations of summarized Simi- larity Feature Representations such as Jaccard, Cosine, Euclidean, etc. in the context of DDI prediction. Moreover, Chapter 4 presents an unsupervised learning drug clustering approach to explore pairwise drug similarity. Subsequently, it is extended for heteroge- neous data integration which is then used for drug repositioning. 1.1 Motivation 5

1.1.2 Drug-drug interaction prediction

Drug-drug interactions (DDIs) can occur when two or more drugs are administered to- gether. In such a case, one drug may abolish, diminish or potentiate the effect of the other drug. These DDIs can occur at the binding sites of the target protein molecules. Various factors can affect drug interactions, including binding to plasma proteins, binding to tis- sues and extravascular sites, the activity of the liver enzyme system and intake of certain food groups [8,22,23]. Some drug combinations may lead to creating unexpected adverse effects on patients. Hence, investigating and understanding new harmful DDIs is crucial to improving health care and patient outcomes. Performing experimental trials for a large number of drug pairs is not realistic in terms of cost and time. Due to animal welfare considerations, an animal-based testing process is also problematic. During the last decade, machine learning [5,7,9,10,24,25] and statistical models [26,27] including integration of text mining [28] have gained popularity for inferring DDIs. There are a large number of drugs available on the market, but the knowledge of DDIs is incomplete. DrugBank is an online database which provides biochemical and pharma- cological information about drugs, their mechanisms and their targets [29]. The most recent release, Drugbank 5.0 [30] has evolved with a significant number of enhancements over its previous releases [15, 29]. The number of DDIs has grown to 365,984 (in 2017) from 13,795 (in 2008) [15] reflecting the great interest in investigating DDIs during recent years. However, this number represents only < 1% of the total possible drug pairs for a total of 10,562 drugs [31]. Given the large space of possible interactions, it is quite likely that many DDIs have been left undiscovered. Moreover, during the last decade, various computational methods have been developed to infer potential DDIs [5, 7, 9, 10, 24–27]. The major issue in the existing supervised learning methods is that considering ran- domly selected instances from the unlabeled data as negative training exemplars. Since there is no approved source of information about drug combinations that are not inter- acting with each other, conventional binary classification approaches of DDIs may not be accurate. Therefore, in Chapter 3, we introduce a positive-unlabeled learning method for DDI prediction using Growing Self Organizing Maps [32] that identifies reliable neg- 6 Introduction atives from the unlabeled data.

1.1.3 Drug repositioning

As mentioned before, producing new drugs and marketing them with a complete drug profile is a long process and it needs a large investment of time and money. Drug repo- sitioning or drug repurposing is the process of identifying new indications for existing drugs. It can reduce the time, cost and risk of the traditional drug discovery process [1,18,33,34]. The main goal of drug repositioning is to increase the therapeutic use of the existing drugs in the medical domain. It is believed that drugs having similar profiles are more likely to behave similarly in the presence of similar targets like genes and proteins [1, 19, 33–36]. This has been the fundamental hypothesis of computational drug reposi- tioning. From drug perspective, if Drug B is known to associate with Disease B, Drug A that is similar to Drug B would be an interesting repositioning candidate to treat Disease B. In contrast to laborious in vivo and in vitro experiments, computational methods for drug repositioning became popular as efficient approaches for drug repositioning [1, 19, 33–35]. Discovering computational methods for identifying new uses for exist- ing drugs and finding new associations between other contributing entities like protein, gene, disease and side effect are interesting approaches to solve this problem. There are two fundamental strategies of drug repositioning: new target recognition and new indication recognition [1, 37]. Figure 1.2 illustrates a generalized view of these two drug repositioning strategies. Figure 1.2a shows the known interactions between drug and target and target and disease; where each of the drugs is associated with at least one target protein and vice versa. On the other hand, a disease can associate with at least one target protein and vice versa. Figure 1.2b and Figure 1.2c show new tar- get recognition and new indication recognition, respectively. Identification of unknown drug-target relationships and identification of unknown target-disease relationships are known as new target recognition and new indication recognition, respectively; it is indi- cated by many-to-many (m-n) relationship. 1.1 Motivation 7

Figure 1.2: A generalized illustration of two alternative approaches involved in drug repositioning. (The notations 1* and m-n indicate one-or-many and many-to- many relationships, respectively.)

1.1.4 Anatomical Therapeutic Chemical (ATC) classification

As defined by World Health Organization, the Anatomical Therapeutic Chemical (ATC) classification captures the pharmacodynamic properties of drugs [38]. It employs ac- tive ingredients of drugs as well as their anatomical, therapeutic and chemical properties when constructing the classification system. ATC classification is a five-level classifica- tion system. The first level classification is based on the anatomical group; it contains 14 groups. The second level classification is based on pharmacological/therapeutic sub- groups. The third and fourth levels denote chemical/pharmacological/therapeutic sub- groups and the fifth level refers to the chemical substance. It should be noticed that some drugs have been categorized into multiple classes. As illustrated in Figure 1.2, new target recognition and new indication recognition are two typical ways of approaching drug repositioning. Even though the use of ATC classi- fication is popular in the input space to determine the anatomical/therapeutic/chemical features of drugs [4, 8, 39, 40], little research directly focuses on drug repositioning by predicting ATC classes [18, 41]. Recent research [18, 41] limited their studies only for the drugs that already possess an ATC class. In Chapters 4 and 5, known therapeutic uses of existing drugs are determined according to the ATC classification. In Chapter 4, we approach drug repositioning by identifying plausible new ATC ther- 8 Introduction apeutic (second level) classes for existing drugs. Identification of drug’s second level classification implies its therapeutic uses. Hence, reclassification of drugs into ATC ther- apeutic (second level) class enables inferring useful repositioning candidates. In Chapter 5, prize-collecting steiner tree problem is introduced as a suitable method for subnetwork identification for drug repositioning. Drug repositioning is formed as a prize-collecting steiner tree problem and multiple Drug Similarity Networks are created incorporating the hierarchical structure of the ATC classification. Major voting is a use- ful technique to obtain consensus results. Employing ATC classification to design the multiple Drug Similarity Networks enables a basis for an ensemble learning platform. Identified subnetworks of Drug Similarity Networks are explored to infer repositioning drug candidates.

1.2 Research focus and thesis outline

As mentioned before, drug repositioning and drug interaction prediction have a large impact on improving the well-being of patients as well as pharmaceutical industry. This thesis focuses on investigating the following research questions for knowledge discovery in pharmacological data.

1. How can we construct reliable and representative training set for predicting drug-drug interactions (DDIs)? Can we further improve the performance of DDI prediction by introducing a different pairwise drug similarity measure? Chapter 3 will address this research question. Experimentally-based DDI prediction methods involve great expense and a large investment of time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary clas- sifiers require both positives and negatives for training. In the DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. In the related work, randomly selected instances from the unlabeled data has been used as neutral DDIs (negatives) [5, 7, 8, 10, 12, 17, 24, 42]. This selection 1.2 Research focus and thesis outline 9

Figure 1.3: Using Jaccard Index for drug comaprison

approach may introduce noisy data, resulting in a lack of distinction between pos- itives and negatives. The overall performance of the final prediction relies on the feature representation and the training sample as well. Therefore, to address this lack of negatives, we will propose a Positive-Unlabeled Learning method for in- ferring potential DDIs. As mentioned before, methods for DDI prediction are developed based on the as- sumption that drugs having similar characteristics are more likely to have similar profiles [5,7–12]. In the literature, Tanimoto Coefficient, a variant of Jaccard Index, is widely used to quantify the overall context similarity between drugs. How- ever, two drug pairs can possess the same overall similarity index, though their individual features are different (see Figure 1.3). Hence, we will propose a new similarity metric to represent the finer-grained pairwise drug similarities which could eventually improve the performance of the final predictions. 2. Chapter 4 will address the following research questions.

(a) How can clustering be employed to investigate pairwise drug similarities? 10 Introduction

How can clustering be adapted for heterogeneous data integration? Drugs can be explained using various characteristics such as chemical, tar- get and phenomic, etc. Recent studies have emphasized developing novel, efficient and reliable computational models to improve the final predictions using heterogeneous data integration [1, 18–21]. The primary objective of heterogeneous/multi-view data integration is to understand the drug char- acteristics more deeply and to obtain a consensus solution[43]. Drugs in the same cluster will demonstrate similar characteristics while be- ing relatively dissimilar to drugs in other clusters. Therefore, clusters can be used to determine drug-drug similarities. We will consider the drug-drug relationships within each of the clusters to construct the Drug-Drug similar- ity matrix. The drug-drug similarity matrix can be obtained based on drugs’ homogeneous characteristics. Analyzing various drug-drug similarity ma- trices from different characteristics can provide an opportunity to achieve heterogeneous data integration. (b) How can ATC classification be leveraged to label plausible drug reposi- tioning candidates in an unsupervised learning environment? As mentioned before, ATC classification system is a drug classification sys- tem published by World Health Organization. It consists of five levels of classification in which the second level classification represents the thera- peutic uses of a drug. Hence, reclassification of drugs into ATC therapeutic (second level) class would suggest new therapeutic uses for an existing drug. The use of unsupervised clustering methods enables grouping of drugs without any prior knowledge of ATC classes. However, the identified drug clusters can be compared with reference to drugs that have been previously classified in ATC classification. This may result in classification of drugs into new ATC therapeutic (second level) classes. Hence, this approach can be suggested to identify reliable repositioning drug candidates. Further, this approach can be used as a way to predict the indication for new compounds (as long as the clusters are based on pre-clinically available data). 1.2 Research focus and thesis outline 11

3. How can the hierarchical structure of the ATC classification system be used to design drug similarity networks for subnetwork identification? Chapter 5 and Appendix A will attempt to address this research question. Pharmacological data can be represented as homogeneous or heterogeneous net- works based on the chosen participating entities. Existing network-based meth- ods tend to identify new relationships between entities in the specified network. Typically, these applications identify new relationships on multiple decomposed subnetworks. Subnetwork identification using prize-collecting steiner tree (PCST) algorithm identifies a single small-scale subnetwork from the large network so that that drug repositioning can be performed for individual indication (disease) at a time. Subnetwork identification applies to identify the subgraph from a large drug net- work, connecting a set of given drugs called ‘terminals ’ while minimizing the net-cost of ‘edge-costs’ and ‘vertex-prizes’. In this context, we will use ATC codes to define the ‘terminals ’ at the initialization phase. We will select the terminal drugs according to the ATC classification system so that the terminals are selected with a specific scope rather than at random. Herein, the ATC second level classifi- cation will be used to select the terminal drugs as it captures the therapeutic uses; hence meaningful subnetworks can be identified. Also, employing ATC classifica- tion to design multiple drug similarity networks enables a basis for obtaining con- sensus results. In this study, the ATC classification will be employed to construct drug similarity networks to identify repositioning candidates for cardiovascular diseases and diseases related to nervous system. 4. How can we use incomplete drug-target interaction data effectively to cluster targets? How can we evaluate purely unsupervised target clustering? Chapter 6 will attempt to address this research question. As illustrated in Figure 1.1, new target detection or identifying new drug-target relationships is an important strategy to achieve drug repositioning. Target simi- larities are useful in determining new drug-target relationships for existing drugs. However, not all drug-target interactions are captured in the current databases 12 Introduction

[44]. So, in this study, we investigate how we can minimize the effects of the in- completeness of these databases in machine learning. In contrast to the supervised learning methods that use training sets with class labels, the unsupervised learning methods such as clustering do not rely on the class labels. Moreover, target clustering using incomplete data may increase the uncertainty of the final predictions. An integration of dimensionality reduction and clustering by outlier detection for inferring target clustering will be investi- gated to overcome this issue. Non-linear relationships in drug networks will be considered when performing dimensionality reduction. Evaluating the unsupervised learning model is also essential. But, extrinsic evalu- ation of target clustering is challenging without a standard reference classes. The incomplete nature of the drug-target database will be considered in investigating a suitable performance evaluation method. Adapting datasets by removing 10% of known drug-target interactions may enable constructing a validation model to assess the inferred target clusters. Then, evaluation metrics such as Adjusted Mu- tual Information [45], Normalized Mutual Information [46] and Standardized Mu- tual Information [47] will be employed to assess the performance of the clustering model.

The outline of this thesis is as follows.

Chapter 2: Literature Review

This chapter reviews background information about drug repositioning and drug-drug interaction prediction, focusing on the aforementioned main research questions. The ma- terials presented are divided into three sections. The first section gives an overview of pharmacological data and the significance of heterogeneous data integration. The second section provides an overview of the available computational methods for drug reposi- tioning. The third section provides an overview of available computational methods for drug-drug interaction predictions. Moreover, the benefits and limitations of the existing approaches are discussed and key challenges are identified. 1.2 Research focus and thesis outline 13

Chapter 3: Inferring pairwise drug interactions based on heterogeneous at- tributes

This chapter introduces a positive-unlabeled learning approach for drug-drug interaction prediction where drug-drug interactions obtained from Drugbank are served as positives and reliable non-interacting drug pairs identified from the unlabeled data to serve as neg- atives in binary classification. Moreover, a new pairwise similarity function to quantify the overlap of drug features along higher dimensions is proposed to improve the per- formance of final predictions. Subsequently, the predicted drug-drug interactions are classified as Cytochrome P450 (CYP)-Dependent and CYP-Independent interactions to enhance the clinical significance of the predictions. Finally, a case study on three pre- dicted CYP-Dependent DDIs is provided to evaluate the clinical relevance of this study.

Chapter 4: A two-tiered clustering approach for drug repositioning through heterogeneous data integration

Chapter 4 presents a purely unsupervised, two-tiered clustering approach for drug repo- sitioning. The presented approach is employed for heterogeneous drug data integration and ATC classification is employed for large-scale drug repositioning of drugs. In this study, a confidence measure is proposed to determine the significance of the inferred repositioning candidates. The significance of findings arising from this study is twofold; (i) correctly profile and suggest therapeutic indication for drugs that do not possess the ATC code; (ii) flag potential of some drugs to be used for other therapeutic purposes. Furthermore, we provide clinical evidence for four predicted results to support that the proposed approach can be reliably used to infer ATC code and drug repositioning.

Chapter 5: Employing subnetwork identification to repositioning drugs using ATC classification

This chapter demonstrates the significance of the ATC classification in identifying plau- sible drug repositioning candidates. It devotes to explore the benefits of Prize-Collecting Steiner Tree approach as an effective way of subnetwork identification for drug repo- sitioning. Moreover, the hierarchy of the ATC classification is proposed as the primary 14 Introduction attribute to design the experiments. Further, four different sparse graph generation meth- ods are investigated to employ Prize-Collecting Steiner Tree approach. The two-tiered clustering approach proposed in Chapter 4 is also employed as one of the sparse graph generation methods. Here, Prize-Collecting Steiner Tree-based subnetwork identification is employed to infer drug repositioning candidates for nervous diseases.

Chapter 6: Inferring super-targets using dimensionality reduction and density- based clustering

This chapter demonstrates the significance of integration of dimensionality reduction and outlier detection for reducing the burden of incomplete information in the input space of drug-target interactions. In this study, targets are clustered based on drug-target in- teractions involving enzymes, ion channels, G protein-coupled receptors (GPCRs) and nuclear receptors in human. Evaluation of target clustering is also challenging as there is no standard reference grouping of these targets. The validation technique introduced in this study enables assessing the performance of target clustering using extrinsic valida- tion metrics such as Normalized Mutual Information, Adjusted Mutual Information and Standardized Mutual Information. The intrinsic and extrinsic validation metrics demon- strate the usefulness of the inferred target clusters and they can be used as a basis for further studies on drug repositioning by new target recognition and side effect predic- tion by off-target recognition.

Chapter 7: Conclusions

This chapter concludes from each of the contribution chapters together. It summarizes what can be generalized from these findings and the future work which remains to be done. 1.3 Contributions of this thesis 15

Appendix A: A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning

This presents the published journal article associating with Chapter 5 where I am an equal first author. In this study, subnetwork identification is employed to infer drug repositioning candidates for cardiovascular disease. Drug Similarity Networks for car- diovascular diseases are designed with the aid of ATC class-C drugs. It introduces Physarum-inspired subnetwork identification algorithm proposed by Yahui Sun as a suit- able Prize-Collecting Steiner Tree approach for subnetwork identification for drug repo- sitioning. Moreover, two sparse graph generation methods are investigated to employ Prize-Collecting Steiner Tree approach. Furthermore, we found literature-based evidence to support previous discoveries that nitroglycerin, theophylline and acarbose may be able to be repositioned for cardiovascular diseases.

1.3 Contributions of this thesis

In this thesis, the following contributions are made: Chapter 3

• Proposing and employing a suitable positive-unlabeled learning approach for drug-drug interaction prediction • Proposing and employing a new pairwise similarity measure to capture pairwise drug similarity; hence improving the drug-drug interaction prediction • Proposing and employing a method to infer CYP-isoforms for the predicted drug- drug interactions • Providing a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study.

Chapter 4

• Proposing and employing a two-tiered clustering approach for heterogeneous data integration and drug repositioning using ATC Classification • Employing drug clustering using Growing Self Organizing Map and Graph clus- tering 16 Introduction

• Evaluating drug clusters with reference to ATC classification • Introducing a confidence measure to prioritize the most useful repositioning can- didates associated with each cluster • Providing clinical evidence for four predicted results to support that the proposed approach can be reliably used to infer ATC code and drug repositioning.

Chapter 5

• Proposing and employing ATC classification hierarchy to design multiple Drug Similarity Networks • Proposing four sparse graph generation methods • Applying subnetwork identification to identify repositioning candidates for car- diovascular diseases and diseases related to the Nervous system • Evaluating the subnetwork identification methods

Chapter 6

• Demonstrating the significance of integrating dimensionality reduction and target clustering by outlier detection to overcome the limitations arising from the incom- plete drug-target interaction datasets • Proposing an evaluation technique to assess the target clustering approach when there is no standard reference for target clusters.

1.4 Related publications by the author

Journal Articles

1. Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2018). A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration. BMC Bioinformatics, 19(1), 129. 2. Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2017). Positive- Unlabeled Learning for inferring drug interactions based on heterogeneous at- tributes. BMC Bioinformatics, 18(1), 140. 3. Sun, Y.+, Hameed, P. N.+, Verspoor, K., & Halgamuge, S. (2016). A physarum- inspired prize-collecting steiner tree approach to identify subnetworks for drug 1.4 Related publications by the author 17

repositioning. BMC Systems Biology, 10(5), 128. (+ equal first authors) This page is intentionally left blank. Chapter 2 Literature Review

MPLOYING pairwise similarity-based analysis has become a primary step in both E drug repositioning and Drug-Drug Interaction (DDI) prediction applications. In drug repositioning, the pairwise similarity of individual drugs is considered while in DDI prediction, the similarity of a ‘pair’ of drugs is considered. Various properties of pharmacology data are used to determine the pairwise similarity. First, this chapter ex- plains the consequences of various characteristics of pharmacological data used in drug repositioning and DDI prediction. Then, it provides a detailed overview of existing com- putational methods for drug repositioning and DDI prediction.

2.1 Pharmacological data

There is no direct physical relationship between drugs and therapeutic or adverse effects. These relationships are identified as a result of drug’s biological activities arising at target proteins or target genes [1]. Investigating and understanding the relationship between chemical structure and biological activity is needed to identify therapeutic and adverse effects/side effects of the drugs. The drug-disease and drug-side effect relationships can be identified through drug-target relationship analysis. In preliminary investigations, computational models have been developed using homogeneous/individual pharmaco- logical characteristics such as disease, symptoms, side effects, chemical structures, pro- teins and genes. These characteristics have their own pros and cons. Drug repositioning and drug interaction prediction can be accomplished by com- paring pairwise similarities between drugs, or targets or diseases or combination of

19 20 Literature Review

Figure 2.1: The fundamental components that directly/indirectly involved in drug- disease relationships [1]. them [1, 4, 36]. Figure 2.1 is used to illustrate the fundamental attributes that are directly/indirectly demonstrating the characteristics of drug-disease relationships. It shows key attributes of both drug and disease data that may be useful to determine pairwise similarities. Pairwise similarities can be obtained using direct or indirect com- parisons. Using chemical properties is the direct approach to compare drugs which they indicate as a physiochemical property of the drug. Molecular docking is also shown as a physiochemical property, but it needs to be generated through simulations rather than the salient properties. Associative indications and side effects are described as biologi- cal effects of drugs. On the other hand, using pathology is illustrated as the direct way of achieving disease-based similarities while clinical knowledge such as indications and side effects are described as indirect ways to compare disease similarities.

Chemical similarities

The 881 chemical substructures of drugs defined by PubChem [13] are widely used to represent drugs’ chemical properties [15, 48, 49]. Campillos et al. [50] also suggested a 1024 bit length chemical fingerprints to represent chemical properties of drugs. They have demonstrated combining chemical and side effect characteristics of drugs as an ef- fective way to achieve drug repositioning. Moreover, they suggested employing phar- macophores and active metabolites to form a higher resolution of the chemical similarity 2.1 Pharmacological data 21 that captures the mode of action of drugs. Similarly, Dudley et al. [1] explored drug-based discovery using the chemical struc- tures of the drugs and showed that using chemical structural similarity alone is insuffi- cient as drugs undergo various metabolic transformations and pharmacokinetic transfor- mations. Hence, investigating drugs’ mechanism of action was encouraged. Employing connectivity maps [51] to construct the molecular activity profile based on gene expres- sions has been introduced as a comprehensive and systematic approach towards lever- aging the molecular activity for drug comparison. The primary limitation of molecular activity similarity is that they highly rely on the quality and assumptions of the exper- iments that used to derive molecular activity profiles. Many drugs undergo multiple chemical transformations which might be difficult to represent and compare by a single molecular activity signature [1].

Disease similarities

In early research, symptom-based similarities have been exploited to analyze the disease similarities; thereby they identified new uses for existing drugs [35, 52]. The medical bibliographic records and the related Medical Subject Headings (MeSH) [53] metadata from PubMed [54] is used to elucidate symptom-based characteristics of disease [55]. Zhou and Menche [55] integrated disease-gene association and protein-protein interac- tion (PPI) data, to investigate the correlations between the symptom-based similarity of diseases and their degree of shared genes or PPIs. As a result, they found that symptom- based similarity of two diseases correlates strongly with the number of shared genetic associations and the extent to which their associated proteins interact. However, it was realized that symptom-based similarity is inadequate to predict new therapeutic uses for existing drugs. Consequently, mRNA expressions and the protein-protein network was used to investigate disease similarities [35]. Suthram et al. [35] investigated disease similarities to infer new drug-target rela- tionships. They used gene expression data to explore disease similarities. According to Suthram et al. [35], it is difficult to decide whether two diseases are treated by the same drug due to the similarity in their molecular pathology or due to their underlying disease 22 Literature Review type. Hu and Agarwal [56] demonstrated that the gene expression profiles of diseases and drugs could be used for drug-target/pathway identification. Moreover, they explained the significance of combining ‘global similarity’ concepts such as correlation-based sim- ilarity and ‘local similarity’ enrichment scores to construct a reliable pairwise similarity model. Moreover, Li and Agarwal [57] explored pairwise disease similarities based on disease-genes relationships and associating biological pathways. As a result, they have identified new disease relationships which cannot be captured through literature mining or gene overlap analysis. Since gene expression profiles were generated under certain conditions using gene expression alone may not be adequate to match the therapeutic effects that are not con- sidered at the gene expression level [56]. The Connectivity Map is applied to measure the genome-wide transcriptional changes in the disease condition and generate a signature or profile of its molecular activity [1].

Side effect similarities

Drug-side effect association is another useful characteristic of human phenotypic infor- mation. Kuhn et al. [16] have developed widely used SIDER, a side effect resource that connects 888 drugs and 1450 side effect terms. Campillos et al. [50] emphasized the significance of using side effect similarity for drug repositioning. The concepts of the Coding Symbols for Thesaurus of Adverse Reac- tion Terms (COSTART) ontology was used as the basis for the extraction of side effects. They weighted side effect terms to avoid the redundancy caused by highly correlating side effects (Ex: drugs causing nausea also induce vomiting). Moreover, concepts such as the inverse correlation between side effect frequency and the likelihood of two drugs to target a target protein have been used when assigning weights for side effects. Realizing the incomplete nature of available side effect information of drugs, Tatonetti et al. [26] have implemented an adaptive data-driven approach to predict side effects of existing drugs. As a result, they produced two comprehensive databases containing 47 drug-class interactions and the predicted adverse drug effects. They showed that the 2.1 Pharmacological data 23 side effect profiles of drugs could be used to infer new drug-target and new drug-off- target relationships. Further, their results showed that although drugs have different chemical structures and are used for various indications, two drugs having similar side effect profiles can target similar targets. But, their approach was not applicable for newly arrived drugs. Atias and Sharan [58] employed canonical correlation to compare side effect similari- ties and they combined it with network diffusion to predict side effects of existing drugs. Moreover, they demonstrated the importance of heterogeneous data integration to obtain improved performance. Even though side effect similarities can be used to link the interactions between drugs and targets, there are certain limitations. Some side effects arise due to hormonal changes in the body. Data redundancy is another challenge in side effect domain. Also, it requires a long time to observe and confirm the complete drug-side effect profile. Hence, it cannot be directly applied to the newly arrived drugs without an explicit drug profile.

Target Similarities

Yamanishi et al. [19] published a gold standard dataset including four types of drug- target interactions involving enzymes, ion channels, G protein-coupled receptors and nu- clear receptors. They used supervised bipartite graph inference method to predict new drug-target interactions. Their results suggest that drug-target interactions are highly correlating with pharmacological effect similarity than chemical similarity. It implies that drug-target relationships do not highly rely on drug’ s chemical characteristics. But, exploring target properties such as protein-protein interactions, genome characteristics may also be useful for new target or off-target prediction for drugs. Similarly, Yildirim et al. [20] proposed a bipartite graph inference method employing drug-protein interac- tions. The target similarities can be captured by comparing protein sequence data [59–63]. Li and Lu [64] used protein-protein interaction [65] data to illustrate target similarities. Moreover, UniProt [66], the largest knowledgebase of protein sequences and associated Gene annotations are used to capture Gene Ontology-based semantic similarity [60, 63]. 24 Literature Review

Raman et al. [67] demonstrate a structural level analysis to identify the binding sites of target proteins and mapping them for tuberculosis infectious diseases. They employed flux balance analysis and network analysis of protein-protein interaction for new target detection.

ATC Classification

As explained in Section 1.1.4, Anatomical Therapeutic Chemical (ATC) classification sys- tem is a five-level classification system. It classifies drugs into 14 main groups based on drug’s anatomical properties. In the sub-levels, it employs drug’s therapeutic, chemical, pharmacological properties. ATC classification is widespread as an input feature repre- sentation to determine anatomical/therapeutic/chemical features of drugs [4,7,8,39,40]. In several studies ([7,8,39,68]), ATC therapeutic class codes are used to represent the ther- apeutic information of drugs. Shi et al. [40] used ATC classification to represent chemical information. Little research [18, 41] focuses on employing ATC classes in the prediction phase. Napolitano et al. [18] proposed drug classification using Support Vector Machine (SVM). They obtained up to 78% of accuracy using SVM classifier employing drug-chemical, drug-gene and drug-protein associations as inputs whereas ATC classification is used to represent labels. Similarly, Chen et al. [41] used ATC classification in the prediction phase to represent drug classes. They proposed a hybrid predictive model integrating the pre- dictions from ontology-based, similarity-based and interaction-based approaches. The performance of the hybrid method has outperformed the individual prediction models. Napolitano et al. [18] and Chen et al. [41] has emphasized further analysis of the pre- dicted false positives in inferring drug repositioning candidates. However, they limited their studies only for the drugs that already possess an ATC code.

2.1.1 Importance of heterogeneous data integration

Different drug characterizations may lead to infer various outputs. Integrating those outputs may be useful to infer reliable predictions. Several studies have focused on 2.1 Pharmacological data 25 developing novel, efficient and reliable computational models for drug repositioning [1, 18–21, 36, 58, 69] and DDI prediction [7–10] using heterogeneous data integration. Moreover, data integration can overcome the limitations arising from data incomplete- ness.

Drug repositioning

Some studies [19, 20] demonstrated the importance of spanning chemical, genomic and side effect features to infer new drug-target interactions using supervised bipartite graph inference. Yamanishi et al. [19] found that side effect similarities were highly correlating with the new predictions than chemical similarities. They proposed a two-step strategy to combine chemical, genomic and side effect data for supervised bipartite graph learn- ing. Moreover, distance learning algorithm was combined with supervised bi-partite graph learning to improve the efficiency. In contrast to the side effect prediction method proposed by Campillos et al. [50], the method proposed by Yamanishi et al. [19] could be used to infer side effects of any drug. However, they have suggested to incorporate kernel functions for genomic sequences and chemical structures to improve the overall performance. Also, they suggested integrating medical vocabulary to avoid redundant keywords. Campillos et al. [50] and Dudley et al. [1] have also investigated the impact of chem- ical similarities for drug repositioning. They found that using chemical structural sim- ilarities alone is insufficient as drugs undergo metabolic transformations and pharma- cokinetic transformations. Therefore, studying the mechanism of action of a drug is en- couraged. Using connectivity maps to construct the molecular activity profiles based on gene expression has been considered as a better approach as it simplifies drug compar- isons. However, a molecular activity similarity-based approach may not be very accurate as many disease conditions involve in more than one molecular activity. Moreover, gene expression profiles may be generated under different conditions such as different doses, time durations, different disease stages and ages. Therefore, considering gene expression data alone may result in poor performance. PREDICT [70] is a machine learning framework that predicts new indications for both 26 Literature Review existing and novel drugs. They used heterogeneous data integration where the drug- drug similarities are captured based on chemical structural, side effect, sequence data, gene ontology and disease similarities. They used these features to predict large-scale drug repositioning candidates employing logistic regression. As a result, they achieved an AUC (area under the receiver operating curve) score of 0.92.

Drug-drug interaction prediction

Recent studies have emphasized the relevance of integration of heterogeneous character- istics including chemical, phenotype, biological and therapeutic features, for establishing pairwise drug similarity models for DDI prediction as well [7–10]. Vilar et al. [9,10] employed both 2D and 3D chemical structures, side effects, chemoge- nomic targets and drug indications for DDI prediction. Vilar et al. [9] observed higher DDI predictive performance when heterogeneous features were integrated using Prin- cipal Component Analysis and the highest individual predictive performance was ob- served when using Interaction Profile Fingerprints. Cheng et al. [8] employed supervised learning approach integrating phenotypic- similarity, therapeutic-similarity, chemical structural-similarity and genomic-similarity for DDI prediction. They observed various combinations of these features to identify the most suitable property for DDI prediction. As a result, they found that integrating structural or phenotypic features with other features are most useful for DDI prediction. INDI [7] is a DDI prediction tool which uses seven heterogeneous pairwise drug sim- ilarity scores based on chemical structures, ligands, side effects, Anatomical Therapeutic and Chemical classification, sequence similarity, distance on a protein-protein interac- tion network and Gene Ontology to construct 49 classification features. As a result, they observed AUC of 0.93 and 0.96 for Cytochrome P450 (CYP)-Related DDIs and non-CYP- Related DDIs, respectively. 2.2 Drug repositioning 27

2.2 Drug repositioning

Producing new drugs and marketing them with a complete drug profile is a challenging task as it is a long process and requires a large investment of time and money. Drug repositioning or drug repurposing is the process of identifying new therapeutic uses for existing drugs. It can reduce the time, costs and risks of the traditional drug discov- ery process [1, 18, 33, 34]. The main objective of drug repositioning is to increase the therapeutic uses of the existing drugs in the clinical domain. It is believed that drugs having similar characteristics are more likely to demonstrate similar behavior at similar targets/binding sites [1, 19, 33–36, 71]. There is also evidence that computational drug repositioning can be improved by heterogeneous data analysis [1, 19–21, 36]. In contrast to laborious in vivo and in vitro experiments, in silico methods for drug repositioning have become popular as effective and efficient approaches for drug repositioning [1,19,33–35]. These methods aimed to identify new uses for existing drugs by finding new associations between other contributing entities such as proteins, genes, diseases and side effects. As shown in Figure 1.2, there are two main strategies involved in drug repositioning: new target recognition and new indication recognition. Figure 1.2a shows the known interactions where each of the drugs is associated with at least one target protein and vice versa; each of the targets is also associated with at least one disease and vice versa. Figures 1.2b and 1.2c show new target recognition and new indication recognition, re- spectively. In new target recognition, the objective is to identify novel molecular targets for a given drug while in new indication recognition, the objective is to identify new diseases that may be impacted by one of the existing targets of the drug. Computational methods such as network-based inferencing [1, 19, 20, 35, 72] and ma- chine learning [18, 68, 73], are widely used in drug repositioning. In Sections 2.2.1 and 2.2.2 we review network-based and machine learning-based drug repositioning.

2.2.1 Network-based inferences

Pharmacological data can be represented by networks. In those networks, the entities such as drugs, disease/side effects, targets (proteins/genes) can be represented by nodes 28 Literature Review whereas the weighted/unweighted or directed/undirected edges can be formed to rep- resent direct physical interactions, activation, inhibition, coregulation or any other rela- tionship between the nodes [36]. As mentioned in Section 2.1.1, several studies have identified heterogeneous data in- tegration as a significant approach for obtaining reliable predictions. However, introduc- ing heterogeneous data increases the complexity of data representation and the number of features in the networks. Therefore, network partitioning or clustering methods can be used to simplify the large and complex pharmacology data and predictions can be efficiently made on the identified subgroups/clusters [20, 21, 72, 74–76]. Identification of unknown relationships between the specified nodes enables drug repositioning. Various graph theory concepts that are applied to path analysis are di- rectly or indirectly applicable for new link prediction in pharmacological networks. It is straightforward that every two disconnected nodes would find a path in very complex networks such as pharmacological networks. However, identifying the most suitable and useful relationships is challenging. Finding two paths between directly disjoint nodes is a popular step in many real life applications. However, computing all paths and the longest path in a graph is NP-hard. Therefore, Borgwardt et al. [77] suggested graph kernels based on shortest paths for clas- sification problems which are polynomially executable. One could calculate the number of matching walks in two different graphs, using random walk kernels. However, it is computationally expensive and they state it as an unreliable approach as it would pro- duce high similarity scores even with small identical substructures. The shortest path graph kernel is widespread in bioinformatics and system biology. Finding the shortest path between two entities in a similarity network enables identifying drugs with closer similarities which can be explored further for drug repositioning and DDI prediction. Dijkstra algorithm is widespread as an efficient shortest path algorithm in the context of drug repositioning [20,78]. Floyd-Warshall is another algorithm applies to determining optimal shortest distances [77]. Gao et al. [78] proposed a graph-based model to identify new genes that are related to glioma disease which is the most common and dangerous intracranial tumor. They used 2.2 Drug repositioning 29

Dijkstra algorithm for prediction of new gene relationship identification using shortest path between two nodes. Betweenness and permutation test are used to evaluate false positives. However, Gramatica et al. [79] say that paths cannot be too short in biochem- ical interactions through which drugs are related to pharmacological effects. Therefore, they demonstrated the impact of normalizing graphs and ranking of paths while prov- ing their results for sarcoidosis knowledge network. They used rank-based random walk path detection to observe the indirect relationships between peptides and diseases. Intro- ducing these types of normalizing and rank-based can also be used in improving success in drug repositioning. A deeper understanding of system biology and the cellular interconnectedness can improve the identification of new disease-genes relationships. Barabasi et al. [80] ex- plained this idea using three strategies. First, they employed network clustering algo- rithms to predict topological modules in which nodes with a higher tendency are linked with the local neighborhood (than the outside nodes). Subsequently, they predicted func- tional modules in which nodes are grouped based on its role in the same network. Finally, in the disease modules, a set of nodes sharing a cellular function and disruption of which re- sults in a particular disease phenotype are grouped in the same module. Their inferences are made based on two assumptions: (i) the direct interactions of disease and proteins are likely to be associated with the similar disease phenotype and (ii) the components belong to the same module implies they associate with the same diseases. These types of partitioning of the large network enable more efficient drug repositioning. Cheng et al. [81] proposed a bipartite drug-target network and they assessed the performance of drug-based similarity inference (DBSI), target-based similarity inference (TBSI) and the proposed network-based inference (NBI) approach for drug repositioning. They have demonstrated that using network theory concepts is much effective than TBSI and DBSI approaches. Their NBI method has outperformed the DBSI and TBSI. DBSI and TBSI methods have demonstrated poor performance due to summarized similarity measures and redundancy issues. Similarly, Yildirim et al. [20] constructed a bipartite graph representation for drugs and proteins employing drug-target interactions based on cellular network properties. 30 Literature Review

This bi-partite graph model is reduced to two separate networks: drug network and tar- get network. In the drug network, two drugs are connected only when they have at least one target protein in common while in the target protein network two proteins are connected if a common drug targets both of them. Clustering and further analysis of disease-gene and drug-target interactions were carried out using shortest distance be- tween proteins in the network. Hartsperger et al. [74] also explained the importance of arranging the biological en- tities such as disease, gene and proteins in a meaningful k-partite graph and the im- portance of clustering which enables reclassification and annotation of biological entities represented by the graph. They believed that graph transformations such as projecting a higher degree network into a lower degree network might discard essential relationships which may lead to wrong interpretations and wrong inferences. Yamanishi et al. [19] also inferred new drug-target relationships based on network analysis. In contrast to Yildirim et al. [20], they used genomic space characteristics as the bridging property to capture the correlation between chemical structures and side effects. They demonstrated that incorporating side effect data as a useful resource for reliable drug-target prediction. AUC, sensitivity, specificity and PPV (positive predictive value) have been used to assess the performance of their model. Even though sensitivity values vary between 0.05 and 0.40, AUC values are relatively high as they vary between 0.6 and 0.9.

2.2.2 Machine learning-based approaches

As explained in Section 2.2.1, pharmacological data can be represented in homogeneous or heterogeneous graphs/networks. The associations between nodes and other graph properties are used as primary features to construct the feature vectors for machine learning approaches. Therefore, the majority of the drug repositioning methods can be seen as hybrid methods of graph/network-based and machine learning approaches [19–21, 68, 72]. Graph clustering is such a hybrid approach where graphs of homoge- neous and heterogeneous objects are grouped into small clusters based on their associa- tions. Since pharmacology networks are large and complex, partitioning large networks 2.2 Drug repositioning 31 produces an abstraction which simplifies their complex interaction structure. Napolitano et al. [18] employed supervised learning techniques to classify drugs. They integrated drug-chemical, drug-gene and drug-protein representations and ob- tained classification accuracy of 78% with reference to ATC classification. Incorporating pharmacological concepts is also essential when performing drug repositioning via ATC classification as pharmacological properties are exerted in the ATC classification system. Wu et al. [72] also proposed a graph-based clustering approach for drug reposition- ing. They employed drug-drug, drug-disease and disease-disease relationships to create a drug-disease network. In their representation, nodes are determined to be drugs or disease while the edges are defined to be the gene, biological process, pathway and phe- notype or combination of these features shared by the two specified nodes (entities). They used Jaccard coefficient and shortest path values to analyze the characteristics of known indications to compute the pairwise similarity. The graph clustering algorithms such as the Louvain method and ClusterONE (Clustering with Overlapping Neighbourhood Expansion) have been used to detect clusters in the large heterogeneous drug-disease network. Consequently, they found 1008 shortest paths out of 1041 with the Jaccard co- efficient greater than 0.5. They assessed their model using a held out dataset and they have achieved a recovery rate of 95% and 85% for ClusterONE and Louvain method, respectively. Lee et al. [21] proposed that drug groups (DGs) having common DG-DG interac- tion partners would follow similar drug mechanisms and they have proposed Molecular Complex Detection (MCODE) algorithm for module detection in the DG-DG interaction network. They investigated clustering DG-DG interactions in relation to ATC classifica- tion. They found that DG-DG interactions would be useful in describing the mechanisms and the features of drugs. Laarhoven et al. [82] demonstrated the relevance of drug-target topology for drug repositioning. They have obtained AUPR (area under the precision-recall curve) scores of 88.5, 92.7, 71.3 and 61.0 in the enzyme, ion-channel, GPCR and nuclear receptor datasets, respectively. They attribute the poor performance in nuclear-receptor dataset to the rel- atively lower number of drugs in the dataset. Higher AUPR scores are obtained for the 32 Literature Review datasets with a relatively higher number of drugs. When the dataset is small, there is less information available for each entry which may decrease the quality of pairwise sim- ilarity. Similarly, Takeda et al. [83] demonstrated the significance of employing target proteins for improved prediction performance of off-target effects of drugs. Bleakley et al. [61] proposed an iterative supervised learning approach for drug-target prediction using bi-partite graph models. They chose every drug-target interaction vec- tor to be a label vector at each iteration through which they infer new drug-target rela- tionships. Sawada et al. [69] proposed methods for drug repositioning using similarity search on chemical and phenotype domains. They proposed a method, to compare drug- target and target-disease interactions to infer new indications for which they refer to as ‘Template Matching’. Alaimo et al. [44] illustrated the importance of combining biological domain knowl- edge with bipartite drug-target interaction network for drug repositioning. As a result, they have obtained a relatively higher performance compared to other network-based in- ferencing approaches. They achieved AUC of 100% on their proposed hybrid approach. They showed that the results highly rely on the input data; hence they have performed a priori analysis to select the best data to learn the model. The asymptotic complexity of their approach is O(n2(m + m2)). Yildirim et al. [20] focused on combining heterogeneous data using drug-target and disease-gene interactions employing bipartite graph projections. Mei et al. [62] also sug- gested a bipartite local model for predicting drug-target interactions while Campillos et al. [50] proposed a probability theoretic approach to span chemical and pharmaceutical axes. Suthram et al. [35] proposed a quantitative framework by which they identified relationships between diseases. They used disease related mRNA expressions and hu- man protein interaction networks to introduce another data-driven approach to use the knowledge of disease relationships in a systematic way. They used microarrays to denote the disease-state using gene expression data and normalized it using z-score transforma- tion. Z-score transformation is used for direct comparison of various microarray samples. Moreover, they generate a matrix containing Module Response Score (MRS), combining 2.2 Drug repositioning 33 gene expression data and functional modules. Also, the relationship between different diseases was obtained by the Partial Spearman correlation coefficient of the MRS values. Further, they emphasize the importance of Partial Spearman correlation coefficient of the MRS values as it surpasses the performance of generic Spearman correlation co- efficient by providing a way to explicitly factor out the possible dependencies between different gene expressions in addition to a quantitative metric to assess disease similarity. In their network, the nodes are diseases, while the thickness of the edges between two diseases represents their strength of correlation. As a result, they observed heterogeneity in the results and recognized 138 significant associations between diseases. Further, this significance was strengthened using statistical tests [35].

2.2.3 Importance of clustering in pharmacological data network analysis

Clustering is beneficial in large and heterogeneous pharmacological data network. It is an unsupervised learning approach that enables grouping of entities based on their similarity, independently of any labels. The entities belong to the same cluster demon- strate higher similarities to the local entities than the entities belong to another cluster. Pharmacological data networks become large and complex as the number of participat- ing features and their associations are increasing. Partitioning the large network is an abstraction for simplifying this complex nature. Hartsperger et al. [74] demonstrated the significance of fuzzy clustering in the weighted k-partite graphs consisting of biological entities such as disease, gene and pro- tein. Also, they investigated a supervised bi-partite graph inference approach by span- ning chemical and pharmacological properties. They explained that graph transforma- tions such as graph projections would lead to information loss. Discovering drug classes or grouping drugs has become another significant task in drug discovery. Napolitano et al. [18] proposed drug classification employing Support Vector Machine (SVM). They obtained up to 78% of accuracy using SVM classifier based on drug-chemical, drug-gene and drug-protein associations as inputs whereas ATC clas- sification is used to represent labels. Lee et al. [21] assumed that the drug-groups (DGs) having common DG-DG interaction partners would demonstrate similar drug mech- 34 Literature Review anisms and show similar interactions to other participating entities. They employed Molecular Complex Detection (MCODE) algorithm for module detection in DG-DG in- teraction network. They investigated clustering DG-DG interactions incorporating ATC classification. Zickenrott et al. [84] also demonstrated the significance of analyzing differential net- works to classify target gene relationships for healthy and unhealthy (disease) pheno- types of drugs. They believe differential network analysis could determine the minor differences in phenotype states. Noeske et al. [85] examined pharmacological assay findings based on ligand similari- ties of Metabotropic Glutamate Receptor (mGluR) Antagonists. In their study, Self Orga- nizing Maps (SOM) has been used to locate similar molecules together and then predict potential activities of known mGluR antagonists and new binding behavior. Even though SOM was not very popular for drug repositioning context, it is well known for pattern recognition and clustering in other contexts because of its unsupervised self-learning na- ture. It can handle higher dimensional data efficiently. However, in SOM, different map initializations and input order matters when defining clusters. Running different maps for fine tuning clusters also becomes time-consuming. Therefore, exploring variants of SOM algorithms will be beneficial. Volz et al. [86] also explored the impact of clustering disease. They demonstrated that contact data such as clustering coefficient and degree distribution could be used to parameterize the network structure of the disease network; hence obtaining useful clusters. Evaluation of drug repositioning becomes complicated as it unveils a new knowledge about available drugs. Wu et al. [72] evaluated repositioning results using a held-out data. However, Nepusz et al. [87] believe removing edges in a network as a step which introduces noise to the network. Therefore, they suggested removing the node/vertex itself. Wu et al. [72] assessed their model performance using recall alone which is insuf- ficient to derive a conclusion. 2.3 Drug-drug interaction prediction 35

2.3 Drug-drug interaction prediction

Drugbank is an online database which provides biochemical and pharmacological in- formation about drugs, their mechanisms and their targets [29]. It is a comprehen- sive database containing extensive biochemical and pharmacological information about drugs, their mechanisms and their targets. It uses data on drug-target, drug-enzyme and drug-transporter associations to provide insight on drug-drug interactions [29]. More- over, it includes drug interaction information from several external sources [88] such as Physician’s Desk Reference [89], e-Therapeutics [90], Medicines Complete [91], Epocrates RX [92] and Drugs.com [93]. Therefore, it is widespread in DDI prediction applications [4]. Drug-drug interactions (DDIs) refer to the modification of action of one drug caused by the action of another drug. Thus, DDIs occur when two or more drugs are adminis- tered together. DDIs may abolish, diminish or potentiate the effect of the drugs involved. Some DDIs can also lead to fatalities if an inappropriate drug combination has been cho- sen [8]. Various factors can affect drug interactions, including binding to plasma proteins, binding to tissues and extravascular sites, the activity of the liver enzyme system and in- take of certain food groups [22, 23]. Hence, investigating and understanding harmful DDIs is crucial to improving health care and patient outcomes. Performing experimental trials for a large number of drug pairs is not realistic in terms of cost and time. Due to animal welfare considerations, an animal-based testing process is also problematic. During the last decade, machine learning [5,7,9,10,24,25] and statistical models [26,27] including integration of text mining [28] have gained popularity for inferring DDIs. These computational methods assume that drug pairs sharing similar characteris- tics (chemical, phenotype, biological, therapeutic, etc.) are more likely to demonstrate the same drug interactions. Tanimoto Coefficient is a variant of Jaccard Index which is widely used in quantifying the overall similarity between drugs. Vilar et al. [9] also pro- posed a similarity metric to quantify the overall overlapping of any two drugs. Huang et al. [94] suggested a metric to compute the tightness/strength of target centric drugs, acknowledging the impact of protein-protein interaction networks towards identifying 36 Literature Review new DDIs. Vilar et al. [5] investigated the impact of molecular structural similarities for DDI predictions and obtained sensitivity, specificity and precision of 0.68, 0.96 and 0.26, re- spectively. They compared pairwise drug similarity using Tanimoto coefficient and their predictions are solely dependent on the Tanimoto coefficient. They used various simi- larity cut-off values to infer useful DDIs. Later, Vilar et al. [9] published a new protocol for inferring DDIs by integrating 2D and 3D molecular structural, target and side effect similarities. In their study [9], the drug profiles were combined using linear algebraic concepts through matrix manipulations. Similarly, Zitnik and Zupan [27] proposed a probabilistic model (COPACAR) using collective matrix factorization to depict the pair- wise relations in pharmacological networks. They demonstrated AUC of 0.924 using 10-fold cross validation for predicting DDIs which were found in Drugbank. But, their method can focus on only one particular data type at a time. Ning et al. [95] have proposed a Stochastic algorithm to learn the directional DDI- based drug-drug similarities in a DDI network. They performed a graphlet similarity matching to determine the drug-drug similarity and it has provided promising drug clus- ters for female reproductive agents, hormones found in birth control formulations. Liu et al. [12] provide a comprehensive analysis of overall contextual similarities to predict plausible DDIs. They employed chemical interactions comprising the overall structural activities and reaction-based similarities. Overall pairwise target similarity was captured analyzing the interactions of target proteins. Pathway analysis was carried out to capture the related pathways of drugs and their functions. Moreover, a minimum redundancy maximum relevance approach and incremental feature selection concepts were used for dimensionality reduction and to remove redundant features. They have achieved specificity ranging between 0.89 and 0.97 and sensitivity ranging between 0.13 and 0.71. Cheng et al. [24] used chemical sub-structural properties in predicting five major CYP isoforms integrating various base classifiers. They constructed combined classifiers for each of the isoforms of interest, obtaining overall predictive accuracy between 72.3% and 83.7%. However, this single-isoform modeling approach does not scale well to large num- 2.4 Summary 37 bers of drugs. Similarly, INDI [7] classifies predicted DDIs as CYP related DDIs and non-CYP related DDIs for drug pairs generated using the 805 drugs. Moreover, several studies [7,8,24] employed binary classification methods for DDI prediction by randomly selecting examples from the unlabeled data to represent the negative class, which may introduce noise, resulting in a lack of distinction between positive and negative classes in the training set. In general, the outcome of the final predictions relies on the charac- teristics of the training sample. Thus, selecting a representative training sample is also beneficial in improving predictive performance. The signal detection algorithm proposed by Tatonetti et al. [28] is an integration of text mining and statistical inferencing towards DDI prediction. They introduced a data-driven approach based on known drug-side effect relationships to predict unknown drug-side effect relationships. They proposed a statistical correlation of uncharacterized bias (SCRUB) method to infer new drug-side effects and DDI-side effects. They computed the Tanimoto coefficient between the drug’s side effect bit vectors. Moreover, multivari- ate linear regression and logistic regression has been used to predict the number of targets that two drugs target. They identified a linear relationship between shared indications between a pair of drugs and the similarity of their side effect profiles implying that this relationship could be used in predicting new uses for existing drugs.

2.4 Summary

• Drug repositioning and drug-drug interaction prediction are two important ap- plications in pharmacology data analysis. Traditional drug development process is a long process and it requires high cost and risk. Moreover, due to animal wel- fare considerations, an animal-based testing process is also problematic. Therefore, during the last two decades, applying computational methods for drug reposition- ing and DDI prediction have attracted increasing attention. • Drug repositioning is the process of identifying new therapeutic uses for existing drugs. Drug repositioning methods apply to infer new uses for existing drugs. They can be categorized as drug-centric or disease-centric methods. 38 Literature Review

• DDIs refer to the modification of action of one drug caused by the action of another drug. DDIs occur when two or more drugs are administered together. Hence, investigating and understanding harmful DDIs is crucial to improving health care and patient outcomes. • Drugs and diseases can be explained using various characteristics such as disease, symptoms, side effects, chemical structures, proteins and genes. The methods used to compare pairwise similarity needs to address heterogeneous data integra- tion to more deeply understand the biological system. The analysis of biochem- ical and physiological levels, quantitative experimental techniques, directionality and strength of interaction and novel methods to quantify pairwise similarities are needed in improving reliable predictions. • The overall context similarity measures such as Jaccard Index and its variants are widespread in quantifying the pairwise similarity. However, two drug pairs can possess the same overall similarity index, though their individual features are dif- ferent. • Pharmacology networks become large and complex due to the large number of available drugs and other participating entities. Meanwhile, introducing heteroge- neous data types increases the complexity of data representation and the number of features. Decomposing the large-networks into smaller subnetworks can sim- plify the drug repositioning task. Network partitioning or clustering methods can be used to simplify large and complex pharmacology data and predictions can be efficiently made on identified subgroups. • It is noted that biological entities such as drugs, genes and proteins, etc. may be- long to multiple clusters due to their highly overlapping nature. Hence, overlap- ping or hierarchical clustering is beneficial. Moreover, scalability, robustness, high dimensional features reduction, speed, intrinsic, adaptability and preserving topo- logical order are some important characteristics that need to be considered in this context. • It is difficult to specify a hard-cut boundary to distinguish network-based and ma- chine learning-based approaches because many machine learning approaches em- 2.4 Summary 39

ploy network properties in the feature space. • The characteristics of ATC classification is used in the input space to represent anatomical/therapeutic/chemical features of drugs as well as a gold-standard ref- erence in drug classification. However, the related work limited their studies only for the drugs that already possess an ATC code. • Drugbank can be considered as one of the gold standard databases for understand- ing and obtaining approved DDIs; there is no gold standard database for non- interacting DDIs. Such negatives are needed to train binary classifiers. • The common problem in drug-target predictions and DDI prediction is the lack of standard negatives. The existing supervised learning methods in these contexts consider randomly selected instances from the unlabeled data as negatives train- ing exemplars. This may introduce noisy data, resulting in a lack of distinction between positives and negatives. Separating these positives and negatives from the unlabeled data is beneficial, but challenging. • Existing pharmacology knowledgebases are incomplete; thus, effective text min- ing techniques can be used to uncover knowledge from the PubMed articles and clinical notes. Those results would be beneficial for validation. • Evaluating the predicted drug repositioning candidates and DDIs is challenging as these applications unveil a new knowledge. Finding clinical evidence and literature-based evidence could solve this issue to a certain extent. This page is intentionally left blank. Chapter 3 Inferring pairwise drug interactions based on heterogeneous attributes

DVERSE drug effects and drug interactions are two fundamental safety concerns A having large implications for drug development and drug discovery [26,28]. Any drug can produce therapeutic effects as well as unexpected adverse drug effects [96]. Drug-drug interactions (DDIs) are the most common type of drug interaction. DDIs be- comes the major safety concern when multiple drugs are co-administered to treat a par- ticular disease or to treat patients with multiple diseases. In this chapter, we introduce an effective computational approach for inferring pairwise drug interactions. The combination effect of drugs may vary by the action of the other drugs. Some drug interactions between or amongst multiple drugs may lead to increased toxicity, decreased efficacy, or both [8, 26, 28]. Investigating and understanding the interactive properties of drugs can help prevent serious adverse drug effects. It is crucial to improving the health care and patient outcomes as well as improving the drug development process. There are a large number of drugs available on the market, but the knowledge of DDIs is incomplete. This incompleteness has triggered interest in investigating new DDIs. Drugbank 5.0 [30] has evolved with a significant number of enhancements over its previ- ous releases [15, 29]. The number of DDIs has grown to 365,984 (in 2017) from 13,795 (in 2008) [15] reflecting the great interest in investigating DDIs during recent years. How- ever, it represents only < 1% of the total possible drug pairs for a total of 10,562 drugs [31]. Given the large space of possible interactions, it is quite likely that many DDIs have been left undiscovered. On the other hand, there is no approved source of information about drug combinations that are not interacting with each other. Therefore, DDI predic-

41 42 Inferring pairwise drug interactions based on heterogeneous attributes tion using binary classification is challenging. DDIs can be associated with drug metabolism via inhibition or induction of the cy- tochrome P450 (CYP) system [97–100]. The CYP enzymes are amongst the most essential drug-metabolizing enzymes. CYP is an enzyme system found throughout many organs and tissues including the kidneys, lungs, gut and, liver. Moreover, the most of human drug oxidation is attributed to six enzymes (isoforms): CYP1A2, CYP2C9, CYP2C19, CYP2D6, CYP2E1 and CYP3A4 [97–100]. Similar to other existing work [5, 7, 9, 10, 24, 25] , we assume that two drug pairs that exhibit similar pairwise drug similarities (chemical, phenotype, biological, therapeutic, etc.) are more likely to react similarly at the protein binding sites; hence may produce similar types of DDIs. In this experiment, we focus on DDIs for 548 drugs; hence we classify 149878 drug pairs. Moreover, to overcome the limitations of the existing com- putational models, this chapter introduces a positive-unlabeled learning (PUL) approach for DDI prediction where plausible non-interacting drug pairs are identified from the unlabeled data and serve as negatives in binary classification. This investigation makes three primary contributions: i) we propose a PUL approach based on Growing Self Organizing Map (GSOM) [32] clustering to infer DDIs; ii) we pro- pose a new pairwise similarity function to quantify the overlap of drug features along several dimensions; and iii) we classify the predicted DDIs as Cytochrome P450 (CYP)- Dependent and CYP-Independent interactions by invoking their locations in GSOM. Identification of CYP enzymes for the predicted DDIs is also considered to be significant because DDIs are involved in drug metabolism in human liver [99].

3.1 Published manuscript

The results of this study is presented in the following published form: Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2017). Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes. BMC Bioin- formatics, 18(1), 140. The complete list of the predictions are available as supplementary information in the 3.1 Published manuscript 43 online version of the paper at https://doi.org/10.1186/s12859-017-1546-7. 44 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 DOI 10.1186/s12859-017-1546-7

RESEARCH ARTICLE Open Access Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes Pathima Nusrath Hameed1,2,3* , Karin Verspoor4, Snezana Kusljic5,6 and Saman Halgamuge7

Abstract Background: Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. Results: The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. Conclusion: Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development. Keywords: Drug-drug interaction, Growing self organizing map (GSOM), Pairwise drug similarity, CYP isoforms, PU learning

*Correspondence: [email protected] 1Department of Mechanical Engineering, University of Melbourne, Parkville, 3010 Melbourne, Australia 2Data61, Victoria Research Lab, 3003 West Melbourne, Australia Full list of author information is available at the end of the article

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. 3.1 Published manuscript 45

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 2 of 15

Background positives and negatives as a mixture; this characteristic Drug interactions refer to the modification of action matches well to the situation with DDIs. of one drug caused by the action of another drug. Our investigation makes three primary contributions: i) Thus, drug-drug interactions (DDIs) occur when two or we propose a PUL approach based on Growing Self Orga- more drugs are administered together. DDIs may abolish, nizing Map (GSOM) [16] clustering to infer DDIs; ii) we diminish or potentiate the effect of the drugs involved. propose a new pairwise similarity function to quantify Some DDIs can also lead to fatalities if an inappropriate the overlap of drug features along several dimensions; and drug combination has been chosen [1]. Various factors iii) we classify the predicted DDIs as Cytochrome P450 can affect drug interactions, including binding to plasma (CYP)-Dependent and CYP-Independent interactions by proteins, binding to tissues and extravascular sites, activ- invoking their locations in GSOM. ity of the liver enzyme system, and intake of certain food groups [2, 3]. Investigating and understanding new DDIs Positive-unlabeled learning is crucial in improving health care and patient outcomes. In some applications, only positive examples are known There are a large number of drugs available on the and labeled while the unlabeled data may contain both market, but knowledge of DDIs is incomplete, trigger- negatives as well as unlabeled positives. Extracting posi- ing interest in investigating new DDIs. DrugBank is an tives from the unlabeled data is a useful task, and this is online database that provides biochemical and pharma- more challenging compared to traditional supervised clas- cological information about drugs, their mechanisms and sification problems where clear negatives exist. Figure 1 their targets [4]. The most recent release (DrugBank illustrates the main idea behind Positive-Unlabeled Learn- 4.0) contains over 335 thousand biochemical and other ing (PUL) scenario, where only positive examples are DDIs, reflecting the great interest in investigating DDIs labeled while the unlabeled data contain both negatives in recent years. However, this represents only <1% of the and unlabeled positives. The ultimate goal is to identify possible drug pairs for the 8206 drugs listed on Drug- useful positive examples from the unlabeled data. Bank [5], suggesting that many more DDIs have been Computational methods such as one class learning left undiscovered. and semi-supervised approaches have been proposed to Performing experimental trials for a large number of address PUL problems [17, 18]. Semi-supervised learn- drug pairs is not realistic in terms of cost and time. Due ing involves a two step learning strategy where the initial to animal welfare considerations, an animal-based test- step employs random initialization of negatives from the ing process is also problematic. During the last decade, unlabeled data the subsequent step applies iterative learn- machine learning [6–11] and statistical models [12, 13] ing to refine the predictions. An integrated Naive Bayes including integration of text mining [14] have gained pop- and iterative Support Vector Machine (SVM) approach ularity for inferring DDIs. Even though initial approaches has been considered as the baseline approach in PUL. The focused on utilizing chemical space properties to com- Naive Bayes classifier is used in the initial classification pare drug characteristics, heterogeneous data integra- to identify a possible negative set and then the itera- tion has been recognized to be significant in developing tive learning uses the SVM. However, Rocchio algorithm reliable computational models to infer drug interactions may outperform Naive Bayes [17]. Rocchio algorithm is [1, 6, 9, 10]. Therefore, we consider a range of drug a linear classifier based on cosine similarity. However, characteristics in our study. it is difficult to extend for a multi class environment. DrugBank can be considered as the gold standard Also, it may extract false negatives if the decision bound- database for understanding and learning DDIs, there is no ary is non-linear. Further purification of Rocchio output gold standard database for non-interacting DDIs. To train using k-means clustering is recommended to identify false binary classifiers, such negatives are needed. Therefore, positives and false negatives [17]. in this research, we propose to address DDI prediction One Class Support Vector Machine (OCSVM) [19] clas- with a Positive-Unlabeled Learning (PUL) approach and sifier has been utilized in order to extract the positive to demonstrate a suitable method for identifying poten- class instances from the unlabeled data. In contrast to tial negatives to learn a standard binary classifier. The binary/multi class SVM, OCSVM defines a hypersphere objective is to identify potential non-interacting DDIs on higher dimensional space. One Class Logistic Regres- from unlabeled data to allow them to be treated as neg- sion has been proposed as a strategy to overcome the atives. Learning from positives and unlabeled data has limitations of the traditional OCSVM [20]. Revealing the seen successful application in the medical diagnosis con- poor performance of the trained classifiers using uncer- text, as well as for uncertain data, streaming, text, and tain data, they estimate probabilities for the final predic- Web [15]. In such applications, the available labeled items tions using One Class Logistic Regression. However, not are only positive examples while the negative class data is using a negative set limits the analysis and evaluation. On unknown or unavailable and therefore may include both the other hand, Ren et al. [21] suggested the importance 46 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 3 of 15

Fig. 1 This diagram illustrates the main idea behind Positive-Unlabeled Learning. a Available data. b Goal

of using binary classification over OCSVM to achieve bet- a representative training sample with comprehensive het- ter predictive performance. They suggest weighted SVM erogeneous properties. to identify likely negatives from the unlabeled data. Their results reveal Recall of 71% though Precision and F-Score Similarity based DDI prediction are relatively low. Computational approaches to DDI prediction [1, 6, 8–10, In the PUL context, clustering is used as an approach to 22, 25] assume that drug pairs sharing similar character- logically separate unlabeled data [17]. It is used as a pre- istics (chemical, phenotype, biological, therapeutic, etc.) step for binary classification to identifying likely negatives are more likely to share the same drug interactions. Cheng from the unlabeled data. Then, the binary classification et al. [7] and Vilar et al. [8] employed chemical properties can be employed to further refine the identified posi- in predicting new DDIs. However, their methods showed tives using initially known positives and identified likely poor sensitivity. Drug interactions like physiological prop- negatives. Unlike classification, clustering is applied on erties cannot be predicted by chemical properties alone, the inputs concealing their labels and may identify new since drugs undergo complicated metabolic transforma- patterns based on the input characteristics. Clustering tions and other pharmacokinetic transformations as they algorithms like K-means, and Self Organizing Map (SOM) are metabolized and physiologically distributed [26, 27]. require the number of possible clusters to be specified Recent studies have emphasized the relevance of inte- in advance. In medical and clinical applications, separat- gration of heterogeneous characteristics, including chem- ing the unlabeled data into two groups is not appropriate ical, phenotype, biological, and therapeutic features, as they can be divided into more than two groups based for establishing pairwise drug similarity [1, 6, 9, 10]. on finer characteristics. Similarly, there are different and Vilar et al. [8] investigated the impact of molecular struc- various types of DDIs and knowing them is important tural similarities for DDI predictions and obtained per- when performing unsupervised learning using K-means formance values of 0.68, 0.96, and 0.26 of sensitivity, and SOM. Therefore, clustering DDIs is a challenging specificity, and precision, respectively. Later, Vilar et al. task. In this paper, we propose GSOM [16] as a suit- [9] published a new protocol for inferring DDIs integrat- able clustering approach to cluster drug pairs to enable ing 2D and 3D molecular structural, target, and side- inferring negatives from the unlabeled dataset. The main effect similarities. In their study, the drug profiles were advantage of GSOM is automatic detection of the num- combined by means of linear algebraic concepts through ber of subgroups/nodes. Its high dimensionality feature matrix manipulations. Moreover, they observed higher reduction and topology preserving nature are also useful DDI predictive performance when heterogeneous fea- characteristics. tures were integrated using Principal Component Anal- In the related work, randomly selected instances from ysis and the highest individual predictive performance the unlabeled data has been used as neutral DDIs (neg- was observed when using Interaction Profile Fingerprints. atives) [1, 6–8, 22–24]. We treat this existing method as Similarly, Zitnik and Zupan [13] proposed a probabilis- our ‘Baseline’. This approach may introduce noisy data, tic model using collective matrix factorization to depict resulting in a lack of distinction between positives and the pairwise relations in pharmacological networks. They negatives. However, the overall performance of the final demonstrated Area Under ROC Curve (AUC) of 0.924 prediction relies on the feature representation and the using 10-fold cross validation for predicting DDIs which training sample as well. We believe that the clinical rele- were found in DrugBank. But, their method can focus on vance of the predicted DDIs can be improved employing only one particular data type at a time. 3.1 Published manuscript 47

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 4 of 15

Cheng et al. [7] used chemical sub-structural prop- also beneficial to represent the pairwise drug similarities. erties in predicting five major CYP isoforms integrat- Our proposed similarity metric can be used to summa- ing various base classifiers. They constructed combined rize any type of drug data associations. Moreover, our classifiers for each of the isoforms of interest, obtain- results demonstrate the importance of the proposed pair- ing overall predictive accuracy between 72.3 and 83.7%. wise similarity measure in obtaining better performance However, this single-isoform modeling approach does not on DDI predictions. scale well to large numbers of drugs. Realizing the value of heterogeneity for achieving useful DDI prediction, Methods Gottlieb et al. [6] employed seven heterogeneous char- Drug data acteristics based on chemical, ligand, side effect, annota- Drug characterization tion, sequence, closeness in a protein-protein interaction We obtained four different independent sources of drug network, and Gene Ontology through which they con- associations following the work of Wang et al. [23]; drug- structed 49 classification features. As a result of deep chemical [29], drug-therapeutic [30], drug-protein [29], analysis of CYP-based DDIs, they observed AUC of 0.93 and drug-phenotype [31] associations. These associa- and 0.96 for CYP-Related DDIs and non-CYP-Related tions are represented as binary relationships where ‘1’ DDIs, respectively. They employed binary classification by represents a known interaction (labeled) and ‘0’ repre- randomly selecting examples from the unlabeled data to sents an unknown interaction (unlabeled). Each of the represent the negative class, which may introduce noise, drugs has 881, 719, 775, and 1385 dimensions in chem- resulting in a lack of distinction between interacting and ical, therapeutic, protein, and side effect characteristic non-interacting classes in the training set. In general, profiles, respectively. The four independent sources have the outcome of the final predictions rely on the charac- 548 drugs in common. We calculate pairwise drug sim- teristics of the training sample. Thus, selecting a repre- ilarities using Jaccard Index (JI) and a proposed pair- sentative training sample is also beneficial in improving wise similarity function (see “The proposed similarity predictive performance. feature representation” section). JI-based pairwise sim- Liu et al. [22] provides a comprehensive analysis ilarity measure was applied to produce four pairwise of overall contextual similarities to predict plausible drug similarity features (for each type of drug associa- DDIs. They employed chemical interactions comprising tion) while the proposed pairwise drug similarity func- the overall structural activities and reaction-based sim- tion led to produce 3760 (881+719+775+1385) pairwise ilarities. Overall pairwise target similarity is captured drug similarity features (for each drug in each dimen- analyzing the interactions of target proteins. Pathway sion). Further, we observed 69,600, 2284, 1902, and analysis was also carried out to capture the related path- 41,008 associations between the 548 drugs and 881 ways of drugs and their functions. Moreover, a minimum chemical features, 719 diseases, 775 proteins, and 1385 redundancy maximum relevance approach and incremen- side effects, respectively. tal feature selection were used for dimensionality reduc- tion and to remove redundant features. They achieved Drug-drug interactions specificity ranging between 0.89 and 0.97, and sensitivity DrugBank is a comprehensive database containing exten- ranging between 0.13 and 0.71. In DDI prediction, posi- sive biochemical and pharmacological information about tives are more important than negatives due to physical drugs, their mechanisms and their targets. It uses data consequences of the effect. To emphasize positive inter- on drug-target, drug-enzyme and drug-transporter asso- action prediction, we believe that the measures like sen- ciations to provide insight on DDIs [4]. Moreover, it sitivity, precision, and f-score are more important than includes drug interaction information from several exter- specificity. nal sources [32] such as Physician’s Desk Reference [33], Tanimoto Coefficient is a variant of Jaccard Index which e-Therapeutics [34], Medicines Complete [35], Epocrates is widely used in quantifying the overall context similar- RX [36], and Drugs.com [37]. ity between drugs. Vilar et al. [9] proposed a similarity We obtained 6036 unique DDIs for the 548 drugs of metric to quantify the overall overlap between any two interest via ‘DrugBank - Interax Drug Interaction Lookup’ drugs. Huang et al. [28] defined a metric to compute the [38] (Note: these known DDIs will be served as positives tightness/strength of target centric drugs, acknowledg- for machine learning). Also, we noticed 994 Cytochrome ing the impact of protein-protein interaction networks in P450 (CYP)-Dependent DDIs out of 6036 DDIs. The 6036 identifying new DDIs. Similarly, other existing similarity DDIs span only 451 unique drugs out of the selected 548 measures provide overall context similarity to compare drugs. There are 149,878 (548*548-548) possible unique any two drugs. However, two drug pairs can share the drug pairs for these selected 548 drugs. Hence, the known same overall similarity index though their individual fea- (labeled) DDIs to unlabeled DDIs ratio is approximately 1 tures are different. Hence, a detailed similarity metric is to 24 (1:24). 48 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 5 of 15

Proposed methods positive-unlabeled learning approach” section). As a Posing the overall DDI prediction task as a PUL task, result, likely negatives can be inferred from the unlabeled we propose a two phase learning strategy integrating data for training purposes. Second, SVM is employed for GSOM and SVM. Moreover, we propose an ensemble binary classification. Finally, we perform a further anal- learning approach for DDI prediction based on two dif- ysis on the predicted DDIs considering CYP isoforms ferent pairwise similarity metrics; we build a classifier (CYP1A2, CYP2C9, CYP2C19, CYP2D6, and CYP3A4). using each metric separately and then combine their indi- Specifically we identify CYP isoforms for the predicted vidual predictions to obtain a final prediction. Figure 2 CYP-Dependent DDIs by revisiting their node on the illustrates the main steps of the proposed approach GSOM, as explained in “Case study: clinical implications” for inferring DDIs. We emphasize the importance of section. quantifying the overlap of individual properties as well For the 548 selected drugs (see “Drug characteriza- as the overall context similarity for inferring potential tion” section), the 6036 known DDIs correspond to DDIs. approximately 4% of possible drug pairs (149,878). To the First, we employ the GSOM clustering algorithm to best of our knowledge, there is no gold standard database cluster drug pairs based on given chemical, disease, pro- representing drug pairs that do not interact with each tein, and side effect (pairwise) similarities. We believe other. Therefore, we pose this problem as a PUL problem. that non interacting drug pairs are likely to contain We suggest and demonstrate our proposed GSOM-based pairwise similarities that are different from the interact- PUL approach as a suitable method for inferring DDIs. ing drug profiles. We therefore label GSOM nodes as We select GSOM over other clustering methods due to positives or negatives based on the scattered pattern of its automatic detection of the number of relevant clusters; knownlabeledpositivesonGSOM(see“GSOM-based this is not a fixed parameter that needs to be set.

Fig. 2 This diagram illustrates the proposed methodology and our three main contributions for inferring DDIs, integrating Similarity Feature Representation1 (SFR1) and Similarity Feature Representation2 (SFR2) 3.1 Published manuscript 49

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 6 of 15

GSOM-based positive-unlabeled learning approach employ the GSOM implementation of Chan et al. [40] as Clustering enables grouping of drug pairs based on their it provides visual aids for cluster analysis. similarity, independently of any labels. It can be used to In this study, GSOM is used to infer negatives from the distinguish drug pairs as interacting and non-interacting unlabeled data. We propose a node profiling algorithm DDIs, by identifying pairs that have similar characteris- to profile each of the GSOM nodes as a ‘positive’ /‘nega- tics to (cluster with) a known DDI, and pairs that do tive’/‘ambiguous’ node, based on the positive proportion not cluster with any known DDI, respectively. We pro- of the inputs clustered within the node. We define Positive pose GSOM [16] as a suitable approach to cluster drug Proportion as: pairs. GSOM is an extended version of the conventional SOM [39]. A GSOM consists of nodes where each node Positive Proportion contains at least one input. This clustering algorithm assigns each of the instances to a node on GSOM map. = number of labeled positive instances in the GSOM node It uses the same weight adaptation and neighborhood total number of instances in the GSOM node kernel learning as SOM. Further analysis can also be (2) inferred after a map is produced. GSOM’s topological pre- serving nature is useful in grouping neighboring nodes The nodes with 100% unlabeled instances are consid- where necessary. ered as ‘negative nodes’. The nodes with 100% positive GSOM’s automatic detection of the number of clusters instances are considered as ‘positive nodes’ and the nodes is useful in this context as we have no prior knowledge with both positives and unlabeled instances are consid- of the number of DDI types. GSOM can cluster drug ered as ‘ambiguous nodes’. The instances at ‘negative pairs based on their given characteristics and is capable nodes’ are inferred as negatives. Hence, the DDIs col- of handling high dimensional data. Since we are unaware lected from DrugBank and the inferred negatives are of the exact number of DDI types, GSOM enables sub- served as positives and negatives, respectively, for learn- grouping these drug pairs without any prior knowledge ing the binary classifier whereas the unlabeled instances of number of clusters. The size of the map can be con- at ‘ambiguous nodes’ are considered for DDI prediction. trolled by the growth threshold which is inversely pro- The algorithm in Fig. 3 also illustrates the process of portional to the Spread Factor and it is shown in the profiling GSOM nodes as positive/negative/ambiguous equation below: for PUL.

Growth Threshold =−D ∗ ln(Spread Factor) Pairwise drug feature representation (1) where D is the dimensionality of the input data. Overall context similarity using jaccard index In related DDI prediction tasks, pairwise drug similarities Accordingly, the number of growing nodes can be con- have been derived by quantifying the overall similarity trolled by its Spread Factor, ranging between 0 and 1 of drug features such as chemical, disease, protein, and ((0,1]). A higher Spread Factor can produce a larger side effect [1, 6, 8, 9]. Jaccard Index (JI) based similarity GSOM with a higher number of nodes. In this study, we representation is widely used in this context. JI performs

Fig. 3 Pseudo-code for profiling GSOM nodes as ‘positive/negative/ambiguous’ node 50 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 7 of 15

a bitwise comparison and computes an overall similarity Therefore, the values on SFR2 include only three values; value as explained in the following equation: ‘0’, ‘0.5’, and ‘1’. ‘0’ denotes when a particular feature is  absent in both of the drugs, ‘0.5’ denotes when a partic- f (Drug ) ∩ f (Drug ) ( ) = i i 1 i 2 ular feature is present in only one of the drugs, and ‘1’ JI Drug1, Drug2 ( ) ∪ ( ) (3) i fi Drug1 fi Drug2 denotes when a particular property is present in both of where i is the index of feature vector, f . the drugs. As for SFR1, each training vector is associated For instance, when drug-chemical associations are rep- with a label indicating the DDI status of the corresponding resented using 881 chemical sub-structures, the pairwise drug pair, obtained from the DrugBank database. Figure 4 drug chemical similarity can be mapped into a single sim- illustrates an example on deriving similarity metrics for ilarity value using JI (see Fig. 4). This, JI-based pairwise binary drug-feature representation. Because SFR1 pro- drug similarity feature representation quantifies the over- duces a single summary similarity value, two pairs of drugs all similarity of any two drugs based on one set of features. can have same JI similarity value although their underly- Here, we consider chemical, target-protein, disease and ing characteristics are entirely distinct. In contrast, SFR2 side effect features separately, producing four JI similarity measures similarity at a granular feature level. There- values. This results in Similarity Feature Representation1 fore, we suggest aggregating the final results of SFR1 and (SFR1), which concatenates the four individual JI values SFR2. into a single feature vector. Each training vector is asso- ciated with a label indicating the DDI status of the drug Application to DDI prediction pair; the DDIs derived from DrugBank were considered as The Support Vector Machine (SVM) learning method is positive examples. frequently used in inferring DDIs as well as in the PUL context [1, 17, 21]. Therefore, we employ SVM as the clas- sifier for the supervised learning task. We consider iden- The proposed similarity feature representation tifying likely neutral DDIs as an important step towards Reflecting the importance of comparing two drug pairs obtaining reliable prediction, as they can serve as neg- in terms of their overlap on individual features, we pro- atives for binary classifiers. For the proposed approach, pose a new similarity representation (Similarity Feature the DDIs obtained from DrugBank database are served Representation2, SFR2), based on the following equation: as the positives while the neutral DDIs inferred by the

Individual Similarityi(Drug1, Drug2) GSOM step (see “GSOM-based positive-unlabeled learn- (4) ing approach” section) are served as negatives. We then = Average(f (Drug ), f (Drug )) i 1 i 2 train a classifier with a training set of known positives and where i is the index of the ith feature in the feature these inferred negatives. vector, f . Ensemble Learning is the popular way of combin- This captures the shared properties of the drugs, and ing multiple base classifiers and aggregating outputs leads to generate one similarity value per feature; for our to a single meta classifier. For the learning step, we data means 881, 719, 775, and 1385 chemical similarity apply ensemble learning with two classifiers employ- values, disease similarity values, protein similarity values, ing each of the two similarity functions; SFR1 and and side effect similarity values, respectively. Therefore, SFR2. the proposed SFR2 includes 3760 (881+719+775+1385) We employ probabilistic outputs for Support Vector similarity features. Machines to compute the posterior probabilities of the As explained in “Drug characterization” section, the classification output following Platt [41]. Platt proposed initial drug features in our data are binary associations. an approach to calculate the posterior probability by fit-

Fig. 4 Example of deriving similarity metrics for drug association. Jaccard Index is the frequently used approach while Individual Similarity function is the proposed function 3.1 Published manuscript 51

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 8 of 15

ting a sigmoid function after building the SVM. It is shown Precision and Recall metrics as shown in the following in the equations below: equations:

1 True Positives P(y = 1|f ) = (5) Precision = (6) 1 + exp(Af + B) True Positives + False Positives True Positives Recall = (7) where the parameters A and B are fitted using maximum True Positives + False Negatives likelihood estimation from a training set (fi, yi). 2 ∗ Precision ∗ Recall Consequently, we estimate posterior probabilities for all F1 − score = (8) Precision + Recall SVM classifier outputs and we average the final probabili- ties to make final predictions. Results Balanced training sets are frequently used in machine In the proposed PUL approach, we employed known learning based DDI predictions [1, 6]. Here, we use mul- positives and negatives inferred by GSOM from the unla- tiple balanced training sets to strengthen the ensemble beled data. We compared the performance of the pro- learning approach, through which we obtain an average posed GSOM-based PUL approach against Baseline and final probability. This may aid to reduce the variance of OCSVM [19]. We employed Baseline using known posi- the final outputs. tives and randomly selected negatives from the unlabeled data while OCSVM is employed using known positives only. Baseline method In computational DDI prediction, In Additional file 1, we demonstrate the improved per- researchers employ binary classification as a step towards formance of the proposed GSOM-based PUL approach inferring drug interactions. Since there is no certain as compared to Baseline and OCSVM using adapted dataset representing neutral DDIs, a baseline method benchmark data (breast cancer and iris data). The origi- that randomly selects examples from the unlabeled data nal benchmark datasets are not directly compatible with has been used to identify negative cases [1, 6–8, 22–24]. DDI data as they are fully described with labels where In this Baseline approach, known positives and randomly clear negatives exist. Therefore, we modified the labels selected unlabeled data are used as inputs for the binary of these two benchmark datasets to resemble the Posi- classifier. tive Unlabeled Learning scenario. We assessed the impact of the proportion of unreliable negatives (unlabeled posi- The explicit steps of this approach is shown below: tives) in the training data towards final prediction; and we defined this proportion as Unlabeled Positive Proportion 1. Initial Data: Generate unique drug pairs for the drugs (UPP). The results on the adapted benchmark datasets of interest (see Additional file 1) suggest that the proposed GSOM 2. Collect known DDIs from reliable sources (ex: clustering as a suitable approach for inferring negatives DrugBank) from unlabeled data. The proposed GSOM-based PUL 3. Label known DDIs (2) as positives in the initial data approach outperforms Baseline and OCSVM. Moreover, (1) the results on the adapted benchmark datasets suggest 4. Randomly select an equivalent number of examples that the proposed method is most beneficial when the from the unlabeled data (not part of the positive set) UPP is below 50% (see Additional file 1 for further details). to the number of positive cases (3) In this section, we demonstrate the results for inferring 5. Label drug pairs (4) as negatives in the initial data (1) DDIs in relation to the proposed pairwise drug similarity 6. Employ binary classification using positives (3) and and the proposed PUL concept. The results on DDI clas- negatives (5) for DDI prediction sification also evidence that the proposed methods can be generalized well to more complex DDI prediction task.

Evaluation metrics It is important to note that there is Inferring plausible DDIs no certain dataset for the neutral-DDIs (negatives). There- Pairwise drug similarity functions fore, evaluating DDI prediction as a binary classification SFR1 captures the similarity based on JI (see Eq. 3) which is a challenging task. We use GSOM-based PUL approach is the quantification of four overall context similarities toinferDDIs.Then,F1-scoreisusedtoevaluateandcom- for chemical, therapeutic, protein, and side effect fea- pare the performance of the proposed methods. Since our tures. We employ SFR2 to quantify the overlap of indi- ultimate goal is to predict plausible positives, F1-score is vidual properties between drugs (see Eq. 4) which leads selected to express the overall performance of classifying to 3760 features: concatenating 881, 719, 775, and 1385 positive and neutral DDIs [42]. F1-score combines both features in chemical, therapeutic, protein, and side effect 52 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 9 of 15

characteristic profiles, respectively. It is possible that SFR2 there is no prior knowledge. We eliminated the nodes is a sparse vector as some of the features are unique. We having only one instance to minimize the false nega- therefore employed Principal Component Analysis (PCA) tives in the training sample. Accordingly, the number of for data compression. PCA is one of the popular methods nodes varies between 300 and 920 when Spread Factor for dimensionality reduction. It produces a new feature is between 10−10 and 10−1 (Fig. 5 (b)). In SFR1 analysis, vector capturing maximum variance of the original data. we selected the GSOM map with Spread Factor=0.1 hav- We obtained 228 PCs capturing 90% of maximum vari- ing 919 nodes. Then, we profiled each of the nodes as ance; these components were used in generating results ‘positive/negative/ambiguous’ according to the algorithm for SFR2. described in Fig. 3. For instance, when a node has all unlabeled instances then it is profiled as a ‘negative node’. GSOM-based PUL approach for DDI prediction Accordingly, we inferred 4066 negatives. In our GSOM-based PUL approach, we employed GSOM clustering to infer negatives from the unlabeled sam- GSOM clustering using SFR2: Similarly, we observed ple. (In Baseline, the known 6036 DDIs were considered GSOM node variation for SFR2, employing 228 PCs. We as positives while randomly selected 6036 drug pairs selected the GSOM map with 922 nodes when Spread from the unlabeled samples were considered as negatives. Factor=10−15. After profiling each of the nodes as posi- OCSVM is employed using known 6036 DDIs only.) tive/negative/ambiguous, 20,099 negatives were inferred. In order to construct a reliable negative set, we filtered GSOM clustering using SFR1: In GSOM, the number of the common negatives captured in both approaches. As a growing nodes can be controlled by altering Spread Factor result, only 589 common negatives (see Additional file 2) which varies between 0 and 1 ((0,1]). Higher Spread Fac- were identified. These 589 examples are the consensus tors generate larger maps with relatively higher number negatives identified by both SFR1 and SFR2 which demon- of nodes. Also, larger GSOM produces higher coherence strates a higher possibility to be served as negative exam- within the nodes. We used Average Within Cluster Dis- ples when learning the binary classifier. (Figure 6 shows tance (AWCD) to study the effect of varying Spread Factor the GSOM map for SFR1 and SFR2.) within GSOMs; as shown in the Eq. (9).   n m ( − ¯ )2 i=1 j=1 rij fj wi AWCD = (9) Binary classification: In related work, use of balanced n training sets has been shown to be beneficial for SVM where w¯ is the weight vector of the winning node, f is the classifiers [1, 6, 43]. Therefore, we propose an ensem- weight vector of the input, n isthenumberofnodesinthe ble learning approach that integrates multiple balanced GSOM, m isthenumberofinputsineachnodeandrij is datasets for both SFRs (SFR1 and SFR2). The train- 1ifinputi ∈ node j while rij is 0 if input i ∈/ node j. ing sample for DDIs includes the 6036 known DDIs, In Fig. 5 (a), we illustrate the AWCD for SFR1 when extracted from DrugBank and the aforementioned 589 Spread Factor is between 10−1 and 10−10. The lowest inferred negatives. In order to perform an ensemble learn- AWCD is observed at Spread Factor=0.1 because of the ing approach, we constructed multiple balanced training expected higher coherence within GSOM nodes. More- sets using the training data for SFR1 and SFR2. For both over, it enables finer analysis on data particularly when SFR1 and SFR2, we constructed 10 training sets using the

Fig. 5 a The average within cluster distance (AWCD) using Similarity Feature Representation 1 and (b) Number of GSOM nodes variation for Similarity Feature Representation 1 3.1 Published manuscript 53

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 10 of 15

Fig. 6 GSOM maps for DDI data: (a) shows the GSOM map for Similarity Feature Representation 1 (SFR1) when Spread Factor=0.1 and it contains 919 nodes; (b) shows the GSOM map for Similarity Feature Representation 2 (SFR2) when Spread Factor=10−15 and it contains 922 nodes. The nodes shown in blue are the proposed negative nodes having only unlabeled instances, the nodes shown in grey contains both initial positives and unlabeled instances, and the nodes shown in red contains only initial positives

589 inferred negatives and randomly selected 589 known is a significant improvement in DDI prediction on Base- DDIs (from the 6036 known DDIs). line and GSOM-based PUL approach when the proposed We used a SVM classifier employing a polynomial ker- SFR2 is used though there is no significant change on nel with a polynomial order of 2. The SVM classifier OCSVM. For instance, F1-score has improved by 31.3 and was trained using 5-fold cross validation and a suitable 7% when using SFR2 for Baseline and GSOM-based PUL Regularization Parameter (C) is selected accordingly. We approach, respectively. On the other hand, our proposed employed Matlab implementation for learning SVM clas- GSOM-based PUL approach outperformed Baseline by sifier and for computing SVM posterior probability. For 38.1 and 13.8% in F1-score for SFR1 and SFR2, respec- SFR1 and SFR2, Regularization Parameter is selected to be tively. Moreover, there are notable improvement in Pre- 10−2 and 10−3 respectively. cision and Recall when our proposed GSOM-based PUL Table 1 shows the cross validation performance of the approach is used. proposed GSOM-based PUL method, Baseline method and OCSVM for SFR1 and SFR2 using the complete bal- Ensemble learning: The 10 classification models of SFR1 anced training dataset (589 positives + 589 negatives). It and SFR2 obtained using our GSOM-based PUL approach summarizes the mean performance of the 10 balanced are considered for DDI prediction. We employed ensem- training samples where the mean Precision, Recall, and ble learning aggregating the classification results of each F1-score from 5-fold cross validation are shown. There classification model to minimize the variance of the final output. Each prediction is made by invoking a posterior Table 1 Performance assessment of the proposed GSOM-based probability. Finally, 20 predicted posterior probabilities PUL approach, Baseline and OCSVM using Similarity Feature are averaged to derive the final predictions. Representation1 (SFR1) and Similarity Feature Representation2 As explained in the above section, our training set con- (SFR2) sist of 589 positives and 589 negatives. In order to assess Baseline OCSVM GSOM-based PUL the ensemble model, 80% of this complete balanced train- Cross validation SFR1 Precision 0.628 0.584 0.951 ing set are used for training while 20% of the complete Recall 0.448 0.499 0.861 balanced training set are held out to test the overall pre- dictive performance of the ensemble model. In Table 2, F1-score 0.523 0.537 0.904 we assess the performance of the ensemble model using SFR2 Precision 0.823 0.622 0.974 SFR1 and SFR2. It summarizes the mean performance of Recall 0.850 0.436 0.975 the 10 balanced training and test samples. There is no sig- nificant improvement in the ensemble model compared F1-score 0.836 0.511 0.974 to the model generated using SFR2. However, the results 54 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 11 of 15

Table 2 Performance assessment of the ensemble model for node consists of only unlabeled positives. In this study, the GSOM-based PUL approach for DDI prediction using Similarity nodes with only one instance is not considered for node Feature Representation1 (SFR1) and Similarity Feature profiling to minimize this error. Representation2 (SFR2) Selecting a suitable spread factor is challenging as we SFR1 SFR2 Ensemble model are unaware of the number of DDI types. One approach Cross validation (80%) Precision 0.960 0.968 is to consider the properties of GSOM which clearly dis- Recall 0.841 0.972 tinguish positives and negatives in choosing the most appropriate map. This approach is used when deciding F-measure 0.896 0.970 the spread factor for the adapted Breast cancer data, and Testing (20%) Precision 0.749 0.973 0.970 Iris data (see Additional file 1). However, a method to Recall 0.648 0.979 0.975 determine the spread factor is yet to be found, particularly F-measure 0.692 0.976 0.973 when there is no information about the clusters. Here, we selected the GSOM maps for SFR1 and SFR2 with a relatively similar number of nodes. Because of various types of DDIs, we noticed the scat- of the ensemble model is considered for further analysis tered pattern of known positives in the GSOM map. as ensemble learning has evidenced to minimize the vari- Thus the proposed node profiling method was useful in ance of the final output [44]. We inferred 5892 potential identifying plausible negatives. The assigned positive pro- DDIs with the greatest probability (see Additional file 3). portion of the GSOM node is considered when profiling In “Case study: clinical implications” section, we provide a node as ‘positive node/negative node/ambiguous node’. a case study on these predicted positive DDIs associating The inferred negatives were extracted from the proposed plausible CYPs. negative nodes. The results on adapted Breast Cancer and Iris data suggest the importance of the proposed approach Discussion particularly when Unlabeled Positive Proportion (UPP) is GSOM-based positive-unlabeled learning below 60% (see Additional file 1). Inferring negatives is Ourresultssuggesttheimportanceofusingarepresen- much accurate for the datasets with lower UPPs. Since tative training sample in achieving better performance, the UPP value is unknown for the DDI prediction task, as well as revealing the value of clustering methods to the negative set has been further refined by selecting the infer negatives; clustering clearly outperforms a random common negatives of SFR1 and SFR2. These predicted strategy for selecting negatives. Since there exists no gold negatives could be verified from the PubMed articles and standard database to capture drug pairs that do not inter- other related resources using text mining. Text mining act with each other, the proposed GSOM-based method approaches tend to identify the co-occurrence of drug can be used to identify representative negatives to train a pairs in biomedical publications. But, there is no guaran- binary classifier. Moreover, GSOM is a direct approach to tee that those drug pairs would interact with each other in group similar drug pairs with no prior knowledge about reality. Therefore, the drug pairs identified by text mining the group labels. Since the number of possible drug inter- approaches would also require further validation to verify action types is unknown, DDIs in GSOM nodes can be the prediction. considered as sub-groups of related DDIs. Hence, the neg- The proposed GSOM-based method can also be used in atives can be identified in relation to the scattered pattern identifying representative training samples in any dataset of known positives. when there is an uncertainty about the labeled data. For In related work, randomly selected unlabeled data has instance, in Breast Cancer data, the proposed GSOM- been considered as negatives or neutral DDIs. Impor- based method has extracted 236 (out of 286) more tantly, we have shown using benchmark data, that the representative negatives while eliminating redundant or increasing the number of unreliable negatives in the train- ambiguous instances (see UPP=0% in Additional file 1: ing sample reduces the predictive performance of the Figure 1). This may enable the construction of efficient binary classifier. classification as well. Moreover, GSOM’s topology pre- GSOM is capable of handling high dimensional fea- serving nature can be employed in selecting a represen- tures, and therefore it is more useful when employing the tative training set, particularly to resolve the imbalanced proposed pairwise drug similarity. The number of grow- data issues if the GSOM clustering has clear separat- ing nodes can be controlled by its spread factor ranging ing boundaries between classes (as in Additional file 1: between 0 and 1. Higher spread factor results in producing Figure 2). The nodes that are closer to the separation more coherence within nodes. However, higher coherence boundary may reflect close relationships to the other GSOM nodes may result in generating nodes with a single class. Therefore, selecting training instances away from member. This approach may produce false negatives, if a the separation boundary can be a useful aspect to define 3.1 Published manuscript 55

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 12 of 15

a representative training sets. It may also improve the dis- as negatives for binary classifier. Table 3, summarizes the tinction between classes and efficient classification. Also, cross-validation performance assessment of DDI predic- GSOM can be used even when there are multiple classes. tion using INDI dataset. We compared the performance SVM is frequently used as the binary classifier in PUL of the proposed method against Baseline method and and DDI prediction applications. Results on Breast Cancer OCSVM. It is clear that the proposed GSOM-based PUL Data and Iris Data demonstrate better performance when approach outperforms the other two approaches for DDI the UPP is below 50%. Since we used 6036 initial posi- prediction using INDI data as well. tives, predicting up to additional 6036 DDIs with higher probability is feasible. In this study, we inferred 5892 DDIs Pairwise drug similarity with the posterior probability above 0.995. Importantly, The proposed pairwise drug similarity function (SFR2) is the proposed GSOM-based PUL approach led to achieve simple yet more useful to capture finer-grained similar- F1-score of 0.904 (SFR1) and 0.974 (SFR2) while in the fre- ities as compared to the similarity metrics like Jaccard quently used Baseline achieved F1-score of 0.522 (SFR1) Index (JI). This is because two drug pairs with very differ- and 0.836 (SFR2). This represents approximately a 38 and ent underlying characteristics can have the same JI. The 14% absolute improvement over Baseline approach using proposed pairwise similarity function can capture the het- SFR1 and SFR2, respectively. On the other hand, applica- erogeneous similarity properties between any two drugs tion of OCSVM does not perform well on SFR1 and SFR2. at a more fine-grained level of representation. It can be Since the drug pairs span a large variation of DDI types, extended to compare more than two drugs as well as in employing a representative training data for the oppos- other domains where pairwise similarities are required. ing class enables the classifier to accurately distinguish Both pairwise similarity representations used in this between positives and negatives. Our results reveal the paper demonstrate better performance when using our need for a reliable training set for achieving better perfor- GSOM-based PUL approach. Therefore, aggregating clas- mance as well as inferring reliable DDIs. Even though Liu sification models obtained from JI inspired pairwise simi- et al. [22] achieved Specificity ranging between 0.89 and larity function (SFR1) and SFR2 enables obtaining reliable 0.97, and Sensitivity ranging between 0.13 and 0.71, our predictions. Without restricting the similarity compari- proposed approach achieved 0.97, 0.98, and 0.97 as Pre- son for Jaccard variants, ensemble learning using multiple cision, Sensitivity/Recall, and F1-score, respectively. We distance measures like Euclidean, cosine, correlation, etc. attribute our performance to the strength of our method. may also increase the predictive performance. Another approach to consider would be active learn- SFR1 and SFR2 pairwise drug similarity representations ing. But, it may require a large memory and time as we used in this paper are employed separately. Concate- are dealing with a large number of drug pairs on high nating SFR1 and SFR2 is another possible extension to dimensional space. this approach. When using SFR2, we employed PCA for dimensionality reduction. As a result, we obtained 228 Evaluation of the proposed GSOM-based PUL approach PCs capturing 90% of maximum variance. Further reduc- employing INDI data tion of the dimensions is not realistic as some of the pair- INDI is a DDI prediction tool which uses seven hetero- wise drug similarity features may unique and some data geneous pairwise drug similarity scores based on chem- are truly high dimensional. Even though GSOM clustering ical, ligands, side effects, Anatomical Therapeutic and algorithm is capable of handling high dimensional data, Chemical classification, sequence similarity, distance on a preprocessing using PCA can be beneficial in improving protein–protein interaction network, and Gene Ontology. memory and time efficiency. The proposed approach was further evaluated employing In this work, we represented the pairwise drug sim- the seven heterogeneous pairwise drug similarity scores ilarity in terms of their chemical structure, disease, of 805 drugs used in INDI [6]. It provides the DDI labels protein, and side effects characteristics and obtained as binary scores which are categorized as CYP related higher performance with these four types of features. DDIs and non-CYP related DDIs for the corresponding drug pairs generated using the 805 drugs. We merged CYP related DDIs and non CYP related DDIs to con- Table 3 Cross-Validation performance assessment of the struct the corresponding DDI label of each drug pair proposed GSOM-based PUL approach, Baseline and OCSVM for and we obtained 25103 positive DDIs. We constructed a DDI prediction using INDI dataset heterogeneous feature vector similar to SFR1 with seven Baseline OCSVM GSOM-based PUL dimensions where the binary class labels indicate the DDI Precision 0.610 0.583 0.917 status of the corresponding drug pair. Employing the proposed GSOM node profiling algo- Recall 0.529 0.583 0.916 rithm (see Fig. 3), we inferred 4352 neutral DDIs to serve F1-score 0.567 0.583 0.916 56 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 13 of 15

Nevertheless, integration of other types of similarity fea- DDIs in relation to their allocated nodes on the GSOM tures like ligands based characteristics and 3D chem- maps. ical structures may further strengthen the predictive Consequently, we classify the predicted 5892 DDIs as performance and contribute to making our approach CYP-Dependent and CYP-Independent DDIs by revis- more robust. Not all predicted drug pairs will be co- iting the GSOM maps. We assigned CYP/s in relation administered in practice, but, deeper analysis of their to their nodes on both GSOM maps. Accordingly, we individual properties may lead to derive useful drug inferred 5300 CYP-Dependent DDIs (see Additional file 4) repositioning opportunities. and 592 CYP-Independent DDIs (see Additional file 5). We further categorized the CYP dependent DDIs into Case study: clinical implications CYP1A2, CYP2C9, CYP2C19, CYP2D6, and CYP3A4 Cytochrome P450 (CYP) is an enzyme system found isoforms by analyzing DrugBank description of already throughout many organs and tissues including the kid- known DDIs at ‘ambiguous nodes’. It should also be neys, lungs and gut however its highest concentration noted that 3311 CYP-Dependent DDIs were predicted is in the liver. Most drugs used clinically are metabo- as to be associated with more than one CYP isoform. lized by the six main CYP enzymes: CYP1A2, CYP2C9, Furthermore, out of the 5300 predicted CYP-Dependent CYP2C19, CYP2D6, CYP2E1, and CYP3A4 [45–47]. DDIs, 1083 DDIs overlap with the recently published Interestingly, a single drug can act as an inducer, inhibitor CYP related DDIs predicted using quantitative structure- or a substrate of these enzymes. Knowledge in this activity relationships models [48]. In the next section, we area has expanded rapidly in recent years and a num- provide the clinical relevance of three predicted CYP- ber of emerging clinically relevant interactions have been Dependent DDIs in relation to their predicted CYPs. addressed [3]. Clinically significant effects usually arise when drugs which either inhibit or induce particular CYP Clinical relevance enzyme are administered together. If one drug inhibits In this study, we inferred numerous predicted DDIs that the metabolism of another concurrently administered appear to have significant clinical relevance [49]. Car- drug, then this may lead to the accumulation of that diovascular and mental health, diabetes mellitus, asthma second drug and possible toxicity [3, 45, 47]. and control are some of the National Health Priority Areas chosen by the Australian government due Classify CYP-dependent and CYP-independent DDIs to their significant contribution to the burden of ill- DDIs can be associated to cytochrome P450 (CYP) ness in Australian community [50]. In this study, we enzyme system as well as molecular targets for drug action selected the predicted DDIs with a focus on medica- including receptors, ion channels and transporters or car- tions frequently used in treating some of the common rier molecules [2]. Most of the drug metabolism occurs diseases: -Abacavir, Carvedilol-Metformin and via CYP enzyme system. Currently, there are over sev- Cimetidine-Erythromycin. (It should be noted that these enty (70) CYP gene families, of which three main ones predictions were made for initially on unlabeled data and (CYP1, CYP2, CYP3) are involved in drug metabolism not mentioned in DrugBank.) in human liver [46]. Therefore, identification of related Bosentan, an endothelin indicated CYP enzymes for the predicted DDIs is considered to be for pulmonary arterial , is an inducer of significant. CYP2C9 and a substrate for the same as well as inducer In particular, clustering algorithms can be employed and a substrate for CYP3A4. Abacavir,anantiretroviral using features only, independently of the class labels. drug indicated for HIV infection, is mainly metabolised Hence, we extended the DDI classification task to predict by the liver [49]. When Abacavir is combined with Bosen- plausible CYPs by revisiting the GSOM map/s (Fig. 6). tan, its efficacy can be significantly decreased because According to the algorithm explained in Fig. 3, an ambigu- Bosentan induces CYP enzyme system thus speeding up ous node contains known DDIs and unlabeled inputs. For Abacavir metabolism. each of the ambiguous nodes (Fig. 3; Step 4.3), we assign DDIs inferred in this study include an interaction CYPs based on DrugBank description of already allocated between Carvedilol and Metformin. Carvedilol,non- positives. Such node is considered as CYP-Independent if selective beta and alpha1 receptor blocker, controls car- it’s known DDIs do not have any CYP-Dependent interac- diac contractility and blood pressure and is widely used tions. If it has at least one known CYP-Dependent inter- clinically in the management of hypertension and chronic actions, then it is considered as a CYP-Dependent node systolic failure. Metformin, an oral hypoglycaemic and the known CYP is also noted. In addition, the nega- agent used in the management of type II diabetes, tar- tive nodes are considered to be CYP-Independent (Fig 3; gets the liver in order to reduce hepatic glucose pro- Step 4.2) as they do not contain any known DDI inter- duction [49]. Carvedilol is a substrate for CYP2D6 and actions. Accordingly, we assign CYPs for the predicted when combined with Metformin, the activity of CYP2D6 3.1 Published manuscript 57

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 14 of 15

may be altered thus leading to loss of its therapeutic Additional file 3: The inferred 5892 DDIs and their probabilities. This is effectiveness. Supplementary Table 2. (XLSX 134 kb) Another clinically relevant DDI inferred in this study Additional file 4: The inferred 5300 CYP-Dependent DDIs and associating relates to an interaction between Cimetidine and Ery- CYPs. This is Supplementary Table 3. (XLSX 177 kb) thromycin. Cimetidine, a histamine H2 receptor antag- Additional file 5: The inferred 592 CYP-Independent DDIs. This is onistthatreducesgastricacidsecretion,isusedinthe Supplementary Table 4. (XLSX 18.2 kb) management of peptic ulcer disease, dyspepsia and gastro- Abbreviations oesophageal reflux disease while Erythromycin is an anti- AWCD: Average within cluster distance; CYP: Cytochrome P450; DDI: infective that inhibits bacterial protein synthesis and also Drug-drug interaction; GSOM: Growing self organizing map; JI: Jaccard index; acts as an immuno-modulatory and anti-inflammatory OCSVM: One class support vector machine; PCA: Principal component analysis; PUL: Positive-unlabeled learning; SFR1: Similarity feature representation1; agent [49]. Both drugs inhibit CYP3A4 which in turn SFR2: Similarity feature representation2; SOM: Self organizing map; SVM: leads to decrease in their metabolism and results in an Support vector machine; UPP: Unlabeled positive proportion enhancement of the adverse effects and toxicity. Acknowledgments The authors would like to thank Chan et al. [40] for sharing the Growing Self Conclusions Organizing Map implementation. Identifying new drug-drug interactions is crucial in Funding improving clinical care. The number of already iden- PNH is fully supported by the PhD scholarships of The University of Melbourne tified drug-drug interactions on DrugBank is signifi- and partially supported by NICTA scholarship of National ICT Australia, now cantly low compared to the number of possible drug Data61 since merging CSIRO’s Digital Productivity team. This work is also partially funded by Australian Research Council grant DP150103512. combinations. On the other hand, there is no certain database which represents the drugs that do not inter- Availability of data and materials act with each other. In this paper, we propose a new The initial drug data (third party data) used in this paper is available from: http://astro.temple.edu/~tua87106/druganalysis.html Positive-Unlabeled Learning approach based on Grow- We also provide a DOI to access the simulated data and drug data used in this ing Self Organizing Map (GSOM) leading to identify work https://github.com/fathimanush786/PU-based-DDI-prediction. plausible drug-drug interactions. Particularly, the pro- Authors’ contributions posed Positive-Unlabeled Learning approach is suitable PNH, KV, and SH designed the experiment and evaluation methodology. PNH when the number of classes is unknown or unavailable in conceived the idea, proposed specific methods, collected drug interactions advance. GSOM is useful in clustering drug-drug inter- from DrugBank, implemented the methods, assessed the performance and drafted the manuscript. SK provided expertise on the Cytochrome P450 (CYP) actions based on their pairwise similarities since there enzymes and provided the clinical relevance of the three predicted are various types of drug-drug interactions. Moreover, CYP-Dependant DDIs. KV and SH contributed to the writing and editing of the we propose a pairwise similarity function to quantify the manuscript. All authors approved the final draft. overlap of individual attributes of the drug feature vec- Competing interests tor. Our results reveal the importance of the proposed The authors declare that they have no competing interests. Positive-Unlabeled Learning approach and the proposed pairwise similarity function in identifying plausible drug- Consent for publication Not Applicable. drug interactions. In addition, the proposed GSOM node labeling algorithm is used to associate cytochrome P450 Ethics approval and consent to participate isoforms for the predicted drug-drug interactions and we Not Applicable. provide the significant clinical relevance of three pairs of the predicted drug-drug interactions. The proposed Author details 1Department of Mechanical Engineering, University of Melbourne, Parkville, approach is a promising strategy to identify plausible 3010 Melbourne, Australia. 2Data61, Victoria Research Lab, 3003 West drug-drug interactions and our discoveries can be used Melbourne, Australia. 3Department of Computer Science, University of to improve clinical care as well as the research outcomes Ruhuna, 81000 Matara, Sri Lanka. 4Department of Computing and Information Systems, University of Melbourne, Parkville, 3010 Melbourne, Australia. of drug development. Further, the proposed GSOM-based 5Department of Nursing, University of Melbourne, Parkville, 3010 Melbourne, Positive-Unlabeled Learning approach can be applied in Australia. 6The Florey Institute of Neuroscience and Mental Health, University other applications where appropriate. of Melbourne, Parkville, 3010 Melbourne, Australia. 7Research School of Engineering, College of Engineering and Computer Science, The Australian National University, 2601 Canberra, ACT, Australia. Additional files Received: 9 October 2016 Accepted: 13 February 2017 Additional file 1: Adapting Labeled Data for Positive-Unlabeled Learning. This file comprises the significance of the proposed method for adapted labeled data. (PDF 313 kb) References 1. Cheng F, Zhao Z. Machine learning-based prediction of drug-drug Additional file 2: The 589 inferred negatives for binary classification. This interactions by integrating drug phenotypic, therapeutic, chemical, and is Supplementary Table 1. (XLSX 18.1 kb) genomic properties. J Am Med Inform Assoc. 2014;21(e2):278–86. 58 Inferring pairwise drug interactions based on heterogeneous attributes

Hameed et al. BMC Bioinformatics (2017) 18:140 Page 15 of 15

2. Ai N, Fan X, Ekins S. In silico methods for predicting drug-drug 27. Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P. Drug target interactions with cytochrome p-450s, transporters and beyond. Adv Drug identification using side-effect similarity. Science. 2008;321(5886):263–6. Deliv Rev. 2015;86:46–60. 28. Huang J, Niu C, Green CD, Yang L, Mei H, Han J. Systematic prediction 3. Snyder BD, Polasek TM, Doogue MP. Drug interactions: principles and of pharmacodynamic drug-drug interactions through practice. Aust Prescr. 2012;35(3):85–8. protein-protein-interaction network. PLoS Comput Biol. 2013;9(3): 4. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, 1002998. Arndt D, Wilson M, Neveu V, et al. Drugbank 4.0: shedding new light on 29. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, drug metabolism. Nucleic Acids Res. 2014;42(D1):1091–097. Hassanali M. Drugbank: a knowledgebase for drugs, drug actions and 5. DrugBank. DrugBank Stat. http://www.drugbank.ca/stats. Accessed 31 drug targets. Nucleic Acids Res. 2008;36(suppl 1):901–6. Mar 2016. 30. Li J, Lu Z. A new method for computational drug repositioning using 6. Gottlieb A, Stein GY, Oron Y, Ruppin E, Sharan R. Indi: a computational drug pairwise similarity. In: Bioinformatics and Biomedicine (BIBM), 2012 framework for inferring drug interactions and their associated IEEE International Conference On; 2012. p. 1–4. recommendations. Mol Syst Biol. 2012;8(1):592. doi:10.1109/BIBM.2012.6392722. 7. ChengF,YuY,ShenJ,YangL,LiW,LiuG,LeePW,TangY. 31. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource Classification of cytochrome p450 inhibitors and noninhibitors using to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6(1):343. combined classifiers. J Chem Inf Model. 2011;51(5):996–1011. 32. Interax Drug Interaction Lookup. DrugBank. http://www.drugbank.ca/ 8. Vilar S, Harpaz R, Uriarte E, Santana L, Rabadan R, Friedman C. interax/drug_lookup. Accessed 01 Nov 2015. Drug-drug interaction through molecular structure similarity analysis. J 33. Physicians’ Desk Reference. PDR Netw. http://www.pdr.net/. Accessed 10 Am Med Inform Assoc. 2012;19(6):1066–074. Dec 2016. 9. Vilar S, Uriarte E, Santana L, Lorberbaum T, Hripcsak G, Friedman C, 34. E-therapeutics. Canadian Pharmacists Association. http://www.e- Tatonetti NP. Similarity-based modeling in large-scale prediction of therapeutics.ca/. Accessed 10 Dec 2016. drug-drug interactions. Nat Protoc. 2014;9(9):2147–163. 35. Medicines Complete. https://www.medicinescomplete.com/about/ 10. Vilar S, Lorberbaum T, Hripcsak G, Tatonetti NP. Improving detection of index.htm. Accessed 10 Dec 2016. arrhythmia drug-drug interactions in pharmacovigilance data through 36. Epocrates Athena Health Service. http://www.epocrates.com/products/ the implementation of similarity-based modeling. PLoS ONE. 2015;10(6): features. Accessed 10 Dec 2016. 0129974. 37. Drugs.com. Wolters Kluwer Health, American Society of Health-System 11. Tari L, Anwar S, Liang S, Cai J, Baral C. Discovering drug-drug Pharmacists, Cerner Multum and Micromedex from Truven Health. interactions: a text-mining and reasoning approach based on properties https://www.drugs.com/. Accessed 10 Dec 2016. of drug metabolism. Bioinformatics. 2010;26(18):547–53. 38. Drugbank Documentation. DrugBank. https://www.drugbank.ca/ 12. Tatonetti NP, Patrick PY, Daneshjou R, Altman RB. Data-driven prediction documentation. Accessed 10 Dec 2016. of drug effects and interactions. Sci Transl Med. 2012;4(125):125–3112531. 39. Teuvo K. Self-organizing Map, 3rd edn. Berlin Heidelberg: Springer; 2001. 13. Zitnik M, Zupan B. Collective pairwise classification for multi-way analysis 40. Chan C-KK, Hsu AL, Halgamuge SK, Tang SL. Binning sequences using of disease and drug data. In: Pacific Symposium on Biocomputing. Big very sparse labels within a metagenome. BMC Bioinforma. 2008;9(1):215. Island of Hawaii: Pacific Symposium on Biocomputing; 2016. p. 81–92. 41. Platt J, et al. Probabilistic outputs for support vector machines and 14. Tatonetti NP, Fernald GH, Altman RB. A novel signal detection algorithm comparisons to regularized likelihood methods. Adv Large Margin for identifying hidden drug-drug interactions in adverse event reports. J Classifiers. 2000;10(3):61–74. Am Med Inform Assoc. 2012;19(1):79–85. 42. Powers DM. Evaluation: from precision, recall and f-measure to roc, 15. Zhao Y, Kong X, Yu PS. Positive and unlabeled learning for graph informedness, markedness and correlation. J Mach Learn Technol. 2011;2: classification. In: Data Mining (ICDM), 2011 IEEE 11th International 37–63. Conference On. Vancouver: IEEE; 2011. p. 962–71. 43. He H, Garcia E, et al. Learning from imbalanced data. Knowl Data Eng IEEE 16. Alahakoon D, Halgamuge SK, Srinivasan B. Dynamic self-organizing Trans. 2009;21(9):1263–284. maps with controlled growth for knowledge discovery. Neural Netw IEEE 44. Clemen RT. Combining forecasts: A review and annotated bibliography. Trans. 2000;11(3):601–14. Int J Forecast. 1989;5(4):559–83. 17. Li X, Liu B. Learning to classify texts using positive and unlabeled data. In: 45. McKinnon RA, Sorich MJ, Ward MB. Cytochrome p450 part 1: multiplicity IJCAI. Acapulco: International Joint Conferences on Artificial Intelligence and function. J Pharm Pract Res. 2008;38(1):55–7. Organization; 2003. p. 587–92. 46. Rang H, Ritter J, FLower R, Henderson G. Rang and Dale’s Pharmacology, 18. Zhao XM, Wang Y, Chen L, Aihara K. Gene function prediction using Seventh edn. Edinburgh: Elsevier Churchill Livingstone; 2012. labeled and unlabeled data. BMC Bioinforma. 2008;9(1):57. 47. Mathew T, Chow R, Desmond P, Isaacs D, Lander C, McNeil J, Shenfield 19. Khan SS, Madden MG. A survey of recent trends in one class classification. G, Wainwright D. Drug interactions and adverse drug reactions. Aus In: Irish Conference on Artificial Intelligence and Cognitive Science. Berlin Adverse Drug React Bull. 2000;19(3):10–11. Heidelberg: Springer; 2009. p. 188–97. 48. Zakharov AV, Varlamova EV, Lagunin AA, Dmitriev AV, Muratov EN, 20. Sokolov A, Paull EO, Stuart JM. One-class detection of cell states in tumor Fourches D, Kuz’min VE, Poroikov VV, Tropsha A, Nicklaus MC. Qsar subtypes. In: Pacific Symposium on Biocomputing. Big Island of Hawaii: modeling and prediction of drug-drug interactions. Mol Pharm. Pacific Symposium on Biocomputing; 2016. p. 405–16. 2016;13(2):545–56. 21. Ren J, Liu Q, Ellis J, Li J. Positive-unlabeled learning for the prediction of 49. Rossi S, Calabretto JP, Patterson C. Australian Medicines Handbook. conformational b-cell epitopes. BMC Bioinforma. 2015;16(Suppl 18):12. Adelaide South Australia: AMH Pty Ltd; 2015. 22. Liu L, Chen L, Zhang YH, Wei L, Cheng S, Kong X, Zheng M, Huang T, 50. Australia Institute of Health and Welfare. National Health Priority Areas. Cai YD. Analysis and prediction of drug-drug interaction by minimum http://www.aihw.gov.au/national-health-priority-areas/. Accessed 10 Jan redundancy maximum relevance and incremental feature selection. J 2016. Biomol Struct Dyn. 2017;35:312–29. 23. Wang F, Zhang P, Cao N, Hu J, Sorrentino R. Exploring the associations between drug side-effects and therapeutic indications. J Biomed Inform. 2014;51:15–23. 24. Fokoue A, Sadoghi M, Hassanzadeh O, Zhang P. Predicting drug-drug interactions through large-scale similarity-based link prediction. In: International Semantic Web Conference. Kobe: Springer; 2016. p. 774–89. 25. Zhang P, Wang F, Hu J, Sorrentino R. Label propagation prediction of drug-drug interactions based on clinical side effects. Sci Reports. 2015;5: 12339–48. 26. Dudley JT, Deshpande T, Butte AJ. Exploiting drug-disease relationships for computational drug repositioning. Brief Bioinform. 2011;013:303–11. 3.1 Published manuscript 59

The results of the supplementary studies on two adapted benchmark datasets (Iris and Breast cancer) from UCI repository [101] are also presented in its published form. It demonstrates the validity of the introduced PUL approach and explores the impact of different distributions of positive and unlabeled examples where the actual answers exist and can be assumed to be complete. 60 Inferring pairwise drug interactions based on heterogeneous attributes

Page 1 of 4

Additional File 1

1 Adapting Labeled Data for Positive-Unlabeled Learning To the best of our knowledge there exists no benchmark dataset to test Positive- Unlabeled Learning (PUL) methods. We therefore modified two existing benchmark datasets to resemble the positive and unlabeled data scenario in order to test our proposed approach.

1.1 Breast Cancer Data The Breast cancer Wisconsin (Diagnostic) dataset is one of the popular benchmark datasets used in binary classification. It is available at the UCI machine learning repository [1]. It contains 569 instances where each of the instances is described in terms of 30 real-valued features, considering 10 main measurements of the cell nucleus. There are 212 malignant records which are considered as positives while 357 benign records are considered as negatives. Moreover, the original data has to be normalized to avoid bias on a particular feature. We normalize the features using Z-score standardization [2]. Then, 20% of the original data (113 instances) are held out for testing the method while the 80% of the data (456 instances) are used for adapting and training. In Section 1.3, we explain how we adapt labeled data for positive and unlabeled learning.

1.2 Iris Data The Iris dataset available at the UCI machine learning repository [1] is another benchmark dataset used in classification. It includes 150 instances, 50 in each of 3 classes, Setosa, Versicolor and Verginica. Each of the instances is represented in terms of 4 attributes, considering sepal and petal measurements. These 150 instances are categorized into 3 classes as Setosa, Versicolor, and Virginica and each class has got 50 instances. In this study, we focus on classifying Setosa (50) vs Non-Setosa (Versicolor and Virginica: 100) where Setosa is considered as positives and rest of the instances are considered as negatives. As above, the original data has to be normalized to avoid bias on a particular feature. We normalize the features using standard deviation. Then, 20% of the original data (30 instances) are held out for testing the overall approach while the 80% of the data (120 instances) are used for adapting and training. In the next section (Section 1.3), we explain how we adapt labeled data for positive and unlabeled learning.

1.3 Construction of Adapted Positive and Unlabeled Datasets Aforementioned training/adapting data is used to create PU datasets. In our ex- periments, 20% of the original data are held out to test the overall predictive per- formance while the 80% of the original data are used for developing and training the model. In order to mimic the PU context for these data sets, randomly selected known positives and all negatives are masked as negatives in the training data. This Un- labeled Positive Proportion (UPP) is selected to vary from 0% to 90% positives in the training data. Accordingly, we constructed 10 different PU datasets by varying UPP. UPP is defined as follows: number of masked positives Unlabeled P ositive P roportion (UPP ) = (1) total number of positives 3.1 Published manuscript 61

Page 2 of 4

For instance, when UPP is 10%, the PU dataset is constructed masking the labels of randomly selected 10% of known positives and all negatives.

2 GSOM-based PUL Approach for Adapted PU Datasets In this section, we demonstrate the significance of the proposed PUL approach using the adapted benchmark datasets. We compared the proposed GSOM-based PUL approach against Baseline and OCSVM for varying UPP (see Section 1.3). In the GSOM-based PUL approach, we employed known positives and negatives inferred by GSOM from the unlabeled data. We employed Baseline using known positives and randomly selected negatives from the unlabeled data while OCSVM is employed using only known positives.

2.1 Adapted Breast Cancer Data: There exist 170 positives and 286 negatives in the training set while 42 positives and 71 negatives exist in the test set. Fig 1 illustrates GSOM's capability of selecting a representative training sample for the unlabeled negative class when UPP is varying between 0% and 90%. The proposed GSOM-based PUL approach led to identify potential negatives for UPP between 0% and 50% the accuracy of the negative extraction lies above 90% when UPP is below 50%. Although the number of false negatives has gradually increased as UPP is increasing, it is significantly low com- pared to the number of initial unlabeled instances. In addition, Fig 2 corresponds to the correct identification of negative nodes for the adapted breast cancer data when UPP is 20%. It should also be noted that the node-112 contains one known positive and it is topologically located with other negative nodes as it has showed closer similarity with other negatives. On the other hand, node-61, is incorrectly profiled as a negative node as it has only one unlabeled positive instance. But, this error is minimized by removing nodes with only one instance.

Figure 1: For Breast Cancer Data: (a) Negatives identified by the proposed GSOM- based PUL method for varying Unlabeled Positive Proportion (UPP): The dotted line shown in green represents the actual negatives which is 286 in the original training set, the dotted line shown in blue represents the total unlabeled data; (b) Accuracy of the proposed GSOM-based PUL method in inferring negatives.

Fig 3 illustrates F1-score for classifying malignant and benign instances using the proposed GSOM-based PUL approach against Baseline and OCSVM. It shows 62 Inferring pairwise drug interactions based on heterogeneous attributes

Page 3 of 4

Figure 2: GSOM maps (Breast Cancer Data): (a) Profiling of GSOM nodes when Unlabeled-Positive Proportion (UPP) is 20%;(b) Profiling of GSOM nodes using original labels; The nodes shown in blue are the proposed negative nodes having only unlabeled instances, the nodes shown in grey contains both initial positives and unlabeled instances, and the nodes shown in red contains only initial positives.

the results only for UPP from 0% to 50% (UPP above 50% is ignored due to lack of adequate positive instances in the training samples). It is clear that the GSOM-based PUL approach outperforms the Baseline and OCSVM in all instances. Moreover, the final classification performance correlates strongly with the accuracy of the training set.

Figure 3: For Breast Cancer Data: Performance assessment of the GSOM-based PUL approach, Baseline and OCSVM for varying Unlabeled-Positive-Proportion (UPP) using Linear SVM classifier when Regularisation Parameter (C)=1. (Here we included the cross-validation (CV) and testing performance (Test) on binary classification).

2.2 Adapted Iris Data: Fig 4 illustrates the intrinsic evaluation of the proposed GSOM-based PUL approach in identification of reliable negatives for varying UPPs between 0% and 90%. It has been correctly identified reliable negatives when UPP is below 40%. But the per- formance degrades as UPP is increasing. According to Fig 4, GSOM's performance of negative extraction has decreased when UPP is above 60%. Thus, predictive per- formance may further degrade when UPP is above 60%. Fig 5 illustrates F1-score 3.1 Published manuscript 63

Page 4 of 4

for classifying Setosa and Non-Setosa using the proposed GSOM-based PUL ap- proach against Baseline and OCSVM. It shows the results only for UPP from 0% to 50% (UPP above 50% is ignored due to lack of adequate positive instances in the training samples). It is clear that the GSOM-based PUL approach outperforms the Baseline and OCSVM in all instances. Moreover, the final classification performance correlates strongly with the accuracy of the training set.

Figure 4: For Iris Data: (a) Negatives identified by the proposed GSOM-based PUL method for varying Unlabeled Positive Proportion (UPP): The dotted line shown in green represents the boundary of the actual negatives which is 80 in the original training set, the dotted line shown in blue represents the boundary of the total unlabeled data; (b) Accuracy of the proposed GSOM-based PUL method in inferring negatives.

Figure 5: For Iris Data: Performance assessment of the GSOM-based PUL approach, Baseline and OCSVM for varying Unlabeled-Positive-Proportion (UPP) using Lin- ear SVM classifier when Regularisation Parameter (C)=1. (Here we included the cross-validation (CV) and testing performance (Test) on binary classification.)

Author details References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007) 2. Larose, D.T.: Discovering Knowledge in Data: an Introduction to Data Mining. John Wiley & Sons (2014) 64 Inferring pairwise drug interactions based on heterogeneous attributes

3.2 Summary

The results of this study demonstrate the importance of employing representative train- ing samples for DDI prediction and the significance of the new pairwise similarity mea- sures for improved classification accuracies. Clustering drug pairs based on their pair- wise similarities enables grouping of similar types of DDIs. The GSOM clustering ap- proach used in this study suits DDI clustering well as the number of DDI clusters does not need to be specified in advance. Also, GSOM is capable of handling higher dimen- sional data. The proposed approach showed a significant improvement in the perfor- mance on DDI prediction compared to the Baseline approach (which randomly selects negatives from the unlabeled data) and One Class SVM. The results on DDI classifica- tion also evidence that the proposed methods can be generalized well to more complex DDI prediction tasks. The classification of the inferred harmful DDIs, associating with CYP isoforms CYP(1A2, 2C9, 2C19, 2D6, 2E1 and 3A4) enables investigating the clinical significance of the predicted DDIs. Chapter 4 A two-tiered clustering approach for drug repositioning through heterogeneous data integration

S mentioned in Chapters 1 and 2, drug repositioning is the process of identifying A new uses for existing drugs and it can be achieved by assessing pairwise drug similarities. Heterogeneous data integration is a fundamental problem in drug data anal- ysis because a drug can be represented in terms of various characteristics such as chem- ical structures, known targets, known therapeutic/side effects, etc. This chapter intro- duces a heterogeneous data integration methodology that supports identification of drug repositioning candidates by analyzing pairwise drug similarities. The primary objective of heterogeneous/multi-view data integration is to understand the drug characteristics more deeply and to obtain a consensus solution. A purely unsupervised, two-tiered clus- tering approach is introduced here for heterogeneous drug data integration and drug clustering. In this study, vector-based clustering and graph clustering algorithms are employed for drug clustering. Clustering applies to group drugs with similar characteristics. Here, clustering drugs based on chemical, disease, gene, protein and side effect properties enables grouping drugs from five different perspectives. This chapter proposes a consensus clustering so- lution based on heterogeneous data integration for drug repositioning. Drugs in the same cluster should share the same therapeutic use; hence clustering enables an opportunity to achieve drug repositioning. The mapping between drug-cluster and the therapeutic uses is accomplished in this

65 A two-tiered clustering approach for drug repositioning through heterogeneous data 66 integration work based on the Anatomical Therapeutic Chemical (ATC) classification published by the World Health Organization. Assessing the inferred drug repositioning candidates is challenging. The confidence measure introduced in this study is used to prioritize the most appropriate drug repositioning candidates. The drug repositioning candidates identified consistently by multiple clustering algorithms and with high confidence have a higher possibility of being effective repositioning candidates.

4.1 Published manuscript

The results of this study is presented in the following published form: Hameed, P. N., Verspoor, K., Kusljic, S., & Halgamuge, S. (2018). A two-tiered unsuper- vised clustering approach for drug repositioning through heterogeneous data integra- tion. BMC bioinformatics, 19(1), 129. The supplementary materials are available in the online version of the paper at https://doi.org/10.1186/s12859-018-2123-4. 4.1 Published manuscript 67

Hameed et al. BMC Bioinformatics (2018) 19:129 https://doi.org/10.1186/s12859-018-2123-4

RESEARCH ARTICLE Open Access A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration Pathima Nusrath Hameed1,2,3* , Karin Verspoor4, Snezana Kusljic5,6 and Saman Halgamuge7

Abstract Background: Drug repositioning is the process of identifying new uses for existing drugs. Computational drug repositioning methods can reduce the time, costs and risks of drug development by automating the analysis of the relationships in pharmacology networks. Pharmacology networks are large and heterogeneous. Clustering drugs into small groups can simplify large pharmacology networks, these subgroups can also be used as a starting point for repositioning drugs. In this paper, we propose a two-tiered drug-centric unsupervised clustering approach for drug repositioning, integrating heterogeneous drug data profiles: drug-chemical, drug-disease, drug-gene, drug-protein and drug-side effect relationships. Results: The proposed drug repositioning approach is threefold; (i) clustering drugs based on their homogeneous profiles using the Growing Self Organizing Map (GSOM); (ii) clustering drugs based on drug-drug relation matrices based on the previous step, considering three state-of-the-art graph clustering methods; and (iii) inferring drug repositioning candidates and assigning a confidence value for each identified candidate. In this paper, we compare our two-tiered clustering approach against two existing heterogeneous data integration approaches with reference to the Anatomical Therapeutic Chemical (ATC) classification, using GSOM. Our approach yields Normalized Mutual Information (NMI) and Standardized Mutual Information (SMI) of 0.66 and 36.11, respectively, while the two existing methods yield NMI of 0.60 and 0.64 and SMI of 22.26 and 33.59. Moreover, the two existing approaches failed to produce useful cluster separations when using graph clustering algorithms while our approach is able to identify useful clusters for drug repositioning. Furthermore, we provide clinical evidence for four predicted results (Chlorthalidone, Indomethacin, Metformin and ) to support that our proposed approach can be reliably used to infer ATC code and drug repositioning. Conclusion: The proposed two-tiered unsupervised clustering approach is suitable for drug clustering and enables heterogeneous data integration. It also enables identifying reliable repositioning drug candidates with reference to ATC therapeutic classification. The repositioning drug candidates identified consistently by multiple clustering algorithms and with high confidence have a higher possibility of being effective repositioning candidates. Keywords: Drug repurposing, ATC classification, Drug clustering, Data integration, Heterogeneity

*Correspondence: [email protected] 1Department of Mechanical Engineering, University of Melbourne, Parkville, 3010 Melbourne, Australia 2Data61, Victoria Research Lab, West Melbourne 3003, Australia Full list of author information is available at the end of the article

© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. A two-tiered clustering approach for drug repositioning through heterogeneous data 68 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 2 of 18

Background heterogeneous objects can be grouped into small clus- Producing new drugs and marketing them with a com- ters based on their associations. Since pharmacology net- plete drug profile is a challenging task as it is a long pro- works are large and complex, partitioning large networks cess and requires a large investment of time and money. produces an abstraction which simplifies their complex Drug repositioning or drug repurposing is the process of interaction structure. Realizing the importance of simpli- identifying new therapeutic uses for existing drugs. It can fying drug-data network, research [2, 8, 10, 17, 18]has reduce the time, costs and risks of the traditional drug approached partitioning pharmacological networks using discovery process [1–4].Themaingoalofdrugreposi- various graph theory concepts. tioning is to increase the therapeutic use of the existing Yildirim et al. [8] focused on combining heterogeneous drugs in the clinical and medical domain. It is believed data using drug-target and disease-gene interactions that drugs having similar profiles are more likely to share employing bipartite graph projections while Hartsperger similar behavior in presence of similar targets (e.g. pro- et al. [19] demonstrated the importance of fuzzy cluster- teins) [1, 3–7]. There is also evidence that computational ing for arranging the biological entities like disease, gene drug repositioning can be improved by heterogeneous and proteins in a meaningful weighted k-partite graph. data analysis [1, 5, 7–9]. In contrast to laborious in- Moreover, Klamt et al. [20] demonstrates graph transfor- vivo and in-vitro experiments, computational methods for mations such as graph projection methods would lead to drug repositioning have become popular as effective and information loss. In contrast, Yaminishi et al. [5]investi- efficient approaches for drug repositioning [1, 3–6]. These gated a supervised bipartite graph inferencing approach methods focus on identifying new uses for existing drugs by integrating chemical and pharmacological properties. and finding new associations between other contributing Campillos et al. [18] suggested a probability theoretic entities like proteins, genes, diseases and side effects to approach to integrate chemical and pharmaceutical prop- approach this problem. erties. There are two main concepts behind drug reposi- Napolitano et al. [2] proposed useful drug reclassifi- tioning: new target recognition and new indication cations for ATC classification using supervised machine recognition. Figure 1 illustrates a general view of these learning. They integrated drug-chemical, drug-gene and two drug repositioning concepts. Figure 1a shows the drug-protein representations and obtained classification known interactions where each of the drugs is associated accuracy of 78%. But, integrating pharmacological con- with at least one target protein and vice versa; each of the cepts is also important when focusing drug repositioning targetsisalsoassociatedwithatleastonediseaseandvice using ATC classification. In general, taking second/higher versa. Figures 1b and c show new target recognition and order derivatives of objects is a popular method for high- new indication recognition, respectively. In new target lighting special features. Lee et al. [9]proposedthat recognition, the objective is to identify novel molecular drug groups (DG) having common DG-DG interaction targets for a given drug while in new indication recog- partners would share similar drug mechanisms and they nition, the objective is to identify new diseases that may have proposed Molecular Complex Detection (MCODE) be impacted by one of the existing targets of the drug. algorithm for module detection in DG-DG interaction Computational methods like network based inferencing network. They investigated clustering DG-DG interac- [1, 5, 6, 8, 10], machine learning [2, 11, 12], and text tions in relation to ATC classification and they believe mining approaches [13, 14] are widely used for drug DG-DG interactions would be useful in describing the repositioning. In recent computational approaches, the mechanisms and the features of drugs. Anatomical Therapeutic Chemical (ATC) classification system [15] is considered as an intermediate source to The importance of heterogeneous data integration In identify useful drug repositioning candidates where the preliminary investigations of drug repositioning, compu- ATC therapeutic classes are used to identify reposition- tational models for pharmacological data have been devel- ing candidates [9, 11, 16]. Every repositioning candidate oped using homogeneous components such as disease, identified by computational models may not be directly symptoms, side effects, chemical structures, proteins and applicable in clinical practice. However, the outcomes genes. But, each homogeneous component has its own of the computational models may enable prioritizing pros and cons [1]. Although many findings acknowledge repositioning candidates for in-vivo/in-vitro analysis. the benefits of phenome space properties like disease and Pharmacological data can be represented in homo- side effects [18, 21], chemical structures are also impor- geneous or heterogeneous graphs/networks. Therefore, tant to make predictions. Different drug characterizations most of the drug repositioning approaches can be seen may lead to identifying various repositioning candidates as hybrid methods of graph/network theory concepts based on different aspects. Hence, combining the results and machine learning [5, 8–10, 12]. Graph clustering is of different drug characterizations can lead to identify- such hybrid approach where graphs of homogeneous and ing reliable repositioning candidates. Recent studies have 4.1 Published manuscript 69

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 3 of 18

Fig. 1 A generalized illustration of two alternative approaches involving in drug repositioning; (a), (b)and(c) represent the known interactions, New Target Recognition and New Indication Recognition, respectively. (The notations 1*-1* and m-n indicate one-or-many and many-to-many relationships, respectively)

focused on the development of novel, efficient and reliable Therefore, considering gene expression alone may result computational models to improve the final predictions in poor performance. using heterogeneous data integration [1, 2, 5, 8, 9]. Yamanishi et al. [5] have demonstrated the impor- In early research, symptom similarities have been tance of spanning chemical, genomic and pharmaco- employed to analyze disease similarities and in turn to logical space features in discovering new drug-target identify new uses for existing drugs [22]. However, it was interactions using supervised bipartite graph inference. realized that symptom-based similarities alone are inad- They found that pharmacological effect similarities more equate to predict new therapeutic uses for existing drugs. strongly correlate with new predictions than chemical Consequently, mRNA expression and protein-protein similarities. Moreover, they proposed a two-step strategy interaction networks have been used in investigating to combine chemical, genomic and pharmacological prop- disease similarities [6]. Campillos et al. [18] demonstrated erties using supervised bipartite graph learning and hence the significance of using side effect similarity for drug obtained reliable drug-target associations. repositioning. Even though side effect similarities can be In-silico drug repositioning has become very popular used to link the interactions between drugs and targets, during the last decade as it contributes to accelerating there are certain limitations as well. Some side effects drug development and drug discovery. Moreover, recent arise due to hormonal changes of the body. Also, side research has identified heterogeneous data integration effects may require a long time to observe and construct as important for obtaining reliable predictions. However, a strong drug-side effect profile. Hence, it cannot be introducing heterogeneous data types increases the com- directly applied to the newly arrived drugs without an plexity of data representation and the number of features. explicit drug profile. Since many side effects are com- Therefore, network partitioning or clustering methods mon among various drugs, data redundancy is another can be used to simplify large and complex pharmacology problem in the side effect domain. data and predictions can be efficiently made on identi- Campillos et al. [18]andDudleyetal.[1]havealso fied subgroups [8–10, 19, 23]. Consensus clustering is a investigated the impact of chemical similarities for drug method used for ensemble clustering [24]. It has been repositioning. They found that using chemical struc- introduced to overcome the limitations of basic cluster- tural similarities alone is insufficient as drugs undergo ing algorithms. It can also be considered as a method metabolic transformations and pharmacokinetic transfor- to integrate multiple sources. However, the existing con- mations. Therefore, studying the mechanism of action of a sensus clustering algorithms require the number of clus- drug is encouraged. Using connectivity maps to construct ters to be defined in advance. In this study, we propose the molecular activity profiles based on gene expression a two-tiered clustering approach for drug repositioning has been considered as a better approach as it simplifies inspired by consensus clustering. Here, we selected clus- drug comparisons. However, a molecular activity simi- tering algorithms which could be employed without any larity based approach may not be very accurate as many prior knowledge about drug clusters. disease conditions involve in more than one molecu- Pharmacology networks are large and heterogeneous; lar activity. Moreover, gene expression profiles may be drugs can be considered as the main hubs in these net- generated under different conditions such as different works. The main objective of this study is to construct a doses, time durations, different disease stages and ages. consistent computational model for drug repositioning A two-tiered clustering approach for drug repositioning through heterogeneous data 70 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 4 of 18

through heterogeneous data integration. Drug-chemical, repositioning. It applies to infer a single subnetwork at a drug-gene, drug-protein, drug-disease and drug-side time, where ATC-C class is used to reposition drugs for effect relationships are useful to represent different Cardiovascular diseases. aspects of drugs such as chemical, biological and phe- This paper fills the gap with a purely unsupervised nome characteristics, respectively. We therefore cluster learning approach by heterogeneous data integration drugs based on their heterogeneous associations. Specif- where ATC classification is employed for large-scaled ically, we apply clustering of drugs to simplify the large drug repositioning of drugs with and without assigned drug-centric pharmacology networks. In this study, we ATC class. This study also presents a confidence measure propose a two-tiered clustering approach, an unsuper- which is used to determine the significance of the inferred vised learning approach for drug repositioning via ATC repositioning candidates. Moreover, the significance of classification. This proposed approach enables clustering findings arising from this study is twofold; (i) correctly drugs based on heterogeneous data integration which is profile and suggest therapeutic indication for drugs that used as the drug similarity model for drug repositioning. do not possess the ATC code; (ii) flag potential of some Hence, the final clustering is an overall solution that drugs to be used for other therapeutic purposes. Fur- groups similar drugs using a variety of drug character- thermore, we provide clinical evidence for four predicted istics. The identified drug clusters are compared against results (Chlorthalidone, Indomethacin, Metformin and already published ATC classification to infer useful repo- Thioridazine) to support that our proposed approach can sitioning candidates. The identified drug clusters can be be reliably used to infer ATC code and drug repositioning. used as a source to understand drug-drug similarities as well as drug-group similarities. Methods As illustrated in Fig. 1, new target recognition and As explained in “Background” section, drug repositioning new indication recognition are two typical ways of candidates can be identified by analyzing drug-drug sim- approaching drug repositioning. Even though the use of ilarities. This study proposes an unsupervised two-tiered ATC classification is popular in the input space to deter- clustering model to identifying drug similarities based on mine anatomical/therapeutic/chemical features of drugs heterogeneous drug characteristics. Figure 2 illustrates [25–27], little research directly focuses on drug reposi- the main steps of the proposed approach. A two-tiered tioning by ATC classification [2, 16, 28]. Recent research clustering approach is proposed to build the drug simi- [2, 28] limited their studies only for the drugs that already larity model for drug repositioning. In Drug Clustering possess an ATC code. Recently, Sun et al. [16]proposeda Tier 1, clustering is performed based on drugs’ chemical, semi-supervised learning approach based on a physarum- therapeutic, gene, protein and side effect associations sep- inspired prize-collecting steiner tree approach, for drug arately to illustrate how close two drugs are, along each

Fig. 2 The proposed approach 4.1 Published manuscript 71

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 5 of 18

dimension. Drug clustering Tier 2 is a heterogeneous data from the Unified Medical Language System (UMLS) integration phase, in which the results of Drug Cluster- [31]. These are the treatment relationships between ing Tier 1 are combined to produce an overall similarity drugs and diseases from the National Drug that considers all aspects of the drug similarity. Drug repo- File-Reference Terminology. sitioning is carried out employing ATC classification for • drug-protein features [775]: The target protein the drug clusters identified at Drug Clustering Tier 2. information of drugs has been obtained from The therapeutic classification of the ATC classification is Drugbank [32] and they have been mapped using used to label each cluster from which we identify plausible UniProt Knowledgebase [33]. repositioning candidates. • drug-side effect features [1385]: The drug-side The particular drug profile leading to identifying simi- effect information has been extracted from the lar therapeutic uses may vary from drug to drug; choosing SIDER database [34] which uses UMLS library to an appropriate representation for drug repositioning is map the side effect keywords. challenging. Therefore, making a similarity decision based • drug-gene features [1504]: We constructed a drug- on heterogeneous drug profiles such as chemical, disease, gene binary profile for the 1504 KEGG gene data used genes, proteins and side effect is worthwhile. Moreover, in Wu et al. [10] to represent drug-gene relationships. some dimensions can be incomplete. If the data in one drug profile is inaccurate or incomplete, it may be com- These five sources have 417 drugs in common. The pensated by better data in other drug profiles. Therefore, drug profiles of the selected drugs are available at https:// making the final conclusions based on consolidated het- github.com/fathimanush786/two_tiered_clustrering_ erogeneous data enables less errors. ATC classification data. is used as the gold standard reference classification. We expect that drugs that are in the same ATC class should be ATC classification clustered together and hence we can use this to validate As defined by World Health Organization, the Anatomical our clusters. Therapeutic Chemical (ATC) classification [15]captures In “Data” section, the drug data and their ATC clas- the pharmacodynamic properties of drugs. This resource sification codes used in this study are explained. In uses active ingredients of drugs as well as their anatomical, “The proposed approach” section, we explain the selected therapeutic and chemical properties when constructing clustering algorithms, the proposed two-tiered clustering the classification system. ATC is a five level classifica- approach, the evaluation process for the identified drug tion system. The first level classification is based on the clusters and the computation of confidence measure. anatomical group; it contains 14 groups. The second level classification is based on pharmacological/therapeutic Data subgroups. The third and fourth levels denote chemi- Drug profiles cal/pharmacological/therapeutic subgroups and the fifth We use five different homogeneous drug profiles where level refers to the chemical substance. Some drugs have four of them are obtained from DyDruma [29]database: been categorized into multiple classes. These classifica- drug-chemical, drug-therapeutic, drug-protein and drug- tions may also be updated based on new research findings. side effect profiles. We obtained the KEGG gene data used We obtained ATC classes for 405 drugs out of the 417 in Wu et al. [10] to represent drug-gene relationships. selected drugs and 12 drugs had not yet been assigned This allows us to link drug associations in the genomic into ATC classification. We focus on classifying only up space, adding a fifth homogeneous drug dimension. These to the second (therapeutic) level as our broader goal is to drug profiles are represented as binary associations where infer new therapeutic uses for existing drugs. We observe values 1 and 0 represent the presence and absence of a 66 unique classes at ATC second level classification for particular feature, respectively. these 405 drugs. These 66 classes are used as the reference clustering to evaluate the performance of the drug clus- • drug-chemical features [881]: Each drug is ters identified by our method. The ATC classification of associated to relevant chemical fingerprints, based on the selected 417 drugs are available at https://github.com/ the 881 fingerprints (2D chemical structures) defined fathimanush786/two_tiered_clustrering_data. by PubChem [30]. We assume one feature for each fingerprint. If a drug contains a given structural The proposed approach fingerprint, the corresponding feature will have a Our two-tiered unsupervised clustering model is pro- value of 1. posed as a similarity model to identify drugs with closer • drug-therapeutic features [719]: The therapeutic relationships. Unsupervised clustering is an approach uses of the drugs have been obtained by extracting to grouping similar objects together without any prior treatment relationships between drugs and diseases knowledge of their class labels. Objects that are in a A two-tiered clustering approach for drug repositioning through heterogeneous data 72 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 6 of 18

given cluster should demonstrate higher similarity to each one vertex and it adds or removes vertices in greedy other and relatively higher dissimilarity with the objects approach to achieve better cluster separations with high in other clusters. In general, clustering is popular as a cohesiveness. powerful technique which can identify useful patterns in an unsupervised learning environment. There are numer- MCL Markov Clustering (MCL) [37, 38] algorithm is ous clustering algorithms that have been proposed. But, another graph clustering algorithm which is also widely there is no acknowledged single preferred algorithm. Each used as a protein module detection algorithm for large algorithm has its own pros and cons. However, scalabil- protein networks. It has been used in a recent drug repo- ity, robustness, handling high dimensional features, speed, sitioning application as well [23]. It is popular for its intrinsic nature, adaptability and preserving topological scalability, fast, intrinsic, adaptable and emergent nature. order like properties are some interesting characteristics It uses a stochastic flow simulation based concept to par- which we have considered in this context. tition graphs/networks. It’s parameter ‘inflation’ can be In the context of drug data, we can apply clustering used to control the number of clusters where smaller algorithms by adopting a representation of each drug inflation produces lower granularity with large clusters. that allows drug similarity to be computed. We propose a two-tiered clustering approach to cluster drugs into MCODE The Molecular Complex Detection (MCODE) smaller groups based on heterogeneous data integration. [40] algorithm includes three stages: vertex weighting, We employ four clustering algorithms for partitioning complex prediction and optionally post-processing to fil- the pharmacology network. We employ Growing Self ter or add inputs in the resulting complexes by certain Organizing Map (GSOM) [35, 36] which is a vector- connectivity criteria (haircut and fluffing). MCODE uses based clustering algorithm and three state-of-the-art a method based on clustering coefficient when assigning graph clustering algorithms: Markov Clustering (MCL) weights for vertices. The vertex weight threshold param- algorithm [37, 38], Clustering with Overlapping Neigh- eter can be used to define the density of the resulting borhood Expansion (ClusterONE) [39]andMolecular complex. A threshold that is closer to the weight of the Complex Detection (MCODE) [40]. In general, these seed vertex identifies a smaller, denser network region selected clustering algorithms can be applied without any around the seed vertex. prior knowledge about the number of classes, which is more useful in this context. We compare the performance Drug Clustering Tier 1 of clusters identified by each algorithm to the classes of According to the fundamental graph theory concepts, the ATC classification. We demonstrate the performance any drug-feature/drug-drug associations can be repre- evaluation of drug clustering using internal and external sented in two ways; (i) graph representation and (ii) vec- evaluation measures. The identified drug clusters are tor/matrix representation. Therefore, we can obtain an used for drug repositioning via ATC classification. adjacencymatrixtorepresentthedrug-featureassocia- tionsasshowninFig.3. An adjacency matrix demon- Selected clustering algorithms strates which vertices/nodes of a graph/network are GSOM Growing Self Organizing Map (GSOM) [35, 36] adjacent to which other vertices/nodes. In this manner, is an extended version of Self-organization map (SOM) we have adjacency matrices (data matrices) of 417×881, [41] which is a popular vector-based clustering algorithm, 417×719, 417×1504, 417×775 and 417×1385 for each capable of handling large-scale and high dimensional fea- drug-chemical, drug-disease, drug-genes, drug-protein tures. It is popular for its growing nature while preserving and drug-side effect associations, respectively. Then, we the topological order. It also demonstrates an emergent cluster drugs with respect to these independent homoge- nature where it starts with one node and it assigns data neous features using GSOM algorithm. points considering the shortest Euclidean distance. Spread factor is the parameter which controls the granularity of Drug Clustering Tier 2 the cluster map. Smaller spread factor results in a fewer The clustering solutions obtained from Drug Clustering number of nodes in the GSOM map while larger spread Tier 1 are used to derive drug-drug relation (DDR) matri- factor enables a high growth of the GSOM map. ces. Hence, we produce one DDR matrix per dimension considering their Tier 1 cluster assignments. We then ClusterONE Clustering with Overlapping Neighbor- cluster drugs based on combining these individual DDR hood Expansion (ClusterONE) [39] is a graph partitioning matrices in order to capture overall drug similarities of algorithm initially proposed for identifying overlapping aggregated features used in Tier 1. Figure 4 illustrates protein modules in protein-protein interaction network the mechanism for deriving the DDR matrix using drug and also used in a drug repositioning application [10]. clusters (from Drug Clustering Tier 1). We construct five It uses a seeded growing concept where it starts with DDR matrices for chemical, disease, gene, protein and side 4.1 Published manuscript 73

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 7 of 18

Fig. 3 Drug-feature associations could capture in a bipartite graph as shown on (a) and its corresponding adjacency matrix is shown on (b). D(1,2,3) denotes the drugs while F(1,2,3,4) denotes the features such as chemical, disease, protein and side effect

effects separately, based on the individual Tier 1 clustering therapeutic features by concatenating features from differ- foreachtypeoffeature.WethenintegratetheDDRmatri- ent domains where Hy={C1, C2, C3, ...Ck, T1, T2, T3, ..., Tl} ces of Tier 1 clustering into a single relation matrix by be the heterogeneous data integrated binary vector of averaging the individual DDR matrices. The averaged rela- drug Di,fori ∈ 1, 2, 3, ..., n. Similarly, we can extend this tion matrix is used to cluster drugs. By performing this to integrate drug profiles of multiple domains. second round of clustering, we aim to improve the reliabil- ity of the drug clustering. We employ ClusterONE, MCL, Averaging summarized pairwise similarities MCODEaswellasGSOMinDrugClusteringTier2. Another way of integrating heterogeneous features is to average the similarity measure for each member of a Alternative approaches drug pair according to each individual type of feature, Concatenating all features into a single vector to obtain a single summary similarity score [2]. Jaccard A straightforward approach to integrating heteroge- coefficient is widely used to obtain the similarity measure neous features is to concatenate all individual features between two drugs. Let SimC(Di, Dj) and SimT (Di, Dj) into a single vector [16, 42]. Let D beasetofdrugs be the chemical and therapeutic similarity measures of {D1, D2, D3, ..., Dn}whereC={C1, C2, C3, ...Ck}bethe apairofdrugsDi and Dj, respectively. Then, we can binary vector of chemical features of drug Di and construct a heterogeneous data representation (Hz)by + ... SimC SimT T={T1, T2, T3, , Tl} be the binary vector of thera- averaging SimC and SimT where Hz = 2 which peutic features of drug Di.Then,wecanconstructa would lead to provide a nxn square DDR matrix (where n heterogeneous data representation (Hy) of chemical and isthenumberofdrugs).Wecanextendthistointegrate

Fig. 4 a illustrates drug clusters while (b) illustrates its corresponding drug-drug associations. D(1,2,3) and C(1,2) denote the drugs and the clusters, respectively A two-tiered clustering approach for drug repositioning through heterogeneous data 74 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 8 of 18

drug profiles in terms of more than two dimensions of shown below: similarity. (MI(U, V)) NMIsqrt(U, V) = √ (2) H(U)H(V) Evaluation Internal evaluation MI(U, V) − E [MI(U, V)] The objective of internal validation is to examine the SMI(U, V) = √ (3) ( ( )) compactness/cohesion and the separation of the clusters var MI U, V [43]. There are various internal validation measures where MI is the mutual information, H is the associ- and they are variations of these two. But, there is no ated entropy value, E is the expected value and var is the acknowledged measurement of choice. Silhouette anal- variance. ysis is used as an internal evaluation technique to assess the consistency within a cluster/class because it takes Assigning confidence measure both compactness/cohesion and separation into account. Since a drug can belong to more than one ATC class, Moreover, Silhouette can be interpreted using visual aids identifying drug clusters with 100% pure ATC class is chal- for in-depth analysis. lenging. Therefore, we identify the majority class for each Silhouette analysis is used as an internal evaluation drugclusterandassignaconfidencemeasureforeach technique to assess the consistency within a cluster/class identified majority class. Then, we predict the identified [44, 45]. It measures the similarity of an object to its own majority class as a reclassification for the drug/s belongs cluster/class compared to the other clusters/classes. If the to minority class/s with the confidence measure as defined object has a greater similarity to its own cluster/class than by the following equation: to its other clusters/classes, the Silhouette value would be number of drugs belong to the major ATC class of cluster i confidence = +1 and if the object has greater dissimilarity to its own i total number of drugs of cluster i cluster/class than to the other clusters/classes, the Silhou- (4) ette value would be -1. The following equation defines the Silhouette measure for an object i: where i is the cluster number/id. Hence, we can employ the confidence measure to filter the most useful reposi- ( ) − ( ) tioning candidates. ( ) = b i a i Silhouette i { ( ) ( )} (1) max a i , b i Drug repositioning via ATC therapeutic classes As explained in “ATC classification” section, ATC classifi- where a(i) and b(i) are the dissimilarity of the object i to cation consists of five levels where the second level deter- its own cluster/class and the dissimilarity of the object i to mines drug’s therapeutic uses/properties. In this study, we the other clusters/classes. approach drug repositioning by identifying plausible new ATC therapeutic (second level) classes for existing drugs. External evaluation Identifying the drug’s second level classification implies We employed ATC classification to compare the per- its therapeutic uses. We believe reclassification of drugs formance of our two-tiered clustering approach as well into ATC therapeutic (second level) class would enable as the performance the clustering algorithms used in inferring repositioning candidates. this study. We selected adjusted measures: Normalized The use of unsupervised clustering methods enables Mutual Information (NMI) [24] and Standardized Mutual grouping of drugs without any prior knowledge of ATC Information (SMI) [46] to evaluate the identified clusters classes. We expect that drugs in the same cluster will with reference to ATC classification. These are informa- demonstrate similar characteristics while being relatively tion theoretic measures derived based on mutual informa- dissimilar to drugs in other clusters. Therefore, new drug- tion. NMI provides a normalized measure using mutual drug similarities can be identified by analyzing the drug information where it ranges between 0 and 1. SMI pro- clusters. The identified new drug-drug similarities lead vides a statistical adjustment for the mutual information to propose classification of drugs into new ATC thera- which is beneficial in adjusting selection bias and to peutic (second level) classes. These proposals are inferred increase the interpretability. SMI further reduces the bias based on the majority ATC class associated with each in clustering comparisons towards selecting clusterings cluster. Classes with higher confidence (see “Assigning with more clusters and where clustering involves fewer confidence measure” section) can be prioritized for reclas- data points. The upper bound of SMI varies based on the sification. Since we compare the drug clustering solutions used reference clustering, however, higher SMI value indi- with reference to ATC therapeutic (second level) classes, cates better clustering. The equations for NMI [24]and this reclassification step enables inference of repositioning SMI [46] to compare clustering solutions U and V are candidates via ATC therapeutic classes. 4.1 Published manuscript 75

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 9 of 18

Results show negative Silhouette values, illustrating higher vari- Drug Clustering Tier 1 ations within ATC classes. The mean Silhouette value of First, we clustered drugs based on their individual, homo- ATC classification based on chemical, disease, gene, pro- geneous properties; chemical, disease, gene, protein and tein and side effects are − 0.31, − 0.06, − 0.49, − 0.25 side effects. We employed GSOM to cluster drugs in Drug and − 0.33, respectively. However, disease profiles provide Clustering Tier 1 because it is a vector based clustering relatively greater consistency with the ATC classification algorithm. In this study, we used the GSOM implementa- compared to other drug profiles. tion of Chan et al. [47] because of its convenient visual aids Moreover, the Silhouette analysis on GSOM identified for cluster analysis. As mentioned in “GSOM”section,we drug clusters demonstrates relatively higher Silhouette tuned the parameter, spread factor (SF), to obtain GSOM values than ATC classification where the mean Silhouette maps of different sizes. As a result, we obtained GSOM value for chemical, disease, gene, protein and side effect maps of 68 (SF = 0.0001), 69 (SF = 0.25), 66 (SF = 0.8), using GSOM algorithm are 0.13, 0.09, 0.22, 0.15 and -0.07, 63 (SF = 0.2) and 63 (SF = 0.001) nodes for chemical, dis- respectively which are relatively higher than ATC classifi- ease, gene, protein and side effects profiles, respectively. cation (see Additional file 1: Figure S2 for the Silhouette Out of 417 drugs, 405 drugs have already classified into at analysis). least one ATC class. Moreover, we noticed 66 unique ATC Furthermore, we examined the closeness of the clus- classes (2nd level ATC classification) relating these 405 tering solutions between different drug properties used drugs. We evaluated drug clustering solutions for these in this study. In Tables 2 and 3, we show the clustering 405 drugs with reference to the ATC classification. comparison between different drug profiles using NMI Table 1 shows NMI and SMI values for Drug Clustering and SMI, respectively. In these tables, we compare the Tier 1. Accordingly, the NMI varies between 0.46 and 0.68 drug clusters generated by one type of drug profile with and SMI varies between 2.91 and 39.33. As of ATC classifi- the drug clusters generated by another type of drug pro- cation, anatomical and therapeutic features are considered file. For instance, drug clusters generated using chemical in its first two classification levels. Hence, drug clustering properties are compared against drug clusters generated using disease and protein profiles demonstrate relatively by disease, gene, protein and side effect profiles. Accord- higher NMI and SMI. The NMI and SMI of chemical and ing to Table 2, NMI of 0.55, 0.48, 0.59 and 0.56 have side effect profiles are relatively lower than disease and been observed between drug clusters generated by chem- protein profiles as they are considered in the third, fourth ical profile and drug clusters generated by disease, gene, and fifth levels of ATC classification. On the other hand, protein and side effect, respectively. Similarly, accord- clustering solution on gene profiles shows the least close- ing to Table 3, SMI of 12.71, 0.50, 20.85 and 9.98 have ness to ATC classification as this type of information is been observed between drug clusters generated by chem- not considered in ATC classification system. Unlike NMI ical profile and drug clusters generated by disease, gene, where the upper bound is always 1.0, the upper bound protein and side effect, respectively. for SMI depends on the choice of reference clustering; According to NMI, drug clusters of chemical profiles, the upper bound for ATC reference clustering is 98.18. disease profiles, protein profiles and side effect profiles Notably, the ranking order of these clustering solutions is show relatively closer similarities where they vary between consistent for both NMI and SMI. 0.55 and 0.59. On the other hand, the highest SMI is Approximately 16% of the drugs (out of 405 drugs) noticed between clusters of disease and protein profiles. are assigned to multiple classes. Therefore, we randomly Notably, drug clusters of gene profiles are relatively far selected one ATC class for those drugs having multi- away from other drug clustering solutions. This devia- ple classes when constructing the reference class list. tion might have caused due the highly sparse nature of Additional file 1: Figure S1 corresponds to the Silhouette the gene profiles. Moreover, the clusters identified by gene analysis for chemical, disease, gene, protein and side effect profiles lie relatively very far away from ATC classification profiles, respectively. It is clear that most of the drugs than the other clusters. Therefore, we selected chemical,

Table 1 Performance assessment of Drug clustering Tier 1 Table 2 Drug clustering comparison between drug profiles Drug profiles NMI SMI based on Normalized Mutual Information (NMI) Chemical 0.59 20.09 Disease Gene Protein Side effect Disease 0.68 39.33 Chemical 0.55 0.48 0.59 0.56 Gene 0.46 2.91 Disease 0.45 0.59 0.55 Protein 0.63 30.38 Gene 0.45 0.47 Side Effect 0.58 21.07 Side effect 0.56 A two-tiered clustering approach for drug repositioning through heterogeneous data 76 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 10 of 18

Table 3 Drug clustering comparison between drug profiles (0, 0.9] increased the number of clusters. We identified based on Standardized Mutual Information (SMI) 66 clusters using MCL when inflation parameter is set to Disease Gene Protein Side effect 0.048. The number of clusters increases when the infla- Chemical 12.71 0.50 20.85 9.98 tion parameter is increased. We obtained two clustering solutions; CL1 and CL1 employing ClusterONE. CL1 Disease 1.08 22.06 14.89 I II I is obtained when the density parameter is set to 0.6 and Gene 2.58 0.89 ‘nodes’ is used as the seed method while CL1II is obtained Side effect 16.43 when the density parameter is set to 0.8 and ‘unused- nodes’ is used as the seed method. CL1I resulted in 61 clusters including all 417 drugs while CL1II resulted in 58 disease, protein and side effect profiles for further anal- clusters including only 405 drugs. In ClusterONE, choos- ysis to identify drug repositioning candidates using ATC ing ‘nodes’ as the seed method enables every node to be classification. used as a seed and subgroups smaller than a given density We identified a set of 26 pairs of drugs (see Additional are thrown away. file 2) which occur together in each drug cluster, generated Table 4 summarizes NMI and SMI values for Drug Clus- based on individual chemical, disease, protein and side tering Tier 2 using GSOM, MCL, CL1I and MCODE. The effect profiles. 25 out of these 26 drug pairs are assigned GSOM results are relatively higher, measuring NMI and to the same ATC class (second level), indicating mean- SMI with reference to the ATC classification. The NMI ingfulness of the identified drug clusters. Fluphenazine and SMI values of Drug Clustering Tier 2 are0.66and and Thioridazine are also identified in the same cluster in 36.11 while they are 0.68 and 39.33 for disease profiles in all four clustering solutions. However, Thioridazine does Drug Clustering Tier 1. However, NMI and SMI values of notbelongtoanyoftheATCclasseswhileFluphenazine Drug Clustering Tier 2 are relatively higher than other four belongs to ATC class N05 (-). Therefore, we drug profiles. Since we employed ATC therapeutic class as believe Thioridazine may share similar drug profile as of the reference cluster, the results in Drug Clustering Tier 1 Fluphenazine and we propose to classify Thioridazine into are more favorable towards disease profiles. N05 (-psycholeptics). We predicted new ATC therapeutic classes based on the identified majority ATC classes in the corresponding clus- Drug Clustering Tier 2 ters which led to reclassification of the existing drugs. In As explained above, we employed the four drug clus- order to filter the most reliable repositioning candidates, terings generated based on chemical, disease, protein we assigned a confidence measure for each prediction (see and side effect profiles in Drug Clustering Tier 2. We “Assigning confidence measure” section). We therefore constructed four DDR matrices based on these four filter the repositioning candidates with high confidence as identified drug clustering solutions (as explained in “Drug reliable drug repositioning candidates. The highest con- Clustering Tier 1”sectionin“Methods”section).We fidence measures of the identified major classes are 0.85, propose merging of these DDR matrices into a single 0.83, 0.75 and 0.5 for MCL, ClusterONE, MCODE and matrix as a way of heterogeneous data integration. The GSOM, respectively. merged DDR matrix can be constructed by giving equal importance to each of the drug clusterings or by ranking Comparing the proposed approach against existing the drug clusterings based on different evaluation mea- methods sures such as NMI and SMI. However, there is no single We compared the performance of the proposed two- type of homogeneous drug characteristics identified to tiered clustering approach against two recently used provide an efficient and effective drug classification or heterogeneous data integration methods for drug repo- drug repositioning [1]. Giving equal importance to each sitioning (see “Alternative approaches” section). Table 5 of the drug clusterings, we constructed a heterogeneous shows the performance assessments of these three DDR matrix by averaging the four DDR matrices. We used the averaged DDR matrix to identify drug clusters, employing the graph clustering algorithms: Clus- Table 4 Performance assessment of Drug Clustering Tier 2 using terONE, MCODE and MCL as well as the GSOM algo- four different clustering algorithms rithm. In this study, we used ClusterONE, MCL and Algorithm NMI SMI MCODE implementations available in MATLAB Systems GSOM 0.66 36.11 Biology and Evolution Toolbox (SBEToolbox) [48]. We MCL 0.59 26.49 obtained a GSOM map of 63 nodes when SF is 0.2. We ClusterONE (CL1I)0.5621.37 identified 64 clusters using MCODE when the thresh- MCODE 0.52 11.57 old parameter is set to 0.9. Increasing the threshold from 4.1 Published manuscript 77

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 11 of 18

Table 5 Comparison of the proposed approach against two candidates having a minimum confidence measure existing methods for heterogeneous data integration of 0.5. Out of these, 4 drugs (Chlorthalidone, Thior- Method NMI SMI idazine, and Indomethacin) have not The proposed two-tiered clustering 0.66 36.11 been assigned to ATC classification yet. We infer these unclassified Chlorthalidone, Thioridazine, Orphenadrine Concatenating all heterogeneous 0.60 22.26 features into a single vector (Hy) and Indomethacin for ATC classes C03- (con- Averaging summarized heteroge- 0.64 33.59 fidence: 0.83), N05-psycholeptics (confidence: 0.80), neous (pairwise) similarities (Hz) R06 -antihistamines (confidence: 0.64) and M01- antiinflammatory and antirheumatic (confidence: 0.57), respectively. Interestingly, in Drug Clustering Tier 1, different methods for heterogeneous data integration Thioridazine is inferred to have a similar drug profile as using GSOM algorithm only. In Drug Clustering Tier 2, of Fluphenazine which also belongs to ATC class N05- GSOM demonstrates NMI and SMI of 0.66 and 36.11, psycholeptics. Moreover, in the predicted repositioning respectively. The all concatenated heterogeneous fea- list, is inferred to be repositioned for diseases ture representation method (Hy) demonstrates NMI related to renin-angiotensin system (C09) with the high- and SMI of 0.60 and 22.26, respectively while averaging est confidence measure of 0.85. Even though Amlodipine summarized heterogeneous (pairwise) similarities (Hz) is not directly classified into C09, fixed combinations of demonstrates NMI and SMI of 0.64 and 33.59, respec- aliskiren,valsartan, , ACE inhibitors, tively. There is a significant improvement in the proposed etc. are already classified in C09 [15]. approach compared to the alternative method Hy.Even Different algorithms may produce different clustering though there is no significant improvement in the pro- solutions. However, different algorithms may have simi- posed approach compared to the alternative method Hz, larities too. We identified 79 reclassification predictions Hz fails to produce useful clusters when graph clustering which are generated consistently by at least two clus- algorithms are used. tering algorithms or in at least two different clusters It should be noted that these three heterogeneous data (in ClusterONE). Table 7 summarizes 11 reclassification integration methods did not outperform drug clusters candidates identified consistently by at least two cluster- identified by disease characteristics where NMI and SMI ing algorithms with relatively high confidence measures are 0.68 and 39.33, respectively. Since we employed ATC (see Additional file 3 for the complete list). ClusterONE therapeutic class as the reference cluster, the results in algorithm produces overlapping clusters. Therefore, some Drug Clustering Tier 1 are more favorable towards disease drugs are assigned to more than one cluster. Table 7 profiles. Our proposed approach and alternative method illustrates three drug reclassification candidates (Cypro- Hz outperformed other three clusterings identified by heptadine, and Dolasetron) that are identified chemical, protein and side effects profiles in Drug Cluster- by more than one cluster in ClusterONE results. ing Tier 1 whereas alternative method Hy outperformed In this study, ATC classification is considered as the gold clusterings identified by chemical and side effects profiles standard classification, therefore, we obtained clustering in Drug Clustering Tier 1. performance with reference to the ATC classification. We The alternative method Hz, explained in this study pro- used only up to its second level classification as it captures duces a complete graph while the proposed two-tiered the therapeutic uses. The drugs used in this study include clustering approach involves the removal of noisy edges, 12 drugs that are not yet assigned into ATC classifica- resulting in a sparse graph for efficient graph cluster- tion. However, our method enables inferring suitable ATC ing. The graph clustering algorithms used in this study classification for them (see Additional file 4 for the com- are not able to identify useful clusters on the given plete list of predictions). Moreover, the inferred new ATC complete graph where they resulted in producing only codes of other drugs can be used for drug repositioning. one module at all time with all three graph cluster- “Clinical significance of our findings” section summa- ing algorithms. Therefore, the proposed two-tiered drug rizes some clinical evidence to support these findings. clustering approach as a heterogeneous data integration We therefore suggest that cluster-based classification and approach demonstrates better performance and can be reclassification into the ATC classification system is a considered as a reliable method for both vector-based and viable method for drug repositioning. graph clustering. Clustering enables partitioning the large pharmacology network into smaller subgroups and hence simplifies the Drug Repositioning via ATC therapeutic class drug repositioning process. Since drugs can be considered We analyzed the drug clusters identified by MCL, as the main component of the pharmacological networks, MCODE, CL1I and GSOM to infer useful drug reposi- drug clustering provides an indirect way of clustering tioning candidates. In Table 6, we show 39 repositioning the networks, where associations to related entities (e.g., A two-tiered clustering approach for drug repositioning through heterogeneous data 78 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 12 of 18

Table 6 The inferred repositioning candidates with higher confidence Drug name Cluster ID Old ATC name New ATC name Confidence Algorithm Amlodipine 403 C08 C09 0.85 MCL Chlorthalidone 2 C03 0.83 CL1 51 N04 N05 0.80 CL1 Thioridazine 51 N05 0.80 CL1 Hydroxyzine 30 N05 C09 0.75 MCODE 46 R06 N06 0.70 CL1 Amlodipine 11 C08 C09 0.70 CL1 Carvedilol 11 C07 C09 0.70 CL1 Cetirizine 11 R06 C09 0.70 CL1 414 D05 D10 0.67 MCL Brinzolamide 48 S01 L02 0.67 MCODE Orphenadrine 392 R06 0.64 MCL 56 C02, N02, S01 N05 0.62 CL1 Thioridazine 56 N05 0.62 CL1 399 C01 L02 0.60 MCL Cyproheptadine 35 R06 N06 0.59 CL1 35 C02 N06 0.59 CL1 Dipivefrin 44 S01 N05 0.57 CL1 Indomethacin 7 M01 0.57 MCODE 57 C08 N06 0.57 CL1 Cyproheptadine 4 R06 N06 0.54 CL1 4 N07 N06 0.54 CL1 Arsenic Trioxide 4 L01 P01 0.50 MCODE 48 A03, S01 N04 0.50 CL1 Atropine 393 A03, S01 N04 0.50 MCL Dacarbazine 79 L01 A10 0.50 GSOM Hexachlorophene 350 D08 D05 0.50 MCL Isocarboxazid 50 N06 N05 0.50 MCODE 346 N03 L01 0.50 MCL Lithium 6 N05 N06 0.50 GSOM Mercaptopurine 4 L01 P01 0.50 MCODE Metformin 79 A10 L01 0.50 GSOM Moexipril 26 C09 C07 0.50 CL1 Mycophenolic Acid 342 L04 N03 0.50 MCL 66 N03 C01 0.50 GSOM 350 D05 D08 0.50 MCL Tolterodine 59 G04 C01 0.50 GSOM Topotecan 346 L01 N03 0.50 MCL 342 N03 L04 0.50 MCL Note: ATC code names are given in Additional file 5

chemical, target and phenomic) can be incorporated as a levels. It enables other participating entities to present in basis for clustering. Hence, the proposed two-tiered drug- more than one cluster. Then, new associations between centric drug clustering can be extended by employing all chemical, target and phenome can be predicted for each the other related heterogeneous data at each of the cluster of the clusters as well. Moreover, it enables investigation 4.1 Published manuscript 79

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 13 of 18

Table 7 Repositioning candidates identified consistently by more than one clustering algorithm Drug name Cluster ID Old ATC name New ATC name Confidence Algorithm Amlodipine 403 C08 C09 0.85 MCL Amlodipine 11 C08 C09 0.70 CL1

Cyproheptadine 46 R06 N06 0.70 CL1 Cyproheptadine 35 R06 N06 0.59 CL1 Cyproheptadine 4 R06 N06 0.54 CL1 Cyproheptadine 56 R06 N06 0.25 MCODE

Brinzolamide 48 S01 L02 0.67 MCODE Brinzolamide 9 S01 L02 0.17 CL1

Atropine 48 A03, S01 N04 0.50 CL1 Atropine 393 A03, S01 N04 0.50 MCL Atropine 20 A03, S01 N04 0.46 GSOM

Metformin 79 A10 L01 0.50 GSOM Metformin 21 A10 L01 0.33 MCODE

Mycophenolic Acid 342 L04 N03 0.50 MCL Mycophenolic Acid 22 L04 N03 0.27 GSOM

Carbamazepine 46 N03 N05 0.43 GSOM 42 N03 N05 0.23 CL1 Carbamazepine 20 N03 N05 0.20 MCODE

Carbamazepine 46 N03 N06 0.43 GSOM Carbamazepine 25 N03 N06 0.27 CL1 Carbamazepine 20 N03 N06 0.20 MCODE

Droperidol 28 N05 N01 0.42 GSOM Droperidol 359 N05 N03 0.40 MCL Droperidol 40 N05 N01 0.32 CL1 Droperidol 30 N05 N03 0.17 CL1

Fulvestrant 23 L02 A10 0.42 GSOM Fulvestrant 13 L02 A10 0.42 CL1

Dolasetron 2 A04 A02 0.40 MCODE Dolasetron 40 A04 A02 0.20 GSOM Dolasetron 59 A04 L01 0.20 CL1 Dolasetron 24 A04 L01 0.12 CL1 Note: ATC code names are given in Additional file 5

of multiple links connecting drugs and may prove useful section summarizes clinical evidence for four findings of for pathway analysis. this study: Chlorthalidone, Indomethacin, Metformin and Thioridazine. Clinical significance of our findings Our study interestingly inferred the ATC code, C03 and The significance of findings arising from this study is therapeutic use, diuretics, for a drug known as Chlorthali- twofold; (i) correctly profile and suggest therapeutic indi- done (see Table 6), which until now does not belong to cation for drugs that do not possess the ATC code; (ii) the ATC classification. Chlorthalidone is a potent ; flag potential of some drugs to be used for other ther- a drug that promotes water loss and is currently used in apeutic purposes. More interestingly, the inferred thera- the management of hypertension or high blood pressure peutic uses are significantly different to the one for which and fluid retention associated with heart failure [49]. In these drugs were initially developed and trialed. This fact, Chlorthalidone has better clinical outcome in terms A two-tiered clustering approach for drug repositioning through heterogeneous data 80 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 14 of 18

of lowering blood pressure than other more commonly not suitable for drug clustering because the number of prescribed diuretics [50, 51]. clusters and the cluster shapes need to be known and Furthermore, Indomethacin is another drug that does specified in advance [62]. In drug clustering, we cannot not have an ATC code yet. According to our find- expect to have a priori knowledge about the grouping ings, Indomethacin was indicated to be used as an anti- and the cluster shapes. Moreover, higher dimensional fea- inflammatory and anti-rheumatic agent (see Table 6). This ture space in pharmacology data could potentially hinder perfectly matches the clinical situations for which this the efficiency and effectiveness of the machine learning drug is used; Indomethacin is indicated for managing pain algorithms. associated with inflammation, rheumatoid arthritis as well GSOM is well-suited for Drug Clustering Tier 1 and 2 as osteoarthritis [52, 53]. because it is cable of handling higher dimensional fea- Another interesting finding arising from our work tures and the number of clusters is defined automatically. relates to Metformin (see Tables 6 and 7). Metformin is In GSOM, the parameter spread factor is used to control used to manage type 2 diabetes and its initial classifica- the size of the GSOM map or the number of clusters. This tion was an oral hypoglycaemic, drug that lowers blood spread factor does not depend on the dimensionality of sugar level [54]. In the past ten years, Metformin was also the data. Moreover, it preserves the topological order. found to be therapeutically effective in other diseases such MCL, MCODE and ClusterONE algorithms used in as polycystic ovarian syndrome and metabolic syndrome Drug Clustering Tier 2, are graph clustering algorithms [55, 56]. Emerging evidence is strongly suggesting that that are popular in the context of pharmacology data anal- Metformin can now be used as an adjuvant treatment ysis. They are also capable of handling high dimensional in bowel and prostate cancer due to its antineoplastic features and the number of clusters is defined automat- properties; can inhibit cancer growth [57, 58]. This is a ically. Unlike vector-based algorithms, these graph clus- significant deviation from its original therapeutic use and tering algorithms are not appropriate for Drug Clustering was correctly inferred in our study by the ATC code L01 Tier 1 because they result in clustering drugs as well as and therapeutic class antineoplastic agent. their corresponding features. Furthermore, it is important to mention that our Interestingly, we observed relatively close number of proposed drug repositioning method accurately flagged drug clusters in Drug Clustering Tier 1 and Drug Clus- Thioridazine, a drug that does not possess an existing tering Tier 2 after tuning cluster parameters. Table sum- ATC code, as being agent (see Table 6). marizes the parameters and their effect on generating Thioridazine is clinically effective in treating patients with the clusters. The number of clusters generated by GSOM schizophrenia since its discovery [59, 60], however, it was using chemical, disease, gene, protein and side effect pro- withdrawn from the market in 2005 due to its ability to files are 68, 69, 66, 63 and 63, respectively. In GSOM, cause toxicity to the heart [61]. parameter spread factor canbeusedtotunethenumberof clusters. In Drug Clustering Tier 2, GSOM, MCL, MCODE Discussion and ClusterONE, generated 63, 64, 66 and 61 clusters, Clustering respectively. In MCODE, increasing the threshold from (0, Clustering enables partitioning the large pharmacology 0.9] increased the number of clusters. However, further network into smaller subgroups and hence simplifies the incrementing the threshold parameter after 0.9 resulted in drug repositioning process. Since drugs can be considered a decrement of the number of clusters. Interestingly, we as the main component of the pharmacological networks, identified 64 clusters using MCODE when the threshold drug clustering provides an indirect way of clustering parameter is set to 0.9, strengthening our confidence that the networks, where associations to related entities (e.g., the number of clusters lies around 64. chemical, target and phenomic) can be incorporated as a Overlapping clustering algorithms may be more suit- basis for clustering. Hence, the proposed two-tiered drug- able for drug clustering as some drugs are used to treat centric drug clustering can be extended by employing all multiple diseases. Moreover, overlapping clusters may the other related heterogeneous data at each of the cluster enable identifying more repositioning candidates. Clus- levels. It enables other participating entities to present in terONE, MCL and MCODE algorithms used in this study more than one cluster. Then, new associations between can handle overlapping algorithms. But, in the current chemical, target and phenome can be predicted for each analysis, we observed overlapping clusters only from Clus- of the clusters as well. Moreover, it enables investigation terONE. It should be noted that some repositioning can- of multiple links connecting drugs and may prove useful didates identified by ClusterONE are identified by GSOM, for pathway analysis. MCL and MCODE as well. Therefore, we believe the Clustering algorithms such as k-means, SOM, GSOM repositioning candidates identified by non-overlapping and mixture models can be employed in Drug Cluster- clusters could still be prospective candidates for further ing Tier 1. But, K-means, SOM and mixture models are in-depth analysis. The repositioning candidates identified 4.1 Published manuscript 81

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 15 of 18

by multiple clustering algorithms increase our confidence in other classes may share higher similarity though they that they might be interesting. have distinct uses which will also result in a higher num- Since the four clustering algorithms used in this study ber of false positives. Moreover, many ATC classes have are capable of handling higher dimensional feature repre- very high intra-cluster variations which will result in a sentations, we did not employ dimensionality reduction. higher number of false negatives. Therefore, we cannot Dimensionality reduction techniques may be useful to expect the identified drug clusters to be highly correlated remove noisy information. But, it is not appropriate for with ATC classification. MCL, MCODE and ClusterONE, graph clustering algo- Using Silhouette values to fine tune the parameters rithms as they use drug-drug/drug-feature relationships would be another approach that we could use when deter- to be the input. mining the number of clusters. But, it should be noted that GSOM typically uses Euclidean distance to compute the the mean Silhouette value of ATC classification based on pairwise distance between input vector and weight vec- chemical, disease, gene, protein and side effects are − 0.31, tor. The performance of GSOM may be further improved − 0.06, − 0.49, − 0.25 and − 0.33, respectively which illus- by employing Jaccard similarity or squared Euclidean dis- trates the higher variations within ATC classes. Hence, tance or taking the average distance based on multiple higher variations within drug clusters are expected. Hence fine metrics when binary data are used. tuning the parameters of the clustering algorithms com- parison to ATC classification may not be very accurate. Heterogeneous data integration According to Silhouette values of ATC classification, Drugs can be explained using various characteristics such obtaining a clustering close to the ATC classification is as chemical, target and phenomic, etc. The primary objec- challenging due to the large variation within ATC classes, tive of heterogeneous/multi-view data integration is to misclassifications and missing information in the ATC more deeply understand the predictive model and to classification. The mismatches between the clustering obtain a consensus solution [63]. solution and the ATC classification arise due to the iden- Multi-view data integration can be performed at tified new drug classes (drug-drug similarities) for the the input/intermediate/output phase [63]. The proposed existing drugs. As explained in “Drug repositioning via methods can be seen as a type of multi-view data integra- ATC therapeutic classes” section, the reclassification into tion. In the alternative methods Hy and Hz,dataintegra- ATC therapeutic classes can be interpreted as reposition- tion is performed at the input phase and the intermediate ing opportunities. Also, the clustering solutions enable phase, respectively, while our presented two-tiered clus- identifying more useful drug-drug relationships. tering approach performs data integration at the output External clustering evaluation is an important task phase. In our method, the outputs from various individual though it is challenging. Consequently, various exter- views are combined and the consensus clustering results nal clustering comparison measures have been proposed. are obtained at the second tier. Moreover, in multi-view Pair-counting based measures include RI and ARI while data integration, a kernel matrix is typically used as an MI, NMI and AMI are information theoretic based mea- input for kernel classification, regression and clustering sures useful to compare clustering solutions against a [63]. In our study, the clustering results of Drug Cluster- reference clustering. There is no clear evidence that one ing Tier 1 are used to construct the Drug-Drug Relation measure is superior to another. NMI, AMI and SMI are matrix which can be viewed as a type of kernel matrix. the adjusted measures for MI and have important benefits Hence, the proposed method is compatible with existing [46]. Moreover, SMI is proportional to AMI. We therefore kernel learning approaches. performed clustering evaluation using NMI and SMI. Methods such as kernelized Bayesian matrix factoriza- It is important that the drugs within a cluster are more tion, random walk methods can be effectively applied on similar to each other than the other drugs. Our primary bi-partite graphs as a mean of data integration. Multi- objective of this study is not to present a model to predict modal deep learning can also be applied to heterogeneous the ATC classification. Fine tuning parameters against an drug data integration where output of each view can external reference may not be a better option since our be integrated into higher layers [63]. Deep Boltzmann broader focus is to determine the repositioning candidates machine would be a suitable approach for drug data clus- where they deviate from the current ATC class. More- tering where binary data are considered. over, the false positives predicted by the clusters is not necessarily an undesired result and optimizing clusters for Cluster evaluation NMIandSMImeasuremightpreventusfromdetecting Sincedrugscanbelongtomorethanoneclass,theclasses interesting novel clusters or repositioning candidates. induced from the ATC classification can have distantly Drug Clustering Tier 2 achieved 11.9, 4.8, and 13.8% gain related drugs which will result in a higher number of false in NMI compared to chemical, protein, and side effect, positives in the compared clustering solution. Some drugs respectively of Drug Clustering Tier 1 whereas there is A two-tiered clustering approach for drug repositioning through heterogeneous data 82 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 16 of 18

a2.9%lossinNMIcomparedtodiseaseprofileofDrug Additional file 3: The repositioning candidates identified consistently by Clustering Tier 1. Since we employed ATC therapeutic at least two clustering algorithms. This includes the complete list of class as the reference cluster, the results in Drug Clustering consistent repositioning candidates, algorithm names and their confidence Tier 1 are more favorable towards disease profiles. measures. (PDF 104 kb) Additional file 4: The complete prediction list. This includes the The predicted clusters that do not provide a higher complete list of predicted new classifications into ATC therapeutic class Silhouette value is not necessarily an undesired result. and their confidence measures. (PDF 892 kb) Moreover, not all identified clusters may be useful for Additional file 5: The ATC code list. This includes the ATC codes (Second drug repositioning. As explained in “Assigning confidence Level) and the corresponding ATC therapeutic class names. (PDF 363 kb) measure” section, we defined a confidence measure so that we can identify the highly probable repositioning can- Abbreviations ATC: Anatomical therapeutic chemical; ClusterONE: Clustering with didates. The drug repositioning candidates that are com- overlapping neighborhood expansion; DDR: Drug-drug relation; GSOM: monly identified by multiple clustering algorithms also Growing self organizing map; MCODE: Molecular complex detection; NMI: have higher probability to be chosen as repositioning can- Normalized mutual information; SF: Spread factor; SMI: Standardized mutual information didates. As explained in “Drug Clustering Tier 2”section in “Methods” section, drug-drug relation matrix repre- Acknowledgments Not applicable. sents a kernel matrix or similarity matrix, illustrating the similarity means of cluster relationships of drugs. Hence, Funding PNH is fully supported by the PhD scholarships of The University of Melbourne the drug-drug relational matrix can be straightforwardly and partially supported by NICTA scholarship of National ICT Australia, now incorporated in kernel-based supervised and unsuper- Data61 since merging CSIRO’s Digital Productivity team. Article processing vised learning methods such as support vector machines, charge is funded by Australian Reseach Council Discovery Grant DP150103512. spectral clustering, multiple kernel learning, etc [63, 64]. Availability of data and materials The data supporting the results of this article are cited within the article.

Conclusions Authors’ contributions Computational drug repositioning provides new strate- PNH, KV and SH designed the experiment and evaluation methodology. PNH gies for drug development. It has been argued that using conceived the idea, proposed specific methods, collected ATC classification in 2016, implemented the methods, assessed the performance and drafted the heterogeneous features results in better drug reposition- manuscript. SK provided the clinical significance of the predicted ing predictions. In this study, we proposed an unsuper- repositioning candidates. KV and SH contributed to the writing and editing of vised learning approach to achieve drug repositioning the manuscript. All authors approved the final draft. by, first, performing drug-centric drug clustering and, Ethics approval and consent to participate second, associating inferred clusters to ATC therapeutic Not applicable. classes based on known drug classifications. Moreover, Consent for publication the proposed two-tiered clustering approach enables drug Not applicable. clustering through heterogeneous data integration. The drug clustering based on core drug features produces Competing interests The authors declare that they have no competing interests. clusters that align well with the existing ATC classifica- tion levels. The repositioning candidates identified con- Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in sistently by multiple clustering algorithms and with high published maps and institutional affiliations. confidence have a higher possibility for reliable drug repo- Author details sitioning. Furthermore, the identified drug clusters can be 1Department of Mechanical Engineering, University of Melbourne, Parkville, used as an intermediate source to explore drug similari- 3010 Melbourne, Australia. 2Data61, Victoria Research Lab, West Melbourne ties. The clinical significance of the predicted results also 3003, Australia. 3Department of Computer Science, University of Ruhuna, Matara 81000, Sri Lanka. 4Department of Computing and Information Systems, suggests that the proposed two-tiered clustering approach University of Melbourne, Parkville, 3010 Melbourne, Australia. 5Department of can be safely used to infer new ATC code as well as new Nursing, University of Melbourne, Parkville, Melbourne 3010, Australia. 6The therapeutic uses based on the given drug characteristics. Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, Melbourne 3010, Australia. 7Research School of Engineering, College of Engineering & Computer Science, The Australian National University, 2601 Additional files Canberra, ACT, Australia.

Received: 28 September 2017 Accepted: 21 March 2018 Additional file 1: Silhouette analysis for ATC classification and GSOM clustering. This file includes the figures illustrating the Silhouette values of drugs based on ATC classification and GSOM clustering using chemical, disease, gene, protein and side effect profiles. (PDF 224 kb) References 1. Dudley JT, Deshpande T, Butte AJ. Exploiting drug–disease relationships Additional file 2: The 26 pairs of drugs which occur together in Drug for computational drug repositioning. Brief Bioinforma. 2011;12:013. Clustering Tier 1. These drug pairs occur together in each drug cluster, 2. Napolitano F, Zhao Y, Moreira VM, Tagliaferri R, Kere J, D’Amato M, generated based on individual chemical, disease, protein and side effect Greco D. Drug repositioning: a machine-learning approach through data profiles. (PDF 59 kb) integration. J Cheminformatics. 2013;5:30. 4.1 Published manuscript 83

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 17 of 18

3. Li J, Zheng S, Chen B, Butte AJ, Swamidass SJ, Lu Z. A survey of 30. Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. Pubchem: a current trends in computational drug repositioning. Brief Bioinforma. public information system for analyzing bioactivities of small molecules. 2016;17(1):2–12. Nucleic Acids Res. 2009;37(suppl 2):623–33. 4. U Sahu N, S Kharkar P. Computational drug repositioning: A lateral 31. Bodenreider O. The unified medical language system (umls): integrating approach to traditional drug discovery? Curr Top Med Chem. 2016;16(19): biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):267–70. 2069–77. 32. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, 5. Yamanishi Y, Kotera M, Kanehisa M, Goto S. Drug-target interaction Hassanali M. Drugbank: a knowledgebase for drugs, drug actions and prediction from chemical, genomic and pharmacological data in an drug targets. Nucleic Acids Res. 2008;36(suppl 1):901–6. integrated framework. Bioinformatics. 2010;26(12):246–54. 33. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, 6. Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. Gasteiger E, Huang H, Lopez R, Magrane M, et al. Uniprot: the universal Network-based elucidation of human disease similarities reveals protein knowledgebase. Nucleic Acids Res. 2004;32(suppl 1):115–9. common functional modules enriched for pluripotent drug targets. 34. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource PLoS Comput Biol. 2010;6(2):1000662. to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6(1):343. 7. Berger SI, Iyengar R. Network analyses in systems pharmacology. 35. Alahakoon D, Halgamuge SK, Srinivasan B. Dynamic self-organizing Bioinformatics. 2009;25(19):2466–72. maps with controlled growth for knowledge discovery. IEEE Trans Neural 8. Yildirim MA, Goh K-I, Cusick ME, Barabasi A-L, Vidal M. Drug-target Netw. 2000;11(3):601–14. network. Nat Biotechnol. 2007;25(10):1119–26. 36. Hsu AL, Tang S-L, Halgamuge SK. An unsupervised hierarchical dynamic 9. Lee M, Park K, Kim D. Interaction network among functional drug self-organizing approach to cancer class discovery and marker gene groups. BMC Syst Biol. 2013;7(3):1. identification in microarray data. Bioinformatics. 2003;19(16):2131–40. 10. Wu C, Gudivada RC, Aronow BJ, Jegga AG. Computational drug 37. Van Dongen S. Graph clustering by flow simulation. 2001. repositioning through heterogeneous network clustering. BMC Syst Biol. 38. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for 2013;7(Suppl 5):6. large-scale detection of protein families. Nucleic Acids Res. 2002;30(7): 11. Chen L, Zeng W-M, Cai Y-D, Feng K-Y, Chou K-C. Predicting anatomical 1575–84. therapeutic chemical (atc) classification of drugs by integrating 39. Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes chemical-chemical interactions and similarities. PloS ONE. 2012;7(4): in protein-protein interaction networks. Nat Methods. 2012;9(5):471–2. 35254. 40. Bader GD, Hogue CW. An automated method for finding molecular 12. Cheng F, Li W, Wu Z, Wang X, Zhang C, Li J, Liu G, Tang Y. Prediction of complexes in large protein interaction networks. BMC Bioinformatics. polypharmacological profiles of drugs by the integration of chemical, 2003;4(1):1. side effect, and therapeutic space. J Chem Inf Model. 2013;53(4):753–62. 41. Kohonen T, Maps S. Self-organizing Maps. Springer; 1995, p. 30. 13. Tari LB, Patel JH. Systematic drug repurposing through text mining. 42. Liu M, Wu Y, Chen Y, Sun J, Zhao Z, Chen X-W, Matheny ME, Xu H. Biomed Lit Min. 2014;1159:253–67. Large-scale prediction of adverse drug reactions using chemical, 14. Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment biological, and phenotypic properties of drugs. J Am Med Inform Assoc. pairs from biomedical literature for drug repurposing. BMC 2012;19(e1):28–35. Bioinformatics. 2013;14(1):181. 43. Hassani M, Seidl T. Using internal evaluation measures to validate the 15. World Health Organization. Anatomical Therapeutic Chemical (ATC) quality of diverse stream clustering algorithms. Vietnam J Comput Sci. Classification System. 2016. http://www.whocc.no. 2017;4(3):171–83. 16. Sun Y, Hameed PN, Verspoor K, Halgamuge S. A physarum-inspired 44. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and prize-collecting steiner tree approach to identify subnetworks for drug validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. repositioning. BMC Syst Biol. 2016;10(5):25. 45. Rendón E, Abundez I, Arizmendi A, Quiroz E. Internal versus external 17. Lin S-F, Xiao K-T, Huang Y-T, Chiu C-C, Soo V-W. Analysis of adverse cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34. drug reactions using drug and drug target interactions and graph-based 46. Romano S, Bailey J, Nguyen XV, Verspoor K. Standardized mutual methods. Artif Intell Med. 2010;48(2):161–6. information for clustering comparisons: One step further in adjustment 18. Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P. Drug target for chance. In: ICML. Beijing: International Conference on Machine identification using side-effect similarity. Science. 2008;321(5886):263–6. Learning; 2014. p. 1143–51. 47. Chan C-KK, Hsu AL, Halgamuge SK, Tang S-L. Binning sequences using 19. Hartsperger ML, Blöchl F, Stümpflen V, Theis F. Structuring very sparse labels within a metagenome. BMC Bioinformatics. 2008;9(1):1. heterogeneous biological information using fuzzy clustering of k-partite 48. Konganti K, Wang G, Yang E, Cai JJ. Sbetoolbox: a matlab toolbox for graphs. BMC Bioinformatics. 2010;11(1):522. biological network analysis. Evol Bioinforma. 2013;9:355. 20. Klamt S, Haus U-U, Theis F. Hypergraphs and cellular networks. PLoS 49. Roush GC, Kaur R, Ernst ME. Diuretics: a review and update. J Cardiovasc Comput Biol. 2009;5(5):1000385. Pharmacol Ther. 2014;19(1):5–13. 21. Tatonetti NP, Fernald GH, Altman RB. A novel signal detection algorithm 50. Bakris GL, Sica D, White WB, Cushman WC, Weber MA, Handley A, for identifying hidden drug-drug interactions in adverse event reports. Song E, Kupfer S. Antihypertensive efficacy of hydrochlorothiazide vs J Am Med Inform Assoc. 2012;19(1):79–85. chlorthalidone combined with azilsartan medoxomil. Am J Med. 22. Zhang K, Chai Y, Yang SX. Self-organizing feature map for cluster analysis 2012;125(12):1229–1. in multi-disease diagnosis. Expert Syst Appl. 2010;37(9):6359–67. 51. Ernst ME, Carter BL, Goerdt CJ, Steffensmeier JJ, Phillips BB, 23. Zhou B, Wang R, Wu P, Kong D-X. Drug repurposing based on Zimmerman MB, Bergus GR. Comparative antihypertensive effects of drug–drug interaction. Chem Biol Drug Des. 2015;85(2):137–44. hydrochlorothiazide and chlorthalidone on ambulatory and office blood 24. Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for pressure. Hypertension. 2006;47(3):352–8. combining multiple partitions. J Mach Learn Res. 2002;3(Dec):583–617. 52. Rossi S, Calabretto J-P, Patterson C. Australian Medicines Handbook. 25. Zhao X-M, Iskar M, Zeller G, Kuhn M, Van Noort V, Bork P. Prediction of Adelaide: AMH Pty Ltd; 2017. drug combinations by integrating molecular and pharmacological data. 53. Crilly MA, Mangoni AA. Non-steroidal anti-inflammatory drug (nsaid) PLoS Comput Biol. 2011;7(12):1002323. related inhibition of aldosterone glucuronidation and arterial dysfunction 26. Shi J-Y, Li J-X, Lu H-M. Predicting existing targets for new drugs base on in patients with rheumatoid arthritis: a cross-sectional clinical study. BMJ strategies for missing interactions. BMC Bioinformatics. 2016;17(8):282. Open. 2011;1(1):000076. 27. Vilar S, Hripcsak G. The role of drug profiles as similarity metrics: 54. Duhault J, Lavielle R. History and evolution of the concept of oral therapy applications to repurposing, adverse effects detection and drug–drug in diabetes. Diabetes Res Clin Pract. 1991;14:9–13. interactions. Brief Bioinforma. 2016;18(4):670–81. 55. Bianchi C, Penno G, Romero F, Del Prato S, Miccoli R. Treating the 28. Chen L, Lu J, Zhang N, Huang T, Cai Y-D. A hybrid method for prediction metabolic syndrome. Expert Rev Cardiovasc Ther. 2007;5(3):491–506. and repositioning of drug anatomical therapeutic chemical classes. Mol 56. Diamanti-Kandarakis E, Economou F, Palimeri S, Christakou C. Metformin BioSyst. 2014;10(4):868–77. in polycystic ovary syndrome. Ann N Y Acad Sci. 2010;1205(1):192–8. 29. Wang F, Zhang P, Cao N, Hu J, Sorrentino R. Exploring the associations 57. Coyle C, Cafferty F, Vale C, Langley R. Metformin as an adjuvant between drug side-effects and therapeutic indications. J Biomed Inform. treatment for cancer: a systematic review and meta-analysis. Ann Oncol. 2014;51:15–23. 2016;27(12):2184–95. A two-tiered clustering approach for drug repositioning through heterogeneous data 84 integration

Hameed et al. BMC Bioinformatics (2018) 19:129 Page 18 of 18

58. Hankinson SJ, Fam M, Patel NN. A review for clinicians: Prostate cancer and the antineoplastic properties of metformin. In: Urologic Oncology: Seminars and Original Investigations. Netherlands: Elsevier; 2017. p. 21–9. 59. Meltzer H, Sachar E, Frantz A. Dopamine antagonism by thioridazine in schizophrenia. Biol Psychiatry. 1975;10(1):53–7. 60. Sultana A, Reilly J, Fenton M. Thioridazine for schizophrenia. Cochrane Libr. 2000;2:. Art. No.: CD001944. 61. Purhonen M, Koponen H, Tiihonen J, Tanskanen A. Outcome of patients after market withdrawal of thioridazine: a retrospective analysis in a nationwide cohort. Pharmacoepidemiol Drug Saf. 2012;21(11):1227–31. 62. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78. 63. Li Y, Wu F-X, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinformatics. 2016;19:113. 64. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. J Mach Learn Res. 2006;7(Jul):1531–65.

Submit your next manuscript to BioMed Central and we will help you at every step:

• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit 4.2 Results analysis using the missing ATC codes in the published manuscript 85

4.2 Results analysis using the missing ATC codes in the pub- lished manuscript

We are thankful to the anonymous thesis examiner who noticed the ATC class codes for Thioridazine, Chlorthalidone and Indomethacin. In this study, we obtained the ATC clas- sification from https://www.whocc.no/atc_ddd_index/. Unfortunately, we did not obtain the ATC classification codes for 12 drugs that are shown in Table 4.1. We no- ticed that there is a drug name conflict in the dataset that we have used in our study and the ATC classification system. However, it should be noted that the unsupervised clustering approach proposed in this study does not rely on the class labels of the drugs and is applied to cluster drugs without any prior knowledge of the class labels. Therefore, the clusters predicted in this study are completely independent of the ATC codes. ATC codes are used at the post- processing phase to infer drug repositioning candidates. It can be seen from Table 4.1 that our method has correctly inferred the ATC code for 3 drugs out of those 11 drugs that already possess and ATC class: Chlorthalidone (C03), Thioridazine (N05) and Indomethacin with the confidence score of 0.83, 0.80 and 0.57, respectively. These findings have been explained in the published manuscript and cate- gorized as they do not possess an ATC codes. The confidence scores of Aspirin, Hydrox- yurea, Glycopyrrolate, Moricizine, Methimazole, Ethacrynic Acid, and Nitroglycerin are below 0.5 as they are assigned to the clusters where those clusters include multiple ATC classes. Therefore, we did not recommend those predictions as reliable drug reposition- ing candidates. For instance, Methimazole is clustered with 6 other drugs from six differ- ent ATC classes. Therefore, the confidence score of this prediction was 0.17. In this section, we apply all the available ATC codes including the drugs shown in Table 4.1 to reassess the proposed two-tiered clustering approach and the inferred drug repositioning candidates. A two-tiered clustering approach for drug repositioning through heterogeneous data 86 integration

Table 4.1: Drugs that are considered to be not in the ATC classification system

Drug name in Drug name in the Predicted ATC Confidence Predicted our dataset ATC classification ATC code code (Second score Algorithm

system level)

Chlorthalidone C03BA04 C03 0.83 CL1

Thioridazine Thioridazine N05AC02 N05 0.80 CL1

Orphenadrine Orphenadrine (Chloride) N04AB02 R06 0.64 MCL

Indomethacin Indometacin C01EB03, M01AB01,

M02AA23,S01BC01 M01 0.57 MCODE

Aspirin Acetylsalicylic acid A01AD05, B01AC06, M01 0.47 CL1

N02BA01

Hydroxyurea Hydroxycarbamide L01XX05 A02 0.40 MCODE

Glycopyrrolate Glycopyrronium A03AB02, R03BB06 A10, D11, L01 0.33 MCODE

Moricizine Moracizine C01BG01 C01, B01, N03,

N05, N06 0.20 MCODE

Methimazole Thiamazole H03BB02 H03, A10, L01,

N04,N05, R03 0.17 MCODE

Cysteamine L01 0.15

Ethacrynic Acid Etacrynic acid C03CC01 A10, N03 0.12 CL1

Nitroglycerin Glyceryl trinitrate C01DA02, C05AE01 L01 0.12 CL1 4.2 Results analysis using the missing ATC codes in the published manuscript 87

4.2.1 Results analysis

Approximately 17% of the drugs (out of 411 drugs) are assigned to multiple ATC classes. When there is an M number of drugs assigned to at least 2 ATC classes, we will require to analyze at least 2M-1 number of solutions. It should be noted that there is a maximum of 8 multiple ATC classes in the selected drug dataset (Prednisolone and Dexamethasone are assigned to 8 different ATC classes). Moreover, the time taken to compute NMI and SMI values for one instance of random reference class list is 0.002 seconds and 24.55 minutes, respectively using MATLAB R2015b on a computer with 8 GB RAM and the Intel(R) Core(TM) i7-3770 CPU. Therefore, in this study, we randomly selected one ATC class for those drugs having multiple ATC classes when constructing the reference class list. In this section, we compare the performance assessment of Drug Clustering Tier 1 and 2 considering 5 random reference class lists including all available ATC classes. More- over, we quoted the performance measures from the published manuscript to provide a clear comparison.

Drug Clustering Tier 1

Table 4.2 summarizes the NMI and SMI values for Drug Clustering Tier 1. These mea- sures illustrate that there is no significant difference in NMI and SMI values for the five random reference class lists. The mean NMI value of the clusters identified by Chemical, Disease, Gene, Protein and Side effect characteristics are 0.58, 0.68, 0.45, 0.62 and 0.58, respectively. The mean SMI value of the clusters identified by Chemical, Disease, Gene, Protein and Side effect characteristics are 19.68, 38.97, 2.71, 30.81 and 21.20, respectively. Moreover, it should be noted the NMI value of the clusters identified by Chemi- cal, Disease, Gene, Protein and Side effect characteristics, in the published manuscript are 0.59, 0.68, 0.45, 0.63 and 0.58, respectively. The SMI value of the clusters iden- tified by Chemical, Disease, Gene, Protein and Side effect characteristics, in the pub- lished manuscript are 20.09, 39.33, 2.91, 30.38 and 21.07, respectively. Therefore, there is no significant difference in the NMI and SMI values that were shown in the published A two-tiered clustering approach for drug repositioning through heterogeneous data 88 integration manuscript (please see 75-76 for more details).

Table 4.2: Performance assessment of Drug Clustering Tier 1

Chemical Disease Gene Protien Side effect

NMI 0.58 0.68 0.45 0.62 0.58 Random 1 SMI 20.05 39.89 2.90 30.92 21.45

NMI 0.58 0.67 0.45 0.62 0.58 Random 2 SMI 18.91 38.86 2.42 30.80 20.88

NMI 0.58 0.67 0.45 0.62 0.58 Random 3 SMI 19.92 38.24 2.38 30.75 20.78

NMI 0.58 0.68 0.46 0.63 0.59 Random 4 SMI 19.73 38.75 3.17 30.90 22.31

NMI 0.58 0.68 0.45 0.63 0.58 Random 5 SMI 19.40 38.76 2.49 31.08 20.73

Drug Clustering Tier 2

Table 4.3 summarizes the NMI and SMI values for Drug Clustering Tier 2. These mea- sures illustrate that there is no significant difference in NMI and SMI values for the five random reference class lists. The mean NMI value of the clusters identified by GSOM,

MCL, CL11 and MCODE are 0.65, 0.59, 0.55 and 0.51, respectively. The mean SMI value of the clusters identified by Chemical, Disease, Gene, Protein and Side effect characteristics are 35.49, 26.31, 21.49 and 11.32, respectively. Moreover, it should be noted the NMI value of the clusters identified by GSOM, MCL,

CL11 and MCODE, in the published manuscript are 0.66, 0.59, 0.56 and 0.52, respec- tively. The SMI value of the clusters identified by GSOM, MCL, CL11 and MCODE, in the published manuscript are 36.11, 26.49, 21.37 and 11.57, respectively. Therefore, there is no significant difference in the NMI and SMI values that were shown in the published 4.2 Results analysis using the missing ATC codes in the published manuscript 89 manuscript (please see 75-76 for more details).

Table 4.3: Performance assessment of Drug Clustering Tier 2

GSOM MCL CL1I MCODE

NMI 0.65 0.59 0.55 0.52 Random 1 SMI 36.51 27.07 21.90 11.93

NMI 0.64 0.59 0.55 0.51 Random 2 SMI 34.82 26.76 21.32 11.63

NMI 0.64 0.59 0.55 0.51 Random 3 SMI 34.43 26.04 21.40 10.99

NMI 0.65 0.59 0.55 0.52 Random 4 SMI 36.17 26.25 21.15 11.21

NMI 0.65 0.58 0.55 0.51 Random 5 SMI 35.52 25.46 21.66 10.84

Drug repositioning via ATC classification

In the published manuscript, in Table 6 (on page 78), we showed 39 drug repositioning candidates having a minimum confidence score of 0.5. There is no significant change in the list of repositioning candidates (having a minimum confidence score of 0.5) using the new ATC class list except for one instance which is shown in Table 4.4. In the published paper, Metformin is repositioned for therapeutic class antineoplastic agent represented by L01 with the confidence score of 0.50. However, in this application, it has increased up to 0.67. In GSOM, Metformin is identified in a cluster with Dacarbazine and Hydrox- yurea. Previously, we were unaware of the ATC code of Hydroxyurea as L01 (as it was given as Hydroxycarbamide in the ATC classification system). Since there are 2 (Dacar- bazine and Hydroxyurea) out of the 3 drugs in this cluster belong to L01 therapeutic class, we can reposition Metformin for L01 with the increased confidence score of 0.67. A two-tiered clustering approach for drug repositioning through heterogeneous data 90 integration

Table 4.4: The changes in the inferred repositioning candidates with higher confi- dence

Drug name Cluster ID Old ATC class New ATC class Confidence score Predicted algorithm

Metformin 79 N06 L01 0.67 GSOM

In the published manuscript, we have already explained the clinical significance of this finding. In this proposed unsupervised two-tiered clustering approach, the drugs are grouped without any prior knowledge of their ATC classes. Therefore, only the repositioning can- didates/confidence scores derived from the clusters that were affected due to the new ATC codes have been changed. Moreover, the predictions with the confidence score be- low 0.5 are not considered as reliable predictions. Further, we can attribute that the meth- ods proposed in this study are valid and applicable to drug repositioning as well as to predict the ATC codes for any drug.

4.3 Summary

This chapter introduces an unsupervised learning approach to grouping drugs based on various drug characteristics. It is a two-tiered clustering approach. In the first tier, drugs are clustered based on homogeneous characteristics. In the second tier, drugs are clustered based on an integrated summary of these homogeneous characteristics into a heterogeneous similarity. The clusters identified at the second tier are investigated to identify useful drug repositioning candidates for multiple diseases via Anatomical Ther- apeutic Chemical Classification. Importantly, this approach can infer ATC classes for drugs that do not possess an ATC code. Moreover, a confidence measure is introduced to filter the highly probable drug repositioning candidates. The two-tiered unsupervised clustering approach can be used to construct pairwise similarity model in other appli- cations such as drug-drug interaction prediction. In Chapter 5, this approach is used to generate sparse graphs from complete drug similarity networks. Chapter 5 Employing subnetwork identification to repositioning drugs using ATC classification

Some content of this chapter has been extracted from the published manuscript: Sun, Y.+, Hameed, P. N.+, Verspoor, K., & Halgamuge, S. (2016). A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning. BMC systems biology, 10(5), 128.

5.1 Background

RUG repositioning aims to identify the new uses for existing drugs. Methods D to achieve this aim are typically based on the similarity of drugs. It is believed that drugs with similar anatomical, chemical, therapeutic, phenotype, etc. characteristics will demonstrate similar therapeutic and adverse drug effects [4, 7, 36, 102]. Network- based methods are one of the emerging methods for drug repositioning which use net- works/graphs to represent pharmacological data [103]. Network decomposition or net- work partitioning is a fundamental concept used in these methods [35, 36, 103]. They reposition drugs by identifying drug candidates in multiple decomposed subnetworks. Subnetwork identification is a technique that identifies a single small-scale subnet- work from a large network. It differs from previous network-based methods as we need to analyze a single identified subnetwork focusing a single therapeutic use. Though sub- network identification is not popular for pharmacology data analysis, it has already been

91 92 Employing subnetwork identification to repositioning drugs using ATC classification proven to be efficient to simplify the visualization and interpretation of other biologi- cal networks [104–110]. In our recently published article entitled ‘A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning [111],’ the subnetwork identification is proposed as a reliable method for drug reposi- tioning. In Appendix A, the published article is attached for further reference. Subnetwork identification applies to identify a subgraph from a large drug network that connects a set of given drugs called ‘terminals ’ while minimizing the net-cost of ‘edge-costs’ and ‘vertex-prizes’. We selected the terminal drugs according to the Anatom- ical Therapeutic and Chemical (ATC) classification system so that the terminals are se- lected with a specific scope rather than at random. Herein, the ATC second level clas- sification is used to select the terminal drugs as it captures the therapeutic uses, hence meaningful subnetworks are identified. The drugs in the identified subnetwork should share similar characteristics to the selected terminals. In our published article [111], the suitability of the physarum-inspired subnetwork identification method for identifying repositioning candidates for cardiovascular diseases is investigated. Herein, drug repositioning problem is designed as a semi-supervised learning prob- lem where some of the drugs used for ‘single therapeutic use (known)’ are initialized as labeled data incorporating ATC classification. The Drug Similarity Network proposed in this chapter represents pairwise similarity with respect to the selected terminal drugs. The Physarum-inspired subnetwork identification algorithm proposed by Yahui Sun can be applied to solve such semi-supervised learning problems. In contrast to the meth- ods proposed in Chapter 4, subnetwork identification concept applies to infer a single subnetwork for a single ATC class at a time. The Prize-Collecting Steiner Tree (PCST) approach gained interaction in subnetwork identification because it is heuristic, (i.e. it is not an exact solution) but it is deterministic. PCST aims to find a connected subgraph which contains all the terminals while minimiz- ing the total edge-cost and maximizing the total vertex-prize and the optimal solution of PCST is called Steiner Minimum Tree. In drug repositioning, the selected terminals should represent the same therapeutic purposes. Even though subnetwork identification is popular as an efficient technique to simplify 5.2 The proposed method 93 the visualization and interpretation of other biological networks, this approach has not been applied to drug networks so far. The Physarum-inspired subnetwork identification algorithm (PSIA) proposed in our published article [111] follows PCST approach for sub- network identification on DSNs. It requires a positive number of terminal vertices to be determined at the initial phase. The algorithm applies to find a low-cost and high-prize subnetwork including all terminals. In this chapter, we will explain how ATC classification can be used to construct Drug Similarity Network (DSN), in the context of drug repositioning as a Prize-Collecting Steiner Tree Problem (PCSTP). Moreover, we will explain the suitability of the proposed approach by employing it for identifying repositioning candidates for diseases related to nervous system (i.e. ATC class-N). By applying it to another ATC class of different characteristics, we demonstrate the generalization of this method for drug repositioning. Further, we illustrate the significance of PSIA for drug repositioning. According to the literature, the Prize-Collecting Steiner Tree algorithms perform well on sparse graphs [111, 112]. Hence, in this chapter, we will also explore the suitability of PCST-based sub- network identification using different representations of sparse DSNs including the two- tiered drug clustering approach proposed in Chapter 4.

5.2 The proposed method

In this study, drug repositioning is perceived as a semi-supervised learning problem where known drugs from a therapeutic group are considered as labeled data. Here, the hierarchy of ATC classification is employed to initialize the labeled data. Major voting is a useful machine learning approach to obtain consensus results. Employing ATC classi- fication to design multiple DSNs enables a platform from which consensus results can be obtained. Prize-Collecting Steiner Tree (PCST) is used for identifying a connected subgraph G’= (V’,E’), of a given undirected graph G = (V,E,p,c) where V is the set of vertices, E is the set of edges, p is the prize, a function which maps each vertex in V to a non-negative number, c is the cost, a function which maps each edge in E to a positive number, V’ ⊆ V 94 Employing subnetwork identification to repositioning drugs using ATC classification and E’ ⊆ E. The subgraphs are identified by means of a set of terminals, T ⊂ V so that the identified subgraph G’ connects all the terminals by minimizing the objective function 0 c(G ) = ∑e∈E0 c(e) − ∑v∈V0 p(v). A required component of this approach is an undirected graph. Hence, Drug Simi- larity Networks (DSNs) are constructed as explained in Section 5.2.1. In this study, the hierarchy of the ATC classification system is taken into account when constructing multi- ple DSNs. Here, we explore repositioning candidates for diseases related to the nervous system. In the ATC classification, ATC class-N includes 112 drugs related to the nervous system. Based on the ATC second level classification, these 112 drugs of ATC class-N has been divided into 7 subclasses where some drugs are presented in multiple subclasses. Hence, 7 different DSNs are generated varying the terminals (see Section 5.2.1). Similar to our published article [111], two PCST Algorithms are employed to identify subnetworks, but from different DSNs with the aim of identifying repositioning candi- dates for a different type of disease (diseases related to the nervous system). The DSNs are created so that the vertex-prize represents the drug-terminal similarity and edge-cost represents the drug-drug dissimilarity.

5.2.1 Drug Similarity Network

We used four different drug characteristics following the work of Wang et al. [17]; drug- chemical [15], drug-therapeutic [64], drug-protein [15] and drug-phenotype [16]. Each of the drugs has 881, 719, 775 and 1385 features in chemical, therapeutic, protein and side effect domains, respectively, and they have 548 drugs in common. Appendix B includes the list of 548 drugs used in this study. The created Drug Similarity Networks (DSNs) includes following components:

1. Vertex (V): a vertex represents a drug. The 548 common drugs in chemical, ther- apeutic, protein and phenotype datasets are considered when creating the DSNs. Each drug is represented by a binary bit vector of 3760 length where 881 chem- ical features, 719 therapeutic features, 775 protein features and 1385 phenotype features are concatenated to build the binary feature vector. This binary feature vector represents the presence or absence of each feature. 5.2 The proposed method 95

2. Edge (E): an edge represents the association between two drugs. 3. Terminal (T): a terminal represents a special vertex to consider as the starting point to include in the identified subnetwork. Terminal vertices are selected according to the ATC second level classification. In this chapter, we choose ATC-N (i.e. Ner- vous System diseases). 4. Edge-cost (c): the edge-cost represents the pairwise dissimilarity between two drugs where Jaccard coefficient is used:

n ∑k=1 υi(k) ∩ υj(k) cij = 1 − n (5.1) ∑k=1 υi(k) ∪ υj(k)

where i and j are indexes of two different drugs, cij is the cost of edge (i, j), n is

the total number of features considered for each drug, which is 3760 and υi is the feature vector of drug i. 5. Vertex-prize (p): vertex-prize represents the average similarity between a drug and all the other drugs represented by terminals. It is calculated according to the following equation:

1 ∑j∈T,j6=i 1 + cij p = (5.2) i |T|

where pi is the prize of vertex i, T is the set of terminals and |T| is the total number of terminals.

Multiple DSNs can be created according to the above specifications, only by varying the terminal set. The vertex, edge and edge-cost are identical in all DSNs whereas the terminal set and the vertex-prize vary with respect to the selected terminal set. Since there are 7 subclasses in ATC class-N, 7 complete DSNs are constructed. Table 5.1 illustrates the 7 subclasses of ATC class-N with the corresponding class names. Appendix C includes the list of ATC class-N drugs used in this study. 96 Employing subnetwork identification to repositioning drugs using ATC classification

Table 5.1: ATC class-N

ATC Class Code ATC Class Name N01 ANESTHETICS N02 N03 ANTIEPILEPTICS N04 ANTI-PARKINSON DRUGS N05 PSYCHOLEPTICS N06 PSYCHOANALEPTICS N07 OTHER NERVOUS SYSTEM DRUGS

5.2.2 Sparse Graph generation

According to the literature, the PCST algorithms perform well on sparse graphs [112]. In this chapter, four sparse graph generation methods are used to prune the complete DSNs. In addition to the two sparse graph generation methods (Sparse Graph Method 1 and Sparse Graph Method 2) used in Sun et al. [111], the two-tiered clustering approach introduced in Chapter 3 as Sparse Graph Method 3 and Sparse Graph Method 4 is pro- posed here to investigate repositioning candidates for nervous system diseases. These four sparse graph generation methods result in producing 28 different sparse DSNs for ATC-Class N.

Sparse Graph Method 1

Table 5.2, illustrates the procedure of the Sparse Graph Method 1. It generates a sparse graph adding edges to the Minimum Spanning Tree (MST) of the complete graph which is found using Prim’s [113] Algorithms. The edges are probabilistically added to the MST until the expected number of edges are reached. This method is used to generate Sparse DSNs: SG1-SG7. Table 5.3 and Figure 5.1 illustrate the statistics and edge cost distribution of five sparse DSNs generated using sparse graph method 1, respectively. In these DSNs, the edge cost distribution lies between 0.15 and 0.98. The mean edge cost is approximately 0.63. Moreover, there is no significant difference in the sparse DSNs generated using Sparse Graph Method 1. Therefore, we can ensure the robustness of this method for generating sparse DSNs. 5.2 The proposed method 97

Table 5.2: Sparse graph method 1

Input: a complete graph Cg = (V, E00, p, c) Output: a sparse graph G = (V, E, p, c) 1: Find MST of Cg as G 2: While |E| < De do 3: For i = 1 to |V| − 1 do 4: For j = i + 1 to |V| do 5: If rand(1) ≤ Pro then 6: Add edge (i, j) to G 7: Break 8: If |E| == De then 9: Break |E| is the number of edges in the sparse graph, |V| is the number of vertices, De is the expected number of edges in the sparse graph, Pro is the probability of adding edges to MST.

Table 5.3: Analysis of five DSNs generated using Sparse Graph Method 1

Run 1 Run 2 Run 3 Run 4 Run 5 Range [0.15, 0.98] [0.15, 0.98] [0.15, 0.94] [0.15, 96] [0.15, 0.98] Median 0.66 0.66 0.65 0.65 0.66 Mean 0.63 0.63 0.63 0.63 0.64

Sparse Graph Method 2

Table 5.4, illustrates the procedure of the Sparse Graph Method 2. In this method, edges are selected using two thresholds: t1 and t2. In the first step, the edges with cost below t1 are deleted from the complete graph. Then, edges with cost below t2 are further deleted while maintaining the connectivity. t1 and t2 are selected in such a way that t1 < t2 and ensuring t1 is small enough to maintain the graph connectivity. This method is used to generate Sparse DSNs: SG8-SG14. In Sparse Graph Method 2, the most edge-costs are between 0.9 and 1. The edges with a cost below 0.9 are deleted to improve the running time of the algorithm. In computa- 98 Employing subnetwork identification to repositioning drugs using ATC classification

Figure 5.1: Edge cost distribution of five different sparse DSNs generated using Sparse Graph Method 1.

tional trials, the runtime of Sparse Graph Method 2 is 29.24 s when t1=0.9 and t2=0.95 whereas the runtime of Sparse Graph Method 2 is 7870.69 seconds when t1=0.5 and t2=0.95. Moreover, the graph becomes disconnected when t1 is 0.95. Accordingly, large t1 results in faster implementation, but at the risk of ruining the graph connectivity and large t2 makes the graph sparse but at the cost of long running time.

Sparse Graph Method 3

The two-tiered clustering approach introduced in Chapter 4 is used as the third sparse graph generation method. Clustering enables grouping of drugs into the same group based on their shared similar characteristics. Hence, pairwise drug associations can be inferred for drugs in the same cluster. The generated sparse graph from this method may rely on the parameters of the clus- tering algorithm. Unlike other sparse graph generation methods, the sparseness of this method depends on the number of clusters generated by the underlying clustering algo- rithm. One can fine-tune the parameters of the clustering algorithm to increase/decrease the number of clusters. Increasing the number of clusters produced by the clustering method results in fewer number of data points in each cluster, hence fewer edges be- tween drugs are included. Moreover, if the number of clusters is high, the sparseness of 5.2 The proposed method 99

Table 5.4: Sparse graph method 2

Input: a complete graph Cg = (V, E00, p, c) Output: a sparse graph G = (V, E, p, c) 1: Save Cg as G 2: For i = 1 to |V| − 1 do 3: For j = i + 1 to |V| do

4: If cij < t1 then 5: Check the connectivity of G without edge (i, j) 6: If G is still connected without edge (i, j) then 7: Delete edge (i, j) from G 8: For i = 1 to |V| − 1 do 9: For j = i + 1 to |V| do

10: If cij < t2 then 11: Check the connectivity of G without edge (i, j) 12: If G is still connected without edge (i, j) then 13: Delete edge (i, j) from G the network remains high. On the other hand, if the number of clusters is low, sparseness remains low. Therefore, the sparseness of this approach depends only on the parameters of the clustering algorithm. This method is used to generate Sparse DSNs: SG15-SG21.

Sparse Graph Method 4

Table 5.5 illustrates the procedure of the Sparse Graph Method 4. Unlike Sparse Graph Method 2, in this method, edges with lower cost are given the major priority. Since our focus is to minimize the edge-cost of the subnetwork, prioritizing edges from the lower range of edge-cost distribution may help in identifying the subnetwork with lower dis- similarity. Therefore, the sparse graph is created, adding all edges with dissimilarity below 0.5. In order to ensure the connectivity of the sparse graph, more edges are added by gradually increasing the dissimilarity threshold. This results in including most of the edges from lower edge-cost range. This method is used to generate Sparse DSNs: SG22- SG28. 100Employing subnetwork identification to repositioning drugs using ATC classification

Table 5.5: Sparse graph method 4

Input: a complete graph Cg = (V, E00, p, c) Output: a sparse graph G = (V, E, p, c) 1: Save Cg as G 2: For i = 1 to |V| − 1 do 3: For j = i + 1 to |V| do

4: If cij < t1 then 5: Add edge (i, j) from G 6: End For 7: While G is not connected do 8: For i = 1 to |V| − 1 do 9: For j = i + 1 to |V| do 10: If rand(1) ≤ Pro then 11: Add edge (i, j) to G 12: Break 13: If G is connected then 14: Break

5.2.3 Drug repositioning by subnetwork identification

Similar to our published article [111], in this study, two PCST algorithms: the physarum- inspired subnetwork identification algorithm (PSIA) proposed by Sun et al. [111] and the Goemans Williamson (GW) algorithm proposed by Michel X. Goemans and David P. Williamson [112] has been employed for subnetwork identification of sparse DSNs (SG1-SG28). We used the PSIA and GW implementations which is publicly available at https://github.com/YahuiSun/PSIA-to-identify-subnetworks. The definition of Prize-Collecting Steiner Tree (PCST) Problem is given as follows: let G = (V, E, p, c) be a connected, undirected graph, where V is the set of vertices, E is the set of edges, p is a function which maps each vertex in V to a non-negative number called the prize and c is a function which maps each edge in E to a positive number called the cost. Let T be a subset of V called terminals. The aim of PCST is to find a connected subgraph G0 = (V0, E0), V0 ⊆ V, E0 ⊆ E which contains all the terminals 0 while minimizing the objective function c(G ) = ∑e∈E0 c(e) − ∑v∈V0 p(v) and the optimal 5.2 The proposed method 101 solution of PCSTP is called Steiner Minimum Tree (SMT) in G for T. The large amoeboid organism called Physarum polycephalum shows many intelli- gent behaviors [114, 115]. In our published article [111], my colleague Yahui Sun pro- posed Physarum-inspired Subnetwork Identification Algorithm (PSIA) to identify sub- networks in DSNs. PSIA is inspired by the Lowest-cost Network Physarum Optimiza- tion algorithm (LNPO) [116] (refer Appendix A for the detailed explanation). In PSIA, a single terminal is chosen probabilistically to be the sink node; then all the other terminals become source nodes. Therefore, PSIA may identify different subnetworks for the same DSN. Moreover, we used the widely used GW algorithm to identify subnetworks in DSNs. GW algorithm was proposed by Michel X. Goemans and David P. Williamson [22] (refer Appendix A for the detailed explanation). However, the GW algorithm is designed to solve PCSTP instances with a single terminal, which is called the root. Since there are multiple terminals in these DSNs, the GW algorithm is employed by randomly choosing a single terminal to be the root and giving big prizes to other terminals. The GW algo- rithm initializes a root node from the terminals; then within the algorithm, the prizes of the other terminals are set to be a large number which is greater than the net vertex-prize. We iterate this process multiple times by varying the root node, then the optimal solution with the minimum net-cost of edge costs and vertex prizes is output as the final solution. The complete explanation of this algorithm is available in Appendix A, on page 151-152. An optimal subnetwork identified by PSIA/GW must include the drugs that are highly similar to the terminals. Since the terminals are selected from an ATC therapeutic class, the non-terminal drugs identified in the subnetwork can be repositioned for the disease represented by that ATC therapeutic class. In this chapter, terminals are chosen from ATC class-N. Hence, the identified non-terminals can be repositioned for diseases related to nervous system. We employed both PSIA and GW algorithm on 28 DSNs, SG1-SG28 (see Section 5.2.2). The subnetworks identified by these two algorithms are analyzed to identify reliable repositioning candidates. The repositioning candidates identified in multiple subnet- works are more likely to reposition for diseases related to nervous systems (ATC class- 102Employing subnetwork identification to repositioning drugs using ATC classification

N). An identified subnetwork includes terminals, True Positives (TPs) and False Positives (FPs). TPs are the drugs, already in ATC class-N (excluding terminals) whereas FPs are the drugs do not belong to ATC class-N. True Negatives (TNs) and False Negatives (FNs) are the drugs not included in the identified subnetwork. TNs are non ATC class-N drugs whereas FNs are the ATC class-N drugs. In this chapter, Rand Index (RI), Precision, Sen- sitivity and Specificity are used to analyze the identified subnetworks. They are defined as shown in the equations below.

TP + TN RI = × 100% (5.3) TP + TN + FP + FN

TP Precision = × 100% (5.4) TP + FP

TP Sensitivity = × 100% (5.5) TP + FN

TN Specificity = × 100% (5.6) FP + TN

5.3 Results

This section presents the subnetwork analysis for the subnetworks identified by PSIA and GW algorithm. With the aim of identifying concise subnetworks for drug repositioning, in our published article [111] RI is used to evaluate the subnetworks. In this section, in addition to RI, we demonstrate the properties of the solution subnetwork in terms of, Precision, Sensitivity and Specificity as well. Moreover, TP, FP, TN and FN are given for further reference. Tables 5.6, 5.7, 5.8 and 5.9 correspond to the subnetworks obtained using the sparse DSNs generated by Sparse Graph Methods 1, 2, 3 and 4, respectively. These tables sum- marize the properties of the initial DSN and the identified subnetworks in which Graph ID, |T| and T-Origin are the name of the sparse DSN, the number of terminals and the 5.3 Results 103 origin of the terminal set in each DSN, respectively, whereas |V0| and |E0| are the numbers of vertices and edges in each identified subnetwork, respectively. It should be noticed that the number of drugs or the number of vertices are the same in all 28 DSNs. The number of edges in the sparse DSNs depends based on the choice of the sparse graph generation method. Table 5.6 summarizes the results of the subnetworks identified by PSIA and GW algo- rithm for SG1-SG7. Similar to our published article [111], in these sparse DSNs number of edges are increased up to 1500 starting from an MST of 547 edges. It is clear that the sub- networks identified by GW are relatively large than the subnetworks identified by PSIA. Average RI, Precision, Sensitivity and Specificity of 80.3, 30.8, 7.3 and 96.5 are observed, respectively, in the subnetworks identified by PSIA. Average RI, Precision, Sensitivity and Specificity of 47.8, 18.4, 55.3 and 47.2 are observed, respectively, in the subnetworks identified by GW. Table 5.7 summarises the results of the subnetworks identified by PSIA and GW al- gorithm for Sparse DSNs generated using SG8-SG14. Similar to our published article, in these sparse DSNs 1391 edges are found when t1 and t2 are set to 0.9 and 0.95, respec- tively. Firstly, the edges of cost below 0.9 are deleted from the complete graph then the edges of cost below 0.95 are deleted while maintaining the graph connectivity. Here, t1 and t2 are set in such a way that t1

Table 5.6: Subnetwork Evaluation for ATC-N using Sparse Graph Method 1

Sparse DSN Solution Subnetwork Graph |T| T-source Algo |V0| |E0| ATC-N TP FP TN FN RI Precision Sensitivity Specificity ID drugs (%) (%) (%) (%) PSIA 36 35 20 6 16 428 84 79.8 27.3 6.1 96.4 SG1 14 N01 GW 304 303 68 54 236 208 36 47.6 18.6 55.1 46.9

PSIA 29 28 13 4 16 428 91 78.7 20.0 3.9 96.4 SG2 9 N02 GW 308 307 67 58 241 203 37 46.9 19.4 56.3 45.7

PSIA 41 40 26 11 15 429 78 81.1 42.3 11.3 96.6 SG3 15 N03 GW 310 309 68 53 242 202 36 46.3 18.0 54.6 45.5

PSIA 26 25 16 6 10 434 88 80.3 37.5 5.9 97.8 SG4 10 N04 GW 281 280 63 53 218 226 41 50.4 19.6 52.0 50.9

PSIA 61 60 37 10 24 420 67 81.0 29.4 11.7 94.6 SG5 27 N05 GW 314 313 77 50 237 207 27 47.8 17.4 58.8 46.6

PSIA 59 58 36 8 23 421 68 81.0 25.8 9.5 94.8 SG6 28 N06 GW 325 324 78 50 247 197 26 46.0 16.8 59.5 44.4

PSIA 18 17 12 3 6 438 92 80.3 33.3 2.9 98.7 SG7 9 N07 GW 280 279 61 52 219 225 43 49.9 19.2 50.5 50.7 5.3 Results 105

Table 5.7: Subnetwork Evaluation for ATC-N using Sparse Graph Method 2

Sparse DSN Solution Subnetwork Graph |T| T-source Algo |V0| |E0| ATC-N TP FP TN FN RI Precision Sensitivity Specificity ID drugs (%) (%) (%) (%) PSIA 84 83 23 9 61 383 81 71.9 12.9 9.2 86.3 SG8 14 N01 GW 22 21 14 0 8 436 90 80.1 0.0 0.0 98.2

PSIA 17 16 11 2 6 438 93 80.1 25.0 1.9 98.6 SG9 9 N02 GW 18 17 11 2 7 437 93 80.0 22.2 1.9 98.4

PSIA 20 19 16 1 4 440 88 81.2 20.0 1.0 99.1 SG10 15 N03 GW 21 20 16 1 5 439 88 81.1 16.7 1.0 98.9

PSIA 17 16 13 3 4 440 91 80.9 42.9 2.9 99.1 SG11 10 N04 GW 22 21 13 3 9 435 91 79.9 25.0 2.9 98.0

PSIA 92 91 33 6 59 385 71 73.5 9.2 7.1 86.7 SG12 27 N05 GW 41 40 29 2 12 432 75 81.8 14.3 2.4 97.3

PSIA 62 61 38 10 24 420 66 81.2 29.4 11.9 94.6 SG13 28 N06 GW 40 39 29 1 11 433 75 81.9 8.3 1.2 97.5

PSIA 15 14 10 1 5 439 94 80.1 16.7 1.0 98.9 SG14 9 N07 GW 18 17 11 2 7 437 93 80.0 22.2 1.9 98.4

GW always includes all 548 vertices. Therefore, it is not suitable for drug repositioning. Table 5.9 summarizes the results of the subnetworks identified by PSIA and GW al- gorithm for SG22-SG28. These sparse DSNs include 2675 edges where 47.35% of the edges represent edge-cost below 0.5. Other edges are added by gradually increasing the edge-cost (dissimilarity) threshold. Including edges from the lower edge-cost enables identifying useful subnetworks for drug repositioning. Average RI, Precision, Sensitivity and Specificity of 80.1, 27.9, 6.4 and 96.4 are observed, respectively, in the subnetworks identified by PSIA. Average RI, Precision, Sensitivity and Specificity of 34.8, 16.7, 65.3 106Employing subnetwork identification to repositioning drugs using ATC classification

Table 5.8: Subnetwork Evaluation for ATC-N using Sparse Graph Method 3

Sparse DSN Solution Subnetwork Graph |T| T-source Algo |V0| |E0| ATC-N TP FP TN FN RI Precision Sensitivity Specificity ID drugs (%) (%) (%) (%) SG15 14 N01 PSIA 23 22 18 4 5 439 86 81.5 44.4 4.1 98.9

SG16 9 N02 PSIA 14 13 12 3 2 442 92 81.1 60.0 2.9 99.5

SG17 15 N03 PSIA 25 24 20 5 5 439 84 81.8 50.0 5.2 98.9

SG18 10 N04 PSIA 14 13 13 3 1 443 91 81.4 75.0 3.0 99.8

SG19 27 N05 PSIA 42 41 35 8 7 437 69 83.9 53.3 9.4 98.4

SG20 28 N06 PSIA 37 36 33 5 4 440 71 84.0 55.6 6.0 99.0

SG21 9 N07 PSIA 15 14 13 4 2 442 91 81.3 66.7 3.9 99.5 and 29.4 are observed, respectively, in the subnetworks identified by GW. Notably, the subnetworks identified by GW are relatively larger than the subnetworks identified by PSIA. The subnetworks identified by GW includes 55.3%, 4.7% and 71.6% of the drugs/vertices in the sparse DSNs generated by Sparse Graph Method 1, 2 and 4, respectively, whereas the subnetworks identified by PSIA includes 7.0%, 8.0%, 4.4% and 6.9% of the vertices in sparse DSNs generated by Sparse Graph Method 1, 2, 3 and 4, respectively. The subnetworks identified by PSIA demonstrate a relatively higher RI, precision and specificity in 7, 5 and 7 of the 7 sparse DSNs generated by Sparse Graph Method 1, 2 and 4, respectively. As mentioned previously, GW algorithm is unable to identify useful subnetworks on DSNs with 14.5% sparse DSNs which is generated using Sparse Graph Method 3. It is clear from Table 5.8 that unlike GW algorithm, PSIA is capable of handling 14.5% of sparse DSNs constructed according to Sparse Graph Method 3. Moreover, PSIA is likely to identify smaller subnetworks which are more suitable for drug repositioning. 5.3 Results 107

Table 5.9: Subnetwork Evaluation for ATC-N using Sparse Graph Method 4

Sparse DSN Solution Subnetwork Graph |T| T-source Algo |V0| |E0| ATC-N TP FP TN FN RI Precision Sensitivity Specificity ID drugs (%) (%) (%) (%) PSIA 40 39 20 6 20 424 84 79.0 23.1 6.1 95.5 SG22 14 N01 GW 395 394 76 62 319 125 28 33.5 16.3 63.3 28.2

PSIA 24 23 14 5 10 434 90 80.0 33.3 4.9 97.8 SG23 9 N02 GW 388 387 76 67 312 132 28 35.4 17.7 65.1 29.7

PSIA 33 32 20 5 13 431 84 80.3 27.8 5.2 97.1 SG24 15 N03 GW 393 392 80 65 313 131 24 35.3 17.2 67.0 29.5

PSIA 31 30 16 6 15 429 88 79.4 28.6 5.9 96.6 SG25 10 N04 GW 385 384 77 67 308 136 27 36.2 17.9 65.7 30.6

PSIA 62 61 38 11 24 420 66 81.2 31.4 13.0 94.6 SG26 27 N05 GW 396 395 82 55 314 130 22 34.0 14.9 64.7 29.3

PSIA 53 52 33 5 20 424 71 81.0 20.0 6.0 95.5 SG27 28 N06 GW 393 392 83 55 310 134 21 34.8 15.1 65.5 30.2

PSIA 22 21 13 4 9 435 91 80.0 30.8 3.9 98.0 SG28 9 N07 GW 396 395 77 68 319 125 27 34.3 17.6 66.0 28.2 108Employing subnetwork identification to repositioning drugs using ATC classification

Table 5.10: Analysis of subnetwork identification for complete graphs vs. sparse graphs

Number Complete Sparse Sparse Sparse Sparse Algorithm ATC Class of Graph Graph Graph Graph Graph Terminals Method 1 Method 2 Method 3 Method 4 N01 14 548 304 22 548 395 N02 9 548 308 18 548 388 N03 15 548 310 21 548 393 GW N04 10 548 281 22 548 385 N05 27 548 314 41 548 396 N06 28 548 325 40 548 393 N07 9 548 280 18 548 396 N01 14 0 36 84 23 40 N02 9 0 29 17 14 24 N03 15 0 41 20 25 33 PSIA N04 10 0 26 17 14 31 N05 27 0 61 92 42 62 N06 28 0 59 62 37 53 N07 9 0 18 15 15 22 5.3 Results 109

According to the literature, the PCST algorithms perform well on sparse graphs [112]. As explained above, in this chapter, we used four sparse graph generation methods to prune the complete DSNs. In Table 5.10, we show the significance of using sparse DSNs for subnetwork identification by PCST algorithms. It can be seen from the table that both GW and PSIA algorithms are not capable of identifying useful subnetworks for drug repositioning when the complete graphs are used. GW algorithm gives the subnetwork including all 548 vertices from the initial graph as GW algorithm becomes more sensitive to the node weights when the graph is complete. It has connected all the nodes to the root-node as they can be easily connected in the complete graph. It should be noted that GW algorithm was unable to produce useful subnetworks even for 14.5% of sparse DSNs constructed using Sparse Graph Method 3. On the other hand, PSIA does not give any subnetwork as the adjacency matrix used to denote the edge-weights become nearly singular when the graph is complete. It should be noted that, even though we prune edges from the complete graph to construct four different sparse graphs, subnetworks identified by these PCST algorithms can include any drugs as no vertex has been deleted in any of these sparse graphs. Overall it can be seen PSIA as a useful algorithm for drug repositioning by subnet- work identification using sparse DSNs. As explained before, in this study, we used 28 different sparse DSNs varying the sparsity and the terminal sets. Some drugs occur in multiple DSNs. We chose the frequently occurring repositioning candidates in multi- ple subnetworks identified by PSIA for further analysis. The frequently occurring drugs have a higher probability to prioritize as a repositioning drug candidates. We have high confidence that the frequently occurred non-terminal, non-ATC-class-N drugs in these subnetworks can be repositioned for nervous diseases. Table 5.11 shows the 55 frequently occurring drug repositioning candidates seen in at least 3 subnetworks. , Alclometasone, Aminophylline, Lisinopril, Theo- phylline, Vinblastine, Isoniazid and Naftifine have the highest possibility to be reposi- tioned for diseases related to the nervous system as they have occurred in at least 5 sub- networks. However, proposing them for specific nervous diseases would be beneficial. Figure 5.2 illustrates the current ATC class distribution across the inferred reposition- 110Employing subnetwork identification to repositioning drugs using ATC classification

Figure 5.2: ATC class distribution in the repositioned ATC-N drug ing candidates for the nervous diseases that occurred in at least three subnetworks. The majority of the inferred repositioning candidates are from ATC class-C (20%), related to the cardiovascular system. Drugs used for Dermatologicals (17%) can be seen as the sec- ond highest overlapping ATC class with ATC class-N drugs. A fewer number of drugs related to ATC class P, L, H, B and V are seen in the inferred list and no drugs from ATC class-M are found among the inferred repositioning candidates.

5.4 Discussion

In our published article [111], repositioning candidates for cardiovascular diseases are identified based on ATC class-C (Cardiovascular System) drugs. As a result, nitro- glycerin, theophylline, arsenic trioxide, isocarboxazid, lincomycin, acarbose, , 5.4 Discussion 111 haloperidol, malathion and neomycin are inferred as plausible drug repositioning can- didates for treating cardiovascular diseases. Moreover, we found evidence from the ex- isting discoveries on nitroglycerin, theophylline and acarbose to show they are already being used to treat cardiovascular diseases although they are not included in the current ATC class-C. In this chapter, we employed PSIA and popular GW [112] for 28 different DSNs with the aim of identifying repositioning drug candidates for nervous diseases. The main objective of this experiment was to demonstrate the generalization of subnetwork identi- fication method from the particular context explored in our published article [111]. In the ATC classification, drugs related to the nervous system are represented by ATC class-N. By applying this method to ATC class-N, plausible repositioning candidates for treating diseases related to nervous system are identified. Similarly, this approach can be applied to other 12 main ATC classes. An identified subnetwork includes terminals, TPs and FPs. TPs are the drugs already classified in ATC class-N (excluding terminals). FPs are the drugs that are currently not in ATC class-N. Theoretically, the FPs are the predicted repositioning candidates. Un- like other classification applications, selecting the most appropriate evaluation metric for drug repositioning is challenging as the chosen evaluation metric should enable a reason- able number of FPs in the subnetwork. Thus determining the best subnetwork based on an evaluation metric may be inappropriate. An evaluation metric focuses on TPs and FNs is not the best option since our broader focus is to determine the repositioning candidates through FPs/predicted positives. Can all FPs identified in a subnetwork be inferred as repositioning candidates? It is the other question requires main attention. The idea of drug repositioning by subnet- work identification requires precise assessment of the predicted positives. Due to drugs’ complex characteristics, drugs may be similar to various other drugs. From Tables 5.6, 5.7, 5.8 and 5.9, it can be seen that a subnetwork with a relatively higher number of TPs includes a higher number of FPs as well. Sensitivity is designed to assess the positives whereas Specificity is designed to assess negatives. RI and Specificity metrics take TNs into account (please see Equations 5.3-5.6). 112Employing subnetwork identification to repositioning drugs using ATC classification

Moreover, RI considers TP, FP, TN and FN which was the reason for its use in our pub- lished article [111]. Sensitivity is likely to suggest the larger subnetwork as it avoids FPs and TNs. On the other hand, Precision, Specificity and RI are likely to suggest the smaller subnetwork. In the large subnetworks, a process to identify the most reliable FPs as repo- sitioning candidates become necessary. Hence, in this application, assessing FPs and TNs has a major impact in obtaining reliable repositioning candidates. The subnetworks with a higher Precision may provide more reliable repositioning candidates, but merely de- pending on Precision to make a choice would restrict detecting more interesting drug repositioning candidates that are encountered as FPs. The subnetworks identified by GW become complicated as they contain a relatively higher number of drug repositioning candidates (FPs). The subnetworks identified by PSIA are generally small and it includes a relatively fewer number of FPs to be analyzed. Moreover, PSIA identifies useful subnetworks in all four types of sparse DSNs used in this study. Therefore, subnetworks identified by PSIA are further analyzed to identify FPs that occur in multiple subnetworks. These subnetworks are identified based on DSNs with different characteristics ( 7 ATC classes and 4 sparse DSNs). The FPs that occur in multiple subnetworks increase the confidence that those candidates might be interesting prospective drug repositioning candidates for further in-depth analysis. There is evidence from the literature that PCST algorithms perform better on sparse graphs than complete graphs [112]. Moreover, PSIA and GW algorithms used in this study failed to produce useful subnetworks when the complete DSNs are used. There- fore, the complete DSNs created according to Section 5.2.2 are pruned to construct sparse DSNs. In sparse DSNs, edges are selected/pruned from the complete DSNs while ver- tices are kept stable. Since the vertices are kept stable, the subnetworks may include common drugs. In addition to the two sparse methods used in our published article [111], Sparse Graph Methods 3 and 4 were introduced in this chapter to extend the generalization of subnetwork identification methods for drug repositioning. Sparse Graph Method 3 employs the two-tiered clustering approach proposed in Chapter 4. Unlike other sparse graph generation methods, the sparseness of this method depends on the number of 5.4 Discussion 113 clusters generated through the underlying clustering method. Increasing the number of clusters produced by the clustering method results in fewer number of data points in each cluster, hence fewer edges between drugs are included. Moreover, if the number of clusters is high, the sparseness of the network remains high. On the other hand, if the number of clusters is low, sparseness remains low. The sparseness of this approach depends only on the parameters of the clustering algorithm. Even though the sparseness is 14.5% for the sparse DSNs (SG15-SG21) generated using Sparse Graph Method 3, GW is unable to produce useful subnetworks in these sparse DSNs. But, PSIA produced useful subnetworks. In contrast to other three methods, in Sparse Graph Method 4, edges with low cost are given the main preference when generating the sparse DSNs. It may enable the subnetwork identification method to learn the subnetwork with low edge- costs (dissimilarity) effectively. Further, in Sparse Graph Method 1, 2, and 4, the pairwise drug similarities are computed using only Jaccard distance measure. Combining various similarity measures such as Euclidean, cosine, etc. would be another option to ensure reliable pairwise similarities. The proposed approach leads to identify the consensus solutions for reliable drug repositioning. Constructing multiple DSNs by varying terminals enables identifying multiple subnetworks from various interests. One approach to determine the terminals is the widely used random selection. A random proportion (80%, 60%, 50%, etc.) of known drugs can be used as terminals. But, the subnetwork identification method can benefit when the random proportion is kept at a lower value due to drug’s highly overlapping nature. In contrast to traditional approaches, in this study, we employed the hierarchy of the ATC classification to choose the terminals. The subclasses of the second level ATC classification are employed as the main source to determine the terminals. It provides a meaningful terminal selection as the terminals represent a particular therapeutic class. Employing different types of DSNs by varying the terminals and varying the sparse graph generation methods provides an opportunity to assess the drug repositioning can- didates on the basis of a consensus solution. The 4 different sparse graph generation methods and the 7 subclasses of ATC class-N enable in constructing 28 different sparse DSNs. We aggregate the predicted drug repositioning candidates from the 28 multiple 114Employing subnetwork identification to repositioning drugs using ATC classification subnetworks and prioritize the drug repositioning candidates that occur in multiple sub- networks. This strategy strengthens the confidence that the drug repositioning candi- dates frequently occurred in multiple subnetworks as reliable drug repositioning candi- dates.

5.5 Summary

This chapter emphasizes the significance of the ATC classification in identifying plausi- ble drug repositioning candidates. In this study, the hierarchy of the ATC classification is proposed as the primary attribute to design the experiments. This chapter also devotes to explore the benefits of Prize-Collecting Steiner Tree approach as a useful way of subnet- work identification for drug repositioning. The Physarum-inspired subnetwork identifi- cation algorithm (PSIA) employed in this study can be applied to solve semi-supervised learning problems such as drug repositioning where some of the drugs used for thera- peutic uses are known in advance. In the published article attached in Appendix A, drug repositioning candidates for cardiovascular disease is discovered using PSIA algorithm. Therefore, ATC class-C drugs are used in the experiments. In addition, we employed the novel method based on drug similarity network to infer drug repositioning candidates for nervous diseases via ATC class-N. This extended study demonstrates that the previously published methods for drug repositioning in one disease can be generalized to a new disease context. Moreover, two additional sparse graph methods are introduced to strengthen the consensus solu- tion. The drug repositioning candidates inferred by multiple subnetworks increases the confidence that those candidates might be interesting for further in-depth analysis. 5.5 Summary 115

Table 5.11: Repositioning candidates for nervous diseases identified in multiple subnetworks

Drug Name FrequencyCurrent ATC Drug Name Frequency Current ATC Class Class Nimodipine 7 C Dexamethasone 3 A, C, D, H, S, R Alclometasone 6 D, S 3 C, V Aminophylline 5 R Didanosine 3 J Lisinopril 5 C 3 C Theophylline 5 R Dopamine 3 C Vinblastine 5 L Enalapril 3 C Isoniazid 5 J Flumazenil 3 V Naftifine 5 D Fluorometholone 3 C, D, S Amprenavir 4 J Hexachlorophene 3 D Budesonide 4 A, D, R Hydroxocobalamin 3 B, V 4 D, R Loratadine 3 R Norepinephrine 4 C Milrinone 3 C 4 P 3 C Tacrolimus 4 D, L Neomycin 3 A, B, D, J, S, R Tranexamic Acid 4 B 3 A Apraclonidine 4 S Phenoxybenzamine 3 C Brinzolamide 4 S 3 C 4 R Propantheline 3 A Miglitol 4 A Pyrazinamide 3 J 3 C Simvastatin 3 C Amlodipine 3 C Levocabastine 3 S, R 3 C Oxiconazole 3 D, G, J, S Azithromycin 3 J, S Penciclovir 3 D, J Chloramphenicol 3 D, G, J, S Prednicarbate 3 D Clotrimazole 3 A, D, G 3 A Cyproheptadine 3 R Rimexolone 3 H, S Dapsone 3 D, J Tolterodine 3 G Delavirdine 3 J A:Alimentary tract and metabolism, B: Blood and blood forming organs, C: Cardiovascular system, D: Dermatologicals, G: Genito urinary system and sex hormones, H: Systemic hormonal preparations, excl. sex hormones and insulins, J: Antiinfectives for systemic use, L: Antineoplastic and immunomodulating agents, R: Respiratory system, P: products, insecticides and repellents S: Sensory organs, V: Various This page is intentionally left blank. Chapter 6 Inferring super-targets using dimensionality reduction and density-based clustering

6.1 Background

RUGS may interact with target proteins and off-target proteins. Generally, the D drug-target interactions result in producing therapeutic effects whereas the drug- off-target interactions result in producing unexpected side effects. Drugs may react sim- ilarly with similar types of target protein or families of proteins. A drug, if known to interact with Target A, is assumed to interact with the targets related to Target A; this set of targets is known as a ‘super-target ’ group [117]. In this chapter, target-cluster will be referred to as ‘super-target’. Clustering targets into groups enables identification of super-targets. Investigating super-targets provides an opportunity to explore target-target similarities through which new drug-target detection can be performed. Much research is focused on the potential of drug repositioning through new target detection [20, 62, 69, 118–121]. The rationale of new target detection is the prediction of new relationships between an existing drug and potential target proteins. As explained earlier, investigating drug-target relationships can span two axes of clinical pharmacology: drug-disease and drug-side effect detection. Over the past two decades, many computational approaches based on machine learn- ing, network analysis and text mining or blends of those methods have been proposed for new target detection [19, 35, 78, 102, 120]. Computational methods for drug-target recog-

117 118 Inferring super-targets using dimensionality reduction and density-based clustering nition play an essential role in drug repositioning. Experimentally identifying new drug- target relationships requires substantial time and cost, while computational methods can be applied at large-scale. Therefore, computational predictions can be used to narrow down the experimental focus into a smaller set of experiments. Moreover, the increasing interest in new target detection implies the incomplete nature of the current drug-target interaction data and many interactions are still unknown [19, 44, 81, 120, 122, 123]. Machine learning approaches can be classified as supervised, unsupervised and semi- supervised learning methods where labeled, unlabeled and a mixture of labeled and un- labeled examples are used, respectively, in the learning process. In this study, we present a purely unsupervised learning approach which can be applied to unlabeled target data representing drug-target interactions. The fundamental objectives of this study are (i) to assess the impact of employing incomplete drug-target interaction data for target clus- tering, (ii) to improve target clustering by dimensionality reduction and outlier detection and (iii) to identify a strategy for evaluating purely unsupervised target clustering when there is no standard reference. Clustering targets based on drug-target interaction alone may introduce noise due to missing information in the current databases. Hence, we perform dimensionality reduc- tion to extract the most dominant features from the current data before target clustering. T-distributed stochastic neighbor embedding (t-SNE) is capable of transforming high- dimensional features into a reduced number of dimensions using nonlinear dimension- ality reduction techniques [124,125]. It is well-suited for applications such as drug-target interaction data with complex nonlinear relationships. Moreover, detection of outliers in the identified clusters provides an opportunity to choose the most suitable target candidates for the inferred super-targets while eliminat- ing the outlier targets. A target can be an outlier due to its incomplete feature representa- tion or due to highly deviating characteristics. Hence, target clustering can be enhanced by removing the outlier-targets. Density-based spatial clustering of applications with noise (DBSCAN) [126] algorithm is used to identify super-targets based on drug-target interaction data with noise or incomplete information; it is also capable of inferring out- liers. In DBSCAN, data points are classified into clusters or outliers based on the density- 6.2 Data 119 reachability and density-connectivity criteria [126]. It has a large number of applications in the field of bioinformatics [127–130]. To the best of my knowledge, the integration of t-SNE technique and DBSCAN algorithm has not been applied to drug repositioning or target clustering. Evaluating the unsupervised learning model is also essential. But, extrinsic evalua- tion of target clustering is challenging without a standard reference target cluster. Here, the incomplete nature of the drug-target interactions is considered to design a suitable performance evaluation model for target clustering. Then, evaluation metrics such as Adjusted Mutual Information [45], Normalized Mutual Information [46] and Standard- ized Mutual Information [47] are employed to assess the performance of the clustering model.

6.2 Data

We obtained the gold standard drug-target interaction data used in Yamanishi et al. [19]. It contains drug-target interactions involving enzymes, ion channels, G protein-coupled receptors (GPCRs) and nuclear receptors in human providing four different drug-target datasets (see Table 6.1). In these datasets, 445, 210, 223 and 54 drugs were found to be interacting with 664, 204, 95 and 26 enzymes, ion channels, GPCRs and nuclear receptors, respectively. Moreover, the known drug-target interactions represent only 1.98%, 3.60%, 3.00% and 6.41% interactions compared to the total number of possible drug-target in- teractions for enzymes, ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors, respectively.

Table 6.1: Existing gold-standard drug-target interactions

Target type |Drugs| |Targets| |Known interaction| %Known interaction Enzyme 445 664 2926 2.0 210 204 1476 3.6 GPCR 223 95 635 3.0 Nuclear receptor 54 26 90 6.4 120 Inferring super-targets using dimensionality reduction and density-based clustering

Table 6.2: Analysis of in-degree and out-degree of drug-target interactions

Target type Interaction Type 1st Quantile 2nd Quantile 3rd Quantile

(Q1)(Q2)(Q3) drug → target 1 2 5 Enzyme target → drug 1 2 4 drug → target 1 3 9 Ion channel target → drug 2 5 10 drug → target 1 2 3 GPCR target → drug 1 3 8 drug → target 1 1 2 Nuclear receptor target → drug 1 3 4

Table 6.2 summarizes the analysis of in-degree and out-degree of the existing drug- th th th target interactions. Q1, Q2 and Q3 indicate the 25 , 50 and 75 percentile, respectively. It shows that 75% of the drugs are targeting only 5 enzymes, 9 ion channels, 3 GPCRs and 2 nuclear receptors. The 75% of enzymes, ion channels, GPCR and nuclear receptor are targeted by only 4, 10, 8 and 4 drugs, respectively.

6.3 The proposed method

Clustering can be used to group targets. A cluster of targets is referred to as a ’super- target’ as introduced by Shi et al. [117]. Targets in the same super-target share similar behavior as compared to targets in other clusters. New target detection can be performed by leveraging the target-target similarities; this is useful for drug repositioning and side effect prediction. In this study, we propose an unsupervised learning approach for in- ferring super-targets which can be investigated further for new target recognition. The integration of dimensionality reduction and clustering with outlier detection is proposed to enhance the performance of super-target identification. Figure 6.1 illustrates the proposed workflow. In the present work, dimensionality reduction, specifically t-distributed stochastic neighbor embedding (t-SNE) technique is employed as a pre-processing technique. Here, it is proposed for three reasons: i) it en- 6.3 The proposed method 121 ables faster learning, ii) it enables filtering of the least dominant features on incomplete data and iii) DBSCAN algorithm is more effective on lower dimensional feature repre- sentations [131]. We propose that the impact of the incompleteness on the algorithms can be minimized when the most dominant features of each target are extracted after higher dimensionality feature reduction. Then, clustering is performed using density-based spa- tial clustering of applications with noise (DBSCAN).

Proposed validation technique

The validation of the inferred super-targets is challenging as there is no standard refer- ence grouping of these targets. So, we propose an evaluation method to assess the perfor- mance of the proposed unsupervised model. We randomly removed 10% of the existing drug-target interactions to construct five adapted datasets for each target category (en- zyme, ion channel, GPCR and nuclear receptor). Similarly, more adapted datasets can be constructed by varying the existing interactions removing percentage. These adapted datasets provide an opportunity to assess the effectiveness of the proposed model on in- complete datasets. The proposed higher dimensionality feature reduction using t-SNE followed by clustering using DBSCAN is performed on both original and the adapted datasets. Finally, the inferred super-target clusters are assessed by means of intrinsic and extrinsic evaluation. We assess whether the proposed integrated unsupervised learning model can infer similar super-targets on both original and the corresponding adapted datasets.

6.3.1 T-distributed stochastic neighbor embedding

Dimensionality reduction is the transformation of high-dimensional data into a mean- ingful lower dimensional representation. Unlike the widely used Principal Compo- nent Analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and its vari- ants such as Barnes-Hut stochastic neighbor embedding (Barnes-Hut-SNE) are capable of transforming high-dimensional features into reduced dimensions using nonlinear di- mensionality reduction techniques [124, 125]. T-SNE technique constructs a probability 122 Inferring super-targets using dimensionality reduction and density-based clustering

Figure 6.1: The proposed workflow for target clustering distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, while dissimilar points have an extremely small prob- ability of being picked. T-SNE defines a similar probability distribution over the points in the low-dimensional map and it minimizes the Kullback-Leibler divergence between the two distributions concerning the locations of the points on the map. Barnes-Hut-SNE is very good at preserving local structure of the data in the embedding. The two pri- mary steps of the Barnes-Hut-SNE algorithm are explained in brief next (see [125] for the complete explanation):

1. constructs a sparse approximation to the similarities between input objects using vantage-point trees. 2. approximates the t-SNE gradient using a variant of the Barnes-Hut algorithm.

Two properties of Barnes-Hut-SNE are beneficial when dealing with large networks such as pharmacological networks. It comprises of following advantages:

• runs in O(N log N) rather than O(N2) and requires only O(N) memory, hence it is substantially faster than standard t-SNE. • facilitates the visualization of data sets with millions of data objects in scatter plots.

6.3.2 Clustering and outliers detection using DBSCAN

To the best of my knowledge, density-based methods have not been previously applied for target clustering on drug-target interaction data. Density-based spatial clustering of 6.3 The proposed method 123 applications with noise (DBSCAN) [126] has become a popular clustering technique in data with high noise. Also, it has following flexibilities over other traditional clustering methods such as K-means, Self-Organizing Maps. Similar to Growing Self Organizing Map, employed in Chapter 3 and Chapter 4, DBSCAN can infer clusters with arbitrary shapes and number of clusters is defined automatically. In addition, DBSCAN uses the connectivity as an additional condition when forming clusters which may be useful when detecting outliers and forming super-targets. In DBSCAN, all points within the cluster are mutually density-connected. If a point is density-reachable from any point of the cluster, it is part of the cluster as well. Parameters e and minpts are defined to demonstrate the neighborhood of each cluster and the cluster size, respectively. In DBSCAN, data points are classified into clusters or outliers based on the density-reachability and density-connectivity criteria as follows (see [126] for the complete explanation):

1. core-points (p): points where at least minpts are within its e-neighborhood 2. density-reachable point (q): if there is a direct path exist and accessible by core- points (p) 3. density-connectivity point: if there is a point o, such that both p and q are density- reachable from o 4. outliers (n): The points that are not accessible by any other core-points p.

DBSCAN has been shown to be much more effective on lower dimensional data [131]. Realizing the significance of dimensionality reduction, Barnes-Hut-SNE is employed be- fore the clustering step which may be also beneficial in addressing the problem of miss- ing and incomplete interactions in the current datasets. Moreover, detecting outliers can further alleviate the super-targets through which useful repositioning candidates can be inferred.

6.3.3 Evaluation

Evaluating target clustering is challenging. Intrinsic and extrinsic evaluation of the iden- tified target clusters may be useful in demonstrating the significance of the proposed approach. 124 Inferring super-targets using dimensionality reduction and density-based clustering

Intrinsic evaluation

Silhouette analysis is used as an intrinsic evaluation technique to assess the consistency within a cluster/class [132, 133]. It measures the similarity of an object to its own clus- ter/class compared to the other clusters/classes. If the object has a greater similarity to its own cluster/class than to its other clusters/classes, the Silhouette value would be +1 and if the object has greater dissimilarity to its own cluster/class than to the other clusters/classes, the Silhouette value would be -1. The following equation defines the Silhouette measure for an object i:

b(i) − a(i) Silhouette(i) = (6.1) max{a(i), b(i)}

where a(i) and b(i) are the dissimilarity of the object i to its own cluster/class and the dissimilarity of the object i to the other clusters/classes.

Extrinsic evaluation

There is no acknowledged measure of choice to compare clusters. Similar to Chapter 4, in this chapter, our focus is on information theoretic measures which are beneficial in adjusting the selection bias. The adjusted measures: Adjusted Mutual Information (AMI) [134], Normalized Mu- tual Information (NMI) [46] and Standardized Mutual Information (SMI) [47] are used to evaluate the clusters identified by original and the adapted data. These are informa- tion theoretic measures derived based on mutual information. AMI takes a 0 expecta- tion value when clusters are independent and it equals to 1 when the two clusterings are identical. NMI provides a normalized measure using mutual information where it ranges between 0 and 1. SMI further reduces the bias in clustering comparisons towards selecting clusterings with more clusters and where clustering involves fewer data points. The upper bound of SMI varies based on the used reference clustering. However, higher SMI value indicates better clustering. The equations for AMI [134], NMI [46] and SMI [47] to compare clustering solutions U and V are shown below: 6.4 Results 125

(MI(U, V) − E[MI(U, V)]) AMI(U, V) = p (6.2) H(U)H(V) − E[MI(U, V)]

(MI(U, V)) NMIsqrt(U, V) = p (6.3) H(U)H(V)

MI(U, V) − E[MI(U, V)] SMI(U, V) = p (6.4) var(MI(U, V)) where MI is the mutual information, H is the associated entropy value, E is the expected value and var is the variance.

6.4 Results

This section presents the results analysis for the proposed purely unsupervised learning approach for inferring super-targets by integrating Barnes-Hut-SNE (a variant of t-SNE) and DBSCAN. In the present study, we demonstrate the benefits of preprocessing drug- target data using Barnes-Hut-SNE dimensionality reduction technique through which the errors occurring from missing interactions can be minimized. Also, the performance assessment of super-targets inferred by DBSCAN demonstrates the goodness of this clus- tering algorithm to infer super-targets while detecting plausible outliers. Intrinsic and ex- trinsic evaluation of the super-targets demonstrates that the proposed approach could be delved further in research where target similarities are used. Section 6.4.1 demonstrates the intrinsic and extrinsic evaluation for the inferred super-targets integrating Barnes- Hut-SNE and DBSCAN integration. Section 6.4.2 demonstrates the significance of di- mensionality reduction. In Section 6.4.3, the importance of outlier detection is assessed.

6.4.1 Inferring super-targets

As explained in Section 6.3.1, Barnes-Hut-SNE is used for dimensionality reduction prior to target clustering. The original target-drug interaction feature vector of enzyme, ion channel, GPCR and nuclear receptor is reduced to three dimensions from 445, 210, 223 and 54 dimensions, respectively. The reduced three-dimensional feature vectors are used 126 Inferring super-targets using dimensionality reduction and density-based clustering as inputs for inferring super-targets employing DBSCAN algorithm. Here, DBSCAN Matlab implementation of Yarpiz [135] is used employing cosine distance metric. Table 6.3 demonstrates the intrinsic performance assessment of the identified target clusters for the original dataset and 5 different adapted datasets for enzyme, ion channel, GPCR and nuclear receptor. Here, a cluster is formed when it has at least 3 targets in it. In the DBSCAN algorithm, the parameter minpts is set to 3 to control the size of the clusters. Tuning e is challenging as there is no knowledge about a standard reference for super-targets. Hence, the value of e is set so that the Silhouette value is maximized. Table 6.3 shows the intrinsic evaluation of the selected clusters where Silhouette value is relatively high. The e value of original and the adapted datasets and the associating Silhouette value of original and the adapted datasets in the identified clusters are relatively close. The drug-enzyme interactions comprise interactions between 445 drugs and 664 tar- gets (see Section 6.2). This represents an enzyme feature vector of 445 dimensions and it is reduced to 3 dimensions using Barnes-Hut-SNE. Then, DBSCAN identified 32 clusters and 51 outliers based on the original data. Meanwhile, an average number of 35 clusters and 51 outliers are obtained based on the five adapted datasets. The average Silhouette value of the clusters that are obtained from both the five adapted datasets and the original data is 0.86 each. The drug-ion channel interactions comprise interactions between 210 drugs and 204 targets. This represents an ion channel feature vector of 210 dimensions and it is reduced to 3 dimensions using Barnes-Hut-SNE. Then, DBSCAN identified 17 clusters and 11 outliers based on the original data. Meanwhile, an average number of 17 clusters and 13 outliers are obtained based on the five adapted datasets. The average Silhouette value of the clusters that are obtained from the five adapted datasets and the original data are 0.79 and 0.72, respectively. The drug-GPCR interactions comprise interactions between 223 drugs and 95 targets. This represents a GPCR feature vector of 223 dimensions and it is reduced to 3 dimen- sions using Barnes-Hut-SNE. Then, DBSCAN identified 12 clusters and 20 outliers based on the original data. Meanwhile, an average number of 12 clusters and 30 outliers are 6.4 Results 127

Table 6.3: Super-target analysis: Enzyme

Target type Dataset e Super-targets Outliers Silhouette Original 2.5 32 51 0.86 Adapted set 1 2.5 31 47 0.86 Adapted set 2 3 30 48 0.85 Enzyme Adapted set 3 2 42 56 0.87 Adapted set 4 2 37 55 0.86 Adapted set 5 2.5 34 49 0.88 Original 7.5 17 11 0.72 Adapted set 1 2 18 11 0.77 Adapted set 2 2 18 15 0.81 Ion Channel Adapted set 3 2 15 6 0.79 Adapted set 4 2 16 23 0.82 Adapted set 5 2 19 11 0.78 Original 70 12 20 0.78 Adapted set 1 60 12 39 0.83 Adapted set 2 60 12 25 0.80 GPCR Adapted set 3 50 12 28 0.83 Adapted set 4 50 11 32 0.81 Adapted set 5 40 12 28 0.84 Original 450 3 15 0.77 Adapted set 1 450 2 16 0.87 Adapted set 2 450 3 8 0.75 Nuclear Receptor Adapted set 3 400 3 8 0.77 Adapted set 4 350 2 7 0.77 Adapted set 5 400 3 9 0.82 obtained based on the five adapted datasets. The average Silhouette value of the clusters that are obtained from the five adapted datasets and the original data are 0.82 and 0.78, respectively. The drug-nuclear receptor interactions comprise interactions between 223 drugs and 95 targets. This represents a nuclear receptor feature vector of 223 dimensions and it is reduced to 3 dimensions using Barnes-Hut-SNE. Then, DBSCAN identified 3 clusters and 128 Inferring super-targets using dimensionality reduction and density-based clustering

15 outliers based on the original data. Meanwhile, an average number of 3 clusters and 10 outliers are obtained based on the five adapted datasets. The average Silhouette value of the clusters that are obtained from the five adapted datasets and the original data are 0.80 and 0.77, respectively. The above results confirm that target clustering by DBSCAN as an effective clustering algorithm since the Silhouette values of the clustering models generated using the origi- nal data and the adapted data are relatively high and similar. It should also be noted that the chosen e values of the cluster models generated based on the original data are also relatively similar to the chosen e value of the five adapted datasets.

Table 6.4: Extrinsic evaluation of super-target prediction

AMI NMI SMI Enzyme 0.95 0.96 112.9 Ion channel 0.93 0.94 49.8 GPCR 0.99 0.99 24.2 Nuclear receptor 0.95 0.96 4.6

Extrinsic evaluation of super-targets is essential, however it is challenging as there is no standard reference to evaluate the identified super-targets. The secondary motive of this chapter is to present a method to evaluate the inferred super-targets. We have shown that removing 10% of known interactions is an effective strategy for determining the performance of unsupervised learning approach. Table 6.4 summarises the extrinsic evaluation of the inferred super-targets where the super-targets inferred by original and the adapted datasets are compared using AMI, NMI and SMI metrics. The average AMI are 0.95, 0.93, 0.99 and 0.95 for the target clus- ters identified by enzyme, ion channel, GPCR and nuclear receptor, respectively. The average NMI are 0.96, 0.94, 0.99 and 0.96 for the target clusters identified by enzyme, ion channel, GPCR and nuclear receptor, respectively. The average SMI are 112.9, 49.8, 24.2 and 4.6 for the target clusters identified by enzyme, ion channel, GPCR and nuclear receptor, respectively. The upper bound of SMI changes according to the reference clus- ter; the upper bound of SMI for enzyme, ion channel, GPCR and nuclear receptor are 124.7, 56.1, 28.1 and 5.7, respectively. These performance assessments confirm that the 6.4 Results 129 presented model is well suited for drug-target interactions data where some interactions are missing. Moreover, from Table 6.3 and Table 6.4, it can be seen that the range of SMI values are proportional to the number of super-targets because the SMI value has increased with the number of comparing super-targets/clusters. For instance, 32, 17, 12 and 3 clusters are identified for the enzyme, ion channel, GPCR and nuclear receptor, respectively, whereas their corresponding SMI values are 112.9, 49.8, 24.2 and 4.6 respectively.

6.4.2 Impact of dimensionality reduction

Table 6.5: Analysis of super-targets obtained using DBSCAN algorithm alone

Enzyme Super-targets Outliers Silhouette Original data 42 182 1.0 Adapted set 1 42 238 1.0 Adapted set 2 43 246 1.0 Adapted set 3 47 239 1.0 Adapted set 4 43 237 1.0 Adapted set 5 41 243 1.0

To determine the significance of the dimensionality reduction step, super-targets are inferred using DBSCAN algorithm alone employing initial higher dimensional drug- enzyme interactions as the inputs. The inferred super-targets using DBSCAN alone is compared with the inferred super-targets integrating Banes-Hut-SNE and DBSCAN. The experiments are carried out for the drug-enzyme data and the intrinsic cluster evaluation results are summarized in Table 6.5. The higher Silhouette values corresponds that the targets have a greater similarity to its own cluster than to its other cluster. Therefore, the Silhouette value is +1 for both original and the adapted data. However, it is worthwhile to note that out of 664 targets, 182 and 241 targets are removed as outliers in original and the adapted data (average), respectively. DBSCAN algorithm has ignored many target- enzymes as outliers that introduced the higher noise in the input space. Table 6.6 corresponds to the extrinsic evaluation of the inferred super-targets (en- 130 Inferring super-targets using dimensionality reduction and density-based clustering zyme) by DBSCAN algorithm alone, without using dimensionality reduction. To provide the extrinsic evaluation, the inferred super-targets using original data are compared with the inferred super-targets using the 5 corresponding adapted datasets. The average AMI, NMI and SMI values are 0.84, 0.88 and 73.3, respectively, and they demonstrate 11%, 8% and 39.6% absolute loss in the overall performance, respectively. Even though DBSCAN can infer clusters with higher Silhouette values, the inferred clusters are not consistent for drug-enzyme with original higher dimensional data. Hence, performing dimensionality reduction, before carrying out target clustering, demonstrates a significant improvement in the overall performance and may alleviate the problem of unknown drug-target inter- actions. Table 6.6: Extrinsic evaluation of super-targets obtained using DBSCAN algorithm alone

Adapted set AMI NMI SMI Adapted set 1 0.85 0.89 74.7 Adapted set 2 0.85 0.89 73.1 Adapted set 3 0.83 0.87 71.9 Adapted set 4 0.84 0.88 74.3 Adapted set 5 0.84 0.88 72.7 Average 0.84 0.88 73.3

6.4.3 Impact of DBSCAN algorithm for inferring super-targets

Here, the importance of a clustering algorithm for target clustering is assessed. To vi- sualize the spread of the super-targets, the higher dimensional target-drug interactions feature vector has to be transformed into a 3-dimensional representation. In Figure 6.2, enzyme data is visualized via 3D scatter plots. Figure 6.2a visualizes the 3-dimensional data after employing Barnes-Hut-SNE technique on original data (higher dimensional data). Figure 6.2b visualizes the super-targets identified by DBSCAN algorithm on 3- dimensional data. In Figure 6.2b, the targets shown in black are the outliers and super- targets are shown in other multi-colors. The scatter plots clearly demonstrate the need of a clustering algorithm to distinguish the super-targets and to detect the outliers. In 6.4 Results 131

Figure 6.2a, the super-targets located outside the marked region can be distinguished manually by visual inspection. But, a clustering algorithm is needed to distinguish the super-targets located inside the marked region. DBSCAN algorithm was capable of ef- fectively handling outliers and inferring super-targets.

Figure 6.2: 3-D scatter plots, illustrating super-target clustering of enzyme data

6.4.4 Importance of super-target analysis for new drug-target detection

In this section, we present the biological significance of the proposed method using iden- tified nuclear receptor super-targets. We expect to publish our results on enzyme, ion channel and GPCR in our future publication/s. DBSCAN clustered 11 nuclear receptors into 3 clusters while removing 15 nuclear receptors as outliers. Therefore, only the most similar nuclear receptors are clustered into the same group. Table 6.7 shows the 11 nuclear receptors and their corresponding super-target index. We analyzed the drugs interacting with each of these targets to infer new drug-target relationships. Tazarotene has been interacting with four out of five nuclear receptors in super-target 3; hence the relationship between Tazarotene and NR0B1 is inferred as a plausible new drug-target relationship. Similarly, we identified 13 new drug-target re- lationships that were not included in our initial data. We believe these 13 drug-target 132 Inferring super-targets using dimensionality reduction and density-based clustering

Table 6.7: Inferred super-targets: nuclear receptor

Super-target Index Nuclear Receptor ID Nuclear Receptor Name 1 hsa:2099/ESR1 Estrogen Receptor 1 1 hsa:2100/ESR2 Estrogen Receptor 2 1 hsa:5241/PGR Receptor 2 hsa:2908/NR3C1 Nuclear Receptor Subfamily 3 Group C Member 1 2 hsa:367/AR Androgen Receptor 2 hsa:4306/NR3C2 Nuclear Receptor Subfamily 3 Group C Member 2 3 hsa:190/NR0B1 Nuclear Receptor Subfamily 0 Group B Member 1 3 hsa:5914/RARA Receptor Alpha 3 hsa:6096/RORB Rar Related Orphan Receptor B 3 hsa:6257/RXRB X Receptor Beta 3 hsa:6258/RXRG Gamma relationships could be used to identify new drug-disease and drug-side effects rela- tionships. Further, the existing discoveries on three relationships (Desogestrel-ESR2, Diethylstilbestrol-PGR and Levonorgestrel-ESR2) are found on published literature to evaluate the significance of this study. As of Desogestrel-ESR2, Grandi et al. [136] showed that desogestrel and its metabolite (etonogestrel, ETN) are very suitable molecules for use in hormonal contraception. Moreover, they show that the estrogenic component in- creases the contraindications, forcing the prescription to the safer only-progestin prepa- rations, desogestrel progestin-only pill or ETN implant. As of Diethylstilbestrol-PGR, Cunha et al. [137] showed that diethylstilbestrol could induce ESR1 in epithelia of the uterine corpus, cervix and globally induced PGR in most cells of the developing human female reproductive tract. As of Levonorgestrel-ESR2, Wu et al. [138] showed that lev- onorgestrel might inhibit the recurrence and formation of endometrial polyp through lowering the expressions of the estrogen receptor.

6.5 Discussion

Unlike supervised learning methods that use training sets with class labels, this study uses a fully unsupervised learning approach which does not rely on the class labels. The 6.5 Discussion 133

Table 6.8: Inferred new drug-target relationships: nuclear receptor

Drug name Nuclear Receptor ID Confidence score Tazarotene hsa:190/NR0B1 0.80 Desogestrel hsa:2100/ESR2 0.67 Levonorgestrel hsa:2100/ESR2 0.67 Norgestrel hsa:2100/ESR2 0.67 Progesterone hsa:2100/ESR2 0.67 Ethynodiol diacetate hsa:2100/ESR2 0.67 Diethylstilbestrol hsa:5241/PGR 0.67 Estradiol hsa:5241/PGR 0.67 Estramustine hsa:5241/PGR 0.67 Raloxifene hydrochloride hsa:5241/PGR 0.67 Fulvestrant hsa:5241/PGR 0.67 Etretinate hsa:190/NR0B2 0.60 Etretinate hsa:6096/RORB 0.60 proposed approach uses an integration of dimensionality reduction and clustering for in- ferring super-targets. In pharmacological data analysis, there exist earlier studies which used dimensionality reduction by PCA to avoid the curse of higher dimensions [139,140]. In the present study, the primary motive of using dimensionality reduction by Barnes- Hut-SNE is to alleviate the problem of nonlinearity and unknown drug-target interac- tions. Moreover, clustering super-targets by DBSCAN enables identifying most probable super-targets while eliminating confusing targets as outliers. The results demonstrate the effectiveness of the inferred super-targets in terms of NMI, AMI and SMI. The super- targets inferred by DBSCAN represent a group of targets where they can be repositioned to treat similar types of diseases. Target clustering by outliers detection technique has greater benefits on data with noise or incomplete information. Clustering targets with incomplete data may infer false clusters. Removing outliers from the super-targets increase the confidence that the targets in the same super-target can be repositioned for a similar type of diseases. Unlike widely used K-means and SOM, DBSCAN algorithm infers the number of clusters automatically and it infers clusters of arbitrary shapes. The Growing Self Orga- 134 Inferring super-targets using dimensionality reduction and density-based clustering

Figure 6.3: Dendrogram analysis of higher dimensional feature space and reduced dimensional feature space after using Barnes-Hut-SNE nizing Map algorithm used in Chapter 3 and Chapter 4 is also capable of identifying clus- ters of different shapes and identifying the number of clusters automatically. Similar to GSOM algorithm, DBSCAN starts cluster initialization with an arbitrary point and then the other points are assigned based on their pairwise similarity. But, using the density- reachable and density-connected property of DBSCAN algorithm enables identification of more coherence within super-targets eliminating outliers. Herein, cosine similarity is employed to compute the pairwise similarity and a super-target is defined only if it con- sists of at least 3 (minpts) targets. In DBSCAN, there are two types of points in a cluster: core-points are the points inside of the cluster and border-points are the points on the bor- der of the cluster. Border-points may disobey the density-connected and density-reachable property when the minpts is set to a higher value. Thus in this study, the minpts is set to 3. Targets demonstrate complex behaviors in the pharmacological network as they in- volve in multiple disease pathways, hence may overlap between classes which makes purely unsupervised clustering difficult. Determining super-targets require the cluster- ing algorithm to accurately differentiate targets whose properties are known to be differ- ent. It also should have the capacity to determine the similarities or differences between targets whose complete characteristics are unknown. 6.5 Discussion 135

The dendrograms shown in Figure 6.3 also illustrate the significance of the dimen- sionality reduction step where targets correspond to the leaf nodes in the dendrogram and the heights denote the distance between nodes. As a result of Barnes-Hut-SNE tech- nique, the pairwise distance range has been expanded from 2.57 - 8.05 to 6.95 - 40.32. Therefore, employing dimensionality reduction by Barnes-Hut-SNE enables emphasiz- ing the differences between targets and it is better suited in this situation. It may enable faster learning of the clustering algorithm. Feature extraction and reduced dimensions of features enable the machine learning algorithm to learn well and infer useful patterns. Known drug-target interactions may not represent the complete drug-target relationships which introduce the idea of drug repositioning by new target detection. Learning patterns based on existing drug-target interactions may infer an increased number of false positives and false negatives. Dimen- sionality reduction techniques can be used to extract the most dominant features so that the errors occurring due to missing interactions can be minimized. In this study, Barnes- Hut SNE a variant of the t-SNE technique is used for dimensionality reduction which is a fast technique and uses a nonlinear learning technique to map the higher dimensional into two or three dimensions. However, it is infeasible to extrapolate them into higher dimensions (> 3) due to the exponentially increasing size of the tree in the embedding space [124]. This approach is also popular for visualization. The reduced 2 or 3 dimen- sions can be easily plotted in a 2D or 3D map to visualize patterns. Due to complex nature of target networks, clear cluster separations were not found using Barnes-Hut-SNE tech- nique alone. However, the results demonstrate the benefit of dimensionality reduction in inferring super-targets. Tables 6.5 and 6.6 show the intrinsic and extrinsic of cluster evaluation of super- targets for enzyme data using DBSCAN alone, respectively. Table 6.5 demonstrates the effectiveness of DBSCAN clustering algorithm. DBSCAN is capable of inferring clusters with the highest Silhouette value of +1 while discriminating confusing outliers. But, it should be noted that the number of outliers detected by DBSCAN is 27% and 36%, for original and the adapted data respectively. The number of outliers detected by the pro- posed integrated model of Barnes-Hut and DBSCAN is only 8% and the Silhouette value 136 Inferring super-targets using dimensionality reduction and density-based clustering is +0.86. The extrinsic evaluation of super-targets is challenging without a standard reference clusters. Proposing performance evaluation method is the secondary motive of this chap- ter. Introducing 5 adapted datasets for each target type by removing 10% of known drug- target interactions enabled a validation approach to assess inferred super-targets using AMI, NMI and SMI. The extrinsic evaluation with AMI, NMI and SMI results shown in Tables 6.6 and 6.4 demonstrate an absolute improvement of the proposed integrated model in AMI, NMI, SMI of 11%, 8% and 39.5 for drug-enzyme data. Therefore, dimen- sionality reduction technique is beneficial in inferring consistent super-targets particu- larly when drug-target interaction information is incomplete. It can be concluded that the presented model is a useful clustering model which can delve further for drug repo- sitioning tasks or other relevant research. Identifying similar targets enables inferring new drug-target as well as target-disease and target-side effect relationships. Moreover, the proposed approach can be employed to cluster ‘super-drugs’ so that drug repositioning by new target detection can be ac- complished from both directions: drug and target. Exploring further on super-drugs may be useful when learning pairwise similarities between drugs. Moreover, the results inferred by this study may provide complementary and supporting evidence to experi- mental studies and may demonstrate a useful step in the analysis of pairwise similarities.

6.6 Summary

New target detection is one of the main concepts in drug repositioning. Due to high cost, time and risk of the conventional drug development process, computational drug-target detection has become popular. As mentioned in previous chapters, pairwise similarity is widely used to infer repositioning candidates in computational drug repositioning. But, incomplete knowledge in drug-target interactions is challenging for machine learning methods. In this chapter, drug-target interactions are analyzed to infer super-targets. Targets are clustered based on their relationships with interacting drugs. We integrated Barnes-Hut- 6.6 Summary 137

SNE technique and density-based spatial clustering of applications with noise (DBSCAN) algorithm to cluster targets based on drug-target interaction data. Inferring super-targets by density-based clustering technique is shown to be an effective method to distinguish clusters with high coherence. Investigating super-targets provides an opportunity to explore target-target similarities through which new drug-target detection can be per- formed; Hence, new drug-disease and drug-side effect relationships can be identified. Moreover, the results demonstrate the benefit of reduced dimensions in improv- ing clusters, particularly for incomplete data. Integrating dimensionality reduction by Barnes-Hut-SNE and outlier detection by DBSCAN can reduce the limitations arising from the incomplete data in the input space. Hence, it is a good combination to explore drug-target analysis and improves the performance of target clustering. Extrinsic evalua- tion is also challenging as there is no standard reference cluster. The validation approach suggested in this study made use of NMI, AMI and SMI to evaluate the influence of super-targets. The intrinsic and extrinsic measures demonstrate the effectiveness of the proposed method to infer super-targets. These methods can be used as a basis for further studies on drug repositioning by new target recognition as well as side effect prediction by off-target analysis. This page is intentionally left blank. Chapter 7 Conclusions

N Chapter 3, we introduced an approach to DDI prediction based on positive- I unlabeled learning (PUL), a method that enables using reliable positive and negative DDIs to improve the performance on prediction of new DDIs. The DDIs collected from Drugbank serve as positive DDIs while negative DDIs are inferred through application of Growing Self Organizing Map (GSOM) for clustering instances. Our GSOM-based PUL approach outperforms a widespread random strategy for selecting negatives. Sec- ondly, we introduced a new pairwise drug similarity function (SFR2) as a more useful representation to capture finer-grained similarities compared to the summarised pair- wise similarity representation (SFR1) inspired by Jaccard Index. SFR2 could overcome the limitations arising from the widely-used summarized pairwise similarity measure. Our GSOM-based PUL approach achieved an absolute improvement of in F1-score of 38% and 14% in comparison to the method that randomly selects unlabeled data points as likely negatives using SFR1 and SFR2, respectively. DDIs can be associated with cy- tochrome P450 (CYP) enzyme systems. Therefore, we classified the inferred DDIs as CYP- Dependent and CYP-Independent interactions, invoking their locations on the Growing Self Organizing Map. A case study on Bosentan-Abacavir, Carvedilol-Metformin and Cimetidine-Erythromycin (three drug pairs that are frequently used in treating some of the common diseases) interactions inferred by this method illustrates the clinical rele- vance of the presented methods. Chapter 4 presents a two-tiered unsupervised clustering approach for drug cluster- ing. Drugs in the same cluster will demonstrate similar characteristics while being rel- atively dissimilar to drugs in other clusters. Initially, drug clustering is used to explore

139 140 Conclusions pairwise drug similarity. Subsequently, it is extended for heterogeneous data integration. The primary objective of heterogeneous/multi-view data integration is to understand the drug characteristics more deeply and to obtain a consensus solution. Finally, the drug clusters are compared with the current Anatomical Therapeutic Chemical (ATC) thera- peutic classes. The mismatches between the identified drug clustering and the current ATC classes led to identifying new ATC therapeutic-classes and reliable drug reposition- ing candidates. In this study, vector-based and graph-based clustering algorithms are employed for drug clustering. The repositioning candidates identified consistently by multiple clustering algorithms and with high confidence have a higher possibility for reli- able drug repositioning. The clinical evidence for four repositioning candidates identified by this approach (Chlorthalidone, Indomethacin, Metformin and Thioridazine) provides support that the approach is reliable for inferring new therapeutic uses of drugs. Chapter 5 explores the benefits of Prize-Collecting Steiner Tree approach for sub- network identification for drug repositioning. The hierarchy of the ATC classification is employed as the primary attribute for establishing multiple networks reflecting the drug similarity networks. Moreover, the drugs classified in the ATC classification sys- tem are used to label some nodes in the drug network transforming drug repositioning as a semi-supervised learning problem. The Physarum-inspired subnetwork identifica- tion algorithm (PSIA) explained in Appendix A is an effective subnetwork identification algorithm and can be applied in semi-supervised learning problems such as drug repo- sitioning. PSIA is applied to identify repositioning drug candidates for a single disease at a time and it can be employed when some of the drugs for that particular disease are known in advance. In this study, PSIA is employed to infer drug repositioning candi- dates for cardiovascular diseases and nervous diseases via ATC class-C and ATC class-N, respectively. In the last contribution chapter, Chapter 6, attention is turned to explore methods to avoid the limitations arising from incomplete or missing data in drug-target interactions. Targets are clustered based on their relationships with interacting drugs. Initially, a val- idation method is introduced for target clustering when there is no standard reference cluster. Subsequently, integrating Barnes-Hut-SNE technique and density-based spatial 7.1 Future work 141 clustering of applications with noise (DBSCAN) algorithm is presented to cluster targets based on drug-target interaction data. The integrated clustering approach demonstrates an absolute improvement in AMI, NMI and SMI of 11%, 8% and 39.6% compared to the clustering approach without dimensionality reduction. Further, the identified super- targets on nuclear receptors is analyzed in-depth to infer new drug-target relationships. Employing DBSCAN clustering was shown to be effective in identifying outliers and useful super-targets.

7.1 Future work

In this study, chemical, disease, protein and side effect characteristics of drugs have been considered for DDI prediction and drug repositioning. Considering these heterogeneous characteristics provide us an opportunity to demonstrate drug characteristics from four different perspectives. Employing other characteristics such as ligands, gene expressions, etc. may also improve the clinical significance of the final prediction. However, the pre- dicted DDIs might be biased for pharmacokinetic interactions. Incorporating pharmaco- dynamic properties such as drug concentration at its site of action may enable exploring pharmacodynamic interactions [141]. The drug characteristics data used in this study may contain noisy/incomplete data. Unlike the chemical structure of the drug, the relationships on drug-disease, drug-protein and drug-side effect data may contain noisy data. Giving them smaller weights in the similarity feature vector will be beneficial to minimize the problems arising from missing data. Moreover, the methods proposed in Chapter 4 can be used to assign weights for each drug characteristics. The DDI prediction model introduced in this thesis employs Support Vector Machine as the binary classifier. Applying an ensemble learning model, for instance using multi- ple classifiers such as Logistic Regression, Random Forest, etc., may further improve the inference DDIs. The introduced pairwise similarity representation can also be extended to represent similarities between more than two drugs. Further, the two-tiered drug clus- tering approach proposed in Chapter 3 could be employed to capture the pairwise drug 142 Conclusions similarities and heterogeneous data integration. However, memory issues should be un- derstood when moving to second tier clustering. In comparison with the drug clustering context, DDI clustering requires more space to store and manipulate the similarity ma- trix. In a drug network with N drugs and (N*N-1)/2 pairs of interactions, the size of the similarity matrix representation would be (N*N-1)/2 x (N*N-1)/2. Therefore, the second tier of the two-tiered clustering approach becomes expensive for larger networks with a higher number of drugs. Furthermore, the unsupervised learning model introduced in Chapter 6 that integrates dimensionality reduction and outlier detection can be applied to improve the inference DDIs. Since ATC therapeutic class labels were used to determine the class label for the iden- tified drug clusters in Chapter 4, drug repositioning is limited only to the therapeutic classes that are captured in the ATC classification system. However, this method can be applied to infer drug repositioning candidates by employing other therapeutic class ref- erences and to infer other functions such as mechanism of action (MOA). Investigating new therapeutic uses for drug combinations is another useful option in drug reposition- ing [142]. The Orange book [143] published and controlled by the U.S. Food and Drug Administration provides information on currently allowed drug combinations. Aware- ness of harmful DDIs is beneficial to extract the most suitable drug combinations for drug repositioning. Integrating the introduced PUL approach with the two-tiered clus- tering approach could be employed for inferring plausible drug combinations. In subnet- work identification, the drug similarity networks for drug repositioning is designed as a semi-supervised learning problem; hence the semi-supervised learning methods such as seeded-K-means [144] and seeded-GSOM [145] can also be applied to infer repositioning drug candidates. Moreover, the introduced subnetwork identification can be employed to infer repositioning drug candidates for other diseases. It is evidenced that identifying similar targets enables inferring new drug-target, target-disease and target-side effect relationships. It follows that similarities within super-targets/target-clusters can be delved further for drug repositioning and side effect prediction via target similarities and off-target similarities, respectively. Similar to the methods explained in Chapter 4, cluster relationships can be used to explore pairwise 7.1 Future work 143 target similarities. In drug-target interaction data used in this study, every interaction is given the same weights. Employing information such as the strength of target interaction and binding affinity can improve the final predictions by distinguishing the interaction types. Moreover, association rule mining could be used to determine the relationships among targets within super-targets. The association rule mining of targets on the in- put space of drug-target interactions may not provide higher inter-target-similarities. Therefore, employing it on super-target/target-cluster space can be extrapolated to infer new targets for drugs. Having removed the outliers in the super-targets/target-clusters may increase the confidence that the association rules derived on the basis of super- targets/target-clusters are accurate. This page is intentionally left blank. Appendix A A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning

The content of this appendix is presented in the following published form where I am an equal first author: Sun, Y.+ , Hameed, P. N.+, Verspoor, K., & Halgamuge, S. (2016). A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning. BMC Systems Biology, 10(5), 128.

145 146

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 DOI 10.1186/s12918-016-0371-3

RESEARCH Open Access A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning Yahui Sun1†, Pathima Nusrath Hameed1,2,3†, Karin Verspoor4 and Saman Halgamuge5*

From 15th International Conference On Bioinformatics (INCOB 2016) Queenstown, Singapore, 21-23 September 2016

Abstract Background: Drug repositioning can reduce the time, costs and risks of drug development by identifying new therapeutic effects for known drugs. It is challenging to reposition drugs as pharmacological data is large and complex. Subnetwork identification has already been used to simplify the visualization and interpretation of biological data, but it has not been applied to drug repositioning so far. In this paper, we fill this gap by proposing a new Physarum-inspired Prize-Collecting Steiner Tree algorithm to identify subnetworks for drug repositioning. Results: Drug Similarity Networks (DSN) are generated using the chemical, therapeutic, protein, and phenotype features of drugs. In DSNs, vertex prizes and edge costs represent the similarities and dissimilarities between drugs respectively, and terminals represent drugs in the cardiovascular class, as defined in the Anatomical Therapeutic Chemical classification system. A new Physarum-inspired Prize-Collecting Steiner Tree algorithm is proposed in this paper to identify subnetworks. We apply both the proposed algorithm and the widely-used GW algorithm to identify subnetworks in our 18 generated DSNs. In these DSNs, our proposed algorithm identifies subnetworks with an average Rand Index of 81.1%, while the GW algorithm can only identify subnetworks with an average Rand Index of 64.1%. We select 9 subnetworks with high Rand Index to find drug repositioning opportunities. 10 frequently occurring drugs in these subnetworks are identified as candidates to be repositioned for cardiovascular diseases. Conclusions: We find evidence to support previous discoveries that nitroglycerin, theophylline and acarbose may be able to be repositioned for cardiovascular diseases. Moreover, we identify seven previously unknown drug candidates that also may interact with the biological cardiovascular system. These discoveries show our proposed Prize-Collecting Steiner Tree approach as a promising strategy for drug repositioning. Keywords: Steiner tree problem, Subnetwork identification, Drug similarity network, Big data, Physarum polycephalum

*Correspondence: [email protected] †Equal contributors 5Research School of Engineering, College of Engineering & Computer Science, The Australian National University, 2601 Canberra, ACT, Australia Full list of author information is available at the end of the article

© The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. 147

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 26 of 63

Background in E to a positive number called the cost. Let T beasub- Drug repositioning aims to identify new therapeutic set of V called terminals. The aim of PCSTP is to find a effects for known drugs. By repositioning known drugs, connected subgraph G = (V , E), V  ⊆ V, E ⊆ E which drug development time, costs and risks can be reduced contains all the terminals while minimizing the objective ( ) = ( )− ( ) significantly [1–3]. There are mainly two challenges to function c G e∈E c e v∈V  p v ,andtheoptimal reposition drugs. First, pharmacological data is usually big solution of PCSTP is called Steiner Minimum Tree (SMT) and difficult to analyze [4, 5]. Second, pharmacological in G for T. data is highly complex and involves various drug char- The algorithms for PCSTP can be divided into two acteristics, including their chemical structures, molecular groups: exact algorithms and heuristic algorithms. Exact targets and induced gene expression signatures [6]. algorithms can find SMT, but are slow in large graphs Existing drug repositioning methods can be divided [20]. On the contrary, heuristic algorithms can find solu- into three categories; data-driven methods [1–3, 6], text- tions faster, but they may only find close approximations mining methods [7, 8], and network-based methods to SMT [21]. The Drug Similarity Networks (DSN) we [3, 9–11]. The data-driven methods reposition drugs used in this paper are large graphs with 548 vertices and by analyzing pharmacological data using statistical and thousands of edges. Thus, it is necessary for us to use machine learning concepts such as statistical estimations, heuristic algorithms in DSNs. Many heuristic algorithms classification and clustering [1, 6, 10]. Because of the have been proposed to solve PCSTP; the GW algorithm overlapping nature of pharmacological data [3, 11], the (named for Michel X. Goemans and David P. Williamson) evaluation process of the data-driven methods is compli- is the most popular one [22–25]. However, in our simu- cated [11]. On the other hand, text mining methods use lations we observe that GW algorithm does not perform efficient text analytics and semantic inference approaches well in DSNs. Physarum-inspired algorithms are emerging to reposition drugs [7, 8], but their application is lim- heuristic algorithms that have already been used to solve ited by the availability of relevant biomedical publications PCSTP [26]. In this paper, we propose a new Physarum- and reports. Network-based methods are emerging meth- inspired algorithm called Physarum-inspired Subnetwork ods that use networks to represent pharmacological data Identification Algorithm (PSIA) to identify subnetworks [10]. These methods typically reposition drugs by iden- in DSNs. Our proposed PSIA outperforms the popular tifying drug candidates in multiple decomposed subnet- GW algorithm by identifying more suitable subnetworks works [10–12]. Even though multiple therapeutic effects for drug repositioning. Furthermore, by analyzing the are expected to be found, it requires a long time to analyze identified subnetworks, we find evidence to support pre- these multiple decomposed subnetworks. vious discoveries that some drugs could be repositioned Subnetwork identification is a technique to identify a for cardiovascular diseases. These discoveries show that single small-scale subnetwork from a large-scale network. our proposed Prize-Collecting Steiner Tree approach is It differs from previous network-based methods in that effective and efficient to reposition drugs. we only need to analyze a single identified subnetwork. This method has already been proven to be efficient to Methods simplify the visualization and interpretation of protein- Generation of drug similarity networks protein interaction networks [13–16], protein-DNA inter- We build Drug Similarity Networks (DSNs) to represent action networks [17], gene-regulatory networks [18] and the similarities between drugs. There are several pharma- metabolic networks [19]. However, to our knowledge, no cological databases at present, such as PharmGKB [27], one has applied subnetwork identification to pharmaco- DrugBank [5, 28], SIDER [29], etc. We generate DSNs logical networks so far. This paper will fill this gap by using the data following the work of Zhang et al. [30], exploring the application of subnetwork identification to which includes data from DrugBank and SIDER. Similari- drug repositioning for the first time. ties between drugs are quantified in DSNs based on their The Prize-Collecting Steiner Tree (PCST) approach is chemical, therapeutic, protein and phenotype features. gaining traction in subnetwork identification, but has There are 881 chemical features, 719 therapeutic features, not been tried with pharmacological data yet. Existing 775 protein features, and 1385 phenotype features consid- methods are slow and non-deterministic, chance based. ered for each drug. Therefore, 3760 (881+719+775+1385) This method is heuristic, i.e. it is not an exact solution, features in total are considered for each drug. but it is deterministic. The definition of Prize-Collecting The DSNs we generated have five components: Steiner Tree Problem (PCSTP) is given as follows: let G = (V, E, p, c) be a connected, undirected graph, where V is vertex: Each vertex represents a drug. There are 548 the set of vertices, E is the set of edges, p is a function drugs included in each of our generated DSNs [30]. which maps each vertex in V to a non-negative number Each drug is associated with a 1 × 3760 feature vector called the prize, and c is a function which maps each edge where binary numbers represent the presence or 148

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 27 of 63

absence of each individual feature that we consider. a subnetwork of DSN that includes drugs similar to the Note that, binary numbers have already been widely drugs represented by terminals is expected to be identified used to describe drug features [6, 30, 31]. using the PCST approach. edge: Each edge represents the association between Complete graphs with different sets of terminals can two drugs. be generated using the five graph components defined terminal: Each terminal represents a vertex which above. Since the sets of vertices are identical, the sets of must be contained in the identified subnetworks of edge costs are also the same in different complete graphs. DSNs. In each DSN, the terminal set represents a However, the sets of vertex prizes are different as the sets cardiovascular subclass of drugs in the Anatomical of terminals are different in different complete graphs. Therapeutic Chemical (ATC) classification system PCSTP algorithms perform better in sparse graphs than in [32]. ATC is used for the classification of active complete graphs [22]. Therefore, we propose two sparse ingredients of drugs according to the organ or system graph generation algorithms to prune the complete graphs on which they act and their therapeutic, to produce sparse graphs for DSNs. pharmacological and chemical properties. There are In our first proposed algorithm, the Minimum Spanning 9 subclasses in the cardiovascular class (C); cardiac Tree (MST) of the complete graphs is found using Prim’s therapy (C01), antihypertensives (C02), diuretics algorithm [33]. Then, edges are added probabilistically to (C03), peripheral vasodilators (C04), vasoprotectives MST until the total number of edges is increased to the (C05), beta blocking agents (C07), calcium channel desired number. This algorithm is outlined in Fig. 1, in blockers (C08), agents acting on the which |E| isthenumberofedgesinthesparsegraph,|V| is renin-angiotensin system (C09), and lipid modifying the number of vertices, De is the desired number of edges agents (C10). There are 104 drugs in total in these in the sparse graph, Pro is the probability of adding edges subclasses. (Notably, there is no C06 in the ATC to MST. classification system). Our first proposed algorithm generates a sparse graph edge cost: Each edge cost represents the quantified by adding edges to the MST of a complete graph. While dissimilarity between two drugs. The bigger the edge our second proposed algorithm generates a sparse graph cost is, the more dissimilar the two drugs are. The by deleting edges from a complete graph. The challenge of edge cost is calculated using the Jaccard coefficient, generating a sparse graph by deleting edges is to delete as as shown in the formula below. many edges as possible while maintaining the graph con-  n nectivity. The graph connectivity can be checked using = υi(k) ∩ υj(k) c = 1 − k 1 (1) Tarjan’s algorithm, which has the complexity of O(|V|+ ij n υ (k) ∪ υ (k) k=1 i j |E|) [34]. It takes a long time to generate a sparse graph where i and j are indexes of two different drugs, cij is if the connectivity is checked every time after an edge is the cost of edge (i, j), n is the total number of features deleted. In our second proposed algorithm, two threshold considered for each drug, which is 3760, and υi is the values, t1 and t2, are used to delete edges in two steps. In feature vector of drug i. the first step, all the edges which have a cost below t1 are vertex prize: A prize is associated with each vertex to deleted from the complete graph. In the second step, all signify the similarity between the drug represented by the edges which have a cost below t2 are deleted from the this vertex and all the drugs represented by terminals. The vertex prize is calculated using the following equation.  1 j∈T,j=i 1 + cij p = (2) i |T|

where pi is the prize of vertex i, T is the set of terminals, and |T| is the total number of terminals.

The objective of PCSTP is to minimize the net-cost of edge costs and vertex prizes. Thus, the subnetwork identi- fied using the PCST approach tends to include edges with small costs and vertices with big prizes. In our generated DSNs, edges with small costs connect drugs with big sim- ilarities, and vertices with big prizes represent drugs that Fig. 1 The first proposed sparse graph generation algorithm are similar to the drugs represented by terminals. Hence, 149

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 28 of 63

graph when deleting the edge will not make the graph dis- 7870.69 seconds to generate a sparse graph when t1 = 0.5 connected. Set t1 < t2, and make sure t1 is small enough and t2 = 0.95. Moreover, the graph becomes disconnected to maintain the graph connectivity. The purpose of delet- when t1 = 0.95. Therefore, the computational trials prove ing edges in two steps is to make the algorithm fast by only that a big t1 makes the second proposed algorithm fast, checking the graph connectivity in the second step. Our but at the risk of ruining the graph connectivity, and a big second proposed algorithm is outlined in Fig. 2. t2 makes the graph sparse, but at the cost of long running Two sparse graphs can be generated from each complete time. graph using the two algorithms proposed above. These DSNs in sparse graphs are generated using the two pro- two algorithms generate sparse graphs by only consider- posed algorithms. Because no vertex has been deleted ing the edge costs. Since the sets of edge costs are the same in any of these sparse graphs, subnetworks containing in different complete graphs, the sparse graphs generated similar drugs can be identified in the sparse graphs gen- using the same proposed algorithm will have the same set erated by both proposed algorithms. Nevertheless, in our of edges. Therefore, sparse graphs with two different sets simulations we find that PCSTP algorithms have better of edges are generated using the two proposed algorithms, performances in DSNs generated using the second pro- and these two types of sparse graphs are visualized in posed algorithm than in DSNs generated using the first Fig. 3, in which Fig. 3a visualizes the sparse graphs gener- proposed algorithm. ated using the first proposed algorithm, and there are 548 vertices and 1500 edges in each of them, Fig. 3b visualizes The proposed physarum-inspired subnetwork the sparse graphs generated using the second proposed identification algorithm algorithm, and there are 548 vertices and 1391 edges in Physarum polycephalum is a large amoeboid organism each of them. that has displayed many intelligent behaviors [35–37]. The The distributions of edge costs in the complete graphs Physarum-inspired Subnetwork Identification Algorithm and two types of sparse graphs are shown in Fig. 4. It (PSIA) is proposed in this paper to identify subnetworks can be seen from Fig. 4a that most edges in the complete in DSNs. The proposal of PSIA is inspired by the Lowest- graphs have costs between 0.5 and 0.9. It can be seen from cost Network Physarum Optimization algorithm (LNPO) Fig. 4b that most edges in the first type of sparse graphs [26]. LNPO is designed to find PCSTP solutions as close also have costs between 0.5 and 0.9. The reason why the to SMT as possible. There are two iteration processes in complete graphs and the first type of sparse graphs have LNPO, the inner iteration process and the outer iteration similar distributions of edge costs is that, in the first pro- process. A feasible PCSTP solution can be found in each posed algorithm, edges are randomly added to the MST inner iteration process. The outer iteration process is used of the complete graphs without considering their costs. to find multiple solutions and choose the solution which However, it can be seen from Fig. 4c that most edges in the is closest to SMT as the final solution. However, SMT or second type of sparse graphs have costs between 0.9 and close approximations to SMT may not be suitable for drug 1. It is because t1 and t2 are set respectively to be 0.9 and repositioning. There is no need to use the outer iteration 0.95 in the second proposed algorithm, and all the edges process in PSIA. Thus, only the inner iteration process is which have a cost below 0.9 have been deleted. In the com- included in PSIA. Moreover, the subnetwork identified in putational trials, it takes the second proposed algorithm the inner iteration process may not be a tree. Hence, a 29.24 s to generate a sparse graph when t1 = 0.9 and t2 = post-processing technique is used in PSIA to ensure that 0.95. In contrast, it takes the second proposed algorithm the identified subnetwork is a tree. In our proposed PSIA algorithm, a single terminal is chosen probabilistically to be the sink node, and all the other terminals will become source nodes. Let l(i) be the total cost of edges linked to terminal i. Name the termi- nals in such a way that l(1) ≤ l(2) ≤ ··· ≤ l(|T|),where |T| is the number of terminals. Then, the probability of choosing terminal i as the sink node can be obtained by l(|T|−i + 1) P(i) =  (3) |T| ( ) j=1 l j

There is flux flowing through each edge, and the flux Qij in edge (i, j) is given by   Dij Fig. 2 The second proposed sparse graph generation algorithm Qij = Pri − Prj (4) Cij 150

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 29 of 63

Fig. 3 Visualization of two types of sparse graphs. a shows the first type of sparse graphs, which are generated using the first proposed algorithm. b shows the second type of sparse graphs, which are generated using the second proposed algorithm

Fig. 4 The distributions of edge costs in the complete graphs and the sparse graphs. a shows the distribution of edge costs in the complete graphs. b shows the distribution of edge costs in the first type of sparse graphs, which are generated using the first proposed algorithm. c shows the distribution of edge costs in the second type of sparse graphs, which are generated using the second proposed algorithm 151

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 30 of 63

pi pj Because the sink node is chosen probabilistically in Cij = cij − − + 2N (5) di dj PSIA, different subnetworks can be identified in a single DSN by employing PSIA for multiple times. To reposition where Dij is the edge conductivity, Cij is a net-cost for drugs, we employ PSIA for multiple times in each DSN to ( ) edge i, j , Pri and Prj are pressures at vertex i and j, cij identify multiple subnetworks. Then, we select the most ( ) is the cost of edge i, j , pi and pj are the prizes of ver- suitable subnetwork from them for drug repositioning. tex i and j, di and dj are the degrees of vertex i and j,and N = max(p ), k ⊆ V. k GW algorithm The flux flows into the network from each source node, Besides the proposed PSIA, we also use the popular GW and the flux flows out of the network from the single sink algorithm to identify subnetworks in DSNs. GW algo- node. By considering the conservation law of flux at each rithm was proposed by Michel X. Goemans and David vertex, the network Poisson equation is described below. ⎧ P. Williamson [22], and it is widely used to solve PCSTP ⎨ − = [23–25]. However, GW algorithm is designed to solve  D I0, j source ij ( − ) = +(| |− ) = PCSTP instances with a single terminal, which is called Pri Prj ⎩ T 1 I0, j sink (6) Cij i∈V(j) 0, otherwise the root. While in DSNs, there are multiple terminals. In this paper, we apply GW algorithm to DSNs by randomly ( ) where V j is the set of vertices linked to vertex j,andI0 is choosing a single terminal to be the root and give other the flux flowing into each source node. Let the pressure at terminals big prizes. thesinknodebe0,andotherpressurescanbecalculated We first choose a single terminal to be the root. Then, by solving the network Poisson equation. In our simula- we give each of the other terminals a big prize M,and tions, we find that the net-costs of edges in DSNs are quite > M (i,j)∈E cij. This big prize ensures that all the termi- close to each other. In this case, if all the edge conduc- nals will be included in the subnetwork identified by GW tivities are the same, then the network Poisson equation algorithm. may not be solvable. Thus, we give each edge conductiv- To identify a subnetwork, we initially set each vertex as ity a random initial value to make the network Poisson a component. Each component has a surplus (initially the equation solvable. vertex prize). A component is active when its surplus is After the calculation of pressures, the flux in each edge bigger than 0. However, the root component will always be can be got. Edge conductivities will be updated using the inactive. In addition, each edge has a deficit (initially the conductivity update equation below. edge cost), and an edge is active when it is not connecting two vertices in the same component. Dij(k + 1) = Dij(k) + α|Qij(k)|−μDij(k) (7) Setting a constant , we iteratively do this: the surplus where k is the number of conductivity update times, α and of all active components are reduced by , the deficit of μ are two constants. Edges with conductivities smaller any active edge adjacent to a single active component is than the threshold value  will be cut from the graph. We reduced by , and the deficit of any active edge adja- iteratively update the edge conductivities and cut edges cent to two active components is reduced by 2.After for K times to find a subnetwork. However, this subnet- the update of surpluses and deficits, we check that: if work may not be a tree. Thus, MST of this subnetwork is an edge’s deficit is not above 0, we merge the two com- found to be the final identified subnetwork. The proposed ponents linked by this edge and give the new merged PSIA is outlined in Fig. 5 (the MATLAB coding of PSIA is component the sum of surpluses of the two components publicly available at https://github.com/YahuiSun/PSIA- being merged; if a component’s surplus is not above 0, to-identify-subnetworks). we deactivate this component. The iteration will end until

Fig. 5 The proposed physarum-inspired subnetwork identification algorithm 152

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 31 of 63

there is no active component disconnected with the root where Ic is the number of non-terminal vertices that component. represent drugs that are in both the identified subnetwork After the iteration, the vertices and the edges in the root and the cardiovascular class (C; including drugs in all 9 component will be a tree. Then, we delete some vertices cardiovascular subclasses), Nnc is the number of vertices and edges by strong pruning the tree. The strong pruning that represent drugs that are neither in the identified sub- idea was proposed by Johnson et al. in 2000 [25]. In the network nor in the cardiovascular class, |V| is the number general GW algorithm, MST of the strong pruned tree is of vertices in DSN (|V|=548 in this paper), |T| is recommended to be found to increase the total net-prize the number of terminals in DSN. Notably, our computa- of the identified subnetwork. However, in this paper, the tional trials show that identifying true positives (Ic)and aim of identifying subnetworks is to identify drug can- true negatives (Nnc) are both important to subnetwork didates, which are vertices in DSNs. Therefore, it is not identification for drug repositioning. necessary to find MST of the strong pruned tree, and we We evaluate all the subnetworks identified by PSIA and can directly use the strong pruned tree as the identified GW algorithm. Then, we select the subnetworks with high subnetwork for drug repositioning. The MATLAB cod- RI as the suitable subnetworks for drug repositioning. ing of GW algorithm is publicly available at https://github. Most drugs in these selected subnetworks have already com/YahuiSun/GW-to-identify-subnetworks. been classified into the cardiovascular class. However, there may still be drugs in these selected subnetworks Subnetwork evaluation for drug repositioning that have not been classified into the cardiovascular class As described above, we select each of the 9 cardiovascu- yet. We consider the ‘not-classified-yet’ drugs that have lar subclasses individually as the terminal set, and all the frequently occurred in these selected subnetworks as can- other drugs in the DSN are considered as non-terminal didates for drug repositioning. vertices. We then apply two sparse graph generation algo- rithms to generate two sparse graphs for each cardiovas- Results cular subclass, resulting in 18 DSNs. We name each DSN There are two groups of DSNs generated in this paper. as D_i_a or D_i_b, in which i represents the origin of the Each group contains 9 DSNs that are generated using 9 terminal set (subclass C01, C02, C03, C04, C05, C07, C08, cardiovascular subclasses (C01, C02, C03, C04, C05, C07, C09, or C10), a or b represents the first or the second C08, C09, C10). The DSNs in the first group (D_01_a sparse graph generation algorithm that is used to generate to D_10_a) are generated using the first proposed sparse that particular DSN. graph generation algorithm (Fig. 1), while the DSNs in Both PSIA and GW algorithm have been applied to the second group (D_01_b to D_10_b) are generated using each of the 18 DSNs to identify subnetworks. PSIA can the second proposed sparse graph generation algorithm identify multiple subnetworks in each DSN, while GW (Fig. 2). These DSNs are publicly available at https:// algorithm can only identify a single subetwork in each github.com/YahuiSun/Drug-Similarity-Network. DSN. Each identified subnetwork contains all the termi- Both PSIA and GW algorithm are used to identify sub- nals and may also contain some non-terminal vertices. In networks in two groups of DSNs. Since PSIA can identify DSNs, the drugs represented by terminals are in a cer- multiple subnetworks in a single DSN, we employ PSIA for tain cardiovascular subclass, while the drugs represented three times in each DSN to identify three subnetworks. by non-terminal vertices may or may not be in the other In each DSN, the subnetwork with the highest RI iden- cardiovascular subclasses. The aim of subnetwork identi- tified by PSIA is selected to compare with the subnetwork fication is to reposition drugs for cardiovascular diseases. identified by GW algorithm. The comparison results are Drugs in the cardiovascular class are closely related to shown in Tables 1 and 2, in which ID is the name of DSN, cardiovascular diseases. Moreover, the identified subnet- |V|, |E|, |T| are the numbers of vertices, edges, terminals work is supposed to contain drugs that are closely related in each DSN, T-Origin is the origin of the terminal set in to each other. Therefore, a subnetwork that is suitable each DSN, |V | and |E| are the numbers of vertices and for drug repositioning for cardiovascular diseases may edges in each identified subnetwork. contain a high percentage of drugs that are in the car- The identified subnetwork with a higher RI in each DSN diovascular class and a low percentage of drugs that are has been highlighted in Tables 1 and 2. It can be seen not in the cardiovascular class. Hence, we propose Rand that every highlighted subnetwork has a smaller number Index (RI) [38] as the metric to evaluate the identified of vertices than the other subnetwork in the same DSN. subnetworks, and it is defined as Thus, we observe that Observation 1: In each DSN, the identified subnetwork Ic + Nnc which has a higher RI is generally smaller than the other RI = × 100% (8) |V|−|T| identified subnetwork. 153

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 32 of 63

Table 1 Subnetwork identification results in drug similarity network: D_01_a to D_10_a DSN Identified subnetwork

ID |V||E||T| T-Origin Algorithm |V’||E’| Ic Rand Index PSIA 60 59 7 79.8 D_01_a 548 1500 22 C01 GW 354 353 53 41.4 PSIA 37 36 10 81.9 D_02_a 548 1500 12 C02 GW 339 338 62 45.0

PSIA 35 34 4 80.4 D_03_a 548 1500 13 C03 GW 330 329 61 46.5

PSIA 9 8 1 81.1 D_04_a 548 1500 4 C04 GW 322 321 66 47.4

PSIA 25 24 4 80.9 D_05_a 548 1500 9 C05 GW 281 280 52 51.2

PSIA 25 24 1 81.8 D_07_a 548 1500 15 C07 GW 301 300 55 50.3

PSIA 23 22 2 80.2 D_08_a 548 1500 8 C08 GW 320 319 63 47.8

PSIA 29 28 4 82.5 D_09_a 548 1500 16 C09 GW 322 321 56 47.0

PSIA 18 17 1 80.7 D_10_a 548 1500 8 C10 GW 354 353 66 42.6

The highlighted numbers indicate the higher Rand Index and the corresponding Ic in each instance

Table 2 Subnetwork identification results in Drug Similarity Network: D_01_b to D_10_b DSN Identified subnetwork

ID |V||E||T| T-Origin Algorithm |V’||E’| Ic Rand Index PSIA 41 40 2 81.6 D_01_b 548 1391 22 C01 GW 32 31 1 82.9

PSIA 22 21 2 81.7 D_02_b 548 1391 12 C02 GW 25 24 1 80.8 PSIA 18 17 2 82.8 D_03_b 548 1391 13 C03 GW 20 19 1 82.1 PSIA 9 8 2 81.4 D_04_b 548 1391 4 C04 GW 10 9 1 80.9 PSIA 12 11 1 82.2 D_05_b 548 1391 9 C05 GW 17 16 1 81.3 PSIA 23 22 2 82.6 D_07_b 548 1391 15 C07 GW 24 23 1 82.0

PSIA 19 18 1 80.6 D_08_b 548 1391 8 C08 GW 19 18 0 80.2 PSIA 22 21 1 82.7 D_09_b 548 1391 16 C09 GW 26 25 1 82.0 PSIA 54 53 5 75.6 D_10_b 548 1391 8 C10 GW 14 13 1 81.5

The highlighted numbers indicate the higher Rand Index and the corresponding Ic in each instance 154

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 33 of 63

It is preferable to choose small subnetworks than large new therapeutic effects for known drugs. In this paper, subnetworks for drug repositioning as analysis can be we propose subnetwork identification as a new method done more efficiently in small subnetworks. Most drugs to reposition drugs. Because cardiovascular health con- included in our generated DSNs are not in the cardio- tributes significantly to the burden of illness and injury in vascular class. Hence, it is important for subnetworks to the Australian community [39], and the Prize-Collecting identify true negatives (Nnc in Eq. (8)), and then avoid false Steiner Tree (PCST) approach is a good way to identify positives (drugs that are not in the cardiovascular class). subnetworks, we focus on using the PCST approach to One counter-example is that the subnetworks identified reposition drugs for cardiovascular diseases. by GW algorithm in D_01_a to D_10_a contain many To identify subnetworks for drug repositioning, we gen- false positives, and thus are large and not suitable for drug erate Drug Similarity Networks (DSN) including five com- repositioning. ponents, which are vertices, vertex prizes, edges, edge In D_01_a to D_10_a, all the highlighted subnetworks costs, and terminals. The PCSTP algorithm tends to iden- are identified by PSIA. In D_01_b to D_10_b,7outof tify a subnetwork constructed by vertices with big prizes 9 highlighted subnetworks are identified by PSIA. In 18 and edges with small costs. In our DSNs, the vertex prizes DSNs, the average RI of the subnetworks identified by represent similarities between drugs, and the edge costs PSIA is 81.1%, while the average RI of the subnetworks represent dissimilarities between drugs. Moreover, termi- identified by GW algorithm is 64.1%. Therefore, the con- nals represent drugs in the cardiovascular class. There- clusion below can be made. fore, a subnetwork of drugs that are closely related to the cardiovascular system is expected to be identified using Conclusion 1: In our generated DSNs, PSIA generally the PCST approach. outperforms GW algorithm in identifying subnetworks 18 DSNs are generated using 9 cardiovascular sub- for drug repositioning. classes and 2 sparse graph generation algorithms. After D_01_a to D_10_a are generated using the first pro- generating DSNs, PCSTP algorithms are used to iden- posed sparse graph generation algorithm (Fig. 1), while tify subnetworks. GW algorithm is one of the most D_01_b to D_10_b are generated using the second pro- popular PCSTP algorithm. However, GW algorithm is posed sparse graph generation algorithm (Fig. 2). 8 out of designed for the single-terminal (root) case, while there 9 highlighted subnetworks in D_01_b to D_10_b (except are multiple terminals in DSNs. Therefore, we first adapt D_02_b) have higher RI than the corresponding high- GW algorithm for the multiple-terminal case and then lighted subnetworks in D_01_a to D_10_a (two DSNs use it to identify subnetworks in DSNs. Nevertheless, corresponds to each other when they use the same cardio- GW algorithm can only identify a single subnetwork in vascular subclass as the terminal set; see Tables 1 and 2). each DSN, and this subnetwork may not be suitable for Hence, the conclusion below can be made. drug repositioning. Hence, we propose a new PCSTP Conclusion 2: The second proposed sparse graph gen- algorithm, Physarum-inspired Subnetwork Identification eration algorithm is more suitable than the first proposed Algorithm (PSIA), to identify subnetworks in DSNs as sparse graph generation algorithm for DSN generation. well, and PSIA can identify multiple subnetworks in each We select the nine highlighted subnetworks in D_01_b DSN. to D_10_b (whicharegeneratedusingthesecondpro- We employ both PSIA and GW algorithm in 18 DSNs. posed sparse graph generation algorithm) for drug repo- In each DSN, one subnetwork is identified by GW algo- sitioning. These subnetworks are visualized in Fig. 6, in rithm, and three subnetworks are identified by PSIA. which S01-S09 are IDs of the highlighted subnetworks in Since Rand Index gives equal weight to the identifica- D_01_b to D_10_ b, the numbers in the visualized sub- tion of true positives and true negatives, it can be used to networks represent the drug index (see drug names in select suitable subnetworks for drug repositioning. Thus, Additional file 1), the green-color vertices represent drugs we evaluate these subnetworks using their Rand Index. that are in the cardiovascular class, and the white-color Furthermore, the subnetwork identified by GW algorithm vertices represent drugs that are not in the cardiovascu- and the best subnetwork identified by PSIA are compared lar class. Drug candidates are selected from the frequently with each other in each DSN. occurring drugs that are not in the cardiovascular class. Based on the comparison results shown in Tables 1 These drug candidates are closely related to the car- and 2, we first observe that smaller subnetworks always diovascular system, and they could be repositioned for have higher Rand Index than larger subnetworks in the cardiovascular diseases. same DSN. Then, we conclude that PSIA outperforms GW algorithm in DSNs. Moreover, we conclude that the Discussion second proposed sparse graph generation algorithm is Due to the long time, large costs and high risks to develop more suitable than the first proposed sparse graph gener- new drugs, drug repositioning is important since it finds ation algorithm for DSN generation. 155

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 34 of 63

Fig. 6 Visualization of the highlighted subnetworks in D_01_b to D_10_b. S01-S09 are IDs of the highlighted subnetworks in D_01_b to D_10_b.The numbers in the visualized subnetworks represent the indexes of drugs. The green-color vertices represent drugs that are in the cardiovascular class. The white-color vertices represent drugs that are not in the cardiovascular class

Drug repositioning for cardiovascular diseases are nitroglycerin, theophylline, arsenic trioxide, isocar- After the evaluation of all the identified subnetworks, we boxazid, lincomycin, acarbose, adapalene, haloperidol, select nine most suitable subnetworks to reposition drugs malathion, and neomycin. for cardiovascular diseases. These nine subnetworks are We believe that these ten drug candidates could be visualized in Fig. 6. The drugs contained in these subnet- repositioned for cardiovascular diseases. Thus, we evalu- works are supposed to be closely related to the cardio- ate each drug candidate using published pharmacological vascular system. There are 134 drugs contained in these discoveries. The existing discoveries on three candidates subnetworks, in which 104 drugs are already in the cardio- (nitroglycerin, theophylline and acarbose) are introduced vascular class, while 30 drugs are not in the cardiovascular below. class yet. Therefore, we consider these 30 drugs as newly As to nitroglycerin, Koch et al. [40] found that nitroglyc- identified drugs for drug repositioning. These 30 drugs are erin can produce a sharp fall in the cardiac filling pres- listed in Table 3, in which Index is the drug index, Freq sures and the pulmonary arterial pressures. Moreover, the is the number of times each drug has been identified for, vasodilatory effects of nitroglycerin also have the potential S01-S09 are IDs of the nine selected subnetworks. to be used in cardiovascular therapeutics [41]. As to theo- It can be seen from Table 3 that ten newly identified phylline, Sollevi et al. [42] found that theophylline can act drugs have occurred more than once in the selected sub- as an antagonist to antagonize cardiovascular networks, while the other 20 drugs have occurred only responses. As to acarbose, Chiasson et al. [43] found that once in the selected subnetworks. We consider the ten treating impaired glucose tolerance patients with acarbose drugs which have occurred more than once as candi- is associated with a significant reduction in the risk of dates for drug repositioning. These ten drug candidates cardiovascular diseases and hypertension. 156

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 35 of 63

Table 3 Newly identified drugs in the selected subnetworks Index Drug Name Freq S01 S02 S03 S04 S05 S06 S07 S08 S09 368 nitroglycerin 7 X X X XXXX 496theophylline5 X XXXX 32 arsenic trioxide27 3 X XX 261 isocarboxazid 3 X X X 287 lincomycin 3 X X X 2acarbose 2X X 7 adapalene 2 X X 239 haloperidol 2 X X 298 malathion 2 XX 359 neomycin 2 X X 10 alclometasone 1 X 14 amcinonide 1 X 39 azathioprine 1 X 70 caffeine 1 X 74 1 X 93 ceftazidime 1 X 135 1 X 165 droperidol 1 X 217 formoterol 1 X 241 hexachlorophene 1 X 367 nitrofurantoin 1 X 417 pramipexole 1 X 422 prednisone 1 X 429 procyclidine 1 X 449 repaglinide 1 X 466 selegiline 1 X 497 thiabendazole 1 X 513 1 X 518 tranexamic acid 1 X 526 triiodothyronine 1 X

It can be seen from these discoveries that nitroglycerin, by this vertex and all the drugs represented by terminals. theophylline and acarbose have already been suspected There are different types of drug similarities with physical for their potential therapeutic effects for cardiovascu- meanings, such as chemical similarity, therapeutic simi- lar diseases. Therefore, we provide evidences to support larity, phenotype similarity, and similarity based on their these previous discoveries. As to the other seven drug can- interacting targets (such as proteins) [44]. didates, we believe that they also may interact with the In our generated DSNs, drug similarities are calculated biological cardiovascular system. These evidences have using four types of drug features, which are the chemi- shown the effectiveness and efficiency of our proposed cal, therapeutic, protein, and phenotype features. In this PCST approach for drug repositioning. section, we generate new DSNs based on new drug sim- ilarities, and show that the initial drug similarities calcu- Different types of drug similarities lated using four types of drug features are the best drug In our generated DSNs, the edge cost represents the quan- similarities for drug repositioning. tified dissimilarity between drugs, and the vertex prize We generate four new types of DSNs, and in each of represents the similarity between the drug represented them the drug similarities are calculated using a single 157

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 36 of 63

type of drug features. The used drug features are chemical, data. Thus, it is necessary to ensure that we can use the therapeutic, protein, and phenotype features. We com- PCST approach to identify subnetworks in large DSNs. pare the standard deviations of vertex prizes and edge In this section, random DSNs with different sizes are costs in the initial type of DSNs and four new types of generated. We employ both PSIA and GW algorithm in DSNs. these DSNs using MATLAB R2014a on a computer with The comparison results are demonstrated in Table 4, in 16 GB RAM and the Intel(R) Core(TM) i7-4770 CPU. which SD_VP is the average standard deviation of vertex The running time of PSIA and GW algorithm in these prizes, SD_EC is the standard deviation of edge costs in DSNs is demonstrated in Table 5, in which DSN_X means the corresponding complete graphs, DSN_C is the type a DSN with X vertices. The unit of the running time is of DSNs where drug similarities are calculated using the minute. chemical features, DSN_T is the type of DSNs where drug ItcanbeseenfromTable5thatbothPSIAandGW similarities are calculated using the therapeutic features, algorithm can identify subnetworks in large DSNs with DSN_Pr is the type of DSNs where drug similarities are up to 3000 vertices in a reasonable time. Moreover, the calculated using the protein features, DSN_Ph is the type running time above can be further shortened by using a of DSNs where drug similarities are calculated using the low-level programming language. Thus, we can use the phenotype features, DSN_01_a/b to DSN_10_a/b are the PCST approach to identify subnetworks in large DSNs. initial type of DSNs used for drug repositioning, where Notably, even though the running time of PSIA is longer drug similarities are calculated using all the four types of than that of GW algorithm, PSIA is considered better drug features. as it can identify more suitable subnetworks for drug It can be seen from Table 4 that SD_VP and SD_EC of repositioning. DSN_C are higher than that of other types of DSNs. It is recommended to select DSNs with high standard devia- Conclusions tions for drug repositioning as it is hard to identify drug Drug repositioning is important for drug development. repositioning candidates in DSNs with low standard devi- In this paper, the subnetwork identification method is ations. However, many drugs undergo complex and largely used to reposition drugs for the first time. A new Price- uncharacterized metabolic transformations, and the phys- Collecting Steiner Tree algorithm is proposed in this iological effects of drugs may not be able be predicted paper to identify subnetworks. The popular GW algo- by their chemical properties alone [45]. Therefore, it is rithm is also used to compare with our proposed algo- not appropriate to only consider chemical similarities for rithm. Drug Similarity Networks are generated, in which drug repositioning. Similarly, it is not appropriate to only vertex prizes and edge costs represent the similarities and consider any other homogeneous drug similarity either dissimilarities between drugs respectively, and terminals [11]. The initial drug similarities are heterogeneous as represent drugs in the cardiovascular class, as defined in they are calculated using multiple types of drug features. the Anatomical Therapeutic Chemical classification sys- It can also be seen from Table 4 that SD_VP and SD_EC tem. In the generated Drug Similarity Networks, our pro- of DSN_01_a/b to DSN_10_a/b are also relatively high. posed algorithm identifies subnetworks with higher Rand Therefore, the initial heterogeneous drug similarities cal- Index than the popular algorithm. Furthermore, nine most culated using four types of drug features are the best drug suitable subnetworks are selected for drug repositioning, similarities for drug repositioning. and ten drug candidates are identified from these subnet- works. We find evidence to support previous discoveries The running time in large drug similarity networks that nitroglycerin, theophylline and acarbose may be able We use the PCST approach to identify subnetworks for to be repositioned for cardiovascular diseases. Moreover, drug repositioning. The Prize-Collecting Steiner Tree we identify seven previously unknown drug candidates Problem is NP-hard [46], which means that the time that also may interact with the biological cardiovascular required to solve it may increase exponentially as the system. Therefore, our proposed Prize-Collecting Steiner graph size increases. Large DSNs with thousands of Tree approach is shown to be a promising strategy for vertices can be generated using the existing pharmacology drug repositioning.

Table 4 Standard deviations of vertex prizes and edge costs Table 5 The running time of PSIA and GW algorithm in DSNs DSN_C DSN_T DSN_Pr DSN_Ph DSN_01_a/b to with different sizes DSN_10_a/b DSN_100 DSN_548 DSN_1000 DSN_3000 SD_VP 3.68 3.34 2.32 2.24 2.42 GW 0.001 min 0.039 min 0.196 min 7.345 min SD_EC 15.63 7.72 7.43 8.90 10.14 PSIA 0.036 min 0.638 min 2.102 min 19.169 min 158

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 37 of 63

Additional file 6. Napolitano F, Zhao Y, Moreira VM, Tagliaferri R, Kere J, D’Amato M, Greco D. Drug repositioning: a machine-learning approach through data integration. J Cheminformatics. 2013;5:30. Additional file 1: This file contains a long table of the drug indexes and 7. Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment names. (PDF 42 kb) pairs from biomedical literature for drug repurposing. BMC Bioinforma. 2013;14(1):181. Declarations 8. Tari LB, Patel JH. Systematic drug repurposing through text mining. This article has been published as part of BMC Systems Biology Volume 10 Biomed Lit Min. 2014;1159:253–67. Supplement 5, 2016. 15th International Conference On Bioinformatics (INCOB 9. Zhang Y, Tao C, Jiang G, Nair AA, Su J, Chute CG, Liu H. Network-based 2016): systems biology. The full contents of the supplement are available analysis reveals distinct association patterns in a semantic medline-based online http://bmcsystbiol.biomedcentral.com/articles/supplements/volume- drug-disease-gene network. J Biomed Semant. 2014;5:33. 10-supplement-5. 10. Wu Z, Wang Y, Chen L. Network-based drug repositioning. Mol BioSyst. 2013;9(6):1268–1281. Funding 11. Berger SI, Iyengar R. Network analyses in systems pharmacology. Both YS and PNH are supported by the PhD scholarships of The University of Bioinformatics. 2009;25(19):2466–472. Melbourne. PNH is also partially supported by NICTA scholarship of National 12. Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. ICT Australia, now Data61 since merging CSIRO’s Digital Productivity team. Network-based elucidation of human disease similarities reveals This work is partially funded by Australian Research Council grant DP1096296. common functional modules enriched for pluripotent drug targets. PLoS Comput Biol. 2010;6(2):1000662. Availability of data and materials 13. Sadeghi A, Fröhlich H. Steiner tree methods for optimal sub-network The data supporting the results of this article are included and cited within the identification: an empirical study. BMC Bioinforma. 2013;14(1):144. article. 14. Tuncbag N, Braunstein A, Pagnani A, Huang S-SC, Chayes J, Borgs C, Zecchina R, Fraenkel E. Simultaneous reconstruction of multiple Authors’ contributions signaling pathways via the prize-collecting steiner forest problem. YS proposed the idea of applying subnetwork identification to pharmacology J Comput Biol. 2013;20(2):124–36. networks. PNH proposed the idea of repositioning drugs for cardiovascular 15. Yosef N, Ungar L, Zalckvar E, Kimchi A, Kupiec M, Ruppin E, Sharan R. diseases using subnetwork identification. PNH collected ATC classification Toward accurate reconstruction of functional protein networks. Mol Syst (2016) data. Both YS and PNH designed and generated the Drug Similarity Biol. 2009;5(1):248. Networks. YS proposed the PSIA algorithm for drug positioning. YS applied 16. Shih YK, Parthasarathy S. A single source k-shortest paths algorithm to both PSIA and GW algorithm to identify subnetworks. PNH evaluated the infer regulatory pathways in a gene network. Bioinformatics. identified subnetworks. Both YS and PNH analyzed the identified subnetworks 2012;28(12):49–58. and the drug candidates. YS and PNH drafted the manuscript. Both SH and KV 17. Scott MS, Perkins T, Bunnell S, Pepin F, Thomas DY, Hallett M. provided assistance on this work. All authors edited and approved the final Identifying regulatory subnetworks for a set of genes. Mol Cell draft. Proteomics. 2005;4(5):683–92. 18. Bailly-Bechet M, Braunstein A, Pagnani A, Weigt M, Zecchina R. Competing interests Inference of sparse combinatorial-control networks from gene-expression The authors declare that they have no competing interests. data: a message passing approach. BMC Bioinforma. 2010;11(355):1–12. 19. Faust K, Dupont P, Callut J, Van Helden J. Pathway discovery in metabolic Consent for publication networks by subgraph extraction. Bioinformatics. 2010;26(9):1211–1218. Not applicable. 20. Ljubic I, Weiskircher R, Pferschy U, Klau GW, Mutzel P, Fischetti M. Solving the prize-collecting steiner tree problem to optimality. In: Proc. of the Seventh Workshop on Algorithm Engineering and Experiments Ethics approval and consent to participate (ALENEX 05). Vancouver: SIAM; 2005. p. 68–76. Not applicable. 21. Gutner S. Elementary approximation algorithms for prize collecting steiner tree problems. Comb Optim Appl. 2008;5165:246–54. Author details 22. Goemans MX, Williamson DP. A general approximation technique for 1Department of Mechanical Engineering, University of Melbourne, Parkville, constrained forest problems. SIAM J Comput. 1995;24(2):296–317. 3010 Melbourne, Australia. 2Data61, Victoria Research Lab, 3003 West 23. Archer A, Bateni M, Hajiaghayi M, Karloff H. Improved approximation Melbourne, Australia. 3Department of Computer Science, University of algorithms for prize-collecting steiner tree and tsp. SIAM J Comput. Ruhuna, 81000 Matara, Sri Lanka. 4Department of Computing and Information 2011;40(2):309–32. Systems, University of Melbourne, Parkville, 3010 Melbourne, Australia. 24. Cole R, Hariharan R, Lewenstein M, Porat E. A faster implementation of 5Research School of Engineering, College of Engineering & Computer Science, the goemans-williamson clustering algorithm. In: Proceedings of the The Australian National University, 2601 Canberra, ACT, Australia. Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms. Washington: SIAM; 2001. p. 17–25. Published: 5 December 2016 25. Johnson DS, Minkoff M, Phillips S. The prize collecting steiner tree problem: theory and practice. In: Eleventh Annual ACM-SIAM Symposium References on Discrete Algorithms. San Francisco: SIAM; 2000. p. 760–9. 1. Huang H, Nguyen T, Ibrahim S, Shantharam S, Yue Z, Chen JY. Dmap: 26. Sun Y, Halgamuge S. Fast algorithms inspired by physarum polycephalum a connectivity map database to enable identification of novel drug for node weighted steiner tree problem with multiple terminals. In: IEEE repositioning candidates. BMC Bioinforma. 2015;16(Suppl 13):4. Congress on Evolutionary Computation (CEC); 2016. p. 3254–260. 2. Li J, Lu Z. Pathway-based drug repositioning using causal inference. BMC Bioinforma. 2013;14(16):1. 27. Thorn CF, Klein TE, Altman RB. Pharmgkb: the pharmacogenomics knowledge base. Pharmacogenomics: Methods and Protocols. 2013;1015: 3. Wu C, Gudivada RC, Aronow BJ, Jegga AG. Computational drug 311–20. repositioning through heterogeneous network clustering. BMC Syst Biol. 28. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, 2013;7(Suppl 5):6. Hassanali M. Drugbank: a knowledgebase for drugs, drug actions and 4. Sawada R, Iwata H, Mizutani S, Yamanishi Y. Target-based drug drug targets. Nucleic Acids Res. 2008;36(suppl 1):901–6. repositioning using large-scale chemical–protein interactome data. 29. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource J Chem Inf Model. 2015;55(12):2717–730. to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6(343):1–6. 5. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, 30. Zhang P, Wang F, Hu J, Sorrentino R. Exploring the relationship between Arndt D, Wilson M, Neveu V, et al. Drugbank 4.0: shedding new light on drug side-effects and therapeutic indications. In: AMIA Annual drug metabolism. Nucleic Acids Res. 2014;42(D1):1091–1097. Symposium Proceedings. Washington DC: AMIA; 2013. p. 1568. 159

Sun et al. BMC Systems Biology 2016, 10(Suppl 5):128 Page 38 of 63

31. Gottlieb A, Stein GY, Ruppin E, Sharan R. Predict: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol. 2011;7(1):496. 32. World Health Organization. Anatomical Therapeutic Chemical (ATC) Classification System. 2016. http://www.whocc.no. Accessed 15 Mar 2016. 33. Prim RC. Shortest connection networks and some generalizations. Bell Syst Tech J. 1957;36(6):1389–1401. 34. Tarjan R. Depth-first search and linear graph algorithms. SIAM J Comput. 1972;1(2):146–60. 35. Nakagaki T, Yamada H, Tóth Á. Intelligence: Maze-solving by an amoeboid organism. Nature. 2000;407(407):470–0. 36. Saigusa T, Tero A, Nakagaki T, Kuramoto Y. Amoebae anticipate periodic events. Phys Rev Lett. 2008;100(1):018101. 37. Nakagaki T, Yamada H, Hara M. Smart network solutions in an amoeboid organism. Biophys Chem. 2004;107(1):1–5. 38. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50. 39. Peiris D, Patel AA, Cass A, Howard MP, Tchan ML, Brady JP, De Vries J, Rickards B, Yarnold D, Hayman N, et al. Cardiovascular disease risk management for aboriginal and torres strait islander peoples in primary health care settings: findings from the kanyini audit. Med J Aust. 2009;191: 304–9. 40. Koch-Weser J, Cohn JN, Franciosa JA. Vasodilator therapy of cardiac failure. N Engl J Med. 1977;297(1):27–31. 41. Steinhorn BS, Loscalzo J, Michel T. Nitroglycerin and nitric oxide–a rondo of themes in cardiovascular therapeutics. N Engl J Med. 2015;373(3): 277–80. 42. Sollevi A, Östergren J, Fagrell B, Hjemdahl P. Theophylline antagonizes cardiovascular responses to dipyridamole in man without affecting increases in plasma adenosine. Acta Physiol Scand. 1984;121(2):165–71. 43. Chiasson JL, Josse RG, Gomis R, et al. Acarbose treatment and the risk of cardiovascular disease and hypertension in patients with impaired glucose tolerance: the stop-niddm trial. Jama. 2003;290(4):486–94. 44. Gottlieb A, Stein GY, Oron Y, Ruppin E, Sharan R. Indi: a computational framework for inferring drug interactions and their associated recommendations. Mol Syst Biol. 2012;8(1):592. 45. Dudley JT, Deshpande T, Butte AJ. Exploiting drug–disease relationships for computational drug repositioning. Brief Bioinform. 2011;12:303–11. 46. da Cunha AS, Lucena A, Maculan N, Resende MG. A relax-and-cut algorithm for the prize-collecting steiner problem in graphs. Discret Appl Math. 2009;157(6):1198–1217.

Submit your next manuscript to BioMed Central and we will help you at every step:

• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit This page is intentionally left blank. Appendix B The list of drugs used in Chapters 3 and 5

161 162

Page 1 of 6

Table 1: Drug indexes and names

Index Drug Name Index Drug Name Index Drug Name 1 abacavir 2 acarbose 3 4 5 acitretin 6 acyclovir 7 adapalene 8 adenosine 9 albendazole 10 alclometasone 11 alendronate 12 allopurinol 13 amantadine 14 amcinonide 15 amifostine 16 amikacin 17 amiloride 18 19 aminophylline 20 21 22 amlodipine 23 amoxapine 24 amoxicillin 25 ampicillin 26 amprenavir 27 anagrelide 28 anastrozole 29 apraclonidine 30 argatroban 31 aripiprazole 32 arsenic trioxide 33 aspirin 34 atenolol 35 36 atorvastatin 37 38 atropine 39 azathioprine 40 azelaic acid 41 azelastine 42 azithromycin 43 aztreonam 44 bacitracin 45 46 balsalazide 47 beclomethasone 48 benazepril 49 benztropine 50 betaxolol 51 bethanechol 52 53 bicalutamide 54 bimatoprost 55 biperiden 56 bisoprolol 57 bleomycin 58 bortezomib 59 bosentan 60 61 brimonidine 62 brinzolamide 63 bromocriptine 64 budesonide 65 66 bupivacaine 67 68 buspirone 69 cabergoline 70 caffeine 71 candesartan 72 capecitabine 73 captopril 74 carbachol 75 carbamazepine 76 carbinoxamine 77 carnitine 78 carteolol 79 carvedilol 80 cefaclor 81 cefadroxil 82 cefazolin 83 cefdinir 84 cefditoren 85 cefepime 86 cefixime 87 cefoperazone 88 cefotaxime 89 cefotetan 90 cefoxitin 91 cefpodoxime 92 cefprozil 93 ceftazidime 94 ceftizoxime 95 ceftriaxone 96 cefuroxime 163

Page 2 of 6

97 celecoxib 98 cephalexin 99 cephradine 100 cetirizine 101 chloramphenicol 102 103 104 chlorpheniramine 105 chlorpromazine 106 107 chlorthalidone 108 109 ciclesonide 110 cidofovir 111 cimetidine 112 cinoxacin 113 ciprofloxacin 114 citalopram 115 clemastine 116 117 clobetasol 118 clofazimine 119 120 clonidine 121 clotrimazole 122 clozapine 123 cromolyn 124 cyclobenzaprine 125 cycloserine 126 cyproheptadine 127 cysteamine 128 dacarbazine 129 danazol 130 dantrolene 131 dapsone 132 daunorubicin 133 delavirdine 134 135 desflurane 136 137 desloratadine 138 desoximetasone 139 dexamethasone 140 dexmedetomidine 141 dexrazoxane 142 diazoxide 143 diclofenac 144 dicloxacillin 145 dicyclomine 146 didanosine 147 diflunisal 148 digoxin 149 dihydroergotamine 150 151 diphenhydramine 152 dipivefrin 153 154 disulfiram 155 dobutamine 156 docetaxel 157 dofetilide 158 dolasetron 159 donepezil 160 dopamine 161 162 163 doxorubicin 164 165 droperidol 166 dutasteride 167 echothiophate 168 econazole 169 efavirenz 170 emedastine 171 enalapril 172 enflurane 173 entacapone 174 epinephrine 175 176 eprosartan 177 ergotamine 178 ertapenem 179 erythromycin 180 181 estradiol 182 estramustine 183 ethacrynic acid 184 ethambutol 185 ethionamide 186 187 ethotoin 188 etodolac 189 etomidate 190 etoposide 191 exemestane 192 ezetimibe 193 famciclovir 194 famotidine 195 196 fenofibrate 197 198 fenoprofen 199 fexofenadine 200 finasteride 201 flavoxate 164

Page 3 of 6

202 flecainide 203 fluconazole 204 fludarabine 205 fludrocortisone 206 flumazenil 207 flunisolide 208 fluocinonide 209 fluorometholone 210 fluoxetine 211 fluphenazine 212 flurbiprofen 213 flutamide 214 fluticasone 215 fluvastatin 216 fluvoxamine 217 formoterol 218 foscarnet 219 fosfomycin 220 fosinopril 221 222 fulvestrant 223 224 225 226 ganciclovir 227 gatifloxacin 228 gefitinib 229 gemcitabine 230 gemfibrozil 231 gemifloxacin 232 233 234 glycopyrrolate 235 griseofulvin 236 237 guanfacine 238 239 haloperidol 240 241 hexachlorophene 242 hydralazine 243 hydrochlorothiazide 244 hydroflumethiazide 245 hydroxocobalamin 246 247 hydroxyurea 248 hydroxyzine 249 ibuprofen 250 251 idarubicin 252 iloprost 253 imatinib 254 255 imiquimod 256 257 indinavir 258 indomethacin 259 irbesartan 260 irinotecan 261 isocarboxazid 262 isoflurane 263 isoniazid 264 isoproterenol 265 isosorbide dinitrate 266 267 itraconazole 268 269 kanamycin 270 271 ketoconazole 272 ketoprofen 273 ketorolac 274 labetalol 275 lactulose 276 lamivudine 277 278 lansoprazole 279 latanoprost 280 leflunomide 281 letrozole 282 leucovorin 283 levetiracetam 284 levobunolol 285 levocabastine 286 287 lincomycin 288 lindane 289 lisinopril 290 lithium 291 lomefloxacin 292 loperamide 293 loratadine 294 295 losartan 296 lovastatin 297 loxapine 298 malathion 299 maprotiline 300 mebendazole 301 302 medroxyprogesterone 303 304 mefloquine 305 megestrol 306 meloxicam 165

Page 4 of 6

307 meperidine 308 309 mepivacaine 310 mercaptopurine 311 meropenem 312 mesoridazine 313 metaproterenol 314 metformin 315 methadone 316 methazolamide 317 methimazole 318 319 320 321 methylphenidate 322 methylprednisolone 323 metipranolol 324 metoclopramide 325 326 327 328 329 miconazole 330 midodrine 331 mifepristone 332 miglitol 333 milrinone 334 minocycline 335 336 mirtazapine 337 mitotane 338 mitoxantrone 339 modafinil 340 moexipril 341 molindone 342 mometasone 343 montelukast 344 moricizine 345 346 moxifloxacin 347 mupirocin 348 mycophenolic acid 349 nabilone 350 nabumetone 351 nadolol 352 naftifine 353 nalbuphine 354 naloxone 355 356 naproxen 357 358 nelfinavir 359 neomycin 360 nevirapine 361 nicardipine 362 363 364 nilutamide 365 nimodipine 366 nitric oxide 367 nitrofurantoin 368 nitroglycerin 369 nitroprusside 370 nizatidine 371 norepinephrine 372 norfloxacin 373 374 ofloxacin 375 olanzapine 376 377 omeprazole 378 ondansetron 379 orlistat 380 orphenadrine 381 oxacillin 382 oxandrolone 383 oxaprozin 384 385 386 oxiconazole 387 oxybutynin 388 oxytetracycline 389 paclitaxel 390 pamidronate 391 pantoprazole 392 paricalcitol 393 paromomycin 394 paroxetine 395 396 penciclovir 397 penicillin G 398 penicillin V 399 pentazocine 400 401 pentoxifylline 402 perindopril 403 perphenazine 404 phenelzine 405 phenoxybenzamine 406 407 phenytoin 408 pilocarpine 409 pimecrolimus 410 411 166

Page 5 of 6

412 pioglitazone 413 piperacillin 414 pirbuterol 415 piroxicam 416 417 pramipexole 418 pravastatin 419 prazosin 420 prednicarbate 421 prednisolone 422 prednisone 423 424 425 probenecid 426 427 procaine 428 prochlorperazine 429 procyclidine 430 progesterone 431 432 433 propantheline 434 propofol 435 436 propylthiouracil 437 protriptyline 438 pyrazinamide 439 pyridostigmine 440 pyrimethamine 441 quetiapine 442 quinapril 443 444 rabeprazole 445 raloxifene 446 ramipril 447 448 remifentanil 449 repaglinide 450 451 ribavirin 452 rifabutin 453 rifapentine 454 455 rimexolone 456 risedronate 457 risperidone 458 ritonavir 459 rivastigmine 460 ropinirole 461 ropivacaine 462 rosiglitazone 463 salmeterol 464 saquinavir 465 secobarbital 466 selegiline 467 sertraline 468 sildenafil 469 simvastatin 470 471 spectinomycin 472 473 stavudine 474 streptomycin 475 sufentanil 476 477 sulfasalazine 478 sulfinpyrazone 479 sulfisoxazole 480 sulindac 481 sumatriptan 482 tacrolimus 483 tadalafil 484 tamoxifen 485 tamsulosin 486 tazarotene 487 telmisartan 488 489 teniposide 490 terazosin 491 terbinafine 492 terbutaline 493 terconazole 494 testosterone 495 496 theophylline 497 thiabendazole 498 thioridazine 499 thyroxine 500 501 ticlopidine 502 tiludronate 503 timolol 504 tiotropium 505 tirofiban 506 tizanidine 507 tobramycin 508 509 tolazoline 510 511 tolmetin 512 tolterodine 513 topiramate 514 topotecan 515 toremifene 516 167

Page 6 of 6

517 trandolapril 518 tranexamic acid 519 travoprost 520 trazodone 521 treprostinil 522 triamcinolone 523 524 trifluoperazine 525 trihexyphenidyl 526 triiodothyronine 527 trimethoprim 528 trimetrexate 529 trimipramine 530 tropicamide 531 trovafloxacin 532 valacyclovir 533 valproic acid 534 valsartan 535 vancomycin 536 venlafaxine 537 538 vinblastine 539 vincristine 540 vinorelbine 541 voriconazole 542 warfarin 543 zafirlukast 544 zalcitabine 545 zidovudine 546 ziprasidone 547 zolpidem 548 zonisamide This page is intentionally left blank. Appendix C The list of ATC class-N drugs used in Chapter 5

169 170

Drug Index Drug name ATC class (Level 2) Drug Index Drug name ATC class (Level 2) 13 Amantadine N04 309 Mepivacaine N01 21 Amitriptyline N06 312 Mesoridazine N05 23 Amoxapine N06 315 Methadone N07 31 Aripiprazole N05 321 Methylphenidate N06 35 Atomoxetine N06 336 Mirtazapine N06 51 Bethanechol N07 339 Modafinil N06 55 Biperiden N04 341 Molindone N05 63 Bromocriptine N04 345 Morphine N02 66 Bupivacaine N01 353 Nalbuphine N02 67 Bupropion N06 355 Naltrexone N07 68 Buspirone N05 362 Nicotine N07 69 Cabergoline N04 373 Nortriptyline N06 70 Caffeine N06 375 Olanzapine N05 74 Carbachol N07 384 Oxazepam N05 75 Carbamazepine N03 385 Oxcarbazepine N03 105 Chlorpromazine N05 394 Paroxetine N06 114 Citalopram N06 399 Pentazocine N02 119 Clomipramine N06 400 Pentobarbital N05 120 Clonidine N02 403 Perphenazine N05 122 Clozapine N05 404 Phenelzine N06 135 Desflurane N01 407 Phenytoin N03 136 Desipramine N06 408 Pilocarpine N07 140 Dexmedetomidine N05 410 Pimozide N05 147 Diflunisal N02 417 Pramipexole N04 149 Dihydroergotamine N02 424 Primidone N03 154 N07 427 Procaine N01 159 Donepezil N06 428 Prochlorperazine N05 162 Doxepin N06 429 Procyclidine N04 165 Droperidol N05 434 Propofol N01 172 N01 437 Protriptyline N06 173 Entacapone N04 439 Pyridostigmine N07 177 Ergotamine N02 441 Quetiapine N05 186 Ethosuximide N03 448 Remifentanil N01 187 Ethotoin N03 454 Riluzole N07 189 Etomidate N01 457 Risperidone N05 210 Fluoxetine N06 459 Rivastigmine N06 211 Fluphenazine N05 460 Ropinirole N04 216 Fluvoxamine N06 461 Ropivacaine N01 221 Fosphenytoin N03 465 Secobarbital N05 224 Gabapentin N03 466 Selegiline N04 225 Galantamine N06 467 Sertraline N06 239 Haloperidol N05 475 Sufentanil N01 240 Halothane N01 481 Sumatriptan N02 248 Hydroxyzine N05 488 Temazepam N05 254 Imipramine N06 500 Tiagabine N03 261 Isocarboxazid N06 513 Topiramate N03 262 N01 516 Tramadol N02 270 Ketamine N01 520 Trazodone N06 277 Lamotrigine N03 524 Trifluoperazine N05 283 Levetiracetam N03 525 Trihexyphenidyl N04 286 Lidocaine N01 529 Trimipramine N06 290 Lithium N05 533 Valproic Acid N03 294 Lorazepam N05 536 Venlafaxine N06 297 Loxapine N05 546 Ziprasidone N05 299 Maprotiline N06 547 Zolpidem N05 308 Mephenytoin N03 548 Zonisamide N03 Bibliography 171

[1] J. T. Dudley, T. Deshpande, and A. J. Butte, “Exploiting drug–disease relationships for computational drug repositioning,” Briefings in bioinformatics, p. bbr013, 2011.

[2] A. J. Atkinson Jr, S.-M. Huang, J. J. Lertora, and S. P. Markey, Principles of clinical pharmacology. Academic Press, 2012.

[3] I. Kola and J. Landis, “Can the pharmaceutical industry reduce attrition rates?” Nature reviews Drug discovery, vol. 3, no. 8, pp. 711–716, 2004.

[4] S. Vilar and G. Hripcsak, “The role of drug profiles as similarity metrics: applica- tions to repurposing, adverse effects detection and drug–drug interactions,” Brief- ings in bioinformatics, p. bbw048, 2016.

[5] S. Vilar, R. Harpaz, E. Uriarte, L. Santana, R. Rabadan, and C. Friedman, “Drug- drug interaction through molecular structure similarity analysis,” Journal of the American Medical Informatics Association, vol. 19, no. 6, pp. 1066–1074, 2012.

[6] L. B. Peters, N. Bahr, and O. Bodenreider, “Evaluating drug-drug interaction infor- mation in ndf-rt and drugbank,” Journal of biomedical semantics, vol. 6, no. 1, p. 19, 2015.

[7] A. Gottlieb, G. Y. Stein, Y. Oron, E. Ruppin, and R. Sharan, “Indi: a computational framework for inferring drug interactions and their associated recommendations,” Molecular systems biology, vol. 8, no. 1, p. 592, 2012.

[8] F. Cheng and Z. Zhao, “Machine learning-based prediction of drug-drug interac- tions by integrating drug phenotypic, therapeutic, chemical, and genomic prop- erties,” Journal of the American Medical Informatics Association, vol. 21, no. e2, pp. e278–e286, 2014.

[9] S. Vilar, E. Uriarte, L. Santana, T. Lorberbaum, G. Hripcsak, C. Friedman, and N. P. Tatonetti, “Similarity-based modeling in large-scale prediction of drug-drug inter- actions,” Nature protocols, vol. 9, no. 9, pp. 2147–2163, 2014.

[10] S. Vilar, T. Lorberbaum, G. Hripcsak, and N. P. Tatonetti, “Improving detection of arrhythmia drug-drug interactions in pharmacovigilance data through the imple- 172 Bibliography

mentation of similarity-based modeling,” PLoS ONE, vol. 10, no. 6, p. e0129974, 2015.

[11] P. Zhang, F. Wang, J. Hu, and R. Sorrentino, “Label propagation prediction of drug- drug interactions based on clinical side effects,” Scientific Reports, vol. 5, 2015.

[12] L. Liu, L. Chen, Y.-H. Zhang, L. Wei, S. Cheng, X. Kong, M. Zheng, T. Huang, and Y.-D. Cai, “Analysis and prediction of drug-drug interaction by minimum redun- dancy maximum relevance and incremental feature selection,” Journal of Biomolec- ular Structure and Dynamics, no. just-accepted, pp. 1–452, 2016.

[13] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, and S. H. Bryant, “Pubchem: a public information system for analyzing bioactivities of small molecules,” Nucleic acids research, vol. 37, no. suppl 2, pp. W623–W633, 2009.

[14] O. Bodenreider, “The unified medical language system (umls): integrating biomed- ical terminology,” Nucleic acids research, vol. 32, no. suppl 1, pp. D267–D270, 2004.

[15] D. S. Wishart, C. Knox, A. C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam, and M. Hassanali, “Drugbank: a knowledgebase for drugs, drug actions and drug targets,” Nucleic acids research, vol. 36, no. suppl 1, pp. D901–D906, 2008.

[16] M. Kuhn, M. Campillos, I. Letunic, L. J. Jensen, and P. Bork, “A side effect resource to capture phenotypic effects of drugs,” Molecular systems biology, vol. 6, no. 1, 2010.

[17] F. Wang, P. Zhang, N. Cao, J. Hu, and R. Sorrentino, “Exploring the associations between drug side-effects and therapeutic indications,” Journal of biomedical infor- matics, vol. 51, pp. 15–23, 2014.

[18] F. Napolitano, Y. Zhao, V. M. Moreira, R. Tagliaferri, J. Kere, M. D’Amato, and D. Greco, “Drug repositioning: a machine-learning approach through data integra- tion.” J. Cheminformatics, vol. 5, p. 30, 2013.

[19] Y. Yamanishi, M. Kotera, M. Kanehisa, and S. Goto, “Drug-target interaction pre- diction from chemical, genomic and pharmacological data in an integrated frame- work,” Bioinformatics, vol. 26, no. 12, pp. i246–i254, 2010. Bibliography 173

[20] M. A. Yildirim, K.-I. Goh, M. E. Cusick, A.-L. Barabasi, and M. Vidal, “Drug-target network,” Nature biotechnology, vol. 25, no. 10, pp. 1119–1126, 2007.

[21] M. Lee, K. Park, and D. Kim, “Interaction network among functional drug groups,” BMC systems biology, vol. 7, no. 3, p. 1, 2013.

[22] N. Ai, X. Fan, and S. Ekins, “In silico methods for predicting drug-drug interac- tions with cytochrome p-450s, transporters and beyond,” Advanced Drug Delivery Reviews, 2015.

[23] B. D. Snyder, T. M. Polasek, and M. P. Doogue, “Drug interactions: principles and practice,” Australian Prescriber, vol. 35, no. 3, pp. 85–8, 2012.

[24] F. Cheng, Y. Yu, J. Shen, L. Yang, W. Li, G. Liu, P. W. Lee, and Y. Tang, “Classifica- tion of cytochrome p450 inhibitors and noninhibitors using combined classifiers,” Journal of Chemical Information and Modeling, vol. 51, no. 5, pp. 996–1011, 2011.

[25] L. Tari, S. Anwar, S. Liang, J. Cai, and C. Baral, “Discovering drug-drug in- teractions: a text-mining and reasoning approach based on properties of drug metabolism,” Bioinformatics, vol. 26, no. 18, pp. i547–i553, 2010.

[26] N. P. Tatonetti, P. Y. Patrick, R. Daneshjou, and R. B. Altman, “Data-driven predic- tion of drug effects and interactions,” Science translational medicine, vol. 4, no. 125, pp. 125ra31–125ra31, 2012.

[27] M. Zitnik and B. Zupan, “Collective pairwise classification for multi-way analysis of disease and drug data,” in Pacific Symposium on Biocomputing, vol. 21. Pacific Symposium on Biocomputing, 2016, pp. 81–92.

[28] N. P. Tatonetti, G. H. Fernald, and R. B. Altman, “A novel signal detection algo- rithm for identifying hidden drug-drug interactions in adverse event reports,” Jour- nal of the American Medical Informatics Association, vol. 19, no. 1, pp. 79–85, 2012.

[29] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu et al., “Drugbank 4.0: shedding new light on drug metabolism,” Nucleic Acids Research, vol. 42, no. D1, pp. D1091–D1097, 2014. 174 Bibliography

[30] D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda et al., “Drugbank 5.0: a major update to the drugbank database for 2018,” Nucleic acids research, vol. 46, no. D1, pp. D1074–D1082, 2017.

[31] Drugbank Statistics. Drugbank. [Online]. Available: https://www.drugbank.ca/ stats

[32] D. Alahakoon, S. K. Halgamuge, and B. Srinivasan, “Dynamic self-organizing maps with controlled growth for knowledge discovery,” IEEE Transactions on neural networks, vol. 11, no. 3, pp. 601–614, 2000.

[33] J. Li, S. Zheng, B. Chen, A. J. Butte, S. J. Swamidass, and Z. Lu, “A survey of cur- rent trends in computational drug repositioning,” Briefings in bioinformatics, vol. 17, no. 1, pp. 2–12, 2016.

[34] N. U Sahu and P. S Kharkar, “Computational drug repositioning: A lateral ap- proach to traditional drug discovery?” Current topics in medicinal chemistry, vol. 16, no. 19, pp. 2069–2077, 2016.

[35] S. Suthram, J. T. Dudley, A. P. Chiang, R. Chen, T. J. Hastie, and A. J. Butte, “Network-based elucidation of human disease similarities reveals common func- tional modules enriched for pluripotent drug targets,” PLoS Comput Biol, vol. 6, no. 2, p. e1000662, 2010.

[36] S. I. Berger and R. Iyengar, “Network analyses in systems pharmacology,” Bioinfor- matics, vol. 25, no. 19, pp. 2466–2472, 2009.

[37] T. A. Hemphill, “Repurposing pharmaceuticals: Does united states intellectual property law and regulatory policy assign sufficient value to new use patents?” International Journal of Innovation Management, vol. 16, no. 04, p. 1250016, 2012.

[38] Anatomical Therapeutic Chemical Classification. World Health Organization. [Online]. Available: http://www.whocc.no/atc_ddd_index/ Bibliography 175

[39] X.-M. Zhao, M. Iskar, G. Zeller, M. Kuhn, V. Van Noort, and P. Bork, “Prediction of drug combinations by integrating molecular and pharmacological data,” PLoS computational biology, vol. 7, no. 12, p. e1002323, 2011.

[40] J.-Y. Shi, J.-X. Li, and H.-M. Lu, “Predicting existing targets for new drugs base on strategies for missing interactions,” BMC bioinformatics, vol. 17, no. 8, p. 282, 2016.

[41] L. Chen, J. Lu, N. Zhang, T. Huang, and Y.-D. Cai, “A hybrid method for predic- tion and repositioning of drug anatomical therapeutic chemical classes,” Molecular BioSystems, vol. 10, no. 4, pp. 868–877, 2014.

[42] A. Fokoue, M. Sadoghi, O. Hassanzadeh, and P. Zhang, “Predicting drug-drug in- teractions through large-scale similarity-based link prediction,” in International Se- mantic Web Conference. Springer, 2016, pp. 774–789.

[43] Y. Li, F.-X. Wu, and A. Ngom, “A review on machine learning principles for multi- view biological data integration,” Briefings in bioinformatics, p. bbw113, 2016.

[44] S. Alaimo, A. Pulvirenti, R. Giugno, and A. Ferro, “Drug-target interaction pre- diction through domain-tuned network-based inference,” Bioinformatics, vol. 29, no. 16, pp. 2004–2008, 2013.

[45] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: is a correction for chance necessary?” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 1073–1080.

[46] A. Strehl and J. Ghosh, “Cluster ensembles - a knowledge reuse framework for combining multiple partitions,” Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002.

[47] S. Romano, J. Bailey, X. V. Nguyen, and K. Verspoor, “Standardized mutual infor- mation for clustering comparisons: One step further in adjustment for chance.” in ICML, 2014, pp. 1143–1151.

[48] E. Pauwels, V. Stoven, and Y. Yamanishi, “Predicting drug side-effect profiles: a chemical fragment-based approach,” BMC bioinformatics, vol. 12, no. 1, p. 169, 2011. 176 Bibliography

[49] M. Schenone, V. Danˇcík,B. K. Wagner, and P. A. Clemons, “Target identification and mechanism of action in chemical biology and drug discovery,” Nature chemical biology, vol. 9, no. 4, p. 232, 2013.

[50] M. Campillos, M. Kuhn, A.-C. Gavin, L. J. Jensen, and P. Bork, “Drug target identi- fication using side-effect similarity,” Science, vol. 321, no. 5886, pp. 263–266, 2008.

[51] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C. Blat, M. J. Wrobel, J. Lerner, J.-P. Brunet, A. Subramanian, K. N. Ross et al., “The connectivity map: using gene- expression signatures to connect small molecules, genes, and disease,” science, vol. 313, no. 5795, pp. 1929–1935, 2006.

[52] R. Kalaria, “Similarities between alzheimer’s disease and vascular dementia,” Jour- nal of the Neurological Sciences, vol. 203, pp. 29–34, 2002.

[53] H. J. Lowe and G. O. Barnett, “Understanding and using the medical subject head- ings (mesh) vocabulary to perform literature searches,” Jama, vol. 271, no. 14, pp. 1103–1108, 1994.

[54] Medline database. National Center for Biotechnology Information. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed

[55] X. Zhou, J. Menche, A.-L. Barabási, and A. Sharma, “Human symptoms–disease network,” Nature communications, vol. 5, p. 4212, 2014.

[56] G. Hu and P. Agarwal, “Human disease-drug network based on genomic expres- sion profiles,” PloS one, vol. 4, no. 8, p. e6536, 2009.

[57] Y. Li and P. Agarwal, “A pathway-based view of human diseases and disease rela- tionships,” PloS one, vol. 4, no. 2, p. e4346, 2009.

[58] N. Atias and R. Sharan, “An algorithmic framework for predicting side effects of drugs,” Journal of Computational Biology, vol. 18, no. 3, pp. 207–218, 2011.

[59] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: Bibliography 177

new developments in kegg,” Nucleic acids research, vol. 34, no. suppl_1, pp. D354– D357, 2006.

[60] H. Ding, I. Takigawa, H. Mamitsuka, and S. Zhu, “Similarity-based machine learn- ing methods for predicting drug–target interactions: a brief review,” Briefings in bioinformatics, vol. 15, no. 5, pp. 734–747, 2013.

[61] K. Bleakley and Y. Yamanishi, “Supervised prediction of drug–target interactions using bipartite local models,” Bioinformatics, vol. 25, no. 18, pp. 2397–2403, 2009.

[62] J.-P. Mei, C.-K. Kwoh, P. Yang, X.-L. Li, and J. Zheng, “Drug-target interaction pre- diction by learning from local information and neighbors,” Bioinformatics, vol. 29, no. 2, pp. 238–245, 2013.

[63] S. Fakhraei, B. Huang, L. Raschid, and L. Getoor, “Network-based drug-target in- teraction prediction with probabilistic soft logic,” IEEE/ACM Transactions on Com- putational Biology and Bioinformatics (TCBB), vol. 11, no. 5, pp. 775–787, 2014.

[64] J. Li and Z. Lu, “A new method for computational drug repositioning using drug pairwise similarity,” in Bioinformatics and Biomedicine (BIBM), 2012 IEEE Interna- tional Conference on, Oct 2012, pp. 1–4.

[65] T. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Math- ivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal et al., “Human protein reference database-2009 update,” Nucleic acids research, vol. 37, no. suppl_1, pp. D767–D772, 2008.

[66] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane et al., “Uniprot: the universal pro- tein knowledgebase,” Nucleic acids research, vol. 32, no. suppl 1, pp. D115–D119, 2004.

[67] K. Raman, K. Yeturu, and N. Chandra, “targettb: a target identification pipeline for mycobacterium tuberculosis through an interactome, reactome and genome-scale structural analysis,” BMC systems biology, vol. 2, no. 1, p. 109, 2008. 178 Bibliography

[68] F. Cheng, W. Li, Z. Wu, X. Wang, C. Zhang, J. Li, G. Liu, and Y. Tang, “Prediction of polypharmacological profiles of drugs by the integration of chemical, side effect, and therapeutic space,” Journal of chemical information and modeling, vol. 53, no. 4, pp. 753–762, 2013.

[69] R. Sawada, H. Iwata, S. Mizutani, and Y. Yamanishi, “Target-based drug reposi- tioning using large-scale chemical–protein interactome data,” Journal of chemical in- formation and modeling, vol. 55, no. 12, pp. 2717–2730, 2015.

[70] A. Gottlieb, G. Y. Stein, E. Ruppin, and R. Sharan, “Predict: a method for infer- ring novel drug indications with application to personalized medicine,” Molecular systems biology, vol. 7, no. 1, p. 496, 2011.

[71] I. Kosti, N. Jain, D. Aran, A. J. Butte, and M. Sirota, “Cross-tissue analysis of gene and protein expression in normal and cancer tissues,” Scientific reports, vol. 6, p. 24799, 2016.

[72] C. Wu, R. C. Gudivada, B. J. Aronow, and A. G. Jegga, “Computational drug repo- sitioning through heterogeneous network clustering,” BMC systems biology, vol. 7, no. Suppl 5, p. S6, 2013.

[73] L. Chen, W.-M. Zeng, Y.-D. Cai, K.-Y. Feng, and K.-C. Chou, “Predicting anatomical therapeutic chemical (atc) classification of drugs by integrating chemical-chemical interactions and similarities,” PloS one, vol. 7, no. 4, p. e35254, 2012.

[74] M. L. Hartsperger, F. Blochl, V. Stumpflen, and F. J. Theis, “Structuring hetero- geneous biological information using fuzzy clustering of k-partite graphs,” BMC Bioinformatics, vol. 11, no. 1, p. 522, 2010.

[75] B. Zhou, R. Wang, P. Wu, and D.-X. Kong, “Drug repurposing based on drug–drug interaction,” Chemical biology & drug design, vol. 85, no. 2, pp. 137–144, 2015.

[76] C. Zhu, C. Wu, and A. G. Jegga, “Network biology methods for drug reposition- ing,” Post Genom. Approaches Drug Vaccine Dev, vol. 5, p. 115, 2015. Bibliography 179

[77] K. M. Borgwardt and H.-P. Kriegel, “Shortest-path kernels on graphs,” in Data Min- ing, Fifth IEEE International Conference on. IEEE, 2005, pp. 8–pp.

[78] Y.-F. Gao, Y. Shu, L. Yang, Y.-C. He, L.-P. Li, G. Huang, H.-P. Li, and Y. Jiang, “A graphic method for identification of novel glioma related genes,” BioMed research international, vol. 2014, 2014.

[79] R. Gramatica, T. Di Matteo, S. Giorgetti, M. Barbiani, D. Bevec, and T. Aste, “Graph theory enables drug repurposing–how a mathematical model can drive the discov- ery of hidden mechanisms of action,” PloS one, vol. 9, no. 1, p. e84912, 2014.

[80] A.-L. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a network-based approach to human disease,” Nature Reviews Genetics, vol. 12, no. 1, pp. 56–68, 2011.

[81] F. Cheng, C. Liu, J. Jiang, W. Lu, W. Li, G. Liu, W. Zhou, J. Huang, and Y. Tang, “Prediction of drug-target interactions and drug repositioning via network-based inference,” PLoS Comput Biol, vol. 8, no. 5, p. e1002503, 2012.

[82] T. van Laarhoven, S. B. Nabuurs, and E. Marchiori, “Gaussian interaction profile kernels for predicting drug–target interaction,” Bioinformatics, vol. 27, no. 21, pp. 3036–3043, 2011.

[83] T. Takeda, M. Hao, T. Cheng, S. H. Bryant, and Y. Wang, “Predicting drug–drug interactions through drug structural similarities and interaction networks incorpo- rating and pharmacodynamics knowledge,” Journal of Chemin- formatics, vol. 9, no. 1, p. 16, 2017.

[84] S. Zickenrott, V. Angarica, B. Upadhyaya, and A. Del Sol, “Prediction of disease– gene–drug relationships following a differential network analysis,” Cell death & disease, vol. 7, no. 1, p. e2040, 2016.

[85] T. Noeske, B. C. Sasse, H. Stark, C. G. Parsons, T. Weil, and G. Schneider, “Predict- ing compound selectivity by self-organizing maps: Cross-activities of metabotropic glutamate receptor antagonists,” ChemMedChem, vol. 1, no. 10, pp. 1066–1068, 2006. 180 Bibliography

[86] E. M. Volz, J. C. Miller, A. Galvani, and L. A. Meyers, “Effects of heterogeneous and clustered contact patterns on infectious disease dynamics,” PLoS computational biology, vol. 7, no. 6, p. e1002042, 2011.

[87] T. Nepusz, H. Yu, and A. Paccanaro, “Detecting overlapping protein complexes in protein-protein interaction networks,” Nature methods, vol. 9, no. 5, pp. 471–472, 2012.

[88] Interax drug interaction lookup. DrugBank. [Online]. Available: http://www. drugbank.ca/interax/drug\_lookup

[89] Physicians’ desk reference. PDR Network. [Online]. Available: http://www.pdr. net/

[90] E-therapeutics. Canadian Pharmacists Association. [Online]. Available: http: //www.e-therapeutics.ca/

[91] Medicines complete. [Online]. Available: https://www.medicinescomplete.com/ about/index.htm

[92] Epocrates athena health service. [Online]. Available: http://www.epocrates.com/ products/features

[93] Drugs.com. Wolters Kluwer Health, American Society of Health-System Pharma- cists, Cerner Multum and Micromedex from Truven Health. [Online]. Available: https://www.drugs.com/

[94] J. Huang, C. Niu, C. D. Green, L. Yang, H. Mei, and J. Han, “Systematic prediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network,” PLoS Comput Biol, vol. 9, no. 3, p. e1002998, 2013.

[95] X. Ning, T. Schleyer, L. Shen, and L. Li, “Pattern discovery from directional high- order drug-drug interaction relations,” in Healthcare Informatics (ICHI), 2017 IEEE International Conference on. IEEE, 2017, pp. 154–162.

[96] I. R. Edwards and J. K. Aronson, “Adverse drug reactions: definitions, diagnosis, and management,” The lancet, vol. 356, no. 9237, pp. 1255–1259, 2000. Bibliography 181

[97] L. E. Shapiro, S. R. Knowles, and N. H. Shear, “Drug interactions of clinical signif- icance for the dermatologist,” American journal of clinical dermatology, vol. 4, no. 9, pp. 623–639, 2003.

[98] R. A. McKinnon, M. J. Sorich, and M. B. Ward, “Cytochrome p450 part 1: multiplic- ity and function,” Journal of Pharmacy Practice and Research, vol. 38, no. 1, pp. 55–57, 2008.

[99] H. Rang, J. Ritter, R. FLower, and G. Henderson, Rang and Dale’s Pharmacology, Seventh ed. Elsevier Churchill Livingstone, 2012.

[100] T. Mathew, R. Chow, P. Desmond, D. Isaacs, C. Lander, J. McNeil, G. Shenfield, and D. Wainwright, “Drug interactions and adverse drug reactions,” Australian Adverse Drug Reactions Bulletin, vol. 19, no. 3, pp. 10–11, 2000.

[101] A. Asuncion and D. Newman, “Uci machine learning repository,” 2007.

[102] X. Chen, M.-X. Liu, and G.-Y. Yan, “Drug-target interaction prediction by random walk on the heterogeneous network,” Molecular BioSystems, vol. 8, no. 7, pp. 1970– 1978, 2012.

[103] Z. Wu, Y. Wang, and L. Chen, “Network-based drug repositioning,” Molecular BioSystems, vol. 9, no. 6, pp. 1268–1281, 2013.

[104] A. Sadeghi and H. Fröhlich, “Steiner tree methods for optimal sub-network identi- fication: an empirical study,” BMC bioinformatics, vol. 14, no. 1, p. 144, 2013.

[105] N. Tuncbag, A. Braunstein, A. Pagnani, S.-S. C. Huang, J. Chayes, C. Borgs, R. Zecchina, and E. Fraenkel, “Simultaneous reconstruction of multiple signaling pathways via the prize-collecting steiner forest problem,” Journal of Computational Biology, vol. 20, no. 2, pp. 124–136, 2013.

[106] N. Yosef, L. Ungar, E. Zalckvar, A. Kimchi, M. Kupiec, E. Ruppin, and R. Sharan, “Toward accurate reconstruction of functional protein networks,” Molecular systems biology, vol. 5, no. 1, p. 248, 2009. 182 Bibliography

[107] Y.-K. Shih and S. Parthasarathy, “A single source k-shortest paths algorithm to infer regulatory pathways in a gene network,” Bioinformatics, vol. 28, no. 12, pp. i49–i58, 2012.

[108] M. S. Scott, T. Perkins, S. Bunnell, F. Pepin, D. Y. Thomas, and M. Hallett, “Iden- tifying regulatory subnetworks for a set of genes,” Molecular & Cellular Proteomics, vol. 4, no. 5, pp. 683–692, 2005.

[109] M. Bailly-Bechet, A. Braunstein, A. Pagnani, M. Weigt, and R. Zecchina, “Inference of sparse combinatorial-control networks from gene-expression data: a message passing approach,” BMC bioinformatics, vol. 11, no. 355, 2010.

[110] K. Faust, P. Dupont, J. Callut, and J. Van Helden, “Pathway discovery in metabolic networks by subgraph extraction,” Bioinformatics, vol. 26, no. 9, pp. 1211–1218, 2010.

[111] Y. Sun, P. N. Hameed, K. Verspoor, and S. Halgamuge, “A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug reposition- ing,” BMC Systems Biology, vol. 10, no. 5, p. 25, 2016.

[112] M. X. Goemans and D. P. Williamson, “A general approximation technique for con- strained forest problems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 296–317, 1995.

[113] R. C. Prim, “Shortest connection networks and some generalizations,” Bell Labs Technical Journal, vol. 36, no. 6, pp. 1389–1401, 1957.

[114] T. Nakagaki, H. Yamada, and Á. Tóth, “Intelligence: Maze-solving by an amoeboid organism,” Nature, vol. 407, no. 6803, p. 470, 2000.

[115] T. Saigusa, A. Tero, T. Nakagaki, and Y. Kuramoto, “Amoebae anticipate periodic events,” Physical review letters, vol. 100, no. 1, p. 018101, 2008.

[116] Y. Sun and S. Halgamuge, “Fast algorithms inspired by physarum polycephalum for node weighted steiner tree problem with multiple terminals,” in Evolutionary Computation (CEC), 2016 IEEE Congress on. IEEE, 2016, pp. 3254–3260. Bibliography 183

[117] J.-Y. Shi, S.-M. Yiu, Y. Li, H. C. Leung, and F. Y. Chin, “Predicting drug–target inter- action for new drugs using enhanced similarity measures and super-target cluster- ing,” Methods, vol. 83, pp. 98–104, 2015.

[118] A. L. Hopkins, “Network pharmacology: the next paradigm in drug discovery,” Nature chemical biology, vol. 4, no. 11, pp. 682–690, 2008.

[119] M. Dunkel, S. Günther, J. Ahmed, B. Wittig, and R. Preissner, “Superpred: drug classification and target prediction,” Nucleic acids research, vol. 36, no. suppl_2, pp. W55–W59, 2008.

[120] Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa, “Prediction of drug–target interaction networks from the integration of chemical and genomic spaces,” Bioinformatics, vol. 24, no. 13, pp. i232–i240, 2008.

[121] S.-F. Lin, K.-T. Xiao, Y.-T. Huang, C.-C. Chiu, and V.-W. Soo, “Analysis of adverse drug reactions using drug and drug target interactions and graph-based methods,” Artificial intelligence in medicine, vol. 48, no. 2, pp. 161–166, 2010.

[122] Z. Liu, H. Fang, K. Reagan, X. Xu, D. L. Mendrick, W. Slikker Jr, and W. Tong, “In silico drug repositioning–what we need to know,” Drug discovery today, vol. 18, no. 3-4, pp. 110–115, 2013.

[123] M. Wen, Z. Zhang, S. Niu, H. Sha, R. Yang, Y. Yun, and H. Lu, “Deep-learning- based drug–target interaction prediction,” Journal of proteome research, vol. 16, no. 4, pp. 1401–1409, 2017.

[124] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.

[125] L. Van Der Maaten, “Barnes-hut-sne,” arXiv preprint arXiv:1301.3342, 2013.

[126] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for dis- covering clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231. 184 Bibliography

[127] D. R. Edla, P. K. Jana et al., “A prototype-based modified dbscan for gene cluster- ing,” Procedia Technology, vol. 6, pp. 485–492, 2012.

[128] J. Lin, M. J. Wester, M. S. Graus, K. A. Lidke, and A. K. Neumann, “Nanoscopic cell- wall architecture of an immunogenic ligand in candida albicans during antifungal drug treatment,” Molecular biology of the cell, vol. 27, no. 6, pp. 1002–1014, 2016.

[129] P. Banerjee, J. Erehman, B.-O. Gohlke, T. Wilhelm, R. Preissner, and M. Dunkel, “Super natural ii-a database of natural products,” Nucleic acids research, vol. 43, no. D1, pp. D935–D939, 2014.

[130] D. Herath, S.-L. Tang, K. Tandon, D. Ackland, and S. K. Halgamuge, “Comet: a workflow using contig coverage and composition for binning a metagenomic sam- ple with high precision,” BMC bioinformatics, vol. 18, no. 16, p. 571, 2017.

[131] A. Hinneburg and D. A. Keim, “Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering,” 1999.

[132] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.

[133] E. Rendón, I. Abundez, A. Arizmendi, and E. Quiroz, “Internal versus external cluster validation indexes,” International Journal of computers and communications, vol. 5, no. 1, pp. 27–34, 2011.

[134] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, no. Oct, pp. 2837–2854, 2010.

[135] Dbscan clustering in matlab. YARPIZ. [Online]. Available: http://yarpiz.com/ 255/ypml110-dbscan-clustering

[136] G. Grandi, A. Cagnacci, and A. Volpe, “Pharmacokinetic evaluation of desogestrel as a female contraceptive,” Expert opinion on drug metabolism & toxicology, vol. 10, no. 1, pp. 1–10, 2014. Bibliography 185

[137] G. R. Cunha, T. Kurita, M. Cao, J. Shen, S. J. Robboy, and L. Baskin, “Response of xenografts of developing human female reproductive tracts to the synthetic estro- gen, diethylstilbestrol,” Differentiation, vol. 98, pp. 35–54, 2017.

[138] X. Wu, X. Liu, X. Jin, and X. Xu, “Effects of levonorgestrel intrauterine system on the expressions of estrogen receptor, progesterone receptor and insulin-like growth factor-1,” Zhonghua yi xue za zhi, vol. 94, no. 35, pp. 2763–2765, 2014.

[139] H. Yu, J. Chen, X. Xu, Y. Li, H. Zhao, Y. Fang, X. Li, W. Zhou, W. Wang, and Y. Wang, “A systematic prediction of multiple drug-target interactions from chemical, ge- nomic, and pharmacological data,” PloS one, vol. 7, no. 5, p. e37608, 2012.

[140] A. Koul, E. Arnoult, N. Lounis, J. Guillemont, and K. Andries, “The challenge of new drug discovery for tuberculosis,” Nature, vol. 469, no. 7331, p. 483, 2011.

[141] J. Van Den Berg, H. Vereecke, J. Proost, D. Eleveld, J. Wietasch, A. Absalom, and M. Struys, “Pharmacokinetic and pharmacodynamic interactions in anaesthesia. a review of current knowledge and how it can be used to optimize anaesthetic drug administration,” British journal of anaesthesia, vol. 118, no. 1, pp. 44–57, 2017.

[142] H. Iwata, R. Sawada, S. Mizutani, M. Kotera, and Y. Yamanishi, “Large-scale pre- diction of beneficial drug combinations using drug efficacy and target profiles,” Journal of chemical information and modeling, vol. 55, no. 12, pp. 2705–2716, 2015.

[143] Orange book: Approved drug products with therapeutic equivalence evaluations. Orange Book. [Online]. Available: https://www.accessdata.fda.gov/scripts/ cder/ob/index.cfm

[144] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by seeding,” in In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Cite- seer, 2002.

[145] C.-K. K. Chan, A. L. Hsu, S. K. Halgamuge, and S.-L. Tang, “Binning sequences using very sparse labels within a metagenome,” BMC bioinformatics, vol. 9, no. 1, p. 1, 2008.

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s: Hameed, Pathima Nusrath

Title: In silico methods for drug repositioning and drug-drug interaction prediction

Date: 2018

Persistent Link: http://hdl.handle.net/11343/219484

File Description: In silico methods for drug repositioning and drug-drug interaction prediction

Terms and Conditions: Terms and Conditions: Copyright in works deposited in Minerva Access is retained by the copyright owner. The work may not be altered without permission from the copyright owner. Readers may only download, print and save electronic copies of whole works for their own personal non-commercial use. Any use that exceeds these limits requires permission from the copyright owner. Attribution is essential when quoting or paraphrasing from these works.