<<

Hameed et al. BMC Bioinformatics (2018) 19:129 https://doi.org/10.1186/s12859-018-2123-4

RESEARCH ARTICLE Open Access A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration Pathima Nusrath Hameed1,2,3* , Karin Verspoor4, Snezana Kusljic5,6 and Saman Halgamuge7

Abstract Background: Drug repositioning is the process of identifying new uses for existing drugs. Computational drug repositioning methods can reduce the time, costs and risks of drug development by automating the analysis of the relationships in networks. Pharmacology networks are large and heterogeneous. Clustering drugs into small groups can simplify large pharmacology networks, these subgroups can also be used as a starting point for repositioning drugs. In this paper, we propose a two-tiered drug-centric unsupervised clustering approach for drug repositioning, integrating heterogeneous drug data profiles: drug-chemical, drug-disease, drug-gene, drug-protein and drug-side effect relationships. Results: The proposed drug repositioning approach is threefold; (i) clustering drugs based on their homogeneous profiles using the Growing Self Organizing Map (GSOM); (ii) clustering drugs based on drug-drug relation matrices based on the previous step, considering three state-of-the-art graph clustering methods; and (iii) inferring drug repositioning candidates and assigning a confidence value for each identified candidate. In this paper, we compare our two-tiered clustering approach against two existing heterogeneous data integration approaches with reference to the Anatomical Therapeutic Chemical (ATC) classification, using GSOM. Our approach yields Normalized Mutual Information (NMI) and Standardized Mutual Information (SMI) of 0.66 and 36.11, respectively, while the two existing methods yield NMI of 0.60 and 0.64 and SMI of 22.26 and 33.59. Moreover, the two existing approaches failed to produce useful cluster separations when using graph clustering algorithms while our approach is able to identify useful clusters for drug repositioning. Furthermore, we provide clinical evidence for four predicted results (Chlorthalidone, Indomethacin, Metformin and ) to support that our proposed approach can be reliably used to infer ATC code and drug repositioning. Conclusion: The proposed two-tiered unsupervised clustering approach is suitable for drug clustering and enables heterogeneous data integration. It also enables identifying reliable repositioning drug candidates with reference to ATC therapeutic classification. The repositioning drug candidates identified consistently by multiple clustering algorithms and with high confidence have a higher possibility of being effective repositioning candidates. Keywords: Drug repurposing, ATC classification, Drug clustering, Data integration, Heterogeneity

*Correspondence: [email protected] 1Department of Mechanical Engineering, University of Melbourne, Parkville, 3010 Melbourne, Australia 2Data61, Victoria Research Lab, West Melbourne 3003, Australia Full list of author information is available at the end of the article

© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Hameed et al. BMC Bioinformatics (2018) 19:129 Page 2 of 18

Background heterogeneous objects can be grouped into small clus- Producing new drugs and marketing them with a com- ters based on their associations. Since pharmacology net- plete drug profile is a challenging task as it is a long pro- works are large and complex, partitioning large networks cess and requires a large investment of time and money. produces an abstraction which simplifies their complex Drug repositioning or drug repurposing is the process of interaction structure. Realizing the importance of simpli- identifying new therapeutic uses for existing drugs. It can fying drug-data network, research [2, 8, 10, 17, 18]has reduce the time, costs and risks of the traditional drug approached partitioning pharmacological networks using discovery process [1–4].Themaingoalofdrugreposi- various graph theory concepts. tioning is to increase the therapeutic use of the existing Yildirim et al. [8] focused on combining heterogeneous drugs in the clinical and medical domain. It is believed data using drug-target and disease-gene interactions that drugs having similar profiles are more likely to share employing bipartite graph projections while Hartsperger similar behavior in presence of similar targets (e.g. pro- et al. [19] demonstrated the importance of fuzzy cluster- teins) [1, 3–7]. There is also evidence that computational ing for arranging the biological entities like disease, gene drug repositioning can be improved by heterogeneous and proteins in a meaningful weighted k-partite graph. data analysis [1, 5, 7–9]. In contrast to laborious in- Moreover, Klamt et al. [20] demonstrates graph transfor- vivo and in-vitro experiments, computational methods for mations such as graph projection methods would lead to drug repositioning have become popular as effective and information loss. In contrast, Yaminishi et al. [5]investi- efficient approaches for drug repositioning [1, 3–6]. These gated a supervised bipartite graph inferencing approach methods focus on identifying new uses for existing drugs by integrating chemical and pharmacological properties. and finding new associations between other contributing Campillos et al. [18] suggested a probability theoretic entities like proteins, genes, diseases and side effects to approach to integrate chemical and pharmaceutical prop- approach this problem. erties. There are two main concepts behind drug reposi- Napolitano et al. [2] proposed useful drug reclassifi- tioning: new target recognition and new indication cations for ATC classification using supervised machine recognition. Figure 1 illustrates a general view of these learning. They integrated drug-chemical, drug-gene and two drug repositioning concepts. Figure 1a shows the drug-protein representations and obtained classification known interactions where each of the drugs is associated accuracy of 78%. But, integrating pharmacological con- with at least one target protein and vice versa; each of the cepts is also important when focusing drug repositioning targetsisalsoassociatedwithatleastonediseaseandvice using ATC classification. In general, taking second/higher versa. Figures 1b and c show new target recognition and order derivatives of objects is a popular method for high- new indication recognition, respectively. In new target lighting special features. Lee et al. [9]proposedthat recognition, the objective is to identify novel molecular drug groups (DG) having common DG-DG interaction targets for a given drug while in new indication recog- partners would share similar drug mechanisms and they nition, the objective is to identify new diseases that may have proposed Molecular Complex Detection (MCODE) be impacted by one of the existing targets of the drug. algorithm for module detection in DG-DG interaction Computational methods like network based inferencing network. They investigated clustering DG-DG interac- [1, 5, 6, 8, 10], machine learning [2, 11, 12], and text tions in relation to ATC classification and they believe mining approaches [13, 14] are widely used for drug DG-DG interactions would be useful in describing the repositioning. In recent computational approaches, the mechanisms and the features of drugs. Anatomical Therapeutic Chemical (ATC) classification system [15] is considered as an intermediate source to The importance of heterogeneous data integration In identify useful drug repositioning candidates where the preliminary investigations of drug repositioning, compu- ATC therapeutic classes are used to identify reposition- tational models for pharmacological data have been devel- ing candidates [9, 11, 16]. Every repositioning candidate oped using homogeneous components such as disease, identified by computational models may not be directly symptoms, side effects, chemical structures, proteins and applicable in clinical practice. However, the outcomes genes. But, each homogeneous component has its own of the computational models may enable prioritizing pros and cons [1]. Although many findings acknowledge repositioning candidates for in-vivo/in-vitro analysis. the benefits of phenome space properties like disease and Pharmacological data can be represented in homo- side effects [18, 21], chemical structures are also impor- geneous or heterogeneous graphs/networks. Therefore, tant to make predictions. Different drug characterizations most of the drug repositioning approaches can be seen may lead to identifying various repositioning candidates as hybrid methods of graph/network theory concepts based on different aspects. Hence, combining the results and machine learning [5, 8–10, 12]. Graph clustering is of different drug characterizations can lead to identify- such hybrid approach where graphs of homogeneous and ing reliable repositioning candidates. Recent studies have Hameed et al. BMC Bioinformatics (2018) 19:129 Page 3 of 18

Fig. 1 A generalized illustration of two alternative approaches involving in drug repositioning; (a), (b)and(c) represent the known interactions, New Target Recognition and New Indication Recognition, respectively. (The notations 1*-1* and m-n indicate one-or-many and many-to-many relationships, respectively) focused on the development of novel, efficient and reliable Therefore, considering gene expression alone may result computational models to improve the final predictions in poor performance. using heterogeneous data integration [1, 2, 5, 8, 9]. Yamanishi et al. [5] have demonstrated the impor- In early research, symptom similarities have been tance of spanning chemical, genomic and pharmaco- employed to analyze disease similarities and in turn to logical space features in discovering new drug-target identify new uses for existing drugs [22]. However, it was interactions using supervised bipartite graph inference. realized that symptom-based similarities alone are inad- They found that pharmacological effect similarities more equate to predict new therapeutic uses for existing drugs. strongly correlate with new predictions than chemical Consequently, mRNA expression and protein-protein similarities. Moreover, they proposed a two-step strategy interaction networks have been used in investigating to combine chemical, genomic and pharmacological prop- disease similarities [6]. Campillos et al. [18] demonstrated erties using supervised bipartite graph learning and hence the significance of using side effect similarity for drug obtained reliable drug-target associations. repositioning. Even though side effect similarities can be In-silico drug repositioning has become very popular used to link the interactions between drugs and targets, during the last decade as it contributes to accelerating there are certain limitations as well. Some side effects drug development and drug discovery. Moreover, recent arise due to hormonal changes of the body. Also, side research has identified heterogeneous data integration effects may require a long time to observe and construct as important for obtaining reliable predictions. However, a strong drug-side effect profile. Hence, it cannot be introducing heterogeneous data types increases the com- directly applied to the newly arrived drugs without an plexity of data representation and the number of features. explicit drug profile. Since many side effects are com- Therefore, network partitioning or clustering methods mon among various drugs, data redundancy is another can be used to simplify large and complex pharmacology problem in the side effect domain. data and predictions can be efficiently made on identi- Campillos et al. [18]andDudleyetal.[1]havealso fied subgroups [8–10, 19, 23]. Consensus clustering is a investigated the impact of chemical similarities for drug method used for ensemble clustering [24]. It has been repositioning. They found that using chemical struc- introduced to overcome the limitations of basic cluster- tural similarities alone is insufficient as drugs undergo ing algorithms. It can also be considered as a method metabolic transformations and pharmacokinetic transfor- to integrate multiple sources. However, the existing con- mations. Therefore, studying the mechanism of action of a sensus clustering algorithms require the number of clus- drug is encouraged. Using connectivity maps to construct ters to be defined in advance. In this study, we propose the molecular activity profiles based on gene expression a two-tiered clustering approach for drug repositioning has been considered as a better approach as it simplifies inspired by consensus clustering. Here, we selected clus- drug comparisons. However, a molecular activity simi- tering algorithms which could be employed without any larity based approach may not be very accurate as many prior knowledge about drug clusters. disease conditions involve in more than one molecu- Pharmacology networks are large and heterogeneous; lar activity. Moreover, gene expression profiles may be drugs can be considered as the main hubs in these net- generated under different conditions such as different works. The main objective of this study is to construct a doses, time durations, different disease stages and ages. consistent computational model for drug repositioning Hameed et al. BMC Bioinformatics (2018) 19:129 Page 4 of 18

through heterogeneous data integration. Drug-chemical, repositioning. It applies to infer a single subnetwork at a drug-gene, drug-protein, drug-disease and drug-side time, where ATC-C class is used to reposition drugs for effect relationships are useful to represent different Cardiovascular diseases. aspects of drugs such as chemical, biological and phe- This paper fills the gap with a purely unsupervised nome characteristics, respectively. We therefore cluster learning approach by heterogeneous data integration drugs based on their heterogeneous associations. Specif- where ATC classification is employed for large-scaled ically, we apply clustering of drugs to simplify the large drug repositioning of drugs with and without assigned drug-centric pharmacology networks. In this study, we ATC class. This study also presents a confidence measure propose a two-tiered clustering approach, an unsuper- which is used to determine the significance of the inferred vised learning approach for drug repositioning via ATC repositioning candidates. Moreover, the significance of classification. This proposed approach enables clustering findings arising from this study is twofold; (i) correctly drugs based on heterogeneous data integration which is profile and suggest therapeutic indication for drugs that used as the drug similarity model for drug repositioning. do not possess the ATC code; (ii) flag potential of some Hence, the final clustering is an overall solution that drugs to be used for other therapeutic purposes. Fur- groups similar drugs using a variety of drug character- thermore, we provide clinical evidence for four predicted istics. The identified drug clusters are compared against results (Chlorthalidone, Indomethacin, Metformin and already published ATC classification to infer useful repo- Thioridazine) to support that our proposed approach can sitioning candidates. The identified drug clusters can be be reliably used to infer ATC code and drug repositioning. used as a source to understand drug-drug similarities as well as drug-group similarities. Methods As illustrated in Fig. 1, new target recognition and As explained in “Background” section, drug repositioning new indication recognition are two typical ways of candidates can be identified by analyzing drug-drug sim- approaching drug repositioning. Even though the use of ilarities. This study proposes an unsupervised two-tiered ATC classification is popular in the input space to deter- clustering model to identifying drug similarities based on mine anatomical/therapeutic/chemical features of drugs heterogeneous drug characteristics. Figure 2 illustrates [25–27], little research directly focuses on drug reposi- the main steps of the proposed approach. A two-tiered tioning by ATC classification [2, 16, 28]. Recent research clustering approach is proposed to build the drug simi- [2, 28] limited their studies only for the drugs that already larity model for drug repositioning. In Drug Clustering possess an ATC code. Recently, Sun et al. [16]proposeda Tier 1, clustering is performed based on drugs’ chemical, semi-supervised learning approach based on a physarum- therapeutic, gene, protein and side effect associations sep- inspired prize-collecting steiner tree approach, for drug arately to illustrate how close two drugs are, along each

Fig. 2 The proposed approach Hameed et al. BMC Bioinformatics (2018) 19:129 Page 5 of 18

dimension. Drug clustering Tier 2 is a heterogeneous data from the Unified Medical Language System (UMLS) integration phase, in which the results of Drug Cluster- [31]. These are the treatment relationships between ing Tier 1 are combined to produce an overall similarity drugs and diseases from the National Drug that considers all aspects of the drug similarity. Drug repo- File-Reference Terminology. sitioning is carried out employing ATC classification for • drug-protein features [775]: The target protein the drug clusters identified at Drug Clustering Tier 2. information of drugs has been obtained from The therapeutic classification of the ATC classification is Drugbank [32] and they have been mapped using used to label each cluster from which we identify plausible UniProt Knowledgebase [33]. repositioning candidates. • drug-side effect features [1385]: The drug-side The particular drug profile leading to identifying simi- effect information has been extracted from the lar therapeutic uses may vary from drug to drug; choosing SIDER database [34] which uses UMLS library to an appropriate representation for drug repositioning is map the side effect keywords. challenging. Therefore, making a similarity decision based • drug-gene features [1504]: We constructed a drug- on heterogeneous drug profiles such as chemical, disease, gene binary profile for the 1504 KEGG gene data used genes, proteins and side effect is worthwhile. Moreover, in Wu et al. [10] to represent drug-gene relationships. some dimensions can be incomplete. If the data in one drug profile is inaccurate or incomplete, it may be com- These five sources have 417 drugs in common. The pensated by better data in other drug profiles. Therefore, drug profiles of the selected drugs are available at https:// making the final conclusions based on consolidated het- github.com/fathimanush786/two_tiered_clustrering_ erogeneous data enables less errors. ATC classification data. is used as the gold standard reference classification. We expect that drugs that are in the same ATC class should be ATC classification clustered together and hence we can use this to validate As defined by World Health Organization, the Anatomical our clusters. Therapeutic Chemical (ATC) classification [15]captures In “Data” section, the drug data and their ATC clas- the pharmacodynamic properties of drugs. This resource sification codes used in this study are explained. In uses active ingredients of drugs as well as their anatomical, “The proposed approach” section, we explain the selected therapeutic and chemical properties when constructing clustering algorithms, the proposed two-tiered clustering the classification system. ATC is a five level classifica- approach, the evaluation process for the identified drug tion system. The first level classification is based on the clusters and the computation of confidence measure. anatomical group; it contains 14 groups. The second level classification is based on pharmacological/therapeutic Data subgroups. The third and fourth levels denote chemi- Drug profiles cal/pharmacological/therapeutic subgroups and the fifth We use five different homogeneous drug profiles where level refers to the chemical substance. Some drugs have four of them are obtained from DyDruma [29]database: been categorized into multiple classes. These classifica- drug-chemical, drug-therapeutic, drug-protein and drug- tions may also be updated based on new research findings. side effect profiles. We obtained the KEGG gene data used We obtained ATC classes for 405 drugs out of the 417 in Wu et al. [10] to represent drug-gene relationships. selected drugs and 12 drugs had not yet been assigned This allows us to link drug associations in the genomic into ATC classification. We focus on classifying only up space, adding a fifth homogeneous drug dimension. These to the second (therapeutic) level as our broader goal is to drug profiles are represented as binary associations where infer new therapeutic uses for existing drugs. We observe values 1 and 0 represent the presence and absence of a 66 unique classes at ATC second level classification for particular feature, respectively. these 405 drugs. These 66 classes are used as the reference clustering to evaluate the performance of the drug clus- • drug-chemical features [881]: Each drug is ters identified by our method. The ATC classification of associated to relevant chemical fingerprints, based on the selected 417 drugs are available at https://github.com/ the 881 fingerprints (2D chemical structures) defined fathimanush786/two_tiered_clustrering_data. by PubChem [30]. We assume one feature for each fingerprint. If a drug contains a given structural The proposed approach fingerprint, the corresponding feature will have a Our two-tiered unsupervised clustering model is pro- value of 1. posed as a similarity model to identify drugs with closer • drug-therapeutic features [719]: The therapeutic relationships. Unsupervised clustering is an approach uses of the drugs have been obtained by extracting to grouping similar objects together without any prior treatment relationships between drugs and diseases knowledge of their class labels. Objects that are in a Hameed et al. BMC Bioinformatics (2018) 19:129 Page 6 of 18

given cluster should demonstrate higher similarity to each one vertex and it adds or removes vertices in greedy other and relatively higher dissimilarity with the objects approach to achieve better cluster separations with high in other clusters. In general, clustering is popular as a cohesiveness. powerful technique which can identify useful patterns in an unsupervised learning environment. There are numer- MCL Markov Clustering (MCL) [37, 38] algorithm is ous clustering algorithms that have been proposed. But, another graph clustering algorithm which is also widely there is no acknowledged single preferred algorithm. Each used as a protein module detection algorithm for large algorithm has its own pros and cons. However, scalabil- protein networks. It has been used in a recent drug repo- ity, robustness, handling high dimensional features, speed, sitioning application as well [23]. It is popular for its intrinsic nature, adaptability and preserving topological scalability, fast, intrinsic, adaptable and emergent nature. order like properties are some interesting characteristics It uses a stochastic flow simulation based concept to par- which we have considered in this context. tition graphs/networks. It’s parameter ‘inflation’ can be In the context of drug data, we can apply clustering used to control the number of clusters where smaller algorithms by adopting a representation of each drug inflation produces lower granularity with large clusters. that allows drug similarity to be computed. We propose a two-tiered clustering approach to cluster drugs into MCODE The Molecular Complex Detection (MCODE) smaller groups based on heterogeneous data integration. [40] algorithm includes three stages: vertex weighting, We employ four clustering algorithms for partitioning complex prediction and optionally post-processing to fil- the pharmacology network. We employ Growing Self ter or add inputs in the resulting complexes by certain Organizing Map (GSOM) [35, 36] which is a vector- connectivity criteria (haircut and fluffing). MCODE uses based clustering algorithm and three state-of-the-art a method based on clustering coefficient when assigning graph clustering algorithms: Markov Clustering (MCL) weights for vertices. The vertex weight threshold param- algorithm [37, 38], Clustering with Overlapping Neigh- eter can be used to define the density of the resulting borhood Expansion (ClusterONE) [39]andMolecular complex. A threshold that is closer to the weight of the Complex Detection (MCODE) [40]. In general, these seed vertex identifies a smaller, denser network region selected clustering algorithms can be applied without any around the seed vertex. prior knowledge about the number of classes, which is more useful in this context. We compare the performance Drug Clustering Tier 1 of clusters identified by each algorithm to the classes of According to the fundamental graph theory concepts, the ATC classification. We demonstrate the performance any drug-feature/drug-drug associations can be repre- evaluation of drug clustering using internal and external sented in two ways; (i) graph representation and (ii) vec- evaluation measures. The identified drug clusters are tor/matrix representation. Therefore, we can obtain an used for drug repositioning via ATC classification. adjacencymatrixtorepresentthedrug-featureassocia- tionsasshowninFig.3. An adjacency matrix demon- Selected clustering algorithms strates which vertices/nodes of a graph/network are GSOM Growing Self Organizing Map (GSOM) [35, 36] adjacent to which other vertices/nodes. In this manner, is an extended version of Self-organization map (SOM) we have adjacency matrices (data matrices) of 417×881, [41] which is a popular vector-based clustering algorithm, 417×719, 417×1504, 417×775 and 417×1385 for each capable of handling large-scale and high dimensional fea- drug-chemical, drug-disease, drug-genes, drug-protein tures. It is popular for its growing nature while preserving and drug-side effect associations, respectively. Then, we the topological order. It also demonstrates an emergent cluster drugs with respect to these independent homoge- nature where it starts with one node and it assigns data neous features using GSOM algorithm. points considering the shortest Euclidean distance. Spread factor is the parameter which controls the granularity of Drug Clustering Tier 2 the cluster map. Smaller spread factor results in a fewer The clustering solutions obtained from Drug Clustering number of nodes in the GSOM map while larger spread Tier 1 are used to derive drug-drug relation (DDR) matri- factor enables a high growth of the GSOM map. ces. Hence, we produce one DDR matrix per dimension considering their Tier 1 cluster assignments. We then ClusterONE Clustering with Overlapping Neighbor- cluster drugs based on combining these individual DDR hood Expansion (ClusterONE) [39] is a graph partitioning matrices in order to capture overall drug similarities of algorithm initially proposed for identifying overlapping aggregated features used in Tier 1. Figure 4 illustrates protein modules in protein-protein interaction network the mechanism for deriving the DDR matrix using drug and also used in a drug repositioning application [10]. clusters (from Drug Clustering Tier 1). We construct five It uses a seeded growing concept where it starts with DDR matrices for chemical, disease, gene, protein and side Hameed et al. BMC Bioinformatics (2018) 19:129 Page 7 of 18

Fig. 3 Drug-feature associations could capture in a bipartite graph as shown on (a) and its corresponding adjacency matrix is shown on (b). D(1,2,3) denotes the drugs while F(1,2,3,4) denotes the features such as chemical, disease, protein and side effect effects separately, based on the individual Tier 1 clustering therapeutic features by concatenating features from differ- foreachtypeoffeature.WethenintegratetheDDRmatri- ent domains where Hy={C1, C2, C3, ...Ck, T1, T2, T3, ..., Tl} ces of Tier 1 clustering into a single relation matrix by be the heterogeneous data integrated binary vector of averaging the individual DDR matrices. The averaged rela- drug Di,fori ∈ 1, 2, 3, ..., n. Similarly, we can extend this tion matrix is used to cluster drugs. By performing this to integrate drug profiles of multiple domains. second round of clustering, we aim to improve the reliabil- ity of the drug clustering. We employ ClusterONE, MCL, Averaging summarized pairwise similarities MCODEaswellasGSOMinDrugClusteringTier2. Another way of integrating heterogeneous features is to average the similarity measure for each member of a Alternative approaches drug pair according to each individual type of feature, Concatenating all features into a single vector to obtain a single summary similarity score [2]. Jaccard A straightforward approach to integrating heteroge- coefficient is widely used to obtain the similarity measure neous features is to concatenate all individual features between two drugs. Let SimC(Di, Dj) and SimT (Di, Dj) into a single vector [16, 42]. Let D beasetofdrugs be the chemical and therapeutic similarity measures of {D1, D2, D3, ..., Dn}whereC={C1, C2, C3, ...Ck}bethe apairofdrugsDi and Dj, respectively. Then, we can binary vector of chemical features of drug Di and construct a heterogeneous data representation (Hz)by + ... SimC SimT T={T1, T2, T3, , Tl} be the binary vector of thera- averaging SimC and SimT where Hz = 2 which peutic features of drug Di.Then,wecanconstructa would lead to provide a nxn square DDR matrix (where n heterogeneous data representation (Hy) of chemical and isthenumberofdrugs).Wecanextendthistointegrate

Fig. 4 a illustrates drug clusters while (b) illustrates its corresponding drug-drug associations. D(1,2,3) and C(1,2) denote the drugs and the clusters, respectively Hameed et al. BMC Bioinformatics (2018) 19:129 Page 8 of 18

drug profiles in terms of more than two dimensions of shown below: similarity. (MI(U, V)) NMIsqrt(U, V) = √ (2) H(U)H(V) Evaluation Internal evaluation MI(U, V) − E [MI(U, V)] The objective of internal validation is to examine the SMI(U, V) = √ (3) ( ( )) compactness/cohesion and the separation of the clusters var MI U, V [43]. There are various internal validation measures where MI is the mutual information, H is the associ- and they are variations of these two. But, there is no ated entropy value, E is the expected value and var is the acknowledged measurement of choice. Silhouette anal- variance. ysis is used as an internal evaluation technique to assess the consistency within a cluster/class because it takes Assigning confidence measure both compactness/cohesion and separation into account. Since a drug can belong to more than one ATC class, Moreover, Silhouette can be interpreted using visual aids identifying drug clusters with 100% pure ATC class is chal- for in-depth analysis. lenging. Therefore, we identify the majority class for each Silhouette analysis is used as an internal evaluation drugclusterandassignaconfidencemeasureforeach technique to assess the consistency within a cluster/class identified majority class. Then, we predict the identified [44, 45]. It measures the similarity of an object to its own majority class as a reclassification for the drug/s belongs cluster/class compared to the other clusters/classes. If the to minority class/s with the confidence measure as defined object has a greater similarity to its own cluster/class than by the following equation: to its other clusters/classes, the Silhouette value would be number of drugs belong to the major ATC class of cluster i confidence = +1 and if the object has greater dissimilarity to its own i total number of drugs of cluster i cluster/class than to the other clusters/classes, the Silhou- (4) ette value would be -1. The following equation defines the Silhouette measure for an object i: where i is the cluster number/id. Hence, we can employ the confidence measure to filter the most useful reposi- ( ) − ( ) tioning candidates. ( ) = b i a i Silhouette i { ( ) ( )} (1) max a i , b i Drug repositioning via ATC therapeutic classes As explained in “ATC classification” section, ATC classifi- where a(i) and b(i) are the dissimilarity of the object i to cation consists of five levels where the second level deter- its own cluster/class and the dissimilarity of the object i to mines drug’s therapeutic uses/properties. In this study, we the other clusters/classes. approach drug repositioning by identifying plausible new ATC therapeutic (second level) classes for existing drugs. External evaluation Identifying the drug’s second level classification implies We employed ATC classification to compare the per- its therapeutic uses. We believe reclassification of drugs formance of our two-tiered clustering approach as well into ATC therapeutic (second level) class would enable as the performance the clustering algorithms used in inferring repositioning candidates. this study. We selected adjusted measures: Normalized The use of unsupervised clustering methods enables Mutual Information (NMI) [24] and Standardized Mutual grouping of drugs without any prior knowledge of ATC Information (SMI) [46] to evaluate the identified clusters classes. We expect that drugs in the same cluster will with reference to ATC classification. These are informa- demonstrate similar characteristics while being relatively tion theoretic measures derived based on mutual informa- dissimilar to drugs in other clusters. Therefore, new drug- tion. NMI provides a normalized measure using mutual drug similarities can be identified by analyzing the drug information where it ranges between 0 and 1. SMI pro- clusters. The identified new drug-drug similarities lead vides a statistical adjustment for the mutual information to propose classification of drugs into new ATC thera- which is beneficial in adjusting selection bias and to peutic (second level) classes. These proposals are inferred increase the interpretability. SMI further reduces the bias based on the majority ATC class associated with each in clustering comparisons towards selecting clusterings cluster. Classes with higher confidence (see “Assigning with more clusters and where clustering involves fewer confidence measure” section) can be prioritized for reclas- data points. The upper bound of SMI varies based on the sification. Since we compare the drug clustering solutions used reference clustering, however, higher SMI value indi- with reference to ATC therapeutic (second level) classes, cates better clustering. The equations for NMI [24]and this reclassification step enables inference of repositioning SMI [46] to compare clustering solutions U and V are candidates via ATC therapeutic classes. Hameed et al. BMC Bioinformatics (2018) 19:129 Page 9 of 18

Results show negative Silhouette values, illustrating higher vari- Drug Clustering Tier 1 ations within ATC classes. The mean Silhouette value of First, we clustered drugs based on their individual, homo- ATC classification based on chemical, disease, gene, pro- geneous properties; chemical, disease, gene, protein and tein and side effects are − 0.31, − 0.06, − 0.49, − 0.25 side effects. We employed GSOM to cluster drugs in Drug and − 0.33, respectively. However, disease profiles provide Clustering Tier 1 because it is a vector based clustering relatively greater consistency with the ATC classification algorithm. In this study, we used the GSOM implementa- compared to other drug profiles. tion of Chan et al. [47] because of its convenient visual aids Moreover, the Silhouette analysis on GSOM identified for cluster analysis. As mentioned in “GSOM”section,we drug clusters demonstrates relatively higher Silhouette tuned the parameter, spread factor (SF), to obtain GSOM values than ATC classification where the mean Silhouette maps of different sizes. As a result, we obtained GSOM value for chemical, disease, gene, protein and side effect maps of 68 (SF = 0.0001), 69 (SF = 0.25), 66 (SF = 0.8), using GSOM algorithm are 0.13, 0.09, 0.22, 0.15 and -0.07, 63 (SF = 0.2) and 63 (SF = 0.001) nodes for chemical, dis- respectively which are relatively higher than ATC classifi- ease, gene, protein and side effects profiles, respectively. cation (see Additional file 1: Figure S2 for the Silhouette Out of 417 drugs, 405 drugs have already classified into at analysis). least one ATC class. Moreover, we noticed 66 unique ATC Furthermore, we examined the closeness of the clus- classes (2nd level ATC classification) relating these 405 tering solutions between different drug properties used drugs. We evaluated drug clustering solutions for these in this study. In Tables 2 and 3, we show the clustering 405 drugs with reference to the ATC classification. comparison between different drug profiles using NMI Table 1 shows NMI and SMI values for Drug Clustering and SMI, respectively. In these tables, we compare the Tier 1. Accordingly, the NMI varies between 0.46 and 0.68 drug clusters generated by one type of drug profile with and SMI varies between 2.91 and 39.33. As of ATC classifi- the drug clusters generated by another type of drug pro- cation, anatomical and therapeutic features are considered file. For instance, drug clusters generated using chemical in its first two classification levels. Hence, drug clustering properties are compared against drug clusters generated using disease and protein profiles demonstrate relatively by disease, gene, protein and side effect profiles. Accord- higher NMI and SMI. The NMI and SMI of chemical and ing to Table 2, NMI of 0.55, 0.48, 0.59 and 0.56 have side effect profiles are relatively lower than disease and been observed between drug clusters generated by chem- protein profiles as they are considered in the third, fourth ical profile and drug clusters generated by disease, gene, and fifth levels of ATC classification. On the other hand, protein and side effect, respectively. Similarly, accord- clustering solution on gene profiles shows the least close- ing to Table 3, SMI of 12.71, 0.50, 20.85 and 9.98 have ness to ATC classification as this type of information is been observed between drug clusters generated by chem- not considered in ATC classification system. Unlike NMI ical profile and drug clusters generated by disease, gene, where the upper bound is always 1.0, the upper bound protein and side effect, respectively. for SMI depends on the choice of reference clustering; According to NMI, drug clusters of chemical profiles, the upper bound for ATC reference clustering is 98.18. disease profiles, protein profiles and side effect profiles Notably, the ranking order of these clustering solutions is show relatively closer similarities where they vary between consistent for both NMI and SMI. 0.55 and 0.59. On the other hand, the highest SMI is Approximately 16% of the drugs (out of 405 drugs) noticed between clusters of disease and protein profiles. are assigned to multiple classes. Therefore, we randomly Notably, drug clusters of gene profiles are relatively far selected one ATC class for those drugs having multi- away from other drug clustering solutions. This devia- ple classes when constructing the reference class list. tion might have caused due the highly sparse nature of Additional file 1: Figure S1 corresponds to the Silhouette the gene profiles. Moreover, the clusters identified by gene analysis for chemical, disease, gene, protein and side effect profiles lie relatively very far away from ATC classification profiles, respectively. It is clear that most of the drugs than the other clusters. Therefore, we selected chemical,

Table 1 Performance assessment of Drug clustering Tier 1 Table 2 Drug clustering comparison between drug profiles Drug profiles NMI SMI based on Normalized Mutual Information (NMI) Chemical 0.59 20.09 Disease Gene Protein Side effect Disease 0.68 39.33 Chemical 0.55 0.48 0.59 0.56 Gene 0.46 2.91 Disease 0.45 0.59 0.55 Protein 0.63 30.38 Gene 0.45 0.47 Side Effect 0.58 21.07 Side effect 0.56 Hameed et al. BMC Bioinformatics (2018) 19:129 Page 10 of 18

Table 3 Drug clustering comparison between drug profiles (0, 0.9] increased the number of clusters. We identified based on Standardized Mutual Information (SMI) 66 clusters using MCL when inflation parameter is set to Disease Gene Protein Side effect 0.048. The number of clusters increases when the infla- Chemical 12.71 0.50 20.85 9.98 tion parameter is increased. We obtained two clustering solutions; CL1 and CL1 employing ClusterONE. CL1 Disease 1.08 22.06 14.89 I II I is obtained when the density parameter is set to 0.6 and Gene 2.58 0.89 ‘nodes’ is used as the seed method while CL1II is obtained Side effect 16.43 when the density parameter is set to 0.8 and ‘unused- nodes’ is used as the seed method. CL1I resulted in 61 clusters including all 417 drugs while CL1II resulted in 58 disease, protein and side effect profiles for further anal- clusters including only 405 drugs. In ClusterONE, choos- ysis to identify drug repositioning candidates using ATC ing ‘nodes’ as the seed method enables every node to be classification. used as a seed and subgroups smaller than a given density We identified a set of 26 pairs of drugs (see Additional are thrown away. file 2) which occur together in each drug cluster, generated Table 4 summarizes NMI and SMI values for Drug Clus- based on individual chemical, disease, protein and side tering Tier 2 using GSOM, MCL, CL1I and MCODE. The effect profiles. 25 out of these 26 drug pairs are assigned GSOM results are relatively higher, measuring NMI and to the same ATC class (second level), indicating mean- SMI with reference to the ATC classification. The NMI ingfulness of the identified drug clusters. and SMI values of Drug Clustering Tier 2 are0.66and and Thioridazine are also identified in the same cluster in 36.11 while they are 0.68 and 39.33 for disease profiles in all four clustering solutions. However, Thioridazine does Drug Clustering Tier 1. However, NMI and SMI values of notbelongtoanyoftheATCclasseswhileFluphenazine Drug Clustering Tier 2 are relatively higher than other four belongs to ATC class N05 (-psycholeptics). Therefore, we drug profiles. Since we employed ATC therapeutic class as believe Thioridazine may share similar drug profile as of the reference cluster, the results in Drug Clustering Tier 1 Fluphenazine and we propose to classify Thioridazine into are more favorable towards disease profiles. N05 (-psycholeptics). We predicted new ATC therapeutic classes based on the identified majority ATC classes in the corresponding clus- Drug Clustering Tier 2 ters which led to reclassification of the existing drugs. In As explained above, we employed the four drug clus- order to filter the most reliable repositioning candidates, terings generated based on chemical, disease, protein we assigned a confidence measure for each prediction (see and side effect profiles in Drug Clustering Tier 2. We “Assigning confidence measure” section). We therefore constructed four DDR matrices based on these four filter the repositioning candidates with high confidence as identified drug clustering solutions (as explained in “Drug reliable drug repositioning candidates. The highest con- Clustering Tier 1”sectionin“Methods”section).We fidence measures of the identified major classes are 0.85, propose merging of these DDR matrices into a single 0.83, 0.75 and 0.5 for MCL, ClusterONE, MCODE and matrix as a way of heterogeneous data integration. The GSOM, respectively. merged DDR matrix can be constructed by giving equal importance to each of the drug clusterings or by ranking Comparing the proposed approach against existing the drug clusterings based on different evaluation mea- methods sures such as NMI and SMI. However, there is no single We compared the performance of the proposed two- type of homogeneous drug characteristics identified to tiered clustering approach against two recently used provide an efficient and effective drug classification or heterogeneous data integration methods for drug repo- drug repositioning [1]. Giving equal importance to each sitioning (see “Alternative approaches” section). Table 5 of the drug clusterings, we constructed a heterogeneous shows the performance assessments of these three DDR matrix by averaging the four DDR matrices. We used the averaged DDR matrix to identify drug clusters, employing the graph clustering algorithms: Clus- Table 4 Performance assessment of Drug Clustering Tier 2 using terONE, MCODE and MCL as well as the GSOM algo- four different clustering algorithms rithm. In this study, we used ClusterONE, MCL and Algorithm NMI SMI MCODE implementations available in MATLAB Systems GSOM 0.66 36.11 Biology and Evolution Toolbox (SBEToolbox) [48]. We MCL 0.59 26.49 obtained a GSOM map of 63 nodes when SF is 0.2. We ClusterONE (CL1I)0.5621.37 identified 64 clusters using MCODE when the thresh- MCODE 0.52 11.57 old parameter is set to 0.9. Increasing the threshold from Hameed et al. BMC Bioinformatics (2018) 19:129 Page 11 of 18

Table 5 Comparison of the proposed approach against two candidates having a minimum confidence measure existing methods for heterogeneous data integration of 0.5. Out of these, 4 drugs (Chlorthalidone, Thior- Method NMI SMI idazine, Orphenadrine and Indomethacin) have not The proposed two-tiered clustering 0.66 36.11 been assigned to ATC classification yet. We infer these unclassified Chlorthalidone, Thioridazine, Orphenadrine Concatenating all heterogeneous 0.60 22.26 features into a single vector (Hy) and Indomethacin for ATC classes C03- (con- Averaging summarized heteroge- 0.64 33.59 fidence: 0.83), N05-psycholeptics (confidence: 0.80), neous (pairwise) similarities (Hz) R06 - (confidence: 0.64) and M01- antiinflammatory and antirheumatic (confidence: 0.57), respectively. Interestingly, in Drug Clustering Tier 1, different methods for heterogeneous data integration Thioridazine is inferred to have a similar drug profile as using GSOM algorithm only. In Drug Clustering Tier 2, of Fluphenazine which also belongs to ATC class N05- GSOM demonstrates NMI and SMI of 0.66 and 36.11, psycholeptics. Moreover, in the predicted repositioning respectively. The all concatenated heterogeneous fea- list, Amlodipine is inferred to be repositioned for diseases ture representation method (Hy) demonstrates NMI related to - system (C09) with the high- and SMI of 0.60 and 22.26, respectively while averaging est confidence measure of 0.85. Even though Amlodipine summarized heterogeneous (pairwise) similarities (Hz) is not directly classified into C09, fixed combinations of demonstrates NMI and SMI of 0.64 and 33.59, respec- ,, hydrochlorothiazide, ACE inhibitors, tively. There is a significant improvement in the proposed etc. are already classified in C09 [15]. approach compared to the alternative method Hy.Even Different algorithms may produce different clustering though there is no significant improvement in the pro- solutions. However, different algorithms may have simi- posed approach compared to the alternative method Hz, larities too. We identified 79 reclassification predictions Hz fails to produce useful clusters when graph clustering which are generated consistently by at least two clus- algorithms are used. tering algorithms or in at least two different clusters It should be noted that these three heterogeneous data (in ClusterONE). Table 7 summarizes 11 reclassification integration methods did not outperform drug clusters candidates identified consistently by at least two cluster- identified by disease characteristics where NMI and SMI ing algorithms with relatively high confidence measures are 0.68 and 39.33, respectively. Since we employed ATC (see Additional file 3 for the complete list). ClusterONE therapeutic class as the reference cluster, the results in algorithm produces overlapping clusters. Therefore, some Drug Clustering Tier 1 are more favorable towards disease drugs are assigned to more than one cluster. Table 7 profiles. Our proposed approach and alternative method illustrates three drug reclassification candidates (Cypro- Hz outperformed other three clusterings identified by heptadine, and Dolasetron) that are identified chemical, protein and side effects profiles in Drug Cluster- by more than one cluster in ClusterONE results. ing Tier 1 whereas alternative method Hy outperformed In this study, ATC classification is considered as the gold clusterings identified by chemical and side effects profiles standard classification, therefore, we obtained clustering in Drug Clustering Tier 1. performance with reference to the ATC classification. We The alternative method Hz, explained in this study pro- used only up to its second level classification as it captures duces a complete graph while the proposed two-tiered the therapeutic uses. The drugs used in this study include clustering approach involves the removal of noisy edges, 12 drugs that are not yet assigned into ATC classifica- resulting in a sparse graph for efficient graph cluster- tion. However, our method enables inferring suitable ATC ing. The graph clustering algorithms used in this study classification for them (see Additional file 4 for the com- are not able to identify useful clusters on the given plete list of predictions). Moreover, the inferred new ATC complete graph where they resulted in producing only codes of other drugs can be used for drug repositioning. one module at all time with all three graph cluster- “Clinical significance of our findings” section summa- ing algorithms. Therefore, the proposed two-tiered drug rizes some clinical evidence to support these findings. clustering approach as a heterogeneous data integration We therefore suggest that cluster-based classification and approach demonstrates better performance and can be reclassification into the ATC classification system is a considered as a reliable method for both vector-based and viable method for drug repositioning. graph clustering. Clustering enables partitioning the large pharmacology network into smaller subgroups and hence simplifies the Drug Repositioning via ATC therapeutic class drug repositioning process. Since drugs can be considered We analyzed the drug clusters identified by MCL, as the main component of the pharmacological networks, MCODE, CL1I and GSOM to infer useful drug reposi- drug clustering provides an indirect way of clustering tioning candidates. In Table 6, we show 39 repositioning the networks, where associations to related entities (e.g., Hameed et al. BMC Bioinformatics (2018) 19:129 Page 12 of 18

Table 6 The inferred repositioning candidates with higher confidence Drug name Cluster ID Old ATC name New ATC name Confidence Algorithm Amlodipine 403 C08 C09 0.85 MCL Chlorthalidone 2 C03 0.83 CL1 Amantadine 51 N04 N05 0.80 CL1 Thioridazine 51 N05 0.80 CL1 30 N05 C09 0.75 MCODE 46 R06 N06 0.70 CL1 Amlodipine 11 C08 C09 0.70 CL1 Carvedilol 11 C07 C09 0.70 CL1 Cetirizine 11 R06 C09 0.70 CL1 414 D05 D10 0.67 MCL 48 S01 L02 0.67 MCODE Orphenadrine 392 R06 0.64 MCL 56 C02, N02, S01 N05 0.62 CL1 Thioridazine 56 N05 0.62 CL1 Dofetilide 399 C01 L02 0.60 MCL Cyproheptadine 35 R06 N06 0.59 CL1 35 C02 N06 0.59 CL1 Dipivefrin 44 S01 N05 0.57 CL1 Indomethacin 7 M01 0.57 MCODE Nicardipine 57 C08 N06 0.57 CL1 Cyproheptadine 4 R06 N06 0.54 CL1 Methadone 4 N07 N06 0.54 CL1 4 L01 P01 0.50 MCODE 48 A03, S01 N04 0.50 CL1 Atropine 393 A03, S01 N04 0.50 MCL 79 L01 A10 0.50 GSOM Hexachlorophene 350 D08 D05 0.50 MCL Isocarboxazid 50 N06 N05 0.50 MCODE Levetiracetam 346 N03 L01 0.50 MCL 6 N05 N06 0.50 GSOM 4 L01 P01 0.50 MCODE Metformin 79 A10 L01 0.50 GSOM 26 C09 C07 0.50 CL1 Mycophenolic Acid 342 L04 N03 0.50 MCL Phenytoin 66 N03 C01 0.50 GSOM Tazarotene 350 D05 D08 0.50 MCL Tolterodine 59 G04 C01 0.50 GSOM 346 L01 N03 0.50 MCL Zonisamide 342 N03 L04 0.50 MCL Note: ATC code names are given in Additional file 5 chemical, target and phenomic) can be incorporated as a levels. It enables other participating entities to present in basis for clustering. Hence, the proposed two-tiered drug- more than one cluster. Then, new associations between centric drug clustering can be extended by employing all chemical, target and phenome can be predicted for each the other related heterogeneous data at each of the cluster of the clusters as well. Moreover, it enables investigation Hameed et al. BMC Bioinformatics (2018) 19:129 Page 13 of 18

Table 7 Repositioning candidates identified consistently by more than one clustering algorithm Drug name Cluster ID Old ATC name New ATC name Confidence Algorithm Amlodipine 403 C08 C09 0.85 MCL Amlodipine 11 C08 C09 0.70 CL1

Cyproheptadine 46 R06 N06 0.70 CL1 Cyproheptadine 35 R06 N06 0.59 CL1 Cyproheptadine 4 R06 N06 0.54 CL1 Cyproheptadine 56 R06 N06 0.25 MCODE

Brinzolamide 48 S01 L02 0.67 MCODE Brinzolamide 9 S01 L02 0.17 CL1

Atropine 48 A03, S01 N04 0.50 CL1 Atropine 393 A03, S01 N04 0.50 MCL Atropine 20 A03, S01 N04 0.46 GSOM

Metformin 79 A10 L01 0.50 GSOM Metformin 21 A10 L01 0.33 MCODE

Mycophenolic Acid 342 L04 N03 0.50 MCL Mycophenolic Acid 22 L04 N03 0.27 GSOM

Carbamazepine 46 N03 N05 0.43 GSOM Carbamazepine 42 N03 N05 0.23 CL1 Carbamazepine 20 N03 N05 0.20 MCODE

Carbamazepine 46 N03 N06 0.43 GSOM Carbamazepine 25 N03 N06 0.27 CL1 Carbamazepine 20 N03 N06 0.20 MCODE

Droperidol 28 N05 N01 0.42 GSOM Droperidol 359 N05 N03 0.40 MCL Droperidol 40 N05 N01 0.32 CL1 Droperidol 30 N05 N03 0.17 CL1

Fulvestrant 23 L02 A10 0.42 GSOM Fulvestrant 13 L02 A10 0.42 CL1

Dolasetron 2 A04 A02 0.40 MCODE Dolasetron 40 A04 A02 0.20 GSOM Dolasetron 59 A04 L01 0.20 CL1 Dolasetron 24 A04 L01 0.12 CL1 Note: ATC code names are given in Additional file 5 of multiple links connecting drugs and may prove useful section summarizes clinical evidence for four findings of for pathway analysis. this study: Chlorthalidone, Indomethacin, Metformin and Thioridazine. Clinical significance of our findings Our study interestingly inferred the ATC code, C03 and The significance of findings arising from this study is therapeutic use, diuretics, for a drug known as Chlorthali- twofold; (i) correctly profile and suggest therapeutic indi- done (see Table 6), which until now does not belong to cation for drugs that do not possess the ATC code; (ii) the ATC classification. Chlorthalidone is a potent ; flag potential of some drugs to be used for other ther- a drug that promotes water loss and is currently used in apeutic purposes. More interestingly, the inferred thera- the management of or high pressure peutic uses are significantly different to the one for which and fluid retention associated with [49]. In these drugs were initially developed and trialed. This fact, Chlorthalidone has better clinical outcome in terms Hameed et al. BMC Bioinformatics (2018) 19:129 Page 14 of 18

of lowering blood pressure than other more commonly not suitable for drug clustering because the number of prescribed diuretics [50, 51]. clusters and the cluster shapes need to be known and Furthermore, Indomethacin is another drug that does specified in advance [62]. In drug clustering, we cannot not have an ATC code yet. According to our find- expect to have a priori knowledge about the grouping ings, Indomethacin was indicated to be used as an anti- and the cluster shapes. Moreover, higher dimensional fea- inflammatory and anti-rheumatic agent (see Table 6). This ture space in pharmacology data could potentially hinder perfectly matches the clinical situations for which this the efficiency and effectiveness of the machine learning drug is used; Indomethacin is indicated for managing pain algorithms. associated with inflammation, rheumatoid arthritis as well GSOM is well-suited for Drug Clustering Tier 1 and 2 as osteoarthritis [52, 53]. because it is cable of handling higher dimensional fea- Another interesting finding arising from our work tures and the number of clusters is defined automatically. relates to Metformin (see Tables 6 and 7). Metformin is In GSOM, the parameter spread factor is used to control used to manage type 2 diabetes and its initial classifica- the size of the GSOM map or the number of clusters. This tion was an oral hypoglycaemic, drug that lowers blood spread factor does not depend on the dimensionality of sugar level [54]. In the past ten years, Metformin was also the data. Moreover, it preserves the topological order. found to be therapeutically effective in other diseases such MCL, MCODE and ClusterONE algorithms used in as polycystic ovarian syndrome and metabolic syndrome Drug Clustering Tier 2, are graph clustering algorithms [55, 56]. Emerging evidence is strongly suggesting that that are popular in the context of pharmacology data anal- Metformin can now be used as an adjuvant treatment ysis. They are also capable of handling high dimensional in bowel and prostate cancer due to its antineoplastic features and the number of clusters is defined automat- properties; can inhibit cancer growth [57, 58]. This is a ically. Unlike vector-based algorithms, these graph clus- significant deviation from its original therapeutic use and tering algorithms are not appropriate for Drug Clustering was correctly inferred in our study by the ATC code L01 Tier 1 because they result in clustering drugs as well as and therapeutic class antineoplastic agent. their corresponding features. Furthermore, it is important to mention that our Interestingly, we observed relatively close number of proposed drug repositioning method accurately flagged drug clusters in Drug Clustering Tier 1 and Drug Clus- Thioridazine, a drug that does not possess an existing tering Tier 2 after tuning cluster parameters. Table sum- ATC code, as being psycholeptic agent (see Table 6). marizes the parameters and their effect on generating Thioridazine is clinically effective in treating patients with the clusters. The number of clusters generated by GSOM schizophrenia since its discovery [59, 60], however, it was using chemical, disease, gene, protein and side effect pro- withdrawn from the market in 2005 due to its ability to files are 68, 69, 66, 63 and 63, respectively. In GSOM, cause toxicity to the heart [61]. parameter spread factor canbeusedtotunethenumberof clusters. In Drug Clustering Tier 2, GSOM, MCL, MCODE Discussion and ClusterONE, generated 63, 64, 66 and 61 clusters, Clustering respectively. In MCODE, increasing the threshold from (0, Clustering enables partitioning the large pharmacology 0.9] increased the number of clusters. However, further network into smaller subgroups and hence simplifies the incrementing the threshold parameter after 0.9 resulted in drug repositioning process. Since drugs can be considered a decrement of the number of clusters. Interestingly, we as the main component of the pharmacological networks, identified 64 clusters using MCODE when the threshold drug clustering provides an indirect way of clustering parameter is set to 0.9, strengthening our confidence that the networks, where associations to related entities (e.g., the number of clusters lies around 64. chemical, target and phenomic) can be incorporated as a Overlapping clustering algorithms may be more suit- basis for clustering. Hence, the proposed two-tiered drug- able for drug clustering as some drugs are used to treat centric drug clustering can be extended by employing all multiple diseases. Moreover, overlapping clusters may the other related heterogeneous data at each of the cluster enable identifying more repositioning candidates. Clus- levels. It enables other participating entities to present in terONE, MCL and MCODE algorithms used in this study more than one cluster. Then, new associations between can handle overlapping algorithms. But, in the current chemical, target and phenome can be predicted for each analysis, we observed overlapping clusters only from Clus- of the clusters as well. Moreover, it enables investigation terONE. It should be noted that some repositioning can- of multiple links connecting drugs and may prove useful didates identified by ClusterONE are identified by GSOM, for pathway analysis. MCL and MCODE as well. Therefore, we believe the Clustering algorithms such as k-means, SOM, GSOM repositioning candidates identified by non-overlapping and mixture models can be employed in Drug Cluster- clusters could still be prospective candidates for further ing Tier 1. But, K-means, SOM and mixture models are in-depth analysis. The repositioning candidates identified Hameed et al. BMC Bioinformatics (2018) 19:129 Page 15 of 18

by multiple clustering algorithms increase our confidence in other classes may share higher similarity though they that they might be interesting. have distinct uses which will also result in a higher num- Since the four clustering algorithms used in this study ber of false positives. Moreover, many ATC classes have are capable of handling higher dimensional feature repre- very high intra-cluster variations which will result in a sentations, we did not employ dimensionality reduction. higher number of false negatives. Therefore, we cannot Dimensionality reduction techniques may be useful to expect the identified drug clusters to be highly correlated remove noisy information. But, it is not appropriate for with ATC classification. MCL, MCODE and ClusterONE, graph clustering algo- Using Silhouette values to fine tune the parameters rithms as they use drug-drug/drug-feature relationships would be another approach that we could use when deter- to be the input. mining the number of clusters. But, it should be noted that GSOM typically uses Euclidean distance to compute the the mean Silhouette value of ATC classification based on pairwise distance between input vector and weight vec- chemical, disease, gene, protein and side effects are − 0.31, tor. The performance of GSOM may be further improved − 0.06, − 0.49, − 0.25 and − 0.33, respectively which illus- by employing Jaccard similarity or squared Euclidean dis- trates the higher variations within ATC classes. Hence, tance or taking the average distance based on multiple higher variations within drug clusters are expected. Hence fine metrics when binary data are used. tuning the parameters of the clustering algorithms com- parison to ATC classification may not be very accurate. Heterogeneous data integration According to Silhouette values of ATC classification, Drugs can be explained using various characteristics such obtaining a clustering close to the ATC classification is as chemical, target and phenomic, etc. The primary objec- challenging due to the large variation within ATC classes, tive of heterogeneous/multi-view data integration is to misclassifications and missing information in the ATC more deeply understand the predictive model and to classification. The mismatches between the clustering obtain a consensus solution [63]. solution and the ATC classification arise due to the iden- Multi-view data integration can be performed at tified new drug classes (drug-drug similarities) for the the input/intermediate/output phase [63]. The proposed existing drugs. As explained in “Drug repositioning via methods can be seen as a type of multi-view data integra- ATC therapeutic classes” section, the reclassification into tion. In the alternative methods Hy and Hz,dataintegra- ATC therapeutic classes can be interpreted as reposition- tion is performed at the input phase and the intermediate ing opportunities. Also, the clustering solutions enable phase, respectively, while our presented two-tiered clus- identifying more useful drug-drug relationships. tering approach performs data integration at the output External clustering evaluation is an important task phase. In our method, the outputs from various individual though it is challenging. Consequently, various exter- views are combined and the consensus clustering results nal clustering comparison measures have been proposed. are obtained at the second tier. Moreover, in multi-view Pair-counting based measures include RI and ARI while data integration, a kernel matrix is typically used as an MI, NMI and AMI are information theoretic based mea- input for kernel classification, regression and clustering sures useful to compare clustering solutions against a [63]. In our study, the clustering results of Drug Cluster- reference clustering. There is no clear evidence that one ing Tier 1 are used to construct the Drug-Drug Relation measure is superior to another. NMI, AMI and SMI are matrix which can be viewed as a type of kernel matrix. the adjusted measures for MI and have important benefits Hence, the proposed method is compatible with existing [46]. Moreover, SMI is proportional to AMI. We therefore kernel learning approaches. performed clustering evaluation using NMI and SMI. Methods such as kernelized Bayesian matrix factoriza- It is important that the drugs within a cluster are more tion, random walk methods can be effectively applied on similar to each other than the other drugs. Our primary bi-partite graphs as a mean of data integration. Multi- objective of this study is not to present a model to predict modal deep learning can also be applied to heterogeneous the ATC classification. Fine tuning parameters against an drug data integration where output of each view can external reference may not be a better option since our be integrated into higher layers [63]. Deep Boltzmann broader focus is to determine the repositioning candidates machine would be a suitable approach for drug data clus- where they deviate from the current ATC class. More- tering where binary data are considered. over, the false positives predicted by the clusters is not necessarily an undesired result and optimizing clusters for Cluster evaluation NMIandSMImeasuremightpreventusfromdetecting Sincedrugscanbelongtomorethanoneclass,theclasses interesting novel clusters or repositioning candidates. induced from the ATC classification can have distantly Drug Clustering Tier 2 achieved 11.9, 4.8, and 13.8% gain related drugs which will result in a higher number of false in NMI compared to chemical, protein, and side effect, positives in the compared clustering solution. Some drugs respectively of Drug Clustering Tier 1 whereas there is Hameed et al. BMC Bioinformatics (2018) 19:129 Page 16 of 18

a2.9%lossinNMIcomparedtodiseaseprofileofDrug Additional file 3: The repositioning candidates identified consistently by Clustering Tier 1. Since we employed ATC therapeutic at least two clustering algorithms. This includes the complete list of class as the reference cluster, the results in Drug Clustering consistent repositioning candidates, algorithm names and their confidence Tier 1 are more favorable towards disease profiles. measures. (PDF 104 kb) Additional file 4: The complete prediction list. This includes the The predicted clusters that do not provide a higher complete list of predicted new classifications into ATC therapeutic class Silhouette value is not necessarily an undesired result. and their confidence measures. (PDF 892 kb) Moreover, not all identified clusters may be useful for Additional file 5: The ATC code list. This includes the ATC codes (Second drug repositioning. As explained in “Assigning confidence Level) and the corresponding ATC therapeutic class names. (PDF 363 kb) measure” section, we defined a confidence measure so that we can identify the highly probable repositioning can- Abbreviations ATC: Anatomical therapeutic chemical; ClusterONE: Clustering with didates. The drug repositioning candidates that are com- overlapping neighborhood expansion; DDR: Drug-drug relation; GSOM: monly identified by multiple clustering algorithms also Growing self organizing map; MCODE: Molecular complex detection; NMI: have higher probability to be chosen as repositioning can- Normalized mutual information; SF: Spread factor; SMI: Standardized mutual information didates. As explained in “Drug Clustering Tier 2”section in “Methods” section, drug-drug relation matrix repre- Acknowledgments Not applicable. sents a kernel matrix or similarity matrix, illustrating the similarity means of cluster relationships of drugs. Hence, Funding PNH is fully supported by the PhD scholarships of The University of Melbourne the drug-drug relational matrix can be straightforwardly and partially supported by NICTA scholarship of National ICT Australia, now incorporated in kernel-based supervised and unsuper- Data61 since merging CSIRO’s Digital Productivity team. Article processing vised learning methods such as support vector machines, charge is funded by Australian Reseach Council Discovery Grant DP150103512. spectral clustering, multiple kernel learning, etc [63, 64]. Availability of data and materials The data supporting the results of this article are cited within the article.

Conclusions Authors’ contributions Computational drug repositioning provides new strate- PNH, KV and SH designed the experiment and evaluation methodology. PNH gies for drug development. It has been argued that using conceived the idea, proposed specific methods, collected ATC classification in 2016, implemented the methods, assessed the performance and drafted the heterogeneous features results in better drug reposition- manuscript. SK provided the clinical significance of the predicted ing predictions. In this study, we proposed an unsuper- repositioning candidates. KV and SH contributed to the writing and editing of vised learning approach to achieve drug repositioning the manuscript. All authors approved the final draft. by, first, performing drug-centric drug clustering and, Ethics approval and consent to participate second, associating inferred clusters to ATC therapeutic Not applicable. classes based on known drug classifications. Moreover, Consent for publication the proposed two-tiered clustering approach enables drug Not applicable. clustering through heterogeneous data integration. The drug clustering based on core drug features produces Competing interests The authors declare that they have no competing interests. clusters that align well with the existing ATC classifica- tion levels. The repositioning candidates identified con- Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in sistently by multiple clustering algorithms and with high published maps and institutional affiliations. confidence have a higher possibility for reliable drug repo- Author details sitioning. Furthermore, the identified drug clusters can be 1Department of Mechanical Engineering, University of Melbourne, Parkville, used as an intermediate source to explore drug similari- 3010 Melbourne, Australia. 2Data61, Victoria Research Lab, West Melbourne ties. The clinical significance of the predicted results also 3003, Australia. 3Department of Computer Science, University of Ruhuna, Matara 81000, Sri Lanka. 4Department of Computing and Information Systems, suggests that the proposed two-tiered clustering approach University of Melbourne, Parkville, 3010 Melbourne, Australia. 5Department of can be safely used to infer new ATC code as well as new Nursing, University of Melbourne, Parkville, Melbourne 3010, Australia. 6The therapeutic uses based on the given drug characteristics. Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, Melbourne 3010, Australia. 7Research School of Engineering, College of Engineering & Computer Science, The Australian National University, 2601 Additional files Canberra, ACT, Australia.

Received: 28 September 2017 Accepted: 21 March 2018 Additional file 1: Silhouette analysis for ATC classification and GSOM clustering. This file includes the figures illustrating the Silhouette values of drugs based on ATC classification and GSOM clustering using chemical, disease, gene, protein and side effect profiles. (PDF 224 kb) References 1. Dudley JT, Deshpande T, Butte AJ. Exploiting drug–disease relationships Additional file 2: The 26 pairs of drugs which occur together in Drug for computational drug repositioning. Brief Bioinforma. 2011;12:013. Clustering Tier 1. These drug pairs occur together in each drug cluster, 2. Napolitano F, Zhao Y, Moreira VM, Tagliaferri R, Kere J, D’Amato M, generated based on individual chemical, disease, protein and side effect Greco D. Drug repositioning: a machine-learning approach through data profiles. (PDF 59 kb) integration. J Cheminformatics. 2013;5:30. Hameed et al. BMC Bioinformatics (2018) 19:129 Page 17 of 18

3. Li J, Zheng S, Chen B, Butte AJ, Swamidass SJ, Lu Z. A survey of 30. Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. Pubchem: a current trends in computational drug repositioning. Brief Bioinforma. public information system for analyzing bioactivities of small molecules. 2016;17(1):2–12. Nucleic Acids Res. 2009;37(suppl 2):623–33. 4. U Sahu N, S Kharkar P. Computational drug repositioning: A lateral 31. Bodenreider O. The unified medical language system (umls): integrating approach to traditional drug discovery? Curr Top Med Chem. 2016;16(19): biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):267–70. 2069–77. 32. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, 5. Yamanishi Y, Kotera M, Kanehisa M, Goto S. Drug-target interaction Hassanali M. Drugbank: a knowledgebase for drugs, drug actions and prediction from chemical, genomic and pharmacological data in an drug targets. Nucleic Acids Res. 2008;36(suppl 1):901–6. integrated framework. Bioinformatics. 2010;26(12):246–54. 33. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, 6. Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. Gasteiger E, Huang H, Lopez R, Magrane M, et al. Uniprot: the universal Network-based elucidation of human disease similarities reveals protein knowledgebase. Nucleic Acids Res. 2004;32(suppl 1):115–9. common functional modules enriched for pluripotent drug targets. 34. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource PLoS Comput Biol. 2010;6(2):1000662. to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6(1):343. 7. Berger SI, Iyengar R. Network analyses in systems pharmacology. 35. Alahakoon D, Halgamuge SK, Srinivasan B. Dynamic self-organizing Bioinformatics. 2009;25(19):2466–72. maps with controlled growth for knowledge discovery. IEEE Trans Neural 8. Yildirim MA, Goh K-I, Cusick ME, Barabasi A-L, Vidal M. Drug-target Netw. 2000;11(3):601–14. network. Nat Biotechnol. 2007;25(10):1119–26. 36. Hsu AL, Tang S-L, Halgamuge SK. An unsupervised hierarchical dynamic 9. Lee M, Park K, Kim D. Interaction network among functional drug self-organizing approach to cancer class discovery and marker gene groups. BMC Syst Biol. 2013;7(3):1. identification in microarray data. Bioinformatics. 2003;19(16):2131–40. 10. Wu C, Gudivada RC, Aronow BJ, Jegga AG. Computational drug 37. Van Dongen S. Graph clustering by flow simulation. 2001. repositioning through heterogeneous network clustering. BMC Syst Biol. 38. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for 2013;7(Suppl 5):6. large-scale detection of protein families. Nucleic Acids Res. 2002;30(7): 11. Chen L, Zeng W-M, Cai Y-D, Feng K-Y, Chou K-C. Predicting anatomical 1575–84. therapeutic chemical (atc) classification of drugs by integrating 39. Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes chemical-chemical interactions and similarities. PloS ONE. 2012;7(4): in protein-protein interaction networks. Nat Methods. 2012;9(5):471–2. 35254. 40. Bader GD, Hogue CW. An automated method for finding molecular 12. Cheng F, Li W, Wu Z, Wang X, Zhang C, Li J, Liu G, Tang Y. Prediction of complexes in large protein interaction networks. BMC Bioinformatics. polypharmacological profiles of drugs by the integration of chemical, 2003;4(1):1. side effect, and therapeutic space. J Chem Inf Model. 2013;53(4):753–62. 41. Kohonen T, Maps S. Self-organizing Maps. Springer; 1995, p. 30. 13. Tari LB, Patel JH. Systematic drug repurposing through text mining. 42. Liu M, Wu Y, Chen Y, Sun J, Zhao Z, Chen X-W, Matheny ME, Xu H. Biomed Lit Min. 2014;1159:253–67. Large-scale prediction of adverse drug reactions using chemical, 14. Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment biological, and phenotypic properties of drugs. J Am Med Inform Assoc. pairs from biomedical literature for drug repurposing. BMC 2012;19(e1):28–35. Bioinformatics. 2013;14(1):181. 43. Hassani M, Seidl T. Using internal evaluation measures to validate the 15. World Health Organization. Anatomical Therapeutic Chemical (ATC) quality of diverse stream clustering algorithms. Vietnam J Comput Sci. Classification System. 2016. http://www.whocc.no. 2017;4(3):171–83. 16. Sun Y, Hameed PN, Verspoor K, Halgamuge S. A physarum-inspired 44. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and prize-collecting steiner tree approach to identify subnetworks for drug validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. repositioning. BMC Syst Biol. 2016;10(5):25. 45. Rendón E, Abundez I, Arizmendi A, Quiroz E. Internal versus external 17. Lin S-F, Xiao K-T, Huang Y-T, Chiu C-C, Soo V-W. Analysis of adverse cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34. drug reactions using drug and drug target interactions and graph-based 46. Romano S, Bailey J, Nguyen XV, Verspoor K. Standardized mutual methods. Artif Intell Med. 2010;48(2):161–6. information for clustering comparisons: One step further in adjustment 18. Campillos M, Kuhn M, Gavin A-C, Jensen LJ, Bork P. Drug target for chance. In: ICML. Beijing: International Conference on Machine identification using side-effect similarity. Science. 2008;321(5886):263–6. Learning; 2014. p. 1143–51. 47. Chan C-KK, Hsu AL, Halgamuge SK, Tang S-L. Binning sequences using 19. Hartsperger ML, Blöchl F, Stümpflen V, Theis F. Structuring very sparse labels within a metagenome. BMC Bioinformatics. 2008;9(1):1. heterogeneous biological information using fuzzy clustering of k-partite 48. Konganti K, Wang G, Yang E, Cai JJ. Sbetoolbox: a matlab toolbox for graphs. BMC Bioinformatics. 2010;11(1):522. biological network analysis. Evol Bioinforma. 2013;9:355. 20. Klamt S, Haus U-U, Theis F. Hypergraphs and cellular networks. PLoS 49. Roush GC, Kaur R, Ernst ME. Diuretics: a review and update. J Cardiovasc Comput Biol. 2009;5(5):1000385. Pharmacol Ther. 2014;19(1):5–13. 21. Tatonetti NP, Fernald GH, Altman RB. A novel signal detection algorithm 50. Bakris GL, Sica D, White WB, Cushman WC, Weber MA, Handley A, for identifying hidden drug-drug interactions in adverse event reports. Song E, Kupfer S. Antihypertensive efficacy of hydrochlorothiazide vs J Am Med Inform Assoc. 2012;19(1):79–85. chlorthalidone combined with medoxomil. Am J Med. 22. Zhang K, Chai Y, Yang SX. Self-organizing feature map for cluster analysis 2012;125(12):1229–1. in multi-disease diagnosis. Expert Syst Appl. 2010;37(9):6359–67. 51. Ernst ME, Carter BL, Goerdt CJ, Steffensmeier JJ, Phillips BB, 23. Zhou B, Wang R, Wu P, Kong D-X. Drug repurposing based on Zimmerman MB, Bergus GR. Comparative antihypertensive effects of drug–drug interaction. Chem Biol Drug Des. 2015;85(2):137–44. hydrochlorothiazide and chlorthalidone on ambulatory and office blood 24. Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for pressure. Hypertension. 2006;47(3):352–8. combining multiple partitions. J Mach Learn Res. 2002;3(Dec):583–617. 52. Rossi S, Calabretto J-P, Patterson C. Australian Medicines Handbook. 25. Zhao X-M, Iskar M, Zeller G, Kuhn M, Van Noort V, Bork P. Prediction of Adelaide: AMH Pty Ltd; 2017. drug combinations by integrating molecular and pharmacological data. 53. Crilly MA, Mangoni AA. Non-steroidal anti-inflammatory drug (nsaid) PLoS Comput Biol. 2011;7(12):1002323. related inhibition of aldosterone glucuronidation and arterial dysfunction 26. Shi J-Y, Li J-X, Lu H-M. Predicting existing targets for new drugs base on in patients with rheumatoid arthritis: a cross-sectional clinical study. BMJ strategies for missing interactions. BMC Bioinformatics. 2016;17(8):282. Open. 2011;1(1):000076. 27. Vilar S, Hripcsak G. The role of drug profiles as similarity metrics: 54. Duhault J, Lavielle R. History and evolution of the concept of oral therapy applications to repurposing, adverse effects detection and drug–drug in diabetes. Diabetes Res Clin Pract. 1991;14:9–13. interactions. Brief Bioinforma. 2016;18(4):670–81. 55. Bianchi C, Penno G, Romero F, Del Prato S, Miccoli R. Treating the 28. Chen L, Lu J, Zhang N, Huang T, Cai Y-D. A hybrid method for prediction metabolic syndrome. Expert Rev Cardiovasc Ther. 2007;5(3):491–506. and repositioning of drug anatomical therapeutic chemical classes. Mol 56. Diamanti-Kandarakis E, Economou F, Palimeri S, Christakou C. Metformin BioSyst. 2014;10(4):868–77. in polycystic ovary syndrome. Ann N Y Acad Sci. 2010;1205(1):192–8. 29. Wang F, Zhang P, Cao N, Hu J, Sorrentino R. Exploring the associations 57. Coyle C, Cafferty F, Vale C, Langley R. Metformin as an adjuvant between drug side-effects and therapeutic indications. J Biomed Inform. treatment for cancer: a systematic review and meta-analysis. Ann Oncol. 2014;51:15–23. 2016;27(12):2184–95. Hameed et al. BMC Bioinformatics (2018) 19:129 Page 18 of 18

58. Hankinson SJ, Fam M, Patel NN. A review for clinicians: Prostate cancer and the antineoplastic properties of metformin. In: Urologic Oncology: Seminars and Original Investigations. Netherlands: Elsevier; 2017. p. 21–9. 59. Meltzer H, Sachar E, Frantz A. Dopamine antagonism by thioridazine in schizophrenia. Biol Psychiatry. 1975;10(1):53–7. 60. Sultana A, Reilly J, Fenton M. Thioridazine for schizophrenia. Cochrane Libr. 2000;2:. Art. No.: CD001944. 61. Purhonen M, Koponen H, Tiihonen J, Tanskanen A. Outcome of patients after market withdrawal of thioridazine: a retrospective analysis in a nationwide cohort. Pharmacoepidemiol Drug Saf. 2012;21(11):1227–31. 62. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78. 63. Li Y, Wu F-X, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinformatics. 2016;19:113. 64. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. J Mach Learn Res. 2006;7(Jul):1531–65.

Submit your next manuscript to BioMed Central and we will help you at every step:

• We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal • We provide round the clock customer support • Convenient online submission • Thorough peer review • Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit