<<

Machine learning-based prediction and characterization of drug- drug interactions

A thesis submitted to the faculty of University of Cincinnati

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in the department of College of Engineering & Applied Science

by

Jaswanth Kumar Yella Bachelor of Technology, Computer Science Joginpally BR Engineering College, May 2013

Research Advisor and Committee Chair: Dr. Anil G. Jegga

Academic Advisor and Committee Member: Dr. Ali A. Minai

Committee Member: Dr. Raj K. Bhatnagar

Abstract

Polypharmacy is the simultaneous combination of two or more drugs at a time, which is unavoidable in the elderly population as they often suffer from multiple complex conditions. A drug-drug interaction (DDI) is a change in the effect of a drug due to polypharmacy. Identifying and characterizing the DDIs is important to avoid hazardous complications and also would help reduce development costs for de novo drug discovery. An in-silico method to predict these DDIs a priori using the existing drug profiles can help mitigate not only the DDI-related adverse event risks but also reduce health care costs.

In this thesis, drug related feature data such as pathways, targets, SMILES, MeSH, Indications, adverse events, and contraindications are collected from various sources. Drug-drug similarity for individual feature is calculated and integrated along with DDI labels collected from Drugs.com for

10,67,991 interactions. To handle the high imbalance of labels in the dataset, the Synthetic

Minority Over-sampling Technique (SMOTE) is applied. Then using the final dataset, a computational machine learning framework is developed to evaluate the classifier performance across multiple datasets and identify the best performing classifier.

Random Forest is identified as the best predictive model in this thesis when compared with 5 other classifiers using 5-fold stratified cross-validation. DDI severity characterization is performed using Random Forest for multi-class classification where the labels are safe, minor, moderate and major DDI. The results show that the framework can identify the DDIs and characterize the severity of pairwise drug feature-similarity data, and can therefore be useful in drug development and pharmacovigilance studies.

Acknowledgment

I would like to thank my mentor and research advisor Dr. Anil Jegga, for the patient guidance, advice and constant encouragement he has provided since the day I joined the lab. I gratefully acknowledge the funding he provided me for my graduate program. I have been extremely lucky to receive opportunities to publish papers and attend conferences under his supervision. I would also like to thank my academic advisor Dr. Ali Minai for encouraging me to explore new directions in my research and providing me constructive feedback during research. I greatly appreciate his availability, concern towards my health and suggestions on the direction of research. In particular

I would like to thank Dr. Raj Bhatnagar for being in my committee panel and for valuable suggestions provided in my thesis.

This journey wouldn’t have started without the support of Mr. Suresh Paladugu, who provided funding for a semester through North Alley Pvt. Ltd. I am extremely thankful for all the support provided to commence my academic journey.

A sincere and special thanks to my friends Shravan Nalla, Ranadheer Thummeti and Nithin

Palakurthi for constantly believing in me all these years. More than words could ever express, I am grateful and indebted for all the support provided morally, financially and sarcastically.

I would like to thank my lab mates especially Yunguan Wang (Jake) and Suryanarayana

Yaddanapudi for the encouragement and suggestions provided to complete the thesis. I thank the

Cincinnati Children’s Hospital’s bioinformatics infrastructure team for the technical support and resolving server related issues promptly.

I extend my sincere gratitude to my cousins Lakshmi Yella and Raj Sekhar Yella for their encouragement and guidance in pursuing higher studies. My heartfelt thanks to my parents and brother for believing in me and encouraging my dreams.

Table of Contents 1 INTRODUCTION...... 1 1.1 BACKGROUND ...... 1 1.2 CHARACTERIZING DDI USING MACHINE LEARNING ...... 3 1.3 STRUCTURE AND ORGANIZATION OF THE THESIS...... 4 2 RELATED WORK ...... 6 2.1 METHODS TO DETECT OR PREDICT DDIS ...... 6 2.2 DDI PREDICTION USING IN SILICO TECHNIQUES ...... 6 2.2.1 STATISTICS-BASED METHODS...... 7 2.2.2 NETWORK-BASED METHODS ...... 7 2.2.3 MACHINE LEARNING-BASED METHODS...... 8 2.3 MOTIVATION AND RESEARCH OBJECTIVES ...... 10 3 MATERIALS AND METHODS ...... 11 3.1 NORMALIZING ...... 12 3.2 DRUG-DRUG INTERACTION NETWORKS ...... 12 3.2.1 Binary Classification Datasets ...... 13 3.2.2 Multi-Class Classification Dataset...... 14 3.3 FEATURE EXTRACTION...... 14 3.3.1 Drug-SMILES ...... 14 3.3.2 Drug-Pathway ...... 15 3.3.3 Drug-Adverse Events ...... 16 3.3.4 Drug-ATC ...... 16 3.3.5 Drug-Target ...... 16 3.3.6 Drug-MeSH ...... 17 3.3.7 Drug-Indications ...... 17 3.3.8 Drug-Contraindications ...... 17 3.4 Data Representation for Learning...... 18 3.5 LEARNING ALGORITHMS ...... 19 3.5.1 Logistic Regression ...... 19 3.5.2 K-Nearest Neighbors ...... 20 3.5.3 Support Vector Machine ...... 21 3.5.4 Artificial Neural Networks ...... 21 3.5.5 AdaBoost...... 22 3.5.6 Random Forest ...... 23 3.5.7 Multiclass Classification using Binary Classifiers...... 23 3.6 IMBALANCED DATA AND SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE ...... 24 3.7 EVALUATION METRICS ...... 25 3.7.1 Accuracy ...... 25 3.7.2 Receiver Operating Characteristic Curve ...... 26 3.7.3 Precision, Recall and F1 ...... 26 3.7.4 Micro-Averaging and Macro-Averaging ...... 26 3.7.5 Cohen’s Kappa ...... 27 4 EXPERIMENTAL DESIGN AND RESULTS ...... 28 4.1 EXPERIMENTAL DESIGN ...... 28 4.2 RESULTS ...... 31 4.2.1 Drugs.com+DCDB ...... 31 4.2.2 Random Safe+Drugs.com...... 33 4.2.3 DCDB + DrugBank ...... 35 4.2.4 Comparison with Cheng’s Model ...... 37 4.2.5 Multi-class Classification for DDI severity Characterization ...... 39 4.2.6 Comparing Original and Predicted DDI using Fisher’s Exact Test ...... 42 4.2.7 Predicted vs. Cytochrome Drugs ...... 46 5 CONCLUSIONS AND FUTURE DIRECTIONS ...... 47 APPENDICES ...... 54 APPENDIX A – DRUG FEATURE CONSTRUCTION AND EXTRACTION ...... 55 APPENDIX B – CALCULATING DRUG-DRUG SIMILARITY USING JACCARD INDEX ...... 65 APPENDIX C – DATASET CONSTRUCTION USING SIMILARITY SCORES AND DDI LABELS ...... 70 APPENDIX D – DATA ANALYSIS, MACHINE LEARNING-BASED CLASSIFICATION AND VISUALIZATION ...... 80 APPENDIX E – EVALUATING CLASSIFIERS’ PERFORMANCE ACROSS MULTIPLE DATASETS ... 99

List of figures FIGURE 3.1 SMILES REPRESENTATION OF IBUPROFEN (CHEMICAL FORMULA 퐶13퐻18푂2) ...... 15 FIGURE 3.2 FEATURE CORRELATION PLOT USING DDI SIMILARITY DATA WHERE THE DARK BROWN INDICATES NO CORRELATION BETWEEN FEATURES AND LIGHT BROWN INDICATES THERE EXIST CORRELATION BETWEEN FEATURES...... 19 FIGURE 4.1 – DRUGS INTERSECTION VENN DIAGRAM FOR DCDB SAFE AND DRUGS.COM NOT-SAFE DDI ...... 29 FIGURE 4.2 – AN OVERVIEW OF MACHINE LEARNING-BASED DDI CHARACTERIZATION FRAMEWORK CONSISTING OF DATA SOURCES, PRE-PROCESSING TECHNIQUES, AND CLASSIFICATION GOALS...... 29 FIGURE 4.3 – FEATURE DISTRIBUTION PLOT OF SAFE (LEFT) AND NOT-SAFE (RIGHT) DDI FOR DRUGS.COM+DCDB DATASET...... 31 FIGURE 4.4 – AUCROC PLOT FOR SAFE VS NOT-SAFE DDI CLASSIFICATION USING 6 DIFFERENT CLASSIFIERS WITH 5- FOLD STRATIFIED CROSS-VALIDATION WHERE 0 INDICATES NOT-SAFE DDI AND 1 INDICATES SAFE-DDI ...... 32 FIGURE 4.5 – FEATURE DISTRIBUTION PLOT OF SAFE (LEFT) AND NOT-SAFE (RIGHT) DDI FOR RANDOM SAFE+DRUGS.COM DATASET WHERE 0 INDICATES NOT-SAFE DDI AND 1 INDICATES SAFE-DDI ...... 33 FIGURE 4.6 – AUCROC PLOT FOR RANDOM SAFE VS NOT-SAFE DDI FROM DRUGS.COM CLASSIFICATION USING 6 DIFFERENT CLASSIFIERS WITH 5-FOLD STRATIFIED CROSS-VALIDATION ...... 34 FIGURE 4.7 – FEATURE DISTRIBUTION PLOT OF SAFE (LEFT) AND NOT-SAFE (RIGHT) DDI FOR DCDB+DRUGBANK DATASET WHERE 0 INDICATES NOT-SAFE DDI AND 1 INDICATES SAFE-DDI ...... 35 FIGURE 4.8 – AUCROC PLOT FOR DCDB SAFE VS DRUGBANK NOT-SAFE DDI CLASSIFICATION USING 6 DIFFERENT CLASSIFIERS WITH 5-FOLD STRATIFIED CROSS-VALIDATION ...... 36 FIGURE 4.9 AUCROC PLOT USING FEATURES FROM CHENG’S STUDY WITH 5-FOLD STRATIFIED CROSS-VALIDATION 37 FIGURE 4.10 AUCROC PLOT FOR THE COLLECTED FEATURES IN THE CURRENT STUDY WITH 5-FOLD STRATIFIED CROSS-VALIDATION ...... 37 FIGURE 4.11 – FEATURE DISTRIBUTION PLOT OF DDI SEVERITY FOR DCDB+DRUGS.COM SEVERITY DDI DATASET .. 39 FIGURE 4.12 – KNN AUCROC PLOT FOR DDI CHARACTERIZATION ...... 40 FIGURE 4.13 – LOGISTIC REGRESSION AUCROC PLOT FOR DDI CHARACTERIZATION ...... 40 FIGURE 4.14 – SVM AUCROC PLOT FOR DDI CHARACTERIZATION ...... 40 FIGURE 4.15 – MULTI-LAYER PERCEPTRON AUCROC PLOT FOR DDI CHARACTERIZATION ...... 40 FIGURE 4.16 – ADABOOST AUCROC PLOT FOR DDI CHARACTERIZATION ...... 41 FIGURE 4.17 – RANDOM FOREST AUCROC PLOT FOR DDI CHARACTERIZATION ...... 41 FIGURE 4.18 – FEATURE IMPORTANCE PLOT FOR RANDOM FOREST DDI CHARACTERIZATION ...... 42 FIGURE 4.19 – MAJOR DDI ENRICHMENT NETWORK CONSTRUCTED USING FISHER EXACT TEST IN DRUGS.COM DATASET WHERE THE RED LINES INDICATE INTERACTIONS BETWEEN THE ATC CLASSES AND SIZE OF NODE INDICATES THE NUMBER OF INTERACTIONS THE CLASS HAS...... 43 FIGURE 4.20 – SAFE DDI ENRICHMENT USING FISHER EXACT TEST IN DCDB DATASET...... 45 FIGURE 4.21 – DRUGS INTERSECTION VENN DIAGRAM FOR PREDICTED AND CYTOCHROME DRUGS...... 46

LIST OF TABLES TABLE 3.1: VARIOUS DATASETS USED TO EVALUATE THE CLASSIFIER PERFORMANCE ALONG WITH THE NUMBER OF SAFE AND NOT-SAFE DDI PAIRS IN DATASETS...... 13 TABLE 3.2 THE FINAL DDI DATASET CLASS COUNT USED FOR TRAINING LEARNING ALGORITHMS...... 14 TABLE 3.3: CLASS-IMBALANCE RATIOS FOR SAFE VS NOT-SAFE DDI DATASETS ...... 24 TABLE 3.4: CLASS-IMBALANCE RATIOS FOR SEVERITY DDI DATASETS ...... 25 TABLE 4.1 – CLASSIFIERS PERFORMANCE OVER DRUGS.COM+DCDB DDI DATASET WITH 5-FOLD STRATIFIED CROSS- VALIDATION...... 32 TABLE 4.2 – CLASSIFIERS PERFORMANCE OVER RANDOM SAFE+DRUGS.COM DDI DATASET WITH 5-FOLD STRATIFIED CROSS-VALIDATION...... 34 TABLE 4.3 – CLASSIFIERS PERFORMANCE OVER DCDB+DRUGBANK DDI DATASET WITH SMOTE BALANCING AND 5- FOLD STRATIFIED CROSS-VALIDATION...... 36 TABLE 4.4 – CLASSIFIERS PERFORMANCE OVER CHENG DDI DATASET USING 5-FOLD STRATIFIED CROSS-VALIDATION...... 38 TABLE 4.5 – CLASSIFIERS PERFORMANCE USING THE COLLECTED FEATURES WITH CHENG DDI ON 5-FOLD STRATIFIED CROSS-VALIDATION...... 38 TABLE 4.6 – COMPARISON OF FEATURES USED IN CHENG DATASET AND FEATURES USED IN THIS THESIS ALONG WITH THE FEATURE IMPORTANCE (FI) SCORES. CHENG SOURCE IS THE SOURCE OF DRUG-FEATURE DATA WHEREAS THESIS SOURCE IS THE SOURCE OF FEATURES USED IN THIS THESIS WORK...... 38 TABLE 4.6 – ATC ENRICHMENT ANALYSIS BASED PREDICTED MAJOR DDI ...... 44 TABLE 4.7 – ATC ENRICHMENT ANALYSIS BASED PREDICTED SAFE DDI ...... 45

1 Introduction

1.1 Background

Due to the complicated interplay of various pathophysiological processes underlying most complex diseases, single-drug treatment or monotherapy is often unable to treat or provide relief in these diseases. Hence, combination-drugs or drug cocktails are prescribed. While these are efficacious and also potentially reduce drug-dosage and thereby dose-related drug-toxicity [1], they can result in inadvertent drug-drug interactions (DDI). These DDIs either increase the efficacy of the drug treatment and/or may cause unexpected adverse events (AEs) [2]. In the drug development cycle, pharmacokinetics (PK) refer to the absorption, distribution, metabolism, excretion and toxicity (ADMET) properties of a drug, and pharmacodynamics (PD) study the relationship between the drug and its targets, its mechanism of action and its therapeutic effects

[3]. DDI-related AEs are triggered by certain combinations of drugs wherein the PD-PK properties of one of the drugs are altered because of another drug which is concomitantly administered or inadvertently taken.

Drug-related AEs are among the leading causes of increase in severe fatalities and health care costs

[4-6]. As per an estimate, DDIs account for 30% of unexpected AEs leading to patient morbidity and mortality in United States [7-9]. According to the United States data for

2007-2008 report from Centers for Disease Control (CDC), the percentage of people taking two or more drugs increased from 25% to 31% and the use of five or more drugs increased from 4.0% to 10.1% [10]. A literature review on Medline and Embase database (1990-2006) found that DDIs were responsible for 0.054% of the emergency department visits, 0.57% of the hospital admissions and 0.12% of the re-hospitalizations in United States [11]. Most hospitals employ clinical decision

1 support (CDS) systems which are designed to assist physicians in prescribing drugs to the patients with necessary drug-dosing adjustments and alerts related to contraindications of drug- combinations [12]. However, in a DDI alert research study, it has been found that physicians overrode two-thirds of the DDI alerts from the CDS systems and it was reported that eight drugs on a study of 113 drugs were responsible for three quarters of important DDI alerts [13]. These drugs included some of the widely prescribed medications like azithromycin, citalopram, sildenafil, simvastatin, tamsulosin, tramadol, and warfarin (present in top 100 most used drugs in drugs.com resource).

Although DDIs could occur due to many reasons such as drug metabolism, environmental exposure, or genetic factors, a majority of them occur due to the drug metabolism via the cytochrome P450 (CYP450) system. The cytochrome P450 family of isozymes are responsible for biotransformation of many drugs via oxidation. Due to the inhibition or induction of these , DDIs occur leading to potentially serious AEs [14]. Identifying all DDIs experimentally

(in-vitro and in-vivo) with multiple combinations for all approved drugs and investigational compounds is challenging, non-trivial, and expensive. Hence, many researchers are exploring in- silico techniques for pre-screening or predicting the DDIs with the help of available drug profiles.

In the recent past, several computational approaches have been developed and evaluated to identify

DDIs using different data-sources. However, the current cutting-edge research on DDI is mostly limited to identifying whether there exists any DDI for co-medicated drugs or not. In this study reported here, we have developed and evaluated machine learning-based approaches not only to predict DDIs but also to characterize them, namely predicting the severity of the DDIs. The DDI severity are categorized into 4 categories as follows:

2

• Major DDI: The drugs having major/severe DDI must be avoided at any cost. The

interaction of the drugs causes more toxic effects than any efficacy in treatment.

• Moderate DDI: Drugs with moderate interaction severity are still dangerous; however,

they may be used under special circumstances. The patients consuming these drugs must

be constantly monitored by the health care provider.

• Minor DDI: The risk associated with minor DDI is minimal. Patients, however, need to

be careful and institute a monitoring plan for dosage control.

• Safe DDI: Drugs lead to improved or increased efficacy of the therapeutic regime without

causing any significant AEs.

The overall goal of this thesis is to design and develop a framework to identify and characterize the DDI using drug profile data.

1.2 Characterizing DDI using Machine Learning

Two drugs have a DDI when they could influence each other's therapeutic effect(s) by sharing a common biological element. In order to identify the existence of interaction for an unknown pair computationally, similarity-based learning methods were applied. First, 276,914 DDI pairs for

1,462 approved drugs were retrieved from drugs.com [15] and CredibleMeds [16] which are classified based on severity into minor (20,238), moderate (201,985), and major (54,691) DDI categories. Additionally, 934 safe pairs from DCDB [17] were also retrieved. Then, 1,462 random combination of the interaction drug pairs were generated and enriched target and pathway association properties were extracted from CTDBase[18], chemical structures from DrugBank

[19], AEs of drugs from ADReCS [20], and MeSH, indications, contra-indications of drugs from

DrugCentral [21].

3

With the extracted drug properties, a binary drug-feature matrix was constructed for each of the features wherein 1 represents presence of the biological element of the feature while 0 indicates absence of that feature. This matrix can also be considered as an adjacency matrix of a graph. The

Jaccard Index was applied to calculate the similarity of the drugs for every drug-feature matrix.

All of the feature similarity pairs along with DDI pairs were merged together to form a learnable dataset of a total of 119,053 pairs (training set).

The framework presented in this thesis uses 7 different machine learning algorithms on the DDI training set. In order to test the robustness of the identifying algorithm, two different methods for

DDI characterization are employed. Method-1 involves identification of the safe vs not-safe interaction which is considered as a binary classification problem, whereas Method-2 is a multi- class classification problem, characterizing the severity of the novel interaction pairs. However, due to the high-class imbalance in the data, most algorithms failed to detect DDI severity class for under-represented classes. In order to address this issue, the synthetic minority over-sampling technique (SMOTE)[22] is used to handle the imbalance in the data. Then, each classifier is trained along with hyper-parameter tuning methods such as Random Search and Grid Search. Finally, the best algorithm is picked out based on the performance metrics and statistical significance value across both the methods.

1.3 Structure and Organization of the Thesis

The thesis is organized as follows:

• Chapter-1, the current chapter, introduces the background of DDI problem, types of

interactions, and a brief discussion on the implemented DDI characterization framework.

4

• Chapter-2 presents the various in-silico techniques used to solve the problem and the

challenges this study aims to solve.

• Chapter-3 presents a discussion on data sources, computational methods and analysis

techniques used in the framework to predict and characterize the DDI.

• Chapter-4 discusses the experimental setup of the problem and compares results with the

current state-of-the-art methods.

• Chapter-5 concludes the thesis with summary and suggests future direction for further

research.

5

2 Related Work

2.1 Methods to Detect or Predict DDIs

DDI can be detected or predicted using in vitro, in vivo, in populo, or in silico approaches. In vitro studies refer to studies performed outside the living organism (e.g., in a test tube or any culture dish). For the de novo drug discovery process in pharmaceutical industry, FDA emphasizes a comprehensive guide for the characterization of a potential new molecule entity’s (NME) drug interaction by addressing the important efficacy and safety concerns for drugs metabolized by polymorphic enzymes [23]. In vivo studies are performed using living or model organisms. In populo refer to studies performed in the public. FDA suggests that in vitro drug interactions performed early in the drug development process can determine whether further DDI investigations are necessary. However, in vitro [23], in vivo [24] and in populo [25] experiments are expensive to perform, and have limitations in determining the cause of adverse events, characterizing pharmacologic mechanisms or efficacy-related issues with respect to DDIs.

2.2 DDI Prediction using In Silico Techniques

In silico methods refer to studies performed computationally using results from experiment data.

These methods take advantage of data available from large scale studies on de novo drug discovery.

In the last few years, several in silico methods for DDI prediction and assessments have been

6 developed. Thus far, the existing in silico or computational methods for DDI detection can be categorized broadly into three types:

2.2.1 Statistics-Based Methods

A few statistics-based methods have been developed to identify the adverse events associated with

DDI by analyzing medical reports, databases and electronic medical records. Duke et al. combined a literature mining approach along with validation using electronic medical records (EMR) to predict and evaluate DDIs [26]. Their method evaluated possible molecular mechanism of predicted DDIs and also focused on clinically and statistically significant DDIs that increase the risk of myopathy. Gottlieb et al. built large-scale prediction models integrating drug-similarity features to infer DDIs from both pharmacokinetic (CYP-related DDIs) and pharmacodynamic (non-CYP-related DDIs) data [27].

Tatonetti et al. developed a signal detection algorithm which identifies latent adverse event signals from reporting systems [9]. They applied their methodology to FDA’s Adverse Event Reporting

System (AERS) and predicted 171 novel interactions. Lu et al. developed a statistical method to identify compounds that might interact based on the co-occurrence information available in corresponding MEDLINE records [28]. Their method also predicted candidate proteins that may be involved in DDIs apart from predicting novel DDIs.

2.2.2 Network-Based Methods

Biological networks help in understanding complex interactions and relations of different components in a biological system and reveal high-level relationships, system-wide properties and enrichment patterns [29]. Wang et al. explored drug combinations in the context of gene

7 interaction networks and pathways using effective drug combinations and random combinations

[30]. They reported that drug combinations tend to target proteins which are close in the gene interaction network and are more likely to modulate functionally related pathways. Lee et al. used a heterogeneous biological network which was constructed by combining multiple drug related databases consisting of pathways, proteins, side-effects and interaction information [31]. They proposed a metric to measure the strength of the relation between two nodes in the network to predict and score DDIs. They reported that drugs sharing a disease are more likely to have a DDI than drugs that share a common target.

Huang et al. developed a pharmacodynamic DDI prediction system using a Bayesian probabilistic model with DrugBank DDI data as gold standard dataset [32]. Using their developed metric “S- score”, which measures the strength of network connection between drug targets, they were able to predict DDIs and understand the potential physiological effects underlying DDIs. Udrescu et al. used modularity-based and energy-model layout community detection algorithms on DrugBank

DDI data to detect 9 relevant pharmacological properties [33]. Their clustering technique was able to predict 85% of pharmacological properties when cross-checked with other data sources (such as drugs.com, rxlist, etc.) and literature survey.

2.2.3 Machine Learning-Based Methods

The drug-feature heterogeneous network data can also be trained by machine learning models for node classification and link prediction tasks. Cheng et al. developed a comprehensive DDI network incorporating 6,946 interactions of 721 approved drugs using data from DrugBank [34]. They

8 calculated drug-drug pair similarity values for phenotype, ATC, chemical, and target features.

Finally, they applied five machine learning-based models - naive Bayes, decision tree, k-nearest neighbor, logistic regression, and support vector machine - and predicted DDIs with an AUCROC score of 0.67 using five-fold cross validation. They tested their model on antipsychotic DDIs and found literature support for predictions involving weight gain and P450 inhibition.

Similarly, Zhang et al. collected a variety of drug related associations data such as substructure, target, , transporter, pathway, indication, side-effect, and known DDIs [35]. They calculated drug-drug similarity scores for each feature and built prediction models using neighbor recommender, random walk, and matrix perturbation methods. They integrated the models with ensemble rules and reported higher prediction performance compared to performance of individual algorithms.

Kastrin et al. used link prediction a as binary classification task using unsupervised and supervised approaches for predicting unknown interactions between drugs in five arbitrarily chosen large- scale DDI databases, namely DrugBank, KEGG, NDF-RT, SemMedDB, and Twosides [36]. They applied prediction models such as classification tree, k-nearest neighbors, support vector machine, random forest, and gradient boosting classifiers based on topological and semantic similarity features of the network. Out of all the datasets, Twosides network performed best with precision- and recall scores of 0.93 on both random-forests and gradient boosting machine classifiers.

9

2.3 Motivation and Research Objectives

Given that there are diverse, heterogeneous types of information available on drug activity, and that all of these contribute to drug metabolism and its interactions in a living system, it is essential to integrate and leverage all these information types and build an in silico framework that can predict the DDIs. Multiple drug usage (e.g., combination drugs, over the counter medications, different drugs used for different co-morbid conditions, etc.), also referred to as poly-pharmacy, is common in the patients with complex diseases or co-morbid conditions. Discovering DDI due to polypharmacy is an important challenge, not only to alert the patients after the post-approval of a drug but also to help in the de novo drug discovery process to forecast safe or not-safe combinations. The following are some of the limitations and challenges apparent from a review of the current research methods for computational prediction of DDIs:

• In the current state-of-the-art models, the DDI interaction problem is addressed as a binary

classification problem ([34-36]). These methods use the not-safe DDI pairs from known

sources such as DrugBank [19], FDA Adverse Event Reporting System (FAERS) [37], or

TWOSIDES [38], and any unknown DDI pair is considered as a safe pair. However, none

of the approaches characterize the severity of a DDI.

• There have been no studies that directly compare the known safe pairs of drugs (e.g., drug

combinations as compiled in the DCDB database) with not-safe pairs of drugs. Doing this

can help in identifying features which can potentially be used in a computational

framework to predict DDIs – safe or not-safe – a priori.

• There is a critical and unmet need for a more systematic analysis of multiple drug-related

heterogeneous data sources and their potential for not only predicting but also

10

characterizing DDIs. A major challenge in the existing heterogeneous data is the class-

imbalance issue which can impact the training process.

The focus of this thesis is to predict and characterize DDIs while addressing the following questions:

• How is the class-imbalance problem addressed in the drug annotation datasets? Does it

affect the performance of the DDI prediction model?

• Which classifier(s) perform the best in predicting and characterizing DDIs?

• Are there specific metric(s) to evaluate the performance of the models?

• How do predicted interactions correlate or compare with known DDIs?

In this thesis, the framework is based on similarity-based approaches wherein 7 different drug- profiles such as SMILES, AEs, Pathways, Indications, Targets, MeSH, and Contraindications are used to calculate drug-drug similarities. These individual drug-drug similarity scores are used as features along with the class information in drugs.com, and trained with multiple machine learning algorithms for prediction and characterization of DDI.

3 Materials and Methods

The DDI prediction framework presented in this thesis is developed by constructing a heterogeneous knowledge graph integrating data from various sources. The data is compiled from

7 different publicly available sources on the web: Drugs.com [39], Crediblemeds [16], DCDB [17],

CTDbase [18], DrugBank [19], ADReCS [20], and Drugcentral [21]. The following sections describe the details of the extracted data, pre-processing techniques, learning algorithms, metrics, and analytical methods used in the framework.

11

3.1 Normalizing Drug Nomenclature

The drug compounds often are represented using several alternate names (synonyms, market names, etc.) For example, simvastatin is also known as Simvastatina, Simvastatinum or Synvinolin in different countries. As a result, there is information loss and unnecessary duplicate records which can be potentially problematic when building networks around drugs. DrugBank provides synonym information which is used to map all the drug names to DrugBank-identifiers in our framework. This mapping ensures the consistency of the drugs represented in the heterogeneous drug network and computing similarity between nodes which is explained in section 3.3.

3.2 Drug-Drug Interaction Networks

The not-safe drug interaction pairs’ data was collected from Drugs.com and DrugBank, while the safe interaction drug pairs were downloaded from the Drug Combination Data Base (DCDB). All of the collected drug pairs were mapped to DrugBank drug identifiers for nomenclature consistency across different networks. In most of the previous studies, identification of DDIs is considered as a binary classification problem and the validations are performed using limited dataset(s). In this work, in addition to binary classification on multiple datasets and implemented the framework to identify the best performing feature/algorithm, the severity of DDIs was also characterized. The various DDI networks used for classifier performance evaluation are described in the following section.

12

3.2.1 Binary Classification Datasets

The binary classification dataset is basically a dataset with class information containing whether a

DDI is safe or not-safe. In the combined DCDB+Drugs.com dataset, the safe pairs (from DCDB) are labeled as 0 while the not-safe pairs (from Drugs.com) are labeled as 1. In the Random Pairs and Drugs.com dataset, not-safe DDIs (from Drugs.com) are marked as 1 and all random pairs of drugs formed through combinations which do not have interaction information, are marked as 0

(or safe).

The DrugBank has interaction information between two drugs extracted from drug labels and scientific publications. In the DCDB+DrugBank dataset, the drugs pairs from the corpus are extracted and marked as potential not-safe DDI. The pairs from DCDB are marked as 1 (or safe).

Below are a few examples illustrating the drug interaction information available for a drug drug-

1 in DrugBank.

“The metabolism of can be decreased when combined with

“The serum concentration of can be increased when it is combined with

“The risk or severity of can be increased when is combined with

2>”

Dataset DDIs - Safe Pairs DDIs - Not Safe Pairs

DCDB + Drugs.com 385 35,313 Random Pairs + 192,085 35,313 Drugs.com DCDB + DrugBank 401 48,896 Table 3.1: Various datasets used to evaluate the classifier performance along with the number of safe and not-safe DDI pairs in datasets.

13

3.2.2 Multi-Class Classification Dataset

We retrieved 276,914 DDI pairs for 1,462 approved drugs from Drugs.com which are classified based on their severity into 20,238 pairs with minor severity; 201,985 pairs with moderate severity; and 54,691 pairs with major severity. We merged this dataset with 934 safe pairs DDIs from DCDB to form a multi-class training dataset of DDIs. However, after merging with all the features the

DDI class data was reduced (Table 3.2) because of information loss (see section 3.3),

Dataset Safe Minor Moderate Major

DCDB+Drugs.com 385 2,197 27,297 5,819

Table 3.2 The final DDI dataset class count used for training learning algorithms.

3.3 Feature Extraction

3.3.1 Drug-SMILES

The chemical structure of a drug compound is denoted using ASCII strings structure format called

SMILES (Simplified Molecular-Input Line-Entry System). It provides a flexible and unambiguous method for specifying the topological structure for molecules. Each chemical structure for a drug compound is unique and functions as a fingerprint (see example in figure 3.1).

14

Figure 3.1 SMILES representation of Ibuprofen (chemical formula 퐶13퐻18푂2)

DrugBank, which is a repository of 2,475 approved small molecule drugs and over 5,478 experimental drugs has SMILES data, which was downloaded and processed with RDKit, an open source cheminformatics software, to calculate DDI similarity (see section 3.4).

3.3.2 Drug-Pathway

A biological pathway is a series of interactions among molecules in a cell that either lead to a change in a cell or certain product, or enable specific functions or processes in a cell. Pathways have the ability to trigger genes to turn on or off. CTDbase has pre-computed drug-pathway associations based on pathway enrichment of drug targets. For this study, the enrichment p-value cut-off was set to less than or equal to 0.01. As described earlier, all drugs (1899) were mapped to

DrugBank identifiers and a drug-pathway matrix (1899 drugs × 2054 pathways) was created. The pathway-based drug-drug similarity was calculated using the Jaccard Index to retrieve drugs which may potentially act similarly.

15

3.3.3 Drug-Adverse Events

Adverse event information for 1,473 drugs, was obtained using the Adverse Reaction

Classification System (ADReCS), a comprehensive adverse events (AE) ontology database that provides AEs for drugs. ADReCS comprises 134,022 drug-AE pairs. A drug-AE feature matrix was constructed using 1,473 drugs and 7,496 adverse events to identify DDIs by calculating AE- based drug-drug similarity.

3.3.4 Drug-ATC

The Anatomical Therapeutic Chemical classification system (ATC) is a classification system for active ingredients of drugs based on their characteristics such as chemical, pharmacological, therapeutic actions and the organ systems a drug affects. The ATC system consists of multiple hierarchies with codes representing different organs affected by drugs, and different chemical and pharmacological properties. ATC codes for 2,379 approved and investigational drugs were obtained from DrugBank with 5 different levels in the system. The first-level is considered as the main group, the second is the therapeutic group, the third level is the pharmacological group, the fourth is the chemical subgroup, and the fifth level is the drug compound. Drug-drug similarity was calculated based on ATC codes to identify DDI in similar drugs.

3.3.5 Drug-Target

A drug target is usually a protein in the body that is intrinsically associated with a particular disease process and that could be modulated by a drug to produce a desired therapeutic effect [40].

CTDbase contains enriched target associations for drugs based on annotated genes. Drug-targets

16 are extracted from CTDbase with enrichment p-value less than or equal to 0.01. A drug-target feature matrix is formed consisting of 2,043 drugs and 21,005 targets for drug-drug similarity calculation.

3.3.6 Drug-MeSH

MeSH (Medical Subject Headings) is a comprehensive controlled vocabulary of medical or related terms updated frequently by the United States National Library of Medicine. Most of the approved drugs are annotated with the MeSH terms. These drug-MeSH term associations or classifications are available in DrugBank. For example, analgesics, anticoagulants, anti-inflammatory agents, and non-steroid ant-inflammatory compounds are some of the categories associated with the drug aspirin. There are 2,198 MeSH categories mapped to MeSH identifiers in DrugBank. A drug feature matrix consisting of 2,475 drugs and 2,198 MeSH identifiers was compiled to calculate a drug-drug similarity matrix based on MeSH terms.

3.3.7 Drug-Indications

Drugcentral has data related to drugs along with their therapeutic indications and contraindications

(CIs) compiled from various data sources such as SNOMED, DrugBank, and literature mining. A drug-indication matrix consisting of 1,903 drugs and 1,464 indications was constructed using

Drugcentral database for using as feature in DDI prediction.

3.3.8 Drug-Contraindications

Drug-contraindications are the warnings associated with the drugs usually provided on drug-labels.

For 1,259 drugs in Drugcentral database, there are 1,145 unique contraindications listed. A drug-

17 contraindication matrix was constructed to calculate CI-based drug-drug similarity and predict

DDI using learning algorithms.

3.4 Data Representation for Learning

A drug-feature network is represented as an adjacency matrix where 1 in the matrix indicates the presence of the feature and 0 indicates absence of the feature. The adjacency matrices are constructed for all the drug-feature data such as drug-chemical substructures from DrugBank, drug-side-effects from ADReCS, drug-pathway and drug-target data from CTDbase, drug- indications, drug-contra-indications, and drug-mesh-categories from DrugCentral. Then drug-drug similarity is calculated using the Jaccard Index (JI) - a commonly used similarity metric in information retrieval. Given drugs x and y, the JI is formulated as:

|Γ(푥) ∩ Γ(푦)| 푆푖푚(푥, 푦) = |Γ(푥) ∪ Γ(푦)|

where (x) is the binary feature vector for drug x. Thus, the JI between two drugs x and y is the ratio of the number of shared features to the total number drug features from both the drugs. Based on this similarity measure, drug-drug similarity for individual feature is calculated. The feature correlation plot in figure 3.2 shows that there are no very high correlations between features, indicating that the features collected are not dependent on each other which is essential since correlated features, in general, affects the model performance by introducing harmful bias.

18

Figure 3.2 Feature correlation plot using DDI similarity data where the dark brown indicates no correlation

between features and light brown indicates there exist correlation between features.

3.5 Learning Algorithms

3.5.1 Logistic Regression

Logistic Regression is a widely used statistical model for binary classification and regression as well [41]. This model estimates the probability of a response variable based on the predictor independent variables using a sigmoid function. The sigmoid function or logistic function is defined as:

1 휎(푡) = 1 + 푒−푡

19

In logistic regression, for the binary outcome or the response variable 푌, the aim is to estimate

Pr(푌 = 1|푋 = 푥). Given an independent variable 푋 and response variable 푌 which lies between 0 and 1, the equation will be

푝(푥) 푙푛 ( ) = 훽 + 훽 푥 1 − 푝(푥) 0 1

Where 푝(푥) is the probability of having a DDI and defined as:

1 푝(푥) = 1 + 푒−(훽0+훽1푋)

If the probability of class 1 i.e., 푃(푌 = 1|푋) > 0.5, then the DDI between the two drugs is not- safe, otherwise it is deemed as safe.

3.5.2 K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is one of the most commonly used non-parametric classification algorithms. Since the KNN doesn’t learn a discriminative function from the training data but rather memorizes the complete dataset, this algorithm is referred to as a lazy-learner algorithm. The KNN classifier works by forming a majority vote between the K most similar instances to a given unseen observation.

For example, if K=1, then the object is assigned to a class of its nearest neighbor. If K is 3, then if two points out of 3 neighbors belong to class A and the third belongs to class B, the new data point belongs to class A based on majority voting. Some common distance metrics used to calculate distance between two data points are Euclidean, Manhattan, Chebyshev and Hamming distances. In the experiments, in order to detect DDI pair neighbors a range of K values are tested from 3 to 35 neighbors.

20

3.5.3 Support Vector Machine

Support vector machine (SVM) is a flexible machine learning model capable of performing linear or non-linear classification, regression analysis and outlier detection [42]. SVM, introduced by

Vapnik et al. is a popular technique for binary classification as it is efficient in handling large datasets while also capable of learning with small ones.

In case of a linear classification problem, the SVM model works by searching for an optimal separation hyperplane that classifies the data to the two classes and ensures maximal separation between the classes. In non-linear problems, kernel methods are used to map the data from the original feature space into another space where it becomes linearly separable. The kernel method works by computing the dot product of two vectors 푥 and 푦 in high dimensional feature space.

Then the kernel 푘 with new space (or feature map) 휑 for the dot product will be 푘(푥, 푦) =

휑(푥)푇휑(푦). Both the linear and kernel methods are helpful in capturing significant details in the data.

3.5.4 Artificial Neural Networks

Artificial neural networks (ANN) are a biologically inspired information processing paradigm which have a remarkable ability to extract relationships from complicated or imprecise data [43].

ANNs are based on the networks of nodes called neurons which mimic brain cells. The connections between neurons have weights, which determine how each neuron responds to the activity of other neurons. Initially, the weights are assigned arbitrarily and are then adjusted by learning.

Classification is usually done with feed-forward neural networks, with neurons arranged in layers.

21

The input is presented at the input layer, which represents the feature space. Each neuron of the next layer, called a hidden layer, calculate a weighted sum of inputs which is then passed through the neuron’s non-linear activation function. The same procedure is repeated in successive layers, and the output (class label) is read from the final, or output, layer. In the current instance, the input layer, represents the drug features and the final or the output layer consists of two nodes representing safe or not-safe DDI classification. Intuitively, the layered ANN architecture can be seen as performing successive remappings of the original feature space until the data is mapped into a space where it can be separated linearly by the output layer.

The weights of multi-layer feed-forward ANNs are typically trained using a supervised learning algorithm called backpropagation. The algorithm uses steepest descent to minimize a cost function

– typically chosen to be mean-squared error or cross-entropy.

3.5.5 AdaBoost

AdaBoost is a meta-algorithm which linearly combines a set of weak classifiers in order to form a strong classifier. A weak learner is a classifier which performs slightly better than random guessing i.e., above 50% accuracy. AdaBoost works in a greedy fashion by building a succession of weak classifiers based strategically on previously built classifiers to obtain a strong classifier whose performance is better than any of the individual weak classifier. Consider 푓푡 as a weak classifier that takes drug feature vectors 푥 as input and returns the safe or not-safe classification as response.

Then the boost classifier with 푇 weak classfiers will be in the form of

퐹푇(푥) = ∑ 푓푡(푥) 푡=1

22

th where f t (x) is the weighted response of the t weak classifier. At each iteration of the learning process, a weak learner is selected such that the sum of the training error of the resulting boost classifier is minimized. In the experiments, due to significant improvement in performance logistic regression is used instead of decision stumps.

3.5.6 Random Forest

Random Forest is a supervised ensemble learning algorithm. It is an ensemble of decision trees used for both regression and classification problems. It works by building numerous decision trees which are weak learners combined to make a strong learner. A decision tree works by predicting the value of a target variable by learning simple decision rules inferred from the features. Random

Forests applies a technique called bootstrap aggregation or bagging to tree learners which works by repeatedly selecting random samples with replacement from training set and fitting trees to them. Then after training, by averaging the predictions from all the individual trees, predictions can be made for novel samples 푥′.

퐵 1 푓̂ = ∑ 푓 (푥′) 퐵 푏 푏=1

At the end, Random Forest compiles all the results from tree learners to predict the final output i.e., DDI safe vs not-safe classification.

3.5.7 Multiclass Classification using Binary Classifiers

In multiclass classification, the algorithm needs to classify the instances into one of three or more classes. In order to characterize the severity of the interaction between two drugs ranging from

23 safe to major interaction, one-vs-rest classification is adopted. In a one-vs-rest classification, the strategy consists in fitting one classifier per class i.e., for each classifier, the class is fitted against all other classes. Let the positive instances be all points in class 푘 for the 푘th classifier and let everything else be a negative instance. Then the response 푦̂ for the instance for which the 푘th classifier out of 퐾 classifers reports with highest confidence score is given by

푎푟푔푚푎푥 푦̂ = 푘휖{1…퐾}푓푘(푥)

3.6 Imbalanced Data and Synthetic Minority Oversampling Technique

With 98% of majority class in the datasets being not-safe DDI samples, machine learning algorithms can suffer from high mis-classification during training. This can result in poor predictive accuracy of the minority class since most samples tend to be classified into the majority class by the classifier. As shown in table 3.2.1 and 3.2.2, except for the Random Pairs+Drugs.com dataset, there is a high imbalance for the remaining datasets. In order to overcome this problem, the synthetic minority oversampling technique (SMOTE) was applied to the training data.

Dataset Safe Not Safe Imbalance Ratio

DCDB+Drugs.com 385 35,313 1:91

Random Pairs+Drugs.com 192085 35,313 5:1

DCDB+DrugBank 401 48,896 1:121

Table 3.3: Class-imbalance ratios for safe vs not-safe DDI datasets

24

Dataset Safe Minor Moderate Major Imbalance

Ratio

DCDB+Drugs.com 385 2,197 27,297 5,819 1:6:70:15

Table 3.4: Class-imbalance ratios for severity DDI datasets

SMOTE is an oversampling method that creates artificial or synthetic examples. For instance, the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining it to 푘 minority class randomly chosen neighbors. This leads to a class-balanced training set, which is then used to train on the classifier.

3.7 Evaluation Metrics

Several classification metrics were used to evaluate the prediction performance of the machine learning algorithms. These are reviewed in this section

3.7.1 Accuracy

Accuracy is the measure of the proportion of instances that are correctly predicted. This is the most common metric for classification problems and often misinterpreted. If the distribution of classes is uniform, then classification accuracy determines the closeness of the prediction to the real values. However, if the data is imbalanced, the classification error in the minority class will be insignificant due to the majority class samples. Accuracy (Acc) is defined as:

푇푃 + 푇푁 퐴푐푐 = 푇푃 + 퐹푃 + 퐹푁 + 푇푁

Where TP (True Positive) is an outcome where the classifier correctly predicts the positive class,

TN (True Negative) is when model correctly predicts the negative class, FP (False Positive) is

25 when the model incorrectly predicts the positive class, and FN (False Negative) is when the model incorrectly predicted the negative class.

3.7.2 Receiver Operating Characteristic Curve

A receiver operating characteristic (ROC) curve is a graph created by plotting true positive rate

(TPR) against false positive rate (FPR). This curve shows the performance of a classifier at all

푇푃 classification thresholds. TPR also known as recall is defined as: 푇푃푅 = and FPR is defined 푇푃+퐹푁

퐹푃 as:퐹푃푅 = . The area under the ROC curve (AUC) provides an aggregate measure of 퐹푃+푇푁 performance across all classification thresholds.

3.7.3 Precision, Recall and F1

Precision is the fraction of samples classified as positive that are truly positive and is defined as

푇푃 푃푟푒푐푖푠푖표푛 = whereas recall measures the fraction positive samples that are correctly 푇푃+퐹푃

푇푃 labeled and is defined as 푅푒푐푎푙푙 = . The F1 score is the weighted average of precision and 푇푃+퐹푁 recall. F1 score is usually more useful in class-imbalanced data problems since the score takes

푅푒푐푎푙푙×푃푟푒푐푖푠푖표푛 both false positives and false negatives into account. It is defined as 퐹1 = 2 × . 푅푒푐푎푙푙+푃푟푒푐푖푠푖표푛

3.7.4 Micro-Averaging and Macro-Averaging

In order to evaluate the performance of multi-class classification, macro-averaging and micro- averaging methods were used. In the micro-averaging approach, the ROC curve was calculated based upon the true positive and false positive rates for all DDIs i.e., for a DDI-class 푐푖,true- positive 푡푝푖 and false-positive 푓푝푖, the micro-average (휇) for 푚 number of classes is given as

26

푚 ∑푖=1 푡푝푖 휇 = 푚 ∑푖=1(푡푝푖 + 푓푝푖)

In the macro-averaging approach, the precision for individual class is calculated and then averaged i.e., for a DDI-class 푐푖,true-positive 푡푝푖 and false-positive 푓푝푖, the macro-average (푀) for 푚 number of classes is given as

푡푝 ∑푚 푖 푖=1 푡푝 + 푓푝 푀 = 푖 푖 푚

3.7.5 Cohen’s Kappa

Cohen’s kappa is a statistical measure used to assess inter-rater reliability when observing categorical variables i.e., safe vs not-safe in the binary classification problem or safe vs minor vs moderate vs major DDI in multi-class classification problem[44]. Cohen’s kappa is calculated as

Pr(표) − Pr (푒) 휅 = 1 − Pr (푒)

Where Pr(표) is the observed agreement, and Pr(푒) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each classifier randomly seeing each type of interaction. If the classifiers or raters are in complete agreement, then 휅 = 1. Cohen’s kappa ranges from -1 to +1, with +1 being the perfect agreement between raters and -1 is total disagreement. According to Cohen, 휅 value between 0.01-0.20 indicate slight agreement, 0.21 –

0.40 is fair agreement, 0.40 – 0.60 moderate agreement, 0.61 – 0.80 substantial agreement, and

0.81-1.00 almost perfect agreement.

27

4 Experimental Design and Results

4.1 Experimental Design

The machine learning-based DDI characterization was developed by compiling a large set of data sources for DDI and drug-profiles. However, a few drug profiles are missing feature data, and these drug pairs are dropped from the dataset. For 1,462 approved drugs we obtained 1,067,991 drug pairs through combinations. Of these, 385 safe, 1,994 minor, 25,858 moderate, 5,569 major and 182,907 unknown pairs had all the drug-profile for 653 unique drugs. Multiple datasets

(DCDB+Drugs.com, Random Pairs+Drugs.com, and DCDB+Drugbank) were formed by calculating drug similarities and then merging all seven individual features: drug-SMILES, drug- target, drug-pathway, drug-indications, drug-contraindications, drug-adverse events, and drug-

MeSh. A common DDI intersection of DCDB safe pairs and Drugs.com analysis showed that

DCDB safe intersection dataset has over 485 not-safe DDIs when cross-checked with drugs.com

(figure 4.1). These pairs were marked as not-safe during training. Due to the high class-imbalance ratio in the datasets, synthetic minority oversampling technique was applied to balance the unbalanced dataset and trained with machine learning classifier. Finally, 6 different machine learning algorithms such as Logistic Regression, K-Nearest Neighbors, Support Vector Classifier,

Neural Networks, AdaBoost and Random Forest were applied to predict the DDI identification and characterization as shown in figure 4.2.

28

Figure 4.1 – Drugs intersection Venn diagram for DCDB safe and Drugs.com not-safe DDI

Figure 4.2 – An overview of machine learning-based DDI characterization framework consisting of data

sources, pre-processing techniques, and classification goals.

29

All the machine learning classifiers were implemented using the Scikit-learn[45] package in

Python. In order to determine the best hyperparameters for the 6 machine learning classifiers used in the framework, random search hyperparameter tuning was used to search for up to 10 optimal hyperparameters. For the best parameter selection, a scoring function was defined for binary and multi-class classification datasets which returned the best parameters based on highest F1 score.

A 5-fold stratified cross validation was applied during the hyperparameter search to ensure that all the unique drug-combination pairs were present during training.

Each of the 6 different classifiers used in this research are referenced with their respective names as follows: K-Nearest Neighbors as KNeighborsClassifier, Logistic Regression as

LogisticRegression, Support Vector Machines as SVC, Neural Networks as MLPClassifier,

AdaBoost as AdaBoostClassifier, and Random Forest as RandomForestClassifier. The

CredibleMeds dataset, which consists of 51 unique drugs, is built with an emphasis on cardiovascular medications and with consideration of more recent evidence of DDI. This dataset has 71 unique pairs which were removed from the training dataset and used to validate the classifier performance. In binary classification, all the labels except safe were marked as 1 i.e., not-safe and in multi-class classification the labels were used as is i.e., 2 for moderate and 3 for major interaction.

Finally, to investigate the characterization of DDI, the performance of the classifiers was compared across multiple datasets. Statistical analysis on severity DDI predictions was performed for ATC

30 classes to find correlation between the original and predicted datasets. Although, severity characterization for DDI has not been performed by anyone yet, in order to compare with the state- of-the-art models for DDI identification, evaluation metrics such as ROC, precision, recall and F1 score were used.

4.2 Results

4.2.1 Drugs.com+DCDB

The Drugs.com+DCDB dataset consists of 385 safe pairs from the DCDB and 35,313 pairs which are minor, moderate and major are considered as not-safe pairs. Figure 4.3 shows the feature distribution plots for safe and not-safe DDI in the dataset. Due to the imbalanced relative frequency ratio of 0.01:0.91, SMOTE was applied to balance the training dataset and reduce the classification bias. Figure 4.4 and Table 4.1 show the evaluation and comparison of classifiers performances.

Figure 4.3 – Feature distribution plot of safe (left) and not-safe (right) DDI for Drugs.com+DCDB dataset

31

Figure 4.4 – AUCROC plot for safe vs not-safe DDI classification using 6 different classifiers with 5-fold

stratified cross-validation where 0 indicates not-safe DDI and 1 indicates safe-DDI

Classifier CV Accuracy Kappa Precision Recall F1 Val.Acc

KNeighborsClassifier 5-fold 0.97 0.18 0.56 0.64 0.59 1.00

LogisticRegression 5-fold 0.94 0.14 0.54 0.75 0.56 1.00

SVC 5-fold 0.94 0.15 0.55 0.76 0.57 1.00

MLPClassifier 5-fold 0.95 0.17 0.55 0.74 0.58 1.00

AdaBoostClassifier 5-fold 0.94 0.17 0.55 0.75 0.58 1.00

RandomForestClassifier 5-fold 0.99 0.21 0.64 0.59 0.61 1.00

Table 4.1 – Classifiers performance over Drugs.com+DCDB DDI dataset with 5 -fold stratified

cross-validation.

32

In terms of AUCROC, Random Forest achieved highest score of 0.88. However, in a highly skewed dataset, a high F1-score signifies that the classifier has low mis-classification rate across both the labels. Random Forest achieved highest scores in terms of F1 and Cohen’s kappa metrics compared to the rest of the classifiers. The validation accuracy (Val.Acc) for Random Forest achieved 100% correct classification.

4.2.2 Random Safe+Drugs.com

In the Random Safe+Drugs.com dataset, combinations on all the drug-pairs from the Drugs.com were formed. The drug-pairs which have interaction in Drugs.com were marked as not-safe or 1 and the rest of the pairs marked as safe or 0. Figure 4.5 shows the feature distribution plot for DDI in this dataset. There are 184,378 safe pairs and 35,313 not-safe pairs i.e., with a normalized frequency safe vs not-safe ratio of 0.84:0.15. Figure 4.6 and Table 4.2 shows the evaluation and comparison of classifiers performances.

Figure 4.5 – Feature distribution plot of safe (left) and not-safe (right) DDI for Random Safe+Drugs.com

dataset where 0 indicates not-safe DDI and 1 indicates safe-DDI

33

Figure 4.6 – AUCROC plot for random safe vs not-safe DDI from Drugs.com classification using 6 different

classifiers with 5-fold stratified cross-validation

Classifier CV Accuracy Kappa Precision Recall F1 Val.Acc

KNeighborsClassifier 5-fold 0.81 0.18 0.62 0.58 0.59 0.28

LogisticRegression 5-fold 0.70 0.24 0.60 0.67 0.60 0.92

SVC 5-fold 0.45 -0.01 0.42 0.49 0.28 0.86

MLPClassifier 5-fold 0.84 0.12 0.72 0.54 0.54 0.07

AdaBoostClassifier 5-fold 0.72 0.24 0.61 0.67 0.61 0.93

RandomForestClassifier 5-fold 0.70 0.25 0.61 0.69 0.61 0.90

Table 4.2 – Classifiers performance over Random Safe+Drugs.com DDI dataset with 5-fold

stratified cross-validation.

34

In this experiment, the Random Forest scored highest AUCROC score of 0.75. In terms of F1 and

Cohen’s kappa score, almost all the classifiers achieved very low scores. The Random Forest classifier’s precision of 0.85 suggests that not all random pairs are safe DDI.

4.2.3 DCDB + DrugBank

In the DCDB + DrugBank dataset, all the interaction pairs collected from DrugBank were marked as not-safe or 1 and all the pairs from DCDB database marked as safe or 0. Figure 4.57 shows the feature distribution plot for DDI in this dataset. There were 48,896 interacting not-safe pairs and

401 safe pairs in this dataset with a normalized frequency not-safe vs safe ration of 0.991:0.008.

Due to the high-class imbalance, SMOTE was applied to balance the training dataset and reduce the classification bias.

Figure 4.7 – Feature distribution plot of safe (left) and not-safe (right) DDI for DCDB+DrugBank dataset

where 0 indicates not-safe DDI and 1 indicates safe-DDI

35

Figure 4.8 – AUCROC plot for DCDB safe vs Drugbank not-safe DDI classification using 6 different

classifiers with 5-fold stratified cross-validation

Classifier CV Accuracy Kappa Precision Recall F1 Val.Acc

KNeighborsClassifier 5-Fold 0.97 0.14 0.55 0.62 0.57 0.83

LogisticRegression 5-Fold 0.94 0.12 0.54 0.75 0.55 0.31

SVC 5-Fold 0.94 0.12 0.54 0.74 0.55 0.01

MLPClassifier 5-Fold 0.96 0.14 0.54 0.72 0.56 0.03

AdaBoostClassifier 5-Fold 0.95 0.14 0.54 0.74 0.56 0.77

RandomForestClassifier 5-Fold 0.99 0.20 0.61 0.59 0.60 1.00

Table 4.3 – Classifiers performance over DCDB+DrugBank DDI dataset with SMOTE

balancing and 5-fold stratified cross-validation.

36

Similar to the previous datasets, according to table 4.3 Random Forest performed well in terms of average F1-score of 0.59 and 100% validation accuracy.

4.2.4 Comparison with Cheng’s Model

In order to evaluate the features collected for the model, the features used in Cheng’s dataset [34] were compared with the features used in the current study. Cheng et al calculated similarity features based on structural and ATC similarity (DrugBank [19]), target similarity (Therapeutic

Target Database - TTD) [46], and AE similarity (MetaADEDB [47]). Using the same set of DDI pairs from Cheng’s study, but collected features from the current study, the DDI pairs were generated and analyzed.

Figure 4.9 AUCROC plot using features from Cheng’s Figure 4.10 AUCROC plot for the collected

study with 5-fold stratified cross-validation features in the current study with 5-fold

stratified cross-validation

37

Classifier CV Accuracy Kappa Precision Recall F1 KNeighborsClassifier 5-Fold 0.60 0.20 0.60 0.60 0.60 LogisticRegression 5-Fold 0.61 0.22 0.61 0.61 0.61 SVC 5-Fold 0.61 0.22 0.61 0.61 0.61 MLPClassifier 5-Fold 0.62 0.24 0.62 0.62 0.62 AdaBoostClassifier 5-Fold 0.59 0.19 0.59 0.59 0.59 RandomForestClassifier 5-Fold 0.60 0.20 0.60 0.60 0.60 Table 4.4 – Classifiers performance over Cheng DDI dataset using 5 -fold stratified cross-

validation.

Classifier CV Accuracy Kappa Precision Recall F1 KNeighborsClassifier 5-Fold 0.64 0.29 0.64 0.64 0.64 LogisticRegression 5-Fold 0.65 0.30 0.65 0.65 0.65 SVC 5-Fold 0.65 0.31 0.65 0.65 0.65 MLPClassifier 5-Fold 0.66 0.31 0.66 0.66 0.65 AdaBoostClassifier 5-Fold 0.64 0.28 0.64 0.64 0.64 RandomForestClassifier 5-Fold 0.65 0.31 0.65 0.65 0.65 Table 4.5 – Classifiers performance using the collected features with Cheng DDI on 5 -fold

stratified cross-validation.

Feature Cheng Cheng FI Thesis Thesis FI scores Source scores Source SMILES DrugBank 0.32 DrugBank 0.14 ATC DrugBank 0.22 DrugBank 0.06 AE MetaADEDB 0.31 ADReCS 0.13 Target TTD 0.15 CTDBase 0.14 Pathway CTDBase 0.09 MeSH DrugBank 0.22 Indications DrugCentral 0.03 Contra- DrugCentral 0.15 Indications Table 4.6 – Comparison of features used in Cheng dataset and features used in this thesis along with the feature importance (FI) scores. Cheng Source is the source of drug -feature data whereas Thesis Source is the source of features used in this thesis work.

38

Based on the results from the comparative study – Cheng’s study vs. the current study - the features used for computing DDI similarity in the current study appear to be contributing to the improved performance in the current study. Almost all the classifiers performed well in terms of AUCROC and F1 metrics. While the study design described in Section 4.2.3 is similar to the one used in

Cheng’s study, there are two principal differences. First, in Cheng’s study, random DDIs were categorized as safe instead of using DDI pairs that are reported to be safe. Second, Cheng’s model, and the study design, focused on detecting whether two drugs interact or not. In contrast, the current study focused both on detecting and characterizing the DDIs.

4.2.5 Multi-class Classification for DDI severity Characterization

In DCDB+Drugs.com severity DDI, all the safe interaction pairs were collected from DCDB and severity information from Drugs.com. The two datasets were merged together to form 385 safe,

2,197 minor, 27,297 moderate and 5,819 major interaction pairs with normalized frequency ratio of 0.01:0.06:0.70:0.15. SMOTE was applied to balance the unbalanced dataset and then one-vs- rest classification was performed using the 6 classifiers.

Figure 4.11 – Feature distribution plot of DDI severity for DCDB+Drugs.com severity DDI dataset

39

Figure 4.12 – KNN AUCROC plot for DDI Figure 4.13 – Logistic Regression AUCROC plot

characterization for DDI characterization

Figure 4.14 – SVM AUCROC plot for DDI Figure 4.15 – Multi-layer Perceptron AUCROC

characterization plot for DDI characterization

40

Figure 4.16 – AdaBoost AUCROC plot for DDI Figure 4.17 – Random Forest AUCROC plot for

characterization DDI characterization

In a multi-class classification setup, micro-average is preferable since the contributions of all classes are aggregated to compute the average metric, whereas the macro-average computes the metric (Precision) independently for each class and then takes the average. In figure 4.17, Random

Forest has the highest micro-average of 0.91 compared to the rest of the classifiers with a

CredibleMeds validation accuracy of 0.66. Figure 4.17 is the feature importance plot for the

Random Forest classifier, where Indications and ATC features are shown to have high importance.

41

Figure 4.18 – Feature importance plot for Random Forest DDI characterization

4.2.6 Comparing Original and Predicted DDI using Fisher’s Exact Test

A statistical analysis on Drugs.com and DCDB DDI dataset is performed using Fisher exact test to study the frequency of DDI among the ATC classes with p-value less than 5% of DDIs. Figure

4.19 and 4.20 show the DDI interactions in ATC classes. The predicted novel DDI pairs were validated with Physician Desk Reference (PDR) source [48] and a similar pattern was observed in the predictions for major and safe DDI (Tables 4.6 and 4.7).

42

Figure 4.19 – Major DDI enrichment network constructed using fisher exact test in Drugs.com dataset where

the red lines indicate interactions between the ATC classes and size of node indicates the number of

interactions the class has.

43

Drug1 Drug2 ATC1 ATC2 DDI MAJOR Fluorouracil Tolmetin L01 M01 MAJOR Acetaminophen Lamotrigine N02 N03 MAJOR Amphetamine Aripiprazole N06 N05 MAJOR Amphetamine Buspirone N06 N05 MAJOR Azithromycin Foscarnet J01 J05 MAJOR Azithromycin Ritonavir J01 J05 MAJOR Bumetanide Pentamidine C03 P01 MAJOR Pindolol Propafenone C07 C01 MAJOR Dronedarone Flutamide C01 L02 MAJOR Fingolimod Vorinostat L04 L01

Table 4.6 – ATC enrichment analysis based predicted major DDI

44

Figure 4.20 – Safe DDI enrichment using fisher exact test in DCDB dataset.

Drug1 Drug2 ATC1 ATC2 DDI

Amlodipine Cerivastatin C09 C10 SAFE

Telmisartan Acebutolol C09 C07 SAFE

Etodolac Podofilox M01 D06 SAFE

Acetaminophen Tolbutamide N02 V04 SAFE

Aminosalicylic Acid Zidovudine J04 J05 SAFE

Vandetanib Prednisone L01 H02 SAFE

Acetic Acid S02 L04 SAFE

Dexamethasone Lomustine S02 L01 SAFE

Ceftazidime Famotidine J01 A02 SAFE

Topiramate Gabapentin N03 N03 SAFE

Table 4.7 – ATC enrichment analysis based predicted safe DDI

45

4.2.7 Predicted vs. Cytochrome Drugs

While most DDIs are due to the inhibition of Cytochrome drugs, an analysis on predicted drugs found that 38% of the predicted novel contraindications were identified to be Cytochrome inhibitors. Figure 4.20 is a Venn diagram constructed for displaying the number of drugs in intersection with cytochrome and predicted drugs.

Figure 4.21 – Drugs intersection Venn diagram for predicted and cytochrome drugs.

46

5 Conclusions and Future Directions

Several in-silico methods have been developed to better understand the DDI through similarity- based learning and knowledge-based learning methods. Based on the literature review, this is the first time a computational framework has been developed to systematically identify DDIs across multiple datasets and characterize their severity. In this thesis, 7 different drug-feature profiles are integrated by calculating similarities between the drugs using the Jaccard Index. Six different classifiers are used to identify and characterize the severity of DDI on multiple datasets. This experimental setup indeed gave rise to a better performance in predicting DDIs than with limited sets of features. Based on results from our experiments, the following conclusions can be drawn:

1. For 6 different machine learning models 5-fold stratified cross-validation was performed

along with hyper-parameter tuning. With consistently high average F1 scores across

different datasets after applying SMOTE, balancing with synthetic samples improved

performance significantly.

2. Based on our results, Random Forest seems to be the best model for DI characterization.

Furthermore, SMOTE balancing of the dataset improved the classification performance.

Using CredibleMeds DDI pairs as the validation set, Random Forest was able to predict

DDI with validation accuracy between 0.9 to 1.0 across each of the three datasets and with

an accuracy of 0.66 for DDI characterization.

3. Although previous models [34-36] used AUCROC scores to evaluate the performance of

the DDI prediction, given the high-class imbalance, F1-score for the binary classification

problem and micro-average ROC score for multi-class problem were also adopted to

evaluate the performance. Compared to the other classifiers tested, the Random Forest

47

approach also showed significantly better AUCROC, F1, and Cohen’s kappa scores. To

further show the applicability of the Random Forest classifier for predicting DDIs, an

enrichment analysis was performed on DDIs based on ATC pharmacological class (Level-

3) drugs using Fisher’s exact test. Comparing the enrichment results of original and

predicted DDIs clearly shows that the predictions of the classifier align with ATC level

DDIs.

4. The feature importance scores of the classifier also reveal that among the drug features,

drug indications and drug ATC features were significantly more important in predicting

DDIs. Thus, it can be conjectured that drugs that have the same indication or belong to

same drug ATC class could result in DDI. Additionally, a statistical analysis using Fisher’s

exact test based on ATC class level 3 enrichment analysis on original DDI showed that the

trend appears to be same for major and safe DDI for novel predictions as well.

One of the limitations of the current study is that there exists no “gold standard” for DDI characterization. As a result, a comparison with previous studies is not currently feasible. Majority of the earlier studies focused primarily on predicting whether two drugs can interact or not.

However, a DDI between two drugs can also be safe. Hence, in this study, the framework was developed to not only predict DDIs but also characterize them. Second, the relatively low ROC scores (section 4.2.2 - random paris+Drugs.com dataset) suggest that considering all random pairs as safe, may not be appropriate since many drug pairs might have not-safe DDI and have not been yet reported or documented. It would be interesting to see if there is an inherent difference in the classification scores depending on the drugs or disease studied. In other words, is the classifier’s performance dependent on certain specific adverse events, or a class of drugs, or all drugs indicated for a specific disease?

48

Similar to the other in-silico methods, drug features’ similarity-based learning techniques suffer information loss due to missing feature values or class information resulting in high class- imbalance. Exploring feature level imputation techniques with network topology information such as network embedding techniques or tensor factorization methods can potentially improve DDI characterization performance. Since some of the safe and not-safe combination data is available in the FDA’s adverse event reporting system, utilizing that corpus along with existing heterogeneous data in a multi-modal learning setup could reduce the reliance on synthetic samples and also will be effective in predicting DDI for approved drugs and those in clinical trials. A learning algorithm predicting safe combinations can also potentially reduce the time and costs for studying a new drug in clinical trial phases. In summary, the framework proposed in this thesis evaluated machine learning models across multiple drug-related datasets and the results demonstrated that integration of multiple drug-features is a potentially efficient strategy to predict and characterize the DDIs.

49

Bibliography

1. Fitzgerald, J.B., B. Schoeberl, U.B. Nielsen, and P.K. Sorger, Systems biology and combination therapy in the quest for clinical efficacy. Nat Chem Biol, 2006. 2(9): p. 458- 66. 2. Jia, J., F. Zhu, X. Ma, Z. Cao, Z.W. Cao, Y. Li, Y.X. Li, and Y.Z. Chen, Mechanisms of drug combinations: interaction and network perspectives. Nat Rev Drug Discov, 2009. 8(2): p. 111-28. 3. Sousa, M., A. Pozniak, and M. Boffito, Pharmacokinetics and pharmacodynamics of drug interactions involving rifampicin, rifabutin and antimalarial drugs. J Antimicrob Chemother, 2008. 62(5): p. 872-8. 4. Classen, D.C., S.L. Pestotnik, R.S. Evans, J.F. Lloyd, and J.P. Burke, Adverse drug events in hospitalized patients. Excess length of stay, extra costs, and attributable mortality. JAMA, 1997. 277(4): p. 301-6. 5. Vinh, N.X., J. Epps, and J.J.J.o.M.L.R. Bailey, Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. 2010. 11(Oct): p. 2837-2854. 6. Cullen, D.J., B.J. Sweitzer, D.W. Bates, E. Burdick, A. Edmondson, and L.L. Leape, Preventable adverse drug events in hospitalized patients: a comparative study of intensive care and general care units. Crit Care Med, 1997. 25(8): p. 1289-97. 7. Huang, S.M., R. Temple, D.C. Throckmorton, and L.J. Lesko, Drug interaction studies: study design, data analysis, and implications for dosing and labeling. Clin Pharmacol Ther, 2007. 81(2): p. 298-304. 8. Strandell, J., A. Bate, M. Lindquist, I.R. Edwards, and F.I.X.r.D.-d.I.D. Swedish, Drug- drug interactions - a preventable patient safety issue? Br J Clin Pharmacol, 2008. 65(1): p. 144-6. 9. Tatonetti, N.P., G.H. Fernald, and R.B. Altman, A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports. J Am Med Inform Assoc, 2012. 19(1): p. 79-85. 10. Gu, Q., C. Dillon, and V. Burt, Prescription drug use continues to increase: US prescription drug data for 2007-2008. NCHS data brief, 2010(42): p. 1. 11. Becker, M.L., M. Kallewaard, P.W. Caspers, L.E. Visser, H.G. Leufkens, and B.H. Stricker, Hospitalisations and emergency department visits due to drug-drug interactions: a literature review. Pharmacoepidemiol Drug Saf, 2007. 16(6): p. 641-51. 12. Chertow, G.M., J. Lee, G.J. Kuperman, E. Burdick, J. Horsky, D.L. Seger, R. Lee, A. Mekala, J. Song, A.L. Komaroff, and D.W. Bates, Guided medication dosing for inpatients with renal insufficiency. JAMA, 2001. 286(22): p. 2839-44. 13. Slight, S.P., D.L. Seger, K.C. Nanji, I. Cho, N. Maniam, P.C. Dykes, and D.W. Bates, Are we heeding the warning signs? Examining providers' overrides of computerized drug-drug interaction alerts in primary care. PLoS One, 2013. 8(12): p. e85071. 14. Ogu, C.C. and J.L. Maxa, Drug interactions due to cytochrome P450. Proc (Bayl Univ Med Cent), 2000. 13(4): p. 421-3. 15. Drugs.com. Drug Interaction Checker. [cited 2018; Available from: https://www.drugs.com/drug_interactions.html. 16. Woosley RL, H.C., Gallo T, Tate J, Woosley D and Romero KA. QT Drugs List. [cited 2018 09-28-2018]; Available from: www.crediblemeds.org.

50

17. Liu, Y., Q. Wei, G. Yu, W. Gai, Y. Li, and X. Chen, DCDB 2.0: a major update of the drug combination database. Database (Oxford), 2014. 2014: p. bau124. 18. Davis, A.P., C.J. Grondin, R.J. Johnson, D. Sciaky, B.L. King, R. McMorran, J. Wiegers, T.C. Wiegers, and C.J. Mattingly, The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Res, 2017. 45(D1): p. D972-D978. 19. Wishart, D.S., Y.D. Feunang, A.C. Guo, E.J. Lo, A. Marcu, J.R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, N. Assempour, I. Iynkkaran, Y. Liu, A. Maciejewski, N. Gale, A. Wilson, L. Chin, R. Cummings, D. Le, A. Pon, C. Knox, and M. Wilson, DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res, 2018. 46(D1): p. D1074-D1082. 20. Cai, M.C., Q. Xu, Y.J. Pan, W. Pan, N. Ji, Y.B. Li, H.J. Jin, K. Liu, and Z.L. Ji, ADReCS: an ontology database for aiding standardization and hierarchical classification of adverse drug reaction terms. Nucleic Acids Res, 2015. 43(Database issue): p. D907-13. 21. Ursu, O., J. Holmes, J. Knockel, C.G. Bologa, J.J. Yang, S.L. Mathias, S.J. Nelson, and T.I. Oprea, DrugCentral: online drug compendium. Nucleic Acids Res, 2017. 45(D1): p. D932-D939. 22. Chawla, N.V., K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 2002. 16: p. 321-357. 23. Huang, S.M., J.M. Strong, L. Zhang, K.S. Reynolds, S. Nallani, R. Temple, S. Abraham, S.A. Habet, R.K. Baweja, G.J. Burckart, S. Chung, P. Colangelo, D. Frucht, M.D. Green, P. Hepp, E. Karnaukhova, H.S. Ko, J.I. Lee, P.J. Marroum, J.M. Norden, W. Qiu, A. Rahman, S. Sobel, T. Stifano, K. Thummel, X.X. Wei, S. Yasuda, J.H. Zheng, H. Zhao, and L.J. Lesko, New era in drug interaction evaluation: US Food and Drug Administration update on CYP enzymes, transporters, and the guidance process. J Clin Pharmacol, 2008. 48(6): p. 662-70. 24. Quinney, S.K., X. Zhang, A. Lucksiri, J.C. Gorski, L. Li, and S.D. Hall, Physiologically based pharmacokinetic model of mechanism-based inhibition of CYP3A by clarithromycin. Drug Metab Dispos, 2010. 38(2): p. 241-8. 25. Schelleman, H., W.B. Bilker, C.M. Brensinger, X. Han, S.E. Kimmel, and S. Hennessy, Warfarin with fluoroquinolones, sulfonamides, or azole antifungals: interactions and the risk of hospitalization for gastrointestinal bleeding. Clin Pharmacol Ther, 2008. 84(5): p. 581-8. 26. Duke, J.D., X. Han, Z. Wang, A. Subhadarshini, S.D. Karnik, X. Li, S.D. Hall, Y. Jin, J.T. Callaghan, M.J. Overhage, D.A. Flockhart, R.M. Strother, S.K. Quinney, and L. Li, Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy associated drug interactions. PLoS Comput Biol, 2012. 8(8): p. e1002614. 27. Gottlieb, A., G.Y. Stein, Y. Oron, E. Ruppin, and R. Sharan, INDI: a computational framework for inferring drug interactions and their associated recommendations. Mol Syst Biol, 2012. 8: p. 592. 28. Lu, Y., D. Shen, M. Pietsch, C. Nagar, Z. Fadli, H. Huang, Y.C. Tu, and F. Cheng, A novel algorithm for analyzing drug-drug interactions from MEDLINE literature. Sci Rep, 2015. 5: p. 17357. 29. Zhang, B., Y. Tian, and Z. Zhang, Network biology in medicine and beyond. Circ Cardiovasc Genet, 2014. 7(4): p. 536-47.

51

30. Wang, Y.Y., K.J. Xu, J. Song, and X.M. Zhao, Exploring drug combinations in genetic interaction network. BMC Bioinformatics, 2012. 13 Suppl 7: p. S7. 31. Lee, K., S. Lee, M. Jeon, J. Choi, and J. Kang. Drug-drug interaction analysis using heterogeneous biological information network. in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. 2012. IEEE. 32. Huang, J., C. Niu, C.D. Green, L. Yang, H. Mei, and J.D. Han, Systematic prediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network. PLoS Comput Biol, 2013. 9(3): p. e1002998. 33. Udrescu, L., L. Sbarcea, A. Topirceanu, A. Iovanovici, L. Kurunczi, P. Bogdan, and M. Udrescu, Clustering drug-drug interaction networks with energy model layouts: community analysis and drug repurposing. Sci Rep, 2016. 6: p. 32745. 34. Cheng, F. and Z. Zhao, Machine learning-based prediction of drug-drug interactions by integrating drug phenotypic, therapeutic, chemical, and genomic properties. J Am Med Inform Assoc, 2014. 21(e2): p. e278-86. 35. Zhang, W., Y. Chen, F. Liu, F. Luo, G. Tian, and X. Li, Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data. BMC Bioinformatics, 2017. 18(1): p. 18. 36. Kastrin, A., P. Ferk, and B. Leskosek, Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning. PLoS One, 2018. 13(5): p. e0196865. 37. Pratt, L.A. and P.N. Danese, More eyeballs on AERS. Nat Biotechnol, 2009. 27(7): p. 601-2. 38. Tatonetti, N.P., P.P. Ye, R. Daneshjou, and R.B. Altman, Data-driven prediction of drug effects and interactions. Sci Transl Med, 2012. 4(125): p. 125ra31. 39. Drugs.com. Drug Interactions Checker. [cited 2018 July 30, 2018]; Available from: https://www.drugs.com/drug_interactions.html. 40. Imming, P., C. Sinning, and A. Meyer, Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov, 2006. 5(10): p. 821-34. 41. Hall, G.H. and A.P. Round, Logistic regression--explanation and use. J R Coll Physicians Lond, 1994. 28(3): p. 242-6. 42. Furey, T.S., N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000. 16(10): p. 906-14. 43. Agatonovic-Kustrin, S. and R. Beresford, Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal, 2000. 22(5): p. 717-27. 44. Cohen, J., A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 1960. 20(1): p. 37-46. 45. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, Scikit-learn: Machine learning in Python. Journal of machine learning research, 2011. 12(Oct): p. 2825-2830. 46. Zhu, F., Z. Shi, C. Qin, L. Tao, X. Liu, F. Xu, L. Zhang, Y. Song, X. Liu, J. Zhang, B. Han, P. Zhang, and Y. Chen, Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res, 2012. 40(Database issue): p. D1128-36.

52

47. Cheng, F., W. Li, X. Wang, Y. Zhou, Z. Wu, J. Shen, and Y. Tang, Adverse drug events: database construction and in silico prediction. J Chem Inf Model, 2013. 53(4): p. 744-52. 48. PDR.net - Physician’s Desk Reference. [cited 2018; Available from: https://www.pdr.net/.

53

Appendices

54

Appendix A – Drug Feature Construction and Extraction

# Drug Feature construction import pandas as pd import numpy as np import xml.etree.ElementTree as ET import os import shutil

# Constructs and exports all features to a folder def export_all_features(map_type='', export_path='raw_features', delete_existing=False):

print('='*10)

print('Initializing {} mapper'.format(map_type))

if delete_existing: shutil.rmtree(export_path)

if not os.path.exists(export_path): os.makedirs(export_path)

mapper = None

if map_type == 'drugbank':

drugbank = pd.read_csv('download/drugbank_synonym_v2.txt',sep='\t',encoding='utf-8')

drugbank = drugbank[['Synonym','DBID']]

drugbank = drugbank.drop_duplicates()

dbmap = drugbank.set_index("Synonym").DBID

dbmap.index = dbmap.index.str.lower()

dbmap = dbmap.loc[~dbmap.index.duplicated(keep='first')]

55

mapper = dbmap

elif map_type == 'cui':

cui_df = pd.read_csv('download/revised_cui_v1.txt', sep='\t',encoding='utf-8')

cui_df = cui_df[['STR','CUI_rev']]

cui_df = cui_df.drop_duplicates()

cui_map = cui_df.set_index("STR").CUI_rev

cui_map.index = cui_map.index.str.lower()

cui_map = cui_map.loc[~cui_map.index.duplicated(keep='first')]

mapper = cui_map

else:

print('Invalid Mapping Option. Choose drugbank or cui')

exit(0)

print('Initiating Feature Processing:')

adrecs(export_path,mapper) ctdbase_pathway(export_path,mapper)

# ctdbase_go(export_path,mapper) ctdbase_gene_interaction(export_path,mapper) drugbank_mesh(export_path, mapper)

56

drug_indications(export_path, mapper)

drug_contraindications(export_path, mapper)

atc(export_path,mapper)

def save_data(df, path, name):

df.to_csv(path + '/' + name + '_data.txt', sep='\t', index=False)

print('File Saved: ' + path + '/' + name + '_data.txt')

def ctd_mapper():

ctd_chems = pd.read_csv('download/CTD_drugbank.csv',sep='\t')

ctd_chems = ctd_chems[['ChemicalName', 'DrugBankIDs']]

ctd_chems = ctd_chems.drop_duplicates()

ctdmap = ctd_chems.set_index("ChemicalName").DrugBankIDs

ctdmap.index = ctdmap.index.str.lower()

ctdmap = ctdmap.loc[~ctdmap.index.duplicated(keep='first')]

return ctdmap

# Adverse Event Feature def adrecs(export_path, map):

print('='*10)

print('Processing ADRECS')

57

adrecs = 'download/ADReCS_Drug_info.xml'

e = ET.parse(adrecs).getroot()

columns = ['DrugName','ADR_TERM','ADRECS_ID','Indications']

rows = []

for drug in e.findall('Drug_BADD'):

DRUG_NAME = drug.find('DRUG_NAME').text

INDICATION = drug.find('INDICATIONS').text

for adr in drug.find('ADRs').findall('ADR'):

ADR_TERM = adr.find('ADR_TERM').text

ADRECS_ID = adr.find('ADRECS_ID').text

row = [DRUG_NAME, ADR_TERM, ADRECS_ID, INDICATION]

rows.append(row)

AdrecsDF = pd.DataFrame(rows,columns=columns)

AdrecsDF = AdrecsDF.drop_duplicates()

AdrecsDF['drug_id'] = AdrecsDF.DrugName.str.lower().map(map)

AdrecsDF = AdrecsDF[AdrecsDF['drug_id'].notnull()]

AdrecsDF = AdrecsDF[['drug_id','ADRECS_ID']]

AdrecsDF.dropna(inplace=True)

AdrecsDF.drop_duplicates(inplace=True)

save_data(AdrecsDF, export_path, 'AE')

# Pathway Feature

58 def ctdbase_pathway(export_path, map):

print('='*10)

print('Processing CTDBase Pathways')

drugbank_data = 'download/CTD_chem_pathways_enriched.xml'

e = ET.parse(drugbank_data).getroot()

columns =

['ChemicalName','ChemicalID','PathwayName','PathwayID','PValue','CorrectedPValue','TargetM atchQty','TargetTotalQty','BackgroundMatchQty','BackgroundTotalQty']

rows = []

for drug in e.findall('Row'):

row = list()

for c in columns:

row.append(drug.find(c).text)

rows.append(row)

Drug_Pathway_Score = pd.DataFrame(rows,columns=columns)

# Drug_Pathway_Score = Drug_Pathway_Score[['ChemicalName',

'PathwayID','CorrectedPValue']]

Drug_Pathway_Score.CorrectedPValue =

Drug_Pathway_Score.CorrectedPValue.astype(float)

59

Drug_Pathway_Score =

Drug_Pathway_Score[Drug_Pathway_Score['CorrectedPValue']<=0.05]

Drug_Pathway_Score.to_csv('download/Drug_Pathway_Score.txt', index=False, sep='\t')

Drug_Pathway_Score['drug_id'] =

Drug_Pathway_Score.ChemicalName.str.lower().map(map)

Drug_Pathway_Score = Drug_Pathway_Score[['drug_id','PathwayID']]

Drug_Pathway_Score.dropna(inplace=True)

Drug_Pathway_Score.drop_duplicates(inplace=True)

save_data(Drug_Pathway_Score, export_path, 'Pathway')

# Target Feature def ctdbase_gene_interaction(export_path, map):

drugbank_data = 'download/CTD_chem_gene_ixns.xml'

e = ET.parse(drugbank_data).getroot()

columns = ['ChemicalName','ChemicalID','GeneSymbol','GeneID','Organism']

rows = []

for drug in e.findall('Row'):

row = list()

for c in columns:

try:

row.append(drug.find(c).text)

60

except:

pass

rows.append(row)

Drug_gene_Score = pd.DataFrame(rows,columns=columns)

Drug_gene_Score = Drug_gene_Score[Drug_gene_Score['Organism']=='Homo sapiens']

Drug_gene_Score['drug_id'] = Drug_gene_Score.ChemicalName.str.lower().map(map)

Drug_gene_Score = Drug_gene_Score[Drug_gene_Score['drug_id'].notnull()]

Drug_gene_Score = Drug_gene_Score[['drug_id','GeneID']]

Drug_gene_Score.dropna(inplace=True)

Drug_gene_Score.drop_duplicates(inplace=True)

save_data(Drug_gene_Score, export_path, 'Target')

# ATC Feature def atc(export_path, map):

drugbank_data = 'download/drugbank_database_3.xml'

e = ET.parse(drugbank_data).getroot()

columns = ['drug_id','DrugName', 'ATC','Class','Level']

rows = []

ns = '{http://www.drugbank.ca}'

for i, drug in enumerate(e):

row = []

61 assert drug.tag == ns + 'drug'

DBID = drug.find(ns+"drugbank-id[@primary='true']").text

DrugName = drug.find(ns+"name").text try:

for code in drug.find(ns+"atc-codes").findall(ns+'atc-code'):

# Get root node code

atc_code = code.get('code')

code_cls = DrugName

row = [DBID,DrugName,atc_code, code_cls, 5]

rows.append(row)

for e in code.findall(ns+'level'):

atc_code = e.get('code')

code_cls = e.text

if len(atc_code) == 1:

lvl = 1

elif len(atc_code) == 3:

lvl = 2

elif len(atc_code) == 4:

lvl = 3

elif len(atc_code) == 5:

lvl = 4

62

elif len(atc_code) == 7:

lvl = 5

row = [DBID,DrugName,atc_code,code_cls, lvl]

rows.append(row)

except:

continue

Drugbank = pd.DataFrame(rows, columns=columns)

Drugbank = Drugbank[Drugbank['drug_id'].notnull()]

Drugbank = Drugbank[['drug_id','ATC']]

save_data(Drugbank, export_path, 'ATC')

# MeSH Feature def drugbank_mesh(export_path, map):

print('='*10)

print('Processing Drugbank Mesh Categories')

Drugbank = pd.read_table('download/Drugbank_Mesh.txt')

Drugbank = Drugbank[['drug_id','MeshID']]

Drugbank = Drugbank.dropna()

Drugbank.drop_duplicates(inplace=True)

save_data(Drugbank, export_path, 'MeSH')

63

# Indication Feature def drug_indications(export_path, map):

print('='*10)

print('Processing Drugbank Indications')

drug_indications = pd.read_csv('download/drugbank-indications.txt',sep='\t')

drug_indications.rename(columns={'identifier':'drug_id'}, inplace=True)

drug_indications = drug_indications[['drug_id','umls_cui']]

drug_indications.dropna(inplace=True)

drug_indications.drop_duplicates(inplace=True)

save_data(drug_indications, export_path, 'Indications')

# Contraindications Feature def drug_contraindications(export_path, map):

print('='*10)

print('Processing Drugbank ContraIndications')

drug_indications = pd.read_csv('download/drugbank-contraindications.txt',sep='\t')

drug_indications.rename(columns={'identifier':'drug_id'}, inplace=True)

drug_indications = drug_indications[['drug_id','umls_cui']]

drug_indications.dropna(inplace=True)

drug_indications.drop_duplicates(inplace=True)

save_data(drug_indications, export_path, 'Contraindications')

64

Appendix B – Calculating Drug-Drug Similarity Using Jaccard Index

# Similarity Calculation using Jaccard import pandas as pd import numpy as np from scipy import sparse from sklearn.feature_extraction.text import TfidfTransformer from sklearn.metrics.pairwise import cosine_similarity import os from tqdm import tqdm import shutil

def calculate_jaccard(mat):

mat = mat.T

cols_sum = mat.getnnz(axis=0)

ab = mat.T * mat

# for rows

aa = np.repeat(cols_sum, ab.getnnz(axis=0))

# for columns

bb = cols_sum[ab.indices]

similarities = ab.copy()

similarities.data = similarities.data/(aa + bb - ab.data)

65

return similarities

def get_jaccard_df(df):

index = df[df.columns[0]].unique()

columns = df[df.columns[1]].unique()

print('drugs:{}'.format(len(index)))

print('features:{}'.format(len(columns)))

binaryDF = pd.DataFrame(index=index,columns=columns).fillna(value=0)

for index, row in df.iterrows():

drug = row[df.columns[0]]

feature = row[df.columns[1]]

binaryDF.at[drug, feature] = 1

sparse_mat = sparse.csr_matrix(binaryDF.as_matrix())

sim_mat = calculate_jaccard(sparse_mat)

#Convert csr_mat to numpy again

sim_mat = sim_mat.todense()

df = pd.DataFrame(sim_mat, index=binaryDF.index, columns=binaryDF.index)

return df

66 def get_tfidf_cos_df(df):

index = df[df.columns[0]].unique()

columns = df[df.columns[1]].unique()

binaryDF = pd.DataFrame(index=index,columns=columns).fillna(value=0)

for index, row in df.iterrows():

drug = row[df.columns[0]]

feature = row[df.columns[1]]

binaryDF.at[drug, feature] = 1

tfidf = TfidfTransformer(norm='l2')

x = tfidf.fit_transform(binaryDF.as_matrix())

df = pd.DataFrame(cosine_similarity(x), index=binaryDF.index, columns= binaryDF.index)

return df

def compute_feature_similarities(folder='raw_features', similarity='jaccard', id_type='drugbank', delete_existing=False):

'''

similarity : 'jaccard' or 'tfidf-cosine'

'''

print('='*10)

print('Calculating similarities for {} features'.format(id_type))

if similarity is 'jaccard':

67

output = 'data/jacc_drug_sim_features'

if delete_existing: shutil.rmtree(output)

if not os.path.exists(output): os.makedirs(output) else:

output = 'data/tfcos_drug_sim_features'

if delete_existing: shutil.rmtree(output)

if not os.path.exists(output): os.makedirs(output) features = os.listdir(folder) features = [x for x in features if x.endswith('.csv') or x.endswith('.txt')] features = [x for x in features if not x.startswith('_')] for filename in features:

print("Processing : %s" % filename)

df = pd.read_csv(folder + '/' + filename, sep='\t')

df.dropna(inplace=True)

if similarity is 'jaccard':

df = get_jaccard_df(df)

else:

df = get_tfidf_cos_df(df)

filename = filename.split('_')[0] + '_similarity.csv'

df.to_csv(output + '/' + filename)

# Export Chem similarity filename = 'SMILES_similarity_unstacked.csv'

68

print('='*10)

print('Processing: Chemical Structure')

if id_type is 'drugbank':

pd.read_csv('download/drugbank_chem_tanimoto_similarity.tsv').to_csv(output+'/'+filename, index=False)

else:

pd.read_csv('download/cui_chem_similarity.csv').to_csv(output+'/'+filename, index=False)

69

Appendix C – Dataset Construction Using Similarity Scores and

DDI Labels

# Constructing datasets using similarity scores and DDI labels import pandas as pd import numpy as np import os from tqdm import tqdm import itertools as it

def process_all_features(folder='features', output_file='DrugsMultiClassSeverity', map_type='drugbank', dataset='drugs.com', include_safe=True):

'''

Process All Features provided in features folder.

folder = folder to parse the similarity files

output_file = output file name

'''

def generate_pairs(mapper):

print('='*10)

print('Generating Random Pairs')

if dataset == 'drugs.com':

print('Using drugs.com data')

70

interaction_df = pd.read_csv('download/drugs.com_interactions_1463.csv', sep='\t')

interaction_df = interaction_df[~(interaction_df.InteractingDrug.str.contains('/')) &

(interaction_df.InteractionType=='drug')]

interaction_df['InteractingDrug'] = interaction_df['InteractingDrug'].str.strip()

interaction_df['drug1'] = interaction_df.DrugName.str.lower().map(mapper)

interaction_df['drug2'] = interaction_df.InteractingDrug.str.lower().map(mapper)

interaction_df = interaction_df.dropna()

interaction_df['InteractionLevel'] = interaction_df['InteractionLevel'].str.replace('minor','1')

interaction_df['InteractionLevel'] = interaction_df['InteractionLevel'].str.replace('moderate','2')

interaction_df['InteractionLevel'] = interaction_df['InteractionLevel'].str.replace('major','3')

else:

print('Using drugbank data')

interaction_df = pd.read_csv('download/drugbank_interactions.csv')

interaction_df = interaction_df.dropna()

interaction_df = interaction_df[['drug1','drug2','InteractionLevel']]

if include_safe:

dcdb_df = pd.read_table('download/dcdb_mapped.txt', index_col=0)

dcdb_df['drug1'] = dcdb_df['drug1'].str.lower().map(mapper)

71

dcdb_df['drug2'] = dcdb_df['drug2'].str.lower().map(mapper)

dcdb_df = dcdb_df.dropna()

dcdb_df['InteractionLevel'] = 0

dcdb_df = dcdb_df[['drug1','drug2','InteractionLevel']]

dcdb_df = restack(dcdb_df,target_col_name='InteractionLevel')

dcdb_df = dcdb_df.drop_duplicates()

interaction_df = pd.concat([dcdb_df,interaction_df])

interaction_df = interaction_df.drop_duplicates()

drugs = set(interaction_df.drug1.unique().tolist())

drugs = drugs.union(set(interaction_df.drug2.unique().tolist()))

drugs = list(drugs)

#Generate combinations

drug_random_combinations = list(it.combinations(drugs, 2))

random_interaction_df = pd.DataFrame(np.array(drug_random_combinations), columns=['drug1','drug2'])

random_interaction_df = pd.merge(interaction_df, random_interaction_df, how='outer',left_on=['drug1','drug2'], right_on = ['drug1','drug2'])

72

random_interaction_df['InteractionLevel'] = random_interaction_df['InteractionLevel'].fillna(-1)

print(random_interaction_df['InteractionLevel'].value_counts())

return random_interaction_df

def get_mapper():

if map_type == 'drugbank':

print('Using Drugbank Mapper')

drugbank = pd.read_csv('download/drugbank_synonym_v2.txt',sep='\t',encoding='utf-8')

drugbank = drugbank[['Synonym','DBID']]

drugbank = drugbank.drop_duplicates()

dbmap = drugbank.set_index("Synonym").DBID

dbmap.index = dbmap.index.str.lower()

dbmap = dbmap.loc[~dbmap.index.duplicated(keep='first')]

mapper = dbmap

elif map_type == 'cui':

print('Using CUI Mapper')

cui_df = pd.read_csv('download/revised_cui_v1.txt', sep='\t',encoding='utf-8')

cui_df = cui_df[['STR','CUI_rev']]

cui_df = cui_df.drop_duplicates()

cui_map = cui_df.set_index("STR").CUI_rev

cui_map.index = cui_map.index.str.lower()

73

cui_map = cui_map.loc[~cui_map.index.duplicated(keep='first')]

mapper = cui_map

return mapper

def restack(df, target_col_name='value'):

if df.shape[0] == df.shape[1]:

df = df.stack().reset_index()

columns = df.columns

df.rename(columns={columns[0]: 'drug1',columns[1]:

'drug2',columns[2]:target_col_name}, inplace=True)

test = df.copy(deep=False)

test.rename(columns={'drug1':'drug2','drug2':'drug1'}, inplace=True)

df = pd.concat([df, test])

return df

def getTotalDrugs(df):

drug1 = df.drug1.unique().tolist()

drug2 = df.drug2.unique().tolist()

drugs = set(drug1+drug2)

drugs = list(drugs)

74

return len(drugs)

def remove_mirror_duplicates(df):

return df.loc[pd.DataFrame(np.sort(df[['drug1','drug2']],1),index=df.index).drop_duplicates(keep='first'

).index]

mapper = get_mapper()

# get drugs.com adn random interaction pairs

drugsDF = generate_pairs(mapper)

drugsDF = drugsDF[(drugsDF['drug1'].notnull()) & (drugsDF['drug2'].notnull())]

features = os.listdir(folder)

features = [x for x in features if x.endswith('.csv') or x.endswith('.txt')]

features = [x for x in features if not x.startswith('_')]

combinedSimilarity = None

prev_df_name = curr_df_name = ''

pbar = tqdm(features)

75

for feature_path in pbar:

curr_df_name = feature_path.split('_')[0]

pbar.set_description("Processing Feature: %s" % curr_df_name)

if '_unstacked' in feature_path:

similarity_df = pd.read_csv(folder + '/' + feature_path)

else:

similarity_df = pd.read_csv(folder + '/' + feature_path, index_col=0)

similarity_df = restack(similarity_df)

if combinedSimilarity is None:

combinedSimilarity = similarity_df.copy(deep=False)

prev_df_name = feature_path.split('_')[0]

else:

combinedSimilarity = pd.merge(combinedSimilarity, similarity_df, how='outer', left_on=['drug1','drug2'], right_on = ['drug1','drug2'])

combinedSimilarity = combinedSimilarity.dropna()

if 'value_x' in combinedSimilarity.columns:

combinedSimilarity.rename(columns={'value_x':prev_df_name,'value_y':curr_df_name}, inplace=True)

76

else:

combinedSimilarity.rename(columns={'value':curr_df_name}, inplace=True)

if combinedSimilarity.shape[0] == 0:

print('Error')

exit(0)

#Create a credible validation dataset

credibleMeds = pd.read_csv('data/credibleMeds.csv')

credibleMeds['drug1'] = credibleMeds.object.str.lower().map(mapper)

credibleMeds['drug2'] = credibleMeds.precipitant.str.lower().map(mapper)

credibleMeds = credibleMeds[['drug1','drug2','Interaction']]

credibleMeds = credibleMeds.dropna()

credibleMeds = remove_mirror_duplicates(credibleMeds)

credibleMeds['source'] = 'cm'

credibleMeds = pd.merge(combinedSimilarity,credibleMeds,how='right', left_on=['drug1','drug2'], right_on = ['drug1','drug2'] )

credibleMeds = remove_mirror_duplicates(credibleMeds.dropna())

del credibleMeds['source']

credibleMeds = credibleMeds.dropna()

credibleMeds.to_csv('validation/credibleMeds_validate.csv')

print('Validation set: CredibleMeds with features exported!')

77

finalMatrix = pd.merge(combinedSimilarity,drugsDF,how='outer', left_on=['drug1','drug2'], right_on = ['drug1','drug2'] )

finalMatrix['InteractionLevel'] = finalMatrix['InteractionLevel'].fillna(-1)

finalMatrix = finalMatrix.drop_duplicates()

finalMatrix = remove_mirror_duplicates(finalMatrix)

# Remove Validation Set From Final Matrix

cm = pd.read_csv('data/credibleMeds.csv')

cm = cm[['object_drugbank','precipitant_drugbank']]

cm['source'] = 'cm'

cm.rename(columns={'object_drugbank':'drug1','precipitant_drugbank':'drug2'}, inplace=True)

common = finalMatrix.merge(cm,on=['drug1','drug2'])

finalMatrix = finalMatrix[(~finalMatrix.drug1.isin(common.drug1))&(~finalMatrix.drug2.isin(common.drug2)

)]

finalMatrix = finalMatrix.dropna()

uniqueDrugs = getTotalDrugs(drugsDF)

78 uniqueDrugsWFeatures = getTotalDrugs(finalMatrix)

finalMatrix.to_csv('data/'+output_file+'.csv',index=False) print('Train Set Exported!') print('Actual Unique Drugs: {}'.format(uniqueDrugs)) print('Current Unique Drugs after including features: {}'.format(uniqueDrugsWFeatures))

print('Interactions:') print(finalMatrix.InteractionLevel.value_counts(dropna=False))

79

Appendix D – Data Analysis, Machine Learning-based

Classification and Visualization

# DDI Classification using ML algorithms from tqdm import tqdm import pickle from collections import Counter

# Data Management Tools import pandas as pd import numpy as np

# Preprocessing from sklearn.preprocessing import LabelEncoder, MinMaxScaler from sklearn.preprocessing import label_binarize

# Data Balance from imblearn.over_sampling import SMOTE, ADASYN from imblearn.under_sampling import RandomUnderSampler from imblearn.datasets import make_imbalance

# Model Selection from sklearn.model_selection import StratifiedShuffleSplit

80 from sklearn.model_selection import train_test_split from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import StratifiedKFold

# Classifiers from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,

GradientBoostingClassifier, VotingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.neural_network import MLPClassifier from sklearn import svm from sklearn.multiclass import OneVsRestClassifier

# Hyper-parameter Tuning from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import make_scorer, matthews_corrcoef

#Metrics

81 from sklearn.metrics import accuracy_score, log_loss,roc_auc_score,precision_score,recall_score,f1_score,precision_recall_curve from sklearn.metrics import precision_recall_fscore_support, fbeta_score from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import cohen_kappa_score from sklearn.model_selection import cross_val_score from sklearn.metrics import roc_curve, auc, average_precision_score import scipy.stats as stats

# Plotting & Visualization import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix

def get_train_data(dataset, binarize=False, scale_features=None, include_random=False):

dataset = dataset.dropna()

indi = slice(2,len(dataset.columns)-1)

if not include_random:

dataset = dataset[dataset.InteractionLevel>-1]

if binarize:

dataset['InteractionLevel'] = np.where(dataset['InteractionLevel']<=0, 0, 1)

if scale_features:

82

scaler = MinMaxScaler()

for feature in scale_features:

dataset[feature] = scaler.fit_transform(dataset[[feature]])

X = dataset.iloc[:,indi].as_matrix()

y = dataset.iloc[:,-1].as_matrix()

return X,y, dataset

def plot_features_with_labels(df, binarize=False, figsize=(20, 10)):

df = df[df.InteractionLevel>-1]

indi = slice(2,len(df.columns))

cols = df.columns[indi]

df = df[cols]

if binarize:

df['InteractionLevel'] = np.where(df['InteractionLevel']==0, 1, 0)

indi = slice(0,len(df.columns)-1)

dd=pd.melt(df,id_vars=['InteractionLevel'],value_vars=df.columns[indi],var_name='Features')

plt.figure(figsize=figsize)

sns.boxplot(x='InteractionLevel',y='value',data=dd,hue='Features')

plt.show()

83 def plot_classification_report(y_tru, y_prd, figsize=(5, 5), ax=None, y_labels=['Safe','Interaction']):

plt.clf()

plt.figure(figsize=figsize)

xticks = ['precision', 'recall', 'f1-score', 'support']

yticks = []

yticks = y_labels

yticks += ['avg']

rep = np.array(precision_recall_fscore_support(y_tru, y_prd)).T

avg = np.mean(rep, axis=0)

avg[-1] = np.sum(rep[:, -1])

rep = np.insert(rep, rep.shape[0], avg, axis=0)

sns.heatmap(rep,

annot=True,

cbar=False,

xticklabels=xticks,

yticklabels=yticks,

fmt='g',

linewidths=.5,

ax=ax)

plt.show()

84

def plot_classifiers_auroc(classifiers, X, y, imbalance=True, plt_name=''):

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,

random_state=42,

test_size=0.25)

if imbalance:

oversampler=SMOTE(random_state=0, kind='svm')

X_train,y_train=oversampler.fit_sample(X_train,y_train)

pbar = tqdm(classifiers)

for clf in pbar:

name = clf.__class__.__name__

pbar.set_description("Training using %s" % name)

clf.fit(X_train, y_train)

train_predictions = clf.predict(X_test)

# Compute fpr, tpr, thresholds and roc auc

fpr, tpr, thresholds = roc_curve(y_test, train_predictions)

roc_auc = roc_auc_score(y_test, train_predictions)

name = clf.__class__.__name__

# Plot ROC curve

plt.plot(fpr, tpr, label= name+ ' = %0.3f' % roc_auc)

85

plt.plot([0, 1], [0, 1], 'k--') # random predictions curve

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.0])

plt.xlabel('False Positive Rate or (1 - Specifity)')

plt.ylabel('True Positive Rate or (Sensitivity)')

plt.title('Receiver Operating Characteristic')

plt.legend(loc="lower right")

plt.savefig('saves/'+plt_name+'.png')

plt.show()

def feature_correlation_matrix(df):

indi = slice(2,len(df.columns)-1)

corr_matrix = df.iloc[:,indi].corr().abs()

# upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]

return corr_matrix

def plot_classifiers_pr(classifiers, X, y):

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,

random_state=42,

test_size=0.25)

oversampler=SMOTE(random_state=0, kind='svm')

X_train,y_train=oversampler.fit_sample(X_train,y_train)

86

pbar = tqdm(classifiers)

for clf in pbar:

name = clf.__class__.__name__

pbar.set_description("Training using %s" % name)

clf.fit(X_train, y_train)

train_predictions = clf.predict(X_test)

# Compute fpr, tpr, thresholds and roc auc

precision, recall, _ = precision_recall_curve(y_test, train_predictions)

average_recall = recall_score(y_test, train_predictions)

name = clf.__class__.__name__

# Plot ROC curve

plt.plot(recall, precision, label= name+ ' = %0.3f' % average_recall)

plt.plot([0, 1], [0, 1], 'k--') # random predictions curve

plt.xlim([0.0, 1.05])

plt.ylim([0.0, 1.05])

plt.xlabel('Recall')

plt.ylabel('Precision')

plt.title('Precision Recall Curve')

plt.legend(loc="lower left")

plt.show()

def overall_average_score(actual,prediction):

precision, recall, f1_score, _ = precision_recall_fscore_support(

87

actual, prediction, average='binary')

total_score = (matthews_corrcoef(actual, prediction) +

accuracy_score(actual, prediction) + precision + recall + f1_score)

return total_score / 5

def randomsearch_parameters(model,random_grid, cv, X,y, binarize=False, iterations=10, imbalance=True):

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,

random_state=42,

test_size=0.25)

if imbalance:

oversampler=SMOTE(random_state=0, kind='svm')

X_train,y_train=oversampler.fit_sample(X_train,y_train)

if binarize:

# grid_scorer = make_scorer(fbeta_score, beta=2, greater_is_better=True)

grid_scorer = make_scorer(roc_auc_score, greater_is_better=True)

else:

# y_train = label_binarize(y_train, classes=[0,1, 2, 3])

# y_test = label_binarize(y_test, classes=[0,1, 2, 3])

grid_scorer = make_scorer(f1_score, average='macro')

88

clf = RandomizedSearchCV(estimator = model, param_distributions = random_grid, n_iter = iterations, cv = cv, verbose=0, refit=True, scoring=grid_scorer, n_jobs = -1)

# Fit the random search model

clf.fit(X_train, y_train)

return clf.best_estimator_

def predict_dataset(clf,dataset, mapper='drugbank', binarize=False):

if mapper is 'drugbank':

drugbank = pd.read_csv('download/drugbank_synonym_v2.txt',sep='\t',encoding='utf-8')

drugbank = drugbank[['Synonym','DBID']]

drugbank = drugbank.drop_duplicates()

dbmap = drugbank.set_index("DBID").Synonym

dbmap.index = dbmap.index.str.lower()

dbmap = dbmap.loc[~dbmap.index.duplicated(keep='first')]

mapper = dbmap

else:

cui_df = pd.read_csv('data/revised_cui_v1.txt', sep='\t',encoding='utf-8')

cui_df = cui_df[['STR','CUI_rev']]

cui_df = cui_df.drop_duplicates()

cui_map = cui_df.set_index("CUI_rev").STR

cui_map.index = cui_map.index.str.lower()

cui_map = cui_map.loc[~cui_map.index.duplicated(keep='first')]

89

mapper = cui_map

indi = slice(2,len(dataset.columns)-1)

validate = dataset.iloc[:,indi].as_matrix()

if not binarize:

predictedClass = clf.predict_proba(validate)

predictedClass = np.argmax(predictedClass, axis=1)

output = dataset.copy()

classifierName = clf.__class__.__name__

if binarize:

predictedClass = clf.predict(validate)

output[classifierName+'_class'] = predictedClass

output['DB1'] = output.drug1.str.lower().map(mapper)

output['DB2'] = output.drug2.str.lower().map(mapper)

return output

def validate_with_crediblemeds(classifiers, binarize=False, file_name='val_xyz'):

rows = []

for clf in classifiers:

clf_name = clf.__class__.__name__

credibleMeds = pd.read_csv('validation/credibleMeds_validate.csv', index_col=0)

if binarize:

credibleMeds['Interaction'].replace([1,2],[1,1], inplace=True)

90

cm_validation = predict_dataset(clf,credibleMeds, binarize)

if not binarize:

cm_validation['Interaction'] = cm_validation['Interaction'].apply(lambda x: x + 1)

cm_test = cm_validation['Interaction'].values

cm_predictions = cm_validation[clf_name + '_class'].values

acc = accuracy_score(cm_test, cm_predictions)

print('{} : {}'.format(clf_name, acc))

rows.append([clf_name, acc])

pd.DataFrame(rows, columns=['Classifier','Accuracy']).to_csv('saves/'+file_name+'.csv', index=False)

def save_model(clf, binarize=False):

clf_name = clf.__class__.__name__

if binarize:

clf_name = clf_name + '-binary.sav'

else:

clf_name = clf_name + '-severity.sav'

pickle.dump(clf, open('models/' + clf_name, 'wb'))

print('Model saved at : models/' + clf_name )

def load_model(filename):

91

return pickle.load(open(filename, 'rb'))

def find_optimal_threshold(target, predicted):

""" Find the optimal probability cutoff point for a classification model related to event rate

Parameters

------

target : Matrix with dependent or target data, where rows are observations

predicted : Matrix with predicted data, where rows are observations

Returns

------

list type, with optimal cutoff value

"""

fpr, tpr, threshold = roc_curve(target, predicted)

i = np.arange(len(tpr))

roc = pd.DataFrame({'tf' : pd.Series(tpr - (1-fpr) , index=i), 'threshold' : pd.Series(threshold, index=i)})

roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]

return roc_t['threshold'].values[0]

92

def train_dataset(clf, X, y):

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,

random_state=42,

test_size=0.30)

y_test_dbn = pd.DataFrame(pd.Series(y_test), columns=['dbn'])

oversampler=SMOTE(random_state=0, kind='svm')

print('Original dataset shape {}'.format(Counter(y_train)))

X_train,y_train=oversampler.fit_sample(X_train,y_train)

print('Resampled dataset shape {}'.format(Counter(y_train)))

optimised_classifiers = []

if type(clf) != list:

optimised_classifiers.append(clf)

else:

optimised_classifiers = clf

rows = []

for clf in optimised_classifiers:

clf.fit(X_train, y_train)

classifierName = clf.__class__.__name__

train_predictions = clf.predict(X_test)

93

# train_predictions = train_predictions[:,1]

# opt_threshold = find_optimal_threshold(y_test,train_predictions)

# vfunc = np.vectorize(lambda x: 1 if x > opt_threshold else 0 )

# train_predictions = vfunc(train_predictions)

acc = accuracy_score(y_test, train_predictions)

ckappa = cohen_kappa_score(y_test, train_predictions)

roc = roc_auc_score(y_test, train_predictions)

pr = f1_score(y_test,train_predictions, average='macro')

rows.append([classifierName, acc, ckappa, roc, pr ])

return pd.DataFrame(rows, columns=['Classifier','Accuracy','CohensKappa','AUCROC','AUPR'])

def stratifiedKFold_train(optimised_classifiers, X, y, fold=5, imbalance=False, file_name='xyz'):

skf = StratifiedKFold(n_splits=fold)

rows = []

for clf in optimised_classifiers:

classifierName = clf.__class__.__name__

94 acc = 0 ckappa = 0 roc = 0 recall = 0 precision = 0 f1 = 0 fpr = 0 tpr = 0

print('Processing:{}'.format(classifierName)) for i, (train_index, test_index) in enumerate(skf.split(X, y)):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

if imbalance:

oversampler=SMOTE(random_state=0, kind='svm')

X_train,y_train=oversampler.fit_sample(X_train,y_train)

clf.fit(X_train, y_train)

train_predictions = clf.predict(X_test)

acc += accuracy_score(y_test, train_predictions)

ckappa += cohen_kappa_score(y_test, train_predictions)

95

roc += roc_auc_score(y_test, train_predictions, average='macro')

f1 += f1_score(y_test,train_predictions, average='macro')

recall += recall_score(y_test,train_predictions, average='macro')

precision += precision_score(y_test,train_predictions, average='macro')

f, t, thresholds = roc_curve(y_test, train_predictions)

fpr+=f

tpr+=t

plt.plot(fpr/fold, tpr/fold, label= classifierName+ ' = %0.3f' % (roc/fold))

rows.append([classifierName, acc/fold, ckappa/fold, roc/fold, precision/fold, recall/fold, f1/fold ])

df = pd.DataFrame(rows, columns=['Classifier','Accuracy','CohensKappa','AUCROC','Precision','Recall', 'F1'])

df.to_csv('saves/'+file_name+'.csv', index=False)

plt.plot([0, 1], [0, 1], 'k--') # random predictions curve

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.0])

plt.xlabel('False Positive Rate or (1 - Specifity)')

plt.ylabel('True Positive Rate or (Sensitivity)')

plt.title('Receiver Operating Characteristic')

plt.legend(loc="lower right")

plt.savefig('saves/'+file_name+'.png')

plt.show()

96

print('plot saved at saves/'+file_name+'.png')

return df

def getTotalDrugs(df):

drug1 = df.drug1.unique().tolist()

drug2 = df.drug2.unique().tolist()

drugs = set(drug1+drug2)

drugs = list(drugs)

return drugs def get_drugscom_contingency_table(pair):

atc_df = drugs_df[(drugs_df.atc_drug1.notnull()) & (drugs_df.atc_drug2.notnull()) &

(drugs_df.InteractionLevel > -1)]

# Number of major pair drugs in this categories

n = atc_df[((atc_df.atc_drug1 == pair[0]) & (atc_df.atc_drug2 == pair[1]) &

(atc_df.InteractionLevel ==2)) |

((atc_df.atc_drug2 == pair[0]) & (atc_df.atc_drug1 == pair[1]) & (atc_df.InteractionLevel

==2)) ].shape[0]

# Number of other drugs belong to this categories but are not major

N = atc_df[((atc_df.atc_drug1 == pair[0]) & (atc_df.atc_drug2 == pair[1]) &

(atc_df.InteractionLevel <2)) |

97

((atc_df.atc_drug2 == pair[0]) & (atc_df.atc_drug1 == pair[1]) & (atc_df.InteractionLevel

<2)) ].shape[0]

# Number of major drugs not belonging to this categories

r = atc_df[((atc_df.atc_drug1 != pair[0]) & (atc_df.atc_drug2 != pair[1]) &

(atc_df.InteractionLevel ==2)) &

((atc_df.atc_drug2 != pair[0]) & (atc_df.atc_drug1 != pair[1]) & (atc_df.InteractionLevel

==2)) ][['atc_drug1','atc_drug2']].shape[0]

# Number of other drugs not belonging to major

R = atc_df[((atc_df.atc_drug1 != pair[0]) & (atc_df.atc_drug2 != pair[1]) &

(atc_df.InteractionLevel <2)) &

((atc_df.atc_drug2 != pair[0]) & (atc_df.atc_drug1 != pair[1]) & (atc_df.InteractionLevel

<2)) ][['atc_drug1','atc_drug2']].shape[0]

rows = [[n,N], [r,R]]

index = ['major', 'not-major']

columns = [','.join(pair) + ' Pair', 'Other Pairs']

oddsratio, pvalue = stats.fisher_exact([[n, N], [r, R]])

print('P-val:{}'.format(pvalue))

return pd.DataFrame(rows, columns=columns, index=index)

98

Appendix E – Evaluating Classifiers’ Performance Across Multiple

Datasets

import pandas as pd import numpy as np from DrugFeatures import * from ComputeSimilarity import * from FeaturePreprocess import * from InteractionModel import *

if __name__ == "__main__": dataset = pd.read_csv('data/DrugsMultiClassSeverity.csv') X, y, dataset = get_train_data(dataset,binarize=True, include_random=False)

classifiers = { 'KNeighborsClassifier': [KNeighborsClassifier(), dict( n_neighbors=list(range(1, 31)) )], 'LogisticRegression' : [LogisticRegression(class_weight='balanced', solver='sag', penalty='l2', max_iter=1000, n_jobs=-1), dict( tol=[1e-4,1e-3,1e-2], C=[0.001,0.01,0.1,1,10,100,1000])], 'SVM' : [svm.SVC(max_iter=-1, probability=True), dict( C = [1, 10, 100], tol=[1e-4,1e- 3,1e-2], kernel=['rbf','linear'])], 'MLPClassifier':[MLPClassifier(max_iter=1000), dict( hidden_layer_sizes=range(50,180,10), alpha = [0.01,0.1,1] ) ], 'AdaBoostClassifier' : [AdaBoostClassifier(base_estimator=LogisticRegression(class_weight='balanced', solver='sag', penalty='l2', max_iter=1000, n_jobs=-1)), dict( n_estimators = range(20,400,10), learning_rate=[0.001, 0.01, 0.1, 1.0])], 'RandomForestClassifier' : [RandomForestClassifier(class_weight='balanced'), dict ( n_estimators= [int(x) for x in np.linspace(start = 10, stop = 400, num = 10)],max_depth= [int(x) for x in np.linspace(10, 110, num = 11)],min_samples_split= [2, 5, 10],min_samples_leaf= [1, 2, 4],bootstrap= [True, False] )], }

optimised_classifiers = []

pbar = tqdm(classifiers) for clf in pbar: pbar.set_description(("Optimising %s" % clf)) opt_clf = randomsearch_parameters(classifiers[clf][0], classifiers[clf][1], 5, X,y, binarize=True, iterations=10, imbalance=True) optimised_classifiers.append(opt_clf)

save_object(optimised_classifiers,'dataset1')

99

dataset1_classifiers = pickle.load(open('models/dataset1.sav','rb')) stratifiedKFold_train(dataset1_classifiers, X, y, imbalance=True, file_name='dataset1') validate_with_crediblemeds(dataset1_classifiers, binarize=True, file_name='dataset1_cm_val')

random_dataset = pd.read_csv('data/DrugsMultiClassSeverity.csv') X, y, random_dataset = get_train_data(random_dataset,binarize=True, include_random=True)

optimised_classifiers = []

pbar = tqdm(classifiers) for clf in pbar: pbar.set_description(("Optimising %s" % clf)) opt_clf = randomsearch_parameters(classifiers[clf][0], classifiers[clf][1], 5, X,y, binarize=True, iterations=10, imbalance=True) optimised_classifiers.append(opt_clf)

save_object(optimised_classifiers,'dataset2')

dataset1_classifiers = pickle.load(open('models/dataset2.sav','rb')) stratifiedKFold_train(dataset1_classifiers, X, y, imbalance=True,file_name='dataset2') validate_with_crediblemeds(dataset1_classifiers, binarize=True, file_name='dataset2_cm_val')

random_dataset = pd.read_csv('data/db_DrugsMultiClassSeverity.csv') X, y, random_dataset = get_train_data(random_dataset,binarize=True, include_random=True)

optimised_classifiers = []

pbar = tqdm(classifiers) for clf in pbar: pbar.set_description(("Optimising %s" % clf)) opt_clf = randomsearch_parameters(classifiers[clf][0], classifiers[clf][1], 5, X,y, binarize=True, iterations=10, imbalance=True) optimised_classifiers.append(opt_clf)

save_object(optimised_classifiers,'dataset3')

dataset1_classifiers = pickle.load(open('models/dataset3.sav','rb')) stratifiedKFold_train(dataset1_classifiers, X, y, imbalance=True,file_name='dataset3') validate_with_crediblemeds(dataset1_classifiers, binarize=True, file_name='dataset3_cm_val')

100