<<

DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network based on Heterogeneous Structural and Biological Data

JUAN SEBASTIAN DIAZ BOADA

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network based on Heterogeneous Structural and Biological Data

JUAN SEBASTIAN DIAZ BOADA

Degree Projects in Scientific Computing (30 ECTS credits) Master’s Programme in Computer Simulations for Science and Engineering KTH Royal Institute of Technology year 2020 Supervisor at KI Algorithmic Dynamics Lab, Center for Molecular Medicine: Narsis A. Kiani Supervisor at KTH: Michael Hanke Examiner at KTH: Michael Hanke

TRITA-SCI-GRU 2020:390 MAT-E 2020:097

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

iii

Acknowledgements

This thesis and its experiments were performed in the Algorithmic Dynamics Lab of the Center for Molecular Medicine. Special thanks to Amir Amanzadi for creating the affinity score dataset, Jesper Tegnér for his comments analyz- ing results and Linus Johnson for his help with the Swedish translation.

v

Abstract

The prediction of polypharmacy side effects is crucial to reduce the mortal- ity and morbidity of patients suffering from complex diseases. However, its experimental prediction is unfeasible due to the many possible drug combi- nations, leaving in silico tools as the most promising way of addressing this problem. This thesis improves the performance and robustness of a state-of- the-art graph convolutional network designed to predict polypharmacy side effects, by feeding it with complexity properties of the drug-protein network. The modifications also involve the creation of a direct pipeline to reproduce the results and test it with different datasets. vi

Sammanfattning

Förutsägning av biverkningar från polyfarmaci med grafiska faltnings- neuronnät baserat på heterogen strukturell och biologisk data

För att minska dödligheten och sjukligheten hos patienter som lider av kom- plexa sjukdomar är det avgörande att kunna förutsäga biverkningar från poly- farmaci. Att experimentellt förutsäga biverkningarna är dock ogenomförbart på grund av det stora antalet möjliga läkemedelskombinationer, vilket läm- nar in silico-verktyg som det mest lovande sättet att lösa detta problem. Detta arbete förbättrar prestandan och robustheten av ett av det senaste grafiska falt- ningsnätverken som är utformat för att förutsäga biverkningar från polyfarma- ci, genom att mata det med läkemedel-protein-nätverkets komplexitetsegen- skaper. Ändringarna involverar också skapandet av en direkt pipeline för att återge resultaten och testa den med olika dataset.

Contents

Acknowledgements iii

1 Introduction 1 1.1 Statement of the problem ...... 1 1.2 Thesis Objective ...... 2 1.3 Outline of Thesis ...... 3

2 Theoretical Framework 4 2.1 Supervised Learning ...... 4 2.1.1 Linear Models ...... 6 2.1.2 Tree-based Methods ...... 7 2.1.3 Support Vector Machines ...... 8 2.1.4 Bayesian Methods ...... 8 2.2 Deep Learning ...... 9 2.2.1 Feedforward Neural Networks ...... 10 2.2.2 Training Feed-forward Neural Networks ...... 14 2.2.3 Convolutional Neural Networks ...... 24 2.2.4 Graph Convolutional Networks ...... 27 2.3 Algorithmic Complexity ...... 32

3 Related Work and State of the Art 36 3.1 Traditional calculations ...... 37 3.2 General Methods ...... 39 3.3 Trainable Methods ...... 40 3.3.1 Linear Regression Methods ...... 40 3.3.2 Tree-based methods ...... 41 3.3.3 Other Machine Learning Approaches ...... 41 3.4 Deep Learning Methods ...... 42 3.4.1 Standard Deep Learning Methods ...... 42

vii viii CONTENTS

3.4.2 Decagon ...... 43 3.4.3 Decagon-based methods ...... 48

4 Materials and Methods 49 4.1 Datasets ...... 49 4.2 Original Implementation of Decagon ...... 52 4.2.1 Data structure organisation ...... 53 4.3 Contributions and improvements to Decagon ...... 59 4.3.1 Data Treatment and Preparation ...... 60 4.3.2 Implementation of Algorithmic Complexity Features . 64 4.3.3 Containers and GPU Configuration ...... 70 4.3.4 Minibatch sampling and the data leakage problem . . . 71 4.3.5 Incorporation of edge features ...... 73 4.3.6 Other improvements ...... 73 4.3.7 Overall Pipeline ...... 74

5 Results and Discussion 76 5.1 First experiments: Feature selection ...... 76 5.2 Node features as possible method stabilisers ...... 79 5.3 Experiments with side effects with the lowest performance . . 83 5.4 Extension to experiments with a full graph ...... 87

6 Conclusions 97

Bibliography 102

A Additional figures 111 Acronyms and Abbreviations

ADR adverse drug reaction. 1, 2, 4, 36, 42 AI Artificial intelligence. 4 ANN artificial neural network. 10, 12–14, 19, 21, 22, 25, 26 AP algorithmic probability. 33 AP@K average precision at k. 24 API application programming interface. 97 ATC Anatomical Therapeutic Chemical. 36, 40–42 AUPRC area under the precision-recall curve. 23, 24, 59, 76, 78, 79, 81, 86, 88–94, 112, 113 AUROC area under the receiving operating characteristics curve. 24, 59, 76, 79, 89

BDM Block decomposition method or KC features calculated with the block decomposition method. 34, 35, 64, 67–69, 71, 74–77, 79–81, 83, 85, 87–90, 98, 100

CNN convolutional neural network. 25–29 CPU central processing unit. 20, 71, 76 CSR compressed sparse row. 56, 63 CTM coding theorem method. 33, 34, 65, 66

DDI drug-. 3, 41–44, 47, 48, 53, 54, 60, 62–64, 69, 70, 99 DL Deep learning. 9, 10, 24, 25, 27–29, 41–43, 45

ix x Acronyms and Abbreviations

DSE Single drug side effects. 50, 60, 63, 74, 76, 77, 79–81, 84, 85, 87, 89, 90, 93–95, 99 DTI drug-target interaction. 3, 41, 43, 44, 47, 50, 52–54, 56, 60–64, 69, 73, 78, 99

EMI Edge Minibatch Iterator. 55, 56, 59, 72–74, 80, 81, 83, 88, 89, 100

FN false negatives. 22 FP false positives. 22

GCN graph convolutional network. 2, 28, 29, 31, 52, 57, 58, 60, 61, 64, 73, 81, 97–99 GPU graphic processing unit. 20, 69–71, 74–76, 87, 88, 95, 97, 100

KC Kolmogorov complexity. 32–34

MedDRA Medical Dictionary for Regulatory Activities. 50, 51 ML Machine learning. 4–6, 9, 10, 15, 21, 34, 36, 37, 40 MSE mean squared error. 15

PPI protein-protein interaction. 3, 40, 43, 44, 47, 50, 53, 54, 56, 61–64, 69, 78

ReLU rectified linear unit. 19, 20, 25, 31, 57, 73, 87 RF Random forests. 7, 41

SGD stochastic gradient descent. 17, 18, 21, 35 SVMs Support vector machines. 8, 41, 42

TN true negatives. 24 TP true positives. 22

UTM universal Turing machine. 32–34 w2 Simulations including DSE and BDM features. 81, 87, 89, 90, 93–95 Chapter 1

Introduction

1.1 Statement of the problem

In many complex diseases, single-drug therapies fall short in helping recover- ing patients. This lower performance occurs because complex diseases such as cancer or AIDS, involve processes controlled by multiple biochemical mecha- nisms, which give redundancy to their functioning [1–3]. Usually, from all the targets that a drug may have, only a few of them are known, which give insight to which diseases they can treat. Single drug therapies target only a limited number of pathways in the pathogenesis of a disease, which sometimes leads to an incomplete treatment and, therefore, perpetuates the disease. As a result, new procedures have shifted towards multi-drug therapies, which have proven to boost the efficacy of cancer, AIDS and fungal infection treat- ments over single drug therapies [4–7]. The potentiated polypharmacy effect comes from drug synergy, occurring when multiple drugs undertake the same disease by simultaneously targeting different pathophysiological pathways [1, 5]. Due to this effect, the single-drug doses can be reduced [8, 9], which con- tributes to reducing the individual toxicity of the drugs [2, 4–6, 10], and even reduce the drug resistance of the disease [6, 8, 11, 12]. Nevertheless, polypharmacy is associated with a much higher risk of adverse drug reactions (ADRs) due to drug-drug interactions [1, 4, 13]. Single drugs may modulate the activity of various untracked proteins in what is known as off-target interactions, which are challenging to trace [1, 14]. Multiple inter- actions of this kind could give rise to unexpected polypharmacy ADRs. These interactions usually go unnoticed in clinical trials, due to the limited time spent

1 2 CHAPTER 1. INTRODUCTION

in testing drug combinations [15] and being, most of the times, discovered once the drug is already in the market [14]. As the mechanistic understand- ing of drug-drug interactions is low, it is difficult to predict these side effects [2]. Furthermore, polypharmacy therapies are getting more common [16], be- coming a growing problem and being the cause of a significant fraction of the hospitalisation of patients due to unexpected ADRs [17, 18]. Until recently, the prognostication of polypharmacy ADRs was mainly based on clinical experience [2, 4] and medical expertise [8]. Some classical quan- titative methods to predict the effect of drug combinations, such as the Loewe model (see section 3.1), were also used but failed to fully explain non-linear interactions such as synergy [4, 5, 19]. Clinical experiments can give a solu- tion for a few combinations, but they are time-consuming and expensive [2, 5]. In vitro approaches like high-throughput screening can lead to cheaper proce- dures, but the vast combinatorial space of drugs makes it unfeasible to test all drug combinations [2, 4]. Therefore, it is necessary to have some development in the preclinical trials to make the procedure more sustainable and efficient. In silico approaches come handy to solve these problems. These are computational methods to simu- late the effect of drug combinations rapidly and with low resource investment. While many studies address the problem of synergy or toxicity prediction, a minimal number of studies try to predict specific side effects coming from drug combinations. A method of this kind could reduce mortality and mor- bidity among polypharmacy treatment users and save considerable expenses in healthcare. As the nature of the problem involves interactions of agents in densely connected networks, a solution using a network approach could ex- ploit this kind of structures and extract information that could be overlooked by other methods.

1.2 Thesis Objective

Decagon [1] is the name of an algorithm that has been proposed in the liter- ature to tackle the multiple-drug side effect prediction problem. This method formulates polypharmacy side effects as a graph learning problem and solves it using a graph convolutional network (GCN). However, its current implemen- tation misses some key components in its pipeline, has unnecessary calcula- tions, and too many parameters that affect its efficiency and lead to overfitting. Moreover, its form of data manipulation is not robust and may lead to overesti- mated results. This work aims to improve its performance and generalizability. CHAPTER 1. INTRODUCTION 3

More precisely, a current state-of-the-art deep learning method aimed to pre- dict the side effects of a drug combination is modified and tuned to improve its performance. As this project bases firmly on the premise that the underlying mechanisms of undesired side effects come from the structure of the human interactome, an existing graph convolutional architecture (see section 2.2.4) is chosen as a suitable solution for the problem, among many other machine learning and deep learning techniques. Additionally, the input data used to train the neural network is enriched with a wide variety of features extracted from the graph properties of the involved net- works, namely the protein-protein interaction (PPI) network, the drug-target interaction (DTI) network and the drug-drug interaction (DDI) network or known polypharmacy side effects. These features include algorithmic com- plexity features of the involved networks, obtained by applying algorithmic numerical methods over the graphs before training the neural model. In this fashion, there is a redundancy in the learning data, as the GCN learns the main components of the network from the data while being fed with meta-features obtained independently from the network. Furthermore, tests with secondary structure protein features and drug-target affinity scores are performed to ex- plore possible feature extraction domains.

1.3 Outline of Thesis

Chapter 2 includes a contextualization of machine learning and an explanation of the mathematical theory behind it. The chapter covers a brief explanation of the most common machine learning methods used, deepening more in arti- ficial neural networks and the of graph convolutional networks. It also explains the theory and motivation behind algorithmic complexity and its numerical implementation. Chapter 3 tracks some of the computational meth- ods that have been used to address the problem of predicting the effects of drug combinations, covering trainable and non-trainable methods, but focus- ing on the deep learning approaches. The chapter closes with the introduction of the used method, Decagon. Chapter 4 gives details about the datasets and the method used. It includes an explanation of the original implementation and the modifications done to it. Chapter 5 exposes the results of the exper- iments through all the development stages of the projects. Finally, Chapter 6 presents the thesis’s conclusions, including the contributions, the findings, the limitations and the future work. Chapter 2

Theoretical Framework

This chapter gives a broad summary of the theory behind some of the cutting edge models for solving the polypharmacy ADR prediction problem. It will start with a background of supervised learning and some of its fundamental methods. Afterwards comes an explanation of deep learning and its relevant variations for this study. Finally, the chapter closes with a brief description of algorithmic complexity.

2.1 Supervised Learning

Machine learning (ML) is transforming the world almost in all segments from medical diagnosis, stock market prediction, virtual personal assistants to social networks [20]. This branch of Artificial intelligence (AI) has been increasing its popularity due to an increment in data availability, and the rapid growth of computing power. The reason is that ML techniques analyze vast amounts of data and interpret it to discover patterns and make decisions that humans can not, mainly due to the large size or complexity of the datasets. Any process that requires an unbiased analysis of numerous quantified factors to generate an outcome is suitable to be solved by ML.

Machine Learning categories Machine learning is the name given to the branch of computational meth- ods that learn from data without being explicitly programmed1 [21]. More

1This definition was given in 1959 by Arthur Samuel, one of the pioneers of Artificial Intelligence.

4 CHAPTER 2. THEORETICAL FRAMEWORK 5

specifically, it is the family of computational methods with a data-driven ap- proach, that extracts patterns from data without knowing a precise mathemat- ical model in advance. ML methods can be classified into two big groups, depending on the learning goal and the type of data: supervised and unsu- pervised learning. More recent developments have extended these categories, adding semi-supervised and reinforcement learning. Supervised learning uses labelled data to perform tasks like regression or classification, while unsuper- vised learning uses unlabelled data to perform clustering or dimensionality reduction. Semi-supervised learning deals with scarce labelled data problems and can be considered a middle ground between supervised and unsupervised learning. Conversely, reinforcement learning methods teach themselves based on their actions under specific circumstances, considering the penalties and rewards they may give. As the present study will focus exclusively on super- vised learning methods, this section overview the current approaches within this field.

Regression vs Classification In practice, supervised learning methods are characterized by exploiting a pre- dictive ability. This means that generally2, its primary function is to predict the label of an unseen instance. The labels of the data can be of different types, such as numerical or categorical. Generally speaking, the type of label will define the nature of the supervised problem: if the output is numerical, the problem will be defined as a regression problem, otherwise it will be a classification problem.

The challenge of overfitting ML methods use statistical tools to find patterns or infer properties from the data. One of the greatest challenges of ML is to infer inherent properties from the limited samples available that represent the population correctly. When a model learns particular properties from the training samples that do not repre- sent the general behaviour of the population, the model is said to be overfitted. In these cases, performance measured from the training data is high, but it per- forms poorly on new instances.

2Although prediction is the most practical application of supervised learning, it is some- times necessary to know the mechanisms of learning, or what is known as interpretability or inference. Inference gives some insight into the nature of the problem and the path to the solution. 6 CHAPTER 2. THEORETICAL FRAMEWORK

There are many approaches to solve the overfitting problem. The most gen- eral one involves increasing the number of data points for the learning, in an attempt to make the training dataset as diverse as possible. This enlarging can be done either by mining more data points or by augmenting the current dataset by artificial means. Other methods to reduce overfitting are more specific to the type of model or the optimisation method. The most suitable strategy to reduce overfitting will then depend on the ML algorithm used.

2.1.1 Linear Models Linear regression is a simple yet very applicable method. The method finds an optimal coefficient βi for each feature or component xi of the data, plus an independent term or bias β0. With these coefficients, it generates a linear combination that predicts the labels of the data points as follows:

p X y = β0 + β1x1 + ··· + βpxp = βixi + β0 = β · x + β0, (2.1) i=1 where p is the number of input dimensions. Equally important, logistic regression is a linear method of binary classifica- tion based on linear regression. This method uses the vector product calculated by linear regression as the argument of a sigmoid function (equation (2.2), which gives an output between 0 and 1. This output is then approximated to the closest integer to make the classification:

1 σ(β · x) = . (2.2) 1 − eβ·x

To deal with overfitting, linear models often use regularisation techniques. These techniques impose limits on different norms of the β vector. As ex- amples, the Lasso and the Ridge regularisations impose constraints in the L1 and L2 norms, respectively, while elastic net regularisation uses a linear com- bination of both L1 and L2 norm restrictions. The relevance of linear models lies in their simplicity and interpretability. When analyzing the result of a linear model, the vector β can directly quantify the relevance of each feature of the data. Nevertheless, most practical prob- lems have non-linear behaviour, so the application of linear models can be limited. CHAPTER 2. THEORETICAL FRAMEWORK 7

2.1.2 Tree-based Methods

State of the art tree-based methods are ensemble learning methods3 consisting of multiple decision trees, mainly random forests and boosted classifiers. A decision tree is a structure that subsequently splits the input data into groups based on some splitting criteria. In each level, a node evaluates a feature of the input group and classifies the instances accordingly. The goal is to divide the data repeatedly into groups until there are only pure nodes (have instances be- longing to only one category), called leaves. This data division is interpreted as the partition of the feature space into a set of rectangle-like areas with a single category [22]. A trained decision tree consists of the optimal splitting criteria on the optimal features. The standard way to find the optimal split- ting conditions is by using Gini or entropy to measure each subset’s "purity". Although the leaves are said to hold a category, decision trees can be used for regression and classification problems. In the case of regression, the value on the leaves is a real number. It is important to note that the func- tion modelled by the regression decision tree is stepwise continuous, and the resolution depends on the number of leaves. Decision trees implement pruning to overcome overfitting, consisting of trim- ming its branches before the leaves hold a pure category. In this way, pruning may decrease the training accuracy but increase the generalisation. However, the best way to overcome overfitting is to use ensemble methods of several trees. Random forests (RF) is a commonly used ensemble method. In short, RF grow several uncorrelated trees, each trained with a different subset of data points or data features [22]. The final output of the forest is computed by majority vote or average of the outputs of individual trees for classification and regression tasks, respectively. While RF train each tree independently, boosted classifiers grow trees sequen- tially, each one correcting its predecessor. Gradient boosting methods build a tree adjusting its parameters based on an error function gradient, calculated over the previous tree’s performance [22]. Tree-based methods have much more applicability than linear models without sacrificing its interpretability. Moreover, they can achieve state of the art per- formances in many tasks. However, ensemble tree-based models are still very

3Ensemble learning methods consist of multiple individual models. The final output of these methods takes into account the outputs given by the different individual models. 8 CHAPTER 2. THEORETICAL FRAMEWORK

suitable for overfitting and often lack robustness.

2.1.3 Support Vector Machines Support vector machines (SVMs) are binary maximal margin classifiers. In other words, they find an optimal decision boundary between two classes max- imising the distance of the data points of the different classes to the boundary that separates them. The decision boundary takes a hyperplane form, divid- ing the feature space linearly in two, corresponding to the desired categories. Formally, for N data points, the task is to find the hyperplane described by the vector [β0,ββ] such that

max M β0,βββ,||β||=1 (2.3) T subject to yi(xi β + β0) ≥M, i = 1, ..., N,

th th where yi ∈ {1, −1} is the label of the i category, xi is the i datapoint and M is the margin [22]. The real power of SVMs comes with the utilisation of kernels, which extend their scope beyond linearity. With kernels, data is transformed into a high- dimensional space where the linear classification takes place. This property makes SVMs capable of separating classes that are not linearly separable in the original feature space. SVMs are based on the minimal distance between the data and the boundary, so only the data points closer to the boundary are used for training. This data selection makes the method computationally cheap and functional even with small datasets. Moreover, the use of the so-called kernel trick makes it possible to increase the dimensionality of the problem without running into complex calculations, as the only operations needed are inner products. The downside of this method is that the kernel has to be carefully chosen for the task, which sometimes reduces to a trial-and-error picking. It may also perform poorly when the number of dimensions of the original data is larger than the number of data points.

2.1.4 Bayesian Methods In the previous methods (and in neural networks), the main idea was to up- date the parameters of the model iteratively, maximizing the probability that the parameters will explain the current data. As a new data point came, the CHAPTER 2. THEORETICAL FRAMEWORK 9

parameters tried to adjust themselves closer to their "real" value. Bayesian methods change this paradigm. They use Bayes’ Theorem shown in equation (2.4), where the left-hand side of the equation is the posterior distribution of the parameters given the data, and the right-hand side relates the likelihood of the data p(x|θ) and the prior of the parameters p(θ):

p(θ|x) ∝ p(x|θ)p(θ). (2.4)

The basic idea of Bayesian ML is that the parameters are treated as another random variable taken from the prior distribution. This distribution holds pre- vious knowledge about the parameters given by the problem. The goal then is to maximize the posterior distribution with a fixed dataset. This basic idea can lead to converting previous methods into Bayesian methods. As examples, there is Bayesian linear and logistic regressions or Bayesian networks. Bayesian ML methods are more oriented towards inference. Although a model’s prediction ability is the priority in practice, it is sometimes essential to under- stand the mechanisms rather than obtain a correct prediction with a black-box method. For these problems, Bayesian methods are very suitable. Their lim- itation is that most complex Bayesian ML methods require heavy numerical computations, and therefore vast computational resources.

2.2 Deep Learning

Deep learning (DL) is the name given to the models of multi-layered artificial neural networks. They have become one of the most used ML methods due to their high performance on diverse tasks, but more for being a self-learning model. In other words, the model finds the function that relates the input data and the output by itself, with no other input by the user than stating the com- plexity of the model.

Deep Learning: The Evolution of Machine Learning In conventional ML models, the input gives direct insights to the model on how to perform on the assigned task. Thus, the choice and manipulation of the data are crucial in the performance of the method. As a consequence, a pre-training step known as feature engineering is, most of the time, a decisive phase of the algorithms. Feature engineering is the process of transforming or selecting relevant components of the data to feed an ML algorithm with the optimal 10 CHAPTER 2. THEORETICAL FRAMEWORK

inputs in order to achieve peak performance. This process requires enormous effort and sometimes expert knowledge on the application field to manipulate the data correctly so that the algorithm can extract the correct information from it [20]. DL models stand out from conventional ML techniques as they can receive raw data as input (without the need for feature engineering) and achieve great results. This considerable advantage, which is the core of DL, is achieved by sequentially transforming the inputs through layers, creating an internal repre- sentation of the data in each layer in what is known as representation learning [20]. By creating abstractions of higher or lower dimension, DL algorithms can decide autonomously which aspects of the data are representative to the problem and amplify them, while suppressing the least relevant components. Consequently, one can say that each layer performs an automatic feature engi- neering procedure. With the stacking of more layers, more definite or complex traits can be extracted from the data, facilitating the final task. Hence, the num- ber of layers and learning units per layer can be chosen at discretion to adjust the complexity of the model. Furthermore, non-linearities added at the end of each layer give DL models the ability to model virtually any desired function with an arbitrary accuracy [23].

2.2.1 Feedforward Neural Networks

The perceptron: An artificial neural unit Not surprisingly, artificial neural networks (ANNs) were inspired by biolog- ical neural networks. In an ANN, learning units are connected to mimic the behaviour of a biological neural network. Biological networks are composed of fundamental learning units called neurons. A neuron receives stimuli by other neurons through connections called synapses, which can be excitatory or inhibitory. Multiple synapses can stimulate a neuron simultaneously, having an overall effect equal to the sum of excitatory and inhibitory stimuli (Figure 2.1a). If a neuron receives enough inputs to reach a certain threshold, it will trigger a strong response in the neuron called an action potential, which is the primary way of transmitting information between neurons. The action poten- tial is a binary response, for which it is often described as an "all or none" response [24]. A perceptron is the artificial analogue of a biological neuron. It receives a set of n inputs xi modulated by a corresponding set of weights wi that mimic the excitatory and inhibitory stimuli of a biological neuron (see Figure 2.1b). CHAPTER 2. THEORETICAL FRAMEWORK 11

Input Input

Input

Σ ힼ

Output . . .

(a) Stimulation of a biological neuron. (b) Perceptron.

Figure 2.1: The resemblance between biological neurons and perceptrons. a) Red neurons give activation/inhibition stimuli to the blue neuron. When the sum of the stimuli reaches the threshold, the blue neuron generates an out- put. b) The perceptron’s inputs are combined linearly with some weights and introduced in a non-linear activation function.

Each input represents a feature of the dataset, meaning that a set of simultane- ous inputs would represent a single data point being evaluated in the neuron. The inputs then are transformed by an inner vector product, often called pre- activation or logit z (equation 2.5). An additional term b called bias is usually added to the vector product. Then, the "all or none" response h is modelled evaluating the logit by a step function σ, which outputs 1 if the logit reaches the desired threshold t or 0 otherwise:

p X z = w1x1 + w2x2 + ··· + wpxp + b = wixi + b = w · x + b (2.5) i ( 0, for z ≤ t h = σ(z) = (2.6) 1, for z > t.

The perceptron is a simple supervised learning method that can be trained to solve binary classification problems, as it divides the feature subspace in two by a hyperplane. The algorithm by itself can have a decent performance in simple linearly separable tasks. However, it fails when the data is not linearly separable. This substantial limitation motivates to extend further the power of perceptrons. 12 CHAPTER 2. THEORETICAL FRAMEWORK

Hidden Layer Input Layer Output Layer

Input 1

Input 2

Output

Input 3

Input 4

Figure 2.2: A single-layer neural network with four input neurons, five hidden neurons and one output neuron: Each hidden neuron receives a linear combi- nation of the inputs and transforms it non-linearly. The output neuron does the same procedure with the hidden neurons.

Artificial neural network: A perceptron network In artificial and biological neural networks, a single neuron has no significant influence if it is not connected to others. Therefore, multiple perceptrons can be connected to tackle more complicated tasks. A set of connected artificial neurons working together to perform a specific task is known as an artificial neural network (ANN). The different ways in which the neurons are connected define different architectures of ANNs suitable for different tasks. One way of integrating multiple neurons is by connecting them in parallel or layerwise. With this configuration, the different neurons receive a unique lin- ear combination of the inputs since all the weight combinations are different. After the respective activation functions evaluate their logits, each neuron re- turns its binary response. A subsequent neuron, called output neuron, can then linearly combine these results to generate the output of the network (Figure 2.2). The network as mentioned above would be composed of three parts: the input neurons, each of them representing a feature of the dataset; the set of neurons evaluating the inputs known as the hidden layer; and the output neuron, which consolidates the responses of the neurons in the hidden layer. Such a network CHAPTER 2. THEORETICAL FRAMEWORK 13

h w 11 x h wy 1 z 1 11 Σ ힼ y z 1 y1 . Σ ힼ . . . h1 . . . . wh wy . ij . . kj . zh y j z k yk x xi Σ ힼ Σ ힼ y h . . . . j ...... zy y h wy m m z L mL Σ ힼ wh Σ ힼ x Lp p hL

Figure 2.3: Multi-layered network for multiple-label classification problem: Inputs are linearly combined by the P neuron and non-linearly transformed by the σ neuron in each layer. The outputs of a layer constitute the input of the next. Each output neuron gives the probability that the input belongs to a particular class. is called a single-layer network.

Multi-layered networks: The heart of deep learning A more complex model can be created by stacking multiple hidden layers of neurons between the input and the output, in a way that each layer transforms the output of the previous one. That is, the outputs of one layer are linearly combined with a different set of weights in the following layer, which in turn can be connected to a subsequent layer (Figure 2.3). The number of layers of the model can be chosen at the user’s discretion. The more layers one network has, the more complex the model and the higher its capability to solve intricate problems. In this way, each layer acts as a feature extractor, where initial layers capture fine details, and the last layers extract complex traits. This property of transforming non-linearly the inputs several times gives ANNs the ability to extract meaningful information from untreated features. This characteristic of networks having multiple layers is what gave its name to deep learning. In the network shown in 2.3, every neuron in one layer transmits its output to every neuron in the next layer. These layers are called fully connected layers. It is also important to note that there are no direct connections between neurons of the same layer, or loop connection of a neuron with itself. These properties 14 CHAPTER 2. THEORETICAL FRAMEWORK

are representative of a feedforward neural network. Mathematically, the output of this network can be described with matrix prod- ucts of the inputs with matrices of weights for each layer. In the perceptron, the weights are grouped in a vector w. Instead, in a layer of a feedforward network with m neurons and a p-dimensional input vector x, the weights are expressed by a matrix W of dimensions p × m. The logits z of the lth layer can be expressed as the vector

z(l) = W(l) · x + b(l), (2.7) while the outputs h of the lth layer are just the non-linear activations of the logits h(l) = σ(z(l)). (2.8) Then, plugging the right hand side of equation (2.7) into equation (2.8) gives the expression for the next layer   h(l+1) = σ W(l+1) · h(l) + b(l+1) , (2.9) which is the general propagation rule for feed-forward ANNs. Feedforward neural networks are suitable for both regression and classifica- tion tasks. For regression tasks, the output node of the network must return a numeric value. In binary classification, the output node returns either 0 or 1, depending on the predicted category. For multi-class classification, several output neurons are needed, as shown in Figure 2.3. In this case, each output neuron would represent a category, and it will output 1 when the input belongs to the corresponding category, and 0 otherwise. In practice, the values of the output neurons are real values between 0 and 1, which represent the probability of belonging to the given class.

2.2.2 Training Feed-forward Neural Networks Previously it was assumed that the networks had optimal weights for perform- ing the desired task. However, these parameters have to be trained to achieve their optimal values. These training parameters in ANNs are the weight ma- trices W and biases b of each neuron. For simplicity, the biases are usually included in the W matrix in the mathematical expressions. CHAPTER 2. THEORETICAL FRAMEWORK 15

Training dataset: Learning from examples It is the essence of supervised learning methods that they require a training phase in which they learn from labelled examples. Although each ML method may have its peculiarities, all of them follow the same general procedure. The process starts by selecting a representative subset of the available data known as the training dataset while leaving the remaining examples as the test dataset. The training dataset is used to tune the parameters, while the test dataset is left to evaluate the method’s generalisation ability after training. It is recommended to choose as the training set a significant proportion of the data, usually around 70%, to maximize the probability that the model captures the general traits of the population and not individual characteristics of instances. Sometimes, an additional subset of the dataset known as the validation dataset is built to monitor the performance during training and change the value of non-training parameters, known as hyperparameters. The basic hyperparam- eters of a neural network are the number of layers and the number of neurons per layer.

Cost function: Quantification of the error The next step is to measure the error between the predictions of the model and the ground truth. The ground truth of an instance xi is its label yi. On the other hand, the prediction yˆi of instance xi is the output given by the network upon evaluating xi. This evaluation, which implies performing the corresponding linear combinations and activations sequentially through each layer, is known as forward propagation. Therefore, the required error function should measure the discrepancy between yi and yˆi for all training examples. This function has to take high values when yi and yˆi are very different, and values close to zero when they are alike. Such a function J is known as the cost function. The specific cost function has to be chosen depending on the problem. Some cost functions are best suited for classifications problems, and others for re- gression. One common choice of a cost function for regression is the mean squared error (MSE) N 1 X 2 J = (ˆyi − yi) , (2.10) N i=1 where N is the number of instances in the training dataset. For binary classi- 16 CHAPTER 2. THEORETICAL FRAMEWORK

fication problems, cross-entropy is often used:

N X  J = − yi log2(ˆyi) + (1 − yi) log2(1 − yˆi) . (2.11) i=1

For multi-class classification, the cross-entropy takes the form

N K X X J = − yi,k log2(ˆyi,k), (2.12) i=1 k=1 where K is the number of different classes.

This cost function depends on the values of yˆi, which at the same time depends on every trainable parameter w. As the goal is to find the optimal values for each weight, the cost function has to be minimized, and then use the values w corresponding to such minimum. However, due to the high dimension of the parameter space and the fact that the neural network implements non-linear ac- tivation functions in the hidden layers, the cost function becomes non-convex [25], leading to the minimum estimation to be made by iterable algorithms.

Gradient descent: Moving in the steepest slope direction The preferable choice for optimising neural networks is using gradient descent. This iterable algorithm finds a local minimum of a cost function, updating its value based on the gradient’s direction. The algorithm displaces a chosen start- ing point w0 in the negative direction of the gradient by some step η called the learning rate, as shown in equation (2.13). This direction is where the func- tion decreases at the highest rate (the steepest slope). As the cost function is a hypersurface in the parameter space, the starting point w0 and the gradient ∇wJ(w) are d-dimensional vectors, where d is the number of trainable pa- rameters of the model. Each entry in these two vectors represents the value of one of the weights and its derivative, respectively, as shown in equations (2.14) and (2.15): j+1 j j w = w − η∇wj J(w ) (2.13)  ∂J  j ∂w1 ∂J  j  j  ∂w2  ∇wj J(w ) =   (2.14)  .   .  ∂J j ∂wp CHAPTER 2. THEORETICAL FRAMEWORK 17

 j w1 wj j  2 w =  .  , (2.15)  .  j wp where the superindex j is the number of the iteration. The steepest gradient method is known to have a low speed of convergence, and the value of the learning rate is one of the factors that determines the convergence speed. A learning rate chosen too small would make tiny steps towards the solution, making the algorithm take too many iterations to reach the minimum. On the other hand, choosing a too big learning rate may miss the desired minimum, causing divergence of the algorithm. The value of the learning rate is a hyperparameter that has to be tuned using the validation error. Sometimes, an adaptable learning rate is used, to make large steps in the first iterations and reduce the steps close to the solution. This adaptation could increase the convergence rate of the algorithm. Due to the updating of parameters in each iteration, the current solution moves to a new point in the cost function, so a new gradient must be calculated. The number of iterations depends on the number of batches, which are subsets of training data points used to estimate the cost function and update the network’s weights. Using the whole training dataset as the batch may result in a faster convergence when the dataset is big, but it may be slower when the number of features is large, and it may have a risk of getting stuck in a saddle point with a low performance-set of parameters. An alternative is stochastic gra- dient descent (SGD), where only one or a few instances are used to calculate the gradient. In this method, the cost function is estimated with fewer data- points but over more iterations. This means that the gradient calculation may not be exact but it has several iterations to correct itself, reducing the risk of getting stuck in a flat region [26]. However, this method may result in a slow convergence and even a risk of divergence close to the minimum. Usually, a middle ground method is used, which divides the dataset into batches of a few hundred examples. Most iterative algorithms stop when they find a value good enough given a tolerance. However, in neural network training, one usually sets the number of times the whole dataset will be used in gradient descent as a stopping criterion. Each of these complete passes is called an epoch. The number of iterations then can be calculated multiplying the number of epochs with the number of 18 CHAPTER 2. THEORETICAL FRAMEWORK

batches in which the training dataset is divided.

Backpropagation: Applied chain rule of differentiation Historically, calculating a numerical expression for the gradient has been the most challenging aspect of SGD, as it may be a very computationally expensive procedure [23]. As a result, the development of an algorithm called backprop- agation to calculate the gradient numerically was a significant achievement in the field. The algorithm of backpropagation takes advantage of the recur- sive variable dependence in the network. On the whole, forward propagation makes possible to express each layer as a function of the previous ones. Specifically, in a simple single-layer network (recall Figure 2.2), the output yˆ can be defined as the activation of a logit z(2)

yˆ = σ(z(2)). (2.16)

The logit is a linear combination of the outputs of the neurons in the hidden (1) layer hi m2 (2) X (2) (1) z = wi hi + b2. (2.17) i=1

In turn, these neuron responses hi are activations of linear combinations of the inputs: m1 (1) (1) (1) X (1) hi = σ(zi ); zi = wij xj + b1, (2.18) j=1 where in equations (2.17) and (2.18), the i index goes over neurons in the hidden layer (m2) and index j goes over the inputs (m1). Following this idea, to calculate an entry of the gradient vector of equation (2.14), i.e. a partial derivative of the cost function with respect to a single weight, an expression using the chain rule of differentiation is derived:

(2) (1) (1) (2) (1) (1) ! ∂J ∂J ∂yˆ ∂z ∂h1 ∂z ∂z ∂hm2 ∂z = (2) (1) (1) + ... + (1) (1) . (2.19) ∂w11 ∂yˆ ∂z ∂z w11 ∂z w11 ∂h1 ∂hm2

Nevertheless, the previous explanation exposes an obvious problem using back- propagation—the non-linear function σ(·) (equation (2.6) and Figure 2.4a), defined previously as the step function is not differentiable over all its do- main. Even more, its derivative is zero when it is defined, which will reduce all the chained derivatives to zero and hence, the weights not to be updated CHAPTER 2. THEORETICAL FRAMEWORK 19

in the gradient descent algorithm. For this reason, other non-linear functions have to be used. Differentiable alternatives of the step function such as the sigmoid function (equation (2.2) and Figure 2.4b) or inverse tangent (Figure 2.4c), can be used as substitutes. However, it has been shown [23] that us- ing a rectified linear unit (ReLU) function (equation (2.20) and Figure 2.4d) as activation functions in hidden layers can achieve better performances than saturating functions4.

σReLU (x) = max{0, x} (2.20)

1 1.0

0.5 10 5 0 5 10

0.0 0 10 5 0 5 10

(a) Step function (b) Sigmoid

/2 10

0 5 10 5 0 5 10

0 /2 10 5 0 5 10

(c) Inverse tangent (d) Rectified linear unit (ReLU)

Figure 2.4: Some of the possible non-linear functions that can be used as ac- tivations for the neurons.

The previously explained dependence of outer layer variables from inner layer variables makes ANNs suitable for being represented as acyclic computational

4Saturating functions have horizontal asymptotes in values approaching infinity and neg- ative infinity. In other words, a given increase or decrease in its inputs will not cause a signif- icant increase/decrease of its value over great part of their domains. As a consequence, their derivatives are close to zero approaching these values. 20 CHAPTER 2. THEORETICAL FRAMEWORK

graphs. In such graphs, each variable is a node, linked together to other vari- ables through edges representing operations. This representation is precisely what the implementation packages such as TensorFlow use to make the algo- rithm’s computations. In those computational graphs, each edge represents an operation between two tensors. The training of neural networks can involve a considerable number of these operations. This heavy load of operations is why a graphic process- ing unit (GPU), is commonly used to train neural networks, instead of central processing units, or CPUs. In brief, GPUs have a higher number of cores, and higher bandwidth than CPUs, making them more suitable to perform opera- tions in the computational graph, as many of them are calculated in parallel. Nevertheless, GPUs usually have more limited memory than CPUs, which is the bottleneck of their usage.

Training challenges Gradient descent is used as a resource in the absence of an easy exact solution to the non-convex minimisation problem. This alternative gives successful re- sults in a vast number of situations, but several theoretical and practical issues must be taken into consideration. Among the many practical considerations, one important highlight is dealing with vanishing gradients in backpropagation. A gradient calculation with re- spect to a parameter very deep in the network is usually a product of many partial derivative terms. Only one or some of them being close to zero may vanish and leave that parameter untrained. The deeper the layer is located, the more likely for its derivative to contain a value close to zero. Additionally, vanishing gradients are persistent using saturating activation functions such as sigmoid and inverse tangent, due to the extensive part of their domain where their derivative is close to zero. Two main approaches can be used to solve this problem. The first one uses non-saturating functions as activations of in- ner layers, such as ReLU or other functions that only have a finite, countable number of points where the derivative is not defined [23]. The second option is to use a particular weight initialisation strategy. One of the currently most used initialisation strategies is the Glorot & Bengio initialisation [27], which samples the initial value of the weights from a random uniform distribution bounded between √ 6 ± √ , (2.21) nin + nout where nin and nout are the number of inputs and outputs of the layer, respec- CHAPTER 2. THEORETICAL FRAMEWORK 21

tively. The reader is referred to [27] for an in-depth explanation of this initial- isation method. Among the theoretical considerations, it stands out the fact that gradient de- scent does not guarantee convergence to the absolute minimum, since the given solution depends on the starting point. Although this may seem to bring trou- ble, recent studies have shown that converging to a local minimum is not a severe issue, as, for networks with many parameters, the cost function has abundant minima with similar quality than the absolute minimum [20]. How- ever, the optimisation procedure may arrive at a saddle point, where most of the derivatives are close to zero, which could reduce the average speed of conver- gence. For these cases, momentum-based optimisation methods were devel- oped, Adam being the most successful. The Adam optimisation [28] differs from the regular SGD in two things: first, it keeps individual learning rates ηi for every trainable parameter wi; second, the learning rates are adaptively changed using exponentially decaying averages of the means and variances of the parameters. Equation (2.22) shows the updating of the mean m and un- centered variance s estimates of the gradients, and how they are involved in the parameter updating: (l) (l−1) mi = β1mi + (1 − β1)∇wi J(w) (l) (l−1) 2 si = β2si + (1 − β2)(∇wi J(w)) (l) (2.22) (l+1) (l) mi wi = wi − ηi , r (l) l+1 si (1 − β1 ) l+1 +  1−β2 where l is the iteration, i is the index of the parameter, β1 and β2 are the expo- nential decay rate of the mean and the variance, respectively; and  is a small number that prevents division by zero in the implementation. These last three variables are hyperparameters set by the user, but the values of β1 and β2 have to be between 0 and 1 to generate the exponential decay in the values of mean and variance, and  usually is of the order of 10−7. Adam optimisation helps gradient descent overcome saddle points efficiently by keeping the "momen- tum" of previous iterations. Finally, as for any other ML method, there is a constant threat of overfitting. Deep ANNs can have thousands or even millions of trainable parameters [20], which gives them a tremendous ability to model complex functions but also makes them exceptionally prone to overfitting. To overcome this drawback, some models implement a strategy called dropout. With this approach, neu- rons in the ANN will be turned off during a training iteration with a probability 22 CHAPTER 2. THEORETICAL FRAMEWORK

p, called the dropout rate. Consequently, the model will train with fewer pa- rameters, reducing the model’s complexity during the current iteration. The choice of dropout neurons is made in every iteration, giving the possibility to previously dropped neurons to reactivate. The probability p is a hyperparam- eter, but it usually takes values around 0.1.

Performance Metrics It is crucial to know how well the models are performing. The correct inter- pretation of performance measures can diagnosticate all of the previously de- scribed problems that affect ANN. In classification tasks, two measures are fre- quently used to describe a method’s performance: precision and recall. These metrics are mostly used for binary classification methods, but its use can be extended to multi-class classification by evaluating each class against the rest [29]. Precision measures the accuracy of the positive class, and it is defined as the ratio of the instances correctly classified as positive (true positives, abbreviated TP) and the total number of instances (correctly and incorrectly) classified as positive, as shown in equation (2.23) TP P = , (2.23) TP + FP where FP is the number of false positives or the number of negative instances classified as positive. Precision is useful to identify classifiers that are very good in identifying negative instances, i.e., will never classify a negative in- stance as positive. However, it does not tell anything about the chance of a positive instance being classified as negative. On the other hand, recall (also called sensitivity or true positive ratio, abbre- viated TPR) measures how many of the total positive instances were correctly classified as positive. It is defined as the ratio of TP and the total positive instances in the training dataset, as shown in equation (2.24) TP R = , (2.24) TP + FN where FN stands for false negatives, i.e., instances that are positive but were classified as negative. Thus, recall is useful to identify when a classifier can identify the positive class correctly but gives no information about the per- formance of the negative class. As a result, a classifier with good recall may not necessarily be a good classifier overall, just performing optimally in the CHAPTER 2. THEORETICAL FRAMEWORK 23

Precision-recall curve Receiver operating characteristic curve 1 1

True Precision positive rate

0 0 0 1 0 1 Recall False positive rate

Figure 2.5: Examples for a precision-recall curve to the left and a receiver operating characteristic curve to the right. The precision-recall curve shows the tradeoff between precision and recall: when one of these quantities is 1, the other is 0. The receiver operating characteristic curve shows the opposite relation between the true positive and false negative rates. positive class. For instance, a classifier that assigns the positive class to every instance will have a recall of 1 but misclassify the opposite class completely. The separate roles of precision and recall result in a tradeoff between them: one cannot achieve perfect recall and perfect precision at the same time in practice. A classifier can then tune its parameters to favour one or the other, depending on the specific task. For example, a neural network with an output neuron using a sigmoid function will distinguish between two classes depend- ing on the value of the logit. Tweaking the weights of a neural network to make logits more negative (or simply displacing the sigmoid to change its threshold) will favour negative classification and, therefore, tuning positive classification. This shift will reduce the false positive ratio and hence, increase the precision; but at the same time, it would also increase the false negatives and as a conse- quence, reduce recall. The precision-recall tradeoff can be visualised in a curve of precision vs recall as shown in Figure 2.5. One common way of evaluating the performance of the method is calculating the area under the precision-recall curve (AUPRC), which will give a value between 0 and 1. Although AUPRC is mostly used for binary classification, using the one-vs-rest algorithm and the AUPRC metric can give a more accurate evaluation of a method’s performance using heavily unbalanced datasets [30]. 24 CHAPTER 2. THEORETICAL FRAMEWORK

Theoretically, the AUPRC is calculated by an integral of precision as a function of recall. The integral is calculated computationally, approximating it using the average precision, which is a finite sum of using precision and recall values over some thresholds of the model [31]: X AP = (Rn − Rn−1)Pn, (2.25) n where Pn and Rn are precision and recall in threshold n. Another important metric is the area under the receiving operating characteris- tics curve (AUROC) (Figure 2.5 right). The receiving operating characteristic curve is a graph of recall against the false positive rate (FPR), defined as FP FPR = (2.26) FP + TN where TN is the number of negative instances correctly classified. Similarly to AUPRC, the AUROC is a value between 0 and 1 that measures how well a model can distinguish between classes. Models having values close to one perform optimally, while a random classifier will have an AUROC of 0.5. AU- ROC is often used in multi-class classification tasks. An approach for multi-label prediction is the average precision at k (AP@K), being k the number of labels predicted. First, the precision at each k (P@K) must be calculated, this is, for the first given k labels, how many were they correct. The average precision at k is simply the average of P@K over several k, namely k 1 X AP @K = P @i. (2.27) k i This metric may be useful when a known amount of labels is more important than others. However, it fails to describe the performance over the whole label space.

2.2.3 Convolutional Neural Networks Despite the drastic improvements in the performance of the conventional DL models, they still had some significant shortcomings. Networks with fully connected layers can have a number of parameters that grow intractably with the number of inputs and hidden neurons. This increase means that the mod- els could get unpractically large for solving problems with high dimensional data. Moreover, in the case of images, standard neural networks work fine for CHAPTER 2. THEORETICAL FRAMEWORK 25

Feature map Input map Output map

Figure 2.6: A convolutional filter or kernel combines linearly the entries of a small section of the input map and forms an output map entry. A non-linear activation such as ReLU usually follows the linear combination. datasets with simple images, i.e. small images with binary values, but as the input grows in dimension and complexity (like in high-resolution RGB im- ages), the models do not manage to learn the proper components of the data. One of the reasons for this low performance was the lack of consideration of the spatial data in images. For recognition tasks, an object has to be identified in an image no matter their location, orientation, or size. In images, local groups of values are highly correlated, forming distinctive motifs [20]. A standard neural network would flatten the image as a vector and treat it as a standard input vector. In this way, the information of pixel adjacency is lost.

Convolution: An artificial neuronal receptive field Convolutional neural networks (CNNs) emerge to satisfy the need for a DL method that could capture information coming from 2-dimensional data, such as images. Just as conventional ANNs, CNNs mimic the functioning of bio- logical visual structures. It was found that neurons in the visual cortex have a small receptive field which responded to specific local spatial patterns [32]. A CNN leaves behind the previous architecture of fully connected layers to take a more local approach to imitate this local receptive field. Therefore, a layer uses small trainable matrices called filters or kernels that go over the 26 CHAPTER 2. THEORETICAL FRAMEWORK

Convolution + Convolution + Fully Output Pooling Pooling ReLU ReLU connected predictions

tree (0.01)

sun (0.99)

cloud (0.99)

Figure 2.7: Deep convolutional neural network: An image is put through sev- eral convolutional and pooling layers, then flattened and fed to a standard neu- ral layer. The final layer performs the multi-label classification. entire input, creating responses of local patches of the same size as the filter. A response involves an elementwise multiplication of the filter and the patch, chained to a non-linear activation function as in traditional ANNs. The filter moves across the input generating responses of every possible patch as in a 2-dimensional convolution. As a result, a 2-dimensional output is generated, concatenating the responses accordingly in what is known as a feature map (Figure 2.6). Several different filters can be convolved over a data instance in a layer, generating their characteristic feature maps. Each filter is supposed to detect specific patterns over the data. As the weights that compose the fil- ters are trained using backpropagation, the network decides which patters are relevant.

Deep CNNs: Convolution and pooling layers In image recognition tasks, it is crucial to stack multiple layers to achieve good performance. The subsequent transformations of the feature maps reveal the intricate hierarchy in patterns that are characteristic of objects. Specifically, first convolutional layers identify the most primitive patterns in images, such as borders. The generated feature maps will then contain information on the presence or absence of borders in local patches. The following layers will try to identify patterns of borders that form motifs in the same way. Finally, the last convolutional layers will identify groups of these motifs located in specific ways that form recognisable objects. On the whole, multiple layers guarantee multiple levels of abstraction from fine details to composite objects in a hierarchical way. This multiple level architecture of CNNs is shown in Figure 2.7. The algorithm’s convolutional nature guarantees that a structure can be iden- CHAPTER 2. THEORETICAL FRAMEWORK 27

tified with translational invariance, meaning that the object can be recognised in any location in the image, as many times as necessary. Meanwhile, pool- ing layers also plays an essential role in CNNs. A pooling layer condenses a feature map by obtaining a descriptor of small patches in the map, usually the maximum or the average. In the maximum pooling (max pooling in short), the maximum value of all patches is chosen to form the condensed feature map. This step helps to reduce the dimensionality of the data and improves the gener- alisation of the method. Max pooling layers can make the model more robust, overcome small positional variations and improve statistical efficiency [23]. After several sets of convolutional and pooling layers, the original high dimen- sional input is reduced substantially in size. In the final phase of the algorithm, the resulting feature map is flattened, where fully connected layers can use the reduced feature vector to perform standard tasks like classification, as seen in Figure 2.7.

Sparse interactions: fewer trainable parameters The convolutional approach previously described tackles the problem of spa- tial dependency of data successfully. Additionally, it provides a considerable reduction in the number of trainable parameters due to its sparse interactions, which brings several practical benefits to the method. First of all, a method with fewer parameters is less complex, hence less prone to overfitting: re- ducing the parameters, consequence of the convolutional architecture and the max pooling layers, give robustness to the algorithm. Second, having fewer parameters means fewer memory requirements and fewer calculations needed, which makes the method faster and more efficient. Additionally, the charac- teristic weight sharing of the method improves the statistical efficiency: tradi- tional DL methods use each weight only once, while CNNs exploit more their parameters, making each of them more significant.

2.2.4 Graph Convolutional Networks CNNs have proven to be successful in problems where data has a grid-like structure. However, there is a vast supply of problems that involve data on ir- regular or non-Euclidean domains, such as the ones structured by graphs. For example, data coming from social networks, log data on communication net- works, bibliographical data [33], text data on word embeddings and biological interactions, involve pairwise relationships defined better by graphs [34]. Most of these problems cannot be appropriately addressed without considering their 28 CHAPTER 2. THEORETICAL FRAMEWORK

graph nature. Additionally, many tasks currently being tackled by standard DL methods could be boosted by exploiting the graph structure of the data, like the classification of molecules based on their 3D-structure [35]. In this case, the molecule structure can be transformed into a graph, turning atoms and bonds into nodes and edges, respectively. Consequently, graph convolutional networks (GCNs) emerge as an adaptation of CNNs to irregular grids, gener- alizing the notion of convolution, defined only in structured domains [1].

Graph Theory: The basics A graph is a mathematical structure that represents heterogeneous pairwise relationships of the objects of a set [34]. A graph can be described as a pair G(V,E), where V is a set of nodes or vertices and E the set of connections between them, called edges. Graphs can be directed if the nodes i, j involved in an edge are not interchange- able; in other words, the edges (i, j) and (j, i) represent different interactions. On the contrary, an undirected graph has interactions that do not depend on the order of the nodes involved such that (i, j) = (j, i) [36]. A weighted graph is a graph that has a function w that relates each edge to a real number as w : E → R, giving to each one a quantitative level of importance [36]. A bipartite graph is an undirected graph in which the set of nodes V can be partitioned into two non-overlapping subsets V1 and V2, such that nodes in one subset can only interact with nodes from the other. More precisely, for every edge (i, j) in E, either i ∈ V1 and j ∈ V2, or i ∈ V2 and j ∈ V1 [36]. The degree of a node is the number of connections that the node has to other nodes. In directed graphs, the in-degree is defined as the number of connec- tions coming from other nodes; while the out-degree is the number of connec- tions originated in the node. The adjacency matrix A of a graph is a matrix of dimension |V |×|V | for |V | the number of nodes in the graph, such that aij = 1 if (i, j) ∈ E, and aij = 0 otherwise. In weighted graphs, the 1 is replaced by the number given by the function w. The adjacency matrix is an essential structure for describing a graph, as it is the numerical base of the notion of locality in GCNs. Another important representation of a graph is the Laplacian matrix L, which is defined as L = D − A, (2.28) CHAPTER 2. THEORETICAL FRAMEWORK 29

where D is a diagonal matrix holding the degrees of each node defined as P Dii = j Aij and A is the adjacency matrix [34]. Sometimes a normalisation is applied to the Laplacian in the form

Lsym = D1/2LD1/2 = I − D−1/2AD−1/2, (2.29) where I is the identity matrix. This transformation is known as the symmetric normalized Laplacian [37]. In applications of interest for this study, the graph nodes represent some agent, usually described with some properties. In the case of biological networks, these nodes are either proteins or chemical compounds. In general, a feature vector x is created by stacking attributes of the agent to describe them nu- merically. Subsequently, these vectors are stacked into a feature matrix X of dimension n × p, where n is the number of nodes in the graph, and p is the dimension of the feature vectors.

From Images to Graphs: Evolution of CNN GCNs is a relatively new domain in DL, with many promising approaches and many unexploited applications. Previous attempts to implement CNN-like tools in graphs involved unsupervised approaches based on wavelet analysis, creating multi-layered transformations of signals propagated in graphs [37]. Bruna et al. [37] set up the path to GCN by imitating the limited filter support and weight sharing of convolutional layers, together with the dimensionality reduction of pooling layers using two distinct approaches. The first approach used a measure of locality of nodes to cluster them iteratively and transform their features to reduce dimensionality—the second approach roots in the con- volution theorem and spectral theory. It was this idea which founded a whole new family of GCN methods. The convolution theorem states that the convolution in the spatial domain is simply the multiplication in the Fourier domain [34]. In this sense, it is desired to express graph convolution in the Fourier domain. Hence, the Laplacian can be diagonalised as

L = UΛUT , (2.30) where Λ is the diagonal matrix of eigenvalues and U the matrix of orthonormal eigenvectors. Being the Laplacian symmetric positive definite, its spectrum is nonnegative. In this way, the eigenvectors form a basis of the Fourier space, 30 CHAPTER 2. THEORETICAL FRAMEWORK

which means that for a graph signal x (i.e., a vector having a value in each node), its graph Fourier transform is simply UT x. Defferrad et al. [34] continued the work of Bruna by characterising convolu- tions on graphs as filterings gθ of a signal x in the Fourier space

T gθ ∗ x = UgθU x, (2.31) where ∗ is the convolution operator. To make the calculation computationally feasible, they implemented a truncated expansion of the filter gθ in terms of the Chebyshev polynomials, assuming the filter as a function of the eigenvalues of the Laplacian Λ K X gθ(Λ) ≈ θkTk(Λ) (2.32) k=0 th where Tk is the Chebyshev polynomial of k order, K is the truncation in- dex and θk the Chebyshev coefficient for term k. The Chebyshev polynomials are recursively defined as Tk(x) = 2xTk−1(x) − Tk−2(x) with T0(x) = 1 and T1(x) = x. Using equations (2.31), (2.32) and the normalisation Λˆ = 2 Λ − I λmax , the convolution can be defined in terms of the Chebyshev polyno- mials: K X gθ ∗ x ≈ θkTk(Lˆ)x. (2.33) k=0

T with Lˆ = UΛUˆ . Finally, Kipf & Welling [33] consolidated the previous work in a simple and well-behaved layer-wise propagation rule using only the adjacency matrix and the feature matrix of the nodes. Using equation (2.33) and the Chebyshev polynomials up to the 1st order, they expressed the convolution as

−1/2 −1/2 gθ ∗ x ≈ θ0x − θ1D AD x. (2.34)

Motivated by overfitting reduction and computational efficiency, they imposed further constraints to the number of parameters eliminating θ0, which yielded

 −1/2 −1/2 gθ ∗ x ≈ θ IN + D AD x. (2.35)

A following renormalisation is required to give numerical stability to the method, specifically to avoid vanishing and exploding gradients (equation (2.36)). This CHAPTER 2. THEORETICAL FRAMEWORK 31

step adds to the adjacency matrix an identity matrix IN , to replicate self-loops of the nodes. Updating the adjacency matrix and the degree matrix, yields

A˜ = A + IN X (2.36) D˜ ii = A˜ ij. j

Kipf & Welling’s generalisation uses the feature matrix X as an input signal and a matrix Θ of trainable parameters to create a compact layer equation:

Z = D˜ −1/2A˜ D˜ −1/2XΘ, (2.37) which is the standard GCN equation for numerical implementations. The next step was taken by Schlichtkrull et al. [38], proposing a relational GCN (R-GCN), the first GCN tackling the problem of multirelational link prediction in graphs, this is, predicting one or several types of edges by us- ing an encoder-decoder complex5. The encoder generated feature embeddings for each node using ! (l+1) X X 1 (l) (l) (l) (l) hi = σ Wr hj + W0 hi , (2.38) r ci,r r∈R j∈Ni

r where hi is the embedding of node i, l is the current iteration (or layer), Ni is the set of neighbours of node i under relation r, R is the set of all possible rela- tion types, Wr is a relation-specific trainable weight matrix, W0 the trainable matrix that modulates the influence of node i’s features, σ is the non-linear activation function (in this case ReLU), and ci,r is a normalisation constant. (0) In the first iteration, the input hi becomes the node feature vector xi. Next, to predict an unknown link between two nodes, their corresponding em- beddings were used by a DistMult [39] tensor factorisation decoder to predict the relation between the corresponding nodes:

T f(ei, r, ej) = ei Rrej, (2.39) where ei and ej represent the embedding vectors of the two nodes created by the encoder, and Rr is a relation-specific diagonal parameter matrix. The output is the occurrence probability of relation r in the link between nodes i and j.

5As shown in section 3.4.2, this encoder proposal was crucial to open the gate for Decagon and the methods that came afterwards. 32 CHAPTER 2. THEORETICAL FRAMEWORK

Additionally, they implemented a regularisation technique in which the weight matrices were all multiples of the same base function matrix. In this way, there is an implicit sharing of parameters among relations, and the encoder has to train only one coefficient per relation. It is important to note that the inner sum of equation (2.38) can be efficiently implemented by matrix multiplication of the weight matrix with the (sparse) adjacency matrix, as in the classical formulation of equation (2.37). This for- mulation removes the need for explicitly calculating the sums for every neigh- bouring node.

2.3 Algorithmic Complexity

Complex structures like graphs can be characterised based on their complex- ity. A complex entity can be interpreted as one which involves a high degree of randomness in its generation. In contrast, less complex structures may be a product of a condensed systematic algorithm; in other words, can be gen- erated by a short program following basic rules. It has been common to esti- mate complexity using statistical or information approaches such as Shannon entropy. However, these measures lack invariance over representations of the structures, which means that they may give different predictions of complex- ity for the same object depending on the descriptor [40]. Moreover, statistical methods do not account for the "true" functioning mechanism of the genera- tion of the structures; instead, they exploit statistical regularities missing its generative properties.

Algorithmic Complexity: A robust indicator One measure of the complexity of a system is the Kolmogorov complexity (KC) or algorithmic complexity. In short, the algorithmic complexity K of a sequence s can be defined as the length in bits of the shortest computer pro- gram that can generate the sequence in a universal Turing machine (UTM) and halt [40], KT (s) = min{|p|,T (p) = s} (2.40) where p is a program, and T is a universal prefix-free6 Turing machine [41]. Hence, when the complexity of a sequence is similar to its length in bits, it

6A group of programs form a prefix-free set if no element is a prefix of another [40]. In other words, a program cannot be an extension of another. CHAPTER 2. THEORETICAL FRAMEWORK 33

highly complex or random; while its complexity is much lower than its length when the sequence is highly structured. The use of KC implies a more robust measure of complexity based on algorith- mic regularities, avoiding statistical patterns. Additionally, KC possesses an invariance theorem that states that algorithmic complexity is invariant to the object, with its estimation value depending exclusively on the UTM (or pro- gramming language) and varying from one UTM to another only by a constant [40]. More precisely, given two UTMs U1, U2 and their complexity calcula- tions KU1 , KU2 of s, there exists a constant c independent from s such that

|KU1 (s) − KU2 (s)| < cU1U2 (2.41)

The main shortcoming of the algorithmic complexity is that it is, in practice, uncomputable. Its uncomputability arises from the fact that it is impossible to determine before-hand whether a program will ever halt. However, algorith- mic complexity is lower semi-computable, which means that an upper bound can be found. Traditionally, lossless compression methods were used to esti- mate algorithmic complexity, motivated by the fact that compression implies non-randomness [40]. Nevertheless, most compression algorithms rely heav- ily on statistical patterns, closely related to Shannon’s entropy and inheriting all its previously stated issues. An attempt to estimate KC uses algorithmic probability (AP) (referred as m), which assigns a universal prior probability to objects [42]. The AP of a se- quence s is the probability of a binary computer program producing the se- quence [43]. In general, it is the sum over all the programs p that output s in a UTM and halt, X 1 m(s) = . (2.42) 2|p| p:T (p)=s

The AP magnitude depends on the length of the program. The long programs contribute with minimal terms while the short ones compose the most signif- icant parts of the sum. Following this logic, AP is composed primarily of the term involving the shortest program, which is directly related to KC. In this way, the Coding Theorem establishes a connection between AP and KC in the following way:

K(s) = − log2 m(s) + O(1). (2.43)

Moreover, to approximate KC straightforwardly, one can use the coding theo- rem method (CTM), which explores the space of UTMs that produce the given 34 CHAPTER 2. THEORETICAL FRAMEWORK

sequence. Knowing the number of symbols t and states k of a Turing machine T , one can generate an empirical distribution D(t, k) for s that approximates UTM as |{T ∈ (t, k): T outputs s}| D(t, k)(s) = . (2.44) |{T ∈ (t, k): T halts }| Using the results from equation (2.44), an estimation of KC using the CTM becomes

CTM(t, k)(s) = − log2 D(t, k)(s). (2.45)

Block Decomposition Method Despite the previous efforts, the halting condition of programs is still an ob- stacle, as seen in the denominator of equation (2.44). This limitation makes KC virtually incomputable and the CTM unapplicable to practical datasets. Nevertheless, the CTM can still be used with brief sequences. In particular, the values for D are known for binary sequences of length n < 5. Thereupon, a method for estimating KC proposed by Zenil et al. [40] involves the decomposition of long sequences into smaller blocks with a known value of KC. The total KC is calculated by an aggregation rule of the complexities of the individual blocks plus a term involving the number of partitions: X BDM(t, k)(s) = CTM(t, k)(si) + log2(ni), (2.46) i where the sum goes over all the individual partitions, i and ni represents the number of occurrences of partition i. The block decomposition method (BDM) sacrifices memory in exchange for efficiency, as the known complexities of all possible block combinations, are stored and looked up to calculate the sum in equation 2.46. The BDM algo- rithm is implemented in a Python library and is explained in section 4.3.2. The implementation can currently calculate the approximation of KC for bi- nary objects or arbitrary size in 1 and 2 dimensions.

Applications of algorithmic complexity in ML Although algorithmic complexity is a well-studied field, its disadvantage of being uncomputable has made its applications very limited before the devel- opment of BDM. This recent approach has put KC in the list of tools used by ML, but still, complexity-powered ML methods are in its infancy. So far, the approaches involve incorporating KC-dependant terms as regularisers in CHAPTER 2. THEORETICAL FRAMEWORK 35

the cost functions, penalising high-complexity methods, forcing them to learn with fewer parameters. Hernandez-Orozco et al. pioneered this idea using BDM among other approx- imation methods to impose complexity restrictions in the model [43]. Their result, however, implied a non-smooth loss surface, not trainable by standard gradient-based methods like SGD. Flood et al. overcome this limitation by stating the regularisation term in a probabilistic fashion and calculating its gradient directly, making the loss surface able to be trained by traditional op- timisers [42]. Chapter 3

Related Work and State of the Art

The rise in computational power has made it possible to tackle many unsolved biomedical problems with computational methods [14]. In the field of pharma- cology, computational techniques, and especially ML techniques, have been used recently for drug design [4], repurposing [44], prediction of molecular or pharmacological properties [45], toxicity [46, 47] and side effects [48], among others. In particular, the prediction of drug side effects has been very success- ful, with models constantly increasing their performance and complexity. As these models rely heavily on data, they have benefited by the different advances in experimental and bioinformatic techniques that keep feeding the field with information. The methods of interest in this thesis involve optimality and consequences of drug combinations which, despite constituting a small fraction of the computa- tional methods in pharmacology, still hold an immense variety of techniques, objectives and type of inputs [14]. Hence, these studies can be classified in multiple ways. In terms of goal, some studies focus on identifying synergistic drug combinations, while others predict the presence of inherent ADRs. From another perspective, we can sort methods by type of data utilised, which is a de- termining factor of performance, especially in ML approaches. One standard input used is data or features concerning the specific drugs. These features can be chemical (reactivity, structure) [49], pharmacological (pharmacody- namic and pharmacokinetic features) [49, 50], drug classification information (Anatomical Therapeutic Chemical (ATC) codes) [51], or individual side ef-

36 CHAPTER 3. RELATED WORK AND STATE OF THE ART 37

fects, among many others. These features are usually used in methods that look for a degree of similarity among drugs to predict synergy or side effects [3]. Another popular choice of data is network data. In these cases, the bio- logical network is built upon knowing the specific interactions among genes (and their respective proteins), interactions among drugs and interactions be- tween the two groups (drug-target interactions). The idea behind these types of methods, in general, involves associating specific diseases with some path- ways of the biological network and track which drugs affect these pathways. Finally, in this chapter, the methods will be classified concerning the computa- tional method used, starting from the most general methods, moving to simple trainable ML methods like linear regression, tree-based methods, or support vector machines; and culminating with the state of the art technique in ML: deep artificial neural networks. Firstly, a summary of the traditional methods for quantifying drug interactions will introduce the reader to the topic.

3.1 Traditional synergy calculations

The concepts of additivity, antagonism and synergy among drugs have been studied and attempted to quantify for many years. Previously, two standard metrics were used: the Loewe additivity and Bliss independence. Both of them make different assumptions about the operating mechanisms of drugs and may come to different conclusions, generating a still active debate on which one is more appropriate [19]. The Loewe additivity assumes that the drugs act through a similar mechanism. Assuming that two drugs, drug A and drug B, can achieve a determined effect X alone with concentrations [IA]X and [IB]X respectively, the same effect can be achieved with a combination of the two drugs in concentrations [CA]X and [CB]X respectively. To quantify the combined effect of drugs, the Loewe additivity performs an isobologram1 analysis. The isobologram in this case is given by equation (3.1).

[CA]X [CB]X 1 = + (3.1) [IA]X [IB]

Note that the concentrations C and I for a given drug are the same if the con- centration of the other drug is zero, which is shown in the isobolograms at

1A diagram showing the varying concentrations of two compounds that give constant ac- tivity or effect. 38 CHAPTER 3. RELATED WORK AND STATE OF THE ART

Additivity Synergy Antagonism

[IB]50% [IB]50% [IB]50%

Concentration of drug B 0 0 0

0 [IA]50% 0 [IA]50% 0 [IA]50% Concentration of drug A

Figure 3.1: Different drug interactions result in different isobole shapes: Addi- tive interactions result in a linear isobole, synergic interactions give a convex curve that lies below the additive isobole (the dashed grey line), while an- tagonistic interactions give a concave isobole being over the additive isobole. Figure adapted from [52].

points [0,IB] and [IA, 0] in Figure 3.1. On the other hand, the Bliss independence assumes that the mechanisms of action of the two drugs are different. This assumption gives rise to the concept of effect multiplication, where the result is calculated assuming both events come from independent probability distributions [52]. Accordingly, the Bliss independence estimates that the total effect of an additive multi-drug therapy will be the multiplication of the individual effects,

ET = EA × EB (3.2) where each effect E is expressed as a fraction representing activity compared to control measures representing full inhibition (0%) and null inhibition (100%) [52]. In this fashion, when the combined effect of the treatment is lower than the additive ET , the combination of drugs is antagonistic, and when the com- bined effect is higher, it is synergistic. As the reader can infer, both the Loewe additivity and the Bliss indepen- dence need experimental results of drug combinations to calculate the isobolo- grams and the total effect ET respectively, limiting their predicting capabilities for new pairs. Besides, given their strong assumptions, their effectiveness is doubtful for drug pairs with uncertain mechanisms [19]. CHAPTER 3. RELATED WORK AND STATE OF THE ART 39

3.2 General Methods

Not so long ago, drug combinations strategies were based on physical evidence of synergy, preclinical or clinical insights into nonoverlapping toxicities [53], medical expertise and intuition, or merely test-and-trial strategy [51]. An early attempt to find an optimal drug combination for cancer treatment using com- putational means consisted of a genetic algorithm that detected the promising drug combinations in ongoing multiple-drug in vitro experiments [53]. These combinations were then prioritised in the screening process, while the least assuring combinations were considered to have a high probability of failure and were discarded. This algorithm worked by optimising a fitness function defined by specific attributes of the combinations, mainly the number of cells that remained alive after applying the drugs. Although the method was use- ful for directing experimental resources to combinations with high synergy chances, it relied heavily on experimental data and could not predict a drug combination in advance. Huang et al. [11] attempted to choose drug pairs trying to group drugs that affected the signal pathways of the diseases they wanted to treat. To achieve this, they grouped the different drugs by genomic profiling using a Bayesian non-negative matrix factorisation to represent the drugs’ functional network. At the same time, they reconstructed the disease’s signalling network modules with protein interactome data integrated with the disease genomic data. Then, the condition for a pair to be chosen was that both should inhibit the disease’s signalling network. Sun et al. [12] tried a semi-supervised approach using a manifold ranking al- gorithm over anti-cancer drug pairs with carefully engineered features. They assembled a seven featured vector to perform a synergy ranking and pick the top synergic drug pairs starting from targeting network and transcriptomic pro- files data. They validated their information with experimental data and in vitro experiments on zebrafish, obtaining considerably better results than their pre- decessors. Zhang’s method [54] could be recognised as an early precursor of the current method, as it is a first attempt to propagate node features through a network us- ing an iterative matrix multiplication method. More importantly, it is the first to relate the importance of single side effects of the drugs in the polypharmacy side effect prediction, using them in one-hot encoded feature vectors for the drugs. Using drug network data from TWOSIDES and OFFSIDES [15], side 40 CHAPTER 3. RELATED WORK AND STATE OF THE ART

effect data from SIDER [55] and FAERS [56], and chemical structures data from Pubchem [57], the method built similarity matrices from the network and feature vectors to identify drug pairs that may potentially interact. A subsequent study performed over network data by Chen et al. [58] used protein-protein interactions and the novel term pathway-pathway interactions of drugs. Drug pairs were given a synergy score according to how close were their action pathways and targets. Among the studies using similarity measures is the work of Ferdousi et al., where binary vectors for drugs were assembled denoting presence/absence of individual carriers, transporters, enzymes and targets [59]. The study then intended to find drug interactions comparing the vectors using inner product- based similarity measures.

3.3 Trainable Methods

ML methods stand out from other computational methods because they find patterns autonomously from data. In this sense, the following methods dis- tinguish from the ones stated in the previous section. In this thesis, the term ML methods will refer specifically to supervised training methods, i.e. meth- ods trained on examples with ground truth labels to predict the labels of new instances. It is important to remember that ML is not exclusively limited to these methods, although the terminology written here may suggest otherwise.

3.3.1 Linear Regression Methods One trainable method that is very used is logistic regression. Its popularity comes from its simplicity, and that is very intuitive. It is not surprising that many studies take this method as the primary approach to solve the drug com- bination problem, especially when the task is a binary classification problem. Takeda et al. [49] used structural data from drugs to look for similarity mea- sures and use it in a logistic regression to find appropriate drug pair candidates. Gottlieb et al. [50] used a similar approach but adding chemical descriptors, ligand-based descriptors, ATC codes, single side effects and network data such as PPI. They integrated this data to construct features for each pair and train the classifier to predict adverse side effects and their severity. Other studies like Huang et al. [60] include L1 regularisation in their logistic regression, using data from individual side effects to predict suitable drug combinations. As linear regression is a simple model, sometimes it becomes necessary to CHAPTER 3. RELATED WORK AND STATE OF THE ART 41

treat the data before classification, like in the work of Shi et al. [51]. This work consisted in transforming drug features from the drug network (DDI, DTI), individual side effects and ATC codes into a 4-feature vector per each drug. Drug pair features were formed summing up the individual features of the drugs in the pair. Finally, the logistic regression classifier was trained with the pair features to determine synergy of pair candidates.

3.3.2 Tree-based methods Another popular family of methods are tree-based methods. These methods can sometimes seem very simple but can outperform some of the most com- plex approaches with much better result interpretability [8]. Methods such as RF and XGBoost (Extreme gradient boosting) belong to this family. One ex- ample is the work of Li et al. [10], where they used drug physicochemical and network features along with designed pharmacogenomic features from gene expression profiles to feed a RF algorithm that looked for synergistic pairs of drugs. Alternatively, Janizek et al. addressed the same problem develop- ing an XGBoost method called "TreeCombo", where they used physical and chemical properties of drugs and gene expression levels of cell lines, with bet- ter performance than many previous approaches, including some DL methods [8]. Finally, one recent study [6] compared the two main tree-based methods (RF and XGBoost), finding that XGBoost performs slightly better than RF in data from the NCI-ALMANAC [61] database, comprising known drug com- binations, chemical structures and cell line expressions.

3.3.3 Other Machine Learning Approaches Some other methods that are less common in this field but equally important are probabilistic methods and SVMs. One example of the first group is the work of Li et al. [2], which predicts effective drug combinations and possi- ble side effects using a Probabilistic Ensemble Approach (PEA). Their method used molecular and pharmacological phenotypes and target information to de- sign six features for each drug pair. They compared the features of query drug pairs to the known drug pairs and used a Bayesian network to predict the effect of the pair based on similarities of features. On the other hand, Zheng et al. [62] developed a method using SVMs in a low dimension space to classify pairs of drugs as potentially adverse or safe. The algorithm begins by forming features for drugs using chemical structure, substituents and pathway information; then appending single drug features to 42 CHAPTER 3. RELATED WORK AND STATE OF THE ART

form drug pair features. They developed an innovative way of creating a neg- ative interaction dataset, by relating a drug pair with the pathway of a specific disease using scores, and choosing the lower score pairs as negative examples. Finally, in one study developed by Cheng et al. [63], most of the methods men- tioned above were used to predict the occurrence of ADRs. They designed similarity feature vectors based on network, chemical structure, ATC codes and genomic data to feed five different algorithms: Naive Bayes, k-Nearest Neighbours (k-NN), linear regression, decision trees and SVMs. They dis- covered that despite most of the methods come to similar conclusions, SVMs perform slightly better.

3.4 Deep Learning Methods

3.4.1 Standard Deep Learning Methods Naturally, DL algorithms are widely used in pharmacology and diagnosis and medical image analysis. Ryu et al. [16] used DL to tackle DDI prediction us- ing exclusively structural information of individual drugs. Obtaining chemical data in Simplified Molecular Input Line Entry System (SMILES) [64] format, feature vectors were created for each drug and later concatenated to form drug pairs feature vectors. Afterwards, they performed a dimensionality reduction step (principal component analysis) on the vectors. Then they fed the previous result to a deep feedforward neural network called DeepDDI, which explic- itly stated the type of interaction (e.g. synergy, potential side effects, lower absorption), between the drugs in a human-readable phrase. Another relevant approach was DeepSynergy [4], a deep feedforward neural network that got as inputs the chemical descriptors of the drugs in the pair together with the genomic information of a specific cell line. The purpose was to compute a synergy value of the drug pair over specific cell lines. Later, more complex models like the one of Chen et al. [5] tackled synergy prediction. This method involved the use of several generative stochastic neu- ral networks called Restricted Boltzman Machines (RBM) stacked one after the other in an architecture known as Deep Belief network. The study com- bined information from ontology fingerprints, gene expression, and pathway information to predict drug synergy. Another study combining genomic and pharmacologic data was AuDNNSyn- ergy, a method developed by Zhang et al. [65]. This method used a special CHAPTER 3. RELATED WORK AND STATE OF THE ART 43

kind of neural network called autoencoder, which performs linear and non- linear transformations to genomic data to extract their most relevant compo- nents. Afterwards, it fed the data and the drug physicochemical properties to a standard deep feedforward network to make the synergy prediction.

3.4.2 Decagon The work that this thesis is based on is Decagon, a graph autoencoder for link prediction. Decagon stands out in the vast amount of methods for being the first to predict side effects originated by the combination of drugs, not only the presence of synergy or toxicity. It differs from the work of Ryu et al. [16] in the sense that they predicted a type of interaction, while Decagon predicts the specific side effect. Moreover, Decagon performs multirelational link prediction, which means that it can predict multiple side effects for a single drug pair. This novel functionality leads to a massive potential for immediate applications in the healthcare industry. Decagon was chosen as the starting point of this thesis due to its ability to learn the graph structure of the data using a DL model. In this sense, it used DL’s characteristic representation learning properties and extended its use to graph datasets. Decagon neighbour propagation methodology matches the character- istics of biological networks because it exploits how altering a protein can alter several of its neighbors that share a common path. Additionally, the model can be enriched with node features, allowing the testing of algorithmic complexity features, which is one of the main goals of this thesis. Decagon makes use of a multimodal graph consisting of a PPI network linked with a DDI network by a DTI network (Figure 3.2). The PPI network outlines the interactions between individual proteins in the metabolic pathways, where the proteins are the nodes and the interactions are the edges of the network. The DDI network represents known polypharmacy side effects between a pair of drugs, where the drugs are the nodes, and the side effects are the edges. Finally, the DTI network links a drug in the DDI network with its target protein in the PPI network. All three graphs are unweighted and undirected, while the DTI graph is also bipartite between drugs and proteins. Also, the DDI network allows a pair of drugs to have multiple interactions, which gives the problem its multirelational characteristic. Additionally, every node (protein or drug) can have a feature vector describing its properties. A detailed explanation of the dataset can be found in section 4.1.

More formally, the method uses a graph G = (V,E) with nodes vi in V and 44 CHAPTER 3. RELATED WORK AND STATE OF THE ART

Figure 3.2: Multi-modal graph consisting in PPI, DTI and DDI. The triangle nodes represent drugs, while the circled ones represent proteins. Relations be- tween drugs marked in red represent polypharmacy side effects. Additionally, each node can hold a particular feature vector. Figure obtained from [1]2.

edges (vi, vj) in E, each of the latter having one or several categories ri in the set of possible categories R. The goal of the method is to classify an edge (vi, vj) in one or more categories r. Considering the structure of the data, the method classifies a category r as a link between a pair of proteins, a link between a drug and a protein, or one of the known polypharmacy side effects. Based on the work of Schlichtkrull et al. [38], Decagon starts with a graph convolutional encoder similar to the one in equation (2.38), which generates low dimensional embeddings of each node. The embeddings are then pro- cessed by a relation-specific tensor factorization decoder, which outputs the predicted link labels for a pair of embeddings. The method is trained in an end-to-end fashion using gradient-based methods. As seen previously, this architecture allows for the use of node features. The original approach of Decagon involves using single-drug side effects as char-

2This article is available under the Creative Commons CC-BY-NC license and permits non- commercial use, distribution and reproduction in any medium, provided the original work is properly cited. CHAPTER 3. RELATED WORK AND STATE OF THE ART 45

acterization of node features. The feature vectors were assembled using a "bag of words" intuition with the different side effects, in other words, the vectors were binary vectors of dimension equal to the number of side effects in the data, where each entry determined the presence or absence of a side effect with a 1 or a 0, respectively. To avoid interpreting a polypharmacy side effect as a side effect coming from the use of one drug alone, the different sets of side effects were made mutually exclusive, in the sense that one side effect would not be present in both sets. On the other side, protein features were not included in the original work, using an identity matrix as the protein feature matrix.

Encoder The encoder’s purpose is to capture the essential features of each node and its neighbourhood in a compressed lower-dimensional embedding vector. More precisely, it takes a feature vector from a node xi and generates a d-dimensional embedding zi for every node. To achieve this, the encoder performs a layer- wise procedure of calculations, in which each layer transforms and propagates the feature vector of a node to its immediate neighbours, as shown in Figure 3.3. For a given iteration, a node’s feature vector will be updated into a linear combination of itself and its immediate neighbours’ vectors, and later trans- formed non-linearly by an activation function: ! (k+1) X X ij (k) (k) i (k) hi = σ cr Wr hj + crhi . (3.3) i r∈R j∈Nr This encoding method differs from the R-GCN of Schlichtkrull et al. [38] (equation (2.38)) only in lacking a trainable matrix transforming the feature vector of the node i and in the constants cij and ci, which take values of q i j i i 1/ |Nr ||Nr | and 1/|Nr | respectively. Nr is the number of nodes that are connected to i through the category r, as in [38]. Moreover, Decagon does not perform any regularization technique on the trainable matrices W. Just as in traditional DL techniques, the output of a layer is inputted to the next one. This means that K successive layer calculations would propagate node feature information to the kth order neighbours. Following this reasoning, the (k) final embedding would be computed as zi = hi . Each application of equation (3.3) depends on the neighbourhood of the given node, which forces the encoder to create a different neural architecture for every node, but all of them sharing the same parameters. 46 CHAPTER 3. RELATED WORK AND STATE OF THE ART

푾푟1

c b

⋮ a

c

푾푟2 휎 r 1 b d r r2 1 d a 풉(1) r3 푎 r3 ⋮ e a f

푾푟3

c b

⋮ a

Figure 3.3: Decagon’s encoder: For a node a, the encoder sorts its first neigh- bours by edge type. For each edge type, the encoder performs linear combi- nations of the corresponding nodes’ vectors with an edge type-specific weight matrix Wr. The linear combinations are summed together and non-linearly activated by a function σ to form the layer’s output h. CHAPTER 3. RELATED WORK AND STATE OF THE ART 47

풛푖 푫풓ퟏ 푝(푖, 푟1, 푗)

푫풓ퟐ 푝(푖, 푟2, 푗)

푫풓ퟑ 푝(푖, 푟3, 푗) 푹 풛푗

푫풓풏

푝(푖, 푟푛, 푗)

Figure 3.4: Decagon’s decoder for a pair of drugs: The decoder takes the embeddings zi and zj from two drug nodes and performs a DEDICOM fac- torization using a global trainable matrix R and side effect-specific trainable matrices Dr.

Decoder The decoder predicts the type of associations between nodes by performing tensor factorization operations with its embeddings. Contrary to Schlichtkrull et al. [38], which uses a DistMult factorization in every case, Decagon uses a different decoder depending on the type of relation. In detail, the decoder applies a bilinear form when one of the nodes is a protein (i.e. when the rela- tion is a PPI or a DTI) and a rank-d DEDICOM [66] factorization when both nodes are drugs:

( T z DrRDrzj, if vi and vj are drugs g(v , r, v ) = i . i j T (3.4) zi Mrzj, if either vi or vj is a protein

The bilinear form uses a relation-specific matrix Mr, and the DEDICOM fac- torization uses a diagonal relation-specific matrix Dr and a global matrix R used for all side effects, as shown in Figure 3.4. The importance of utilizing two different matrices when training the DDI 48 CHAPTER 3. RELATED WORK AND STATE OF THE ART

edges relies on distinguishing between different side effects (Dr) and main- taining shared parameters among the drug interactions (R). The latter is used, most importantly, as a regularization strategy, i.e., to reduce overfitting.

ij Following the use of the decoder, the occurrence probability of an edge pr is calculated by applying a sigmoid function of the factorized output:

ij pr = p((vi, r, vj) ∈ R) = σ(g(vi, r, vj)). (3.5)

3.4.3 Decagon-based methods The Decagon model has been the starting point of some other studies. For example, Jiang et al. [13] used a Decagon based network but using cell line- specific data. They used the same gene and drug network data but incorporated a cell line-specific synergy DDI database by O’Neil [9] that included Loewe scores of the drug pairs. In this fashion, they created synergy labels of drug pairs and trained different versions of Decagon with networks corresponding to each cell line to find synergistic drug pairs. Finally, another approach based on Decagon exploited its graph characteris- tics. However, instead of building a trainable matrix for each side effect, it formed an embedded vector using a method of deriving distributed vector rep- resentations from structured biomedical knowledge, known as Embedding of Semantic Predications (ESP). The reduction of dimensionality of the relation embeddings makes this method faster and presumably better performing than the original graph convolutional methods. Chapter 4

Materials and Methods

The starting point of the implementation of Decagon is the code that was made available by the authors [67]. The code can be used directly to train Decagon using a toy dataset, randomly generating artificial graphs representing protein and drug networks, and feeding them to the model for training. The authors also gathered and treated several different datasets containing the biological and pharmacological networks used to obtain their results [68]. However, us- ing and understanding the model with the real dataset provided is not straight- forward. For this reason, considerable work had to be done before training Decagon with the real dataset or considering it for deployment. In this chapter, the details of the method’s implementation and modifications are explained. The first part explains the datasets used in the original publica- tion. The second part explains the implementation of Decagon found in [67]. Finally, the third part explains the contributions and modifications proposed in this thesis regarding the implementation and feature aggregation.

4.1 Datasets

This section describes the data sources and their treatment done by the authors of Decagon. The treated datasets can be downloaded in [68]. Additionally, it is explained which additional datasets were used in an attempt to improve Decagon’s performance.

49 50 CHAPTER 4. MATERIALS AND METHODS

PPI The human PPI network used in Decagon was assembled from different sources. The main contributors were the STRING database (v2017) [69] which includes direct (physical) interactions as well as indirect (functional) interactions. The second source was the BioGRID interaction database (2015 update) [70], con- taining interactions curated from biomedical literature. Other relevant sources were the Human Reference Interactome HuRI (v2014) [71] and the database compiled by Menche et al. in 2015 [72]. These database interactions were cross-referenced, eliminating repeated inter- actions and standardising the protein identifiers in the NCBI Entrez Gene ID format1. As a result, the condensed database consisted of a .csv file organ- ised in two columns, where each row corresponds to a PPI, having the ID of a protein in each column.

DTI The source database for DTIs was STITCH (v2016) [73], which integrates various chemical substances and protein targets. This extensive dataset was reduced to only the experimentally verified interactions. The resulting dataset was a .csv file organised in two columns, one with the drug identifiers in STITCH format, and the second with their corresponding target protein in En- trez [74] format. One drug could have several target proteins, and several drugs could target a single protein. However, each interaction involving a unique drug-protein pair was described in a single line of the file.

DSE Single drug side effects (DSE) used as features for the drug nodes were pulled from the SIDER database (v2015) [55] and the OFFSIDES database (v2012) [15]. The combined dataset was a .csv file with one side effect per row, in which each row contained three columns: one with the STITCH identifier for the drug, one with the identifier for the side effect in Medical Dictionary for Regulatory Activities (MedDRA) format, and a third one with the medical term for the side effect in English. As with the DTIs, one drug could have several side effects, but each one was printed in a separate row.

1The Entrez ID does not correspond to the protein directly, but to the gene that codes for the protein. Consequently, the term gene and protein are sometimes used interchangeably in the datasets and related work. CHAPTER 4. MATERIALS AND METHODS 51

DDI The polypharmacy side effects were obtained from the TWOSIDES database (v2012) [15]. The .csv file for this dataset was organised in four columns: The first two containing the STITCH identifiers of the two drugs involved in the interaction, the third column containing the MedDRA identifier of the side effect and the fourth column containing the medical term for the side effect in English.

Data inconsistencies In their work [1], the authors reported the number of proteins, drugs, and in- teractions of each dataset after the treatment they performed. Nevertheless, their numbers differ from the ones counted on their dataset for most of their fields. Table 4.1 shows the authors’ numbers, and the numbers counted on their dataset. Several GitHub users also reported these inconsistencies as issues on their GitHub repository [75]. As these datasets are treated on a subsequent step, most of these quantities are reduced drastically (see section 4.3.1).

Table 4.1: Inconsistencies between the number of nodes and interactions re- ported by the authors and the ones found in the dataset. The fields where inconsistencies were found are shown in bold.

Quantity Reported by the authors Found in dataset PPI genes 19,085 19,081 PPI interactions 715,612 715,612 DTI genes 8,934 7,795 DTI drugs 519,022 1,774 DTI interactions 18,596 18,955 DSE drugs 1,332 639 DSE side effects 10,097 9,702 DSE interactions 487,530 174,977 DDI drugs 645 645 DDI side effects 1,318 1,317 DDI interactions 4,651,131 4,649,441

Additional Datasets For evaluating different approaches to improve Decagon’s performance, addi- tional datasets were used to complement the current ones. One approach was 52 CHAPTER 4. MATERIALS AND METHODS

to use relevant protein features instead of the one-hot encoded vectors used by the authors. The specific features were chosen to be significant descriptors of the 3-dimensional structure of the protein. This choice was motivated by the fact that interactions between drugs and proteins depend on their geometri- cal conformation. In this sense, three secondary structure features were used: number of α-helices, number of β-strands and turns. Another extracted rele- vant value was the length of the protein in amino acids. The average length for all proteins was calculated and used to transform the number of structures (namely α-helices, β-strands and turns) into a density of structures. This value was then normalised dividing by the maximum density, creating features with values between 0 and 1. The data was obtained from the UniProt database [76]. Also, as an alternative strategy, various scoring functions or measures of com- patibility between drugs and targets were used. These functions give a numer- ical value to the strength of the interaction between chemicals and proteins. Scores for drug-target affinity were calculated by external collaborators for some drug-protein interaction present in the DTI dataset, using the DeepDTA [44] model and the Schrödinger software [77].

4.2 Original Implementation of Decagon

The Decagon GitHub repository contains the GCN model implementation in Python 2.7 using, among other modules, TensorFlow 1.8.0. As it is published, the code is only functional with a toy dataset, generated before the training of the model. In the main.py file, the toy dataset is created and organised in the data format that is acceptable by Decagon. The toy dataset creation con- sists of generating random adjacency matrices given the number of proteins and drugs. No features are considered in the toy dataset, so identity matri- ces are used instead of feature matrices. Next, the TensorFlow placeholders are created. Placeholders are variables (such as scalars, vectors or tensors) that make part of the computational graph created internally by TensorFlow to represent the model. They are initially empty objects, updated with the cor- rect values only when an operation that involves them is executed. Following this, the code builds the Edge Minibatch Iterator, the Decagon Model, and the Decagon Optimizer, which are the fundamental objects forming the back- bone of Decagon. In the final part of the code, the model is trained, and the performance metrics are calculated. The following sections describe each part of the implementation of Decagon. CHAPTER 4. MATERIALS AND METHODS 53

A B

PPI DTI P

TDI DDI D

Figure 4.1: Arrangement of the Python dictionaries used to store the data. A) 2-dimensional dictionary holding a PPI structure in the top-left cell, a DTI structure in the top-right cell, a transpose DTI structure in the lower-left cell and the DDI structures in the lower-right cell. B) 1-dimensional dictionary holding protein node structures in the top cell and drug node structures in the lower cell.

4.2.1 Data structure organisation The data gets into the model in adjacency matrices, feature matrices, and de- gree lists. In the first stage of Decagon, these objects are sorted efficiently into Python dictionaries to facilitate the training2. These dictionaries have different structures, but they can be summarised in two types: 2-dimensional dictionar- ies with four cells in a 2×2 configuration and 1-dimensional dictionaries with two cells in a 2 × 1 configuration. The 2-dimensional dictionaries sort objects according to the edge category to which they are associated: the top-left cell (index (0,0)) holds objects relating to PPIs, the top-right (index (0,1)) holds ob- jects related to DTIs, the bottom left (index (1,0)) relates to the inverse relation of DTIs (referred as DTITs or TDIs), and the bottom right (index (1,1)) holds DDI objects. The 1-dimensional dictionaries sort the objects depending on the type of nodes they refer to: the top cell (index 0) holds protein objects, and the bottom cell (index 1) holds drug objects. An illustrative diagram showing the dictionary types is shown in Figure 4.1. In overall, eight Python dictionaries are created to organise the data: • adj_mats_orig: A 2-dimensional dictionary of original adjacency matrices. A list containing the PPI adjacency matrix and its transpose is stored in the PPI cell. Next, the DTI adjacency matrix and its trans- pose are saved in their respective cells (top right and bottom left, re-

2Python dictionaries are structures that map specific keys, which will be indices, to values, which can be numbers, lists, matrices or strings. 54 CHAPTER 4. MATERIALS AND METHODS

spectively). Finally, a list containing all the DDI adjacency matrices, concatenated with a list containing a transposed copy of each matrix, are stored in the DDI cell. • degrees: A 1-dimensional dictionary of node degrees. In the top cell, two concatenated copies of the degree list of the protein nodes are stored (corresponding to the PPI adjacency matrix and its transpose), while the bottom cell holds two concatenated copies of the list of degree lists for the DDI matrices. • feat: A 1-dimensional dictionary of feature matrices. The protein and drug feature matrices are stores in their respective cells. • num_feat: A 1-dimensional dictionary with integer numbers, each holding the number of features for proteins and drugs, respectively. • non_zero_feat: A 1-dimensional dictionary with integer numbers, each holding the number of non-zero entries of the feature matrices for proteins and drugs, respectively. • edge_type2dim: A 2-dimensional dictionary with shapes of matri- ces. It holds the matrix shapes in adj_mats_orig (including the transposed versions) keeping the same dictionary structure. • edge_type2decoder: A 2-dimensional dictionary with strings. Each cell has the name of the respective tensor factorisation decoder type. For PPI, DTI and DTIT, which involve protein nodes, the string is bilin- ear, while for DDI the string is dedicom. • edge_types: A 2-dimensional dictionary holding in each cell the re- spective number of matrices stored adj_mats_orig. This is, 2, 1, 1, T and 2Nse for PPI, DTI, DTI and DDI, respectively. An illustrative diagram of the data structures is shown in Figure 4.2. It is unclear why there is a transposed copy of each matrix. The most probable explanation relates to a way of generalising the method for directed graphs. Decagon was initially designed to solve a polypharmacy side effect-prediction problem, which has undirected interactions among nodes. However, in gen- eral, it can be applied to solve any multimodal graph link-prediction problem where the graphs can be directed, meaning that the adjacency matrices can be non-symmetric. In this case, it makes sense to hold an additional transposed copy of each matrix. CHAPTER 4. MATERIALS AND METHODS 55

A adj_mats_orig B edge_type2dim

(#proteins, #proteins), (#proteins, T DTI PPI PPI (#proteins, #drugs) #proteins)

DDI DDI (#drugs,#drugs) (#drugs, (#drugs,#drugs) TDI ...... #proteins) (#drugs,#drugs)

T T (#drugs,#drugs) DDI DDI

C D E F G H degrees feat num_feat nonzero_feat edge_type2decoder edge_types # gene PPI degrees, gene # gene ‘bilinear’ ‘bilinear’ PPI degrees features features nonzero entries 2 1 # drug drug 2#se DDI degrees list, # drug nonzero ‘bilinear’ ‘dedicom’ 1 T features DDI degree list features entries

Figure 4.2: The python dictionaries that hold the different data structures used for the training of Decagon. A) Adjacency matrix dictionary. B) Dimension of the adjacency matrices dictionary. C) Degrees of the nodes dictionary. D) Feature matrices dictionary. E) Number of features dictionary. F) Number of non-zero features dictionary. G) Decoder type dictionary. H) Number of adjacency matrices dictionary.

After creating the data structures, the code defines essential parameters of the model, such as the number of epochs, hidden neurons per layer, dropout value, batch size, learning rate and proportion of train-validation test. Subsequently, it initialises the placeholders required for the model.

Edge Minibatch Iterator The Edge Minibatch Iterator (EMI) object has two main functions. The first one is to take the structured data and separate it into train, validation and test datasets, performed in the object initialisation. The second is to sam- ple batches of data points and feed them to the gradient-based optimiser. This task is performed in every training iteration, as each iteration handles a differ- ent batch of data.

Dataset Separation. Decagon deals with data points in the form of edges. These are found in the adjacency matrices as entries with ones. The EMI creates a 2-dimensional dictionary with the structure shown on Figure 4.1A for each dataset to sort the edges into the three different datasets. For each 56 CHAPTER 4. MATERIALS AND METHODS

edge type, the EMI places a fraction of the edges into its respective cell in the dataset dictionary. The fraction of data points corresponds to the train- validation-test proportion, which, in Decagon, is 90-5-53. An edge is defined by its position (row, column) in the adjacency matrix. The edge positions can be efficiently manipulated by converting the matrix in compressed sparse row (CSR) format into a tuple format, making the coordinates of non-zero elements easily accessible. Moreover, as the matrices have a very high degree of sparsity, the edge list has a manageable length. The cross-entropy loss function calculation requires also negative edge exam- ples; this means the zero entries in the adjacency matrix. The model needs the same number of negative samples and positive samples (edges) for each dataset to be appropriately trained and validated. However the manipulation of the negative examples implies some complications. First of all , there is no way of accessing all the coordinates of negative samples efficiently, as there is for positive ones. Secondly, the amount of negative examples is enormous, and handling an array containing all of them requires considerable comput- ing resources. Therefore for selecting the negative edges, a random sampling method is used. This sampling process consists of picking a random entry from the matrix and evaluating if it is zero, and has not been sampled before. If it fulfils these conditions, it is assigned to the corresponding negative ex- ample dataset. The negative datasets are stored in 2-dimensional dictionaries in the same fashion as positive edges. This process seems to work for the test and validation datasets, as the number of samples is small compared to the ma- trix size. However, the negative training dataset sampling is not implemented here, being those edges sampled using a built-in TensorFlow function later in the DecagonOptimizer (see page 58). Finally, after the train edges have been chosen, a train adjacency matrix (excluding test and validation positive edges) is formed. This matrix is renormalised using equation (2.36).

Batch updating. The batch updating functions of the minibatch are called in the training phase. Every edge type (side effect) is trained independently in each training iteration, so the batch updating function must be called every time a new edge type is trained. The batch updating starts by selecting an edge type to train randomly. This ran- dom sampling follows a specific discrete probability distribution: PPIs, DTIs and DTITs types have, each, a 25% probability of being chosen. The remaining

3The train-validation-test proportion has a default value of 90-5-5 in the code implemen- tation. CHAPTER 4. MATERIALS AND METHODS 57

25% is uniformly distributed among the number of polypharmacy side effects. Once the edge type is selected, enough positive edges to fill a batch are picked from the training dataset. The number of edges is defined in the variable batch_size. Before the edges are picked, it is verified that the dataset has enough training edges left to fill a batch; if there are not enough, another edge type is selected. The process ends assigning the batch edges to a placeholder variable and the selected edge type.

Decagon Model The GCN model explained in section 3.4.2 is implemented in an object called Decagon Model. The object creates the encoder and the decoder of Decagon in different structures.

Encoder. The encoder is implemented using a 1-dimensional Python dictio- nary per layer. The dictionary has a similar structure to the base dictionary in Figure 4.1B. In each cell, an object of the predefined class Layer is held. The object in the top cell has the correct input dimensions to propagate the protein node information, while the lower cell holds the drug node equivalent. This Layer object performs the operation described in equation (3.3), but for the sake of efficiency, it replaces the sums for matrix multiplications, similar as in equation (2.37). The object also creates the trainable matrices W with Glorot & Bengio initialisation (sec 2.2.2) and saved as TensorFlow placeholders. As Decagon uses two convolutional layers, only two dictionaries are created. The first layer, or input layer, is designed to receive sparse inputs (node fea- tures). This layer uses 64 neurons to transform the inputs and return the first set of embeddings. Then, the output is non-linearly activated using a ReLU function. The second layer differs from the first one because it receives dense, but lower-dimensional inputs. The embeddings created by the first layer are inputted and transformed once again, this time by 32 neurons, creating the fi- nal embeddings. Despite being reported in the model, the second layer did not activate the outputs using a non-linear activation function.

Decoder. The decoder is implemented in a 2-dimensional dictionary that follows the structure shown in Figure 4.1A. Each cell of the dictionary holds a Layer object of the decoder class that corresponds to the edge type (using the dictionary in Figure 4.2F). This dictionary brings the creation of the trainable matrices R, Dr and Mr initialised with Glorot & Bengio initialisation and 58 CHAPTER 4. MATERIALS AND METHODS

their assignment to placeholder variables.

Data Optimizer The Decagon Optimizer object creates the computational graph that will opti- mise the GCN. The object starts by sampling the negative training edges that are needed to compute the cost function. It has also implemented the functions required to make equation 3.4 predictions for the batch and the validation and test datasets.

Negative training edge sampler. The Decagon optimiser uses a TensorFlow built-in function called fixed_unigram_candidate_sampler [78] to sample the negative edges. The function creates a negative edge using one of the nodes of a positive edge and sampling the other from the node degree distribution. The strategy of looking for a negative interaction involving a positive interac- tion node reduces the number of nodes involved in a batch. This reduction of the number of nodes implies that fewer embedding vectors have to be saved in memory for that given training iteration, making the training more efficient.

Iterative weight updating. Once the negative training examples have been sampled, a calculation of the loss hypersurface for a given set of parameters is possible. It is then necessary to calculate each batch edge prediction value using the generated embeddings and the decoder tensor factorisations. The Decagon Optimizer object implements equation (3.4) to calculate the score of all the pairs of given positive and negative edges. With these scores, the object uses the cross-entropy (equation (2.11)) to calculate the loss value. Next, a TensorFlow Adam optimiser (section 2.2.2) object is created using the −7 parameter default values: β1 = 0.9, β2 = 0.999 and  = 10 . This object has all the functions needed to perform the optimisation already implemented and refined. This object has a function minimize that will be called in every training iteration. This function will access all the encoder training matri- ces and the decoder in the background and train them using backpropagation (2.13).

Training The training of the model is implemented explicitly, looping over the epochs and the iterations of an epoch. At the beginning of each epoch, the training CHAPTER 4. MATERIALS AND METHODS 59

data is shuffled to avoid any bias because of the (data points) order. Next, the EMI functions for sampling an edge type and picking up the batch edges are called, followed by the optimiser minimisation function. The epoch finishes when all the batches for the different edge types have been used. To validate the results, Decagon prints some metrics of the performance ev- ery 150 training iterations. These metrics are area under the receiving op- erating characteristics curve (AUROC), area under the precision-recall curve (AUPRC) and average precision at k =50 (AP@50) of the validation dataset, together with the current train loss and the elapsed time during the iteration. For these calculations, the placeholders are updated with the correct data vari- ables. The calculation of the metrics makes use of pre-implemented functions of the module Sci-kit learn from Python. However, these measures are not calculated in a standardised way to show the continuous improvement of the performance. Instead, they show the method functioning in a random edge type, as every iteration samples a random edge type to train. This sampling makes the print out not suitable for evaluating the method, as some edge types may have never had their performance evaluated. After the training, the metrics for the test dataset are calculated for all the edge types. However, there is no implementation of a method to export and manip- ulate these measures, as the final output of Decagon is the test performance for every edge using the previously mentioned metrics printed on the prompt screen. The original work specifies an early stopping in training with a window of 2, which means that the training is stopped if there is no increase in validation performance after two epochs. However, there is no function implemented in the code that performs early stopping. In this sense, the validation dataset is not used for anything more than monitoring the performance during the training.

4.3 Contributions and improvements to Decagon

The implementation of Decagon, although functional, has several issues. Apart from the data inconsistencies already mentioned, the method has not been im- plemented to use real datasets, only randomly generated toy examples. Specif- ically, there is no way of inputting the .csv files containing the network inter- actions into the code. Besides, the code lacks documentation and comments. These problems have made it difficult for the academic community to repro- duce their results [75]. Furthermore, Decagon is implemented in Python 2.7, 60 CHAPTER 4. MATERIALS AND METHODS

which is no longer supported and has many outdated modules that cause in- compatibility issues. However, a more critical defect of Decagon resides in the separation of the training, validation and test datasets. The sampling method lacks a rigorous checking procedure that could guarantee that the chosen interactions are not already present in the other datasets. Moreover, the inclusion of two adjacency matrices per edge aggravates this problem, as the random sampling is done separately for two matrices representing the same edge.

4.3.1 Data Treatment and Preparation

Remove Outliers: standardise datasets As a requisite for the training, the four datasets needed to be compatible to generate a complete multimodal graph. However, not all the nodes were shared among all datasets, and there were small unconnected clusters. Therefore, it was essential to identify disconnected or incomplete nodes and remove them from the data, to ensure the correct propagation of information using the GCN. This filtering process was done in a phase called "Remove Outliers", that has the following steps: 1. Find the intersection between the drugs in DDI and the drugs in DSE: This meant discarding the nodes in the drug network that did not have features, and discarding the drug features not present in the drug net- work. The step was motivated by the idea that using full nodes and incomplete nodes would create a bias towards nodes with features. 2. Intersect the previous set with drugs of DTI, which is divided into two operations: (a) Firstly, discarding the drugs on the network that did not have a target protein. This step was necessary because a drug with no known target may have unpredictable effects in the body, even if it has known side effects. Implementation-wise, a node of this type would need an extra layer to propagate its features to the network, which, in a GCN of two layers, means being isolated. (b) Secondly, discarding drugs with a known target but without side effect information (DSE). The number of drugs with side effects was the main bottleneck of the whole filtering, as it was the small- est dataset with only 639 drugs. CHAPTER 4. MATERIALS AND METHODS 61

Table 4.2: Node and edge numbers before and after performing the Reduced Dataset filtering with and without protein features.

After filtering Quantity Before filtering After Filtering with proteins PPI genes 19,081 19,081 17,929 PPI interactions 715,612 319,409 311,225 DTI genes 7,795 3,640 3,587 DTI drugs 1,774 283 283 DTI interactions 18,955 18,595 18,291 DSE drugs 639 639 639 DSE side effects 9,702 9,702 9,702 DSE interactions 174,977 174,977 174,977 DDI drugs 645 639 639 DDI side effects 1,317 1,317 1,317 DDI interactions 4,649,441 4,615,522 4,615,522 Protein with features 18,991 18,991 17,929

3. Intersect partially the previously filtered DTI database with the PPI database, which was done in two steps: (a) Firstly, discard DTI interactions with proteins not present in the PPI network, as these proteins are not integrated with any metabolic pathway in the data and hence, have no effect in creating adverse side effects. (b) Secondly, select PPI interactions which have at least one protein in the DTI database. This process removed proteins located more than two steps away from any drug. As the GCN has only two layers, these nodes would never propagate to the drugs, or receive any influence from them. This step also removed unlinked clusters and outliers in the PPI network. 4. In the case of including proteins features, the PPI dataset was intersected with the protein feature dataset in a similar way to the drug network. As a consequence of this filtering, the dataset was reduced both in nodes and in edges. Table 4.2 shows the reduction in the different elements of the dataset. 62 CHAPTER 4. MATERIALS AND METHODS

Reduced Dataset: A tool for generating consistent graphs After performing the filtering of the "Remove Outliers" phase, the data is con- sistent enough to train the model. However, the data still holds many data points, making the first simulation attempts to take too long to finish. Further- more, the authors mention that some polypharmacy side effects are relatively infrequent. A more in-depth inspection of the dataset confirms that some side effects have only one appearance in the whole dataset, making their training unfeasible. For this reason, they decided to use only the side effects present in at least 500 drug combinations, which turn out to be 964 side effects [1]. The reason for this choice was to be able to fill a minibatch of around 500 edges per side effect. These considerations require a new stage of filtering called "Reduced Data Structures". In this phase, a consistent partition of the multimodal graph is selected based on the desired number of polypharmacy side effects. This new reduction implies that the resulting graph has full connectivity and feature vectors for all nodes. The "Reduced Data Structures" filtering is necessary to select the 964 side effects chosen by the authors and reproduce their results. It is also useful to select smaller networks that could train rapidly and evaluate the functioning of the implementation. The "Reduced Data Structures" filtering phase has the following steps: 1. Sort the polypharmacy side effects in the DDI dataset by frequency in descending order. Next, the N most frequent side effects were chosen, being N the desired number of side effects. By default, N was set to 964. This step can be substituted by choosing the desired side effects manually by name. 2. Select the entries in the DDI dataset that have the chosen side effects and discard the rest. 3. Select the DTI entries that involve a drug in the new DDI dataset and discard the rest. 4. Select the PPI entries that have at least one protein in the DTI dataset and discard the rest. 5. In the case of including proteins features, select only entries with pro- teins having features. CHAPTER 4. MATERIALS AND METHODS 63

Data Structure Generator The final data processing step implies transforming the datasets into structures that obey the input data format requirements of Decagon. Decagon requires one adjacency matrix for the DDI subgraph, one for the DTI subgraph and one list of adjacency matrices corresponding to each side effect subgraph. Addi- tionally, PPI graph and DDI subgraphs must have a list of the node degrees; plus the feature matrices of nodes and proteins. The adjacency matrices are constructed from a matrix of zeros of the correct dimensions, replacing entries corresponding to interactions by 1s. The single drug side effect matrices were build using the same strategy. Note that the PPI and DDI matrices are squared and symmetric, having sizes of the Np × Np and Nd × Nd respectively, where Np is the number of proteins and Nd is the number of drugs. In contrast, the DTI matrix is rectangular of size Np × Nd, and the drug feature matrix has size Nd × Ndse, where Ndse is the number of single-drug side effects DSE. Next, the degree lists are easily calculated by summing either over the rows or the columns of the symmetric adjacency matrices. In the case of incorporating protein features, its matrix was built by simply stacking the features in a matrix of dimensions Np × Npf , where Npf is the number of protein features. All the matrices were then stored in CSR format. As the process has to loop over all the entries of the datasets, building ad- jacency matrices of big datasets (964 side effects) can take a few minutes. A parallel implementation was used to reduce computing time and build the different side effect adjacency matrices in the list simultaneously using the module Joblib from Python 3.7. Fortunately, the datasets have numbers of interactions much lower than the total possible combinations of node pairs (n(n − 1)/2), making the matrices considerably sparse and the process finish in a suitable time. Finally, four indexing dictionaries were created—the first two mapped protein IDs and drugs ID to their indices in the adjacency matrices, respectively; a third one mapped the single-drug side effect IDs to their indices in the drug feature matrix, and the last one mapped polypharmacy side effect IDs to the index of their corresponding adjacency matrix in the list. To sum up, the three phases of data treatment were combined in a single Python 3.7 script which imported the four .csv files with the networks plus a protein feature file, removed the outlier nodes, reduced the size of the network, created 64 CHAPTER 4. MATERIALS AND METHODS

the relevant data structures and exported them in a Python dictionary. The exported structures were: • four name-to-index dictionaries, • one list of DDI adjacency matrices, • one list of DDI degree lists, • one DTI adjacency matrix, • one PPI adjacency matrix, • one PPI degree list • one drug feature matrix • one protein feature matrix The data dictionary was exported using the pickle package from Python into a pickle-readable file. The reading and processing of the data were done with Python 3.7, the modules Pandas, NumPy and Scipy, and the parallelisation was done using the Joblib module.

4.3.2 Implementation of Algorithmic Complexity Fea- tures In addition to the single-drug side effects and the structural protein features, one of the main objectives of this project was to evaluate the use of complexity descriptors of the networks as features for the GCN model. As explained in section 2.3, algorithmic complexity can be estimated for binary 2-dimensional arrays of arbitrary size using the BDM, enabling the complexity estimation of adjacency matrices. This numerical method is implemented in a Python 3 module called PyBDM.

Using the PyBDM package The implementation of the method is based on the Python class bdm, which provides a proper environment to perform the required calculations. This class aims to categorise a dataset with its number of dimensions and number of symbols, as different objects are handled differently. Once the class has been instantiated and initialised with the dataset’s characteristics (i.e., an adjacency matrix), it is enough to call the function bdm over the dataset to get its com- CHAPTER 4. MATERIALS AND METHODS 65

plexity value. The calculation of the complexity of an object is divided into four stages: 1. Partition (decomposition): The object is decomposed in smaller pieces for which the algorithmic complexity can be estimated using the CTM. The different ways in which the object can be partitioned are discussed in the next section. 2. Lookup: Although the estimation of each small piece of the decomposed object is possible, it is not practical to calculate it for every partition, as the procedure is computationally expensive. Furthermore, the space of possible combinations of small pieces is small enough to keep track of the complexities of each combination. Therefore, the pre-calculated CTM values for each possible piece are stored in the CTM reference dataset, and in this stage, the values for every partition are looked up efficiently. 3. Count: The unique components of the dataset are counted and arranged together with their CTM values efficiently. 4. Aggregation: The complexity value is calculated using equation (2.46)

Edge perturbation analysis The previous method estimates the complexity of a whole network. A different approach is required if the intention is to calculate the complexity contribution of a given element of the network. Thus, a perturbation perspective is taken to analyse these contributions and return a feature vector [79]. A perturbation experiment studies changes of complexity under changes ap- plied to the underlying dataset [80]. The analysis of these changes gives in- sights into detecting which parts of the system have causal significance and which contribute to the system with noise. The main idea behind perturbation analysis is to measure the complexity of a system before and after an element has been changed. When using binary datasets as adjacency matrices, the flip is merely switching a one for zero or vice versa. Taking the difference be- tween the two measures gives how relevant is the altered component for the network. In other words, if the difference between the measurements is posi- tive, the removed element contributes to the randomness of the system, while if is negative, the element gives order to the system. PyBDM has a perturbation functionality that performs the previously described procedure: for a given binary matrix, calculates the difference between the 66 CHAPTER 4. MATERIALS AND METHODS

complexity before and after switching a value. In network terms, this means that for a given pair of nodes, if there is a connection between them, the func- tion removes it; otherwise, it creates a connection. This procedure yields a feature vector of length N × N, where N is the number of nodes, which quan- tifies each edge complexity contribution in the network. Nevertheless, Decagon admits only node features as inputs, not edge features. This issue forced creating a "node equivalence" functionality that averaged the edge contributions, i.e. the contributions from all the possible additions and removals of edges for each node. Moreover, functionalities for calculating node equivalent complexity contributions for only removing existing edges and only adding new edges were also implemented. These complements were motivated by the idea that nodes with different degrees could yield similar complexities when averaging edge addition and edge removal together. Finally, the matrices representing bipartite networks had to be considered. In such a case, as the matrix represents different sets of nodes in the rows and columns, two different node feature vectors were constructed, representing drug and protein nodes separately.

Implementation of node perturbation analysis. Based on the edge pertur- bation analysis, a novel perturbation experiment implementation was created for this thesis. Node perturbation experiments study changes of the complex- ity of a network when existing nodes are removed. This approach gives a more direct evaluation of the complexity contribution of a node. The method calculates the complexity of a matrix before and after removing a node. The node removal implies deleting a row and a column of an adjacency matrix, or only a row or a column in the case of bipartite networks. For matri- ces containing information on large networks, this can be a time-consuming procedure. Consequently, a parallelisation option was implemented, where the user can specify the number of workers that the algorithm will use. With the parallel mode, the method calculates the contribution of each node simul- taneously in different workers. In the end, the contributions are concatenated in the output vector. The parallelisation is performed using the Joblib module of Python.

Choosing a partition method. Large binary matrices can be decomposed into smaller partitions to estimate their complexity. Up to now, the CTM can give estimates of the complexity of binary 2-dimensional arrays up to a size of CHAPTER 4. MATERIALS AND METHODS 67

4x4. However, in the general case, the matrix is not necessarily divisible into full partitions of 4x4. It is then necessary to decide how to proceed when the remaining parts of the dataset are excluded from the initial decomposition. In PyBDM three approaches are considered: 1. Ignore: The remaining parts of the dataset are ignored, and the BDM estimate is done only with the 4x4 partitions. 2. Recursive: The remaining parts are decomposed in smaller pieces (down to some minimum size) for which the complexity is known. For 2- dimensional arrays, the minimum can be down to 2x2 arrays. 3. Correlated: Use a sliding window in a convolutional fashion using parts that overlap to a certain extend. These multiple methods of decomposition imply that the estimation of BDM for an object may not be unique. Depending on the method, some parts of the data can be ignored or counted twice. Despite being stated by the authors that the three methods converge to the same value for big datasets [40], when performing perturbation experiments, the choice of the partition algorithm can yield very different results. As the complexity of individual sections of the network are evaluated, ignoring or overcounting the contributions of a section of the matrix can lead to erroneous results. For this reason, exploratory perturbation experiments for nodes were performed using the three different methods to choose the best partition algorithm. A ran- dom symmetric binary adjacency matrix of size 10x10 was generated, and its node contribution was calculated using edge perturbation and node perturba- tion. The calculations were performed 100 times, and the results are shown in Figure 4.3. The results of these experiments give several conclusions. First , the complex- ity results for a small matrix can vary significantly among the three methods, being especially different for the correlated algorithm. For this method, the error bars show a huge variance between the different repetitions of the experi- ment, while the variance for the other methods is much lower. The results from the "ignore" and the "recursive" method have similar values for all the nodes and very low variance among experiments. However, the "ignore" method has a value of zero complexity and zero variance among experiments in the last two nodes. This result comes from ignoring the last two nodes, as the method divides the matrix into four sections of 4x4 ignoring the remainder. From the previous results, it is clear that not only the "ignore" and the "corre- 68 CHAPTER 4. MATERIALS AND METHODS

Node Complexity

0

100

Ignore 200 Correlated Recursive 300

Complexity contribution 400

0 2 4 6 8 Node Average edge complexity per node

40 Ignore Correlated 30 Recursive

20

10

Complexity contribution 0

0 2 4 6 8 Node

Figure 4.3: Comparison between the different types of dataset partitions of the PyBDM package. The top plot shows the average node complexity per node with the standard deviation and the bottom plot, the average edge complexity per node. Both plots use a 10x10 binary symmetric adjacency matrices and 100 experiments to calculate the mean. lated" partition algorithms must be discarded for perturbation analysis, being the best choice the "recursive" method, but that the BDM for complexity esti- mation may yield results that are dependent on the descriptor, in this case, the adjacency matrix. In an adjacency matrix, the permutation of nodes, which by no means alter the network structure, may locate important information in the ignore or in the correlated zone and miscalculations can occur. It is also important to remember that in the "recursive" method, the sections that are ignored by other methods are decomposed into smaller section down to some CHAPTER 4. MATERIALS AND METHODS 69

minimum size, but the remaining part is still ignored.

Binary BDM Decagon was designed to receive sparse binary adjacency matrices and sparse feature matrices for the nodes. Although it can handle dense vectors, it be- comes highly inefficient, as its computational demands increase substantially. In this sense, BDM features present a problem, as they are dense and con- taining non-integer numbers, which can take much space in memory to store. For this reason, it was decided to take an approach sacrificing precision for memory, by making the BDM feature matrices sparse and filled with integer values. The method called "Binary BDM" consisted in grouping the BDM values in intervals using a superior and an inferior threshold. In this way, the BDM values below the lower threshold became -1, the values above the upper threshold 1, and the values in between thresholds became 0. This improve- ment became necessary when using GPUs, as their memory is restricted, and in some cases, the total size of the feature vectors exceeded the memory of the GPU. The thresholds were chosen to be two standard deviations above and below the mean. Several values for the thresholds were tried before but did not sparsify the vectors enough to fit thein the GPU memory.

BDM integration with Decagon Node and edge perturbation experiments were performed over the different networks to calculate their node features. For each matrix, three features were calculated per node: node complexity, mean edge addition complexity and mean edge removal complexity. In this sense, the PPI network contributed with three features to the protein nodes. For the DDI, as there are as many adjacency matrices as side effects, the contribution was three features per side effect for each drug node. In the case of DTI, the three given features were extracted for both protein and drug nodes. In summary, for protein nodes, a 6- dimensional BDM vector was calculated summing the three features from PPI and the three for DTI, while the drug nodes received a 3(Nse +1)-dimensional vectors, 3Nse coming from DDI and three from DTI, where Nse is the number of polypharmacy side effects. These BDM features were then concatenated with the drug and protein features obtained from the datasets. In case that no other protein or drug features were used, the BDM features replaced the identity matrices used to represent an absence of features. These, along with adjacency matrices, were converted 70 CHAPTER 4. MATERIALS AND METHODS

into a sparse format and consolidated in the Python dictionaries explained in Figure 4.2. The transposed matrices added in the original version of the code were sup- pressed in this data structure. More specifically, the additional PPIT and the list of transposed of DDI were not included in the dictionary, yet the TDI matrix was conserved to preserve the 2×2 dictionary structure. A further explanation for the removal is found in section 4.3.4. Similar to data structure saving, the final output was exported using the pickle module from Python.

4.3.3 Containers and GPU Configuration One substantial inconvenience of using Decagon was that it is implemented in Python 2.7, an old and already unsupported version of Python. At the same time, essential modules present in the code such as TensorFlow, operate with older versions too. The older version of TensorFlow (1.8.0) caused incom- patibility between the code and many of the devices, especially GPUs, as the newest drivers only support the most recent versions of TensorFlow. This in- compatibility is aggravated because modules like PyBDM and Joblib are only available for Python 3. This constant hopping between different Python ver- sions created several complications, which were solved using containers.

Container: A Virtual Environment In simple terms, a container is an isolated environment in a computer that has all the requisites for running a specific program. No matter the operating system or versions of the main machine, the container will install the needed packages efficiently in the required versions, without interfering with the main configuration of the machine. It differs from a virtual machine in the sense that a container is not emulating hardware using software, but using the same kernel of the main computer to built its own. Logically, all containers must have the same kernel4 as the host machine. Conveniently, containers do not need to be built from scratch. Predesigned images can be used as a base for containers and then add the required versions of the packages. For this work, Singularity containers pulling images from Docker hub [81] were used. Specifically, three different containers were used: • Python 3: This container was built pulling a base Python 3 image from Docker. Further, special packages like Joblib and PyBDM were installed

4Linux kernel. CHAPTER 4. MATERIALS AND METHODS 71

over it. This container was used to treat the dataset, calculate the BDM values and consolidate the dataset into the Decagon dictionaries. • Decagon main: This container was built pulling a base Python 2.7 im- age from Docker. The Python modules in their required versions speci- fied in [67] were installed over it, including TensorFlow 1.8.0. This con- tainer was used to create the Minibatch Iterator objects (section 4.3.4) and train Decagon using CPUs in preliminary tests with limited data. • Decagon GPU: This container was built pulling a base Nvidia image with CUDA version 9.0 and the Deep Neural Network developer pack- age (cudnn) version 7. Python was then installed over this configuration, and then the required packages for running Decagon. In this case, the specialized version TensorFlow-GPU 1.8.0 was installed instead of the custom one5.

Hardware All the experiments were conducted on a server with a processor Dual Intel Xeon(R) Gold 6230 (2.10 GHz) with a total of 40 cores and 80 threads, 4 NVIDIA Tesla V100 SXM2 GPUs (16 GB memory each), and 768 GB of RAM running on CentOS Linux 8 (Core).

4.3.4 Minibatch sampling and the data leakage prob- lem The Edge Minibatch Iterator class possessed certain flaws. The edge sampling presented a data leakage problem in both positive and negative edge sampling. The problem with the positive edges resided in the transposed adjacency ma- trices, which as mentioned earlier, they were removed in the final version. As to the negative edges, the source was the stochasticity of the random edge sam- pler. This same data leakage issue has been already reported in GitHub [82].

Duplicates of adjacency matrices As the current situation involves only undirected graphs, the transpose adja- cency matrices are equal to the original ones. Therefore, the transposed dupli- cates of adjacency matrices are theoretically not needed. Besides, they could

5This change in version was required as in TensorFlow 1, the GPU and CPU functions belong to different packages, which was fixed for TensorFlow 2. 72 CHAPTER 4. MATERIALS AND METHODS

be a source of errors and slow the training. For this reason, it was decided to remove these duplicate matrices to see if this resulted in an improvement. Additionally, the data leakage problem mentioned earlier roots in the fact that a corresponding pair of adjacency-transpose matrices represent the same edge type, but their edges are sampled independently. This double sampling can cause that train edges of one matrix could be present in the validation or test edges of their corresponding transpose, and vice versa. Moreover, this inclu- sion of double edge types meant adding more categories to the problem; which could also lead to incorrect results. Furthermore, this matrix deletion comes with a considerable reduction in the number of parameters that can help avoid overfitting and make the training more efficient. Despite the evident nega- tive influence of adjacency matrices, it was decided to keep the DTIT, as its removal would change the whole structure of the data based on dictionaries, involving drastic changes in all the implementation levels.

The negative sampling problem The random sampling of negative edges evaded the need to handle large ar- rays; however, it did not guarantee that an edge would not be sampled again in another dataset. Although the probability was small, this implied a poten- tial data leakage problem that could overestimate the method’s performance. To fix this problem, the EMI became in charge of sampling the three negative datasets (train, validation and test), without the fixed_unigram_candi- date_sampler function from the optimiser. An additional condition was imposed to verify that a data point could not be included in several datasets simultaneously. Moreover, the idea from the fixed_unigram_candi- date_sampler function of sampling negative edges from the same node was kept to make efficient use of the memory. This new negative training edge sampler saved the sampled edges, allowing the model to calculate train- ing error metrics. The new dataset creation was substantially more extensive than the previous one due to the higher number of train edges than the number of test and vali- dation edges. This increase in size made the random sampling process (which can not be parallelised) much slower, and by extension, the whole training procedure, as the EMI creation was included in the same script as the training. Consequently, the EMI creation was separated from the rest of the Decagon implementation and defined in another file, where it could save the object con- taining the sorted dataset in a pickle readable object. This procedure could also CHAPTER 4. MATERIALS AND METHODS 73

allow to train multiple models with the same datasets, not only to save time but also to guarantee that the models could train with the same data sorting. The new approach can be used for performing comparisons between models and hyperparameter tuning, as the performance changes between models and con- figurations can be attributed to their parameters and not to stochastic variations in the dataset.

4.3.5 Incorporation of edge features The current graph convolutional approaches are based on propagating node features in graphs, although edge features are not considered. In this thesis, one group of experiments involves incorporating of pharmacodynamic fea- tures of the interactions between drugs and proteins. A way of including these edge features to the model was to alter the adjacency matrices, in this case, the DTI and its transpose, which generally include only zeros and ones. This modification was done performing an elementwise multiplication between the binary adjacency matrix and a matrix of edge features. The latter contained the calculated values for affinities between the given drug-protein pairs. This elementwise matrix multiplication turned the network representation into a weighted graph.

4.3.6 Other improvements One crucial change was the train-validation-test proportion. In the original im- plementation [67], the default proportion is 90-5-5. These values are unusual as such a big proportion of training instances may overfit the model. The pro- portion was changed to 70-15-15, as it is a more typical value for these models [83]. Another fix involved the insertion of a ReLU activation in the second layer of the encoder. This activation was mentioned in the author’s work and present in the previous GCN equations, but its implementation was not found in the code. Other technical improvements included: • Training metrics were calculated and exported once per epoch for all edge types (with the modifications to the EMI). • Unnecessary calculations and unimplemented functions were deleted to give more readability to the code. • A ReLU activation was included in the second layer of the encoder, 74 CHAPTER 4. MATERIALS AND METHODS

which was not originally found in the implementation.

4.3.7 Overall Pipeline Finally, an important contribution of this work is creating of a pipeline that makes possible the training of Decagon from scratch with real data and good performance. Furthermore, all the code is appropriately documented and pub- lic in https://github.com/diitaz93/polypharm_predict. The operational flow starts with downloading and extracting the dataset from the web, using the bash script download_data, generating the .csv files. Next, the user must get into the Python 3 Singularity container explained in section 4.3.3. Inside the container, the code DS_generator.py would im- plement the "Remove Outliers" and the "Reduced Data Structures" phases. All Python scripts in this pipeline have been equipped with an argument parser us- ing Python module argparse, which allows the user to enter the number of desired side effects for the filtering, which is set to 964 by default. This func- tionality also allows to print help instructions using the flag –help. The out- put of DS_generator.py is the adjacency matrices, feature matrices and other raw data structures consolidated in a pickle file. Later, the adjacency matrices are used by the BDM scripts to calculate the node and edge com- plexity perturbation values. Each edge category has its own BDM calculation file where the complexity is computed in parallel. Then the bin_BDM.py script binarises the BDM values as explained in section 4.3.2. To end the data pre-processing stage, the script DECAGON_struct.py will consolidate all the features and build the Decagon dictionary-based data structures and ex- port them in a pickle file with Python 2 compatibility. This script will accept the –dse and –bdm flags depending if the user wants DSE or BDM features, respectively; or both. For the following stage, the user must change to a Python 2 container. The container has to be chosen depending on the device where the model will be trained. Once in the container, the next step is to run the MINIBATCH_sav- ing.py script, which will ask for arguments corresponding to the path of the files with the structures, the batch size (set to 512 by default), and the test pro- portion (set to 0.15 by default). This code will create the EMI object and export it in a pickle file. Once with the minibatch exported, the training can begin. To start the training, the user must run either main.py or main_gpu.py. Both of them will ask for the number of epochs and the batch size of the minibatch. The GPU version will also ask for the IDs of the GPU that will be used to train the model. Finally, this script will output the training metrics, validation, and CHAPTER 4. MATERIALS AND METHODS 75

download_data.sh .csv

Python 3.7 DS_generator.py

DECAGON_struct.py ddi_bdm.py dti_bdm.py

ppi_bdm.py 5,6,2,3,5, 8, 12, 25, bin_BDM.py

MINIBATCH_saving.py Python 2.7

main_gpu.py results_training ------, , -, , -, , -, - -, - -, - - , - - -

EdgeMinibatchIterator

Figure 4.4: The overall pipeline used to execute the Decagon code. The pipeline starts with a bash script to download and decompress the original dataset into .csv files. The Python 3.7 Singularity container (green rounded rectangle) is then invoked to run the filtering and BDM feature generation. Later, the container is changed to Python 2.7 with (or without) GPU compati- bility to sort the data and train the model. The training returns its performance metrics. test datasets in NumPy arrays stored in pickle readable files. The full pipeline is illustrated in Figure 4.4. Chapter 5

Results and Discussion

Several experiments were performed to test Decagon over all of its devel- opment phases. In this chapter, results of training Decagon are shown and discussed for every stage. All the experiments performed in this thesis used Decagon with two convolutional layers: one input layer composed of 64 neu- rons, and one output layer of 32 neurons. The dropout and the learning rate were kept constant with values of 0.1 and 0.001, respectively.

5.1 First experiments: Feature selection

The first experiments involved incorporating all the node features, including those mined from the datasets and BDM. The version of Decagon used for this stage had little alterations from the version in [67], having only the real data incorporation pipeline, and a very primitive result saving mechanism. The exported metrics included AUROC, AUPRC and APK@50 of the validation dataset for every edge type used and the value of the training cost function. As the first simulations used an implementation without GPU compatibility, they were trained using a reduced dataset to get fast results. The reduced dataset was formed using a consistent network with the six most frequently occurring polypharmacy side effects. This number was set manually, gener- ating a number of nodes that adjusted to this condition. Hence, this network contained 16269 protein nodes, 630 drug nodes, 9688 different DSE, and held 16 adjacency matrices, where 4 of them were protein edges and 12 side effect edges, as every edge still kept its transpose adjacency matrix. This experi- ment was run using CPUs for 20 epochs, a batch size of 512 and a dataset

76 CHAPTER 5. RESULTS AND DISCUSSION 77

train-validation-test splitting of 90-5-5. With the generated graph, six different repetitions for the experiment were made. The repetitions involved the following variations of the features: 1. One control experiment with no features. 2. DSE binary features for drugs, no protein features. 3. DSE binary features for drugs and secondary structure features for pro- teins. 4. DSE binary features for drugs and normalized secondary structure fea- tures for proteins. 5. BDM dense features for drugs and BDM dense features plus secondary structure features for proteins. BDM features were 2-dimensional vec- tors representing node and edge perturbation results, respectively. 6. BDM dense features for drugs and BDM dense features plus normalized secondary structure features for proteins.

Mean validation AUPRC for different edges 1.0

0.9

0.8

0.7 AUPRC

0.6

0.5 Protein Side effect

0 10 20 30 40 50 60 70 Iteration

Figure 5.1: Performance in AUPRC of the validation dataset during the train- ing iterations in the replica without features for the two main groups of edge types: in blue the average of the protein-related edges and in red the average of the side effects. 78 CHAPTER 5. RESULTS AND DISCUSSION

Due to the probability distribution of edge sampling described in page 56, edge types involving proteins, namely PPI, DTI and TDI, are chosen to be trained with a higher probability. Therefore, there is a strong tendency to overfit these edge types. This tendency is shown in Figure 5.1, where the validation AUPRC of the no_feat experiment is averaged over all the protein edges and shown with its respective error margin representing standard deviations in blue. The same procedure is done to the side effect edges, which is shown in red. The Figure shows a clear gap between the final performances of both types of edges. It is also noticeable that the protein edges reach the maximum performance in fewer iterations and more directly than the side effects, which have more oscillations in their performance.

Mean validation AUPRC for protein edges Mean validation AUPRC for side effecs 0.85 1.0 0.973 0.974

0.80

0.9 0.856 0.852 0.852 0.854 0.75 0.723 0.727 0.714 0.718 0.699 0.706 0.8 0.70

AUPRC AUPRC 0.65 0.7

0.60

0.6 0.55

0.5 0.50 no_feat DSE DSE+PF DSE+N BDM+PF BDM+N no_feat DSE DSE+PF DSE+N BDM+PF BDM+N

Figure 5.2: Protein feature experiment AUPRC for the validation dataset in the last training iteration for the six different replicas. To the left, the average performance for the protein-related edges and their standard deviations. To the right, the equivalent for side effect edges.

Figure 5.2 shows the performance expressed as AUPRC for the different repli- cas of the experiment divided into two plots: on the left the average perfor- mance over the protein edges and on the right the average performance over the side effect edges, both with their corresponding standard deviations as er- rorbars. The values of AUPRC correspond to the validation dataset in the last training iteration. These results show that the use of the protein features visibly reduce the performance and add variability to the protein-related edge types, while slightly reducing the other edges’ performance. This effect indicates that protein features introduce noise to the neural network, as it does not discover any clear associations between the protein network, the protein features and CHAPTER 5. RESULTS AND DISCUSSION 79

Execution time

16 15.072 14.785 15.012 14.731 14.398 14.28 14

12

10

8 Hours

6

4

2

0 no_feat DSE DSE+PF DSE+N BDM+PF BDM+N

Figure 5.3: Protein feature experiments execution time in hours for the six replicas. the drug properties. From these results, nothing conclusive can be said about the impact of DSE or BDM features on performance or efficiency, although they suggest that this aggregation does not change the results significantly. Although different kinds of features were used in these replicas, their execu- tion time was not significantly different, as shown in Figure 5.3. This relatively short and uniform training time can be attributed to the small size of the net- work, and the low number of epochs. The fruitless results of adding protein features were the motivation to discard them from the rest of the experiments of this project. More insight and expertise in the topic are needed to pick up relevant protein features that could satisfy the model’s needs.

5.2 Node features as possible method sta- bilisers

Many changes were applied to Decagon for the next experiments, starting from discarding the secondary structure protein features. Additionally, the test per- formance was exported at the end of the training using AUROC, AUPRC and accuracy. Also, the performance on the validation dataset of all edges was ex- ported once per epoch using these same metrics. Finally, the train-validation- test proportion was set to 70-15-15 to make a more confident measure of the performance, and the number of epochs increased to 50. 80 CHAPTER 5. RESULTS AND DISCUSSION

BDM was modified by splitting the edge perturbation experiment into adding and removing edges. In this sense, the BDM vector now is a 3-dimensional vector with node perturbation, edge adding, and edge removal features. The reason behind this approach was to quantify the impact of the existing edges and the addition of hypothetical edges separately. Without the protein features, there were no other variables that could interfere with evaluating the impact of BDM features in Decagon. However, with lim- ited resources to run simulations of the complete dataset, the side effects to generate the consistent graph had to be chosen carefully. Following this rea- soning, the three best and three worst-performing side effects were chosen to create the network and see the influence of features in opposite scenarios. This choosing generated a completely different network with 16266 proteins, 627 drugs, 9688 DSE and 16 adjacency matrices representing the new side effects and their transposed matrices. It was noticed that the worst performing side effects had much fewer edges than the best performing edges. In the worst case, only 1016 edges were found, compared to the 45750 of the best. This low number of edges forced the EMI to reduce the batch size to 128, to fit more batches and make more iterations per side effect.

Mean AUPRC for different sets of edges

Protein edges Best performing edges Worst performing edges 1.0 0.85 0.85 0.971 0.973

0.904 0.912 0.80 0.80 0.9 0.762 0.762 0.745 0.75 0.75 0.708 0.72 0.8 0.70 0.70

AUPRC AUPRC 0.65 AUPRC 0.65 0.7 0.614 0.608 0.606 0.60 0.60

0.6 0.55 0.55

0.5 0.50 0.50 no_feat DSE BDM w2 no_feat DSE BDM w2 no_feat DSE BDM w2

Figure 5.4: Test AUPRC for the three different edge groups for all replicas. To the left the average of the protein-related edges, in the center the average of the three best performing side effects and to the right the average over the three worst performing side effects.

The experiment consisted of four replicas with different features: one control experiment with no features, one only with DSE, one only with BDM and one CHAPTER 5. RESULTS AND DISCUSSION 81

with both DSE and BDM, referred in the figures as w2. The average perfor- mance over protein, best and worst edges are shown in Figure 5.4. It can be seen that the impact of BDM features is significantly different for the two sets of side effect edges. BDM seems to be lowering the performance of the best performing side effects while giving a minor increase to the worst performing side effects compared to the control. With these results, it is also clear that the feature that acts as the most effective regulariser is the DSE. The w2 replica seems to work fine with the best-performing side effects, but with the worse-performing had the lowest value. Protein edges are also negatively affected by BDM, but not as much as with protein features. It is also notable the increase of variance that BDM replicas give to the protein edge types, which does not happen with side effect edge types. The disadvantage of running experiments with reduced graphs was that the negative edge sampling was done randomly for each replica, which means that the datasets were not identical, and could influence the results. This issue is fixed for the final results by remodelling the EMI. Figure 5.5 gives a more detailed picture of the different replicas during train- ing for the different edge types, showing the average of the validation AUPRC and its dispersion in blue while showing the test AUPRC as a green dashed line. It is evident by the side effect graphs that the three best side effects seem to increase their performance steadily through the epochs with low variance among them. On the contrary, there is a high variance and visible oscillations in the three worst side effects training results, with one case even decreasing its performance during some epochs. This is a consequence of the data imbal- ance between edges; as the optimiser estimates a loss function with fewer data points–meaning less iterations per epoch, its generalisation ability decreases. Another observation is that BDM is adding considerable variation to the pro- tein edges performance, without being so drastic in the side effect edges. Figure 5.6 shows a comparison of the execution times. The difference in the number of datapoints curiously does not affect the execution time, with the four replicas taking almost the same time to train. Nevertheless the execution time increased compared to the previous experiment, mainly because the number of epochs was extended. It is clear from this set of experiments that the adding of features in GCN seem to have a more substantial negative impact on highly performing edge types than on least performing ones. It is also noted that the number of data points plays a crucial role in the performance of an edge type, resulting in the need 82 CHAPTER 5. RESULTS AND DISCUSSION

Mean test and validation for each method

Proteins 3 best 3 worst 0.80 0.971 0.75 0.975 0.762 validation 0.75 0.950 0.70 test

0.925 0.70 0.65

0.900 0.608 no_feat 0.65 0.60

AUPRC 0.875 AUPRC AUPRC 0.60 0.55 0.850 validation 0.55 validation 0.50 0.825 test test 0.800 0.50 0.45 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Epoch Epoch Epoch 0.80 0.973 0.75 0.975 0.762 validation 0.720 0.75 0.950 0.70 test

0.925 0.70 0.65

0.900 DSE 0.65 0.60

AUPRC 0.875 AUPRC AUPRC 0.60 0.55 0.850 validation 0.55 validation 0.50 0.825 test test 0.800 0.50 0.45 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Epoch Epoch Epoch 0.80 0.75 0.975 validation validation validation 0.75 0.950 test test 0.70 test 0.708 0.925 0.70 0.65 0.904 0.614 0.900 BDM 0.65 0.60

AUPRC 0.875 AUPRC AUPRC 0.60 0.55 0.850 0.55 0.50 0.825

0.800 0.50 0.45 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Epoch Epoch Epoch 0.80 0.75 0.975 validation validation 0.75 0.745 0.950 test 0.70 test

0.70 0.925 0.912 0.65 0.900 0.606 w2 0.65 0.60

AUPRC 0.875 AUPRC AUPRC 0.60 0.55 0.850 0.55 validation 0.50 0.825 test 0.800 0.50 0.45 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Epoch Epoch Epoch

Figure 5.5: Average test and validation AUPRC for each edge type group in the columns and for each replica in the rows. The blue lines represent mean validation AUPRC with its deviation for each group of edge types. The green dotted line represent the test AUPRC after 50 epochs. The scales are kept the same column-wise for an easier comparison between methods, but are changed for every group of edge types. CHAPTER 5. RESULTS AND DISCUSSION 83

Execution time 60

52.038 51.654 53.028 51.652 50

40

30 Hours

20

10

0 no_feat DSE BDM w2

Figure 5.6: Execution time in hours for 50 epochs of the four replicas. for training the model on larger graphs to achieve more convincing results.

5.3 Experiments with side effects with the low- est performance

Despite having an adverse influence on some side effects, BDM could be hav- ing different effects on other edge types. As the more exciting results came from the worst-performing side effects, the next experiments shifted towards them. In this sense, the six worst-performing side effects were chosen for the following experiments. The authors reported these side effects as the ones with the lowest performance in the original model, but due to the changes to the code and the reduced data sampling, the worst performing side effects found in this stage are no longer the same reported by the authors. The most significant change of this stage was the removal of the double matri- ces. Also, the sampling of the negative edges was fixed to isolate test and val- idation datasets. These two changes contributed to reducing the data leakage problem found in the implementation. Additionally, the EMI creation was sep- arated into a different stage of the pipeline and exporting the sampled edges in an additional file. This extra step guaranteed that the different replicas would use identical test and validation datasets. These changes gave a notable im- provement on the performances of the following experiments. With the implemented changes, the new experiment was run with the same 84 CHAPTER 5. RESULTS AND DISCUSSION

four replicas as before. The graph created with the six chosen side effects had 16814 proteins, 636 genes and 9700 DSE.

Extension to pharmacodynamic features In this stage, a new variation was attempted in parallel to the standard Decagon procedure. Descriptors related to the plausibility of the interactions between different drugs and proteins, called drug-target affinities, were added to the data. As these are edge features, they were implemented directly in the adja- cency matrix, as explained in section 4.1. This new set of features generated a reduced graph based on network con- sistency rules, using the six worst-performing side effects used in the previ- ous experiment along with the same replicas. As not all the drug-target pairs had an affinity value calculated, the network size reduced considerably, hav- ing only 16271 proteins, 276 drugs and 8120 DSE. Despite being trained for 50 epochs, this experiment’s training time was significantly shorter than the previous ones, mostly due to reduced network size, as shown in Figure 5.7.

Execution time in hours

1.6 1.511 1.432 1.435 1.463 1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 no_feat DSE BDM w2

Figure 5.7: Execution time in hours for 50 epochs of the four replicas in the affinity experiment.

The results of this experiment and the previous one without affinity scores are shown in Figure 5.8. Additionally, the performance reported by the authors is included as a green bar to the right hand side of the plots. Concerning the first experiment, the results show an increase in the performance concerning CHAPTER 5. RESULTS AND DISCUSSION 85

Mean AUPRC for 6 worst side effects

6w 0.76 0.747 AF DEC 0.739

0.74 0.721 0.724

0.711 0.715 0.707 0.72 0.702

AUPRC 0.691 0.70

0.68

0.66

no_feat DSE BDM w2 DECAGON

Figure 5.8: Average AUPRC for the six worst performing side effects with and without affinity. The blue bars show the regular, modified Decagon averaged over the six side effects. In red, the equivalent is shown for the affinity replicas. At the same time, they are compared with the authors’ results using the original version of Decagon, shown in green. the one calculated over the three worst-performing side effects. Still, the im- provements are not attributed to the BDM features, as their replica shows to be the least performing of the four. In this case, however, DSE did not im- prove the performance over the featureless model. On the other hand, the use of affinity features decreases the performance in all replicas. This reduction in performance is not necessarily attributable to a direct influence of the features, but most likely to the network reduction that comes with them. The figure also shows that the side effect performance is much more similar between the repli- cas than in the previous experiment, with the DSE replica slightly surpassing the rest. This similarity could mean that, as the network shrinks, the features become more relevant to offer useful information to the classifier. From the errorbars it can be observed that the affinity experiments have a much larger variance than their corresponding controls. For more details on the progress of performance over the epochs for these experiments the reader is referred to figures A.1 and A.2 in Appendix A. 86 CHAPTER 5. RESULTS AND DISCUSSION

Diarrhea Emesis

6w 6w AF 0.754 AF 0.76 0.75 0.76 DEC DEC 0.743 0.74 0.739 0.74 0.74 0.724 0.72 0.72 0.72 0.711 0.71

AUPRC 0.705 AUPRC 0.703 0.701 0.70 0.693 0.70 0.69 0.684 0.69 0.681 0.68 0.676

0.68 0.66

no_feat DSE BDM w2 DECAGON no_feat DSE BDM w2 DECAGON Body Temperature Increased Bleeding

6w 0.76 6w 0.76 AF AF 0.746 0.743 0.74 0.74 DEC 0.739 DEC 0.74 0.74 0.727 0.73 0.725 0.721 0.717 0.718 0.72 0.72 0.7090.712 0.714 0.71 0.7 AUPRC 0.70 AUPRC 0.70

0.68 0.679 0.68 0.68

0.66 0.66

no_feat DSE BDM w2 DECAGON no_feat DSE BDM w2 DECAGON Renal Disorder Leucopenia

6w 6w 0.78 0.769 AF AF 0.74 0.731 DEC 0.729 DEC 0.759 0.76 0.725 0.722

0.72 0.714 0.74 0.734 0.735 0.732 0.735 0.705

0.716 0.70 0.695 AUPRC 0.72 AUPRC 0.704 0.69 0.685 0.70 0.694 0.68

0.68

0.66

no_feat DSE BDM w2 DECAGON no_feat DSE BDM w2 DECAGON

Figure 5.9: Test AUPRC for the six worst performing side effects. Each plot shows data for one side effect. The blue bars show the regular, modified Decagon averaged over the six side effects. In red, the equivalent is shown for the affinity replicas. At the same time, they are compared with the authors’ results using the original version of Decagon, shown in green. CHAPTER 5. RESULTS AND DISCUSSION 87

When comparing the results, it is clear that both models perform, on average, better than the original Decagon for every replica. It should be highlighted that the modified versions achieve these performances with fewer data points, as they are only training with six side effects. It is also evident that the version with affinity features performs worse than the regular, modified version, and has a much larger variance among side effects, evidenced by the black error bars. A comparison of the experiments and their replicas over the individual side effects is shown in Figure 5.9. From the figure, it is clear that, for all side effects, the original Decagon is beaten by the modified version in all replicas. The version with affinity features completely defeats Decagon in half of the side effects and is defeated in all replicas only in emesis. The best performing method is always blue for all the side effects, being the replica with no features the winner method in four side effects, and the DSE replica in two. When referring to the affinity experiments’ performance, there is no clear pattern, being superior in renal disorder using DSE, BDM and w2, but performing poorly in all replicas for emesis. However, BDM and DSE features improve affinity replicas’ performance more frequently than for the regular version, supporting the theory of feature as regularisers for small graphs.

5.4 Extension to experiments with a full graph

It is now clear that the amount of datapoints is a decisive factor in Decagon- based models’ performance. Therefore, simulations with the complete dataset have to be made in order to achieve peak performance. Still, several attempts to train models with bigger datasets proceeded abnormally slow. As a result, the code was inspected and cleaned, removing any unnecessary calculation and fixing any performance issued that it may have had. However, it was not until the complete working pipeline was completed with the GPU implementation, that a full graph simulation could be run. In Figure 5.10, time and performance results are shown for different attempts of improving the efficacy of the method using a small portion of the dataset. From left to right, the bars represent the regular version used until now, one version in which the explicit calculation for the gradients was omitted, one using a ReLU in the output layer of the encoder, one with the implementation of the DecagonModel defined in fewer loops, and a version run in GPU. The results show an outstanding reduction in time with no significant reduction in performance using the GPU. This improvement cleared the path for simula- 88 CHAPTER 5. RESULTS AND DISCUSSION

Mean AUPRC for side effects Execution time in hours 0.85 4.317 4.176 4.214 4.21 4 0.80 0.752 0.749 0.75 0.757 0.75 0.75 3

0.70 Hours AUPRC 0.65 2

0.60 1 0.853 0.55

0.50 0 BASE NoGRAD ReLU MODRED GPU BASE NoGRAD ReLU MODRED GPU

Figure 5.10: Comparison of performance (AUPRC) on the right figure and execution time in hours on the left, for the different variations in the imple- mentation. From left to right the simulations are: no modifications, no explicit gradient calculation, ReLU in second layer of encoder, decoder reduction and implementation in GPU. tions using the complete dataset. At the same time, improvements were made to the sampling of negative train edges. The sampling was now performed separately in the EMI instead of inside the optimiser, verifying that edges would not be included in test or val- idation dataset, and, more importantly, being stored to calculate train perfor- mance. BDM went through multiple modifications as well. The feature matrices gen- erated in BDM replicas resulted in large, dense tensor multiplication too big to fit in the GPU memory. To reduce the size of BDM, first, the feature vec- tor corresponding to the adding edges perturbation was removed, as it lacked biological meaning1. Second, the remaining BDM vectors were replaced by a sparse binary version stored as a sparse matrix. With these modifications, two new replicas of the model and a control ver- sion were developed to compare their results. The control version tried to preserve most of the original implementation of Decagon to replicate their re- sults, but having some of the new modification to allow it to run efficiently.

1In the biological network, creating an artificial connection between two random nodes is unrealistic, as proteins only interact with the proteins in their pathway and with few other neighbours. To measure the perturbation of these connections does not add any applicable information. CHAPTER 5. RESULTS AND DISCUSSION 89

Mean performance metrics for protein edges Mean performance metrics for side effecs 1.0 0.986 0.985 AUROC 0.973 0.974 1.0 0.962 0.957 AUPRC 0.895 0.894 0.9 0.872 0.848 0.846 0.846 0.9 0.832

0.8 0.752 0.8

0.7 0.7

0.6 0.6

AUROC AUPRC 0.5 0.5 BASE DSE w2 BASE DSE w2 DECAGON Figure 5.11: Mean performances for the test dataset using the complete database (964 side effects). To the left, the mean performances of protein- related edge types and to the right, the performances of side effect edges. The bar groups represent, from left to right, the basic Decagon, the complete im- plementation with DSE and the complete implementation with both DSE and BDM. Additionally, the right plot shows the performances reported by the au- thors. The blue bar show the AUROC and the red bars the AUPRC.

For instance, the double matrices were discarded, as keeping them would have implied dealing with a much bigger file of data structures and training with twice the number of edges and parameters. The whole dataset was used, this is, 19081 proteins, 639 drugs and 964 side effects. The two replicas with the modified model included all the implementations described so far (explained in detail in section 4.3). A dataset including only second neighbours of protein nodes was used instead of the full dataset, this is, a network with 16837 pro- teins, 639 drugs, 9702 DSE and 964 side effects. One of the replicas included DSE exclusively, while the other included DSE and BDM (w2). Both of them used the same EMI object holding the datasets, which took approximately 11 hours to create. The performances of the replicas for this experiment are shown in Figure 5.11. On the right, the average performances over the protein edges measured with AUROC (blue) and AUPRC (red) are shown, while on the left the same met- rics for the side effects. Additionally, the left plot includes the performance reported by the authors using the original Decagon. Firstly, the figure shows that the base simulation’s performance is outperformed by all others, includ- ing the reported results. This means that it was impossible to reproduce the 90 CHAPTER 5. RESULTS AND DISCUSSION

author’s results with a runnable method based on their implementation. Sec- ondly, both modified versions making use of node features surpassed the orig- inal performances reported by the authors. Thirdly, the performances of both modified methods on the side effects are practically equal. For more details on the progress of performance over the epochs for these experiments the reader is referred to figure A.3 in Appendix A. Next, Figures 5.12 and 5.13, show the performance in AUPRC of the differ- ent methods on some of the side effects with reported results by the authors, namely the three best and the three worst, respectively. The figures show the simulation results for the three respective side effects in the different rows while showing the base, DSE and BDM replicas in the columns. The vali- dation performance is shown as a blue dotted line, the test performance as a green dotted line and the authors’ result in a purple dotted line. The training performance was calculated for the two modified replicas, and it is shown as a red line. Figure 5.12 show in its first row of plots the performance for mumps, the au- thors’ best-performing side effect. For this side effect, the authors’ value of performance slightly outperforms the ones from both the modified replicas. On the other hand, both modified replicas surpass the author’s second-best side effect performance by a minimal margin. Next, for coccydynia, the third-best side effect, only the w2 replica surpasses the reported values. It is important to note that these side effects reported as best by the authors are not the best performed in the modified replicas. Finally, the first column shows that the base method gives significantly lower results than the reported officially. For these side effects, both modified replicas give better results for the test and validation datasets than for the training, which is not unfavourable but is unusual. The curves’ expected behaviour is that the training dataset would perform slightly better due to some degree of overfitting, which is not noticed here. This anomaly can be caused by some peculiar edges that are hard to clas- sify in the dataset. As the training dataset is larger than the test, it is more likely for those edges to be present in the training dataset, reducing its performance. Next, Figure 5.13 shows how all three cases included the base simulation, ex- ceed by a wide margin the reported performance for the three worst side ef- fects. Similarly to the previous case, the authors’ worst-performed side effects are not the same as those in any other replicas. These six side effects were chosen to show a comparison between the four sets of results. Unfortunately, there is no available information on the performance of the remaining edge CHAPTER 5. RESULTS AND DISCUSSION 91

Performance for 3 best side effects

BASE DSE w2 0.964 0.964 0.964 val 0.95 0.95 0.939 0.926 0.9 test 0.90 DECAGON 0.90 0.85

0.8 0.80 0.755 0.85 0.75

AUPRC 0.7 AUPRC AUPRC Mumps 0.80 train 0.70 train val 0.65 val 0.6 0.75 test 0.60 test DECAGON DECAGON 0.70 0.55 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

0.95 0.949 0.949 0.953 0.957 val 0.95 0.95 0.949 test 0.90 0.90 DECAGON 0.90 0.85 0.85 0.85 0.80 0.80 0.75 0.80 AUPRC AUPRC 0.75 AUPRC 0.70 train train

Carbuncle 0.75 0.669 0.70 val val 0.65 test 0.70 test 0.65 0.60 DECAGON DECAGON 0.65 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

0.943 val 0.950 0.965 0.9 0.943 test 0.928 0.95 0.943 DECAGON 0.925 0.900 0.8 0.90 0.769 0.875

0.7 0.850 0.85 AUPRC AUPRC AUPRC 0.825 train train Coccydynia val 0.80 val 0.6 0.800 test test 0.775 DECAGON 0.75 DECAGON 0.5 0.750 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

Figure 5.12: Performances in AUPRC of the different replicas in the columns and for the best three performing side effects according to the authors in the rows. In all the plots, the blue dotted lines represent the validation performance over the epochs, the green straight dotted lines the test performance after 50 epochs, and the purple dotted line the reported performance by the authors. Additionally, in the two right columns, the red line shows training performance over the epochs. 92 CHAPTER 5. RESULTS AND DISCUSSION

Performance for 3 worst side effects

BASE DSE w2 0.820 0.825 0.82 0.82 0.76 0.760 0.80 0.80

0.74 0.78 0.78

0.76 0.76 0.72

AUPRC AUPRC 0.74 AUPRC 0.74

Bleeding train train 0.72 0.72 0.70 val val val test 0.70 test 0.70 test DECAGON DECAGON DECAGON 0.68 0.679 0.68 0.679 0.68 0.679 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

0.771 0.822 0.82 0.818 0.82 0.76 0.80 0.80

0.74 0.78 0.78

0.76 0.76 0.72 AUPRC AUPRC 0.74 AUPRC 0.74 train train

Temperature 0.72 0.72 0.70 val val val Increased Body test 0.70 test 0.70 test DECAGON DECAGON DECAGON 0.68 0.680 0.68 0.680 0.68 0.680 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

0.77 0.84 0.84 0.762 0.824 0.822 0.76 0.82 0.82

0.75 0.80 0.80

0.74 0.78 0.78 0.73 0.76 0.76 AUPRC 0.72 AUPRC AUPRC Emesis 0.74 train 0.74 train 0.71 val val val 0.72 0.72 0.70 test test test 0.693 DECAGON 0.70 DECAGON 0.70 DECAGON 0.69 0.693 0.693 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

Figure 5.13: Performances in AUPRC of the different replicas in the columns and for the worst three performing side effects according to the authors in the rows. In all the plots, the blue dotted lines represent the validation performance over the epochs, the green straight dotted lines the test performance after 50 epochs, and the purple dotted line the reported performance by the authors. Additionally, in the two right columns, the red line shows training performance over the epochs. CHAPTER 5. RESULTS AND DISCUSSION 93

Performance for 3 worst side effects

BASE DSE w2 0.879 0.88 0.875 0.868 0.86 0.70 0.850 0.84 0.825 0.82 0.65 0.800 0.80 gravis

AUPRC 0.603 AUPRC AUPRC 0.60 0.775 0.78

Myasthenia 0.76 0.750 train train 0.55 0.74 val 0.725 val val test test 0.72 test 0.700 0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

val 0.85 test 0.86 0.75 0.729 0.80 0.84 0.70 0.75

burn 0.729 0.82 AUPRC AUPRC AUPRC degree Second 0.65 0.70 0.804 0.80 train train 0.65 0.60 val 0.78 val test test

0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

0.737 0.90 0.88

0.70 0.85 0.841 0.86

0.84 0.80 0.65 0.82 train val 0.75 0.80 AUPRC AUPRC AUPRC test 0.60 0.70 0.78 Panniculitis train 0.76 0.55 val val 0.65 0.748 test test 0.74

0 20 40 0 20 40 0 20 40 Epochs Epochs Epochs

Figure 5.14: Performances in AUPRC of the different replicas in the columns and for three different side effects in the rows. These side effects correspond, from top to bottom, to the worst performing in the base replica, the DSE replica and the w2 replica. In this sense, the plots in the diagonal of the array show the worst performing side effect for its corresponding replica. The blue dot- ted lines represent the validation performance over the epochs and the green straight dotted lines the test performance after 50 epochs. 94 CHAPTER 5. RESULTS AND DISCUSSION

Frequency of performances

100 BASE DSE w2 DECAGON

80

mean

60

worst best

Occurrences 40

20

0 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 AUPRC

Figure 5.15: Frequency distribution of performances (AUPRC). The blue his- togram corresponds to the base replica, the red for the DSE replica and the green for the w2 replica. Additionally, the mean, worst and best performance reported by the authors are shown as blue vertical dashed lines. types provided in the original work. Then, equivalent plots were generated for each replica’s worst-performing side effects to evaluate how the method performs on their weakest edge types. Hence, Figure 5.14 shows the worst performing side effect of the base, DSE and w2 replicas in the corresponding order. In this sense, the plots in the diago- nal show each method performing on their weakest edge type. In most of these plots, the test performance’s previous superiority over the test does not occur. Instead, the weak edges are characterised by a considerable gap between the final train performance and the test performance, in other words, overfitting. It is also noticeable the different and curious behaviour of the validation curves to the previous cases. For example, in the case of panniculitis trained in w2, there is a bizarre decreasing in the validation performance. This phenomenon may be explained by wide heterogeneity in the dataset and a fortunately good initialisation of parameters. Figure 5.15 exposes a more substantial outlook of the wide range of perfor- mances among side effects. The figure shows histograms of the base replica performances in blue, for DSE in red and w2 in green. The histograms show an CHAPTER 5. RESULTS AND DISCUSSION 95

Execution time Resident set size memory 250

1000 974.5 924.2 204.8 200 194.1 195.1

800

150 600

501.6 Gb Hours 100 400

50 200

0 0 BASE DSE w2 BASE DSE w2

Figure 5.16: Execution time in hours and memory usage in gigabytes for the three replicas of the experiment with the complete dataset. evident superiority of the modified replicas over the base method. Although it is challenging to assume a distribution for the performances, it is clear that practically all of the modified replicas’ side effects perform better than the base replica average. On the other hand, there is no noticeable difference be- tween the modified replicas’ distributions. Therefore, further statistical anal- ysis should be done to establish if one has better mean performance or more dispersion than the other. The figure also shows in vertical lines the mean, worst and best performances of the authors’ results. It can be seen that the mean and the best performances are not very different from those of the modified implementations. However, the worst performance is considerable lower than the ones in the modified replicas. This could suggest that the distribution of performances of the au- thors’ results is more widespread, i.e. have a high variance, while the modified implementations tend to have a more uniform performance among side effects. The previous models were trained simultaneously in one GPU each. The exe- cution time in hours is shown on the left side of the Figure 5.16, which equals 20.9, 38.5 and 40.6 days for the base, DSE and w2 replicas, respectively. The right side of the figure shows the used physical RAM used for the training of each replica. The higher memory used by the base replica is due to the larger number of nodes that held its network. Training times as the ones shown in Figure 5.16 are considerably long for practical uses of a model. Other models handling network datasets with similar 96 CHAPTER 5. RESULTS AND DISCUSSION

sizes do not take more than a couple of days to train [84]. One possible reason for this long training time is the vast number of parameters that these models hold, and the fact that they scale with the number of classes and the number of input features. For the encoder, one matrix per layer per edge type is needed, which in the Decagon case are of dimensions Nf × 64 and 64 × 32 for the first and second layer, respectively, where Nf is the number of input features. Alternatively, the decoder needs four global 32 × 32 matrices and one 32 × 32 diagonal matrix per side effect. For the modified methods of this project, this gives approximately 600 million parameters, most of them coming from the encoder. If transposed matrices were considered, the number of parameters would double. These facts question the scalability and the applicability of Decagon in practical situations. Chapter 6

Conclusions

Many computational methods have addressed the problem of predicting drug- drug interactions. However, almost all of them give binary predictions of toxicity/synergy, without pointing out specific adverse side effects. Decagon emerges as the first method to tackle this problem, using a graph convolutional encoder that extracts useful information from the network generated by drugs and proteins’ interactions. In this work, the current method was improved and enriched with different sets of features, focusing on algorithmic descriptors of the network.

Contributions This work presents a modified version of Decagon that outperforms the re- ported results in various metrics. The proposed version solves fundamental implementation problems related to data leakage and compatibility, among other topics. Furthermore, a complete pipeline was developed to be able to run Decagon with real data on a GPU, starting from filtering and converting the dataset into the suitable data structures, to exporting the performances for model evaluation; all of this through an application programming interface (API) that releases the user from dealing with package incompatibilities or implementation details. Among the side contributions, new functionalities of the existing Python pack- age PyBDM [80] were created and implemented. These functionalities ex- tended the package’s current scope to generate algorithmic complexity node features from a given graph. Additionally, a method to incorporate edge fea- tures was tested in the GCN training by including them in the adjacency matrix

97 98 CHAPTER 6. CONCLUSIONS

as in a weighted graph.

Findings This work corroborated the findings of [1] and [38] about the effectiveness of using a complex composed of a graph convolutional neural encoder and a tensor factorisation decoder. These architectures have a reliable performance when used in big, highly connected networks, but their efficiency drops when applied to smaller sub-graphs with fewer elements. Their propagation rules benefit from densely connected neighbourhoods propagating information in the graph quickly. However, implementation wise, a highly connected graph may generate dense adjacency matrices, which would affect the computational method’s functioning. Thus, it is crucial for graph datasets to be in a "goldilocks" level of connection, enough to transmit information quickly but sparse enough to be computed efficiently. As the datasets shrink in size, data disparities become notorious, affecting in a greater extend the performances of the classes with fewer training instances. In these cases, the use of appropriate features can improve the performances of classes with a lower prediction capability, but possibly at the cost of affecting top performance classes. Subsequently, it is shown that features can bring stability to GCN models operating with small datasets, where the propagation of information is challenging; therefore, giving robustness to the model. Experiments have shown that algorithmic complexity features, in general, re- duce the overall performance of GCN methods. However, in scarce data sce- narios where the model can not propagate information easily, there were cases where they improved the performance of the worst-performing classes. In data-rich experiments, it has been shown that BDM does not significantly af- fect the performance of the methods, as the GCN can learn the essential prop- erties well enough by itself, and probably discard the contributions of BDM using non-linear activation for considering them irrelevant. Additionally, the sparse version of BDM features helped mitigating their negative effect by de- noising the dense vectors. Nevertheless, it has been shown that some features can affect the performance of the method negatively. As an example, secondary structure protein features gave noise to the network, decreasing its prediction capabilities. On the other hand, further work is needed to quantify the direct influence of drug-target affinity features, as its use in this work implied using a reduced network whose effects could have dominated the results and minimised their influence. The CHAPTER 6. CONCLUSIONS 99

feature selection process is far from being trivial, even in the field of deep learning, as it still required some field knowledge and intuition. Decagon needs a complete multimodal graph to predict polypharmacy side effects. The minimum information needed to predict the side effect caused by a new pair of drugs is their targets; otherwise, the information would not propagate through the network. Therefore, it is vital to have a legitimate and broad database of DTI to apply this model in practice.

Limitations Decagon has evidenced several practical limitations. First of all, the version currently published by the authors is implemented using depreciated libraries of an outdated and no longer supported programming language as is Python 2.7. The work done in this thesis proposed a method to overcome these limi- tations, but the long-lasting solution is to create a version using a more recent version of Python and TensorFlow 2. However, the most significant limitation is related to the execution time. It was shown that the time required to train the models using the complete datasets is excessive, even using the most advanced equipment. This unexpected result imposes serious obstacles for practical applications for Decagon. Another implementation limitation relates to the compatibility between the side effects that are being predicted and the ones being used as drug features, i.e., single-drug side effects. The authors explicitly stated that the two sets of side effects were made mutually exclusive, to avoid interference among them. As there are 9702 side effects in the DSE dataset (compared to 964 in the DDI dataset), those are potentially meaningful side effects that are not being predicted, which is a considerable disadvantage of DSE. In this work, a method for handling edge features was proposed. However, it only deals with 1-dimensional edge features. The GCN architecture can only accept node features, but frequently network datasets contain important multi-dimensional or categorical edge features that are could add relevant in- formation to the classification. Therefore, an efficient way of incorporating and extracting edge features could boost performance by providing new infor- mation. 100 CHAPTER 6. CONCLUSIONS

Future work The most urgent improvement that has to be done to Decagon is a training time reduction. The most logical way to tackle this problem is by reducing its parameters. One strategy to reduce parameters is already implemented in [38], where restrictions were imposed to the weight matrices so that all of them were multiple of the same base function. Another option considered in the same study but not implemented was building the weight matrices as block- diagonal matrices. Although this strategy was thought to reduce overfitting, it may reduce the execution time by reducing the number of parameters. Another way of reducing parameters would be reducing the dimensions of the weight matrices, in other words, reducing the number of neurons per layer and therefore, dimensions of the embeddings. Although this could accelerate the training, it could also drop performance. However, as it has been shown that BDM can be useful in scarce data scenarios, BDM could also help to improve the performance by contributing with valuable information when the embed- ding dimension is low. Further experiments are needed to provide evidence about this theory. Other improvements to execution time could refer to the EMI creation, i.e., the dataset’s sampling. As GPU memory is a bottleneck for performance, choosing the batches to hold the minimum number of nodes would reduce the number of batch-embedding vectors. If the total of embedding vectors fit in the GPU memory, less communication is needed with external memory, and therefore, there would be less computational overhead. Similar work with network datasets has been done in [84], where a clustering technique is used to select edges belonging to a small number of nodes. Despite taking a long time, the training phase has to be done only once. By ex- porting the node embeddings and the tensor factorisation matrices in the same way as the performances, the whole model is saved for future use. This simple functionality could be the base to build a practical application of polyphar- macy side effect prediction, turning the problem into a simple database re- trieval process. The application could perform the tensor factorisation requir- ing minimal computational power, and return the probability of a side effect needing only to specify two drugs in the dataset. Another interesting variant would be expanding the training to cell line-specific networks, especially cancer cells, or gene expression networks. These are net- works that may change their connections depending on multiple factors. An application of this kind could be the first step towards creating a personalised CHAPTER 6. CONCLUSIONS 101

polypharmacy side effect prediction. Additionally, the option of keep play- ing with different types of protein features that could relate to the network is still open. However, the feature choice has to be made based on biomedical expertise. Finally, the prediction procedure of Decagon is oriented towards finding side effects but ignoring the discovery of safe drug combinations completely. A simple modification to Decagon could involve a dataset of safe drug combi- nations, i.e., a dataset of drugs pairs that have no polypharmacy side effects; and include an additional class corresponding to "no side effect". By train- ing with this additional edge type, the functionality of Decagon could expand drastically with few modifications. Bibliography

[1] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. “Modeling polyphar- macy side effects with graph convolutional networks”. In: Bioinformat- ics 34.13 (2018), pp. i457–i466. [2] Peng Li et al. “Large-scale exploration and analysis of drug combina- tions”. In: Bioinformatics 31.12 (2015), pp. 2007–2016. url: https: //academic.oup.com/bioinformatics/article/31/ 12/2007/214330. [3] Eden L. Romm and Igor F. Tsigelny. “Artificial Intelligence in Drug Treatment”. In: Annual Review of Pharmacology and Toxicology 60.1 (2020). PMID: 31348869, pp. 353–369. doi: 10.1146/annurev- pharmtox-010919-023746. eprint: https://doi.org/10. 1146/annurev-pharmtox-010919-023746. url: https: / / doi . org / 10 . 1146 / annurev - pharmtox - 010919 - 023746. [4] Kristina Preuer et al. “DeepSynergy: predicting anti-cancer drug syn- ergy with Deep Learning”. In: Bioinformatics 34.9 (2018), pp. 1538– 1546. [5] Guocai Chen et al. “Predict effective drug combination by deep belief network and ontology fingerprints”. In: Journal of biomedical informat- ics 85 (2018), pp. 149–154. [6] Pavel Sidorov et al. “Predicting synergism of cancer drug combina- tions using NCI-ALMANAC data”. In: Frontiers in chemistry 7 (2019), p. 509. [7] Xing Chen et al. “NLLSS: predicting synergistic drug combinations based on semi-supervised learning”. In: PLoS computational 12.7 (2016), e1004975.

102 BIBLIOGRAPHY 103

[8] Joseph D. Janizek, Safiye Celik, and Su-In Lee. “Explainable machine learning prediction of synergistic drug combinations for precision can- cer medicine”. In: bioRxiv (2018). doi: 10.1101/331769. eprint: https://www.biorxiv.org/content/early/2018/05/ 27/331769.full.pdf. url: https://www.biorxiv.org/ content/early/2018/05/27/331769. [9] Jennifer O’Neil et al. “An Unbiased Oncology Compound Screen to Identify Novel Combination Strategies”. In: Molecular Cancer Thera- peutics 15.6 (2016), pp. 1155–1162. issn: 1535-7163. doi: 10.1158/ 1535-7163.MCT-15-0843. eprint: https://mct.aacrjournals. org/content/15/6/1155.full.pdf. url: https://mct. aacrjournals.org/content/15/6/1155. [10] Xiangyi Li et al. “Prediction of synergistic anti-cancer drug combina- tions based on drug target network and drug induced gene expression profiles”. In: Artificial intelligence in medicine 83 (2017), pp. 35–43. [11] Lei Huang et al. “DrugComboRanker: drug combination discovery based on target network analysis”. In: Bioinformatics 30.12 (2014), pp. i228– i236. [12] Yi Sun et al. “Combining genomic and network characteristics for ex- tended capability in predicting synergistic drugs for cancer”. In: Nature communications 6.1 (2015), pp. 1–10. [13] Peiran Jiang et al. “Deep graph embedding for prioritizing synergis- tic anticancer drug combinations”. In: Computational and Structural Biotechnology Journal 18 (2020), pp. 427–438. [14] Ahmet Sureyya Rifaioglu et al. “Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases”. In: Briefings in bioinformatics 20.5 (2019), pp. 1878–1912. [15] Nicholas P Tatonetti et al. “Data-driven prediction of drug effects and interactions”. In: Science translational medicine 4.125 (2012), 125ra31– 125ra31. [16] Jae Yong Ryu, Hyun Uk Kim, and Sang Yup Lee. “Deep learning im- proves prediction of drug–drug and drug–food interactions”. In: Pro- ceedings of the National Academy of Sciences 115.18 (2018), E4304– E4311. [17] Susmitha Shankar et al. “Predicting Adverse Drug Reactions of Two- drug Combinations using Structural and Transcriptomic Drug Repre- sentations to Train a Artificial Neural Network”. In: BioRxiv (2020). 104 BIBLIOGRAPHY

[18] Elizabeth D Kantor et al. “Trends in prescription drug use among adults in the United States from 1999-2012”. In: Jama 314.17 (2015), pp. 1818– 1830. [19] Jonathan B Fitzgerald et al. “Systems biology and combination ther- apy in the quest for clinical efficacy”. In: Nature chemical biology 2.9 (2006), pp. 458–466. [20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015), pp. 436–444. url: https : / / www . nature.com/articles/nature14539#citeas. [21] Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media, 2019. [22] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. 10. Springer series in statistics New York, 2008. [23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016. url: http://www.deeplearningbook.org/. [24] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity”. In: The bulletin of mathematical bio- physics 5.4 (1943), pp. 115–133. [25] Prateek Jain and Purushottam Kar. “Non-convex optimization for ma- chine learning”. In: arXiv preprint arXiv:1712.07897 (2017). [26] Nikhil Buduma and Nicholas Locascio. Fundamentals of deep learn- ing: Designing next-generation machine intelligence algorithms. " O’Reilly Media, Inc.", 2017. [27] Xavier Glorot and YoshuaBengio. “Understanding the difficulty of train- ing deep feedforward neural networks”. In: Proceedings of the thir- teenth international conference on artificial intelligence and statistics. 2010, pp. 249–256. [28] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [29] Foster Provost and Pedro Domingos. “Well-trained PETs: Improving probability estimation trees”. In: Raport instytutowy IS-00-04, Stern School of Business, New York University (2000). BIBLIOGRAPHY 105

[30] Takaya Saito and Marc Rehmsmeier. “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets”. In: PloS one 10.3 (2015), e0118432. [31] scikit-learn. Precision-Recall. https://scikit- learn.org/ stable/auto_examples/model_selection/plot_precision_ recall.html. Accessed: 2020-12-19. [32] David H Hubel and Torsten N Wiesel. “Receptive fields of single neu- rones in the cat’s striate cortex”. In: The Journal of physiology 148.3 (1959), p. 574. [33] Thomas N Kipf and Max Welling. “Semi-supervised classification with graph convolutional networks”. In: arXiv preprint arXiv:1609.02907 (2016). [34] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. “Con- volutional neural networks on graphs with fast localized spectral fil- tering”. In: Advances in neural information processing systems. 2016, pp. 3844–3852. url: https://proceedings.neurips.cc/ paper/2016/hash/04df4d434d481c5bb723be1b6df1ee65- Abstract.html. [35] David K Duvenaud et al. “Convolutional networks on graphs for learn- ing molecular fingerprints”. In: Advances in neural information pro- cessing systems. 2015, pp. 2224–2232. url: https://proceedings. neurips.cc/paper/2015/hash/f9be311e65d81a9ad8150a60844bb94c- Abstract.html. [36] Georgios A Pavlopoulos et al. “Using graph theory to analyze biological networks”. In: BioData mining 4.1 (2011), p. 10. [37] Joan Bruna et al. “Spectral networks and locally connected networks on graphs”. In: arXiv preprint arXiv:1312.6203 (2013). url: https: //arxiv.org/abs/1312.6203. [38] Michael Schlichtkrull et al. “Modeling relational data with graph convo- lutional networks”. In: European Semantic Web Conference. Springer. 2018, pp. 593–607. url: https : / / link . springer . com / chapter/10.1007/978-3-319-93417-4_38. [39] Bishan Yang et al. “Embedding entities and relations for learning and inference in knowledge bases”. In: arXiv preprint arXiv:1412.6575 (2014). 106 BIBLIOGRAPHY

[40] Hector Zenil et al. “A decomposition method for global evaluation of shannon entropy and local estimations of algorithmic complexity”. In: Entropy 20.8 (2018), p. 605. [41] Fernando Soler-Toscano et al. “Calculating Kolmogorov complexity from the output frequency distributions of small Turing machines”. In: PloS one 9.5 (2014), e96223. [42] Paris DL Flood, Ramon Viñas, and Pietro Liò. “Kolmogorov Regu- larization for Link Prediction”. In: arXiv preprint arXiv:2006.04258 (2020). url: https://arxiv.org/abs/2006.04258. [43] Santiago Hernández-Orozco et al. “Algorithmic Probability-guided Su- pervised Machine Learning on Non-differentiable Spaces”. In: arXiv preprint arXiv:1910.02758 (2019). [44] Xiangxiang Zeng et al. “deepDR: a network-based deep learning ap- proach to in silico drug repositioning”. In: Bioinformatics 35.24 (May 2019), pp. 5191–5198. issn: 1367-4803. doi: 10.1093/bioinformatics/ btz418. eprint: https://academic.oup.com/bioinformatics/ article-pdf/35/24/5191/31797781/btz418.pdf. url: https://doi.org/10.1093/bioinformatics/btz418. [45] Alexander Aliper et al. “Deep learning applications for predicting phar- macological properties of drugs and drug repurposing using transcrip- tomic data”. In: Molecular pharmaceutics 13.7 (2016), pp. 2524–2530. url: https://pubs.acs.org/doi/abs/10.1021/acs. molpharmaceut.6b00248. [46] Limeng Pu et al. “eToxPred: a machine learning-based approach to es- timate the toxicity of drug candidates”. In: BMC Pharmacology and Toxicology 20.1 (2019), p. 2. [47] Yunyi Wu and Guanyu Wang. “Machine learning based toxicity predic- tion: from chemical structural description to transcriptome analysis”. In: International journal of molecular sciences 19.8 (2018), p. 2358. [48] Chun Yen Lee and Yi-Ping Phoebe Chen. “Prediction of drug adverse events using deep learning in pharmaceutical discovery”. In: Briefings in Bioinformatics (2020). url: https://academic.oup.com/ bib/advance- article/doi/10.1093/bib/bbaa040/ 5826453. BIBLIOGRAPHY 107

[49] Takako Takeda et al. “Predicting drug–drug interactions through drug structural similarities and interaction networks incorporating pharma- cokinetics and pharmacodynamics knowledge”. In: Journal of chemin- formatics 9.1 (2017), pp. 1–9. [50] Assaf Gottlieb et al. “INDI: a computational framework for inferring drug interactions and their associated recommendations”. In: Molecular systems biology 8.1 (2012), p. 592. [51] Jian-Yu Shi et al. “Predicting combinative drug pairs towards realistic screening via integrating heterogeneous features”. In: BMC bioinfor- matics 18.12 (2017), pp. 1–9. [52] Karen A Ryall and Aik Choon Tan. “Systems biology approaches for advancing the discovery of effective drug combinations”. In: Journal of cheminformatics 7.1 (2015), pp. 1–15. [53] Ralph G Zinner et al. “Algorithmic guided screening of drug combina- tions of arbitrary size for activity against cancer cells”. In: Molecular cancer therapeutics 8.3 (2009), pp. 521–532. [54] Ping Zhang et al. “Label propagation prediction of drug-drug interac- tions based on clinical side effects”. In: Scientific reports 5.1 (2015), pp. 1–10. [55] Michael Kuhn et al. “The SIDER database of drugs and side effects”. In: Nucleic acids research 44.D1 (2016), pp. D1075–D1079. url: https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC4702794/. [56] US Food and Drug Administration. FDA Adverse Event Reporting Sys- tem (FAERS). http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/ Surveillance/AdverseDrugEffects/default.htm. Ac- cessed: 2020-12-19. [57] Sunghwan Kim et al. “PubChem substance and compound databases”. In: Nucleic acids research 44.D1 (2016), pp. D1202–D1213. [58] Di Chen et al. “Synergy evaluation by a pathway–pathway interaction network: a new way to predict drug combination”. In: Molecular BioSys- tems 12.2 (2016), pp. 614–623. url: https://pubs.rsc.org/ en/content/articlehtml/2016/mb/c5mb00599j. [59] Reza Ferdousi, Reza Safdari, and Yadollah Omidi. “Computational pre- diction of drug-drug interactions based on drugs functional similari- ties”. In: Journal of biomedical informatics 70 (2017), pp. 54–64. 108 BIBLIOGRAPHY

[60] Hui Huang et al. “Systematic prediction of drug combinations based on clinical side-effects”. In: Scientific reports 4 (2014), p. 7160. [61] Susan L Holbeck et al. “The National Cancer Institute ALMANAC: a comprehensive screening resource for the detection of anticancer drug pairs with enhanced therapeutic activity”. In: Cancer research 77.13 (2017), pp. 3564–3576. [62] Yi Zheng et al. “Predicting adverse drug reactions of combined medi- cation from heterogeneous pharmacologic databases”. In: BMC bioin- formatics 19.19 (2018), pp. 49–59. [63] Feixiong Cheng and Zhongming Zhao. “Machine learning-based pre- diction of drug–drug interactions by integrating drug phenotypic, thera- peutic, chemical, and genomic properties”. In: Journal of the American Medical Informatics Association 21.e2 (2014), e278–e286. [64] David Weininger. “SMILES, a chemical language and information sys- tem. 1. Introduction to methodology and encoding rules”. In: Journal of chemical information and computer sciences 28.1 (1988), pp. 31–36. [65] Tianyu Zhang et al. “Synergistic drug combination prediction by inte- grating multi-omics data in deep learning models”. In: arXiv preprint arXiv:1811.07054 (2018). [66] Evangelos E Papalexakis, Christos Faloutsos, and Nicholas D Sidiropou- los. “Tensors for data mining and data fusion: Models, applications, and scalable algorithms”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 8.2 (2016), pp. 1–44. [67] Marinka Zitnik. Decagon: Representation Learning on Multimodal Graphs. https://github.com/mims-harvard/decagon. 2018. [68] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Graph Neural Networks for Multirelational Link Prediction. http://snap.stanford. edu/decagon/. Accessed: 2020-12-10. [69] Damian Szklarczyk et al. “The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible”. In: Nu- cleic Acids Research 45.D1 (Oct. 2016), pp. D362–D368. issn: 0305- 1048. eprint: https://academic.oup.com/nar/article- pdf / 45 / D1 / D362 / 8847225 / gkw937 . pdf. url: https : //doi.org/10.1093/nar/gkw937. BIBLIOGRAPHY 109

[70] Andrew Chatr-aryamontri et al. “The BioGRID interaction database: 2015 update”. In: Nucleic Acids Research 43.D1 (Nov. 2014), pp. D470– D478. issn: 0305-1048. doi: 10 . 1093 / nar / gku1204. eprint: https://academic.oup.com/nar/article- pdf/43/ D1 / D470 / 7329292 / gku1204 . pdf. url: https : / / doi . org/10.1093/nar/gku1204. [71] Thomas Rolland et al. “A Proteome-Scale Map of the Human Inter- actome Network”. In: Cell 159.5 (2014), pp. 1212–1226. issn: 0092- 8674. doi: https://doi.org/10.1016/j.cell.2014.10. 050. url: http://www.sciencedirect.com/science/ article/pii/S0092867414014226. [72] Jörg Menche et al. “Uncovering disease-disease relationships through the incomplete interactome”. In: Science 347.6224 (2015). issn: 0036- 8075. doi: 10 . 1126 / science . 1257601. eprint: https : / / science.sciencemag.org/content/347/6224/1257601. full . pdf. url: https : / / science . sciencemag . org / content/347/6224/1257601. [73] Damian Szklarczyk et al. “STITCH 5: augmenting protein–chemical in- teraction networks with tissue and affinity data”. In: Nucleic Acids Re- search 44.D1 (Nov. 2015), pp. D380–D384. issn: 0305-1048. doi: 10. 1093/nar/gkv1277. eprint: https://academic.oup.com/ nar/article-pdf/44/D1/D380/9483631/gkv1277.pdf. url: https://doi.org/10.1093/nar/gkv1277. [74] Eric W Sayers et al. “Database resources of the national center for biotech- nology information”. In: Nucleic acids research 47.Database issue (2019), p. D23. [75] General issues. https : / / github . com / mims - harvard / decagon/issues/3. Accessed: 2020-12-10. [76] UniProt Consortium. “UniProt: a worldwide hub of protein knowledge”. In: Nucleic acids research 47.D1 (2019), pp. D506–D515. [77] Inc. Schrödinger. Schrödinger Software. https://www.schrodinger. com/. Accessed: 2020-12-17. [78] TensorFlow. Fixed unigram Candidate Sampler. https : / / www . tensorflow.org/versions/r1.15/api_docs/python/ tf / random / fixed _ unigram _ candidate _ sampler. Ac- cessed: 2020-12-19. 110 BIBLIOGRAPHY

[79] Hector Zenil et al. “Causal deconvolution by algorithmic generative models”. In: Nature Machine Intelligence 1.1 (2019), pp. 58–66. [80] Szymon Talaga and Kostas Tsampourakis. PyBDM: Python interface to the Block Decomposition Method. https : / / github . com / sztal/pybdm. 2019. [81] Docker Inc. Python - Docker official images. https://hub.docker. com/_/python. Accessed: 2020-06-19. [82] Data leakage issue. https://github.com/mims-harvard/ decagon/issues/9. Accessed: 2020-12-10. [83] Fatai Anifowose, Amar Khoukhi, and Abdulazeez Abdulraheem. “In- vestigating the effect of training–testing data stratification on the perfor- mance of soft computing techniques: an experimental study”. In: Jour- nal of Experimental & Theoretical Artificial Intelligence 29.3 (2017), pp. 517–535. doi: 10.1080/0952813X.2016.1198936. eprint: https://doi.org/10.1080/0952813X.2016.1198936. url: https : / / doi . org / 10 . 1080 / 0952813X . 2016 . 1198936. [84] Wei-Lin Chiang et al. “Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks”. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, pp. 257–266. url: https://dl.acm.org/ doi/abs/10.1145/3292500.3330925. Appendix A

Additional figures

This appendix shows more details of the training of the different experiments.

111 112 APPENDIX A. ADDITIONAL FIGURES

Figure A.1: Mean performances in AUPRC for the six worst side effects and the protein edge types for that same experiments. Each row represent the dif- ferent replicas of the experiment. In blue the mean validation performance over the epochs with their standard deviation. In green, the test performance at the end of the training. The scales for the two edge type groups were kept the same to facilitate the comparison between replicas. APPENDIX A. ADDITIONAL FIGURES 113

Figure A.2: Mean performances in AUPRC for the six worst side effects and the protein edge types for the affinity experiments. Each row represent the different replicas of the experiment. In blue the mean validation performance over the epochs with their standard deviation. In green, the test performance at the end of the training. The scales for the two edge type groups were kept the same to facilitate the comparison between replicas. 114 APPENDIX A. ADDITIONAL FIGURES

Performance measures for final simulation Proteins SE 0.90 1.00

0.99 0.85

0.98 0.80 0.97 0.752 BASE 0.75 0.96 0.957 AUPRC AUPRC

0.95 0.70

0.94 0.65 0.93 validation validation test test 0.92 0.60 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs 0.90 1.00

0.99 0.85 0.846 0.985

0.98 0.80 0.97

DSE 0.75 0.96 AUPRC AUPRC

0.95 0.70

0.94 training training 0.65 0.93 validation validation test test 0.92 0.60 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs 0.90 1.00

0.99 0.85 0.846

0.98 0.974 0.80 0.97

w2 0.75 0.96 AUPRC AUPRC

0.95 0.70

0.94 training training 0.65 0.93 validation validation test test 0.92 0.60 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs

Figure A.3: Mean performances for the replicas of the final experiment. The rows represent the three replicas used in the experiment and in the columns the protein edge types are separated from the side effects. In all the plots, the blue dotted lines represent the validation performance over the epochs, the green straight dotted lines the test performance after 50 epochs, and the purple dotted line the reported performance by the authors. Additionally, in the two right columns, the red line shows training performance over the epochs.

TRITA -SCI-GRU 2020:390

www.kth.se