Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network Based on Heterogeneous Structural and Biological Data
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network based on Heterogeneous Structural and Biological Data JUAN SEBASTIAN DIAZ BOADA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network based on Heterogeneous Structural and Biological Data JUAN SEBASTIAN DIAZ BOADA Degree Projects in Scientific Computing (30 ECTS credits) Master’s Programme in Computer Simulations for Science and Engineering KTH Royal Institute of Technology year 2020 Supervisor at KI Algorithmic Dynamics Lab, Center for Molecular Medicine: Narsis A. Kiani Supervisor at KTH: Michael Hanke Examiner at KTH: Michael Hanke TRITA-SCI-GRU 2020:390 MAT-E 2020:097 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci iii Acknowledgements This thesis and its experiments were performed in the Algorithmic Dynamics Lab of the Center for Molecular Medicine. Special thanks to Amir Amanzadi for creating the affinity score dataset, Jesper Tegnér for his comments analyz- ing results and Linus Johnson for his help with the Swedish translation. v Abstract The prediction of polypharmacy side effects is crucial to reduce the mortal- ity and morbidity of patients suffering from complex diseases. However, its experimental prediction is unfeasible due to the many possible drug combi- nations, leaving in silico tools as the most promising way of addressing this problem. This thesis improves the performance and robustness of a state-of- the-art graph convolutional network designed to predict polypharmacy side effects, by feeding it with complexity properties of the drug-protein network. The modifications also involve the creation of a direct pipeline to reproduce the results and test it with different datasets. vi Sammanfattning Förutsägning av biverkningar från polyfarmaci med grafiska faltnings- neuronnät baserat på heterogen strukturell och biologisk data För att minska dödligheten och sjukligheten hos patienter som lider av kom- plexa sjukdomar är det avgörande att kunna förutsäga biverkningar från poly- farmaci. Att experimentellt förutsäga biverkningarna är dock ogenomförbart på grund av det stora antalet möjliga läkemedelskombinationer, vilket läm- nar in silico-verktyg som det mest lovande sättet att lösa detta problem. Detta arbete förbättrar prestandan och robustheten av ett av det senaste grafiska falt- ningsnätverken som är utformat för att förutsäga biverkningar från polyfarma- ci, genom att mata det med läkemedel-protein-nätverkets komplexitetsegen- skaper. Ändringarna involverar också skapandet av en direkt pipeline för att återge resultaten och testa den med olika dataset. Contents Acknowledgements iii 1 Introduction 1 1.1 Statement of the problem . .1 1.2 Thesis Objective . .2 1.3 Outline of Thesis . .3 2 Theoretical Framework 4 2.1 Supervised Learning . .4 2.1.1 Linear Models . .6 2.1.2 Tree-based Methods . .7 2.1.3 Support Vector Machines . .8 2.1.4 Bayesian Methods . .8 2.2 Deep Learning . .9 2.2.1 Feedforward Neural Networks . 10 2.2.2 Training Feed-forward Neural Networks . 14 2.2.3 Convolutional Neural Networks . 24 2.2.4 Graph Convolutional Networks . 27 2.3 Algorithmic Complexity . 32 3 Related Work and State of the Art 36 3.1 Traditional synergy calculations . 37 3.2 General Methods . 39 3.3 Trainable Methods . 40 3.3.1 Linear Regression Methods . 40 3.3.2 Tree-based methods . 41 3.3.3 Other Machine Learning Approaches . 41 3.4 Deep Learning Methods . 42 3.4.1 Standard Deep Learning Methods . 42 vii viii CONTENTS 3.4.2 Decagon . 43 3.4.3 Decagon-based methods . 48 4 Materials and Methods 49 4.1 Datasets . 49 4.2 Original Implementation of Decagon . 52 4.2.1 Data structure organisation . 53 4.3 Contributions and improvements to Decagon . 59 4.3.1 Data Treatment and Preparation . 60 4.3.2 Implementation of Algorithmic Complexity Features . 64 4.3.3 Containers and GPU Configuration . 70 4.3.4 Minibatch sampling and the data leakage problem . 71 4.3.5 Incorporation of edge features . 73 4.3.6 Other improvements . 73 4.3.7 Overall Pipeline . 74 5 Results and Discussion 76 5.1 First experiments: Feature selection . 76 5.2 Node features as possible method stabilisers . 79 5.3 Experiments with side effects with the lowest performance . 83 5.4 Extension to experiments with a full graph . 87 6 Conclusions 97 Bibliography 102 A Additional figures 111 Acronyms and Abbreviations ADR adverse drug reaction. 1, 2, 4, 36, 42 AI Artificial intelligence. 4 ANN artificial neural network. 10, 12–14, 19, 21, 22, 25, 26 AP algorithmic probability. 33 AP@K average precision at k. 24 API application programming interface. 97 ATC Anatomical Therapeutic Chemical. 36, 40–42 AUPRC area under the precision-recall curve. 23, 24, 59, 76, 78, 79, 81, 86, 88–94, 112, 113 AUROC area under the receiving operating characteristics curve. 24, 59, 76, 79, 89 BDM Block decomposition method or KC features calculated with the block decomposition method. 34, 35, 64, 67–69, 71, 74–77, 79–81, 83, 85, 87–90, 98, 100 CNN convolutional neural network. 25–29 CPU central processing unit. 20, 71, 76 CSR compressed sparse row. 56, 63 CTM coding theorem method. 33, 34, 65, 66 DDI drug-drug interaction. 3, 41–44, 47, 48, 53, 54, 60, 62–64, 69, 70, 99 DL Deep learning. 9, 10, 24, 25, 27–29, 41–43, 45 ix x Acronyms and Abbreviations DSE Single drug side effects. 50, 60, 63, 74, 76, 77, 79–81, 84, 85, 87, 89, 90, 93–95, 99 DTI drug-target interaction. 3, 41, 43, 44, 47, 50, 52–54, 56, 60–64, 69, 73, 78, 99 EMI Edge Minibatch Iterator. 55, 56, 59, 72–74, 80, 81, 83, 88, 89, 100 FN false negatives. 22 FP false positives. 22 GCN graph convolutional network. 2, 28, 29, 31, 52, 57, 58, 60, 61, 64, 73, 81, 97–99 GPU graphic processing unit. 20, 69–71, 74–76, 87, 88, 95, 97, 100 KC Kolmogorov complexity. 32–34 MedDRA Medical Dictionary for Regulatory Activities. 50, 51 ML Machine learning. 4–6, 9, 10, 15, 21, 34, 36, 37, 40 MSE mean squared error. 15 PPI protein-protein interaction. 3, 40, 43, 44, 47, 50, 53, 54, 56, 61–64, 69, 78 ReLU rectified linear unit. 19, 20, 25, 31, 57, 73, 87 RF Random forests. 7, 41 SGD stochastic gradient descent. 17, 18, 21, 35 SVMs Support vector machines. 8, 41, 42 TN true negatives. 24 TP true positives. 22 UTM universal Turing machine. 32–34 w2 Simulations including DSE and BDM features. 81, 87, 89, 90, 93–95 Chapter 1 Introduction 1.1 Statement of the problem In many complex diseases, single-drug therapies fall short in helping recover- ing patients. This lower performance occurs because complex diseases such as cancer or AIDS, involve processes controlled by multiple biochemical mecha- nisms, which give redundancy to their functioning [1–3]. Usually, from all the targets that a drug may have, only a few of them are known, which give insight to which diseases they can treat. Single drug therapies target only a limited number of pathways in the pathogenesis of a disease, which sometimes leads to an incomplete treatment and, therefore, perpetuates the disease. As a result, new procedures have shifted towards multi-drug therapies, which have proven to boost the efficacy of cancer, AIDS and fungal infection treat- ments over single drug therapies [4–7]. The potentiated polypharmacy effect comes from drug synergy, occurring when multiple drugs undertake the same disease by simultaneously targeting different pathophysiological pathways [1, 5]. Due to this effect, the single-drug doses can be reduced [8, 9], which con- tributes to reducing the individual toxicity of the drugs [2, 4–6, 10], and even reduce the drug resistance of the disease [6, 8, 11, 12]. Nevertheless, polypharmacy is associated with a much higher risk of adverse drug reactions (ADRs) due to drug-drug interactions [1, 4, 13]. Single drugs may modulate the activity of various untracked proteins in what is known as off-target interactions, which are challenging to trace [1, 14]. Multiple inter- actions of this kind could give rise to unexpected polypharmacy ADRs. These interactions usually go unnoticed in clinical trials, due to the limited time spent 1 2 CHAPTER 1. INTRODUCTION in testing drug combinations [15] and being, most of the times, discovered once the drug is already in the market [14]. As the mechanistic understand- ing of drug-drug interactions is low, it is difficult to predict these side effects [2]. Furthermore, polypharmacy therapies are getting more common [16], be- coming a growing problem and being the cause of a significant fraction of the hospitalisation of patients due to unexpected ADRs [17, 18]. Until recently, the prognostication of polypharmacy ADRs was mainly based on clinical experience [2, 4] and medical expertise [8]. Some classical quan- titative methods to predict the effect of drug combinations, such as the Loewe model (see section 3.1), were also used but failed to fully explain non-linear interactions such as synergy [4, 5, 19]. Clinical experiments can give a solu- tion for a few combinations, but they are time-consuming and expensive [2, 5]. In vitro approaches like high-throughput screening can lead to cheaper proce- dures, but the vast combinatorial space of drugs makes it unfeasible to test all drug combinations [2, 4]. Therefore, it is necessary to have some development in the preclinical trials to make the procedure more sustainable and efficient. In silico approaches come handy to solve these problems. These are computational methods to simu- late the effect of drug combinations rapidly and with low resource investment.