Modeling Quantitative Structure-Activity Relationships for G-Protein Coupled Receptor Ligands Project Report
Total Page:16
File Type:pdf, Size:1020Kb
MVE385: Project course in mathematical and statistical modelling Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands Project report Students Adrià Amell Tosas Richard Martin Sebastian Oleszko [email protected] [email protected] [email protected] Partners Peder Svensson Mattias Sundén Fredrik Wallner Erik Lorentzen Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands 2020–12–18 Background IRLAB Therapeutics is a Biotech company, engaged in the discovery and development of novel pharmaceuticals to treat disorders of the brain, currently focusing on Parkinson’s disease. The research is based on a so called phenotypic screening approach, which means that the effects of new chemical compounds are evaluated on a system level, to ensure that both direct and indirect effects on different neurotransmitters and brain pathways are captured. Evaluation of potential target receptor interactions is performed in silico as well as in different in vitro systems. In the design of new compounds, and understanding of how different structural elements of the compounds affect biological effects, quantitative structure-activity relationships (QSAR) is a key component. Our current drug discovery projects focus mainly on G-protein coupled receptors (GPCRs), the main class of molecular targets for CNS pharmaceuticals. Project description In this project, students will access large datasets covering chemical descriptors of a series of molecules, combined with data on receptor interactions. The focus will be on GPCRs, of the monoamine family, but other target proteins will also be included. Chemical descriptors, including eg physico-chemical property estimates, shape, lipophilicity, ionization states, molecular graphs and fingerprints, are obtained from public databases, and generated inhouse at IRLAB. The task is to find statistical QSAR-models that describes how biological activity, in this case receptor affinities, relates to chemical properties of the ligand molecule. Such models can be used to guide the design of novel compounds. Linear, principal component-based models as well as non-linear methods will be investigated. Key points to consider in the project will be: Properties of the chemical descriptor space Data distribution – transforms? Different model types, linear, eg PLS, MR, non-linear, eg SVM, neural networks, random forest Choice of dependent variables for the models – some relations to the independent variables (chemical descriptors) will be common for several dependent variables (receptor affinities), some are unique to specific Y variables. Depending on the statistical modelling approach, separate models for each dependent variable of interest, or a multiple Y block. Diagnostics – how to access predictive capability of models IRLAB will provide chemical descriptors and biological activity data for one or more series of compounds of interest for the pharmacological modulation of GPCRs, focusing on monoamine targets. Smartr will provide supervision regarding statistical models. 1 Abstract G protein-coupled receptors (GPCRs) are the main class of molecular targets for central nervous system pharmaceuticals. Dopamine receptors are GPCRs which are activated by the neurotrans- mitter dopamine and are central players in brain function and are involved in, for example, mo- tor control. Quantitative structure-activity relationships (QSAR) models are theoretical models that relate the quantitative measure of chemical structure to a physical property or a biological activity. They are key in the design of new drugs and in the understanding of how different structural elements of the compounds affect biologically. This project presents a number of QSAR models that can help describe relationships between series of organic molecules to the dopamine receptors D2 or D3 by predicting the corresponding inhibitory constant Ki. The molecules of study and Ki values were obtained from ChEMBL, a public chemical database of manually curated bioactive molecules with drug-like properties, while the descriptors were generated computationally. Linear-based models, a genetic algorithm and tree-ensemble methods are motivated and evaluated on the constructed dataset. The tree- ensemble methods performed best, closely followed by the genetic algorithm. Finally, recom- mendations for any QSAR modelling is discussed. 1 Contents 1 Abstract 1 2 Introduction 3 3 Construction of the data set to model 3 3.1 Selection of compounds and inhibitory constants .................. 3 3.2 Activity cliffs ...................................... 5 4 Modelling 6 4.1 Motivation for methods ................................ 7 4.1.1 Linear-based models .............................. 8 4.1.2 Selection by a Genetic algorithm ....................... 8 4.1.3 Ensemble methods ............................... 9 4.2 Implementation details ................................. 9 4.2.1 Linear-based models .............................. 10 4.2.2 Variable selection ................................ 11 4.2.3 Genetic algorithm ............................... 12 4.2.4 Ensemble methods ............................... 13 5 Results 13 5.1 Variable selection with linear-based models ..................... 13 5.2 Evaluating the linear-based models .......................... 14 5.3 Subset and model refinement with Genetic algorithm ................ 16 5.4 Ensemble models .................................... 16 6 Discussion 19 6.1 Data discussions .................................... 19 6.2 Applicability domain .................................. 19 6.3 Modelling conclusions ................................. 21 6.3.1 Computational limitations ........................... 21 6.3.2 Feature subsets and importance ........................ 22 6.3.3 Model interpretability ............................. 22 6.3.4 Comparison of model candidates ....................... 23 6.4 Future recommendations ................................ 23 7 Acknowledgements 24 A Result tables 28 A.1 Variable selection chosen descriptors ......................... 28 A.2 Specifications of the linear-based models ....................... 28 A.3 Final descriptor sets of the genetic algorithm .................... 28 B Ensemble Parameter Tuning 28 C Code 32 C.1 Variable selection scripts ................................ 32 C.2 Linear-based model scripts ............................... 32 C.3 Genetic algorithm script ................................ 32 2 2 Introduction Human cells are constantly communicating with each other and the surrounding environment. This requires a molecular mechanism for transmission of information over the cell plasma mem- brane. G protein-coupled receptors (GPCRs) are proteins located at the cell plasma membrane that provide this molecular mechanism which transfer signals upon binding of a ligand. This makes them the main class of molecular targets for central nervous system pharmaceuticals. Dopamine receptors are GPCRs. They are activated by the neurotransmitter dopamine and are central players in brain function and are involved in, for example, motor control. This project covered the construction and modelling of a data set of chemical compounds with targets either dopamine receptors D2 or D3, also referred as D2R and D3R, respectively, given that IRLAB Therapeutics, one of the partners, is currently focused on Parkinson’s disease. The purpose of the modelling is to predict an activity, in particular the inhibitory constant Ki, for different chemical compounds with dopamine receptors D2 and D3 as targets. This is usually done as a stage in drug discovery projects where properties of the compounds and their relation to receptor interactions is a key component. The kind of compound-receptor models described in this report are called quantitative structure activity relationship (QSAR) models [1], and are important in the development of new pharmaceuticals. In this report, Section 3 presents the data source and process followed to construct the data set of compounds and activities to model. Section 4 motivates and presents the methods employed, which are regression models. They consist in linear models, a genetic algorithm and tree-based ensemble methods. Modelling results are found in Section 5. Finally, a discussion on the construction of the data set, the modelling applicability domain and interpretability, as well as future recommendations are found in Section 6. Appendix B explains technical details on the implementations and the hyperparameter tuning in the models. All the code and data sets used are provided as supplementary material. 3 Construction of the data set to model In the initial stage of the project, the dataset of chemical compounds was collected, analyzed and processed in order to learn about its properties, to make it suitable for modelling and to help decide on appropriate model choices. In this section the resources used to obtain the data set on which the modelling is based as well as the steps performed in compiling and cleaning the data are described. 3.1 Selection of compounds and inhibitory constants ChEMBL is a large and open-access manually curated database of bioactive molecules with drug- like properties, maintained by the European Bioinformatics Institute of the European Molecular Biology Laboratory. The information in the database about small molecules and their biological activity is extracted from medicinal chemistry journals and integrated with data on approved drugs and clinical development