Machine Learning Algorithms for Prediction of Biological
Total Page:16
File Type:pdf, Size:1020Kb
MACHINE LEARNING ALGORITHMS FOR PREDICTION OF BIOLOGICAL ACTIVITY AND CHEMICAL PROPERTIES By Ralf Mueller Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In Chemistry August, 2011 Nashville, Tennessee Approved: Professor Jens Meiler Professor Brian Bachmann Professor Prasad Polavarapu Professor Dave Weaver DEDICATION To my wife, my parents, and my grandfather. ii ACKNOWLEDGMENTS First I would like to acknowledge my advisor Jens Meiler, Ph.D. for giving me the opportunity to work on these projects, valuable input, and great discussions. Furthermore I would like to acknowledge my committee for keeping me on track and my co-workers Kristian Kaufmann, Eric Dawson, and Nils Woetzel for fruitful discussions. I thank the Conn lab in collaboration with Craig Lindsley and Dave Weaver for providing me the experimental data for the virtual high-throughput screening experiments. Financial support was granted by the National Institute of Health and the Vanderbilt Institute for Chemical Biology. Lastly, I would like to thank my wife and my family for their unwavering support over four continents. iii TABLE OF CONTENTS Page DEDICATION .............................................................................................................................. ii ACKNOWLEDGEMENTS ......................................................................................................... iii TABLE OF CONTENTS ............................................................................................................ iv LIST OF TABLES ..................................................................................................................... xiii LIST OF FIGURES ................................................................................................................... xiv LIST OF ABBREVIATIONS.................................................................................................... xvi SUMMARY............................................................................................................................. xviii Chapter I. INTRODUCTION .............................................................................................................. 1 Quantitative Structure Activity Relationships .............................................................. 1 Quantitative Structure Property Relationships ............................................................. 1 QSAR/QSPR through machine learning ...................................................................... 1 QSAR/QSPR general workflow ................................................................................... 2 Identifying publicly and commercially available carbon chemical shift data bases .... 2 Local high-throughput screens provided small molecule activity data towards mGlu4/5 ........................................................................................................................ 2 Commercial ADRIANA descriptors were developed into atom-based descriptors ..... 3 Small molecule conformations were computed with CORINA ................................... 3 iv Constitutional chemical shift descriptors based on atom/bond types and ring closure 3 Strong synergy between QSAR and QSPR projects .................................................... 4 Neural Networks as robust tools for QSAR/QSPR ...................................................... 4 Allosteric modulation of metabotropic glutamate receptors ........................................ 5 High-throughput screens in drug discovery ................................................................. 5 Virtual high-throughput screens ................................................................................... 6 Positive allosteric modulation of metabotropic glutamate receptor subtype 5 ............. 7 Positive allosteric modulation of metabotropic glutamate receptor subtype 4 ............. 8 Negative allosteric modulation of metabotropic glutamate receptor subtype 5 ........... 9 Prediction of carbon chemical shift ............................................................................ 10 II. IDENTIFICATION OF METABOTROPIC GLUTAMATE RECEPTOR SUBTYPE 5 POTENTIATORS USING VIRTUAL HIGH-THROUGHPUT SCREENING .............. 13 Introduction ................................................................................................................ 13 Activators of mGlu5 may provide a novel approach to treatment of schizophrenia ............................................................................................................................. 14 High-throughput screening in drug discovery ..................................................... 15 Quantitative structure activity relations in drug discovery .................................. 15 Numerical descriptors of chemical structure for QSARs..................................... 16 Fragment-independent transformation-invariant descriptor schemes .................. 17 Application of machine learning algorithms to establish QSARs ....................... 18 Quantitative structure activity relation models for mGlu5 positive allosteric modulation ........................................................................................................... 18 v Results and Discussion ............................................................................................... 19 Discussion of concentration response curves in the experimental high-throughput screen ................................................................................................................... 19 Input sensitivity is a reliable measure to prioritize descriptors ............................ 20 Optimization of molecular descriptor set improves prediction accuracy of the ANN model .......................................................................................................... 21 Balancing the datasets through oversampling yields better results than two undersampling strategies ...................................................................................... 26 Radial distribution functions and electronegativity contribute most to an accurate prediction ............................................................................................................. 28 Virtual screening of ChemBridge compound library ........................................... 30 Analysis of the newly identified set of mGlu5 potentiators ................................. 31 Major scaffolds are evenly distributed throughout training, monitoring, and independent datasets ............................................................................................ 31 The majority of hit compounds share a scaffold with previously identified potentiator compounds ......................................................................................... 32 Benzamides, benzoxazepines, and MPEP-like compounds are enriched among active compounds in the post-screen ................................................................... 32 Inactive compounds in the post-screen library contain 53 % benzamides, benzoxazepines, and MPEP-like compounds ...................................................... 33 Significant numbers of hit compounds are non-trivial modifications of original HTS screen hits .................................................................................................... 34 vi High potency cutoff may introduce bias to close derivatives of original HTS screen hit compounds ........................................................................................... 34 Fragment-independent numerical description deals efficiently with multiple scaffolds ............................................................................................................... 35 Conclusions ................................................................................................................ 35 Methods ...................................................................................................................... 36 Experimental high-throughput screen for mGlu5 potentiators and hit validation 36 Generation of numerical descriptors for training of QSAR models .................... 39 Oversampling was used for balanced training ..................................................... 40 A monitoring dataset was introduced to early terminate ANN training .............. 41 Artificial neural network (ANN) architecture and training ................................. 42 Selection of the optimal set of descriptors of chemical structure ........................ 43 Enrichment and area under the curve as quality measures ....................... 45 Implementation .................................................................................................... 47 III. VIRTUAL HIGH-THROUGHPUT SCREENING AS A ROBUST TOOL TO IDENTIFY METABOTROPIC GLUTAMATE RECEPTOR SUBTYPE 4 POTENTIATORS ............................................................................................................. 48 Introduction ................................................................................................................ 48 Results and Discussion ............................................................................................... 50 Descriptor categories were selected according to input sensitivity ..................... 50 Optimization of molecular descriptor set improves prediction results ................ 51 Jury