Locality-Dependent Training and Descriptor Sets for QSAR Modeling

Locality-Dependent Training and Descriptor Sets for QSAR Modeling Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Bryan Christopher Hobocienski Graduate Program in Chemical Engineering The Ohio State University 2020 Dissertation Committee Dr. James Rathman, Advisor Dr. Bhavik Bakshi Dr. Jeﬀrey Chalmers Copyrighted by Bryan Christopher Hobocienski 2020 Abstract Quantitative Structure-Activity Relationships (QSARs) are empirical or semi-empirical models which correlate the structure of chemical compounds with their biological activities. QSAR analysis frequently finds application in drug development and environmental and human health protection. It is here that these models are employed to predict pharmacological endpoints for candidate drug molecules or to assess the toxicological potential of chemical ingredients found in commercial products, respectively. Fields such as drug design and health regulation share the necessity of managing a plethora of chemicals in which sufficient experimental data as to their application-relevant profiles is often lacking; the time and resources required to conduct the necessary in vitro and in vivo tests to properly characterize these compounds make a pure experimental approach impossible. QSAR analysis successfully alleviates the problems posed by these data gaps through interpretation of the wealth of information already contained in existing databases. This research involves the development of a novel QSAR workflow utilizing a local modeling strategy. By far the most common QSAR models reported in the literature are “global” models; they use all available training molecules and a single set of chemical descriptors to learn the relationship between structure and the endpoint of interest. Additionally, accepted QSAR models frequently use linear transformations such as principal component analysis or partial least squares regression to reduce the ii dimensionality of complex chemical data sets. To contrast these conventional approaches, the proposed methodology uses a locality-defining radius to identify a subset of training compounds in proximity to a test query to learn an individual model for that query. Furthermore, descriptor selection is utilized to isolate the subset of available chemical descriptors tailored specifically to explain the activity of each test compound. Finally, this work adapts a non-linear dimensional reduction technique, t-Distributed Stochastic Neighbor Embedding (t-SNE), for the refinement of global descriptor spaces before local training sets are identified. The resulting ensemble of local models is used to generate predictions for the test set. The proposed local QSAR workflow is evaluated using two data sets from the literature, one concerning Ames mutagenicity and the other blood-brain barrier permeability. Performance statistics are determined by a 5-fold cross-validation strategy. Local model ensembles frequently outperform global models, especially for smaller to medium-sized local training sets. Illustrating this point, local model ensembles from the proposed methodology outperform global models by as much as 5% to 10% when compared by the dimension of the modeling space between the two approaches. A sizeable portion of this work concerns implementation of the t-SNE algorithm to resolve the problems associated with identifying training samples neighboring test compounds in high dimensional spaces. t-SNE-based local model ensembles afford competitive performance to PLS-based local model ensembles; for instance, when the test set coverage is approximately 25%, the accuracy of t-SNE-based local model ensembles is 86.1% whereas that of the PLS-based local model ensembles is 81.8%. When coverage increases to 93%, iii predicting most of the test molecules, the accuracy of t-SNE-based local model ensembles is 73.8% versus 71.2% for PLS-based local model ensembles. Furthermore, the novel QSAR workflow offers comparative performance to literature reported QSAR models. On the Ames mutagenicity data set, AUC values derived from the proposed methodology range from 0.79 to 0.81 whereas those from literature models range from 0.79 to 0.86. Likewise, when predicting blood-brain barrier permeability, Matthews correlation coefficients range between 0.321 and 0.645 using popular machine learning methods and between 0.478 to 0.565 from the proposed methodology. Finally, the proposed local QSAR workflow offers several interpretability-based features. An open criticism of local modeling strategies, due to their fragmented nature, involves the difficultly in recognizing relationships present throughout the entire training set. This problem is addressed by demonstrating how the frequencies and associations of significant descriptors occurring across local models can be extracted. As a concrete example, this analytic approach successfully identifies acryl halides as a structural alert for positive Ames mutagenicity. Additionally, valuable information is provided on the local level such as the descriptor spaces and decision boundaries used for predicting individual query compounds. iv Dedication I dedicate this dissertation to my mother and father as without their love and support I would not be where I am today. v Acknowledgments I would like to express my utmost gratitude to my advisor, Dr. James Rathman, as I very likely would not have completed or even pursued a doctoral degree without his mentorship. Dr. Rathman is always available to offer a new perspective on ideas, to provide encouragement during setbacks, or to just lighten the mood with some witty humor. I would also like to thank Dr. Chihae Yang for introducing me to the broader field of cheminformatics and offering constructive feedback during the earlier years of my graduate studies. I am also thankful to my fellow students, João Ribeiro, Nicholas Wood, and Darshan Mehta, who were at different times apart of our research group. Our shared experience only served to strengthen our individual research endeavors and lessened the overall amount of stress any one of us would have otherwise faced alone. Finally, I very thankful to my parents for their unwavering emotional and financial support throughout my many college years. Their encouragement especially during the most difficult times made completing this dissertation possible. vi Vita December 2011………………………………Bachelor of Science in Chemical Engineering at The Ohio State University August 2014………………………………….Master of Science in Chemical Engineering at The Ohio State University January 2015 - December 2019……………...Graduate Teaching Associate, Department of Chemical Engineering, The Ohio State University Fields of Study Major Field: Chemical Engineering vii Table of Contents Abstract .......................................................................................................................... ii Dedication .......................................................................................................................v Acknowledgments ......................................................................................................... vi Vita .............................................................................................................................. vii List of Tables................................................................................................................ xii List of Figures ............................................................................................................... xv Chapter 1. Introduction ....................................................................................................1 1.1 Chemoinformatics ..................................................................................................3 1.2 QSAR Modeling .................................................................................................. 11 1.3 Applications of QSAR Modeling ......................................................................... 21 Chapter 2: Background .................................................................................................. 30 2.1 Global vs. Local QSAR Modeling ........................................................................ 30 2.2 Descriptor Selection ............................................................................................. 37 2.3 Local Descriptor Selection ................................................................................... 43 2.4 Dimensional Reduction ........................................................................................ 45 viii 2.4.1 t-Distributed Stochastic Neighbor Embedding ............................................... 53 2.5 Learning Algorithms ............................................................................................ 61 2.6 Research Objectives ............................................................................................. 69 Chapter 3: Methodology ................................................................................................ 72 3.1 Proposed Methodology ........................................................................................ 72 3.1.1 Pre-processing Phase ..................................................................................... 76 3.1.2 Global Phase ................................................................................................. 76 3.1.3 Local Phase ..................................................................................................

Locality-Dependent Training and Descriptor Sets for QSAR Modeling

The Chem Access Project from Bitmap Graphics to Fully Accessible Chemical Diagrams

11.2 Alkanes

Imaging Live Drosophila Brain with Two-Photon Fluorescence Microscopy Syeed Ehsan Ahmed University of Texas at El Paso, [email protected]

Arxiv:2001.11604V2 [Cs.PL] 6 Sep 2020 of Trigger Conditions, Having an in Situ Infrastructure That Simpliﬁes Results They Desire

A Dynamic Multidimensional Visualization Method for Social Networks

6-Michalek Et Al.Indd

Answers DRAWING ORGANIC STRUCTURES

Deepscreen: High Performance Drug-Target Interaction Prediction with Convolutional Neural Networks Using 2-D Structural Compound

Molecular Binding Kinetics of CYP P450 by Using the Inﬁnitesimal Generator Approach

Leicester Grammar School's

Development of the Chick Wing and Leg Neuromuscular Systems

Machine Learning Methods for Life Sciences: Intelligent Data Analysis in Bio- and Chemoinformatics