<<

Locality-Dependent Training and Descriptor Sets for QSAR Modeling

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Bryan Christopher Hobocienski

Graduate Program in Chemical Engineering

The Ohio State University

2020

Dissertation Committee

Dr. James Rathman, Advisor

Dr. Bhavik Bakshi

Dr. Jeffrey Chalmers

Copyrighted by

Bryan Christopher Hobocienski

2020

Abstract

Quantitative Structure-Activity Relationships (QSARs) are empirical or semi-empirical models which correlate the structure of chemical compounds with their biological activities. QSAR analysis frequently finds application in drug development and environmental and human health protection. It is here that these models are employed to predict pharmacological endpoints for candidate drug or to assess the toxicological potential of chemical ingredients found in commercial products, respectively.

Fields such as drug design and health regulation share the necessity of managing a plethora of chemicals in which sufficient experimental data as to their application-relevant profiles is often lacking; the time and resources required to conduct the necessary in vitro and in vivo tests to properly characterize these compounds make a pure experimental approach impossible. QSAR analysis successfully alleviates the problems posed by these data gaps through interpretation of the wealth of information already contained in existing databases.

This research involves the development of a novel QSAR workflow utilizing a local modeling strategy. By far the most common QSAR models reported in the literature are

“global” models; they use all available training molecules and a single set of chemical descriptors to learn the relationship between structure and the endpoint of interest.

Additionally, accepted QSAR models frequently use linear transformations such as principal component analysis or partial least squares regression to reduce the ii dimensionality of complex chemical data sets. To contrast these conventional approaches, the proposed methodology uses a locality-defining radius to identify a subset of training compounds in proximity to a test query to learn an individual model for that query.

Furthermore, descriptor selection is utilized to isolate the subset of available chemical descriptors tailored specifically to explain the activity of each test compound. Finally, this work adapts a non-linear dimensional reduction technique, t-Distributed Stochastic

Neighbor Embedding (t-SNE), for the refinement of global descriptor spaces before local training sets are identified. The resulting ensemble of local models is used to generate predictions for the test set.

The proposed local QSAR workflow is evaluated using two data sets from the literature, one concerning Ames mutagenicity and the other blood-brain barrier permeability. Performance statistics are determined by a 5-fold cross-validation strategy.

Local model ensembles frequently outperform global models, especially for smaller to medium-sized local training sets. Illustrating this point, local model ensembles from the proposed methodology outperform global models by as much as 5% to 10% when compared by the dimension of the modeling space between the two approaches. A sizeable portion of this work concerns implementation of the t-SNE algorithm to resolve the problems associated with identifying training samples neighboring test compounds in high dimensional spaces. t-SNE-based local model ensembles afford competitive performance to PLS-based local model ensembles; for instance, when the test set coverage is approximately 25%, the accuracy of t-SNE-based local model ensembles is 86.1% whereas that of the PLS-based local model ensembles is 81.8%. When coverage increases to 93%,

iii predicting most of the test molecules, the accuracy of t-SNE-based local model ensembles is 73.8% versus 71.2% for PLS-based local model ensembles. Furthermore, the novel

QSAR workflow offers comparative performance to literature reported QSAR models. On the Ames mutagenicity data set, AUC values derived from the proposed methodology range from 0.79 to 0.81 whereas those from literature models range from 0.79 to 0.86.

Likewise, when predicting blood-brain barrier permeability, Matthews correlation coefficients range between 0.321 and 0.645 using popular machine learning methods and between 0.478 to 0.565 from the proposed methodology.

Finally, the proposed local QSAR workflow offers several interpretability-based features. An open criticism of local modeling strategies, due to their fragmented nature, involves the difficultly in recognizing relationships present throughout the entire training set. This problem is addressed by demonstrating how the frequencies and associations of significant descriptors occurring across local models can be extracted. As a concrete example, this analytic approach successfully identifies acryl halides as a structural alert for positive Ames mutagenicity. Additionally, valuable information is provided on the local level such as the descriptor spaces and decision boundaries used for predicting individual query compounds.

iv

Dedication

I dedicate this dissertation to my mother and father as without their love and support I would not be where I am today.

v

Acknowledgments

I would like to express my utmost gratitude to my advisor, Dr. James Rathman, as I very likely would not have completed or even pursued a doctoral degree without his mentorship.

Dr. Rathman is always available to offer a new on ideas, to provide encouragement during setbacks, or to just lighten the mood with some witty humor. I would also like to thank Dr. Chihae Yang for introducing me to the broader field of cheminformatics and offering constructive feedback during the earlier years of my graduate studies. I am also thankful to my fellow students, João Ribeiro, Nicholas Wood, and

Darshan Mehta, who were at different times apart of our research group. Our shared experience only served to strengthen our individual research endeavors and lessened the overall amount of stress any one of us would have otherwise faced alone.

Finally, I very thankful to my parents for their unwavering emotional and financial support throughout my many college years. Their encouragement especially during the most difficult times made completing this dissertation possible.

vi

Vita

December 2011………………………………Bachelor of Science in Chemical

Engineering at The Ohio State University

August 2014………………………………….Master of Science in Chemical Engineering

at The Ohio State University

January 2015 - December 2019……………...Graduate Teaching Associate, Department

of Chemical Engineering, The Ohio State

University

Fields of Study

Major Field: Chemical Engineering

vii

Table of Contents

Abstract ...... ii

Dedication ...... v

Acknowledgments ...... vi

Vita ...... vii

List of Tables...... xii

List of Figures ...... xv

Chapter 1. Introduction ...... 1

1.1 Chemoinformatics ...... 3

1.2 QSAR Modeling ...... 11

1.3 Applications of QSAR Modeling ...... 21

Chapter 2: Background ...... 30

2.1 Global vs. Local QSAR Modeling ...... 30

2.2 Descriptor Selection ...... 37

2.3 Local Descriptor Selection ...... 43

2.4 Dimensional Reduction ...... 45

viii

2.4.1 t-Distributed Stochastic Neighbor Embedding ...... 53

2.5 Learning Algorithms ...... 61

2.6 Research Objectives ...... 69

Chapter 3: Methodology ...... 72

3.1 Proposed Methodology ...... 72

3.1.1 Pre-processing Phase ...... 76

3.1.2 Global Phase ...... 76

3.1.3 Local Phase ...... 79

3.2 Model Optimization ...... 81

3.3 Evaluation Data Sets ...... 83

3.3.1 Ames mutagenicity - Hansen et al., 2009 ...... 83

3.3.2 Blood-Brain Barrier - Muehlbacher et al., 2001 ...... 87

3.4 Computational Tools and Molecular Descriptors ...... 91

Chapter 4: Results and Discussion ...... 95

4.1 In-Depth Analysis of the Ames mutagenicity Data Set ...... 96

4.1.1 The Effect of Radius on the Fraction of Predicted Compounds ...... 96

4.1.2 The Effect of Global and Local Descriptor Space Dimension on Model

Performance ...... 104

4.1.3 The Effect of Univariate Descriptor Selection on Model Performance ...... 116

ix

4.1.4 The Effect of Dimensional Reduction Method on Model Performance ...... 119

4.1.5 Prediction Confidence and Smallest Radii Local Model Ensembles ...... 134

4.1.6 Case Study of Select Molecules from the Ames mutagenicity Data Set...... 145

4.2 Performance Comparison between the Proposed Methodology and State-of-the-Art

Models from the Literature...... 163

Chapter 5: Conclusions and Future Work ...... 169

5.1 Conclusions ...... 169

5.2 Future Work...... 174

Bibliography ...... 179

Appendix A: Sample Input and Output Files ...... 194

A.1 Parameter file (input) ...... 194

A.2 Data file (input) ...... 196

A.3 Data types file (input) ...... 197

A.4 Partition set file (input) ...... 199

A.5 Set labels file (input) ...... 200

A.6 Data structures file (input) ...... 201

A.7 Model performance summary file (output) ...... 202

A.8 Model predictions file (output) ...... 207

Appendix B: Chemical Descriptors ...... 208

x

Appendix C. Python Code ...... 212

xi

List of Tables

Table 1: Bitstring representations of the Figure 3 molecules using five MACCS keys and pairwise Tanimoto similarities...... 9

Table 2: Example QSAR molecular descriptors by category...... 14

Table 3: Techniques used in a QSAR workflow with examples...... 16

Table 4: Generic confusion matrix for a binary classification problem...... 18

Table 5: The most frequent QSAR modeling methods in years 2009 and 2014...... 62

Table 6: Source distribution of molecules within the Ames mutagenicity data set...... 85

Table 7: Partitioning of the Ames mutagenicity data set for benchmarking...... 86

Table 8: Source distribution of molecules within the blood-brain barrier data set...... 90

Table 9: PLS, GS(ON), LS(ON), G(2), L(2) global and local performance at the maximum of the applicability domain...... 108

Table 10: Performance of PLS, GS(ON), LS(ON), G(8), L(2), local models predicting

60% of the test molecules...... 109

Table 11: Performance of PLS, G(8), L(2), local models predicting greater than 90% of the test set under various univariate selection configurations...... 118

Table 12: t-SNE, GS(ON), LS(ON), G(2), L(2) select global and local model ensemble performance statistics...... 126

xii

Table 13: Comparison of select PLS- and t-SNE-derived local QSAR model ensembles.

...... 129

Table 14: t-SNE, GS(ON), LS(ON), G(2), L(2) ...... 132

Table 15: t-SNE, GS(ON), LS(ON), G(2), L(2), radius 0.1 ...... 133

Table 16: PLS, GS(ON), LS(ON), G(8), L(2) Ames mutagenicity test compound predictions vs. local model radii...... 135

Table 17: Number of predictions and Matthew’s correlation coefficients ...... 140

Table 18: Comparison of select local model ensembles with and without varying radii in

2 and 8 reduced local dimensions...... 145

Table 19: PLS, GS(ON), LS(ON), G(8), L(2), local model predictions of 1-(4-

Chlorophenyl)-3,3-dimethyltriazene with increasing radius...... 149

Table 20: PLS, GS(ON), LS(ON), G(8), L(2), local model predictions of ...... 156

Table 21: Identifier, activity, distance, and select descriptor data for physostigmine and its five nearest neighbors from most to least similar...... 158

Table 22: Area under the receiver operating characteristics curve values for literature models and models from the proposed methodology on Ames mutagenicity...... 164

Table 23: Performance statistics on literature models and models from the proposed methodology ...... 166

Table 24: Comparison of local, global-on-local, and global overall models from the proposed methodology...... 168

Table 25: Sample parameter file (input)...... 195

Table 26: Sample data file (input)...... 197

xiii

Table 27: Numeric code for descriptor types...... 198

Table 28: Sample data types file (input)...... 198

Table 29: Sample partition set file (input)...... 200

Table 30: Sample labels file (input)...... 201

Table 31: Sample model performance summary file (output)...... 202

Table 32: Sample model predictions file (output)...... 207

Table 33: Descriptors calculated by CORINA Symphony to describe the Ames mutagenicity and blood-brain barrier molecular data sets...... 208

xiv

List of Figures

Figure 1: Representations of isonicotinic acid by a) skeletal formula, b) MDL Molfile format, and c) SMILES string...... 5

Figure 2: The Morgan algorithm applied to isonicotinic acid...... 7

Figure 3: An example query with three sample molecules for similarity comparison...... 9

Figure 4: QSAR development workflow...... 20

Figure 5: The drug discovery and development pipeline...... 22

Figure 6: QSAR applied to virtual screening in drug development...... 24

Figure 7: QSAR applied to skin sensitizers in consumer products...... 28

Figure 8: Hypothetical k-NN classifier example...... 33

Figure 9: The basic QSAR approach...... 38

Figure 10: Demonstration of overfitting...... 39

Figure 11: of the effects of high-dimensionality...... 47

Figure 12: Illustration of principal component analysis...... 48

Figure 13: Illustration of partial least squares regression compared to principal component analysis...... 51

Figure 14: Linear (PCA) and non-linear (LLE, IsoMap) dimensional reduction techniques applied to hypothetical S-shaped data...... 52

Figure 15: MNIST handwritten digits...... 58 xv

Figure 16: Principal component analysis of the MNIST handwritten digits...... 59

Figure 17: t-Distributed stochastic neighbor embedding applied to the MNIST handwritten digits...... 60

Figure 18: Example logistic regression with two descriptors variables...... 65

Figure 19: General QSAR modeling workflow...... 74

Figure 20: Proposed local QSAR methodology...... 81

Figure 21: The Ames test for mutagenicity...... 84

Figure 22: weight distribution of compounds within the Ames mutagenicity data set...... 86

Figure 23: of the blood-brain barrier with annotated mechanisms of transport. 88

Figure 24: Molecule weight (left) and log(BB) (right) distributions of the blood-brain barrier data set...... 90

Figure 25: A screenshot of the ChemoTyper program querying the Ames mutagenicity data set for the nitro chemotype (right). A molecule from the set (left) contains the nitro group and is highlighted accordingly...... 93

Figure 26: A screenshot of the ChemoTyper program querying the Ames mutagenicity data set for an aromatic ether chemotype (right). Several chemicals from the set (left) contain the chemotype and are highlighted...... 94

Figure 27: Model input parameter string...... 95

Figure 28: The effect of radius on coverage in a global descriptor space of low dimension...... 97

Figure 29: Distribution of local training set sizes for a small radius...... 98

xvi

Figure 30: Distribution of local training set sizes for a large radius...... 99

Figure 31: The effect of radius on coverage in a global descriptor space of high dimension...... 100

Figure 32: The effect of radius on coverage in a global descriptor space of high dimension without univariate descriptor selection...... 101

Figure 33: The effect of radius on coverage for a t-SNE embedded global descriptor space...... 102

Figure 34: A 2-dimensional, t-SNE embedded global descriptor space...... 103

Figure 35: PLS, GS(ON), LS(ON), G(2), L(2) sensitivity, specificity, and accuracy .... 106

Figure 36: Global and local PLS, GS(ON), LS(ON), G(8), L(2) model sensitivity, specificity, ...... 110

Figure 37: PLS, GS(ON), LS(ON), G(8) local model ensemble sensitivity, specificity, and ...... 113

Figure 38: PLS, G(8) sensitivity, specificity, and accuracy versus coverage for a collection of global ...... 115

Figure 39: PLS, G(8), G(2) sensitivity, specificity, and accuracy versus the coverage for a collection ...... 117

Figure 40: 2-dimensional, t-SNE embedded Ames mutagenicity ...... 120

Figure 41: 2-dimensional, PLS transformed Ames mutagenicity ...... 121

Figure 42: t-SNE, GS(ON), LS(ON), G(2), L(2) local model ensemble sensitivity, specificity, and ...... 123

xvii

Figure 43: t-SNE, GS(ON), LS(ON), G(2), L(2) sensitivity, specificity, and accuracy versus coverage ...... 125

Figure 44: Performance comparison of the PLS and t-SNE derived local model ensembles

...... 128

Figure 45: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy ...... 137

Figure 46: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy ...... 139

Figure 47: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy ...... 143

Figure 48: PLS, GS(ON), LS(ON), G(8), L(8) local model ensemble sensitivity, specificity, and accuracy ...... 144

Figure 49: Skeletal formula of 1-(4-Chlorophenyl)-3,3-dimethyltriazene, ...... 146

Figure 50: PLS, GS(ON), G(8) global logistic regression coefficients of 20 numeric descriptors with largest magnitude...... 148

Figure 51: PLS, GS(ON), G(8) global logistic regression coefficients of 20 chemotype descriptors with largest magnitude...... 148

Figure 52: Training molecules within 0.75 units to 1-(4-Chlorophenyl)-3,3- dimethyltriazene...... 150

Figure 53: The PLS transformed local descriptor space for the logistic regression model predicting 1-(4-Chlorophenyl)-3,3-dimethyltriazene...... 151

xviii

Figure 54: PLS, GS(ON), LS(ON), G(8), G(2) local logistic regression coefficients for the model predicting 1-(4-Chlorophenyl)-3,3-dimethyltriazene...... 152

Figure 55: Skeletal formula for physostigmine, ...... 153

Figure 56: PLS, GS(ON), G(8) global logistic regression coefficients of 20 largest magnitude numeric descriptors...... 154

Figure 57: PLS, GS(ON), G(8) global logistic regression coefficients of 20 largest magnitude chemotypes...... 154

Figure 58: The PLS transformed local descriptor space for the logistic regression model predicting physostigmine...... 157

Figure 59: The five nearest neighbors of physostigmine from most to least similar in the transformed local descriptor space...... 158

Figure 60: PLS, GS(ON), LS(ON), G(8), G(2) local logistic regression coefficients for the model predicting physostigmine...... 159

Figure 61: Frequency and association of significant descriptors among local training sets

...... 161

Figure 62: Frequency and association of significant chemotypes among local training sets

...... 161

Figure 63: CAS 14882-94-1, an Ames negative compound, with highlighted alkenyl

(blue) and acyclic (orange) carboxylic ester chemotypes...... 162

xix

Chapter 1. Introduction

The first use of computers toward understanding the effects of chemical structure on biological activity is often accredited to Hansch and Fujita’s investigations into plant growth regulators and antibiotics in the 1950s and 1960s [1], [2]. Most consequential of their work was the proposed equation relating the extracellular concentration of test compound required to illicit a biological response, C, with the octanol-water partition coefficient, log P, and the Hammett constant, σ [2], [3]:

1 푙표푔 ( ) = 푘휋 − 푘′휋2 + 휌휎 + 푘′′ (1) 퐶 As Hansch and Fujita compared unsubstituted, parent compounds to series of substituted derivatives, π denotes the difference in log P between the parent and derivative in question.

Likewise, the Hammett constant is proportional to the difference in the logarithm of the rate constant (or equilibrium constant) of the parent and derivative chemicals for a particular reaction. The variables k, k’, and k”’ represent coefficients to be found by fitting to experimental data via multiple linear regression. In doing so, Hansch and Fujita had proposed the first quantitative structure-activity relationship, or QSAR, model [1], [2], [3].

As elaborated on by Martin, a number of decisions made by Hansch and Fujita laid the foundation for the beginnings of modern QSAR analysis [1], [2]:

1 1. Introducing log P as a descriptor of biological activity, which itself models the

various aqueous and lipid environments small molecules must traverse to reach

their cellular targets.

2. Recognizing the quadratic relationship between log P and biological potency with

a maximum value; molecular movement can be hindered by overly hydrophilic or

lipophilic character.

3. Introducing the Hammett constant as an additional descriptor of the molecules

electronic effects.

4. Using a computer instead of manual calculation to fit the regression equation,

allowing larger data sets to be analyzed much more quickly.

Notable additions to this list include:

5. Utilizing familiar, relatively simple statistical methods (i.e. linear regression) to

derive the mathematical relationships between chemical descriptors and response

variables.

6. Reporting goodness-of-fit measures for models such as the coefficient of

determination and the root-mean-squared error.

From these humble beginnings, QSAR analysis has continued to evolve over the last 60 years in conjunction with the larger field of chemoinformatics and computer technology to arrive at its present state.

Today, QSAR analysis is fully recognized in the academic community as an important field of study, and some of its many facets have become active areas of research themselves. Public and private databases containing millions of compounds are routinely sourced for investigations into structure-activity relationships [1], [4]. At the extreme, the 2 Chemical Abstracts Service (CAS) REGISTRY database contains information on more than 151 million organic and inorganic compounds [5]. Such databases continue to expand in size due to the explosion of data generated from chemical combinatoric synthesis and high-throughput screening. Thousands of descriptors have been conceived to capture the

1D, 2D, 3D (i.e. conformal-dependent), and even 4D (i.e. time-dependent dynamics) aspects of chemical structure [1], [6], [7]. Statistical modeling techniques, for example, partial least squares regression, have been developed and employed in QSAR studies to handle frequently encountered problems such as when the number of descriptor variables exceeds the number of compounds within a data set. Machine learning methods, such as random forests, support vector machines, and deep neural networks to name a few, are being utilized with greater frequency to uncover complex, non-linear relationships between chemical structure and function [4], [8], [9]. These developments have come together to benefited society in multiple domains, with the most prominent contributions being made in medicinal science with the discovery and optimization of new pharmaceuticals [4], [10],

[11], [12]. QSAR also enjoys a leading role in predicting the toxicities of chemicals used in consumer products and justifying regulatory rules enacted to protect the environment and public health [4], [13], [14], [15].

1.1 Chemoinformatics

Any discussion of QSAR would be amiss without mentioning chemoinformatics, of which the former is a subfield and without which QSAR as it is practiced now would not be possible. Johann Gasteiger and Thomas Engel, two scientists who have contributed

3 significantly to the progression of this area of study, broadly define chemoinformatics as

“the application of informatics methods to solve chemical problems” [4], [16], [17]. Since

“informatics methods” is rather vague, a more elucidating definition also referenced by

Engel describes chemoinformatics as “a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, , and use of chemical information” [16]. Chemoinformatics is a vast topic; therefore, only some of the problems addressed by the field is presented here to familiarize the reader with the subject.

The digital representation of molecules is critical to chemoinformatics since with out such capability chemical databases would not exist and chemical information could not be readily exchanged. Molecules are represented computationally in many ways, just as they are traditionally represented to humans by names, molecular formulas, skeletal formulas (i.e. 2D graphs), and ball-and-stick models (i.e. 3D representations) [16], [17].

The most widely-used line notation for compounds is the Simplified Molecular Input Line

Entry Specification (SMILES) developed in 1986 by David Weininger. The SMILES line notation uses keyboard characters to represent molecules including branched substructures, , and . Second, seeing as matrices are used to represent graphs in mathematics; the connection table is used to represent molecules within the Molfile format developed by MDL Information Systems [16], [17]. The connection table is divided into blocks containing the compound name, atomic coordinates, bonded atom connections, and their bond orders. Figure 1 shows representations of an example compound, isonicotinic acid, by skeletal formula, by annotated MDL Molfile format, and by SMILES string. Note the skeletal formula is numbered according to the bond in the Molfile. 4

Inspired from [18].

Figure 1: Representations of isonicotinic acid by a) skeletal formula, b) MDL Molfile format, and c) SMILES string.

Several other file formats, some extensions of the MDL Molfile format, exist for representing multiple molecules, reactions, 3D coordinates of large biological molecules, crystallographic data, and spectroscopic data [16].

Molecules must be represented uniquely within a database to avoid storing and retrieving duplicate entries. Resolving this problem poses a unique challenge in chemoinformatics since it is quite easy to represent the same compound via multiple names, graphs, SMILES strings, or MDL Molfiles. In theory, a molecule with N atoms can

5 have up to N! possible valid connection tables. Illustrating further, valid SMILES strings for isonicotinic acid include OC(=O)c1ccncc1, O=C(O)c1ccncc1, and n1ccc(C(=O)O)cc1, among others depending on the starting atom and encoding path. A canonical representation of a molecule uses an algorithm to order the atoms of a chemical the same way each time it is applied. The Morgan algorithm, outlined below, is a widely-known method for attempting to represent molecules canonically [16], [17], [19]:

1. Each heavy atom takes a connectivity value equal to the number of neighboring

heavy atoms for which it shares a bond.

2. The connectivity value of each heavy atom is updated to the sum of the values of

its neighboring atoms from the previous iteration.

3. If the number of unique connectivity values reaches a maximum, proceed to step 4.

Otherwise, return to step 2.

4. The atoms are labeled canonically from largest to smallest connectivity values.

An example of the Morgan algorithm applied to isonicotinic acid is shown in Figure 2. The algorithm stops at the third iteration since the fourth iteration generates the same number of unique connectivity values (i.e. n = 6). While the Morgan algorithm does substantially reduce the number of possible chemical representations, in some instances there remains multiple canonical numberings of equal validity.

6

Figure 2: The Morgan algorithm applied to isonicotinic acid.

As seen in the isonicotinic acid example shown in Figure 2, the algorithm produces three pairs of atoms with equivalent connectivity values, and these equivalencies are resolved arbitrarily. Since its inception, the Morgan algorithm has been extended and/or adapted to address these issues, such as the stereochemically-extended Morgan algorithm (SEMA) and CANGEN designed to produce canonical SMILES strings [17], [19].

Once chemical information is stored in a database, there must be an efficient means to extract it for use in research projects. Databases can be searched by names, registration or identifier numbers, certain structural features, or similarity threshold values, to name a few. Substructure searching in particular is a routine task when querying chemical databases. Searching molecule-by-molecule through the entire database for the presence of 7 the query fragment via graph theoretical techniques would be prohibitively time- consuming and inefficient due to the nature of the algorithms involved [17], [19]. To resolve this issue, database software utilize screening mechanisms to substantially reduce the possible number of matches before proceeding to analyze each remaining compound

[17], [19]. For example, the binary screening method is a commonly employed screening technique where each molecule within the database is represented as a bitstring; the presence or absence of a given substructure is indicated by a “1” or “0”, respectively. A pre-defined “dictionary” of fragments or “keys” dictate the construction of the bitstring for each molecule [17], [19]. The MACCS keys comprise a frequently used structural dictionary containing 166 different fragments [17], [20]. The illustration in Figure 3 shows an example query molecule and three additional molecules for which similarity comparisons are made. In Table 1, the molecules of Figure 3 are represented as bitstrings using five MACCS keys. A search on each molecule against the query can eliminate compounds 2 and 3 since they lack hydroxyl groups and atoms. However, molecule 1 would require further scrutiny to determine if the query is contained within its structure. Noting this application, dictionaries are composed of keys such that they occur frequently enough to be useful yet remain unique enough to maximize discrimination between sets of organic molecules [17], [19].

8

Figure 3: An example query with three sample molecules for similarity comparison.

Table 1: Bitstring representations of the Figure 3 molecules using five MACCS keys and pairwise Tanimoto similarities.

6-member N SQ,M ring

Query 1 0 0 0 1 1.00 Mol 1 1 1 0 1 1 0.50 Mol 2 1 0 1 1 0 0.25 Mol 3 0 0 1 0 0 0.00

Comparisons between molecules are also frequently made through similarity and dissimilarity (i.e. distance) calculations. A common similarity metric is the Tanimoto or

Jaccard coefficient [17], [19]:

푐 푆 = 퐴,퐵 푎 + 푏 − 푐 (2) where for two molecules A and B represented by bitstrings of structural fragments, a denotes the number of features present in A, b the number of features present in B, and c the number of features present in both A and B. Values of the Tanimoto coefficient fall on the range [0,1]; two molecules are “maximally” similar if their coefficient is “1” and 9 maximally dissimilar if their coefficient is “0”. An alternative viewpoint to similarity is the notion of distance; items that are “similar” are “close” in terms of distance in a metric space and vice versa. Therefore, the Tanimoto or Jaccard distance is defined as:

푐 퐷 = 1 − 푆 = 1 − 퐴,퐵 퐴,퐵 푎 + 푏 − 푐 (3) Other similarity/distance metrics have been defined and are regularly used including the

Dice, Cosine, Hamming, Euclidean, and Soergel measures, among others [17]. The similarity and/or distance between two molecules is highly dependent on the fragments used to represent the compounds as bitstrings. Demonstrating its use, the Tanimoto similarity between the query and example molecules of Figure 1 is included as the last column in Table 1. A similarity search can provide a set of molecules with a score greater than or equal to a specified threshold, ordered from highest to lowest, and may be combined with other search modalities to further refine the search.

This concludes what is a very brief introduction into some of the problems encountered by and solutions developed from chemoinformatics. While great strides have been made in this field, outstanding problems such as recognition of drug-like compounds and prediction of toxicity are, at least as of now, too complex to be solved deductively.

Therefore, chemoinformatics must transform chemical and biological data into knowledge using inductive learning. This transformation is completed via QSAR modeling and analysis.

10 1.2 QSAR Modeling

QSAR models are empirical or semi-empirical relationships which seek to exploit existing experimental data with the goal of explaining the yet unknown properties and/or biological activities of chemical compounds. In its simplest form, a QSAR model is a mathematical expression relating a certain endpoint of interest to the molecular structures of chemical compounds:

푦 = 푓(푑1, 푑2, … , 푑푝) (4) where y denotes the response variable in question, d1 through dp represent p molecular descriptors chosen to encode the structural information of molecules in numerical form, and f is a function which the descriptor space to the response space [21], [22]. In

2004, the Organization for Economic Co-operation and Development adopted five principles for QSAR models used for regulatory principles. These principles stipulate a

QSAR model to have [23]:

1. A well-defined end point.

2. An unambiguous algorithm.

3. A defined domain of applicability (i.e. a method to determine if a test query is

being interpolated or extrapolated by the QSAR model).

4. Appropriate measures for goodness-of-fit, robustness, and predictivity.

5. A mechanistic interpretation, if possible.

These principles serve as an initial roadmap for constructing a QSAR model for professional use.

11 The most important ingredient for a good QSAR model is data of the highest possible quality. Typically, researchers developing models obtain data from experimentalists directly or, more commonly, from the literature. Models are built from individual data sets or those compiled from multiple sources. Expectedly, mistakes in data collection and reporting processes will affect the performance of any resulting models using such data. Furthermore, even if data is collected and reported accurately, there may exist differences in the form of chemical compounds used (e.g. isomers or lack of purity), experimental procedures followed, or assays utilized to assess the given endpoint [6]. All these aspects have the potential to introduce unexplainable variability in the data unless these variables are specifically accounted for within the models. As Tropsha discussed in his work on best practices in QSAR, a particularly troublesome source of error is misrepresentation of chemical structures within databases. Error rates in structures were found to range from 0.1% to 8% across public and commercial databases [22].

Furthermore, chemoinformatics software may occasionally function erroneously; for example, the generation of incorrect structures from correct digital encodings. Tropsha concluded that automatic and manual inspection should be instituted before modeling.

Automatic inspection, executed by chemoinformatics software, includes actions such as removal of inorganic compounds, salts, and mixtures; curation of tautomeric forms; neutralization of formal charges; and the deletion of duplicate structures. Manual inspection involves viewing at least a fraction of the dataset for errors potentially missed by software [22]. Lastly, it is not immediately evident in the literature how to systematically assess the quality of experimental data. Comparing measurements of activity across sources for consistency is one viable strategy; compounds with values 12 exhibiting a small variance across laboratories can be averaged to a single value. On the other hand, molecules whose activity is grossly inconsistent between sources should be excluded [24]. Albeit anecdotal, the reliability of data sets may rely on established relationships between modelers and experimentalists.

Implicit in Equation (4) is an axiom fundamental to QSAR analysis known as the similarity principle; chemical compounds with similar structures are observed to have similar properties [22], [25]. Dramatic or indistinct changes in f over small movements of the descriptor space are thought to indicate inadequacies of the selected descriptors and/or distance metric explaining the molecules for the endpoint in question. Stated differently, if two “similar” but non-identical compounds are found to truly differ in their measured responses, then a failure to distinguish between the compounds must result from either the representation of their structure, the definition of similarity among objects in the descriptor space, or a combination of the two. As a result, the descriptors selected to represent the compounds of a data set is a crucial aspect of QSAR modeling. The generation of novel chemical descriptors and algorithms to recognize relevant explanatory structures for a particular endpoint are fields of research onto themselves. The complete set of chemical descriptors available today numbers in the several thousands, if not more [1]. For example, the two-part work Molecular Descriptors for Chemoinformatics includes some 3,300 entries [26]. Generally, molecular descriptors fall into five categories; constitutional (i.e. chemical information devoid of atomic connectivity), topological (i.e. counts or indices calculated from the 2D arrangement of atoms and bonds within a molecule, typically represented as a graph), geometrical (i.e. derived from the 3D conformation of a molecule after optimization), electronic (i.e. quantum mechanical aspects of a molecule), and 13 thermodynamic (i.e. founded in thermodynamic principles) [1], [6], [7], [8], [27].

Physiochemical descriptors obtained from experimental measurements (e.g. partition coefficients, pharmaco-kinetic or -dynamic measures, etc.) or from quantitative structure- property relationship (QSPR) models are also frequently used [6], [7]. Some common molecular descriptors are shown in Table 2:

Table 2: Example QSAR molecular descriptors by category.

*Frequently calculated from 2D QSPR models. Produced with information from [7], [8], [19], [28].

Molecular descriptors Examples molecular weight, sum of atomic van der Constitutional Waals volumes, number of: , bonds (by type), rings (by member), etc. binary presence/absence or counts of: fragments, functional groups, linear or Topological circular fingerprints, H-bond donor/acceptor atoms; Wiener index, 2D autocorrelation vector, complexity, etc. molecular eccentricity/asphericity, radius of gyration, MoRSE descriptor, radial distribution function, inertial moments, Geometric 3D autocorrelation vector, McGowan volume*, topological polar surface area*, approximate surface area*, etc. Highest occupied/lowest unoccupied Electronic molecular orbital energies, dipole moment, Hammett constant, etc. heat of formation, molar refractivity, acid Thermodynamic dissociation constant, etc. octanol-water partition coefficient*, water Physiochemical solubility*, physiological distribution coefficient*

14 If possible, descriptors are chosen from theoretical grounds or previous studies indicating the response variable in question does depend on the molecular characteristics the descriptors capture. Otherwise, useful descriptors are “mined” from a large pool during model development. The identification of new “structural alerts”, or molecular features recognized to adversely influence a chemical’s biological activity (usually in reference to toxicity), benefit from the former case [29]. Lastly, and perhaps unsurprisingly, not all descriptors are useful toward describing the endpoint of interest and large sets of descriptors likely contain redundant information, necessitating the employment of descriptor selection strategies.

Development of a QSAR model is initiated once experimental data on a sufficient number of compounds is obtained and curated, and after descriptor calculation from the molecular structures is completed. Many possible development strategies exist; typically, the data set is divided into three portions: the training set, test set, and external validation set [6], [22], [27], [30]. Approximately 1% to 20% of the data is reserved for external validation. The other 80% to 90% may be split into training and test sets either statically or by a process called cross-validation [22]. Under n-fold cross-validation, the data is split into n sets of equal proportion. Iteratively, each of the n sets serve as a test set for the remaining n-1 sets allocated for model training. The value of n is taken to be 1, 5, or 10 in most studies. This process can be repeated multiple times for the generation of a distribution of performance statistics. Model training itself employs various techniques such as descriptor selection, dimensional reduction, applicability domain determination, and supervised learning, common examples of which are shown in

15 Table 3. It is also during this time that any of the aforementioned methods with tunable parameters undergo optimization. Multiple QSAR models may result from using combinations of descriptors sets found from various feature selection routines and different supervised learning algorithms.

Table 3: Techniques used in a QSAR workflow with examples.

Produced with information from [23], [31], [32], [33], [34].

Technique Examples unbalanced correlation score, Fisher score, information gain, chi-square test, odds ratio, Shannon entropy, forward Descriptor Selection selection, backward elimination, step-wise regression, genetic algorithm, simulated annealing, particle swarms, ant colony system, replacement method, etc. multi-dimensional scaling, principle component analysis (PCA), kernel principal component analysis, local linear Dimensional reduction embedding, t-distributed stochastic neighbor embedding, autoencoder, linear discriminant analysis, etc. bounding box, PCA bounding box, convex hull, distance-based methods, leverage-based methods, k-nearest Applicability domain neighbors, probability density distribution approaches, local density methods, decision tree approaches, etc. multiple linear regression, k-nearest neighbors, logistic regression, naive Bayes, partial least squares regression, Supervised learning decision trees, random forest, artificial neural networks, support vector machines, etc.

16 Each QSAR model will have performance statistics associated with the predictions it makes on test sets. Data sets with continuous-valued endpoints can be evaluated using criteria such as the coefficient of determination and root mean squared error (RMSE), respectively [6]:

2 ∑푁 ̂ 2 푖=1(푦̂푖 − 푦̅)(푦푖 − 푦̅) 푅 = (5) √∑푁 (푦̂ − 푦̅̂)2 √∑푁 (푦 − 푦̅)2 ( 푖=1 푖 푖=1 푖 )

푁 1 푅푆푀퐸 = √ ∑(푦 − 푦̅)2 (6) 푁 푖 푖=1 where 푦푖 is the activity of the i-th molecule, 푦̂푖 is the predicted activity of the i-th molecule from the model, 푦̅ is the average activity of all molecules, and 푦̅̂ is the average activity of all model predictions. The coefficient of determination is equivalent to the square of the

Pearson correlation coefficient between data activities and model predictions, ranging from

0.0 (no correlation) to 1.0 (perfect correlation). The RMSE measures the variance, or uncertainty, of the model’s predictions. Likewise, classification models are assessed using statistics derived from a confusion matrix, which conveys the degree of agreement between the experimental and predicted values of the endpoint. A generic confusion matrix for a binary classification problem is shown in Table 4 [6], [30]:

17 Table 4: Generic confusion matrix for a binary classification problem.

Positive response Negative response

Number of true positives Number of false positives Positive prediction (TP) (FP)

Number of false positives Number of true negatives Negative prediction (FP) (TN)

A number of descriptive statistics may be generated from the confusion matrix such as the sensitivity, specificity, and accuracy:

푇푃 푆푒푛푠𝑖푡𝑖푣𝑖푡푦 = (7) 푇푃 + 퐹푁

푇푁 푆푝푒푐𝑖푓𝑖푐𝑖푡푦 = 푇푁 + 퐹푃 (8)

푇푃 + 푇푁 퐴푐푐푢푟푎푐푦 = 푇푃 + 퐹푁 + 푇푁 + 퐹푃 (9) where TP, FP, FN, and TN are defined by the confusion matrix in Table 4. The sensitivity, specificity, and accuracy range from 0.0 to 1.0. Analyzing all three statistics allow for a better characterization of the model’s performance; for example, a trivial model which only makes negative predictions will achieve high accuracy on a highly unbalanced data set.

However, such a model will have zero sensitivity. Alternatively, a number of measures have been developed to summarize a binary classifier’s performance into a single value.

Consider the Matthew’s correlation coefficient (MCC) below:

푇푃 × 푇푁 − 퐹푃 × 퐹푁 푀퐶퐶 = (10) √(푇푃 + 퐹푁)(푇푃 + 퐹푃)(푇푁 + 퐹푃)(푇푁 + 퐹푁)

18 The Matthew’s correlation coefficient is a binary discretization of the Pearson correlation coefficient (PCC) between a set of predictions and the accompanying set of actual values.

Like the PCC, the MCC lies on the interval [-1,1]; a value of 1 represents a perfect classification, a value of 0 means the classifier’s performance is equivalent to random guessing, and a value of -1 means the classifier’s predictions are exactly opposite of the true response values [35], [36]. Other measures, such as Cohen’s κ (kappa) or the area under the curve (AUC) on a receiver operating characteristics (ROC) , convey similar information consolidated into a single metric [6], [30].

Returning to the QSAR development workflow, models deemed sufficient by pre- defined performance thresholds are further tested on one or more external validation sets

[6], [22], [24], [27]. Models with good internal and external predictive performance are then used individually or in consensus (i.e. ensemble) toward practical QSAR applications.

Since these models will guide future experiments, these models are updated and re- validated once the results of the experiments are completed [22]. A diagram of the QSAR development workflow as described in its totality is shown in Figure 4. Not all studies strictly adhere to the QSAR development workflow as outlined in the figure. For instance, if a data set is deemed too small to be split into training, test, and external validation sets, a smaller study involving only cross-validation is performed, although internal performance statistics are usually overly optimistic compared to proper external validation.

Benchmark data sets complied to compare modeling methodologies across research groups partition data according to other schemes as well.

19

Adapted from [22].

Figure 4: QSAR development workflow.

20 1.3 Applications of QSAR Modeling

A literature review on in silico methods in drug design and development finds their primary application to be for “virtual screening”; the utilization of, among other computational tools, QSAR models to recognize and prioritize potential drug like molecules over those predicted to have properties unsuitable for further consideration [10], [11], [12], [37]. The drug discovery and development process, excluding the pre-clinical and clinical phases, can be decomposed into a series of steps: target identification, target validation, hit identification, hit-to-lead identification, and lead optimization. A summary of each step is found in Figure 5. QSAR models are most frequently applied during the hit identification, lead identification, and lead optimization phases of drug development. During hit identification, these models are used to quantitatively rank which molecules in the library should be selected for initial experimental evaluation. Furthermore, models can screen virtual libraries, guiding the selection of compounds which should be synthesized and added to the ligand collections. Later, during lead identification and optimization, there is a simultaneous requirement to maintain or enhance what is already acceptable target potency and selectivity while correcting potentially deficient pharmacological characteristics. Models constructed on hit and/or lead series can identify which core scaffolds and structural fragments are responsible for target interaction and how modification to sites can improve ADMET (Absorption, Distribution,

Metabolism, Excretion, Toxicity) properties.

21

ADMET—Adsorption, Distribution, Metabolism, Excretion, and Toxicity. Produced with information from [36].

Figure 5: The drug discovery and development pipeline.

22 QSAR models can predict a wide range of pharmacological endpoints including plasma protein binding, lipophilicity, solubility, gastrointestinal absorption, blood-brain barrier passage, oral bioavailability, cytochrome metabolism, off-target interactions, metabolite formation, acute toxicity, skin sensitization, and genotoxicity, among many others [38]. Vital to the success of these models is sufficient training data in the form of physiochemical and structural information on previously approved drugs or molecules whose biological activities are typically ascertained from high-throughput assays. Equally important is training data on compounds with deleterious or no effect for comparison between classes [10], [11], [12], [37]. While direct measurement of target efficacy, selectivity, and ADMET characteristics for every hit or lead molecule would be ideal, conducting the required in vitro or in vivo tests is impractical due to their time-consuming, labor-intensive, and expensive nature [10], [38]. Conveniently, data collected from various assays conducted through the progression of the drug discovery process can function in a feedback loop, providing more high-quality data and subsequent validation to existing

QSAR screens. Tropsha provides a concrete example of the effectiveness of QSAR analysis toward hit identification through the discovery of novel Geranylgeranyltransferase type 1 (GGTase-1) inhibitors [22]. These inhibitors have the capacity to treat conditions such as inflammation, multiple sclerosis, and atherosclerosis. Using a form of the modeling workflow outlined in Figure 4, a virtual screen on 9.5 million commercial compounds yielded 47 computational hits, seven of which were shown to be active and selective upon in vitro testing. Two of these compounds are shown in Figure 6.

23

Two compounds found to have Geranylgeranyltransferase type I inhibitory activity (values shown) via QSAR virtual screening. IC50 is the concentration of compound reducing the enzymatic activity by 50%. Adapted from [22].

Figure 6: QSAR applied to virtual screening in drug development.

Any effort to improve the performance of pharmaceutically-applicable QSAR models is of paramount importance for the success of the industry and, infinitely more important, the betterment of human health. Estimates place the cost of developing a new drug at $800 million to $1.8 billion over a 10 to 15 year development period [11], [39].

Additionally, more recent estimates place development costs at greater than $5 billion and possibly up to $13 billion over the same time frame, depending on the company scrutinized

[40]. In all, R&D spending over the last 40 years on drug discovery and development has steadily increased while the number of New Molecular Entity (NME) registrations submitted to U.S. Food and Drug Administration has remained relatively constant [40].

Attrition is cited as a problem undermining the pharmaceutical industry and driving up costs, with more than 95% of the small molecules reaching clinical trials end in failure

[39], [40], [41], [42]. The highest source of attrition in the 1980s was poor human 24 pharmacokinetic (PK) properties, accounting for nearly 40% of all failures. Since the advancement of pre-clinical screens, including virtual QSAR which specifically address

PK and other ADMET characteristics, PK failures dropped to approximately 10% by 2005

[39], [40], [41], [42]. Today, efficacy in clinical trials is the greatest cause of new drug failures, with safety issues following closely [39], [40], [41], [42]. Addressing these sources of attrition will require new science: better mechanistic insights into the cause of disease; better translation between disease models in animals and later in human patients; and better experimental designs focused on understanding side effects [40], [43]. The data generated from these advancements will feed computational methods and ultimately offer faster, cheaper pathways toward new drugs.

Continuing the discussion on QSAR applicability, computational methods are increasingly employed toward environmental and human health risk assessment and regulation. At present, there are tens of thousands of industrial chemicals in use, with hundreds more added each year, and toxicological profiles of these compounds is usually lacking or absent [1], [13], [44]. These chemicals are found in many commercial products

(e.g. foods, cosmetics, etc.) and pose a potential risk to human health. Additionally, such chemicals frequently enter the environment (e.g. pesticides) and impact human, animal, and plant life. Regulatory agencies, particularly in the U.S. and Europe, are concerned with characterizing and mitigating the potential impact of these substances [13], [44].

Completing the necessary in vitro and in vivo toxicity tests on any one chemical would require a considerable time and capital investment. Such tests also require animal subjects, the latter of which regulatory authorities are reducing reliance upon due to ethical concerns

[1], [44]. Computational methods are sought to quickly and cheaply fill in missing data on 25 commercial compounds. However, predicting toxicological outcomes using computational models presents its own difficulties; training data is relatively scarce, and the biological mechanisms-of-action are multi-faceted and poorly understood. Ongoing programs such as

ToxCast and Tox21 seek to generate high-throughput screening (HTS) data on diverse sets of commercial chemicals using hundreds of biological assays [1], [44]. A combination of physiochemical/structural descriptors and HTS data are used to group compounds hypothesized to share similar mechanistic pathways to certain toxicity endpoints, and then semi-empirical models based on statistical analysis and/or machine learning algorithms predict those endpoints from chemical structure. Such a paradigm has been used to predict toxicological outcomes such as bacterial mutagenicity, developmental toxicity, and skin sensitization, among others [1], [44].

A concrete example of QSAR applied to risk assessment concerns mitigation of allergic contact dermatitis (ACD). Alves et al. investigated the use of QSAR models for predicting skin sensitization over existing animal tests [45]. ACD can pose a hazard to humans and the environment through exposure to sensitizing compounds in commercial products. ACD occurs by a two-step process: induction and immune response elicitation

[45]. First, the chemical contacts the epidermis and binds to skin proteins to form an immunogenic complex. Then, allergen-specific T-cells initiate an inflammatory response.

Skin sensitization is tested in both animals and humans. The murine local lymph node assay

(LLNA) is the preferred animal test by European and U.S. regulatory agencies for measuring skin sensitization in animals since it shows good correlation with human sensitization [45]. In humans, the repeated insult patch and maximization tests are used, the latter of which results in irritated skin. However, for ethical reasons, animal testing is 26 being eliminated or reduced across several industries. For example, animal tests for cosmetic ingredients have been banned in Europe since 2009 [45]. There is also the question of validity; the LLNA test does fail in several instances to predict human skin sensitization. With these facts in mind, QSAR modeling is an attractive alternative to future animal testing. These models can simultaneously exploit data on human and animal studies already collected while introducing additional information in the form of molecular descriptors, potentially absent from mechanisms underlying the assays. Alves et al. compared QSAR models constructed using human data to predictions produced from the

LLNA assay. The QSAR models had higher overall accuracy, 71% vs. 63%, whereas the

LLNA assays had higher sensitivity, 83% vs. 65%, due to the frequency of labeling molecules as sensitizers [45]. This result highlights the ability for QSAR models to replace or at least supplement animal testing assays in commercial practice. The aforementioned

QSAR models provide pathways for modifying existing sensitizers into non-sensitizers, shown in Figure 7, while potentially preserving the useful properties of the original compound.

27

Phenyl benzoate (top), skin sensitizer, along with three suggested compounds (bottom) predicted non-skin sensitizers to potentially replace it in products. Adapted from [45].

Figure 7: QSAR applied to skin sensitizers in consumer products.

This ends the introductory chapter describing in detail the fields of cheminformatics and QSAR analysis, and the primary applications of QSAR modeling. The remaining chapters of this dissertation are organized as follows: Chapter 2 provides background information on implementing QSAR models, including considerations and techniques present throughout the process. It also outlines the objectives of the research contained herein. Chapter 3 describes the methodology of the research conducted herein, namely, a

QSAR modeling workflow which constructs local model ensembles complete with query- specific descriptor sets. It also discusses the computational tools and data sets used to complete and evaluate this research. Chapter 4 presents results and discussion from the

28 application of the proposed research on the evaluation data sets. It also offers performance comparisons between the work and comparable models reported in the literature. Finally,

Chapter 5 re-iterates the conclusions drawn from the results and discussion and outlines areas of improvement to be pursued in future endeavors.

29 Chapter 2: Background

As the field of chemical engineering has only recently trended toward biomolecular applications, it is understandable that many chemical engineers are unfamiliar with chemoinformatics and, in particular, QSAR modeling. Therefore, it is necessary to discuss some of the most pertinent aspects of QSAR modeling in further detail, review previous literature related to it, and explain specific language common to the discipline. Naturally, topics of QSAR modeling which are the subject of the research presented herein are emphasized. Lastly, the research problem addressed by this research is presented along with an outline of the proposed solution.

2.1 Global vs. Local QSAR Modeling

Within the context of a single data set of chemical compounds, the choice of whether or not to use all molecules for training a QSAR model can present additional opportunities to enhance predictive performance. As Guha et al. explains, traditional QSAR models use all available compounds in a given training set for model construction. From the perspective of molecules as observations scattered about in molecular descriptor space, such models are referred to as “global” models [46]. Global QSAR models are most useful when the data set is largely homogeneous and the underlying relationship between activity and molecular structure is relatively simple. However, if the data set is particularly large and/or structurally

30 diverse, there may exist certain regions in descriptor space where the mechanism of action for the activity of the compounds within those regions differ significantly from the rest of the molecules of the data set. Consequently, the descriptors relevant for explaining the activity of these regions may be “drowned out” in favor of more global trends [46], [47].

In order to identify smaller structure-activity regions distributed within a data set, one might consider partitioning the whole data set into a series of subsets whose members share similar structure. These subsets might be found either by a priori clustering or through the use of distance thresholds drawn from points distributed throughout descriptor space

[46], [47]. An individual QSAR model can be built on each of these subsets. The underlying structure-activity relationship should be relatively simple since the molecules in these subsets are likely to be similar as a result of clustering or through their proximity to each other within descriptor space. The models resulting from such an approach are referred to as “local” models, and the strategy as a whole known as “local learning” [46], [47].

Clustering and model training may proceed before any test or query compounds are presented for prediction. Future queries are then placed within pre-existing clusters usually by a similarity-based decision function, and the model of that cluster predicts the activity of the query. Distance-based learning algorithms tend to be “lazy”, the name adopted to an algorithm which does not initiate until a query is presented for prediction [46], [47]. For these approaches, the position of the query in descriptor space dictates the training observations used to predict its activity. Finally, there exists approaches between global and local learning. Generally termed “local weighting”, such methods use all training observations but their contribution toward predicting the query is adjusted in proportion to their distance from the query [47]. For example, LOESS (locally-weighted least squares 31 regression) and LOWESS (robust locally-weighted regression), using Taylor expansions defined by kernel functions, have been applied to various data modeling problems for some time [48]. For simple regression, the polynomial minimization problem for LOESS is given by:

2 푁 푝 푥푖 − 푧 min ∑ (푦푖 − ∑ 훽푗(푥푖 − 푧)) 퐾 ( ) (11) 훽0,…,훽푝 ℎ 푖=1 푗=0 where N is the number of observations, p is the order of the polynomial fit, 푥푖 is the value of the descriptor value for the i-th observation, 푦푖 is the value of the response for the i-th observation, K is the kernel function, h is the bandwidth parameter, and z is the position in the domain where the regression parameters 훽푗 are valid [48]. As Expected, Equation (11) can be applied to the multivariate case as well. As opposed to regular polynomial regression in which the resulting fitted model is assumed reasonable over the entire domain, the model found by minimizing Equation (11) only applies at z, the position of the query for a prediction problem. The investigator is left to the choice of kernel function and bandwidth parameter, that latter of which affects the weight of nearby observations toward the smoothness of the fit and is chosen to reflect the density of the data [48]. Perhaps the most widely-known local learning algorithm is the k-nearest neighbor’s (k-NN) algorithm. As explained by Mitchell, the majority class of a query’s k nearest neighbor’s, where k is an integer greater than zero, decide the predicted class for the query [49]. The distances from the training observations to the query are found in the descriptor space used for the model; the Euclidean distance metric on the normalized space or Mahalanobis distance metric are

32 used the most often. A toy example demonstrating the k-NN algorithm is shown in Figure

8:

A query (yellow triangle) among a two-class data set. If k is taken to be three, two negative neighbors (blue crosses) outnumber the one positive neighbor (red circle) and the query is predicted negative. By similar reasoning, if k is taken to be five, the query is predicted positive.

Figure 8: Hypothetical k-NN classifier example.

As mentioned previously, the contribution of each training observation’s class toward predicting the query may be weighted by its distance to the query by a weighting function.

Ultimately, the value of k and choice of weighting is optimized during the training process by cross or external validation sets [49]. Exemplifying its capabilities, Hansen et al. compared four parametric methods, support vector machines, Gaussian Processes (GPs), 33 random forest, and k-NN, on a public data set of 6,500 compounds constructed using Ames mutagenicity as the endpoint. Of the parametric classifiers, the k-NN algorithm offered comparable performance; the area under the receiver operating characteristic curve values, which serve as an overall measure of classification performance, were 0.86, 0.84, 0.83, and

0.79, respectively [50].

The premise of the local learning strategy aligns closely with the similarity principle of QSAR analysis; that is, the similarities and differences of the training compounds in proximity to the query compound are the most informative for deciding its biological activity. Local algorithms are advantageous through their capacity to simplify globally non- linear relationships as a collection of locally-linear approximations [46]. These local approximations not only provide accurate predictions but also interpretations into the mechanisms governing the observed activities due to their linear nature [51]. This latter aspect, model interpretability, can be much more difficult to obtain from non-linear, “black box” algorithms such as kernel learning methods and artificial neural networks [49], [52].

However, not all local learning algorithms are amenable to interpretation, for example, the k-nearest neighbors algorithm gives no immediate indication of descriptor importance without detailed analysis. Others criticize local learning methods for failing to explain global trends generalizable to the entire data set; the importance of certain descriptors may be closely linked to the position in descriptor space [46]. Helgee et al. observed on a study of simulated and real data sets that local models gave no reliable increase in predictive performance compared to global models [52]. Local models are said to be “risky” in a sense that failure to identify the “correct” locality for a query leads to prediction error. Helgee et al. also argue that local models ignore useful information contained in the totality of the 34 data set which, especially combined with non-linear learning algorithms, global methods sufficiently retain. Lastly, local learning methods are found to require rather large training sets or at least sets with regions of high density to reliably learn local models. Since a local model is built for each query, the computational resources required to predict many queries can be substantial compared to a global modeling strategy. In all, the local learning concept is not new, and some local QSAR modeling approaches found in the literature are now briefly reviewed.

Guha et al. adapted the k-NN algorithm to identify training observations local to queries for predicting three separate, continuous-valued biological endpoints: anti-malarial activity, platelet-derived growth factor receptor inhibition, and dihydrofolate reductase inhibition [46]. For each test compound to be predicted, it’s k-nearest training set neighbors were selected to serve as a local training set specific for that query. Ridge regression was applied to learn the relationship between the molecular descriptors and activity for these neighborhoods. The minimization function for ridge regression is given by [53]:

푁 푝 ′ 2 min ∑(푦푖 − 풙풊휷) + 휆 ∑ 훽푗 휷 (12) 푖=1 푗=1

Ridge regression differs from ordinary least squares regression by the addition of the right- most term which penalizes large regression coefficients. Such a modification, while biasing the regression coefficient estimates, reduces their variance compared to the ordinary least squares minimization problem. This is beneficial for regression problems with small training sets and many descriptors in which the ordinary least squares regression coefficient estimates may be unstable. The performance of the local lazy algorithm was compared to global multiple linear regression models using the RMSE measured on external test sets. Of 35 the three aforementioned data sets, the global and local models obtained the following performance statistics: 0.92 log units vs. 0.94 log units; 0.36 log units vs. 0.31 log units; and

2.16 log units vs. 2.01 log units, respectively. In summary, the local lazy model outperformed the global model in two of the three data sets. However, the authors noted that local models performed poorly for test molecules situated around “activity cliffs”, or regions where the biological activity of the molecule differs greatly despite no significant changes in structure.

Buchwald et al. devised a local modeling strategy which initiates with structural clustering of the training set guided by pre-defined size thresholds [51]. All training molecules are subjected to a graph mining algorithm which identities frequently occurring substructures given a minimum frequency constraint. Clusters are defined by a common substructure whose size must exceed a user-defined proportion of the whole molecule size.

Furthermore, small molecules are excluded from cluster membership by a minimum size threshold. As a result, training molecules may fit into no clusters, one cluster, or several clusters simultaneously if their structural features satisfy the aforementioned cluster inclusion criteria. After clustering, a global model is built using all training molecules while individual local models are built using each cluster. When a query compound is presented for prediction, it is assessed for cluster membership. If the query belongs to no clusters, a global model prediction is made. Otherwise, local model predictions are made for the query using each cluster. If the query belongs to two or more clusters, a consensus prediction is made using weights proportional to cluster size. Thus, larger clusters are assumed to be more confident in their predictions and have more influence. This methodology was benchmarked on 14 QSAR data sets ranging from 282 to 1216 molecules. By adjusting size 36 thresholds, the average number of clusters ranged from approximately 5 to 40 clusters whereas cluster size ranged from approximate 20 to 80 molecules, depending on the data set in question. Global and local models were completed using a variety of supervised learning algorithms: Gaussian Processes, k-nearest neighbors, MSi nearest neighbors, M5P model trees, and support vector machines. Model performance was determined using a 100 times hold-out validation procedure where a random 2/3 of the data set was used for training and the remaining 1/3 for testing. For the regression data sets, a significant improvement in mean absolute error was observed in more than 75% of the cases using the local modeling scheme. The global models outperformed the local models in less than 10% of the cases.

Likewise, for the classification data sets, local models achieved significantly improved accuracy in 15 of the 20 cases. For the remaining instances, global models outperformed in two and no significant difference was found in the final three.

2.2 Descriptor Selection

The descriptors chosen to represent the compounds of a data set is of great importance toward the successful development of a QSAR model. Presently, the relationships between chemical structure and biological activity are far too complex to be explained from first principles. Additionally, there is currently no accepted methodology for directly inputting the 2D or 3D structural representation of molecules into a learning algorithm. Therefore, as seen in Figure 9, molecular descriptors are the only venue for communicating structural information with the mathematical models which describe their correlations to the observed activities:

37

Adapted from [4].

Figure 9: The basic QSAR approach.

Demonstrating the importance of descriptor choice over the choice of learning algorithm,

Young et al. performed an experiment concerning the labeling of 1280 molecules into their correct class: adenosine, antibiotic, antibiotic-cephamycin, cholinergic, GABA, or hormone

[54]. Two different descriptors sets (binary atom-pair and BCUT-like) and three different learning methods (support vector machine, random forest, and boosting) were selected. All possible combinations of descriptor set and learning method were used to predict each class.

The authors found no significant difference between the learning algorithms’ performance for any molecular class; however, performance was dependent on the descriptor set considered.

Ideally, a QSAR model will have as few relevant descriptors as possible to adequately explain the variation of the endpoint in question. Models with fewer numbers of descriptors are the easiest to interpret. Naturally, models built with uninformative or noisy descriptor sets will be less predictive [32]. Those constructed with multiple redundant

38 features will be difficult to interpret since, in most instances of QSAR modeling, the descriptors are not orthogonal. Supervised learning methods applied to such data will obtain high levels of uncertainty in regard to descriptor coefficient estimates since the model cannot distinguish the source of the associations with the response. Furthermore, models trained with too many descriptors, especially in relation to the number of observations present in the data set, will have too many degrees of freedom when fitting to the training set. Although these “overfitted” models explain the training data well, they will often fail to generalize toward the prediction of new test observations [32]. A simple example of overfitting is shown in Figure 10:

Data (black scatter) is generated according to y = x + N(0,0.25). The actual relationship (dashed black line), simple linear regression (blue line), and 6-th degree polynomial regression (red line) are shown. While the polynomial model fits the data better, the simple model agrees more with the underlying relationship.

Figure 10: Demonstration of overfitting.

39 Ultimately, decisions made from QSAR models with the aforementioned discrepancies will be doomed to failure, wasting precious time and resources. Descriptor selection algorithms are designed to alleviate these problems and produce the most useful models possible.

Descriptor or feature selection methods are generally divided into two categories:

filter and wrapper methods. The first type, filter methods, derive their name from the fact that they “filter” or screen the number of descriptors available as a pre-processing step before the primary learning algorithm is applied [7], [21], [31], [32]. Filtering is accomplished by fulfillment of a criterion such as a certain degree of correlation between the descriptor and the response variables. These methods are advantageous in that they scale well with high-dimensional data sets. They are also computationally fast and less intensive relative to wrapper and hybrid/embedded methods [7], [21], [31], [32]. On the other hand,

filter methods do not participate in any form with the learning method employed to explain the variation in the data set. As a result, filter methods usually underperform relative to wrapper/hybrid approaches. Lastly, many filter methods are univariate-based, meaning they only consider one descriptor at a time. Univariate approaches by their nature are unable to consider the benefits of combinations of descriptors toward improving predictive performance or detrimental issues stemming from redundancy/multicollinearity [7], [21],

[31], [32].

Univariate descriptor selection is used exclusively within the proposed work due to the computational expense associated with filtering a large number of descriptors and the fact that a modeling strategy may be tasked with building many local models. For interval or count-based descriptors (i.e. physiochemical properties), the observations are split into

40 groups according to the classes of the response. Next, a one-way ANOVA is conducted on the descriptor by calculating the appropriate test statistic [55]:

푆푆푇푟푒푎푡푚푒푛푡푠/(푎 − 1) 푀푆푇푟푒푎푡푚푒푛푡푠 퐹0 = = (13) 푆푆퐸푟푟표푟/(푁 − 푎) 푀푆퐸푟푟표푟 where SSTreatments is the descriptor variance between response classes, SSError is the descriptor variance within response classes, N is the total number of observations, and a is the number of classes of the response variable. The descriptor is deemed significantly correlated with the response if its associated F-statistic exceeds the 95-th percentile of its respective sampling distribution. Otherwise, the descriptor is dropped from further consideration [55]:

≤ 퐹0.05,훼−1,푁−푎 푡ℎ푒푛 푟푒푡푎𝑖푛 푑푒푠푐푟𝑖푝푡표푟 퐼푓 퐹0 𝑖푠 { (14) > 퐹0.05,훼−1,푁−푎 푡ℎ푒푛 푑𝑖푠푐푎푟푑 푑푒푠푐푟𝑖푝푡표푟

For binary descriptors (i.e. the presence or absence of structural fragments), the Fisher’s exact test is used to determine descriptor significance. This test is appropriate for potentially small, unbalanced data as opposed to large sample tests like that derived from the chi- squared statistic. The relevant test statistic, using the notation of the confusion matrix shown in Table 4, follows the hypergeometric distribution [56]:

푇푃 + 퐹푃 퐹푁 + 푇푁 푇푃 + 퐹푃 퐹푁 + 푇푁 ( ) ( ) ( ) ( ) 푝 = 푇푃 퐹푁 = 퐹푃 푇푁 푇푃 + 퐹푃 + 퐹푁 + 푇푁 푇푃 + 퐹푃 + 퐹푁 + 푇푁 (15) ( 푇푃 + 퐹푁 ) ( 퐹푃 + 푇푁 ) Binary descriptors are deemed significantly correlated with the response if the value of p calculated in Equation (15) is less than the level of significance, taken to be 0.05 for this work. Although not utilized here, a variety of other filtering methods have been proposed

41 such as information gain, mutual information, factor analysis, Shannon entropy, etc. [7],

[21], [31], [32].

The other category of descriptor selection methods, wrappers, optimize which descriptors are best suited for the model through coordination with a supervised learning algorithm [7], [21], [31], [32]. First, the algorithm determines a subset of the available descriptors for consideration. Next, the training data expressed in terms of the descriptor subset is fitted by the learning algorithm and evaluated. The error of the learning algorithm in predicting the data is used as a score to guide the selection of another descriptor subset.

In this sense, the process wraps back around to the subset selection portion of the method, repeating until a stopping criterion is reached. The learning algorithm used for feature selection may be independent from the primary method used for final training and prediction. Wrapper methods enjoy one distinct advantage through the coupled use of a learning algorithm; they can consider the improvement conferred by combinations of several descriptors to the structure-activity relationship. For this reason, they generally outperform most filter methods [7], [21], [31], [32]. However, wrapper methods are much more computationally intensive, have a greater risk of over-fitting the training data, and the selection of subsets is heavily dependent on the learning algorithm employed. The most widely recognized wrapper methods include forward selection, backward elimination, step- wise selection, and genetic algorithm, among others [7], [21], [31], [32].

42 2.3 Local Descriptor Selection

The notion that the importance of individual descriptors toward explaining the response variable is a function of the location within the complete descriptor space is a relatively unexplored topic in QSAR analysis. Guha et al. discussed the notion of incorporating a local feature selection strategy into a QSAR modeling workflow but did not pursue this idea further [46]. Traditionally, descriptor selection identifies a single subset of descriptors which are used for building either global or local models. Indeed, the very concept of distance between training observations and queries depends on first defining a descriptor space. An alternative approach, and a particular aspect of the proposed research, is to allow the composition of descriptor subsets to vary throughout the full descriptor space. Such a process grants each local model the capacity to have its own optimal set of observations and descriptors, though overlap between local models would certainly be possible or even likely depending on the particularities of the data. If a data set is rather large and diverse, multiple mechanisms could be causing the observed variability in the end point. Global descriptor selection algorithms may overlook the influence of certain descriptors if their contribution is subtle or does not occur readily enough throughout the entire descriptor space. A localized approach would strengthen the signal-to-noise ratio of these descriptors by removing irrelevant information. As mentioned, methodologies describing the use of local descriptor sets for prediction are not as prevalent as global selection techniques. Two examples, the

first solving a linear optimization problem and the second utilizing Gaussian Process regressions, are reviewed next.

43 Armanfard et al. introduced the Localized Feature Selection (LFS) method which treats each training observation as a representative of its local descriptor space [57].

Descriptors are chosen such that, for each training observation, the squared Euclidean distance to neighbors with the same class and different classes are simultaneously minimized and maximized, respectively. Over the entire training set, the problem becomes a constrained linear optimization and can be reformulated so that a solution exists. A query is assigned to the class with the largest similarity score; the score is determined by totaling the classes of the nearest training neighbor in all possible local feature sets (i.e. the size of the training set). The LFS method was compared against 6 common descriptor selection algorithms using 10 data sets, 6 of which consisted of cancer-related gene expression microarray data. LFS performed better than all other selection techniques for 9 of the 10 data sets considered [57].

Pichara and Soto devised a classification technique in which the discriminative potential of every descriptor can be estimated for any point in global descriptor space via

Gaussian Process (GP) regressions [58]. A GP regression is learned for each descriptor and estimates that descriptor’s ability to segregate the training data over its domain. Local descriptor subsets are constructed sequentially by adding the most discriminative descriptors at the position of the query. Once the local subset is determined, the query and all training observations are projected to this subspace and a k-nearest neighbors classifier predicts the query’s class. The methodology was tested using 4 data sets: breast cancer biopsy images, letter speech recognition data, infrared astronomical data, and x-ray images.

The technique outperformed 7 common descriptor selection methods for two of the data sets and offered similar performance on the remaining data sets [58]. 44 2.4 Dimensional Reduction

The motivation for dimensional reduction is similar to that for descriptor selection; to reduce the number of descriptor variables, and to alleviate the detrimental effects of too many descriptor variables, before application of a supervised learning algorithm. In many studies, investigators may have no prior knowledge as to the biological endpoint they wish to model and instead wish to “mine” the data set to uncover novel relationships. In these cases, a wide variety of descriptors may be calculated rather indiscriminately. An excess of descriptor variables during QSAR modeling is particularly troublesome given the large number of descriptor variables available for calculation from chemoinformatics software packages.

Furthermore, datasets are not constructed according to experimental designs in terms of molecular structure; that is to say, it is highly likely that descriptors determined from these sets will exhibit moderate to high degrees of correlation. Intercorrelation, or multicollinearity, among predictor variables is a problem for linear learning methods such as ordinary multivariate linear and logistic regression. Ng demonstrates the problem by pointing out that, with a completely orthogonal descriptor set, the coefficients from a p- dimensional multivariate regression are [59]:

푥̃′ 푦̃ 푥̃′ 푦̃ 푥̃′ 푦̃ ̂′ 1 2 푝 훽 = [ ′ , ′ , … , ′ ] (16) 푥̃1푥̃1 푥̃2푥̃2 푥̃푝푥̃푝

As seen in Equation (16), the multivariate regression in the orthogonal case is simply a series of univariate regressions along each dimension. If one were to orthogonalize a descriptor set exhibiting multicollinearity using the Gram-Schmidt procedure, then the response variable is regressed against the residual of 푥̃푗 regressed on 푥̃1, 푥̃2, …, 푥̃(푗−1),

45 푥̃(푗+1), …, 푥̃푝. If 푢̃푗 is such a residual, and if 푥̃푗 is highly correlated with at least one of 푥̃1,

푥̃2, …, 푥̃(푗−1), 푥̃(푗+1), …, 푥̃푝, then 푢̃푗 will be close to 0̃. From Equation (16), the regression coefficient for 푥̃푗 is:

푢̃푗푦̃ 훽푗 = ′ (17) 푢̃푗푢̃푗

This regression coefficient will be highly unstable in the situation described as a result of the intercorrelation with the rest of the descriptors. Aside from multicollinearity, high dimensional datasets pose additional problems for supervised learning collectively referred to as the “curse of dimensionality”. As Verleysen and Francois explain, the first problem is the number of samples required to properly describe a region of the descriptor space which increases exponentially with the number of descriptors [60]. For example, if 10 samples are needed to smoothly fit a relationship in a 1-dimensional space, 100 samples are needed to

fit an equivalent 2-dimensional space, 1,000 samples to fit an equivalent 3-dimensional space, etc. This issue often arises in QSAR studies since the number of predictor variables frequently outnumber the molecules. Second, the geometric properties of high-dimensional spaces are very different from their low-dimensional counterparts. Figure 11 shows, as a function of the dimension d of the space, the volume of a hypersphere of radius 1 (left) and the ratio of the volume of a hypersphere of radius 1 to the volume of a hypercube with side length 2 of which the hypersphere is inscribed (right). As evident from the plots, once the dimension is greater than approximately 20, the volume of the hypersphere is essentially 0.

Furthermore, at dimensions exceeding roughly 10, most of the volume of the hypersphere inscribed within the hypercube is found at the edges of the hypercube. Data drawn uniformly

46 and randomly from such a space would be found at the edges of the hypercube with probability close to 1.

The volume of a hypersphere of radius 1 vs. the dimension of the hypersphere (left). The ratio of a hypersphere of radius 1 inscribed in a hypercube of side length 2 vs. the dimension of the hypersphere/cube (right). Adapted from [60].

Figure 11: Illustration of the effects of high-dimensionality.

In other words, the distance between data points and the center of the distribution in these high-dimensional spaces are large and concentrated in a narrow interval. The concept of locality is heavily distorted; for example, the distance between a query and its neighbors for arbitrary distance metrics becomes approximately equal [60]. As a consequence, learning algorithms which discriminate on distance within descriptor space lose their predictive power. Thus, dimensional reduction techniques are employed to address the problems encountered by high-dimensional, intercorrelated data. 47 The most common dimensional reduction technique is principal component analysis

(PCA). The principal components (PCs) of a data set are the eigenvectors of its variance- covariance matrix; unit vectors oriented in the directions of maximum variance [61]. A 2- dimension example of PCA is shown in Figure 12:

The vector 푒̃ is the 1st PC of the original data (blue circles) and is oriented in the direction of maximum variance. The data are projected linearly onto the first PC (red circles) and are referred to as scores. Inspired from [62].

Figure 12: Illustration of principal component analysis.

All principal components are successively orthogonal. Samples of a p-dimensional data set are projected linearly onto the m-dimensional PCs, where m is usually much less than p.

Information is lost in the p - m components not retained for modeling. PCA is done as a pre- processing step to modeling and is unsupervised; correlation with the response variable has 48 no bearing on the directions of the components. A common approach is to perform multivariate or logistic regression on the PCs, a method known as principal component regression (PCR). While PCR addresses the problems associated with high dimensionality and multicollinearity, as a linear technique, a data set with a relationship that is largely non- linear with the response will likely be lost after the projection. Furthermore, being unsupervised, descriptors important for explaining the endpoint may be lost if they are not correlated with the PCs.

Partial least squares (PLS) regression is a dimensional reduction and learning technique that addresses the unsupervised nature of PCA. The general form of PLS is that of a multivariate regression [63]:

̅ 푌 = 훽0 + 훽1푇1 + 훽2푇2 + ⋯ + 훽푚푇푚 (18) where 푌̅ is the mean value of the response variable, 훽푖 are the regression coefficients for the i-th PLS component or latent variable, 푇1 is the i-th PLS latent variable, and m is the number of PLS latent variables retained, usually much less than the total number of descriptors, p.

The construction of the PLS latent variables is of the most interest. First, each mean- centered descriptor is fitted to the mean-centered response variable univariately. Next, the

PLS latent variable is constructed as a weighed linear combination of the mean-centered descriptors according to [63]:

푇푖 = ∑ 푤푖푗 푏푖푗 (푋푗 − 푥푗̅ ) (19) 푗=1 where 푤푖푗 is the weighting for the i-th PLS latent variable on the j-th descriptor variable, 푏푖푗 is the univariate regression coefficient of the j-th descriptor fit against the response for the

49 i-th PLS latent variable, 푋푗 is the j-th descriptor variable, and 푥푗̅ is the mean vector for the j-th descriptor. Subsequent variability in the descriptor data orthogonal to 푇푖 may be useful for explaining the endpoint. For subsequent PLS latent variables, “residual variability” in Y is regressed against “residual information” for each descriptor variable according to [63]:

′ 푡푖 푣푖푗 푉(푖+1)푗 = 푉푖푗 − ( ′ ) 푇푖 (20) 푡푖 푡푖

′ 푡푖 푢푖 푈푖+1 = 푈푖 − ( ′ ) 푇푖 (21) 푡푖 푡푖

푇푖+1 = ∑ 푤(푖+1)푗푏(푖+1)푗푉(푖+1)푗 (22) 푗=1 where 푉푖푗 = 푋푗 − 푥푗̅ correlated with the i-th PLS latent variable, 푈푖 = 푌 − 푦̅ correlated with the i-th PLS latent variable, 푏(푖+1)푗 is the univariate regression coefficient of 푈푖+1 regressed against 푉(푖+1)푗, and the lower-case variables are their sample equivalents. This process repeats until the final number of PLS latent variables is reached, a value usually determined via cross-validation. A 2-dimensional example of PCA compared to PLS is shown in Figure

13:

50

The PC component, 푒̃ (black dashed line), is oriented in the direction of maximal variance irrespective of class. The PLS latent variable, 푞̃ (green dashed line), is oriented in the direction of maximal covariance between the descriptor data and response classes. Inspired from [64].

Figure 13: Illustration of partial least squares regression compared to principal component analysis.

Similar to PCA, the PLS latent variables are pair-wise orthogonal to each another. In the majority of instances, PLS outperforms PCR due to its consideration of the response variable during construction [63]. Both PCA and PLS constitute linear transformations in the descriptors. However, if molecules are hypothesized to follow a non-linear placement in high dimensional descriptor space, then linear projections will distort this structure after projection, perhaps leading to the false identification of a test molecule’s neighbors. The concept is illustrated in Figure 14:

51

PCA – Principal Component Analysis; LLE – Locally Linear Embedding. Taken from [65].

Figure 14: Linear (PCA) and non-linear (LLE, IsoMap) dimensional reduction techniques applied to hypothetical S-shaped data.

Many non-linear dimensional reduction techniques have been proposed to preserve, if they exist, the structure of non-linear data on a low-dimensional manifold embedded in a high- dimensional space. These methods include Sammon mapping, curvilinear components analysis, stochastic neighbor embedding, IsoMap, maximum variance unfolding, locally linear embedding, and Laplacian eigenmaps [66], [67], [68]. In Figure 14, the original data, albeit manufactured, conforms to an “S” shaped manifold in 3-dimensional space. The PCA

52 transformation fails to discovery and preserve the structure of the manifold, and clearly some of the samples which are distinctly distance from each other in 3-dimensional space are now in close proximity in 2-dimensional space. The non-linear dimensional reduction techniques, specifically locally linear embedding and IsoMap, unravel the manifold and maintain the global and local characteristics of the original data set. As part of the proposed work, a non-linear dimensional reduction technique, namely t-distributed stochastic neighbor embedding (t-SNE), will be compared with PLS. The goal of this modification will be to understand if non-linear dimensional reduction techniques can retain the inherit non-linear character of the datasets, should they exist, and enhance the predictive performance of a local QSAR modeling workflow. The t-SNE algorithm is now explained in detail.

2.4.1 t-Distributed Stochastic Neighbor Embedding

On the topic of manifold learning, van der Maaten and Hinton point out that dimensional reduction methods which operate by linear transformation prioritize keeping the low- dimensional representations of dissimilar data points separate [68]. In contrast, for high- dimensional data which lies upon or near a low-dimensional, non-linear manifold, the authors note it is often necessary to keep low-dimensional representations of highly similar data points close together. This latter necessity is usually not possible with linear reduction techniques. On the other hand, many non-linear reduction methods lose global structure while trying to preserve local structure. As a modification to stochastic neighbor embedding, van der Maaten and Hinton propose t-distributed stochastic neighbor embedding (t-SNE) in

53 an effort to maintain both local and global trends in the high-dimensional data following reduction to the lower dimension. While the methodology is designed primarily for , it is possible to adapt the technique for modeling and prediction. In order to understand t-SNE, it is first necessary to review stochastic neighbor embedding (SNE).

The foundation of SNE rests upon the representation of similarity between data points thought of as Gaussian-centered conditional probabilities [68]. In high-dimensional space, the similarity of 푥푖 and 푥푗 is modeled by 푝푗|푖, or the probability that 푥푖 would select

푥푗 as its neighbor if neighbors were selected in proportion to their probability density under a Gaussian situated at 푥푖, according to:

2 2 푒푥푝 (−‖푥푖 − 푥푗‖ /2휎푖 ) 푝푗|푖 = 2 2 (23) ∑푘≠푖 푒푥푝 (−‖푥푖 − 푥푘‖ /2휎푖 ) where 휎푖 is the variance of the Gaussian centered at 푥푖. From Equation (23), it is evident that if 푥푖 and 푥푗 are in close proximity, then 푝푗|푖will be large. Oppositely, 푝푗|푖 approaches 0 for distantly-spaced points. The value of 푝푖|푖 is set to zero since it is of no interest to compare an object’s similarity with itself. Likewise, the similarity between 푦푖 and 푦푗, the low- dimensional mapping of points 푥푖 and 푥푗, respectively, is modeled by 푞푗|푖 according to:

2 푒푥푝 (−‖푦푖 − 푦푗‖ ) 푞푗|푖 = 2 (24) ∑푘≠푖 푒푥푝 (−‖푦푖 − 푦푘‖ )

Since every point in the high-dimensional space has its own Gaussian distribution of variance 휎푖, and the lower-dimensional representation does not have this feature, it is not possible for the lower-dimensional mapping to perfectly model its higher dimensional counterpart. Naturally, if the reduction remains true to the original data, then 푝푗|푖 and 푞푗|푖

54 will be equal. SNE seeks to minimize the pair-wise error between 푝푗|푖 and 푞푗|푖 using the sum of the Kullback-Leibler divergences as a cost function:

푝푗|푖 퐶 = ∑ ∑ 푝푗|푖푙표푔 ( ) (25) 푞푗|푖 푖 푗

The cost function in Equation (25) punishes mapping errors asymmetrically; if two close data points are mapped far apart (i.e. 푞푗|푖 ≪ 푝푗|푖), the cost is inflated much greater than if two distant data points are mapped close together (i.e. 푞푗|푖 ≫ 푝푗|푖). It is for this reason that the SNE mapping is said to emphasize retainment of the local structure of the data [68]. The value of the variance of the Gaussians centered on each high-dimensional data point xi is determined by a user-defined value named the perplexity:

퐻 푝푒푟푝푙푒푥𝑖푡푦 = 2 (26)

퐻 = − ∑ 푝 푙표푔 (푝 ) 푗|푖 2 푗|푖 (27) 푗 where H is the Shannon entropy measured in bits. The perplexity, viewed as a smoothing parameter over the effective number of neighbors, has been found to work well with values between 5 to 50 [68]. Smaller and larger perplexities tune the influence of local and global structure, respectively. Resolving the transformation through minimization of Equation (25) involves implementing a gradient descent according to:

휕퐶 훾(푡) = 훾(푡−1) + 휂 + 훼(푡)(훾(푡−1) − 훾(푡−2)) 휕훾 (28) where 훾(푡) is the set of points for the k-th generation, 휂 is the step size, 휕퐶⁄휕훾 is the gradient of the cost function with respect to the map points, and 훼(푡) is an exponential decay term (i.e. decreases to zero exponentially as generations progress). Quite simply, Equation 55 (28) represents successive movement in the direction of decreasing cost. The gradient descent is initialized by sampling lower-dimensional points from an isotropic Gaussian of small variance centered at the origin. With the basics of SNE complete, it is now possible to consider the modifications which constitute t-SNE.

The t-SNE algorithm differs from SNE through two improvements: 1) use of a symmetric cost function and 2) use of a heavy-tailed distribution for calculating probabilities in the lower-dimensional space [68]. The first alteration, the symmetric cost function, is given by:

푝푖푗 퐶 = ∑ ∑ 푝푖푗 푙표푔 ( ) 푞푖푗 (29) 푖 푗

The key difference between Equation (25) and Equation (29) are the conditional and joint probabilities, respectively. In Equation (25), there is no such stipulation that 푝푗|푖 = 푝푖|푗 or

푞푗|푖 = 푞푖|푗. On the other hand, in Equation (29), it is true that 푝푖푗 = 푝푗푖 and 푞푖푗 = 푞푗푖 for all i, j. The high-dimensional joint probabilities are defined in terms of the previous conditional probabilities according to:

푝푖|푗 + 푝푗|푖 푝 = 푝 = 푖푗 푗푖 2푛 (30) where n is the number of observations. This benefit of this change is a gradient which is faster to compute compared to SNE. Second, the low-dimensional joint probabilities are calculated using a Student’s t-distribution with one degree of freedom:

2 −1 (1 + ‖푦푖 − 푦푗‖ ) 푞푖푗 = 2 −1 (31) ∑푘≠푙(1 + ‖푦푘 − 푦푙 ‖ )

56 The purpose of using a distribution with heavier tails is to avoid a shortcoming associated with the SNE algorithm known as “crowding” [68]. Compared to a lower-dimensional space, a higher-dimensional space contains more volume to separate distance points. This concept is best illustrated by a simple example. In 2 dimensions, it is possible for 3 points to be spaced equidistant from one another in the form of an equilateral triangle. However, in 1 dimension (i.e. line), there is no option to preserve this equidistant, pair-wise orientation. Consequently, at least one pair of points will be separated farther in the lower- dimensional space than its corresponding separation in the high-dimensional space. In terms of joint probabilities, 푞푗푖 ≪ 푝푗푖 since ‖푦푖 − 푦푗‖ is greater for such a pair of points.

Extending this phenomenon to a data set of greater dimension and many more samples, the

SNE optimization will attempt to “correct” for these representation errors by attracting map points closer together, resulting in the inappropriate crowding of already proximal, low- dimensional points. A distribution with heavier tails counteracts the aforementioned attractive force by allocating greater probability to 푞푖푗 for distantly-spaced map point pairs.

Ultimately, t-SNE takes advantage of SNE’s ability to model similar data points close together in lower-dimensional space while at the same time trying to retain dissimilar data points farther apart in lower-dimensional space. These capabilities make t-SNE better equipped to capture the local and global structure of high-dimensional, non-linear data sets.

Inspection of the t-SNE algorithm on a frequently used data set serves to better familiarize the reader with its functionality. A classic problem common to pattern recognition is the identification of handwritten digits. Although unrelated to chemoinformatics and QSAR modeling, examining these techniques applied to different high-dimensional data sets can be insightful. Derksen visually compared PCA and t-SNE 57 applied to the MNIST database of handwritten digits using the Python programming language, the code of which has been modified for this presentation [69]. The MNIST database contains 70,000 images of handwritten digits each of size 28 x 28 pixels. Each pixel has a grayscale intensity value ranging from 0 (black) to 1 (white). Therefore, each sample lies in a 784-dimensional descriptor space where each pixel is an individual descriptor. Illustrating the input data, 15 randomly-selected digits are shown in Figure 15:

Figure 15: MNIST handwritten digits.

For computation expediency, the analysis presented here is limited to 10,000 of the total

70,000 samples contained within the MNIST dataset. First, PCA is applied to the 10,000 samples as shown in Figure 16:

58

Each point represents a digit such as those shown in Figure 15. The points are color-coded by digit as illustrated in the figure’s legend. “PC1” and “PC2” refer to the first and second principal components, respectively.

Figure 16: Principal component analysis of the MNIST handwritten digits.

The first and second principal components shown in Figure 16 only retain approximately

9.8% and 7.0% of the variability of the data set, respectively. Scrutinizing the plot more closely, PCA is capable of loosely grouping the digits into separate clusters. For example, the 0’s (blue) on the right side of the plot are segregated completely from the 1’s (orange) toward the upper left side of the plot. However, there is considerable overlap of several digit groups at the origin and between similarly-shaped digit clusters at their respective locations

(e.g. the 7’s (gray) and 9’s (cyan) occupy much of the same positions toward the bottom left of the plot). Next, the t-SNE algorithm is applied to the same subset of handwritten digits.

59 van der Maaten and Hinton recommend reducing the dimension of an excessively high dimensional data set to less than or equal to 50 dimensions before applying t-SNE.

Therefore, PCA is used to reduce the dimensionality to 50 components before proceeding with t-SNE. Additionally, the perplexity is set at the default value of 30. The results of the algorithm are shown in Figure 17:

Note: “Dim1” and Dim2” refer to the two t-SNE embeddings, respectively.

Figure 17: t-Distributed stochastic neighbor embedding applied to the MNIST handwritten digits.

Examining the resulting figure, the most striking difference is t-SNE’s ability to partition the digits into largely distinct clusters. Thus, it would be said that t-SNE is preserving the local character of the data set by keeping similar samples in close proximity within the

60 embedded space. On the other hand, the PCA and t-SNE representations do share a number of similarities. First, clusters of similarly-shaped digits do have a degree of overlap. As an example, the 3’s (red) form a secondary cluster within the primary 5’s cluster (brown), itself having a secondary cluster. Second, each of the clusters have other digits nested inside.

Scrutinizing the plots of each method from a global perspective, the clusters of similarly- shaped digits are positioned close to each other. For example, the 3’s (red), 5’s (brown), and

8’s (yellow)all occupy the center of the plot. Furthermore, the 0’s (blue) and 1’s (orange) are placed on the opposite sides of the reduced dimensional representation, similar to the

PCA representation. Regarding a known drawback of t-SNE, the exact placement of clusters and the distances between clusters appear to be arbitrary. As noted by Wattenberg, Vigas, and Johnson on their commentary of the t-SNE algorithm, the global configuration of sample clusters within the reduced space is not guaranteed to be meaningful [70], [71], [72].

In all, t-SNE is a stochastic process highly dependent upon its input parameters and designed primarily for visualization [68], [71]. A parametric form of t-SNE utilizing a deep neural network for regression or classification tasks was put forth by van der Maaten in 2009 [73].

The benefits, if any, of the application of such an algorithm to a high-dimensional, chemical data set are presently unexplored.

2.5 Learning Algorithms

A number of statistical models and machine learning algorithms have been for formulated to explain trends in experimental data and predict future observations. In 2015, Devinyak and Lesyk reviewed the literature to uncover the most frequently used learning methods in

61 QSAR analysis [9]. The authors conducted a bibliometric-based analysis of the QSAR literature from 2009 to 2014 using the top-10 molecular modeling and medicinal chemistry journals ranked by Google Scholar. The most widely used QSAR modeling methods from molecular modeling journals in 2009 and 2014, in order from highest to lowest occurring, are shown in Table 5:

Table 5: The most frequent QSAR modeling methods in years 2009 and 2014.

PLS-2D and PLS-3D refer to the use of PLS for model building using 2D and 3D descriptors, respectively. Taken from [9].

2009 2014 Approximate Approximate Method Method percentage percentage PLS-3D 32 PLS-3D 28 MLR 29 RF 19 SVM 12 MLR 11 PLS-2D 10 SVM 10 ANN 5 PLS-2D 8 RF 4 NB 7

Over the 5-year period examined, PLS-3D remained the most frequently used modeling technique. Other machine learners, such as random forest and naive Bayes, witnessed roughly 4-fold and 8-fold increases in frequency, respectively. Multiple linear regression, a conventional technique, saw a substantial drop-off in usage and is increasing viewed as

“inferior to more complex and advanced” machine learning approaches [9]. The intent of this subsection is to provide a brief summary of the most common learning algorithms applied in QSAR modeling. 62 The most straightforward method for describing the relationship between a quantitative response variable and a set of descriptor variables is multiple linear regression

(MLR). The MLR model assumes the response, Y, can be decomposed into its mean value dependent upon a linear function of the descriptors and a normally-distributed error term

[6], [53], [61], [74]. Expressed in matrix notation [53], [61]:

′ 2 푌̃ = 퐸(푌̃) + 휀̃ = 푋̃ 훽̃ + 휀̃ where 퐸(휀̃) = 0̃ and 퐶표푣(휀̃) = 휎 퐼̃ (32) where 퐸(푌̃) is the expected value of the response variable, 푋̃ is the design matrix, 훽̃ contains the regression coefficients, and 휀̃ is the term accounting for measurement error and the effects of any variables not directly considered by the model. The least squares estimates of the regression coefficients in Equation (32) are as follows:

푏̃ = (푋̃′푋̃)−1푋̃′푦̃ (33) where 푦̃ are the sample response values. The MLR model is easy to understand and interpret; for example, the estimate, 푏푗, the regression coefficient of the j-th descriptor, informs the investigator how the response is expected to change with respect to a unit increase in the corresponding descriptor when all other descriptors are held constant. On the other hand,

MLR suffers from instabilities when the number of samples is too few compared to the number of descriptors or when the descriptor variables are highly intercorrelated, both frequent occurrences in QSAR studies [53], [59], [61]. Lastly, for complex, likely non-linear chemical phenomena encountered in QSAR, the predictive performance of MLR is often below that of other machine learning algorithms [6].

QSAR models are often developed to solve classification problems. For instance, an investigator might want to determine if a chemical ingredient in a commercial product poses

63 a mutagenetic risk, or if a psychoactive drug candidate will permeate the blood-brain barrier to reach its target. The experimental data of such problems include dichotomous response variables, that is, taking a value of 0 if a compound is inactive and 1 if it is active. It is convenient to treat the response variable of such problems as a Bernoulli random variable.

Recalling the discrete probability distribution of a Bernoulli random variable [53]:

푃(푌 = 1) = 휋 and 푃(푌 = 0) = 1 − 휋 (34) where π is the probability of a positive outcome. If the MLR model from Equation (32) is applied to this situation, then it is evident that the probability of the response taking a value of 1 is a linear function of the descriptors. From the perspective of MLR, such a model is problematic for several reasons: 1) the error terms are not normally distributed, 2) the error variance is not constant, and 3) the linear function of the descriptors is unbounded whereas the probability it is modeling is given by [53]:

0 ≤ 퐸(푌̃) = 휋̃ ≤ 1 (35)

A solution to the aforementioned shortcomings of applying the MLR model to a binary response variable is to instead describe the expectation of the response using a logistic function:

′ 푒푋̃ 훽̃ 퐸(푌̃) = 휋̃ = (36) 1 + 푒푋̃′훽̃ The logistic regression coefficients in Equation (36) are found numerically via maximum likelihood estimation [53]. To illustrate, an example logistic regression with two descriptor variables is shown in Figure 18:

64

The first two plots show the marginal fitted probability curves. The third plot shows the decision boundary in descriptor space. “LV1” and “LV2” refer to the first and second PLS latent variables, respectively.

Figure 18: Example logistic regression with two descriptors variables.

The two left-most plots in Figure 18 show how the probability of a positive outcome changes with corresponding descriptor variable in addition to the data producing the fit. The right- most plot shows the logistic regression decision boundary in the descriptor space; observations to the right and left of the boundary are predicted to be positive and negative, respectively. Logistic regression has the same advantages and disadvantages of MLR.

Similar to MLR, a unit increase in a descriptor 푋푗 translates into an change in the estimated odds of the response, 휋̂⁄(1 − 휋̂), by a factor of 푒푏푗 . However, logistic regression still employs a linear decision boundary and therefore cannot adapt well to non-linear phenomena. An almost identical method functionally to logistic regression is linear discriminant analysis (LDA) [61]. LDA projects the data onto a vector which maximizes the ratio of the interclass to intraclass sample variance of the response variable. Whereas logistic regression makes no assumptions about the distribution of the descriptor variables, 65 LDA assumes the descriptors are distributed multivariate normally with equal variance- covariance matrices [61], [75].

Another simple classification algorithm is the Naive Bayes classifier. As discussed by Mitchell, this algorithm makes predictions using Bayes theorem and assumptions of conditional independence [76]. The objective of the classifier is to learn 푃(푌|푋), the probability of a positive outcome given descriptor data. Using the extended form of Bayes theorem and sample values, this probability may be estimated by the following [76]:

푃(푋 = 푥푘|푌 = 푦푖)푃(푌 = 푦푖) 푃(푌 = 푦푖|푋 = 푥푘) = (37) ∑푖 푃(푋 = 푥푘|푌 = 푦푖)푃(푌 = 푦푖) where 푦푖 is the i-th value of the discrete random variable 푌 and 푥푘 is the k-th value vector of a collection of p variables such that 푋 = [푋1, 푋2, … , 푋푝]. The estimation of 푃(푋|푌) can be grossly simplified by assuming each 푋푗 is mutually independent of all other 푋푘’s given

푌. A classifier will determine the value of 푌 given the set of descriptor variables

푋1, 푋2, … , 푋푝, or, mathematically [76]:

푃(푋1, 푋2, … , 푋푝|푌 = 푦푖 )푃(푌 = 푦푖) 푃(푌 = 푦푖 |푋1, 푋2, … , 푋푝) = (38) ∑푖 푃(푋1, 푋2, … , 푋푝|푌 = 푦푖)푃(푌 = 푦푖 )

Assuming conditional independence in the descriptor variables, Equation (38) becomes

[76]:

푃(푌 = 푦푖) ∏푘 푃(푋푘|푌 = 푦푖) 푃(푌 = 푦푖|푋1, 푋2, … , 푋푝) = (39) ∑푖 푃(푌 = 푦푖) ∏푘 푃(푋푘|푌 = 푦푖)

Predicting the class for a query consists of finding the value of 푌 for which Equation (39) is maximized.

66 Support vector machines (SVMs) are typically employed for learning QSAR classification problems that are not easily separable linearly. The goal of SVM is to transform the samples of the original descriptor space into a space of higher dimensions such that a hyperplane separating the classes is optimized. Such optimization involves maximizing the margin, or distant between the points closest to the discriminating hyperplane, those points themselves referred to as support vectors [6], [8], [49], [77], [78].

From linear algebra, distances from the margin are determined via computation of dot products. The transformation to the higher-dimensional space, which may not be explicitly known, is accomplished via the “kernel trick”; calculating the kernel of two points within the input space can be shown to be equivalent to calculating the dot product of the points in the higher-dimensional space [6], [8], [49], [77], [78]:

′ 퐾(푥̃푖, 푥̃푗) = 휙(푥̃푖) 휙(푥̃푗) (40) where 퐾 denotes the kernel function and 휙(푥̃푖) the high dimensional representation of the data pointr 푥̃푖 in the original descriptor space. It is through the choice of kernel function, with polynomial or radial basis functions being the most common, that a non-linear boundary (relative to the input space) can be learned. A shortcoming of SVMs, among technical difficulties, is that interpreting the results of the trained model can be difficult or impossible due to the obfuscation introduced by the kernel function [6], [8], [49], [77], [78].

The random forest (RF) algorithm is a popular learning method within the field of

QSAR modeling [6], [8], [49]. As outlined by Lewis and Wood, the method consists of a large number (e.g. 100 to 500) of individual decision or regression trees together constituting a “forest” [6]. Each tree is constructed using a random sample of observations

67 with replacement and a random subset of descriptors considered at each tree node split.

Averaging, or summing the predictions across the forest after training typically produces accurate predictions. The method is advantageous since the importance of individual descriptors can be gleaned from the fitted model and the variance of predictions across the trees can be used to estimate the overall error [6].

Finally, artificial neural networks are exhibiting some resurgence in QSAR modeling with the emergence of deep learning. As discussed by Mitchell and Lo et al., an artificial neural network (ANN) is a machine learning approach based upon a rudimentary understanding of the structure and organization of biological neurons [8], [49], [78]. In the context of ANN, a “neuron” is essentially a function, such as the logistic function, which calculates an output, or “activation”, given several inputs. The output of a neuron is the linear combination of the neuron’s inputs and a set of weights associated with that neuron.

The simplest of ANNs consists of three layers of neurons: an input layer, hidden layer, and output layer. Multiple neurons are associated with each layer and layers may be connected consecutively. The output layer has one neuron per response variable to be predicted and is therefore capable of modeling multiple responses. During the training process, a gradient descent optimization is conducted with the backpropagation process. The ultimate result of the optimization is to adjust the neuron weights to learn abstract features and minimize predictive error [8], [49], [78]. Queries are assessed by passing their descriptors as input through the final network. Deep learning is an extension of traditional ANNs characterized by many hidden layers and more complex architectures [8], [49]. While the performance of

ANN may rival the best machine learning techniques, they often require large training sets

68 and are computationally slow to learn [8], [49], [78]. Furthermore, the importance of the descriptors is lost as signal traverses through the network [78].

2.6 Research Objectives

In the context of chemical space and QSAR modeling, it is reasonable to hypothesize that the training molecules most similar to a query compound are of most interest for predicting the biological activity of a query molecule. Large, diverse data sets in which multiple underlying mechanisms likely explain the observed relationships between chemical structure and in vivo response favor this expectation. Evidence to substantiate these notions are provided by the previous work of Guha et al. and Buchwald et al. Therefore, the first objective of this work is to implement a local QSAR modeling method. This method will adopt a radius-based approach to gather, upon presentation of a query for prediction, a subset of proximal training samples of which it will use to learn a model specific for the query.

Another facet of local QSAR modeling is the idea that the descriptors most relevant toward capturing the structural information correlated with the endpoint of interest may be dependent upon the location in chemical space. Biological activity is frequently the result of multiple biochemical pathways, and thus different structural features can ultimately induce similar effects. QSAR studies on congeneric sets of compounds are typically more successful from a learning perspective since non-congeneric sets containing information across several mechanisms can serve to confuse the model [1]. Allowing local QSAR models the degree of freedom to select descriptors most pertinent to their region of the

69 chemical space can prevent the introduction of noise from global or other local trends. The limited work into this topic by Armanfard et al. and Pichara and Soto supports this line of reasoning, although their approaches are somewhat complex and unexplored in the field of

QSAR analysis. Thus, a secondary objective of the proposed work is to conduct descriptor selection on the local training data of each query. This will be accomplished by re- introducing all descriptor data once a local training subset has been identified, followed by the use of descriptor selection to tailor the final descriptor subset specifically for the query in question.

Critical to a majority of QSAR investigations is dimensional reduction. Data sets with many features, especially those lacking the requisite number of training samples for characterizing such an expansive chemical space, will suffer from the effects of the curse of dimensionality and multi-collinearity. Dimensional reduction techniques like PCA have been employed in a multitude of studies to alleviate these problems. However, techniques such as PCA constitute a linear projection from the original space, and thus can distort the true configuration of the observations. Naturally, this is a complication for local modeling strategies which dependent upon the proper identification of training instances close to the query. The aforementioned distortion will be more extreme if the data conforms to a non- linear manifold. Addressing these issues in turn, the tertiary objective is to incorporate both

PLS and t-SNE as options for dimensional reduction. The data will be processed globally using these methods before local subsets are identified for query-specific modeling.

The fourth and final objective of this work focuses on model interpretability. While non-linear machine learning methods such as Support Vector Machines are known for their superior predictive performance, their obscurity and lack of mechanistic transparency 70 hinder their acceptance in molecular design and regulatory settings. The proposed workflow retains the use of linear methods specifically to explain the correlations between the descriptor and response variables. In some instances, depending on the nature of the data, local linear models can approximate non-linear behavior quite well. In cases where the proposed approach executes a non-linear dimensional reduction technique, the original descriptor space is lost to interpretability. By utilizing local linear models at the query-level after training subsets have been obtained, a meaningful descriptor space is restored.

The satisfaction of these objectives by the proposed method constitutes a novel

QSAR modeling workflow which aims to achieve comparable predictive performance with existing techniques while providing explanatory information specifically tailored for each test instance.

71 Chapter 3: Methodology

3.1 Proposed Methodology

The primary purpose of this chapter is to leave the reader with a clear, detailed understanding of the proposed local QSAR methodology. Accomplishing this task begins by reviewing a general QSAR modeling workflow like those discussed in Chapter 1.

Constructing a model begins by obtaining a training data set containing the response variable(s) of interest. Applying the model requires a test data set for which predictions on the response variable are desired. This test set may be generated as a subset of the training set through such processes as cross-validation or it may be truly external to the training data. In the former case, model predictions are compared with the actual endpoint values to characterize performance. It is assumed here that the training and test sets have already been curated so that the digital representations of the molecules used in any experimental assays are correct and that the numerical data is overall free from errors. Furthermore, it is assumed that the molecules of the training and test sets have already been represented by a set of chemical descriptors. The details regarding the particular curation and descriptor calculation software chosen in this work will be reviewed later in this chapter. The general model training and prediction-generating process is as follows:

72 1. The training and test sets undergo pre-processing, where samples with missing

values are resolved, descriptors are removed based on a variance threshold, and

numerical descriptors are mean-centered and standardized to remove bias resulting

from scale.

2. A descriptor selection algorithm may be applied to the data.

3. A dimensional reduction algorithm may be applied to the data.

4. An algorithm is applied to learn the applicability domain of the model, deciding

which test observations can be predicted with sufficient confidence.

5. A learning algorithm is applied to find a mapping from the descriptor to response

space.

6. The mapping from the learning algorithm is applied to the test set(s) and the

performance evaluated.

It should be noted that that this modeling workflow is not rigid; it can take on various forms depending on the nature of the algorithms selected to complete each step. For example, if a wrapper method is used for descriptor selection, then the overall algorithm will iterate between the descriptor selection and learning algorithms until a stopping condition is obtained. The process is summarized graphically in Figure 19.

73

Figure 19: General QSAR modeling workflow.

At this point, adapting the general workflow for local modeling requires inclusion of a mechanism to identify which training observations are in proximity to the test queries in the global descriptor space. Such identification is invoked before the learning algorithm is applied. The word “global” denotes that all operations performed on the data up to now have involved the descriptor space characterized by the entire training set. Proceeding forward, two approaches are possible: identify the test observation’s k-nearest training neighbors or identify all training samples contained within a hypersphere of pre-defined radius r in descriptor space centered at the query. The former approach is advantageous in that it can adapt to varying density of the data set and guarantee a local training set of size k. On the other hand, the k-nearest neighbors of relatively isolated queries may be at such 74 a great distance as to be too dissimilar to represent the query’s structure appropriately. The alternative, a radius-based approach, will gather local training sets of various size contingent on the density of the data set and, for isolated queries may not identify enough training samples to properly learn a model. However, the proximity of the data to the test instances is guaranteed to be preserved by the bound. In this work, a radius-based approach is used for locality assessment, sacrificing the ability to predict every query for preservation of proximity in the global descriptor space.

Once “local” training sets have been identified for each test query, a simple means of building a predictive model for each instance is to repeat the general QSAR modeling process outlined in Figure 19. Before executing the general workflow, all descriptors, some of which may have been eliminated due to statistical insignificance with the response variable during descriptor selection at the global level, are re-introduced at the local level.

This distinction separates the proposed methodology from the majority of previous local

QSAR studies which use the globally significant subset of descriptors to learn a model for each query. In contrast, the workflow presented herein performs descriptor selection on the whole set of descriptors in combination with the local training samples specific to each test observation, allowing each descriptor subset to be unique to the test observation in question.

The final algorithm for the proposed work can now be described by the following steps, divided into “phases” for clarity:

75 3.1.1 Pre-processing Phase

The purpose of pre-processing is to prepare the data for model building by addressing issues such as missing sample information, differences in scale among descriptors, and sparsity of binary descriptors.

1. Training and test molecules with missing values on any of the chemical descriptors

are removed from the data set. An alternative “data-saving” option would be mean

imputation of missing values but is not performed in this work.

2. Chemical descriptors with no variance are eliminated from the data set. The

remaining continuous-valued descriptor data is mean-centered and standardized to

avoid introducing scale biases during dimensional reduction and parameter

estimation of the learning algorithm.

3. Binary descriptors, such as those representing the presence or absence of certain

molecular fragments, are removed if their frequency among the training molecules

does not exceed a specified value, set to 3 in this work.

4. The test data is likewise mean-centered and standardized using the statistics of the

training data.

3.1.2 Global Phase

The primary purpose of the global phase is to identify subsets of training samples which are in proximity to the test instances. A secondary objective is defining the method’s applicability domain. Finally, conclusion of the global phase results in a global model

76 which is compared to the subsequently developed local models to investigate which strategy offers the best performance.

1. The training and test molecules are cast into a single set of descriptors. This

descriptor set comprises the “global” descriptor space. Two options are explored

and compared within this work; either using all available descriptors of non-zero

variance or refining the set using descriptor selection. Regarding the latter,

univariate descriptor selection is employed to find descriptors correlating well with

the endpoint.

a. For continuous and count-based descriptors, a one-way analysis of

variance (ANOVA) is conducted between the two levels of the response

(active and inactive).

b. For binary descriptors, the Fisher’s exact test is applied. For all tests, a

level of significance of 0.05 determines the cutoff between descriptors

retained and eliminated.

Global models with and without univariate descriptor selection enabled are

compared to understand how this process affects performance.

2. The dimensionality of the global descriptor space is reduced in preparation for the

application of a distance metric for applicability domain and locality assessments.

Partial least squares (PLS) regression and t-distributed Stochastic Neighbor

Embedding (t-SNE) are used exclusively in this workflow. The two methods are

not used together within the same model; the performance of models using each are

compared. Dimensionality reduction is necessary to avoid complications associated

77 with high-dimensional spaces and to further discriminate descriptor data correlated

with the endpoint.

3. The density of the training data around the test queries is evaluated to define the

model’s applicability domain. Test molecules with little to no training neighbors

represent an extrapolation and therefore should be excluded from prediction. This

is accomplished using a two-parameter approach:

a. A nearest neighbor integer k and quantile value q are chosen. For each

training molecule, the Euclidean distance to its k-th nearest neighbor among

all remaining training molecules is determined.

b. The distance corresponding to the q-th quantile of the distribution of

neighborhood distances determines the query domain threshold. If any test

molecule does not have at least k training molecules within a hyper sphere

of radius corresponding to the domain query threshold, then the test

molecule is said to be out of domain and will not be predicted. For this work,

k and q are taken to be 3 and 0.95, respectively.

4. The training molecules in proximity to test molecules are found by examining the

latter’s locality in the resolved descriptor space. Training molecules local to a test

query constitute the query’s “neighborhood” and are selected by specifying a radius

r such that all training molecules within a hypersphere centered on the test

molecule’s position are considered neighborhood members. The Euclidean distance

metric is chosen to calculate all distances. The members of a test molecule’s

neighborhood serve to construct the local model most relevant for predicting the

molecule’s response. 78 5. A learning algorithm is applied to the entire training set to determine a mapping

from the resolved descriptor space to the response space. This is referred to as a

“global model” since it is constructed through use of all available training

molecules. Logistic regression, which models the probability of a test molecule to

be active or inactive, is selected as it is most analogous to multiple linear regression

when mapping to a dichotomous response variable. Such a linear function is

favorable from an interpretability perspective because the contribution of each

descriptor toward explaining the endpoint in question can be quantified through

examination of the sample regression coefficients. It follows naturally to compare

the performance of global and local models since global models are the most

common type of modeling strategy found in literature QSAR studies.

3.1.3 Local Phase

The purpose of the local phase is to iterate over each test molecule and execute the general

QSAR workflow of Figure 19, ultimately producing an explanatory local model and prediction for each query.

1. The raw data for each training molecule in the test molecule’s neighborhood is re-

introduced. To clarify, chemical descriptors removed during the global phase due

to univariate selection are again applicable if they have non-zero variance within

the query’s local training set.

2. The local training and test data are mean-centered and standardized.

79 3. The neighborhood training molecules and query molecule are cast into a single set

of descriptors. This descriptor set comprises a local descriptor space. The choice of

using univariate selection to refine the final subset local descriptors remains an

option at this step. A performance comparison is made between models utilizing

and not utilizing descriptor selection at the local level.

4. The dimensionality of the local descriptor space is reduced in preparation for the

application of a learning algorithm. Partial least squares (PLS) regression is used

exclusively for this task. Similar to the global phase, dimensionality reduction is

employed to further discriminate local variance correlated with the endpoint and to

aid in the algorithm’s estimation of the model parameters.

5. The learning algorithm is applied to estimate the parameters defining the local

relationship between the chemical descriptors and the endpoint. Again, logistic

regression is the learning algorithm used. This choice facilitates fairer comparison

with the resulting global models.

The proposed methodology in its totality is summarized graphically in Figure 20:

80

Figure 20: Proposed local QSAR methodology.

3.2 Model Optimization

The various options permitted within the local QSAR modeling workflow necessitates a grid search over the input parameter space in order to optimize performance for the data set in question. Later in the document, the effect of these parameters on model performance is discussed. The following input parameters are explored in this work:

1. The option to use univariate descriptor selection to refine the global descriptor

space. The alternative is to retain all available descriptors. 81 2. The choice of PLS or t-SNE for reducing the dimension of the global descriptor

space.

3. The number of latent or embedded variables into which the global descriptor space

is reduced. Recall that local training sets for each query are determined from this

space. In the interest of capturing a sufficient amount of data variability (i.e. more

latent variables) and retaining model interpretability (i.e. less latent variables), the

following values are explored: 2, 3, 4, 6, and 8.

4. The number of latent or embedded variables into which a query’s local descriptor

space is reduced. The same value is used for each query, if possible, given the

variability of the local training set data. For the same reasons as with global

dimensional reduction, the following values are used: 2, 3, 4, 6, and 8.

5. The option to utilize univariate descriptor selection to refine a query’s local

descriptor space. The alternative is to use all available descriptors.

6. The radius r defining the size of a query’s local training set in the global descriptor

space. As the radius increases, local training sets become larger and more queries

have a sufficient number of samples to learn a predictive model. Therefore, a series

of incrementally increasing values are used such that approximately 0 to 95% of

the query compounds are predicted by local models (the upper limit results from

applicability domain assessment).

The remaining parameters, for example the level of significance for univariate selection steps, are not varied and could be the topic of future investigations.

82 3.3 Evaluation Data Sets

The following two data sets, an Ames mutagenicity data set compiled by Hansen et al. in

2009, and a blood-brain barrier data set compiled by Muehlbacher et al. in 2011, are used to benchmark the proposed methodology on real-world QSAR applications [50], [79].

3.3.1 Ames mutagenicity - Hansen et al., 2009

The Ames test, developed by Bruce Ames and colleagues in the late 1960s and early 1970s, is frequently used by pharmaceutical companies to provide an early indicator of potential carcinogenicity and/or teratogenicity for candidate compounds [50], [80]. It is also necessary to assess the mutagenic potential of commercial and industrial chemicals for their safe use. The test consists of exposing -dependent strains of Salmonella typhimurium grown on a histidine-deficient medium to the test compound. Molecules capable of interacting with genetic material possess the possibility to restore histidine synthesis and bacterial colony growth (i.e. “revertants”). Some compounds, such as aromatic amines or polycyclic aromatic , become mutagenic only after activation by metabolic enzymes; therefore, a mammalian metabolizing system (i.e. “S9 fraction” – usually derived from rate liver tissue) is often added along with the test compound [50], [80]. A test molecule is judged Ames positive if it produces significant revertant colony growth in at least one of five commonly used strains either with or without the addition of S9 fraction. Oppositely, a test molecule is deemed Ames negative if no significant revertant colony growth is observed from any strain with and without S9 fraction [50]. The assay is illustrated in Figure 21:

83

Taken from [81].

Figure 21: The Ames test for mutagenicity.

The application of QSAR models toward the prediction of Ames mutagenicity serves the purpose of quickly screening potential drug compounds computationally, therefore saving time and financial resources associated with actually performing the assay on all candidates. Facilitating the development of better QSAR models, the data set compiled by Hansen et al. is intended to serve as a large, publicly transparent Ames mutagenicity data set for QSAR model benchmarking and performance comparisons [50].

The data is collected across six sources with the distribution shown in Table 6:

84

Table 6: Source distribution of molecules within the Ames mutagenicity data set.

Taken from [50].

Source AMES positive AMES negative Total CCRIS [82] 1359 1180 2539 Kazius et al. [83] 1375 849 2224 Helma et al. [84] 81 57 138 Feng et al. [85] 280 111 391 VITIC [86] 386 808 1194 GeneTox [87] 22 4 1194 Total 3503 3009 6512

Furthermore, the distribution of compounds by molecular weight is illustrated in Figure

22. The authors partitioned the molecules into a static training set and five split sets, as depicted in Table 7. Benchmarking a model involves a cross-validation procedure, training on the static set and four of the five split sets followed by prediction on the remaining split set. The procedure repeats until predictions are generated for all split sets. For the results presented on this data set later in the document, this procedure is followed, and performance statistics are derived from predictions on the combined splits.

85

Adapted from [50].

Figure 22: Molecule weight distribution of compounds within the Ames mutagenicity data set.

Table 7: Partitioning of the Ames mutagenicity data set for benchmarking.

Taken from [50].

Static training Split 1 Split 2 Split 3 Split 4 Split 5 set # of 1585 984 985 984 987 987 compounds

86 Lastly, the experimental reproducibility of the Ames test is noted to be approximately 85% with the difference due to interlaboratory error [49]. As a result, this finding places an upper limit on the performance of any modeling attempt.

3.3.2 Blood-Brain Barrier - Muehlbacher et al., 2001

The blood-brain barrier (BBB), as the name implies, is a highly regulated boundary which separates the bloodstream from the brain’s interstitial fluid (ISF) [79], [88]. The capillary endothelial cells within the brain are closely packed, forming tight junctions which severely reduce paracellular transport unlike that observed within other issues of the human body.

Approximately 98% of all small molecules are rejected by the barrier, making the development of new drugs to treat diseases associated with the brain difficult. Molecular access to the brain’s interstitial fluid is limited to 1) lipid-mediated diffusion (aka “passive transport”) and 2) carrier or receptor-mediated transport (aka “active transport”). A diagram of the blood-brain barrier with mechanisms of transport is shown in Figure 23.

The vast majority of existing drugs which do target the brain arrive there via lipid-mediated diffusion and are lipophilic in nature, having a molecular weight less than 400 and forming less than 8 bonds in water [88]. There is a great need to develop brain-targeting pharmaceuticals considering the fact that a number of cancers (e.g. glioblastoma multiforme) and neurological diseases (e.g. Alzheimer’s and Parkinson’s diseases) lack effective treatments.

87

Taken from [89].

Figure 23: Diagram of the blood-brain barrier with annotated mechanisms of transport.

Better understanding of the transport potential of chemical compounds across the BBB would allow potential drug molecules which act on targets within the brain to be identified and scrutinized in a time and resource-efficient manner. Additionally, the side effects and safety profile of any drug or commercial compound may change significantly if the compound can reach the ISF through the BBB [79], [88].

Characterizing BBB transport begins with measurement of a molecule’s ability to move through the boundary, typically represented by its log(BB) value [79]:

퐶 log(퐵퐵) = 푙표푔 ( 푏푟푎푖푛) (41) 퐶푏푙표표푑 where 퐶푏푟푎푖푛 and 퐶푏푙표표푑 are the concentration of the compound in the ISF and blood, respectively. Experimental methods measuring BBB permeability use artificial

88 membranes, cell cultures, and live studies [79]. Muehlbacher et al. complied a large log(BB) data set to evaluate the predictive performance of a cross-sectional area descriptor developed by their collaboration [79]. The data is collected across 12 sources with the distribution shown in Table 8. The Muehlbacher et al. data set contains 362 unique compounds (there is overlap between sources in Table 8), 199 of which are positive and

163 of which are negative. Of the 362 molecules, 18 compounds are removed as they are known substrates of P-glycoprotein and therefore may move through the BBB by active instead of passive transport. Another 6 compounds are removed due to ambiguity of their reported valves in the literature. Lastly, after cleaning the structure-data file provided by

Muehlbacher et al., an additional 10 molecules are identified as duplicates and are removed.

Therefore, 328 molecules from the BBB data set are presented to the proposed workflow for modeling. Figure 24 shows the distribution of molecular weights and log(BB) valves for these compounds:

89 Table 8: Source distribution of molecules within the blood-brain barrier data set.

Produced with information from listed sources and [79].

Source # of compounds Experimental details Vilar et al. [90] 195 in vivo (mostly rats) Platts et al. [91] 119 in vivo (rat) and in vitro Naranayan and Gunturi [92] 38 in vivo (rat) in vivo (mostly mouse and Mente and Lombardo [93] 94 rat) Zhang et al. [94] 147 unspecified in vivo (rat) and in vitro Abraham et al. [95] 197 (human and rat) Garg and Verma [96] 168 unspecified Guerra et al. [97] 106 in vivo (rat) Rose et al. [98] 95 in vivo (rat) Kelder et al. [99] 36 in vivo (rat) Konovalov et al. [100] 165 in vivo and in vitro (rat) Zerara et al. [101] 36 unspecified

Adapted from [79].

Figure 24: Molecule weight (left) and log(BB) (right) distributions of the blood-brain barrier data set.

90 3.4 Computational Tools and Molecular Descriptors

The proposed modeling workflow is programmed in Python 3.7 and executed within the integrated development environment (IDE) Spyder, both obtained via the Anaconda distribution [102], [103], [104]. The Anaconda distribution is an open-source platform containing Python and many associated packages to enable data manipulation, statistical analysis, machine learning, and graphics generation [104]. A few notable packages are relied upon greatly for implementation of the proposed work. Subroutines for PLS, t-SNE, and logistic regression are implemented using the machine learning package, scikit-learn

[105]. The manipulation and analysis of tabular data is completed with the pandas package

[106]. RDKit, a chemoinformatics package, is used for processing of structure-data files and the generation of molecule images [20]. Additional packages are used but not listed here; all packages can be identified through study of the source code available in the appendix. Finally, some images of molecules contained herein are produced using

MarvinSketch and/or MarvinView, software for editing and displaying chemicals from

ChemAxon Ltd [107].

Molecular structure curation and descriptor calculation of the Ames mutagenicity and blood-brain barrier data sets are completed using CORINA Symphony, chemoinformatics software from Molecular Networks GmbH and Altamira, LLC [28]. The continuous- and count-type descriptors used in this work are described in Appendix B. In addition to these descriptors, the ToxPrint chemotypes are included in the numerical representation of all chemicals. As explained by Yang et al., in this context a chemotype is a structural fragment (either connected continuously or disjointed) optionally combined

91 with physiochemical properties on individual atoms, bonds, fragments, electronic systems, or possibly the whole compound [108]. Molecules can be queried against the definition of the chemotype, and fulfillment of the definition by the molecule can be indicated by a “0” or “1”, respectively, constituting a binary descriptor. Chemotypes are expressed in

Chemical Subgraphs and Reactions Mark-up Language (CSRML), an XML-based language designed uniquely for their application. Furthermore, a publicly-available software application, ChemoTyper, was developed by Molecular Networks GmbH and

Altamira, LLC for visualization and querying of structural data sets with chemotype definitions [108]. This program is used herein to provide examples of a select few ToxPrint chemotypes to showcase the chemical information they capture.

Specifically, the ToxPrint chemotypes library is a publicly-available, pre-defined set of 729 chemotype definitions containing structural fragments and properties derived from toxicity data covering over 100,000 compounds [108]. These compounds, which serve as a learning set, are quite diverse in origin; sources include a variety of domains such as pharmaceuticals, food and cosmetic ingredients, and industrial chemicals. The

ToxPrint chemotypes themselves were designed using a bottom-up approach starting with simple functional groups and fragments identified as relevant from FDA and EPA safety assessment rules and previous drug development studies. An iterative process of molecule clustering, pattern searching against target compound sets, and refinement of chemotype definitions resulted in a more complex and discriminating set of descriptors [108]. The full compilation of ToxPrint chemotypes is available via the ChemoTyper program which is free upon registry at https://toxprint.org/. Illustrating their versatility, a few chemotypes present within molecules of the Ames mutagenicity data set are included below. First, the 92 data set is search for the nitro functional group, and the results of the query are shown in

Figure 25. In this example, the structures for the nitro group do not need to be queried individually, as is the case with traditional subgraph search algorithms, because the chemotype specifies the appropriate π-electron system. Another instance, shown in

Figure 26, illustrates the search results for an aromatic ether structure. Multiple compounds fulfilling the query are depicted in the results pane:

Software from [108].

Figure 25: A screenshot of the ChemoTyper program querying the Ames mutagenicity data set for the nitro functional group chemotype (right). A molecule from the set (left) contains the nitro group and is highlighted accordingly.

93

Software from [108].

Figure 26: A screenshot of the ChemoTyper program querying the Ames mutagenicity data set for an aromatic ether chemotype (right). Several chemicals from the set (left) contain the chemotype and are highlighted.

94 Chapter 4: Results and Discussion

Before proceeding with a presentation of the results and discussion of the proposed methodology, some preliminary information is provided to aid the reader’s understanding of the chapter. The design of the proposed local QSAR workflow is such that only a certain fraction of the test set is predicted at any given time. The fraction of test set molecules predicted by any model is referred to as its “coverage” in this document. Coverage varies from 0 to q, where q is the quantile defining the boundary of the applicability domain and is set at 0.95 in this work. The following subsections of this chapter explore the effects of various input parameters on model performance, and the results are often depicted graphically as a function of the model’s coverage. Individual data points within these plots represent the performance characteristics of unique models as specified by the models’ set of input parameters. Most figures and tables are entitled with a string similar to that shown in Figure 27 to inform the reader of the input parameters chosen for the models in the graph:

Figure 27: Model input parameter string.

95 The first indicator displays whether PLS or t-SNE is used to reduce the dimension of the global descriptor space. “GS” and “LS” correspond to “global univariate selection” and

“local univariate selection”, respectively, and can be “ON” or “OFF” depending if descriptor selection is performed at that point in the workflow. Next, the values of 푑푔 and

푑푙 indicate the number of dimensions for which the global and local descriptor spaces are reduced, respectively. Lastly, the value of 푟 indicates the radius used to construct query- specific, local training sets within the global descriptor space. Some of the aforementioned information may be removed or other information added depending on the context of the graph.

4.1 In-Depth Analysis of the Ames mutagenicity Data Set

The Ames mutagenicity data set consists of 6,512 molecules in total. These molecules are partitioned into a static training set of 1,585 compounds of which no predictions are made and a test set of 4,927 compounds predicted using a 5-fold cross-validation scheme.

4.1.1 The Effect of Radius on the Fraction of Predicted Compounds

As discussed in Chapter 3, the local training set for a test or query compound is defined by those molecules positioned within a hypersphere of radius 푟 centered at the query in the global descriptor space. The coverage is largely dependent upon the magnitude of the radius used to determine local training sets for each query. Naturally, smaller radii will tend to produce local training sets with fewer molecules while larger radii will tend to result in local training sets with many compounds. Test compounds with sparse local training sets

96 will fail to learn a model for multiple reasons; for example, from a lack of significant descriptors (i.e. no significant variance among too few compounds) or from instabilities in the numerical maximization of the likelihood function when estimating the logistic regression coefficients. This notion is conveyed in Figure 28, showing the coverage increasing monotonically with radius across several models:

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 28: The effect of radius on coverage in a global descriptor space of low dimension.

This relationship is further demonstrated by examining the distribution of compounds among query-specific training sets. Figure 29 displays the distribution derived from the

97 point (0.075, 0.329) in Figure 28. The mean and standard deviation of the distribution are

9 and 6 molecules, respectively, and no query has more than 37 training samples:

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions; Locality-defining radius of 0.075.

Figure 29: Distribution of local training set sizes for a small radius.

Figure 30 portrays the distribution derived from the point (0.8, 0.934) in Figure 28. The mean local training set size is 616 compounds and standard deviation 410 compounds. The smallest and largest sets contain 6 and 1,990 molecules, respectively. Naturally, an objective of local modeling is to optimize the size of the query’s training sets; sets that are consistently too small will lack enough samples to successfully learn models and will

98 ultimately fail to predict a useful portion of all test compounds. On the other hand, while many large sets will make it possible to generate more predictions, the larger radii required to produce them will diminish their individual local character. However, in the context of the Ames mutagenicity data set, even a set with 1,990 compounds is still to some degree

“local” as it utilizes roughly 36% of the available training samples.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions; Locality-defining radius of 0.8.

Figure 30: Distribution of local training set sizes for a large radius.

The density of molecules within a descriptor space is affected by the number of dimensions used to represent that descriptor space and, as a result, the distribution of

99 compounds found within local training sets will depend on the number of dimensions. For an equal number of samples, projections into lower dimensional spaces tend to be more densely populated than higher dimensional spaces since the latter spaces have more

“volume” for samples to occupy. It is expected that coverage will increase more gradually with radius in a global space with many versus few dimensions due to differences in sample density. This expectation is confirmed by comparing Figure 28 with Figure 31:

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 31: The effect of radius on coverage in a global descriptor space of high dimension.

Global univariate descriptor selection has no obvious effect on the change in predicted test set fraction with radius. This is evident by comparing Figure 31 with Figure 32:

100

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(OFF)” – global descriptor selection disabled; “LS(OFF)” – local descriptor selection disabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 32: The effect of radius on coverage in a global descriptor space of high dimension without univariate descriptor selection.

Descriptor selection applied to the global training set reduces the number of descriptors from approximately 500 to 250, both of which greatly exceed the 2 – 8 dimensions resulting from dimensional reduction techniques. Given that the untransformed descriptor sets are of such a high dimension and both are reduced to less than 10 dimensions, the magnitude of the change remains largely equivalent, and the density of the samples results from the projection to the significantly lower dimensional descriptor space.

Reducing the dimension of the global descriptor space using t-SNE can introduce, at least when compared to similar graphs generated from PLS, an apparent irregularity in the relationship between radius and coverage. This is illustrated in Figure 33, where the fraction of predicted molecules appears relatively constant when the radii is between approximately 0.03 to 0.1: 101

“TSNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(8)” – local space reduced to 8 dimensions.

Figure 33: The effect of radius on coverage for a t-SNE embedded global descriptor space.

Additionally, the relationship between radius and predicted fraction does not increase monotonically contrary to the progression of models derived from PLS transformations.

The stochastic nature of the t-SNE algorithm can orient similar molecules in unpredictably tight or sparse groupings in the reduced dimensional embedding depending on the solution to the gradient descent minimization. An example of a 2-dimensional t-SNE reduced global descriptor space is shown in Figure 34:

102

“TSNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global descriptor selection enabled; “LS(ON)” – local descriptor selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(8)” – local space reduced to 8 dimensions.

Figure 34: A 2-dimensional, t-SNE embedded global descriptor space.

Examining the graph, the scatter of molecules around the origin is largely uniform.

However, there does appear to be some tight groupings of compounds especially toward the periphery. Wattenberg et al. demonstrated with a series of toy examples that a feature of the t-SNE algorithm to “equalize” original data densities by location; tight clusters tend to be expanded and loose clusters contracted in the final embeddings [71]. The presence of equalized sample clusters would explain a range of radii about which queries contain an approximately equal amount of training samples. For queries situated among these clusters, radii extending to the average size of the groupings would incorporate the same local

103 training samples until increased to such a size as to capture samples in neighboring clusters or from adjacent scatter. This underlying structure would translate to a flat portion in the coverage versus radius plot over the range of radii corresponding to the size of the clusters themselves.

4.1.2 The Effect of Global and Local Descriptor Space Dimension on Model Performance

The following analysis presents the change in four performance measures, namely, sensitivity, specificity, accuracy, and Matthew’s correlation coefficient (MCC) with the reduced dimension of the global and local descriptor spaces across multiple models. Recall from Chapter 2 the definitions of the aforementioned measures:

푇푃 푠푒푛푠𝑖푡𝑖푣𝑖푡푦 = 푇푃 + 퐹푁 (42)

푇푁 푠푝푒푐𝑖푓𝑖푐𝑖푡푦 = (43) 푇푁 + 퐹푃

푇푃 + 푇푁 푎푐푐푢푟푎푐푦 = (44) 푇푃 + 푇푁 + 퐹푃 + 퐹푁

푇푃 × 푇푁 − 퐹푃 × 퐹푁 푀퐶퐶 = (45) √(푇푃 + 퐹푁)(푇푃 + 퐹푃)(푇푁 + 퐹푃)(푇푁 + 퐹푁) where 푇푃, 푇푁, 퐹푃, and 퐹푁 are the numbers of true positive, true negative, false positive, and false negative predictions, respectively, made by the model in question. Figure 35 illustrates the change in sensitivity, specificity, and accuracy with the fraction of locally predicted test set compounds across several local model ensembles. For all models, global and local descriptor spaces are reduced to the smallest dimension possible at 2 PLS latent variables. Univariate descriptor selection is implemented during both global and local 104 phases. The light, dark, and dashed lines in each graph indicate the performance measure of the local, global-on-local, and global models, respectively. The “global-on-local” data points represent the global model’s predictions on the same fraction of test molecules predicted by the local model ensembles. The performance of any model typically improves as the applicability domain is narrowed, in essence limiting predictions to those test compounds far from the classifier’s decision boundary. Comparing local and global model performance on the same fraction of test molecules differentiates the benefits conferred from constructing local models to that of a global model only predicting its most confident instances.

Examining each plot in Figure 35, the global model has a uniform sensitivity, specificity, accuracy, and MCC of 78.7%, 46.4%, 65.6%, and 0.253, respectively. Global model predictions do not change as global models always predict approximately 95% of the test set molecules deemed within the applicability domain.

105

10

6

In each plot, the lighter color indicates local model ensemble performance, the darker color global-on-local performance, and the dashed line overall global model performance (on the entire test set). “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 35: PLS, GS(ON), LS(ON), G(2), L(2) sensitivity, specificity, and accuracy versus the fraction of predicted test set compounds.

106 In contrast, predictions made on individual compounds by local models can revert class or withhold prediction as the training sets change with radius. Regarding sensitivity performance, the global model consistently outperforms the local models on the same subset of test molecules, though their performance converges as the coverage reaches its maximum. This does not necessarily indicate a good global model though as the correspondingly low specificity indicates at tendency to make blanket positive predictions without proper discrimination. On the other hand, the local model ensembles consistently perform better at predicting negative queries than the global model, the specificity of the former about 10% to 15% greater than the latter. Scrutinizing a collection of local models predicting 4,620 queries, 1,802 of these models have a prevalence (i.e. the percentage of positive training samples) less than 50% whereas the global model has a prevalence of

54%. The presence of local models with a greater number of negative training samples biases specificity performance in favor of the local strategy. Finally, the accuracy of the global model is slightly better than the local model ensembles until the coverage reaches roughly 60%. Past this point, the accuracy of the local model ensembles surpasses the global model, the difference of which is due to the local model ensembles’ classification of negative queries. The performance of the global and local models at the maximum of the test set’s applicability domain is shown in Table 9:

107 Table 9: PLS, GS(ON), LS(ON), G(2), L(2) global and local performance at the maximum of the applicability domain.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Coverage Sensitivity Specificity Accuracy MCC

Local 93.4% 78.8% 69.0% 71.1% 0.360

Global 93.4% 78.8% 46.4% 65.6% 0.254

Generally, the performance of the local model ensembles in each graph increases with coverage, the latter of which is directly related to radius and local training set size. This behavior is explained by the few latent variables used to describe the global and local descriptor spaces. Much of the variance in the descriptor data is lost with so few latent variables; therefore, incorporating more samples into the local models represents an information gain and further assists the classifiers in discriminating compound activity.

Figure 36 displays how sensitivity, specificity, and accuracy vary with coverage when the global descriptor space is reduced to a higher dimension, 8 PLS latent variables, compared to 2 PLS latent variables shown in Figure 35. All other parameters remain unchanged. Immediately evident is the gain in sensitivity, specificity, and accuracy for virtually all local model ensembles across the coverage domain. Furthermore, the local models outperform the global model by approximately 3% to 5% for the majority of the coverage domain on each measure. Local model performance does decrease consistently as coverage increases, eventually matching or slightly underperforming relative to the

108 global model once 93% of the queries are predicted. Highlighting an instance of the latter, the specificity of the local model ensemble is 60.4% whereas for the global model it is

64.5%. Increasing global character of the individual local models with increasing radius explains the decrease and eventual convergence in performance observed in each graph.

The maximum difference between local and global models occurs at approximately 60% coverage and is shown in detail in Table 10:

Table 10: Performance of PLS, GS(ON), LS(ON), G(8), L(2), local models predicting 60% of the test molecules.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Coverage Sensitivity Specificity Accuracy MCC

Local 60.0% 82.0% 68.8% 76.9% 0.456

Global 60.0% 77.3% 65.6% 72.8% 0.393

Global 94.0% 77.2% 64.8% 72.2% 0.383 (Overall)

109

1 10

In each plot, the lighter color indicates local model performance, the darker color global-on-local performance, and the dashed line overall global model performance (on the entire test set). An anomaly occurs around 0% coverage in which the models only make positive predictions resulting in 100% sensitivity and 0% specificity. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 36: Global and local PLS, GS(ON), LS(ON), G(8), L(2) model sensitivity, specificity, and accuracy versus coverage.

110 Of note, it is much more difficult to visualize and interpret the global descriptor space of the local model ensembles conveyed in Figure 36 versus those shown in Figure 35. The reason for this difficulty results from the number of dimensions; the global descriptor space of the models depicted in Figure 36 are comprised of 8 reduced dimensions, each a linear combination of 254 original molecular descriptors. Such high-dimensional spaces cannot be visualized without generating multiple 2-D or 3-D plots and extracting meaningful information from the 254-by-8 matrix of PLS rotations (coefficients correlating the original descriptors and with the PLS latent variables) is likewise a difficult, time-consuming task.

Alternatively, the descriptor spaces of the local models from both figures, retaining 2 PLS latent components, are more easily visualized and interpreted. The most populated local training sets are derived from, at maximum, 134 significant descriptors. Furthermore, 50% of local training sets have less than 54 significant descriptors, constituting a more manageable number. Analyzing prediction models at the local scale offers the advantage of elucidating mechanisms responsible for observed activity which are not as evident from global models. Conversely, while the associations underlying the predictions of individual queries can be ascertained, an acknowledged disadvantage of this approach is the lack of readily identifiable globally applicable trends without systematically review and coordination of the outcomes across multiple local models.

Evidence toward the effect of local dimension reduction on the sensitivity, specificity, and accuracy is presented in Figure 37. It is anticipated that additional latent variables will capture more descriptor information correlated with the response so long as the local training sets have sufficient samples and variance to benefit from those added components. Each plot depicts performance for local model ensembles with 2, 4, and 8 111 PLS latent variables. To clarify, the same curve shown in Figure 36 with 2 PLS latent variables is again shown in Figure 37 along with the overall global model. Closely examining each graph, increasing the number of dimensions used to represent the local descriptor spaces confers no clear and consistent benefit. For a coverage of approximately

10% to 35%, more dimensions provide modest improvement in each measure of 1% to 3%.

In some models, additional latent variables are detrimental; for example, the accuracy of local models decreases from 79% to 75.3% when predicting roughly 45% of the test set.

Finally, the local model ensembles with 4 and 8 latent variables do not lose performance as quickly as the models constructed with 2 latent variables for coverages ranging from

85% to 95%, though the differences are relatively modest. Illustrating this latter observation, the MCC at roughly 90% coverage between 2 to 8 latent variables is 0.414 and 0.434, respectively. While additional dimensions may be required to sufficiently capture the information available to local training sets with larger sample sizes, increasing their number also introduces a greater possibility for overfitting due to the added degrees of freedom available to the learning method. In all, if the sole objective is to maximize model performance on as many test instances as possible, then additional dimensions should be considered. However, these perceived gains could be overshadowed by the loss in interpretability and the potential lack of generalizability these models may exhibit when applied to an external test set.

112

11

3

In each plot, the blue, green, and purple lines indicate the corresponding performance of local model ensembles whose local training sets are reduced to 2, 4, 8 dimensions, respectively. The gray line indicates the performance of the global model. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions.

Figure 37: PLS, GS(ON), LS(ON), G(8) local model ensemble sensitivity, specificity, and accuracy versus coverage.

113 This subsection concludes with a comparison of the highest performing global models and local model ensembles. Global and local model ensemble performance is plotted on a common axis, that is, the number of dimensions of the descriptor space in which logistic regressions are learned. The search for optimal global models involves a parameter space between 2 and 100 PLS latent variables with univariate descriptor selection both enabled and disabled. This search is depicted by the black and gray lines in

Figure 38. In terms of accuracy and MCC, the best performing global model without descriptor selection occurs with 17 latent variables at 76.7% and 0.451, respectively.

Likewise, the best performing global model including descriptor selection has an accuracy of 75.5% and 0.434, respectively, with 36 PLS latent variables. It should be mentioned that comparable models can be obtained with as few as 15 latent variables; therefore, the graphs in Figure 38 only span up to 20 dimensions. Alternatively, the local model ensembles shown in Figure 38 are derived from an 8-dimensional global space and between 2- and 8- dimensional local spaces. Univariate selection is either utilized or absent simultaneously during the global and local phases of model construction. In order to keep the comparison as fair as possible, all local models displayed predict more than 90% of the test set molecules. Examining each measure, the local models outperform the global models by roughly 3% to 5%, and in some cases as much as 10% over the range of the dimension of the local descriptor space. For a more detailed comparison, the best local model ensemble with 8 local dimensions has a sensitivity, specificity, accuracy, and MCC of 81.0%, 69.7%,

76.5%, and 0.451, respectively. On the other hand, the best global model has a sensitivity, specificity, accuracy, and MCC of 83.4%, 66.8%, 76.7% and 0.451, respectively, but must retain 17 global dimensions to obtain this result. 114

11

5

In each plot, the blue and red lines represent local model ensembles with univariate selection configurations GS(OFF), LS(OFF) and GS(ON), LS(ON), respectively. The gray and black lines represent global models with univariate selection configuration GS(OFF) and GS(ON), respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)” – global selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” – global space reduced to 8 dimensions.

Figure 38: PLS, G(8) sensitivity, specificity, and accuracy versus coverage for a collection of global and local model ensembles.

115 Through identification of training samples more relevant to explaining the activity of test queries than that of the entire test set, the proposed methodology is able to perform as well as global approaches in a significantly smaller modeling space. These more parsimonious local models will benefit from easier interpretability and at least equal generalizability comparted to their global counterparts.

4.1.3 The Effect of Univariate Descriptor Selection on Model Performance

This subsection explores how univariate descriptor selection implemented during the global and local phases of the workflow affects model performance. Because univariate selection may or may not be utilized at each phase, four possible input configurations are available for this parameter alone. Figure 39 depicts the performance measures of a collection of local model ensembles versus coverage under each univariate selection scheme. In the figure, 8 PLS latent variables are retained globally and 2 PLS latent variables are retained locally. Scrutinizing the major trends within each graph, utilizing univariate selection during the local phase of model construction is beneficial particularly when local training set sizes tend to be smaller, corresponding to coverages less than roughly 60%. For example, the sensitivity and accuracy of local models with local univariate selection enabled (blue and green) in this range are approximately 3% to 5% above those models not utilizing univariate selection locally (orange and red). The separation in specificity between models with and without local univariate selection is less obvious and perhaps of no meaningful difference with the exception of the smallest and largest of models on the extremes of the graph.

116

11

7

In each plot, the blue, orange, green, and red lines correspond to univariate selection configurations GS(ON), LS(ON); GS(ON), LS(OFF); GS(OFF), LS(ON); and GS(OFF), LS(OFF), respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)” – global selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 39: PLS, G(8), G(2) sensitivity, specificity, and accuracy versus the coverage for a collection of local model ensembles under various univariate selection configurations.

117 When examining the local model ensembles predicting greater than 90% of the queries, implementing univariate selection during the local phase induces worse performance, most evidently with regards to specificity and accuracy, and to a smaller extent with sensitivity.

These changes are reflected in the accompanying MCC values as well, all of which are illustrated in Table 11 below:

Table 11: Performance of PLS, G(8), L(2), local models predicting greater than 90% of the test set under various univariate selection configurations.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)” – global selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Coverage Sensitivity Specificity Accuracy MCC

GS(ON), 92.9% 78.6% 60.5% 71.2% 0.362 LS(ON)

GS(OFF), 92.9% 78.1% 61.3% 71.3% 0.365 LS(ON)

GS(ON), 93.9% 80.4% 66.2% 74.7% 0.420 LS(OFF)

GS(OFF), 93.7% 80.5% 67.4% 75.2% 0.429 LS(OFF)

An explanation for the decrease in performance involves the ever-increasing global character of the local models derived from large radii. In essence, an increasing fraction of these local models become individual global models and adopt the performance

118 characteristics of the global model constructed from 2 PLS latent variables. It is known from Figure 35 that the sensitivity, specificity, and accuracy of this particular global model is weaker than the amalgam of several local models; therefore, the presence of global-like models among the remainder of local models detracts from overall performance. The question as to why local models utilizing univariate selection are particularly susceptible to this detrimental trend involves univariate descriptor selection itself. Univariate selection is an imperfect means of optimizing a set of descriptors as it is incapable of considering the contribution one descriptor may have in the presence of others when discriminating classes. If this filtering process removes descriptors that are useful in a multi-variate context, then it is expected that performance would decrease accordingly in their absence.

Additionally, if these discriminating descriptors are removed prior to dimensional reduction in the local phase, then it would be expected that the resulting local PLS projections would reflect the loss of information and transform the observations into a reduced space which overlaps the classes to a greater extent.

4.1.4 The Effect of Dimensional Reduction Method on Model Performance

The purpose of this subsection is to understand the influence of t-SNE as a global phase dimensional reduction technique. Recall from Chapter 3 that either PLS or t-SNE may be implemented to reduce the global descriptor space before identification of local training sets. Since t-SNE is designed to group similar, high-dimensional samples in close proximity within a lower-dimensional space, this algorithm may be of particular usefulness to local modeling strategies.

119 To begin, the t-SNE algorithm is used in Figure 40 to visualize the Ames mutagenicity data in 2 embedded components. Immediately following is an equivalent, 2- dimensional PLS transformation shown in Figure 41 for comparison purposes:

“TSNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions.

Figure 40: 2-dimensional, t-SNE embedded Ames mutagenicity global descriptor space.

120

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions.

Figure 41: 2-dimensional, PLS transformed Ames mutagenicity global descriptor space.

Not unexpectedly, the t-SNE embedding and PLS transformation differ considerably. The former tends to group molecules with similar activity into tight clusters of various sizes.

These groupings appear more prevalent at the periphery of the plot while molecules are more uniformly scattered toward the origin. In contrast to t-SNE, PLS generally separates molecule classes along its first latent variable. Ames positive compounds tend to have positive scores on the first latent variable whereas Ames negative compounds tend to have negative scores along the first latent variable. There are also a number of separate,

“outlying” compounds distant from the center-of-mass. Finally, it is readily apparent that both t-SNE and PLS express a high degree of overlap between molecules of opposite classes.

121 Next, the effect of perplexity on local model ensembles resulting from a global t-

SNE embedding is explored. Remember perplexity is viewed as a smoothing parameter over the effective number of neighbors between samples. This parameter is typically set between 5 to 50, with larger values corresponding to increased emphasis on preservation of the global structure of the unembedded data. Figure 42 depicts three collections of local model ensembles produced from a t-SNE embedding to 2 global dimensions and PLS transformation to 2 local dimensions on each query. Univariate selection is utilized during the global and local phases. From the graphs, the sensitivity of models between 10% and

30% coverage with a perplexity of 5 trail behind those at 30 and 50 by a modest 2% to 4%.

Otherwise, perplexity has no appreciable effect on the performance of the local model ensembles. A perplexity of 30 will be used for the remainder of models involving t-SNE justified by the results of Figure 42 and from its theoretical interpretation as a balance between preserving local and global data structure during optimization.

122

12

3

“TSNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 42: t-SNE, GS(ON), LS(ON), G(2), L(2) local model ensemble sensitivity, specificity, and accuracy versus coverage at three levels of perplexity.

123 The effect of increasing radius, or coverage, versus performance of a series of t-

SNE derived local model ensembles is shown in Figure 43. Univariate selection is conducted during both the global and local phases. Furthermore, the global descriptor space is comprised of 2 t-SNE embeddings and the local space 2 PLS latent variables. After examining each plot, immediately noticeable is the difference in performance of the local model ensembles above that of the global models when restricted to the same subset of test compounds. This observation is most extreme in regard to specificity and accuracy. On the other hand, the differences in sensitivity between local and global models is less prominent, and the global model sometimes performs equal to or greater than the local model ensembles as the coverage approaches 95%. Table 12 compares global and local performance for a few select data points in greater detail, re-iterating the previously noted trends. For all measures, the performance of the local model ensembles decreases with increasing coverage, a result of the loss of local character of the individual models with increasing radius. The specificity of the local models appears somewhat immune to this change, remaining relatively constant at approximately 70-75% until a drop is witnessed once a majority of the test set is classified.

124

12

5

In each plot, the lighter color indicates local model ensemble performance, the darker color global-on-local performance, and the dashed line overall global model performance (on the entire test set). “TSNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 43: t-SNE, GS(ON), LS(ON), G(2), L(2) sensitivity, specificity, and accuracy versus coverage for global and local model ensembles.

125 Table 12: t-SNE, GS(ON), LS(ON), G(2), L(2) select global and local model ensemble performance statistics.

“t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Coverage Sensitivity Specificity Accuracy MCC

Local #1 25.6% 90.5% 77.7% 86.1% 0.602

Global #1 25.6% 83.1% 33.2% 66.1% 0.186

Local #2 48.4% 85.4% 72.8% 80.5% 0.513

Global #2 48.4% 75.2% 37.8% 60.8% 0.139

Local #3 94.2% 77.0% 65.7% 72.3% 0.386

Global #3 94.2% 81.3% 28.5% 59.6% 0.114

Overall 94.4% 75.0% 38.2% 59.8% 0.140 Global

Naturally, a comparison of performance between t-SNE- and PLS-based models is warranted. Recall that Figure 35 presents the performance of an equivalent series of PLS models as those shown in Figure 43. Speaking of the global models first, the PLS transformation retains more endpoint-specific information than its counterpart. For example, the sensitivity, specificity, accuracy, and MCC of the PLS-based global model are 78.7%, 46.4%, 65.6%, and 0.253, respectively. Found in Table 12, one t-SNE derived global model likewise obtains 75.0%, 38.7%, 59.8%, and 0.140, respectively, on each of the four measures. Models derived from both PLS and t-SNE dimensional reductions are biased toward the most prevalent class, but the global t-SNE-based model is particularly

126 poor at accurately predicting negative queries relative to the global PLS-based model. It is not unexpected that a global t-SNE derived model would be substandard considering its application as an unsupervised data visualization tool. There is no obvious linear combination of the two embedded dimensions which would discriminate positive and negative molecules in Figure 40. Conversely, the specific intent of PLS is to construct components which both maximize descriptor space variance and are aligned to distinguish response classes, which as mentioned is apparent along the first and second latent variables.

Continuing the analysis is an evaluation of the local model ensembles derived from the PLS and t-SNE algorithms. Recall that in 2 dimensions, local models built from the

PLS reduced space represented a loss of useful information with small radii, as evident with the general upward trend in performance these models exhibit as the coverage approaches 95%. Switching focus to the t-SNE reduced space, the results of the local training sets determined from its embedded components closely resemble those of the local models constructed from an 8-dimensional PLS reduced descriptor space found in Figure

36. These two sets of local model ensembles are graphed together in Figure 44. Most remarkable for these collection of plots in the gain in specificity exhibited from the t-SNE embeddings relative to the PLS projected space. For example, for test set fractions of approximately 10% and 30%, the specificity is 10% and 5% greater than PLS, respectively.

A few select models from the figure are listed in Table 13. Note that coverage, a function of the radii used to define locality, is not directly controllable and therefore the models of the table will not represent a perfect one-to-one comparison in terms of the exact test compounds receiving predictions.

127

12 8

In the figure, light colored lines correspond to the PLS-based local model ensembles whereas dark colored lines correspond to t- SNE-based local model ensembles. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” or “G(2)” – global space reduced to 8 or 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 44: Performance comparison of the PLS and t-SNE derived local model ensembles as a function of coverage.

128 Table 13: Comparison of select PLS- and t-SNE-derived local QSAR model ensembles.

Coverage Sensitivity Specificity Accuracy MCC

PLS #1 25.4% 86.4% 71.8% 81.8% 0.531

t-SNE #1 25.6% 90.5% 77.7% 86.1% 0.602

PLS #2 71.8% 81.2% 65.4% 74.9% 0.423

t-SNE #2 65.2% 78.8% 71.0% 75.6% 0.440

PLS #3 93.0% 78.6% 60.5% 71.2% 0.362

t-SNE #3 93.5% 78.1% 67.8% 73.8% 0.410

The statistics shown in Table 13 highlight the success of the t-SNE-derived local

QSAR modeling strategy relative to its PLS-based equivalent. This workflow configuration produces some of the best performing models of the entire investigation, though the differences are moderate and the PLS-based local models remain competitive.

Local training sets corresponding to smaller radii obtain the most benefit, reflecting a potential optimum arising from the equalization of cluster densities found in the lower- dimensional mapping. That is to say a balance exists between radii of sufficient size to capture all relevant training samples based upon their descriptor information and larger radii which induce a degradation in local character information. In totality, these results provide evidence toward the ability of the t-SNE to cluster both positive and negative molecules of a complex QSAR data set which, when coupled with the local modeling strategy, is conducive to prediction and interpretation. Despite the fact that the algorithm was never intended for such applications, there is a demonstrable benefit to applying a

129 distance metric to the embedded space to facilitate the identification of training samples sharing structural characteristics with test samples. Finally, the t-SNE-based approach allows for visualization of the high-dimensional descriptor space in a single 2-D plot. This

8 is in contrast to PLS which, with 8 latent variables, would require (2) = 28 separate bi- plots, though it may be argued that some latent variables only explain a small fraction of the total variance and are of minimal importance. A valid counterargument to the former involves the dubious nature of the global structure conveyed by t-SNE’s embeddings.

The t-SNE technique by itself and in the context of its use in predictive modeling is not without some notable limitations. First to be discussed is the adaptation required to handle the lack of an explicit map from high-dimensional descriptor spaces to the low- dimensional embeddings. van der Maaten proposed a parametric form of t-SNE which learns a mapping by training a feed-forward neural network [73]. As mentioned in Chapter

3, in this work the t-SNE algorithm operates on the combined training and test sets when learning a representation or, said another way, the test samples influence the positions of each other and of the training samples during the optimization process. This can result in consistency problems; interpretations of the effect of molecular structure on the endpoint of interest no longer depend solely on the training data but also on the influence of the particular test data under investigation. There might be also issues with applicability domain assessment for the same reason. Because the global structure of the t-SNE embedding has no inherent interpretation, the positions and distances between clusters in the embedded space are arbitrary. Furthermore, the aforementioned equalization of data densities could eliminate the notion of outlying data points. With these known behaviors in mind, it is not unreasonable to presume that query compounds could be positioned closer 130 to training molecules than they reasonably should be, thus resulting in suboptimal local training sets for modeling.

This topic of discussion is concluded with a review of the stochastic nature of the t-SNE algorithm and its impact on model performance. Recall from Chapter 2 that initialization of the algorithm involves sampling lower-dimensional points from an isotropic Gaussian of small variance centered at the origin [68]. Additionally, the

Kullback–Leibler divergence, which is the objective function to be minimized by gradient descent, is non-convex. Therefore, each representation results from one of several possible local minima and representations obtained from the t-SNE technique can vary between runs [109]. This behavior is demonstrated in Table 14 and Table 15 where performance statistics are presented from 10 repetitions of global and local t-SNE-derived models, respectively. Reviewing each table, the global models exhibit more variance between runs compared to the local model ensembles. From inspection of MCC values, some of the individual global models may indeed be no better than random guessing at predicting Ames mutagenicity. The remainder of the global models perform poorly. Oppositely, the local models remain much more stable between runs, lending credence to the reliability of t-SNE derived predictions and interpretations for at least this particular set of model input parameters. However, at present it remains unknown if a more expanded search of the input parameter space might produce better local model ensembles or how the variability in performance of those ensembles might vary as those parameters are manipulated.

131

Table 14: t-SNE, GS(ON), LS(ON), G(2), L(2) global model ensemble statistics from 10 repetitions.

“t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions.

Run Coverage Sensitivity Specificity Accuracy MCC

1 94.1% 77.4% 33.7% 59.5% 0.122

2 94.2% 76.8% 31.1% 57.9% 0.089

3 94.5% 75.8% 35.5% 59.2% 0.122

4 94.4% 72.4% 28.1% 54.2% 0.005

5 93.9% 74.0% 36.0% 58.4% 0.109

6 94.4% 77.2% 32.1% 58.6% 0.104

7 94.2% 76.0% 35.1% 59.2% 0.121

8 94.1% 75.7% 33.9% 58.5% 0.105

9 94.3% 81.9% 25.4% 58.6% 0.088

10 94.2% 74.3% 38.2% 59.4% 0.132

Min 93.9% 72.4% 25.4% 54.2% 0.005

Max 94.5% 81.9% 38.2% 59.5% 0.132

Mean 94.2% 76.2% 32.9% 58.4% 0.100

Std 0.177% 2.552% 3.860% 1.544% 0.036

132

Table 15: t-SNE, GS(ON), LS(ON), G(2), L(2), radius 0.1 local model ensemble statistics from 10 repetitions.

“t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(2)” – global space reduced to 2 dimensions; “L(2)” – local space reduced to 2 dimensions; locality-defining radius of 0.1.

Run Coverage Sensitivity Specificity Accuracy MCC

1 65.0% 78.0% 70.3% 74.9% 0.430

2 66.1% 79.3% 70.7% 75.8% 0.441

3 65.6% 77.8% 69.0% 74.2% 0.418

4 66.0% 79.1% 70.6% 75.7% 0.442

5 65.2% 80.3% 69.9% 76.1% 0.446

6 66.5% 79.3% 68.6% 74.9% 0.427

7 65.4% 77.9% 70.7% 75.0% 0.430

8 65.1% 79.3% 68.3% 74.9% 0.427

9 65.7% 78.5% 69.7% 74.9% 0.429

10 66.0% 78.5% 70.7% 75.4% 0.437

Min 65.0% 77.8% 68.3% 74.2% 0.418

Max 66.5% 80.3% 70.7% 76.1% 0.446

Mean 65.7% 78.8% 69.9% 75.2% 0.433

Std 0.490% 0.797% 0.922% 0.563% 0.009

133 4.1.5 Prediction Confidence and Smallest Radii Local Model Ensembles

The first topic of discussion is the reliability of predictions made by the local QSAR methodology. Recalling the trends observed in Figure 36 and Figure 43 and their associated tables, it is clear that although more test molecules receive predictions when the locality- defining radii are large, the performance of the local model ensembles eventually declines to match or even fall below that of the equivalent global model. On the other hand, model performance is relatively high for local model ensembles derived from small to medium- sized radii. An unanswered question remains as to whether or not predictions of test compounds fitting this description are consistently more accurate than those described in the former. If such consistency exists, this represents a confidence of prediction; test queries first obtaining predictions from smaller local training sets tend to be “easier” to predict and have a greater likelihood of being correct than those which require larger local training sets in order to receive predictions. By analogy, just as one anticipates the model performance obtained from cross-validation to similarly extent to an external test set, individual queries are anticipated to share the same ease-of-prediction as validation samples resulting from local models of similar size.

Testing the confidence of the local modeling workflow involves identifying a subset of test molecules and tracking the predictions of its members as a function of the locality-defining radius. One such subset is shown in Table 16 for illustration purposes.

Note the progression of predictions for each compound; some test instances alternative between one class, no prediction, or revert from one class to another between successive

134 local models. This behavior can result from a loss of significant descriptors or change in training set composition which serve to alter the regression model’s decision boundary.

Table 16: PLS, GS(ON), LS(ON), G(8), L(2) Ames mutagenicity test compound predictions vs. local model radii.

“0” and “1” denote negative and positive Ames mutagenicity predictions, respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Test Compound Identifiers Radius 2849-98-1 163275-58-9 4514-19-6 99523-60-1

0.25 - - 1 -

0.50 0 1 1 0

0.75 - 0 1 0

1.00 - 0 1 0

1.20 0 1 1 0

1.40 0 1 1 0

1.60 0 1 1 0

1.80 0 1 1 0

2.00 0 1 1 0

2.20 0 1 1 0

2.40 0 1 1 0

2.60 0 0 1 1

135 Figure 45 displays local modeling performance for the first subset consisting of 397 test compounds (yellow) of which predictions are available at the smallest of radii. The performance on this subset is compared to the entire test set (blue). These plots use radius instead of coverage as the x-axis variable due to the fact that following the predictions for the same set of test compounds represents a constant coverage. Examining each graph, the performance measures on the subset of test molecules remains largely constant with increasing radius. This is in contrast to the performance of the entire test set, which gradually decreases with increasing radius. There is a modest decline in the performance of the subset at the largest of radii sizes; however, the performance statistics of the subset remain consistently above that of the entire test set. This test subset, containing approximately 400 compounds, represents a coverage of approximately 7% to 8% of the total test set.

136

13

7

In the figure, the blue and yellow lines display local model performance for all test molecules and a subset of test molecules, respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 45: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy versus locality-defining radius of a subset of 397 test compounds and the entire test.

137 Figure 46 depicts a similar comparison with a second test subset of 1,252 compounds, or roughly 22% to 23% of the entire test set. In this example, the sensitivity of the subset deteriorates slowly, the specificity falls in close agreement with the entire test set, and the accuracy reflects the same trend observed with the sensitivity. The number of predictions and Matthew’s correlation coefficients for the models of both subsets as a function of the locality-defining radius are shown in Table 17. The blank entries at the top of the table result from a requirement that the subsets contain a certain number of compounds; the initial radii did not produce enough predictions to fulfil this specification.

Furthermore, every member of each subset does not always receive a prediction for every radius value investigated. Using the MCC as an overall measure of model performance, predictions on the test molecules of these subsets enjoy their initial level of success until the radius reaches approximately 2.2 units. Once the radius becomes large, the local training set sizes have increased to the point where they have begun to lose their local character. By similar reasoning, the second subset does not share the performance characteristics of the first subset since a greater number of the test compounds in the second set are more difficult to predict in comparison to the first subset.

138

1

39

In the figure, the blue and yellow lines display local model performance for full set and subset of test molecules, respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 46: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy versus locality-defining radius for a subset of 1,252 test compounds and the entire test set.

139

Table 17: Number of predictions and Matthew’s correlation coefficients for subsets 1 and 2 versus locality-defining radius.

Subset 1 Subset 2 Radius # Predicted MCC # Predicted MCC

0.25 - - - -

0.50 397 0.66 - -

0.75 306 0.658 1252 0.58

1.00 276 0.658 920 0.634

1.20 316 0.655 1024 0.599

1.40 345 0.648 1132 0.557

1.60 381 0.681 1194 0.572

1.80 392 0.686 1229 0.556

2.00 395 0.689 1242 0.547

2.20 393 0.617 1244 0.526

2.40 397 0.584 1250 0.517

2.60 397 0.545 1252 0.485

The conclusion of this subsection involves a performance enhancing modification to the propose methodology gleaned from the previous discussion on prediction confidence. Summarizing that discussion, test compounds predicted from local model ensembles with smaller local training sets tend to enjoy improved performance over those predicted from larger local training sets. This observation forces a trade-off between

140 predicting a relatively small set of test molecules with high quality and confidence and predicting a larger subset of test compounds with reduced quality and confidence.

Additionally, because the locality-defining radius is fixed for any one particular ensemble of local models, test samples which are predicted correctly at smaller radii can become inaccurate at larger radii. A possible solution to this problem is to allow the radius to vary across the ensemble of local models; test compounds retain the prediction of the smallest possible local training set which generated it. This is accomplished by building multiple local model ensembles over a range of radii from small to large such that the coverage varies from 0% to 95%, similar to a grid search. Once a particular local model ensemble renders a prediction for a test compound, that prediction becomes fixed and the test molecule is no longer predicted by any subsequent local models. Initial results from the aforementioned modification referred to as the smallest radius approach are considered next.

Figure 47 displays results of the smallest radius approach with PLS as the dimensional reduction method; the results for t-SNE derived local models are found to be similar. Specifically, the dotted gray line in each graph indicates the performance of local model ensembles with variable locality-defining radii. Studying each plot, there is not much difference in performance between local model ensembles with and without varying radii. However, the specificity of local model ensembles with variable radii do not exhibit a sharp decline as those without varying radii as the coverage reaches its maximum. This change is also reflected in the accuracy of both types of models, serving to keep the local modeling strategy relevant when attempting to predict as many test molecules as possible.

It can be concluded that the modification prevents local models of increasingly global 141 character from over-writing the negative predictions of earlier models derived from smaller, more proximal training sets.

A more noticeable improvement is witnessed when the local training sets are reduced to 8 instead of 2 dimensions, as shown in Figure 48. Here it is observed that the decline in sensitivity, specificity, and accuracy is more gradual with increasing radius, retaining an advantage of about 5% above the local models with non-variable radii for intermediate coverage values. Interpreting this trend, predictions are frequently reverting to an incorrect class when more reduced dimensions are available to describe the local descriptor spaces. This was not observed with lower dimensional local descriptor spaces from Figure 47 in which predictions tended to remain constant with or without variable radii. Of further note, the difference between the two approaches is insignificant when the fraction of predicted test compounds approaches 95%. A final, numerical comparison of the smallest radius local modeling ensembles of both high and low dimension is available in Table 18, expressing overall performance with the Matthew’s correlation coefficient.

142

14

3

In the figure, the light, dark, dashed, and dotted (gray) curves correspond to the local, global-on-local, global overall, and smallest radius model performances, respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 47: PLS, GS(ON), LS(ON), G(8), L(2) local model ensemble sensitivity, specificity, and accuracy vs. coverage including the smallest radius approach.

143

14

4

In the figure, the light, dark, dashed, and dotted (gray) curves correspond to the local, global-on-local, global overall, and smallest radius model performances, respectively. “PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(8)” – local space reduced to 8 dimensions.

Figure 48: PLS, GS(ON), LS(ON), G(8), L(8) local model ensemble sensitivity, specificity, and accuracy vs. coverage illustrating the smallest radius approach.

144 Table 18: Comparison of select local model ensembles with and without varying radii in 2 and 8 reduced local dimensions.

Predicted MCC MCC MCC MCC fraction 2D, static 2D, dynamic 8D, static 8D, dynamic (approximate) radii radii radii radii

20% 0.531 0.532 0.588 0.587

45% 0.489 0.482 0.431 0.523

70% 0.423 0.445 0.441 0.452

90% 0.362 0.419 0.434 0.431

4.1.6 Case Study of Select Molecules from the Ames mutagenicity Data Set

The intent of this subsection is to showcase some of the interpretive features made available by the proposed methodology. Such information is of most use to toxicologists, medicinal chemists, and chemoinformaticicans to assist in the development of safer and more efficacious chemical products. Accomplishing this task involves reviewing global models and local model ensembles and their predictions for some example molecules of the Ames mutagenicity data set. Specifically, two compounds from the set are discussed in detail; the first compound’s mutagenic response is correctly predicted from local model ensembles but incorrectly predicted by the global model. Conversely, the response of the second compound is incorrectly predicted by local model ensembles yet correctly predicted by the global model. In totality, this review should provide a better understanding of the advantages and disadvantages of the proposed QSAR modeling workflow.

145 The first molecule to be considered is CAS 7203-90-9, otherwise known as 1-(4-

Chlorophenyl)-3,3-dimethyltriazene, the skeletal formula of which is shown in Figure 49.

Figure 49: Skeletal formula of 1-(4-Chlorophenyl)-3,3-dimethyltriazene, or CAS 7203-90-9.

Speaking to the specific structural aspects of the compound, the triazenyl group (i.e. three adjacent nitrogen atoms in blue) is known in the literature to be associated with mutagenic and anti-neoplastic properties [110]. Furthermore, this functional group is included in the

ToxPrint chemotypes which are derived from toxicological data sets and are anticipated to cover toxicologically-relevant chemical structures. In all, this information lends explanation toward 1-(4-Chlorophenyl)-3,3-dimethyltriazene’s Ames positive activity (i.e.

“1”), which is sourced from the U.S. Environmental Protection Agency data [50].

The global, logistic regression-based model, reduced to 8 PLS latent dimensions after univariate feature selection, estimates the probably of the test compound being positive at 0.338. Since the decision boundary occurs at 0.5, the model incorrectly predicts this molecule to be Ames negative (i.e. “0”). While not extremely distance from the decision boundary, this prediction is at the same time not equivocal; any prediction might be thought dubious if the estimated probability of a test compound lies in close proximity 146 to the boundary. Figure 50 and Figure 51 show the global model logistic regression coefficients transformed to be expressed in terms of the original numerical and chemotype descriptors, respectively. Each plot includes 20 descriptors with the largest magnitudes related to each class. For example, this data set indicates that nitrogen-based acceptors are associated with positive Ames mutagenicity whereas approximate surface area (ASA) is correlated with negative Ames mutagenicity. Similarly, hydroxyl atoms bonded to nitrogen, oxygen, sulfur, or phosphorus atoms is a chemotype correlated with Ames positive activity while, on the other hand, the generic carbonyl group is associated with Ames negative activity. It is important to note that these relationships are not necessarily causal and the activity of any one particular compound must be interpreted by considering the effects of all significant descriptors. For instance, in reference to 1-(4-Chlorophenyl)-3,3-dimethyltriazene, the global model does recognize a positive correlation between Ames mutagenicity and the triazenyl group; the logistic regression coefficient of the triazenyl group is 0.0135. This compound also contains 3 nitrogen-based hydrogen bond acceptors. However, the remainder of the molecule’s descriptor values relative to those of the training set place its predicted activity into decisively negative character.

147

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “G(8)” – global space reduced to 8 dimensions.

Figure 50: PLS, GS(ON), G(8) global logistic regression coefficients of 20 numeric descriptors with largest magnitude.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “G(8)” – global space reduced to 8 dimensions.

Figure 51: PLS, GS(ON), G(8) global logistic regression coefficients of 20 chemotype descriptors with largest magnitude.

148 Shifting focus now to the predictions made by the local model ensemble, the series of predictions for 1-(4-Chlorophenyl)-3,3-dimethyltriazene as a function of the locality- defining radius are shown in Table 19:

Table 19: PLS, GS(ON), LS(ON), G(8), L(2), local model predictions of 1-(4- Chlorophenyl)-3,3-dimethyltriazene with increasing radius.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Radius 0.5 0.75 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6

Coverage 8.1 25.4 45.0 59.9 71.8 81.1 86.8 89.5 91.5 92.4 93.0 %

Local model - 1 1 1 1 1 1 1 1 1 1 prediction

Initially, no classification is made due to an insufficient number of training instances. The remaining local models do make predictions, all of which correctly predict the test molecule as Ames positive. At a radius of 0.75 units, the training set only contains 3 sample molecules which is the minimum number required for the algorithm to continue forward and attempt model construction. The structures of these neighboring training samples are shown in Figure 52 in order from most to least similar. All of these compounds are Ames positive. Visually, these training samples are clearly very similar to 1-(4-Chlorophenyl)-

3,3-dimethyltriazene, each only differing by a single structural substitution or deletion.

149 Regarding the prediction produced by this small local training set, because this local training set is uniformly positive, the modeling algorithm defaults to the uniform class and makes a positive prediction.

Figure 52: Training molecules within 0.75 units to 1-(4-Chlorophenyl)-3,3- dimethyltriazene.

In terms of fitting a local logistic regression, the first model learned for 1-(4-

Chlorophenyl)-3,3-dimethyltriazene occurs at a radius of 1.0. This model is built from a training set consisting of 26 compounds. A plot of the local training set, logistic regression decision boundary, and the position of the 1-(4-Chlorophenyl)-3,3-dimethyltriazene in the locally transformed space is depicted in Figure 53. Note that the test molecule falls of the positive side of the boundary and is therefore predicted Ames positive with an estimated probability of 0.981. Interpretive information is shown in Figure 54 which includes a graphical depiction of the local logistic regression coefficients expressed in terms of the original descriptor variables. Only 10 descriptors are deemed significant and none of them 150 are chemotypes. From the figure, rotatable bonds and nitrogen-based hydrogen bond acceptors are associated with a positive response whereas oxygen-based hydrogen bond acceptors are correlated with negative activity.

In the figure, the black cross indicates the test molecule, the red markers positive training compounds, the blue markers negative compounds, and the dashed line the logistic regression decision boundary. “LV1” and “LV2” refer to the first two PLS latent variables, consecutively.

Figure 53: The PLS transformed local descriptor space for the logistic regression model predicting 1-(4-Chlorophenyl)-3,3-dimethyltriazene.

Broadly, 1-(4-Chlorophenyl)-3,3-dimethyltriazene itself has multiple features associated with positive activity from this smaller data set; the compound contains 3 nitrogen-based hydrogen bond acceptors and 2 rotatable bonds. Conversely, the test compound has no oxygen-based hydrogen bond acceptors which are found to correlate most with negative

151 Ames character. These same associations are found to exist within the totality of the training set as learned from the global model. However, other structural correlations with negative Ames mutagenicity ultimately resulted in 1-(4-Chlorophenyl)-3,3- dimethyltriazene being incorrectly classified as a negative compound. In summary, this test molecule benefited from the narrowed focus provided by its local training set and resulted in a correct prediction. Furthermore, interpretation of the test molecule’s predicted activity is considerably easier though examination of its local model as compared to the global model due to the former’s limited complexity.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 54: PLS, GS(ON), LS(ON), G(8), G(2) local logistic regression coefficients for the model predicting 1-(4-Chlorophenyl)-3,3-dimethyltriazene.

152 Next to be presented are the models and predictions for CAS 57-47-6, otherwise known as physostigmine. This chemical compound is illustrated in Figure 55:

Figure 55: Skeletal formula for physostigmine, or CAS 57-47-6.

Physostigmine is a cholinesterase inhibitor used in medicine to treat glaucoma and anticholinergic toxicity [111], [112], [113]. The global logistic regression-based model, constructed from an 8-dimensional PLS latent variable space following univariate feature selection, predicts the test compound as Ames negative. The estimated probability of positive Ames mutagenicity is determined to be 0.407. This prediction ultimately agrees with data retrieved from the Chemical Carcinogenesis Research Information System

(CCRIS) which labels physostigmine as Ames negative [50], [82]. Because physostigmine is in a separate cross-validation set than 1-(4-Chlorophenyl)-3,3-dimethyltriazene, a different global model is responsible for its predicted activity. Figure 56 and Figure 57 show the logistic regression coefficients for this global model expressed in terms of the original numerical and chemotype descriptors, respectively. Each plot includes 20 descriptors with the largest magnitudes related to each class.

153

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “G(8)” – global space reduced to 8 dimensions.

Figure 56: PLS, GS(ON), G(8) global logistic regression coefficients of 20 largest magnitude numeric descriptors.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “G(8)” – global space reduced to 8 dimensions.

Figure 57: PLS, GS(ON), G(8) global logistic regression coefficients of 20 largest magnitude chemotypes.

154 For example, looking at the numeric descriptors, an increasing number of rotatable bonds is found to correlate with positive Ames mutagenicity whereas molecules with larger dipole moments are more commonly associated with Ames negative character. Examining physostigmine specifically, this compound is rather rigid for its size due to its complicated ring structure and therefore has a limited number of rotatable bonds. Furthermore, the size of the test compound confers upon it a number of features associated with negative mutagenicity including the number of atoms, bonds, ring complexity, and McGowan volume, to name a few. In regard to chemotypes, many of the same structural relationships with Ames mutagenicity exist between models for 1-(4-Chlorophenyl)-3,3- dimethyltriazene and physostigmine, which is expected if the chemotypes encode toxicological information of a more fundamental nature. Finally, physostigmine itself does not include any notable chemotypes with the exception of a cyclic ethyl feature associated with negative activity in this training data.

Discussion of the predictions from the ensemble of local models for physostigmine are considered next. The progression of predictions with increasing locality-defining radius are shown in Table 20:

155 Table 20: PLS, GS(ON), LS(ON), G(8), L(2), local model predictions of physostigmine with increasing radius.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Radius 0.5 0.75 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6

Coverage 8.1 25.4 45.0 59.9 71.8 81.1 86.8 89.5 91.5 92.4 93.0 %

Local model - - 1 1 1 1 1 1 1 1 1 prediction

Not unsurprisingly, no predictions are made for physostigmine at the smallest of radii due to a lack of sufficient local training samples. Beginning at 1.0 units in the transformed PLS descriptor space and proceeding thereafter, all local models incorrectly predict the test molecule to be Ames positive. In more detail, the local model corresponding to a locality- defining radius of 1.2 units is learned using 36 samples. The local training set and logistic regression decision boundary are visualized in Figure 58:

156

In the figure, the black cross indicates the test molecule (indicated by arrow), the red markers positive training compounds, the blue markers negative compounds, and the dashed line the logistic regression decision boundary. “LV1” and “LV2” refer to the first two PLS latent variables, consecutively.

Figure 58: The PLS transformed local descriptor space for the logistic regression model predicting physostigmine.

Immediately evident is the small amount variance along the second latent variable; most of the training molecules are scattered along the first latent variable by comparison.

Furthermore, the test compound appears close to the decision boundary. However, the model’s estimated probability of physostigmine being Ames positive is 0.789, which is still quite distance from the decision boundary relative to the rest of the training set. The structures of the five nearest neighbors of physostigmine in order from most to least similar are shown in Figure 59, along with corresponding activity, distance, and select descriptor data found in Table 21:

157

Figure 59: The five nearest neighbors of physostigmine from most to least similar in the transformed local descriptor space.

Table 21: Identifier, activity, distance, and select descriptor data for physostigmine and its five nearest neighbors from most to least similar.

CAS # 57-47-6 100325-51-7 113698-18-3 732-11-6 113124-69-9 6098-44-8

Activity 0 0 1 1 1 1

Distance 0 0.437 0.656 0.715 0.790 0.836

Diameter 12.8 12.4 13.1 11.5 13.1 12.5

Weight 275.3 330.3 296.3 317.3 282.3 281.3

From the table, the compound most similar to physostigmine is Ames negative while the remainder, all of slightly greater dissimilarity, are Ames positive. Examining the structural characteristics associated with mutagenicity, the significant numeric descriptors for the aforementioned local model are presented in Figure 60. Diameter and radius of gyration, which are size and mass distribution descriptors respectively, are associated with positive activity. At the same time, molecular weight and span (i.e. the radius which encloses all 158 atoms from the molecule’s center of mass) are correlated with negative activity. These coefficients suggest a non-obvious set of associations between Ames mutagenicity and molecular size, shape and distribution of mass which are loosely observable in the data available in Table 21.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions.

Figure 60: PLS, GS(ON), LS(ON), G(8), G(2) local logistic regression coefficients for the model predicting physostigmine.

Compounds with greater weight tend to be Ames negative whereas those with more elongated shapes (i.e. larger diameters) are more frequently associated with positive Ames activity. At first glance, two positively correlated descriptors with opposite effects on the

159 response may seem contradictory, but such relationships are possible. As a simple analogy, consider the relationship between plant growth, rainfall, and cloud cover. Rainfall and cloud cover share a positive association, yet rainfall generally increases plant growth whereas cloud cover detracts from it. Briefly shifting attention to the ToxPrint chemotypes, only generic halides are found to be significantly correlated with negative Ames mutagenicity. Physostigmine does not contain this chemotype while its nearest neighbor does. Speculation as to source of disagreement between the local model predictions and the true activity of physostigmine could be multi-factorial; under sampling of the relevant molecular space, a lack of truly discriminating descriptors, or the presence of an “activity cliff” in which Ames mutagenicity varies greatly from point to point in this portion of the descriptor space are all possible explanations.

Concluding the discussion on interpretability, it is possible to extract QSAR information applicable to the entire training data through examination of the significant descriptors across the local model ensemble and their associations with the response.

Figure 61 shows the frequency of numeric descriptors deemed significant among local models and the sign of their association with Ames mutagenicity. For example, approximate surface area (ASA) is a significant descriptor in roughly 45 separate local models; it is positively associated with mutagenicity in approximately 20 of these models and negatively associated with mutagenicity in about 25 of these models. The same information is depicted for the ToxPrint chemotypes in Figure 62. Reviewing the numeric descriptors first, none of these structural features are found to discriminate between positive and negative Ames mutagenicity completely.

160

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions; locality-defining radius of 1.0.

Figure 61: Frequency and association of significant descriptors among local training sets (logistic regression coefficients) – PLS, GS(ON), LS(ON), G(8), L(2), radius of 1.0.

“PLS” – dimensional reduction technique reducing global descriptor space; “GS(ON)” – global selection enabled; “LS(ON)” – local selection enabled; “G(8)” – global space reduced to 8 dimensions; “L(2)” – local space reduced to 2 dimensions; locality-defining radius of 1.0.

Figure 62: Frequency and association of significant chemotypes among local training sets (logistic regression coefficients) – PLS, GS(ON), LS(ON), G(8), L(2), radius = 1.0. 161

That is to say, each numeric descriptor appears both positively and negatively correlated with the endpoint in almost equal proportion. Moving to the chemotypes, there are some descriptors uniquely related to one activity class. This is anticipated since the purpose of the ToxPrint chemotypes is to describe toxicological-based structure-activity relationships.

As a specific example, the acyl halide functional group is only found to correlate with

Ames positive character, indicating the possibility of a structural alert for mutagenicity.

This notion is confirmed by a literature search which notes acyl halides as known indicators of Ames mutagenicity [114]. Alternatively, alkenyl and acyclic carboxylic esters are found to only associate with negative Ames mutagenicity. A compound from the Hansen et al. data set containing both features is shown in Figure 63:

Figure 63: CAS 14882-94-1, an Ames negative compound, with highlighted alkenyl (blue) and acyclic (orange) carboxylic ester chemotypes.

Conjecture as to why these chemotypes are related to negative activity, if any casual relationship exists, include a possible incompatibility to interact with DNA molecules or potential cellular clearance mechanism. Furthermore, the frequency and correlation data

162 shown in Figure 62 and Figure 63 is subject to possible change between local model ensembles. The emphasis here is that it is demonstrably possible to extract latent, global trends present in the data set even when adopting the proposed local modeling strategy.

4.2 Performance Comparison between the Proposed Methodology and State-of-the-Art

Models from the Literature.

The purpose of this section is to compare the performance of the proposed methodology to that of models reported in the literature. Table 22 shows area under the receiver operating characteristics curve (AUC) values for different learning methods reported by Hansen et al. and for select models generated by the proposed methodology on the Ames mutagenicity data set. Briefly, the AUC measures the ability of a classifier to distinguish between classes and falls on the interval [0,1]. Its value is generated from the Receiver

Operating Characteristics (ROC) curve, which plots the sensitivity vs. (1-specificity) of a classifier generated by varying the decision boundary from 0 to 1.0. A perfect classifier results in an AUC of 1.0 whereas a classifier performing no better than random guessing has a value of approximately 0.5. Finally, a classifier with an AUC value of 0.0 predicts the opposite class perfectly (i.e. all positives are predicted negative and vice versa). As described by Hansen et al., the models reported from the literature are constructed using a series of DragonX software descriptors which include constitutional, topological, and geometric information, functional group counts, atom-centered fragments, and molecular properties with no other data processing algorithms specified [50].

163 Table 22: Area under the receiver operating characteristics curve values for literature models and models from the proposed methodology on Ames mutagenicity.

“PLS” or “t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)”– global descriptor selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” or “G(2)” – global space reduced to 8 or 2 dimensions; “L(8)” or “L(2)”– local space reduced to 8 or 2 dimensions.

Model AUC

Support Vector Machines 0.86

Gaussian Process 0.84 Hansen et al., 2009 [50] Random Forest 0.83

k-Nearest Neighbor 0.79

PLS, GS(ON), LS(ON), 0.79 G(8), L(2)

PLS, GS(ON), LS(ON), 0.81 G(8), L(8)

Proposed Methodology PLS, GS(OFF), LS(OFF), 0.80 G(8), L(2)

PLS, GS(OFF), LS(OFF), 0.80 G(8), L(8)

t-SNE, GS(ON), LS(ON), 0.79 G(2), L(2)

Examining Table 22, the performance of the several models generated from the local

QSAR workflow described herein, as judged by the AUC, is roughly equivalent to or slightly worse than those reported by Hansen et al. Specifically, the local modeling methodology achieves equivalent or slightly better discriminating potential than the k-NN

164 classifier, but less than that of the SVM, GP, or Random Forest classifiers. As a final note, some uncontrolled sources of variability between the two sets of models include the number and types of chemical descriptors employed and the coverage of the test sets obtained by each. Addressing the latter aspect, all local model reported in Table 22 predict more than 90% of the validation test sets for as fair of a comparison as possible. Hansen et al. do not clearly indicate if any test instances are excluded from prediction on grounds of pre-processing error or domain of applicability.

The local QSAR modeling workflow is also evaluated using the blood-brain barrier

(BBB) data set compiled by Muehlbacher et al. as described in Chapter 3. This data set is considerably smaller than the Ames mutagenicity data set; the Ames mutagenicity set contains 6,512 molecules whereas the blood-brain barrier set numbers 362 compounds in total. Benchmarking the proposed methodology against a smaller data set investigates whether or not the local modeling strategy is useful and competitive when the number of training samples is limited and the data more homogeneous. The comparisons made here are broader and more qualitative in nature considering the differences in training set sizes, selection of descriptors, learning methods, and coverages between the proposed methodology and the models reported in the literature. With that in mind, performance statistics for both sources are shown in Table 23. Note that the models reported by Li et al. derive from a separate data set than that compiled by Muehlbacher et al. However, both data sets concern experimentally obtained blood-brain barrier permeability as the endpoint of interest. In order to mitigate potential sources of bias, all models from the local QSAR workflow listed in Table 23 represent predictions on greater than 90% of the molecules in the test sets under a 5-fold cross-validation scheme. 165

Table 23: Performance statistics on literature models and models from the proposed methodology on blood-brain barrier permeability.

“PLS” or “t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)”– global descriptor selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” or “G(2)” – global space reduced to 8 or 2 dimensions; “L(8)” or “L(2)”– local space reduced to 8 or 2 dimensions.

Training Sensitivity Specificity Accuracy Model MCC Set Size (%) (%) (%) Logistic Regression 415 83.9 46.4 71.0 0.321 Li et al., Linear Discriminant Analysis 415 78.2 58.3 71.2 0.360 2005 k-Nearest Neighbors 415 85.5 61.4 77.1 0.477

16 [115]

6 Support Vector Machines 415 88.6 75.0 83.7 0.645

PLS, GS(ON), LS(ON), 328 84.1 71.6 79.5 0.498 G(8), L(2) PLS, GS(ON), LS(ON), 328 86.6 65.1 78.7 0.478 G(8), L(8) Proposed PLS, GS(OFF), LS(OFF), 328 86.0 68.1 79.3 0.490 Methodology G(8), L(2) PLS, GS(OFF), LS(OFF), 328 89.4 75.2 84.0 0.565 G(8), L(8) t-SNE, GS(ON), LS(ON), 328 85.3 72.6 80.1 0.503 G(2), L(2)

166 As is readily apparent from the table, the performance of models from this work as determined by the Matthew’s correlation coefficient exceed that of logistic regression and linear discriminate analysis based models from Li et al. Furthermore, the discriminative capacity of the local QSAR workflow is equal to or greater than the k-nearest neighbors classifier. Finally, the support vector machine derived model is superior to all others displayed in the table. Of note, while the local QSAR model ensembles of Table 23 may seem rather successful, their performance is at times significantly below that of accompanying global models also generated from the workflow. A few select instances of this observation are illustrated in Table 24. The scarcity of training data within the blood- brain barrier data set is compounded by the local modeling strategy which further reduces the number of training compounds available to predict queries. This resulting loss of information relative to the global model is only sometimes recovered when the local radii increase to such a size to encompass large portions of the overall training samples. These results indicate that the local QSAR workflow is better suited toward larger data compilations. Such compilations are by construction typically more diverse and therefore likely to involve multiple underlying mechanisms which can be isolated and exploited by the local strategy.

167

Table 24: Comparison of local, global-on-local, and global overall models from the proposed methodology applied to the blood-brain barrier data set.

“PLS” or “t-SNE” – dimensional reduction technique reducing global descriptor space; “GS(ON)” or “GS(OFF)”– global descriptor selection enabled or disabled; “LS(ON)” or “LS(OFF)” – local selection enabled or disabled; “G(8)” or “G(2)” – global space reduced to 8 or 2 dimensions; “L(8)” or “L(2)”– local space reduced to 8 or 2 dimensions.

Coverage Sensitivity Specificity Accuracy Input Parameters MCC (%) (%) (%) (%)

Local 50.5 88.4 52.4 79.1 0.417

16 Global-on-

8 PLS, GS(ON), LS(ON), 50.5 90.1 54.8 81.0 0.457

local G(8), L(2) Global 94.7 86.3 68.1 79.4 0.492 (Overall)

Local 93.2 89.4 75.2 84.0 0.565

Global-on- PLS, GS(OFF), LS(OFF), 93.2 90.9 79.6 86.7 0.606 local G(8), L(8) Global 95.3 91.0 80.7 87.0 0.608 (Overall)

168 Chapter 5: Conclusions and Future Work

5.1 Conclusions

The research presented in this document is multi-faceted and its objectives are summarized in the following enumeration:

1. Develop a novel QSAR workflow utilizing a local modeling strategy. In this

context, “local” refers to the use of subsets of training compounds deemed proximal

to test molecules in the descriptor space so that individual models can be learned to

explain and predict their activity.

2. Implement a mechanism for identifying descriptor subsets for specific use toward

individual test compounds. This is in contrast to traditional modeling strategies

which identify and use a single set of “global descriptors” for all test molecules

which correlate with the endpoint.

3. Incorporate within the proposed local QSAR workflow a non-linear dimensional

reduction technique. Dimensional reduction is used to alleviate the detrimental

effects of high-dimensional, multicollinear data. Traditional QSAR models

frequently use linear dimensional reduction methods despite the fact that they can

misrepresent the non-linear nature of real-world data.

169 4. Benchmark the performance capabilities of the proposed methodology, especially

in regard to the aforementioned design features, against the traditional, global

modeling strategy and against accepted models from the QSAR literature.

5. Highlight the unique interpretability aspects of the novel local QSAR workflow.

As described in Chapter 3, a novel QSAR modeling workflow is presented in detail which accomplishes the goals of points 1-3. Addressing the first point, after pre-processing and transforming input data into an acceptable form, a radius-based mechanism identifies training samples in proximity to test queries in the resolved descriptor space. These training molecules constitute a subset of “local” data used to explain the endpoint for its corresponding test compound. This process is repeated for all test molecules requiring prediction, in effect constituting an ensemble of local models covering all test samples deemed within the applicability domain.

Second, the proposed methodology accomplishes the task of identifying unique subsets of descriptors for predicting individual query molecules. These “local” descriptor sets derive from two sources. The first is the variation of descriptor data naturally present within the training samples in proximity to the test compound. Some descriptors are removed from consideration as they have no variance among the subset of training molecules. Second, the QSAR modeling workflow conducts univariate descriptor selection on each local descriptor set, thereby removing irrelevant descriptors from influencing the resulting query-specific models.

Third, the local QSAR modeling workflow presented herein successfully incorporates a non-linear dimensional reduction technique by adapting the work of van der 170 Maaten and Hinton, t-Distributed Stochastic Neighbor Embedding (t-SNE), for use in

QSAR modeling. The t-SNE algorithm generates a 2-D or 3-D representation such that similar and dissimilar data points in high-dimensional space are close together or far apart, respectively, in the low-dimensional mapping. t-SNE, by design, emphasizes the preservation of local structure contained within high-dimensional data spaces. This should make the algorithm particularly amenable to a local QSAR modeling strategy. Moreover, because t-SNE does not provide a mathematical function from the high- to low- dimensional spaces, the proposed methodology combines the training and test data into a single set such that the samples of each can be repositioned together. Models derived from this process are compared to those based on Partial Least Squares projections, a common dimensional reduction technique found in QSAR analysis.

The fourth and fifth objectives of this work are accomplished in the results and discussion of Chapter 4, which documents the unique behaviors and capabilities of the proposed methodology. This is completed through a detailed analysis of the predictions made on Ames mutagenicity data set compiled by Hansen et al. Summarizing the most significant discoveries, local models exhibit significant improvement over comparable global models, especially when predicting the same portions of test compounds. The best performing models result from a global descriptor space reduced to high dimension (i.e. 8) and local descriptor spaces reduced to low dimension (i.e. 2). Furthermore, these local model ensembles are convenient when seeking to visualize the distribution of training samples and decision boundaries for individual queries. Incorporating univariate descriptor selection is most useful when applied during the local phase and when the local training sets are small to medium-sized. However, the performance of these aforementioned models 171 falls below that of global models when the coverage approaches the maximum dictated by the applicability domain. Such a detrimental outcome is alleviated by either increasing the dimension of the local descriptor spaces (i.e. 8) or by allowing the locality-defining radius to vary between test compounds. This latter adaptation involves testing a range of locality- defining radii and retaining the predictions of each query molecule from the smallest radius, and by extension the smallest local training set, capable of providing a prediction.

Lastly, one of the most important discoveries is the observation that local models outperform global models by as much as 5% to 10% when compared by the dimensions of the modeling space. Local models without univariate selection conferred the most improvement, followed by those with univariate selection enabled.

A sizeable portion of the results and discussion concerns the implementation of t-

SNE as a non-linear dimensional reduction technique used during the global phase of the proposed methodology. As anticipated, t-SNE-based global QSAR models display substandard performance since the algorithm is not designed to discriminate on a global scale. On the other hand, t-SNE-based local model ensembles offer competitive performance to PLS-based local model ensembles. Perplexity, a tunable parameter of t-

SNE which can be interpreted as the number of neighbors influencing similarity determinations as the algorithm optimizes, is found to have a limited effect on local model ensemble performance. A benefit of utilizing the t-SNE technique includes visualization of the global descriptor space from which local models are derived; the t-SNE algorithm can display into two dimensions what requires 8 dimensions with a PLS transformation.

The interpretability features of the proposed methodology are illustrated by review of select test compounds and the global and local model which generated predictions on 172 them. A particular example, 1-(4-Chlorophenyl)-3,3-dimethyltriazene, is predicted incorrectly by global models but correctly classified by local models. For this test molecule, the local modeling strategy is shown to effectively attenuate the influence of descriptors correlating with negative Ames mutagenicity on a global level. Examining the descriptor coefficients of the logistic regression on the individual local models is frequently easier with respect to interpretability since local models have significantly fewer training samples and descriptors compared to their global model counterparts. Finally, a valid criticism of local modeling methods involves the difficultly in recognizing relationships present throughout the entire training set. This problem is addressed by extracting the frequencies and associations of significant descriptors occurring across local model ensembles; for example, the analytic approach successfully identified acryl halides as a structural alert for positive Ames mutagenicity.

Finally, the results and discussion conclude with a comparison of the local QSAR workflow with state-of-the-art QSAR models reported in the literature. Data sets on two endpoints, Ames mutagenicity and permeability through the blood-brain barrier, are considered. The local QSAR workflow offers similar performance to models derived from machine learning techniques such as k-nearest neighbors, linear discriminate analysis, gaussian processes, random forests, and support vector machines. On the Ames mutagenicity data set from Hansen et al., AUC values derived of the former ranged between 0.79 and 0.81 whereas the latter ranged from 0.79 to 0.86. Likewise, on the blood- brain barrier data set compiled by Muehlbacher et al., Matthews correlation coefficients on prediction quality ranged from 0.321 to 0.645 from literature reported models and 0.478 to

0.565 from the proposed methodology. Ultimately, performance from the proposed 173 methodology is frequently suboptimal to comparable global models on the smaller-sized blood-brain barrier data set, suggesting its use should be reserved for larger data compilations.

5.2 Future Work

As with any research project, the proposed methodology is not without certain limitations and open questions. Several aspects of the local QSAR workflow may be better understood and improved through future efforts. The goal of this section is to describe some of the most visible areas of improvement and the direction of future work.

A contentious design aspect of the proposed methodology is the minimum number of training molecules required to learn a model, particularly local models where the number of samples is frequently small. The results presented in Chapter 4 are derived with this number set at 3 samples within a distance r of the query compound in the global descriptor space, where r is the locality-defining radius. Therefore, as long as the response values of these local training compounds are not uniform, the workflow will proceed to learn a logistic regression model. This is problematic as there are very likely too few samples to estimate the regression coefficients without considerable variability (i.e. uncertainty) and ultimately poor performance. Addressing this difficulty, some heuristics on the number of samples recommended to effectively learn a reliable model have been presented in the literature. The term “events per variable” or EPV, in a classification setting, is defined as the ratio of the number of observations in the smallest class relative to the number of descriptors considered for model construction [116]. In van Smeden et al.’s article on the

174 topic, an EPV of at least 10 samples is sought in the medical literature, though some simulation studies have found that an EPV of at least 50 samples is necessary to achieve acceptable performance [116]. Ultimately, the outcome of van Smeden et al.’s simulations on binary logistic regression call for an abandonment of any EPV criteria in exchange for validation studies using statistics such as the root mean squared prediction error (rMSPE) or mean absolute prediction error (MAPE) [116]. In light of this information, future work on the local QSAR workflow will need to consider a range of minimum sample values for local training sets, optimized under a cross-validation and/or external validation schemes for the data set being modeled. If the minimum number of samples is specified at too large a value, the coverage of the local model ensemble could become unacceptable small since the number of local training samples is linked to the locality-defining radius. In other words, the minimum number of samples will not be within a distance r for many test compounds. Additionally, increasing the size of the radii to meet this requirement will diminish the local character of the resulting models, eventually reducing performance no better than comparable global models.

Another area of potential improvement involves the univariate descriptor selection technique employed by the proposed methodology. Recalling the background information from Chapter 2, descriptor selection methods are typically classified as either filter or wrapper methods. Filter methods, using a scoring metric, rank descriptors either individually or in subsets as a pre-processing step prior to model training [32], [117]. On the other hand, wrapper methods utilize a specific learning algorithm, perhaps separate from the primary training algorithm, in an iterative process to determine a descriptor subset with optimal predictive performance [32], [117]. Univariate descriptor selection, a filter 175 method, is chosen as the descriptor selection method for this work due to its simplicity, low-computational demand, and its ability to reduce overfitting through the removal of potentially uninformative descriptors. Computational demand is particularly important since the proposed methodology is tasked with learning hundreds to thousands of models, especially during parameter optimization. However, univariate descriptor selection is not without disadvantages, such as its inability to recognize the predictive power or redundancy of some descriptors in the presence of others. Because of these shortcomings, univariate feature selection is typically regarded as one of the worst descriptor selection methods available.

A possible augmentation to the proposed algorithm is to use correlation-based feature selection (CFS) during the global and local phases. CFS is a subroutine which determines the merit of a subset of descriptors according to Equation (46):

푘푟푐푓̅ 푀푒푟𝑖푡푆 = (46) √푘 + 푘(푘 − 1)푟푓푓̅ where 푀푒푟𝑖푡푆 is the score of descriptor subset S containing k descriptors, 푟푐푓̅ is the mean feature-class (i.e. endpoint) correlation, and 푟푓푓̅ is the mean feature-feature correlation

[118]. A forward, best search strategy from an empty set with a non-improving stopping condition is used to find descriptor subsets. Hall and Smith compared CFS to Naïve Bayes and Decision Tree wrappers and concluded equal to or better predictive performance on various data sets [118]. Addressing the intricacies of the proposed methodology specifically, CFS is an order of magnitude faster computationally compared to the wrapper methods and considers descriptor multi-collinearity. However, the increase in

176 computational demand above that of the current univariate feature selection method may make the change impractical.

Finally, at present it not fully understood the extent to which local descriptor selection contributes to the creation of unique local descriptor sets and how much they correlate with model performance. Implementing univariate feature selection during the local phase of the workflow did indeed result in more sensitive local model ensembles, as is evident in Figure 39. To explore this notion further, a more vigorous and quantitative investigation should be conducted to confirm or deny the premise that unique local descriptor sets are correlated with improvement model performance. For each test molecule, the uniqueness of its local descriptor set can be determined by calculating the similarity between it and the globally significant descriptors from the proceeding phase of the workflow. More concretely, the Tanimoto similar from Chapter 2 could be used to complete this calculation:

퐶 푆푇푎푛푖푚표푡표 = (47) 퐴 + 퐵 − 퐶 where in the present context 퐶 is the number of descriptors found in both the local and global sets, 퐴 is the number of descriptors found exclusively in the local set, and 퐵 is the number of descriptors found exclusively in the global set. Similarities calculated in this fashion will tend to be small since the number of global descriptors usually outnumbers those in the local sets considerably. Finishing the analysis involves partitioning the collection of local descriptors set similarities into two populations by the accuracy of their predictions: those which made a correct prediction (i.e. true positives and true negatives) and those making an incorrect prediction (i.e. false positives and false negatives). If the

177 uniqueness of the local descriptor sets has no effect on performance, then the full range of similarities should be dispersed throughout both true and false predictions. Oppositely, the premise dictates that similarities for correct predictions should be significantly smaller than those from incorrect predictions. This problem description fits the formulation of a one- sided Welch’s t-test with the following hypotheses:

̅ ̅ 퐻0: 푆푐표푟푟푒푐푡 = 푆푖푛푐표푟푟푒푐푡 (48)

̅ ̅ 퐻1: 푆푐표푟푟푒푐푡 < 푆푖푛푐표푟푟푒푐푡 (49) Welch’s t-test assumes the variances of the two populations are unequal and is robust to skewed distributions and large sample sizes [119]. Lastly, even if the null hypothesis is rejected for local model ensembles in which local descriptor selection is active, the uniqueness of the descriptor sets may still result mainly from the variances naturally present in the local training sets. Repeating the above analysis on comparable local model ensembles in which local feature selection is disabled would provide sufficient evident toward the question of whether refinement of local descriptor sets contributed significantly to improved model performance.

178 Bibliography

[1] A. Cherkasov, E. N. Muratov, D. Fourches, A. Vernek, I. I. Baskin, M. Cronin, J. Dearden, P. Gramatica, Y. C. Martin, R. Todeschini, V. Consonni, V. E. Kuz'min, R. Cramer, R. Benigni, C. Yang, J. Rathman, L. Terfloth, J. Gasteiger, A. Richard and A. Tropsha, "QSAR Modeling: Where Have You Been? Where Are You Going To?," Journal of Medicinal Chemistry, vol. 57, no. 12, pp. 4877-5010, 2013.

[2] Y. C. Martin, "Hansch analysis 50 years on," Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 2, no. 3, pp. 435-442, 2012.

[3] C. Hansch and T. Fujita, "p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure," Journal of the American Chemical Society, vol. 86, no. 8, pp. 1616-1626, 1964.

[4] J. Gasteiger, "Chemoinformatics: Achievements and Challenges, a Personal View," Molecules, vol. 21, no. 151, pp. 1-15, 2016.

[5] "CAS REGISTRY - The gold standard for chemical substance information," American Chemical Society, 2019. [Online]. Available: https://www.cas.org/support/documentation/chemical-substances. [Accessed 1st June 2019].

[6] R. Lewis and D. Wood, "Modern 2D QSAR for drug discovery," Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 4, no. 6, pp. 505-522, 2014.

[7] A. Khan, "Descriptors and their selection methods in QSAR analysis: Paradigm for drug design," Drug Discovery Today, vol. 21, no. 8, pp. 1291-1302, 2016.

[8] Y.-C. Lo, S. E. Rensi, W. Torng and R. B. Altman, "Machine learning in chemoinformatics and drug discovery," Drug Discovery Today, vol. 23, no. 8, pp. 1538-1546, 2018.

179 [9] O. T. Devinyak and R. B. Lesyk, "5-Year Trends in QSAR and its Machine Learning Methods," Current Computer-Aided Drug Design, vol. 12, no. 4, pp. 265-271, 2016.

[10] D. Rognan, "The impact of in silico screening in the discovery of novel and safer drug candidates," Pharmacology & Therapeutics, vol. 175, pp. 47-66, 2017.

[11] S. J. Y. Macalino, V. Gosu, S. Hong and S. Choi, "Role of computer-aided drug design in modern drug discovery," Archives of Pharmacal Research, vol. 38, no. 9, pp. 1686-2701, 2015.

[12] S. Ekins, J. Mestres and B. Testa, "In silico pharmacology for drug discovery: Methods for virtual screening and profiling," British Journal of Pharmacology, vol. 152, no. 1, pp. 9-20, 2007.

[13] K.-Y. Kim, S. E. Shin and K. T. No, "Assessment of quantitative structure-activity relationship of toxicity prediction models for Korean chemical substance control legislation," Environmental Health and Toxicology, vol. 30, pp. 1-10, 2015.

[14] N. L. Kruhlak, R. D. Benz, H. Zhou and T. J. Colatsky, "(Q)SAR Modeling and Safety Assessment in Regulatory Review," Clinical Pharmacology & Therapeutics, vol. 91, no. 3, pp. 529 - 534, 2012.

[15] A. Amberg, L. Beilke, J. Bercu, D. Bower, A. Brigo, K. P. Cross, L. Custer, K. Dobo, E. Dowdy, K. A. Ford, S. Glowienke, J. V. Gompel, J. Harvey, C. Hasselgren, M. Honma, R. Jolly, R. Kemper, M. Kenyon, N. Kruhlak, P. Leavitt, S. Miller, W. Muster, J. Nicolette, A. Plaper, M. Powley, D. P. Quigley, M. V. Reddy, H.-P. Spirkl, L. Stavitskaya, A. Teasdale, S. Weiner, D. S. Welch, A. White, J. Wichard and G. J. Myatt, "Principles and procedures for implementation of ICH M7 recommended (Q)SAR analyses," Regulatory Toxicology and Pharmacology, vol. 77, pp. 13-24, 2016.

[16] T. Engel, "Basic Overview of Chemoinformatics," Journal of Chemical Information and Modeling, vol. 46, no. 6, pp. 2267-2277, 2007.

[17] A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Revised ed., Dordrecht: Springer Netherlands, 2007.

[18] A. Dalby, J. G. Nourse, W. D. Hounshell, A. K. I. Gushurst, D. L. Grier, B. A. Leland and J. Laufer, "Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecuar Design Limited," Journal of Chemical Information and Modeling, vol. 32, no. 3, pp. 244-255, 1992.

180 [19] J. Gasteiger and T. Engel, Eds., Chemoinformatics: A Textbook, Darmstadt: WILEY-VCH Verlag GmbH & Co., 2003.

[20] G. Landrum, RDKit, Open-Source Cheminformatics (version 2019.03.01), 2019.

[21] A. Z. Dudek, T. Arodz and J. Galvez, "Computational Methods in Developing Quantitative Structure-Activity Relationships (QSAR): A Review," Combinatorial Chemistry & High Throughput Screening, vol. 9, no. 3, pp. 213-228, 2006.

[22] A. Tropsha, "Best Practices for QSAR Model Development, Validation, and Exploitation," Molecular Informatics, vol. 29, no. 6-7, pp. 476-488, 2010.

[23] F. Sahigara, K. Mansouri, D. Ballabio, A. Mauri, V. Consonni and R. Todeschini, "Comparison of Different Approaches to Define the Applicability Domain of QSAR Models," Molecules, vol. 17, pp. 4791-4810, 2012.

[24] A. Golbraikh, X. S. Wang, H. Zhu and A. Tropsha, "Predictive QSAR Modeling: Methodsand Applications in Drug Discoveryand Chemical Risk Assessment," in Predictive QSAR Modeling: Methods and Applications in Drug Discovery and Chemical Risk Assessment, J. Leszczynski, Ed., Dordrecht, Springer-Verlag, 2016, pp. 1-36.

[25] V. Khanna and S. Ranganathan, "Molecular Similarity and Diversity Approaches in Chemoinformatics," Drug Development Research, vol. 72, no. 1, pp. 74-84, 2011.

[26] R. Todeschini and V. Consonni, Molecular Descriptors for Chemoinformatics, 2 ed., 2nd, Ed., Weinhiem: Wiley‐VCH Verlag GmbH & Co. KGaA, 2009.

[27] F. Grisoni, D. Ballabio, R. Todeschini and V. Consonni, "Molecular Descriptors for Structure–Activity Applications: A Hands-On Approach," in Computational Toxicology: Methods and Protocols, O. Nicolotti, Ed., New York, New York: Humana Press, 2018, pp. 3-53.

[28] "CORINA Symphony - Managing and Profiling Molecular Datasets - Program Manual (v1.1)," Molecular Networks GmbH, Germany and Altamira, LLC, USA, Columbus, Ohio, USA and Nürnberg, Germany, 2018.

[29] A. B. Raies and V. B. Bajic, "In silico toxicology: computational methods for the prediction of chemical toxicity," Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 6, no. 2, p. 147–172, 2016.

181 [30] T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.

[31] M. Shahlaei, "Descriptor Selection Methods in Quantitative Structure-Activity Relationship Studies: A Review Study," Chemical Reviews, vol. 113, no. 10, pp. 8093-8103, 2013.

[32] M. Goodarzi, B. Dejaegher and Y. V. Heyden, "Feature Selection Methods in QSAR Studies," Journal of AOAC International, vol. 95, no. 3, pp. 636-651, 2012.

[33] M. Mathea, W. Klingspohn and K. Baumann, "Chemoinformatic Classification Methods and their Applicability Domain," Molecular Informatics, vol. 35, no. 5, pp. 160-180, 2016.

[34] G. Idakwo, J. Luttrell IV, M. Chen, H. Hong, P. Gong and Z. Chaoyang, "A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction," in Advances in Computational Toxicology: Methodologies and Applications in Regulatory Science., H. Hong, Ed., Cham, Springer, Cham, 2019, pp. 119-139.

[35] S. Boughorbel, F. Jarray and M. El-Anbari, "Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric," PLoS One, vol. 12, no. 2: e0177678, 2017.

[36] B. W. Matthews, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme," Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442-451, 1975.

[37] S. P. Leelananda and S. Lindert, "Computational methods in drug discovery," Beilstein Journal of , vol. 12, no. 1, pp. 2694-2718, 2016.

[38] Y. Wang, J. Xing, N. Zhou, J. Peng, Z. Xiong, X. Liu, X. Luo, C. Lou, K. Chen, M. Zheng and H. Jiang, "In silico AMDE/T modeling for rational drug design," Quarterly Reviews of Biophysics, vol. 48, no. 4, pp. 488-515, 2015.

[39] S. M. Paul, D. S. Mytelka, C. T. Dunwiddie, C. C. Persinger, B. H. Munos, S. R. Lindborg and A. L. Schacht, "How to Improve R&D Productivity: The Pharmaceutical Industry's Grand Challenge," Nature Reviews Drug Discovery, vol. 9, no. 3, pp. 203-214, 2010.

[40] S. Boyer, C. Brealey and A. M. Davis, "Attrition in Drug Discovery and Development," in Attrition in the Pharmaceutical Industry: Reasons, Implications, and Pathways Forward, A. Alex, C. J. Harris and D. A. Smith, Eds., Hoboken, New Jersey: John Wiley & Sons, Inc., 2016, pp. 5-45.

182 [41] D. Cook, D. Brown, R. Alexander, R. March, P. Morgan, G. Satterthwaite and M. N. Pangalos, "Lessons learned from the fate of AstraZeneca's drug pipeline: a five- dimensional framework," Nature Reviews Drug Discovery, vol. 13, no. 6, pp. 419- 431, 2014.

[42] M. J. Waring, J. Arrowsmith, A. R. Leach, P. D. Leeson, S. Mandrell, R. M. Owen, G. Pairaudeau, W. D. Pennie, S. D. Pickett, J. Wang, O. Wallace and A. Weir, "An analysis of the attrition of drug candidates from four major pharmaceutical companies," Nature Reviews Drug Discovery, vol. 14, no. 7, pp. 475-486, 2015.

[43] J. Mittra, The New Health Bioeconomy: R&D Policy and Innovation for the Twenty-First Century, Hampshire: Palgrave MacMillan, 2016.

[44] R. S. Judson, K. A. Houck, R. J. Kavlock, T. B. Knudsen, M. T. Martin, H. M. Mortensen, M. D. Reif, D. M. Rotroff, I. Shah, A. M. Richard and D. J. Dix, "In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization: The ToxCast Project," Environmental Health Perspectives, vol. 118, no. 4, pp. 485-492, 2010.

[45] V. M. Alves, S. J. Capuzzi, E. Muratov, R. C. Braga, T. Thornton, D. Fourches, J. Strickland, N. Kleinstreuer, C. H. Andrade and A. Tropsha, "QSAR models of human data can enrich or replace LLNA testing for human skin sensitization," Green Chemistry, vol. 18, no. 24, pp. 6501-6515, 2016.

[46] R. Guha, D. Dutta, P. C. Jurs and T. Chen, "Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions," Journal of Chemical Information and Modeling, vol. 46, no. 4, pp. 1836-1847, 2006.

[47] A. Yu and K. Grauman, "Predicting Useful Neighborhoods for Lazy Local Learning," in Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014.

[48] I. Gijbels and I. Prosdocimi, "Loess," Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 5, pp. 590-599, 2010.

[49] J. B. O. Mitchell, "Machine learning methods in chemoinformatics," Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 4, no. 5, pp. 468-481, 2014.

[50] K. Hansen, S. Mika, T. Schroeter, A. Sutter, A. ter Laak, T. Steger-Hartmann, N. Heinrich and K.-R. Muller, "Benchmark Data Set for in Silico Prediction of Ames

183 Mutagenicity," Journal of Chemical Information and Modeling, vol. 49, no. 9, pp. 2077-2081, 2009.

[51] F. Buchwald, T. Girschick, M. Seeland and S. Kramer, "Using Local Models to Improve (Q)SAR Predictivity," Molecular Informatics, vol. 30, no. 2-3, pp. 205- 218, 2011.

[52] E. Ahlberg, L. Carlsson, S. Boyer and U. Norinder, "Evaluation of Quantitative Structure-Activity Relationship Modeling Strategies: Local and Global Models," Journal of Chemical Information and Modeling, vol. 50, no. 4, pp. 677-689, 2010.

[53] M. H. Kutner, C. J. Nachtsheim and J. Neter, Applied Linear Regression Models, 4 ed., New York, New York: The McGraw-Hill Companies, Inc., 2004.

[54] S. S. Young, F. Yuan and M. Zhu, "Chemical Descriptors Are More Important Than Learning Algorithms for Modeling.," Molecular Informatics, vol. 31, no. 10, pp. 707-710, 2012.

[55] D. C. Montgomery, Design and Analysis of Experiments, 7 ed., Hoboken, New Jersey: John Wiley & Sons, Inc., 2009.

[56] A. Agresti, Categorical Data Analysis, 3 ed., Hoboken, New Jersey: John Wiley & Sons, Inc., 2013.

[57] N. Armanfard, J. P. Reilly and M. Komeili, "Local Feature Selection for Data Classification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 6, pp. 1217-1227, 2016.

[58] K. Pichara and A. Soto, "Local Feature Selection Using Gaussian Process Regression," Intelligent Data Analysis, vol. 18, no. 3, pp. 319-336, 2014.

[59] K. S. Ng, "A Simple Explanation of Partial Least Squares," 27 April 2013. [Online]. Available: http://users.cecs.anu.edu.au/~kee/pls.pdf.

[60] M. Verleysen and D. François, "The Curse of Dimensionality in Data Mining and Time Series Prediction," in IWANN'05 Proceedings of the 8th International Conference on Artificial Neural Networks: Computational Intelligence and Bioinspired Systems, Barcelona, Spain, 2005.

[61] R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6 ed., Upper Saddle River, New Jersey: Pearson Education, Inc., 2007.

184 [62] L. Ungar, "CIS 520 Machine Learning - 2018," 2018. [Online]. Available: https://alliance.seas.upenn.edu/~cis520/dynamic/2018/wiki/uploads/Lectures/pca- example-1D-of-2D.png. [Accessed 14 July 2019].

[63] P. H. Garthwaite, "An Interpretation of Partial Least Squares," Journal of the American Statistical Association, vol. 89, no. 425, pp. 122-127, 1994.

[64] D. Ruiz-Perez and G. Narasimhan, "So you think you can PLS-DA?," Cold Spring Harbor Laboratory, 2018. [Online]. Available: https://www.biorxiv.org/content/10.1101/207225v2. [Accessed 15 July 2019].

[65] J. VanderPlas, "Comparison of PCA and Manifold Learning," AstroML Developers, 2013.

[66] P. Henderson, "Sammon Mapping," Pattern Recognition Letters, vol. 18, no. 11- 13, p. 1307–1316, 1997.

[67] S. T. Roweis and L. K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding," Science, vol. 290, no. 5500, pp. 2323-2326, 2000.

[68] L. van der Maaten and G. Hinton, "Visualizing Data using t-SNE," Journal of Machine Learning Research, vol. 9, pp. 2579-2605, 2008.

[69] L. Derksen, "Visualising high-dimensional datasets using PCA and t-SNE in Python," 29 April 2019. [Online]. Available: https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca- and-t-sne-in-python-8ef87e7915b. [Accessed 17 June 2019].

[70] T. A. Abeo, X.-J. Shen, E. D. Ganaa, Q. Zhu, B.-K. Bao and Z.-J. Zha, "Manifold Alignment via Global and Local Structures Preserving PCA Framework," IEEE Access, vol. 7, pp. 38123-38134, 2019.

[71] M. Wattenberg, F. Viégas and I. Johnson, "How to Use t-SNE Effectively," Distill, 2016.

[72] Y. Zhou and T. O. Sharpee, "Using global t-SNE to preserve inter-cluster data structure," bioRxiv, 2018.

[73] L. van der Maaten, "Learning a Parametric Embedding by Preserving Local Structure," in Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, Clearwater Beach, Florida, USA, 2009.

185 [74] P. Liu and W. Long, "Current Mathematical Methods Used in QSAR/QSPR Studies," International Journal of Molecular Sciences, vol. 10, no. 5, pp. 1978- 1998, 2009.

[75] M. Perme, M. Blas and S. Turk, "Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study," Metodološki Zvezki, vol. 1, no. 1, pp. 143-161, 2004.

[76] T. M. Mitchell, "Machine Learning, Tom Mitchell, McGraw Hill," September 2017. [Online]. Available: http://www.cs.cmu.edu/%7Etom/mlbook/NBayesLogReg.pdf. [Accessed 28th July 2019].

[77] A. Ben-Hur and J. Weston, "A User’s Guide to Support Vector Machines," in Data Mining Techniques for the Life Sciences: Methods in Molecular Biology (Methods and Protocols), vol. 609, C. O. and E. F., Eds., New York: Humana Press, 2010, pp. 223-239.

[78] B. Lantz, "Chapter 7: Black Box Methods - Neural Networks and Support Vector Machines," in Machine Learning with R, 2 ed., Birmingham, Packt Publishing Ltd., 2015, pp. 219-258.

[79] M. Muehlbacher, G. Spitzer, K. R Liedl and J. Kornhuber, "Qualitative Prediction of Blood–Brain Barrier Permeability on a Large and Refined Dataset," Journal of Computer-Aided Molecular Design, vol. 25, no. 12, pp. 1095-1106, 2011.

[80] K. E. Mortelmans and E. Zeiger, "The Ames Salmonella/Microsome Mutagenicity Assay," Mutation Research, vol. 455, pp. 29-60, 2000.

[81] Histidine, "Ames test procedure.," Licensed under CC BY-SA 3.0. 5 February 2011. [Online]. Available: https://en.wikipedia.org/wiki/Ames_test#/media/File:Ames_test.svg. [Accessed 26 July 2019].

[82] "Chemical Carcinogenesis Research Information System on the NCRI Informatics Initiative Homepage," 2009. [Online]. Available: https://toxnet.nlm.nih.gov/newtoxnet/ccris.htm.

[83] J. Kazius, R. Meguire and R. Bursi, "Derivation and Validation of Toxicophores for Mutagenicity Prediction," Journal of Medicinal Chemistry, vol. 48, pp. 312- 320, 2005.

[84] C. Helma, T. Cramer, S. Kramer and L. D. Raedt, "Data mining and machine learning techniques for the identification of mutagenicity inducing substructures 186 and structure activity relationships of noncongeneric compounds," Journal of Chemical Information and Modeling, vol. 44, pp. 1402-1411, 2004.

[85] J. Feng, L. Lurati, H. Ouyang, T. Robinson, Y. Wang, S. Yuan and S. S. Young, "Predictive toxicology: benchmarking molecular descriptors and statistical methods," Journal of Chemical Information and Modeling, vol. 43, pp. 1463- 1470, 2003.

[86] P. N. Judson and et al., "Towards the creation of an international toxicology information centre," Toxicology, vol. 213, pp. 117-128, 2005.

[87] "Genetic Toxicity, Reproductive and Development Toxicity, and Carcinogenicity Database," 2009. [Online]. Available: https://www.fda.gov/AboutFDA/CentersOffices/CDER/ucm092217.htm.

[88] W. M. Pardridge, "Drug Transport Across the Blood-Brain Barrier," Journal of Cerebral Blood Flow & Metabolism, vol. 32, no. 11, pp. 1959-1972, 2012.

[89] K. A. Kübelbeck, " sketch showing the transport types at the blood- brain barrier.," Licensed under CC BY 3.0. 12 November 2011. [Online]. Available: https://commons.wikimedia.org/wiki/File:Blood- brain_barrier_transport_en.png. [Accessed 17 September 2019].

[90] S. Vilar, M. Chakrabarti and S. Costanzi, "Prediction of passive blood-brain barrier partitioning; straightforward and effective classification models based on in silico derived physicochemical descriptors.," Journal of and Modelling, vol. 28, no. 8, pp. 899-903, 2010.

[91] J. A. Platts, M. H. Abraham, Y. Zhao, A. Hersey, L. Ijaz and Butina D, "Correlation and prediction of a large blood-brain distribution data set - an LFER study.," European Journal of Medicinal Chemistry, vol. 36, no. 9, pp. 719-730, 2001.

[92] R. Narayanan and S. B. Gunturi, "In silico ADME modelling: prediction models for blood-brain barrier permeation using a systematic variable selection method," Bioorganic & Medicinal Chemistry, vol. 13, no. 8, pp. 3017-3028, 2005.

[93] S. R. Mente and F. Lombardo, "A recursive-partitioning model for blood-brain barrier permeation," Journal of Computer-Aided Molecular Design, vol. 19, no. 7, pp. 465-481, 2005.

187 [94] L. Zhang, H. Zhu, T. I. Opera, A. Golbraikh and A. Tropsha, "QSAR modeling of the blood-brain barrier permeability for diverse organic compounds," Pharmaceutical Research, vol. 25, no. 8, pp. 1902-1914, 2008.

[95] M. H. Abraham, A. Ibrahim, Y. Zhao and W. E. Acree, "A data base for partition of volatile organic compounds and drugs from blood/plasma/serum to brain, and an LFER analysis of the data," Journal of Pharmaceutical Sciences, vol. 95, no. 10, pp. 2091-2100, 2006.

[96] P. Garg and J. Verma, "In silico prediction of blood brain barrier permeability: an artificial neural network model," Journal of Chemical Information and Modeling, vol. 46, no. 1, pp. 289-297, 2006.

[97] A. Guerra, J. A. Paez and N. E. Campillo, "Artificial neural networks in ADMET modeling: prediction of blood-brain barrier permeation," QSAR & Combinatorial Science, vol. 27, no. 5, pp. 586-594, 2008.

[98] K. Rose, L. H. Hall and L. B. Kier, "Modeling blood-brain barrier partitioning using the electrotopological state," Journal of Chemical Information and Modeling, vol. 42, no. 3, pp. 651-666, 2002.

[99] J. Kelder, P. D. J. Grootenhuis, D. M. Bayada, L. P. C. Delbressine and J.-P. Ploemen, "Polar molecular surface as a dominating determinant for oral absorption and brain penetration of drugs," Pharmaceutical Research, vol. 16, no. 10, pp. 1514-1519, 1999.

[100] D. A. Konovalov, D. Coomans, E. Deconinck and Y. Vander Heyden, "Benchmarking of QSAR models for blood-brain barrier permeation," Journal of Chemical Information and Modeling, vol. 47, no. 4, pp. 1648-1656, 2007.

[101] M. Zerara, J. Brickmann, R. Kretschmer and T. E. Exner, "Parameterization of an empirical model for the prediction of n-octanol, alkane and /water as well as brain/blood partition coefficients," Journal of Computer-Aided Molecular Design, vol. 23, no. 2, pp. 105-111, 2009.

[102] "Anaconda Distribution, version 2019.03 for Python 3.7," Anaconda, Inc., 2019. https://www.anaconda.com/distribution/.

[103] "Python Language Reference, version 3.7," Python Software Foundation, 2019. https://www.python.org.

[104] "Spyder: The Scientific Python Development Environment, version 3.3.3," The Spyder Website Contributors, 2018. https://www.spyder-ide.org/.

188 [105] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, "Scikit-learn: Machine Learning in {P}ython," Journal of Machine Learning Research, vol. 12, pp. 2825- 2830, 2011.

[106] W. McKinney, "Data Structures for Statistical Computing in Python," in Proceedings of the 9th Python in Science Conference, 51-56, 2010.

[107] "Marvin Suite, version 19.20," ChemAxon Ltd., 2019, https://chemaxon.com.

[108] C. Yang, A. Tarkhov , J. Marusczyk, B. Bienfait, J. Gasteiger, T. Kleinoeder, T. Magdziarz, O. Sacher, C. H. Schwab, J. Schwoebel, L. Terfloth, K. Arvidson, A. Richard, A. Worth and J. Rathman, "New Publicly Available Chemical Query Language, CSRML, To Support Chemotype Representations for Application to Data Mining and Modeling," J. Chem. Inf. Model., vol. 55, pp. 510-528, 2015.

[109] S. Arora, W. Hu and P. Kothari, "An Analysis of the t-SNE Algorithm for Data Visualization," ArXiv, vol. abs/1803.01768, 2018.

[110] F. Marchesi, M. Turriziani, G. Tortorelli, G. Avvisati, F. Torino and L. Vecchis, "Triazene Compounds: Mechanism of Action and Related DNA Repair Systems," Pharmacological Research, vol. 56, no. 4, pp. 275-287, 2007.

[111] "ChemIDplus: A TOXNET Database," U.S. National Library of Medicine, National Institute of Health, U.S. Department of Health and Human Services, 2019. [Online]. Available: https://chem.nlm.nih.gov/chemidplus/rn/76-03-9. [Accessed 2nd November 2019].

[112] G. K. L. O. Pinheiro, I. Araujo Filho, I. Araujo Neto, A. C. M. Rego, E. P. Azevedo, F. I. Pinheiro and A. A. S. Lima Filho, "Nature as a source of drugs for ophthalmolog," Arquivos Brasileiros de Oftalmologia, vol. 81, no. 5, pp. 443-454, 2018.

[113] A. M. Arens and T. Kearney, "Adverse Effects of Physostigmine," Journal of Medical Toxicology, vol. 15, no. 3, pp. 184-191, 2019.

[114] M. Honma, A. Kitazawa, A. Cayley, R. V. Williams, C. Barber, T. Hanser, R. Saiakhov, S. Chakravarti, G. J. Myatt, K. P. Cross, E. Benfenati, G. Raitano, O. Mekenyan, P. Petkov, C. Bossa, R. Benigni, C. L. Battistelli, A. Giuliani, O. Tcheremenskaia, C. DeMeo, U. Norinder, H. Koga, C. Jose, N. Jeliazkova, N. Kochev, V. Paskaleva, C. Yang, P. R. Daga, R. D. Clark and J. Rathman, "Improvement of quantitative structure–activity relationship (QSAR) tools for

189 predicting Ames mutagenicity: outcomes of the Ames/QSAR International Challenge Project," Mutagenesis, vol. 34, no. 1, pp. 3-16, 2019.

[115] H. Li, C. W. Yap, C. Y. Ung, Y. Xue, Z. W. Cao and Y. Z. Chen, "Effect of selection of molecular descriptors on the prediction of blood-brain barrier penetrating and nonpenetrating agents by statistical learning methods," Journal of Chemical Information and Modeling, vol. 45, no. 5, pp. 1376-1384, 2005.

[116] M. van Smeden, K. G. M. Moons, J. A. H. de Groot, G. S. Collins, D. G. Altman, M. J. C. Eijkemans and J. B. Reitsma, "Sample size for binary logistic prediction models: Beyond events per variable criteria," Statistical Methods in Medical Research, vol. 28, no. 8, pp. 2455-2474, 2019.

[117] I. a. E. A. Guyon, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, vol. 3, pp. 1157 - 1182, 2003.

[118] M. A. Hall and L. A. Smith, "Feature Selection for Machine Learning: Comparing a Correlation-based Filter Approach to the Wrapper," in Twelfth International FLAIRS Conference, Orlando, FL, 1999.

[119] M. H. a. S. M. J. DeGroot, Probability and Statistics, Boston, MA: Pearson Education, Inc., 2012.

[120] C. Lipinski, F. Lombardo, B. Dominy and P. Feeney, "Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings," Advanced Drug Delivery Reviews, vol. 23, pp. 3-25, 1997.

[121] J. B. Hendrickson, P. Huang and A. G. Toczko, "Molecular Complexity: A Simplified Formula Adapted to Individual Atoms," Journal of Chemical Information and Modeling, vol. 27, pp. 63-67, 1987.

[122] J. Gasteiger and C. Jochum, "Algorithm for the Perception of Synthetically Important Rings," Journal of Chemical Information and Modeling, vol. 19, pp. 43- 48, 1979.

[123] P. A. Labute, " Widely Applicable Set of Descriptors," Journal of Molecular Graphics and Modelling, vol. 18, pp. 464-477, 2000.

[124] Y. Zhao, Abraham, M.H. and A. Zissimos, "Determination of McGowan Volumes for Ions and Correlation with van der Waals Volumes," Journal of Chemical Information and Modeling, vol. 43, pp. 1848-1854, 2003.

190 [125] P. Ertl, B. Rohde and P. Selzer, "Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Tansport Properties," Journal of Medicinal Chemistry, vol. 43, pp. 3714- 3717, 2000.

[126] R. Wang, Y. Gao and L. Lai, " Calculating Partition Coefficient by Atom-Additive Method," Perspectives in Drug Discovery and Design, vol. 19, pp. 47-66, 2000.

[127] M. Petitjean, "Applications of the radius-diameter diagram to the classification of topological and geometrical shapes of chemical compounds," Journal of Chemical Information and Modeling, vol. 32, pp. 331-337, 1992.

[128] M. Volkenstein, Configurational Statistics of Polymeric Chains, New York: Wiley-Interscience, 1963.

[129] C. Tanford, Physical Chemistry of Macromolecules, New York: Wiley, 1961.

[130] H. Yang, L. Sun, W. Li, G. Liu and Y. Tang, "In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts," Frontiers in Chemistry, vol. 6, pp. 1-12, 2018.

[131] R. Wang, Y. Gao and L. Lai, "Calculating partition coefficient by atom-additive method," Perspect. Drug Discovery Des., vol. 19, pp. 47-66, 2000.

[132] K. P. C. Vollhardt and N. E. Schore, "Electronic Attack on Derivatives of ," in Organic Chemistry, 5th ed., New York, New York: W. H. Freeman and Company, 2007, pp. 721-762.

[133] J. K. Seydel and M. Wiese, "Octanol-Water Partitioning versus Partitioning into Membranes," in Drug-Membrane Interactions: Analysis, Drug Distribution, Modeling, vol. 15, J. K. S. a. M. Wiese, Ed., Weinhiem, Germany, Wiley-VCH Verlag GmbH & Co. KGaA, 2002, pp. 35-50.

[134] J. Sangster, Octanol-Water Partition Coefficients: Fundamentals and Physical Chemistry, Chichester, England: John Wiley & Sons Ltd., 1997.

[135] F. Sahigara, D. Ballabio, R. Todeschini and V. Consonni, "Defining a novel k- nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions," Journal of Cheminformatics, vol. 5, no. 1, pp. 1-9, 2013.

[136] K. Roy, S. Kar and R. N. Das, Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment, Cambridge, Massachusetts: Academic Press, 2015.

191 [137] M. Pohar, M. Blas and S. Turk, "Comparison of Logistic Regression and Linear Discriminant Analysis: A Simulation Study," Metodološki zvezki, vol. 1, no. 1, pp. 143-161 , 2004.

[138] S. Nembri, F. Grisoni, V. Consonni and R. Todeschini, "In Silico Prediction of Cytochrome P450-Drug Interaction: QSARs for CYP3A4 and CYP2C9," Int. J. Mol. Sci, vol. 17, no. 914, pp. 1-19, 2016.

[139] Y. C. Martin, J. L. Kofron and L. M. Traphagen, "Do Structurally Similar Molecules Have Similar Biological Activity?," J. Med. Chem., vol. 45, no. 19, p. 4350–4358, 2002.

[140] D. Loughney, B. L. Claus and S. R. Johnson, "To measure is to know: an approach to CADD performance metrics," Drug Discov. Today, vol. 16, pp. 548-554, 2011.

[141] C. A. Lipinski, "Drug-like properties and the causes of poor solubility and poor permeability," J. Pharmacol. Toxicol., vol. 44, pp. 235-249, 2000.

[142] G. Landrum, "RDKit: Open-source cheminformatics (v2019.03)".

[143] J. Hughes, S. Rees, S. Kalindjian and K. Philpott, "Principles of early drug discovery," British Journal of Pharmacology, vol. 162, no. 6, pp. 1239-1249, 2010.

[144] E. A. Helgee, L. Carlsson, S. Boyer and U. Norinder, "Evaluation of Quantitative Structure - Activity Relationship Modeling Strategies: Local and Global Models," J. Chem. Inf. Model., vol. 50, no. 4, pp. 677-689, 2010.

[145] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Mach. Learn. Res., vol. 3, pp. 1157-1182, 2003.

[146] P. Ertl, B. Rohde and P. Selzer, "Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties," J. Med. Chem., vol. 43, pp. 3714-3717, 2000.

[147] M. K. E. and E. Zeiger, "The Ames Salmonella/Microsome Mutagenicity Assay," Mutation Research, vol. 455, no. 1-2, pp. 29-60, 2000.

[148] Danishuddin and A. U. Khan, "Descriptors and their selection methods in QSAR analysis: paradigm for drug design," Drug Discov. Today, vol. 21, no. 8, pp. 1291- 1302, 2016.

192 [149] K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, "When Is ''Nearest Neighbor'' Meaningful?," in ICDT '99 Proceedings of the 7th International Conference on Database Theory, London, UK, 1999.

[150] I. I. Baskin and A. Varnek, "Fragment Descriptors in SAR/QSAR/QSPR Studies, Molecular Similarity Analysis and in Virtual Screening," in Chemoinformatics Approaches to Virtual Screening, Cambridge, UK, The Royal Society of Chemistry, 2008, pp. 1-43.

[151] A. Avdeef, Absorption and Drug Development: Solubility, Permeability, and Charge State, Hoboken, New Jersey: John Wiley & Sons, Inc., 2012.

[152] "Fingerprints - Screening and Similarity," Daylight Chemical Information Systems, Inc., 2008. [Online]. Available: http://www.daylight.com/dayhtml/doc/theory/theory.finger.html.

[153] D. Veber, S. Johnson, H.-Y. Cheng, B. Smith, K. Ward and K. Kopple, "Molecular Properties That Influence the Oral Bioavailability of Drug Candidates," Journal of Medicinal Chemistry, vol. 45, no. 12, pp. 2615-2623, 2002.

[154] C. Yang, A. Tarkhov, J. Marusczyk, B. Bienfait, J. Gasteiger, T. Kleinöder, T. Magdziarz, O. Sacher, C. Schwab, J. Schwöbel, L. Terfloth, K. Arvidson, A. Richard, A. Worth and J. Rathman, "New Publicly Available Chemical Query Language, CSRML, To Support Chemotype Representations for Application to Data Mining and Modeling," Journal of Chemical Information and Modeling, vol. 55, no. 3, pp. 510-528, 2015.

[155] J. Kazius, R. McGuire and B. R., "Derivation and Validation of Toxicophores for Mutagenicity Prediction," Journal of Medicinal Chemistry, vol. 48, no. 1, pp. 312- 320, 2005.

[156] P. N. Judson, P. A. Cooke, D. N. G., G. N., R. P. Hanzlik, C. Hardy, A. Hartmann, D. Hinchliffe, J. Holder, L. Muller, T. Steger-Hartmann, A. Rothfuss, M. Smith, K. Thomas, J. D. Vessey and E. Zeiger, "Towards the Creation of an International Toxicology Information Centre," Toxicology, vol. 213, no. 1-2, pp. 117-128, 2005.

[157] M. Goodarzi, B. Dejaegher and Y. V. Heyden, "Feature Selection Methods in QSAR Studies," J. AOAC Int., vol. 95, no. 3, pp. 636-651, 2012.

193 Appendix A: Sample Input and Output Files

This appendix contains sample input files necessary to use the code included in Appendix

C. Also included are sample output files generated as a result of executing the aforementioned code, namely, model predictions and performance summaries. These files are of the comma-separated values (.csv) format, unless otherwise stated, and represented in tabular form herein. In some instances, the contents of the file do not fit the page and it must be reshaped accordingly.

A.1 Parameter file (input)

This file shown in Table 25 contains the input parameters used to control the various options available to the user when implementing the proposed methodology. It is a 3 x 36 table reshaped here to fit the appendix. The first row (header) lists parameter names, the second row global phase parameter options, and the third row local phase parameter options. In instances where it is appropriate, multiple parameters may be specified to perform a grid search.

194 Table 25: Sample parameter file (input).

Preprocess.miss Preprocess.missing ing_data_indica Preprocess.option _data_method tor

global TRUE remove np.nan

local TRUE remove np.nan

Preprocess.varian ce_threshold_valu Preprocess.infreque Preprocess.transfor Preprocess.tran e nt_threshold_value m_descriptors sform_function

0 3 None None

0 3 None None

VariableSelecti Preprocess.transfo Preprocess.standard VariableSelection.o on.significance_ rm_inverse ize ption level

None TRUE FALSE 0.05

None TRUE FALSE 0.05

VariableSelection.u VariableSelection. nivariate_binary_m DimensionalReduct DimensionalRe method ethod ion.option duction.method

univariate fisher TRUE pls

univariate fisher TRUE pls

DimensionalRedu DimensionalReduct DimensionalRe ction.n_reduced_d DimensionalReducti ion.ratio_explained duction.perplex imensions on.pca_method _variance ity

8 component 0.99 30

8 component 0.99 0 continued

195 Table 25 continued DimensionalReduct DimensionalReduct DimensionalReduct DomainAssessm ion.k_neighbors ion.min_dist ion.metric ent.option

15 0.1 euclidean TRUE

15 0.1 euclidean TRUE

DomainAssessment DomainAssessment LocalityAssessment LocalityAssess .domain_k .domain_quantile .option ment.method

3 0.95 TRUE radius

3 0.95 TRUE radius

LocalityAssessment LocalityAssessment LocalityAssessment .radius .locality_k .locality_quantile Train.option

2.1 3 0.95 TRUE

0 3 0.95 TRUE

Train.n_compo Train.technique Train.n_neighbors Train.weights nents

logistic 5 uniform 1

logistic 5 uniform 1

A.2 Data file (input)

This file, a portion of which is shown in Table 26, contains the raw training set data in the form of an N x (P+1) table, where N is the number of training samples and P is the number of descriptors. An extra column is included for the response data on each training

196 observation and is labeled “Activity”. The index for the compound identifiers is labeled

“CAS_NO”.

Table 26: Sample data file (input).

CAS_NO Activity Atoms Bonds BondsRot HAcc …

2475-33-4 0 68 78 0 8 …

820-75-7 1 18 17 3 7 …

2435-76-9 1 12 12 0 6 …

817-99-2 1 16 15 3 6 …

116539-70-9 1 35 35 7 7 …

… … … … … … …

A.3 Data types file (input)

This file shown in Table 28 contains a list of all descriptors included in the data input file with an integer specifying the descriptor’s type. The numeric code for descriptor types is shown in Table 27. Note that Table 27 does not represent an input file.

197 Table 27: Numeric code for descriptor types.

Descriptor Type Value

Response 1

Continuous-valued (interval) 2

Count-based (integer) 3

Ordinal 4

Nominal 5

Binary 6

Returning to the data types file, the name of the column of descriptors is “Variable” and the column of types “Type”.

Table 28: Sample data types file (input).

Variable Type

Atoms 3

Bonds 3

BondsRot 3

Weight 2

Complex 2

ComplexRing 2 continued

198 Table 28 continued ASA 2

atom:element_main_group 6

atom:element_metal_group_I_II 6

atom:element_metal_group_III 6

atom:element_metal_metalloid 6

atom:element_metal_poor_metal 6

… …

A.4 Partition set file (input)

This file shown in Table 29 indicates which partition set each training observation may be found under a n-fold cross-validation scheme, where n is the number of folds. The name of the column of training observations is “CAS_NO” and the column of folds “Set”. An observation in fold “0” is a part of a static training set which does not change or participate in a test set during cross-validation.

199 Table 29: Sample partition set file (input).

CAS_NO Set

2475-33-4 3

820-75-7 3

2435-76-9 1

817-99-2 2

116539-70-9 4

115-02-6 5

122341-55-3 4

105149-00-6 4

108-78-1 0

2425-85-6 3

67019-24-3 2

… …

A.5 Set labels file (input)

This file shown in Table 30 indicates which set each training observation may be found under a n-fold cross-validation scheme, where n is the number of folds. It is a copy of the partition set file but in string form. An observation labeled “TRAIN” is a part of a static training set. The name of the column of training observations is “CAS_NO” and the column of folds “Set”.

200 Table 30: Sample labels file (input).

CAS_NO Set

2475-33-4 CV3

820-75-7 CV3

2435-76-9 CV1

817-99-2 CV2

116539-70-9 CV4

115-02-6 CV5

122341-55-3 CV4

105149-00-6 CV4

108-78-1 TRAIN

2425-85-6 CV3

67019-24-3 CV2

… …

A.6 Data structures file (input)

This is a structure-data file (.sdf) containing the structural information on all molecules referenced in the data file. These files are similar to the MDL Molfiles described in Chapter

1 (particularly Figure 1) except they can contain multiple compounds and associated data properties. The identifiers of the compounds should be annotated for each compound.

201 A.7 Model performance summary file (output)

This file shown in Table 31 lists summary performance statistics for the accompanying global, local, and global-on-local predictions made by the workflow. It also contains the input parameters which are used to generate the predictions. If multiple values are specified for one or more parameters, then there will be an associated amount of grid points as given by the product rule of combinatorics. Note that summary results on an external test set is equivalent to a single grid point.

Table 31: Sample model performance summary file (output).

Preprocess Preprocess. Preprocess. .option missing_data_method missing_data_indicator

grid point 1 global TRUE remove -

grid point 1 local TRUE remove -

grid point 1 global- TRUE remove - on-local continued

202 Table 31 continued Preprocess.infrequ Preprocess.varianc Preprocess.transfor Preprocess.trans ent_threshold_val e_threshold_value m_descriptors form_function ue

FALSE 3 - -

FALSE 3 - -

FALSE 3 - -

VariableSelectio Preprocess.transfo Preprocess.standa VariableSelection.o n.significance_le rm_inverse rdize ption vel

- TRUE FALSE 0.05

- TRUE FALSE 0.05

- TRUE FALSE 0.05

VariableSelection. VariableSelection. DimensionalReduct DimensionalRed univariate_binary method ion.option uction.method _method

univariate fisher TRUE pls

univariate fisher TRUE pls

univariate fisher TRUE pls

continued

203 Table 31 continued DimensionalReduct DimensionalReduct DimensionalReduc DimensionalRed ion.n_reduced_dim ion.ratio_explained tion.pca_method uction.perplexity ensions _variance

8 component 0.99 30

8 component 0.99 FALSE

8 component 0.99 FALSE

DimensionalReduct DimensionalReduc DimensionalReduct DomainAssessm ion.k_neighbors tion.min_dist ion.metric ent.option

15 0.1 euclidean TRUE

15 0.1 euclidean TRUE

15 0.1 euclidean TRUE

DomainAssessment DomainAssessmen LocalityAssessment LocalityAssessm .domain_k t.domain_quantile .option ent.method

3 0.95 TRUE radius

3 0.95 TRUE radius

3 0.95 TRUE radius

continued

204 Table 31 continued LocalityAssessment LocalityAssessmen LocalityAssessment Train.option .radius t.locality_k .locality_quantile

2.1 3 0.95 TRUE

FALSE 3 0.95 TRUE

FALSE 3 0.95 TRUE

Train. Train. Train. Train.n technique n_neighbors weights _components

logistic 5 uniform TRUE

logistic 5 uniform TRUE

logistic 5 uniform TRUE

nobs npred fpred prev

6512 4617 0.70899877 0.5947585

6512 4498 0.69072482 0.59737661

6512 4498 0.69072482 0.59737661

continued

205 Table 31 continued

tps fns fps tns

2152 594 646 1225

2178 509 547 1264

2114 573 625 1186

sens spec accu ppv

0.78368536 0.65473009 0.73142733 0.7691208

0.81056941 0.69795693 0.76522899 0.79926606

0.78675102 0.6548868 0.7336594 0.77181453

npv auc

0.67344695 0.79092995

0.71291596 0.80467663

0.67424673 0.72081891

206 A.8 Model predictions file (output)

This file contains a record of the binary predictions made on each test observation by the global and local models for each grid point explored. Prediction results on an external test set is equivalent to a single grid point. Global-on-local predictions are not shown since these are simple the global prediction value on compounds for which a local prediction was provided.

Table 32: Sample model predictions file (output).

2475-33-4 820-75-7 2435-76-9 817-99-2 …

grid point 1 - 1 0 1 … global

grid point 1 - 1 0 1 … local

207 Appendix B: Chemical Descriptors

This appendix lists the physiochemical descriptors used to represent and derive QSAR models within this work. Information on the 729 ToxPrint chemotypes can be found at https://toxprint.org/.

Table 33: Descriptors calculated by CORINA Symphony to describe the Ames mutagenicity and blood-brain barrier molecular data sets.

Adapted from [28].

Descriptor Key Description Unit Total number of atoms in the Number of atoms Atoms - molecule (including hydrogen) Total number of Number of bonds Bonds bonds in the - molecule Number of open- Number of BondsRot chain, single - rotatable bonds rotatable bonds Total number of hydrogen bonding Number of acceptors derived hydrogen bonding HAcc from the sum of - acceptors nitrogen and oxygen atoms in the molecule continued

208 Table 33 continued Number of Number of oxygen hydrogen bonding atom-based acceptors derived HAccO - hydrogen bonding from the sum of acceptors oxygen atoms in the molecule Number of Number of hydrogen bonding nitrogen atom- acceptors derived HAccN - based hydrogen from the sum of bonding acceptors nitrogen atoms in the molecule Total number of hydrogen bonding Number of donors derived hydrogen bonding HDon from the sum of - donors nitrogen and oxygen atoms in the molecule Number of Number of oxygen hydrogen bonding atom-based donors derived HDonO - hydrogen bonding from the sum of donors oxygen atoms in the molecule Number of Number of nitrogen hydrogen bonding atom-based donors derived HDonN - hydrogen bonding from the sum of donors nitrogen atoms in the molecule Number of violations of the Number of Rule-of- Lipinski's rule of 5 Ro5Viol - Five violations (Weight > 500, XlogP > 5, HDon > 5, HAcc > 10) continued

209 Table 33 continued Number of violations of the Number of extended Lipinski's extended Rule-of- Ro5ViolExt rule of 5 (additional - Five violation rule: number of rotatable bonds > 10) [120] Total number of Total number of tetrahedral chiral tetrahedral Stereo - centers in the stereocenters molecule [120] Molecular weight Molecular weight Weight derived from the Da gross formula Molecular complexity Molecular Complex according to the - complexity approach by Hendrickson [121] Ring complexity according to the Ring complexity ComplexRing approach by - Gasteiger and Jochum [122] Approximate Approximate ASA surface area of the 2 surface area Å molecule [123] McGowan molecular volume McGowan McGowan approximated by mL/mol molecular volume fragment contributions [124] Topological polar surface area of the Topological polar TPSA molecule derived Å2 surface area from polar 2D fragments [125] Molecular dipole Dipole moment of Dipole D moment the molecule continued

210 Table 33 continued Mean molecular Mean molecular Polariz polarizability of the Å3 polarizability molecule Solubility of the Aqueous solubility LogS molecule in water log units (logS) in log units Octanol/water partition coefficient Octanol/water in [log units] of the partition coefficient XlogP log units molecule following (logP) the XlogP approach [126] Molecular Molecular Aspheric - asphericity asphericity [26]

Molecular Molecular Eccentric - eccentricity eccentricity [26] Maximum distance between two atoms Molecular diameter Diameter Å in the molecule [127] Principal Principal moment component of the of inertia of 1st InertiaX 퐷푎 ∙ Å2 inertia tensor in x- principal axis direction [26] Principal Principal moment component of the of inertia of 2nd InertiaY 2 inertia tensor in y- 퐷푎 ∙ Å principal axis direction [26] Principal Principal moment component of the of inertia of 3rd InertiaZ 2 inertia tensor in z- 퐷푎 ∙ Å principal axis direction [26] Molecular radius of Radius of gyration Rgyr Å gyration [128], [129] Radius of the smallest sphere centered at the Molecular span Span center of mass Å which completely encloses all atoms in the molecule [128]

211 Appendix C. Python Code

This appendix contains the Python code used to execute the proposed methodology, the results of which are presented in this document. At present, working portions of the script are active or inactive with commenting depending on the specific application of the user.

Sample input and output files are included in Appendix A.

""" Author: Bryan C. Hobocienski Purpose: Code required to execute the local modeling workflow Date: 11/8/2018 Revised: 11/20/2019 """

# Import necessary libraries import os import copy import pickle import numpy as np import pandas as pd from warnings import warn import scipy.stats as stats from collections import OrderedDict

# Sklearn specific libraries from sklearn.manifold import TSNE from sklearn.metrics import r2_score from sklearn.decomposition import PCA from sklearn.impute import SimpleImputer from sklearn.metrics import roc_auc_score from sklearn.neighbors import NearestNeighbors from sklearn.preprocessing import StandardScaler from sklearn.model_selection import ParameterGrid from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.cross_decomposition import PLSRegression 212 from sklearn.preprocessing import FunctionTransformer from sklearn.feature_selection import VarianceThreshold from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Libraries related to RDKit from rdkit import Chem from rdkit.Chem import Draw from rdkit.Chem import AllChem

# Suppress warnings import warnings warnings.filterwarnings("ignore")

# Create a molecule class which stores information on molecules class Molecule:

# General names for the identifier and response identifier_name = 'CAS_NO' response_name = 'Activity' partition_name = 'Set'

# The default number of Molecule class objects n_molecules = 0

# The initialization method for the Molecule class def __init__(self, identifier = None, descriptors = None, response = None, set_label = None, pickled_rdk_molecule = None):

# Attributes common to all molecules self.identifier = identifier self.descriptors = descriptors self.response = response # Might not know response self.pickled_rdk_molecule = pickled_rdk_molecule self.set_label = set_label # Training, validation, test etc.

# Set attributes for the molecule self.partition_label = None # Which fold/split the molecule participates

# Evidentual global and local predictions self.global_prediction = None self.local_prediction = None

# The MoleculeSet object corresponding to the local training set self.local_molecule_set = None # (MoleculeSet object)

# Increment the number of Molecule objects in existance Molecule.n_molecules += 1

# Define a instance method to visualize Molecule class objects def show_molecule(self, write_path = None, show = True):

213 # If a pickled rdk molecule is present for the Molecule object if self.pickled_rdk_molecule:

# Unpickle the rdk molecule mol = pickle.loads(self.pickled_rdk_molecule) # Compute the 2D coordinates of the molecule tmp = AllChem.Compute2DCoords(mol)

if write_path:

# Draw the molecule and write to the specified path Draw.MolToFile(mol, write_path)

# If show is true, produce a Matplotlib figure of the molecule if show == True:

# Produce a Matplotlib figure of the molecule molfig = Draw.MolToMPL(mol) molfig.suptitle(self.identifier)

# Define a class composed of a single set of one or multiple molecules class MoleculeSet:

# General names for the identifier, response, partition, and variable types identifier_name = 'CAS_NO' response_name = 'Activity' partition_name = 'Set' variable_name = 'Variable'

# A counter for the number of molecule set instances in existence n_molecule_sets = 0

# Define the molecule set initialization method def __init__(self, raw_descriptors = None, raw_response = None, set_label = None, var_types = None, pickled_rdk_molecules = None):

# The raw descriptors and responses, variable types, set label, # and RDK molecule objects are the starting information of a molecule # set object self.raw_descriptors = raw_descriptors self.raw_response = raw_response self.var_types = var_types self.set_label = set_label self.pickled_rdk_molecules = pickled_rdk_molecules

# The processed descriptors and responses change as the data is # manipulated by the workflow but the raw data is retained try: self.processed_descriptors = copy.deepcopy(raw_descriptors) except: pass 214 try: self.processed_response = copy.deepcopy(raw_response) except: pass # Partition labels are used for cross-validation self.partition_key = None # self.partition_label = None #

# Prediction and performance results from modeling if not isinstance(self.raw_response, np.int64):

# self.global_predictions = pd.Series(np.nan, self.raw_response.index) self.global_prediction_probabilities = pd.Series(np.nan, self.raw_response.index)

self.local_predictions = pd.Series(np.nan, self.raw_response.index) self.local_prediction_probabilities = pd.Series(np.nan, self.raw_response.index) else:

# self.global_predictions = None self.global_prediction_probabilities = None

# self.local_predictions = None self.local_prediction_probabilities = None

# self.global_on_local_predictions = None

# self.global_performance_summary = None self.local_performance_summary = None self.global_on_local_performance_summary = None

# Attributes related to preprocessing self.observations_missing_values = None self.zero_variance_descriptors = None self.infrequent_binary_descriptors = None

# Attributes related to variable selection self.descriptor_statistics = None self.significant_descriptors = None self.binary_selection_statistics = None self.numerical_selection_statistics = None

# Sklearn objects related to data manipulation self.imputation_object = None self.variance_threshold_object = None self.transformation_object = None self.standardization_object = None self.dimensional_reduction_object = None self.modeling_technique_object = None 215

# Variance explained by dimensional reduction #self.dimensional_reduction_explained_varianc # Increment the number of molecule sets in existence by one MoleculeSet.n_molecule_sets += 1

# Define a molecule set constructor method @classmethod def from_file(cls, csv_path = None, sdf_path = None, var_type_path = None, set_label_path = None, sdf_id = None):

# Initalize lists for holding read objects csv_ids, sdf_ids, pkl_rdk_mols = [], [], {}

# If the the path to the variable types file has been passed if var_type_path:

# Read and set the index of the variable types var_types = pd.read_csv(var_type_path) var_types = var_types.set_index(cls.variable_name)

# If the path to the data as a .csv file has been passed if csv_path:

# Try to read and parse the .csv file try:

# Read and parse the .csv file df = pd.read_csv(csv_path) df = df.set_index(cls.identifier_name) response = df[cls.response_name] descriptors = df.drop(cls.response_name, axis = 1) csv_ids = df.index.tolist()

# If the .csv file could not be read except:

# Not being able to read the .csv data is a fatal error raise Exception('The .csv file could not be read.')

# If the path to the SD file has been passed if sdf_path:

# Try to read and parse the SD file try:

# Initialize an RDK sd file reader object (generator) suppl = Chem.SDMolSupplier(sdf_path)

# Iterate over the molecules in the sd file # (there should be only one molecule) 216 for mol in suppl:

# if sdf_id:

# if mol.HasProp(sdf_id): identifier = mol.GetProp(sdf_id)

# else:

# Attempt to read the identifier name for the molecule if mol.HasProp(cls.identifier_name): identifier = mol.GetProp(cls.identifier_name) elif mol.HasProp('_Name'): identifier = mol.GetProp('_Name') elif mol.HasProp('Name'): identifier = mol.GetProp('Name')

# Pickle the molecules before storing them pkl = pickle.dumps(mol) pkl_rdk_mols[identifier] = pkl sdf_ids.append(identifier)

# If the SD file could not be read except:

# Not reading the SD file is not a fatal error warn('The SD file could not be read.')

# if set_label_path: try: set_label = pd.read_csv(set_label_path) set_label = set_label.set_index(cls.identifier_name) except: warn('The set label file could not be read') else: set_label = None

# Check the identifiers of the molecules within the .csv and .sd files if csv_ids and sdf_ids:

# If there is a discrepancy between the two data files if not set(csv_ids) == set(sdf_ids):

# Warn the user if the data files do not agree warn('The identifiers between the .csv and .sdf sets do not match.')

# remove_ids = [str(i) for i in list(set(csv_ids) \ .symmetric_difference(set(sdf_ids)))]

217 # if isinstance(descriptors.index[0], np.integer):

# descriptors = descriptors.reindex([str(i) \ for i in descriptors.index.tolist()]) response = response.reindex([str(i) \ for i in response.index.tolist()])

# data_remove_ids = list(set(remove_ids).intersection(set(descriptors.index.tolist())))

# descriptors = descriptors.drop(data_remove_ids, axis = 0) response = response.drop(data_remove_ids, axis = 0)

# try:

# mols = []

# suppl = Chem.SDMolSupplier(sdf_path)

# for mol in suppl:

# identifier = None

# if sdf_id:

# if mol.HasProp(sdf_id): identifier = mol.GetProp(sdf_id)

# else:

# if mol.HasProp(cls.identifier_name): identifier = mol.GetProp(cls.identifier_name) elif mol.HasProp('_Name'): identifier = mol.GetProp('_Name') elif mol.HasProp('Name'): identifier = mol.GetProp('Name')

# if identifier not in remove_ids:

# mols.append(mol) 218 # w = Chem.SDWriter(sdf_path) for m in mols: w.write(m)

# except:

# warn('The SD file could not be read to remove mismatches.')

# for key in remove_ids: if key in pkl_rdk_mols: del pkl_rdk_mols[key]

# Return a molecule set object with from the read data return cls(raw_descriptors = descriptors, raw_response = response, set_label = set_label, pickled_rdk_molecules = pkl_rdk_mols, var_types = var_types)

# Define a method to create Molecule objects of the identifiers specified def to_molecules(self, identifiers = None):

# Initialize a dictionary to hold molecule objects molecule_objects = {}

# Only specified molecules by identifier will become molecule objects if identifiers:

# Check of the specified identifiers are within the molecule set identifiers = [ide for ide in identifiers \ if ide in self.pickled_rdk_molecules.keys()]

# if not identifiers:

# identifiers = [ide for ide in self.pickled_rdk_molecules.keys()]

# Iterate over the identifiers present for identifier in identifiers:

# Extract the information relevant to the molecule mol_descriptors = self.raw_descriptors.loc[identifier] mol_response = self.raw_response.loc[identifier] mol_pkl_rdk = self.pickled_rdk_molecules[identifier] mol_label = self.set_label.loc[identifier].tolist()[0]

# Instantiate a molecule object with the information provided molecule = Molecule(identifier = identifier, descriptors = mol_descriptors, response = mol_response, 219 pickled_rdk_molecule = mol_pkl_rdk, set_label = mol_label)

# Add the molecule to the dictionary of molecule objects molecule_objects[identifier] = molecule

# Return the molecule objects return molecule_objects

# Define a function to generate 2D images of molecules within the data set # (identifiers allows the user to specify a subset of molecules to show) def show_structures(self, write_path = None, identifiers = None):

# If no path is given to write the image file if not write_path:

# Notify the user that a write path must be specified raise \ Exception('The complete path with name and file extension ' \ 'of the image file must be specified.')

# If there pickled RDK molecules for the data set if self.pickled_rdk_molecules:

# If identifiers of a subset of molecules were passed if identifiers:

# Only depict molecules actually present within the data set identifiers \ = [ide for ide in identifiers \ if ide in self.pickled_rdk_molecules.keys()]

# Unpickle the RDK molecules mols = [pickle.loads(self.pickled_rdk_molecules[ide]) \ for ide in identifiers]

# Compute coordinates for the molecules before drawing for m in mols: tmp = AllChem.Compute2DCoords(m)

row_length = 10 if identifiers: if len(identifiers) < 10: row_length = len(identifiers)

# Depict the molecules in a grid with their identifier's # labeled imgs = Draw.MolsToGridImage(mols, molsPerRow = row_length, subImgSize = (500,500), legends = identifiers)

# Save the molecule depicts to an image file at the path #provided imgs.save(write_path) 220

# else:

# undepictable_mols = []

# if not os.path.isdir(write_path): os.mkdir(write_path)

# Unpickle the RDK molecules mols = [pickle.loads(self.pickled_rdk_molecules[ide]) \ for ide in self.pickled_rdk_molecules] identifiers = list(self.pickled_rdk_molecules.keys())

# Compute coordinates for the molecules before drawing for m in range(len(mols)):

# try: tmp = AllChem.Compute2DCoords(mols[m]) Draw.MolToFile(mols[m], write_path + identifiers[m] + '.png')

# except: undepictable_mols.append(identifiers[m])

if undepictable_mols: undepictable_mols = pd.Series(undepictable_mols) undepictable_mols.to_csv(write_path)

# Define a method to partition the observations of the molecule set # into separate groups def partition(self, method = 'kfold', n_folds = 5, train_size = 0.8, file_path = None):

# Accepted data partition methods: 'kfold', 'split', 'file'

# if method == 'kfold':

# Initialize a list to record which fold each training observation # serves as a test set member partition_ids = pd.DataFrame( np.random.choice(a = np.arange(1, n_folds + 1), size = self.raw_descriptors.shape[0]), index = self.raw_descriptors.index)

partition_ids.columns = [self.partition_name]

# If a train/test split is to be performed elif method == 'split':

221 # Initialize a list to record which fold each training observation # serves as a test set member (Pandas Series) partition_ids = pd.DataFrame( np.random.choice(a = [1, 2], size = self.raw_descriptors.shape[0], p = [train_size, 1-train_size]), index = self.raw_descriptors.index)

partition_ids.columns = [self.partition_name]

# if method == 'file':

# if not file_path:

# raise Exception('The path to the file containing the ' \ 'partition key was not specified')

# else:

# partition_ids = pd.read_csv(file_path, index_col \ = MoleculeSet.identifier_name)

# partition_ids.name = 'partition_set' self.partition_key = partition_ids

# def to_moleculesets(self, training_partition = False):

# moleculesets = []

# if self.partition_key is None:

warn('The function needs a partition key from which to construct' \ ' MoleculeSets objects.') return

# if isinstance(self.partition_key, pd.DataFrame):

# self.partition_key = self.partition_key[self.partition_name]

# max_part = self.partition_key.max()

# 222 for j in range(1, max_part + 1):

# If there is a static training set (e.g. Hansen et al.) if training_partition == True:

# The static training set will always be denoted by 0 static_training_indices = self.partition_key \ .index[(self.partition_key == 0).tolist()].tolist()

# training_indices = self.partition_key \ .index[((self.partition_key != j) \ & (self.partition_key != 0)).tolist()].tolist()

# training_indices = static_training_indices + training_indices

# test_indices = self.partition_key \ .index[(self.partition_key == j).tolist()].tolist()

# Typical n-fold cross-validation else:

# training_indices = self.partition_key \ .index[(self.partition_key != j).tolist()].tolist()

# test_indices = self.partition_key \ .index[(self.partition_key == j).tolist()].tolist()

# test_descriptors = self.processed_descriptors.loc[test_indices] test_response = self.processed_response.loc[test_indices] test_set_label = self.set_label.loc[test_indices]

test_pickled_rdk_molecules \ = {teidx: self.pickled_rdk_molecules[teidx] for teidx in test_indices}

# training_descriptors = self.processed_descriptors.loc[training_indices] training_response = self.processed_response.loc[training_indices] training_set_label = self.set_label.loc[training_indices] training_pickled_rdk_molecules \ = {tridx: self.pickled_rdk_molecules[tridx] \ for tridx in training_indices}

# test_moleculeset \ = MoleculeSet(raw_descriptors = copy.deepcopy(test_descriptors), raw_response = copy.deepcopy(test_response), set_label = copy.deepcopy(test_set_label), 223 var_types = copy.deepcopy(self.var_types), pickled_rdk_molecules \ = copy.deepcopy(test_pickled_rdk_molecules))

# training_moleculeset \ = MoleculeSet(raw_descriptors = copy.deepcopy(training_descriptors), raw_response = copy.deepcopy(training_response), set_label = copy.deepcopy(training_set_label), var_types = copy.deepcopy(self.var_types), pickled_rdk_molecules \ = copy.deepcopy(training_pickled_rdk_molecules))

# moleculesets_object = MoleculeSets(name = 'hansen_set_' + str(j+1), training_set = training_moleculeset, test_set = test_moleculeset)

# moleculesets.append(moleculesets_object)

# return moleculesets

# Define a method to conduct data set pre-processing def preprocess(self, missing_data_method = 'remove', missing_data_indicator = np.nan, variance_threshold_value = 0.0, infrequent_threshold_value = 3, transform_descriptors = None, transform_function = None, transform_inverse = None, standardize = True):

# Treat observations with missing data self.missing_data_imputation(method = missing_data_method, indicator_type = missing_data_indicator)

# Treat descriptors with little to no variance self.variance_threshold_removal(threshold = variance_threshold_value)

# Treat binary descriptors which occur infrequently within the data set self.infrequent_descriptor_removal(threshold \ = infrequent_threshold_value)

# Descriptor transformation (e.g. taking the square root of a count # descriptor) self.descriptor_transformation(transform_descriptors \ = transform_descriptors, transform_function = transform_function, transform_inverse = transform_inverse)

# Standardize the data set self.standardization(standardize = standardize) 224

# Define a method to perform variable selection on the data set def variable_selection(self, significance_level = 0.05, method = 'univariate', univariate_binary_method = 'fisher'):

# if self.significant_descriptors is None:

# Identify interval descriptor names (Pandas index object) non_binary_descriptor_names \ = self.var_types.index[((self.var_types == 2) | \ (self.var_types == 3))['Type']].tolist()

non_binary_descriptor_names \ = [name for name in non_binary_descriptor_names \ if name in self.processed_descriptors.columns.tolist()]

# Extract the interval descriptor data non_binary_descriptors \ = self.processed_descriptors[non_binary_descriptor_names]

# Identify the binary descriptor names (Pandas index object) binary_descriptor_names \ = self.var_types.index[(self.var_types == 6)['Type']].tolist()

binary_descriptor_names \ = [name for name in binary_descriptor_names \ if name in self.processed_descriptors.columns.tolist()]

# Extract the binary descriptor data binary_descriptors \ = self.processed_descriptors[binary_descriptor_names]

# If the variable selection method is univariate if method == 'univariate':

# If there is interval descriptor data if not non_binary_descriptors.empty:

### ANOVA via SciPy ###

# Group the non-binary descriptor data by response class response_grouping = [response_class_data \ for response_class_data in non_binary_descriptors \ .groupby(self.processed_response)] # Initialize a list to hold ANOVA statistics anova_stats = []

# Iterate over each non-binary descriptor for descript in non_binary_descriptors.columns:

# If the within treatment variance is NOT zero OR 225 # undefined (sample variance is normalized by 1/(n-1)) # AND the sum of the observations of each treatment are # larger than sums of squares degrees of freedom (I-J) if not ((response_grouping[0][1][descript].var() < 0.001 \ or pd.isnull(response_grouping[0][1][descript].var())) \ and (response_grouping[1][1][descript].var() < 0.001 \ or pd.isnull(response_grouping[1][1][descript].var()))) \ and (response_grouping[0][1][descript].shape[0] \ + response_grouping[1][1][descript].shape[0] > 2):

# Perform ANOVA on the distribution of the non-binary # descriptors by the response treatment levels anova_stats \ .append( \ stats.f_oneway(response_grouping[0][1][descript], response_grouping[1][1][descript]))

# One-way ANOVA cannot be performed else:

# Drop the descriptor from consideration # (label insignificant) non_binary_descriptors = \ non_binary_descriptors.drop(descript, axis = 1)

# Process the ANOVA statistics if it could be applied if anova_stats:

# Pre-format the ANOVA association statistics anova_stats = [[anova_stats[p][0], anova_stats[p][1]] \ for p in range(len(anova_stats))]

# Place the ANOVA association statistics into a Pandas # Data Frame anova_statistics \ = pd.DataFrame(anova_stats, index = non_binary_descriptors.columns, columns = ['f_stat','p-val'])

# Add the ANOVA statistics to those returned for later # review as a molecule set attribute self.numerical_selection_statistics \ = copy.deepcopy(anova_statistics)

# Extract the names of non-binary descriptors which appear # to have a correlation to the response significant_non_binary_descriptor_names \ = pd.Series( \ anova_statistics.index[anova_statistics['p-val'] \ <= significance_level])

# else:

226 # significant_non_binary_descriptor_names = None

# If there are no non binary descriptors or none that are # significant else: significant_non_binary_descriptor_names = None

# Only perform chi-squared or fisher exact tests if there are # binary descriptors if not binary_descriptors.empty:

# Produce a contingency array for each binary descriptor # enumerating the occurrence of the binary response and # descriptor variables binary_descriptor_contingency_array \ = [np.array([((self.processed_response == 1) \ & (binary_descriptors[dscpt] == 1)).sum(), \ ((self.processed_response == 0) \ & (binary_descriptors[dscpt] == 1)).sum(), \ ((self.processed_response == 1) \ & (binary_descriptors[dscpt] == 0)).sum(), \ ((self.processed_response == 0) \ & (binary_descriptors[dscpt] == 0)).sum()]) \ for dscpt in binary_descriptors]

# Recast the contingency array into a Pandas Data Frame binary_descriptor_contingency_array \ = pd.DataFrame(binary_descriptor_contingency_array, index = binary_descriptors.columns.tolist(), columns = ['TP','FP','FN','TN'])

# Find any descriptors (by index) where an infinite odds ratio # will occur infinite_odds_ratio_index \ = np.where( \ (binary_descriptor_contingency_array['FP'] == 0) \ | (binary_descriptor_contingency_array['FN'] == 0))

# If there are binary descriptors with an infinite odds ratio if infinite_odds_ratio_index[0].size != 0:

# Implement a correction to artificially sway the # descriptors' significance toward the null hypothesis binary_descriptor_contingency_array \ .iloc[infinite_odds_ratio_index[0]] \ = binary_descriptor_contingency_array \ .iloc[infinite_odds_ratio_index[0]] + np.ones(4)

# Calculate the sample odds ratio for each binary descriptor # and add to the contingency array sample_odds_ratios \ = [np.divide(np.multiply( \ binary_descriptor_contingency_array.iloc[r,0], \ 227 binary_descriptor_contingency_array.iloc[r,3]), \ np.multiply( \ binary_descriptor_contingency_array.iloc[r,1], \ binary_descriptor_contingency_array.iloc[r,2])) \ for r in range(binary_descriptor_contingency_array \ .shape[0])]

# Place the sample odds ratios into a Pandas Series sample_odds_ratios \ = pd.Series(sample_odds_ratios, index = binary_descriptors.columns)

# Name the sample odds ratio statistic sample_odds_ratios.name = 'OR'

# Add the sample odds ratios to the binary descriptor # contingency array as a column binary_descriptor_contingency_array \ = pd.concat([binary_descriptor_contingency_array, sample_odds_ratios], axis = 1)

# Determine the correlation of the binary descriptors using the # Chi-2 test if univariate_binary_method == 'chi2':

# Conduct the chi2 test for association on each of the # binary descriptors chi2_statistics \ = [stats.chi2_contingency( \ np.array( \ [[binary_descriptor_contingency_array.iloc[t,0], binary_descriptor_contingency_array.iloc[t,1]], [binary_descriptor_contingency_array.iloc[t,2], binary_descriptor_contingency_array.iloc[t,3]]])) \ for t in range(binary_descriptors.shape[1])]

# Extract the p-values from the chi2 test of association # and cast as a Pandas Series chi2_p_values \ = pd.Series([stat[1] for stat in chi2_statistics], index = binary_descriptors.columns)

# Name the Series holding p-values chi2_p_values.name = 'p-val' # Add the p-values from the test of association to the # contingency array binary_descriptor_contingency_array \ = pd.concat([binary_descriptor_contingency_array, \ chi2_p_values], axis = 1)

# Determine the correlation of the binary descriptors using the # Fisher exact test if univariate_binary_method == 'fisher': 228

# Initialize a list to record fisher exact statistics fisher_statistics = []

# Iterate over the binary descriptors for des_idx in range(binary_descriptors.shape[1]):

# Extract the 2x2 contingency table for the current # descriptor for the contingency array contingency_table \ = np.array(\ [[binary_descriptor_contingency_array \ .iloc[des_idx,0], \ binary_descriptor_contingency_array \ .iloc[des_idx,1]], \ [binary_descriptor_contingency_array \ .iloc[des_idx,2], \ binary_descriptor_contingency_array \ .iloc[des_idx,3]]])

# Calculate the odds ratio and the p-value for # significance on the current descriptor using the # Fisher exact test fisher_statistics \ .append(stats.fisher_exact(contingency_table))

# Extract the p-values from the Fisher exact test of # association and cast as a Pandas Series fisher_p_values \ = pd.Series([fisher_stat[1] \ for fisher_stat in fisher_statistics], \ index = binary_descriptors.columns)

# Name the Series holding the p-values fisher_p_values.name = 'p-val'

# Add the p-values from the test of association to the # contingency array binary_descriptor_contingency_array \ = pd.concat([binary_descriptor_contingency_array, \ fisher_p_values], axis = 1)

# Add the binary variable selection statistics to those # returned as an attribute of the molecule set object contingency_array \ = pd.DataFrame(binary_descriptor_contingency_array)

# self.binary_selection_statistics \ = copy.deepcopy(contingency_array)

# Determine which binary descriptors are thought to corrlate # with the response based on the association method used significant_binary_descriptor_names \ 229 = pd.Series( \ binary_descriptor_contingency_array \ .index[binary_descriptor_contingency_array['p-val'] \ <= significance_level])

# If there are no binary descriptors or none that are significant else: significant_binary_descriptor_names = None

# if significant_non_binary_descriptor_names is not None \ and significant_binary_descriptor_names is not None:

# Combine the names of the significant non-binary and binary # descriptors significant_descriptors_names \ = pd.concat((significant_non_binary_descriptor_names, significant_binary_descriptor_names))

elif significant_non_binary_descriptor_names is not None \ and significant_binary_descriptor_names is None:

# Combine the names of the significant non-binary and binary # descriptors significant_descriptors_names \ = significant_non_binary_descriptor_names

elif significant_non_binary_descriptor_names is None \ and significant_binary_descriptor_names is not None:

# Combine the names of the significant non-binary and binary # descriptors significant_descriptors_names \ = significant_binary_descriptor_names

# else:

# significant_descriptors_names = None

# if significant_descriptors_names is not None:

# Reset the index of the significant descriptor names object significant_descriptors_names.reset_index(drop = True, \ inplace = True)

# Set the significant descriptor names as an attribute of the # molecule set self.significant_descriptors = significant_descriptors_names

# The processed descriptors attribute consists of the statistically # significant descriptor data 230 if self.significant_descriptors is not None: self.processed_descriptors \ = self.processed_descriptors[self.significant_descriptors]

# else:

# if isinstance(self.processed_descriptors, pd.Series):

# Find descriptors in the set deemed insignificant by training insignificant_descriptors \ = [d for d in self.processed_descriptors.index.tolist() \ if d not in self.significant_descriptors.values.tolist()]

# Drop the insignificant descriptors self.processed_descriptors \ = self.processed_descriptors \ .drop(insignificant_descriptors, axis = 0)

else:

# Find descriptors in the set deemed insignificant by training insignificant_descriptors \ = [d for d in self.processed_descriptors.columns.tolist() \ if d not in self.significant_descriptors.values.tolist()]

# Drop the insignificant descriptors self.processed_descriptors \ = self.processed_descriptors \ .drop(insignificant_descriptors, axis = 1)

# Define a method to conduct dimensional reduction on the data set # descriptors def dimensional_reduction(self, method = 'pls', n_reduced_dimensions = 2, pca_method = 'component', ratio_explained_variance = 0.99, n_neighbors = 15, min_dist = 0.1, metric = 'euclidean'):

# If the data set descriptors do not already have a dimensional # reduction object attribute if self.dimensional_reduction_object is None:

# If the dimensional reduction method is PCA or PLS if method == 'pca' or method == 'pls':

# If the requested number of reduced dimensions is greater than the # smaller of the number of observations minus one or the number # of untransformed variables if not n_reduced_dimensions \ > min(self.processed_descriptors.shape[0]-1, \ self.processed_descriptors.shape[1]): 231

# If the dimensional reduction method is Principal # Component Analysis if method == 'pca':

# If the Principal Component Analysis method is to use # the number of components specified if pca_method == 'component':

# Initialize the dimensional reduction object with # number of components specified self.dimensional_reduction_object \ = PCA(n_components = int(n_reduced_dimensions), svd_solver = 'full')

# Fit the dimensional reduction object to the data # set descriptors self.dimensional_reduction_object \ .fit(self.processed_descriptors)

# If the Principal Component Analysis method is to use # as many components necessary to explain the variance # ratio provided elif pca_method == 'variance':

# Start the number of reduced dimensions at two n_reduced_dimensions = 2

# Initialize the dimensional reduction object self.dimensional_reduction_object \ = PCA(n_components = int(n_reduced_dimensions), svd_solver = 'full')

# Fit the dimensional reduction object to the data # set descriptors self.dimensional_reduction_object \ .fit(self.processed_descriptors)

# While the variance explained by the reduced # dimensions is less than that desired while np.sum(self.dimensional_reduction_object \ .explained_variance_ratio_) \ < ratio_explained_variance:

# Increase the number of reduced dimensions by # one n_reduced_dimensions += 1

# Re-initialize the dimensional reduction # object using the new number of reduced # dimensions self.dimensional_reduction_object \ = PCA(n_components = int(n_reduced_dimensions), svd_solver = 'full') 232

# Re-fit the dimensional reduction object to # the data set descriptors self.dimensional_reduction_object \ .fit(self.processed_descriptors)

# If the dimensional reduction object is Partial Least # Squares elif method == 'pls':

# Initialize a dimensional reduction object with the # number of reduced dimensions specified self.dimensional_reduction_object \ = PLSRegression(n_components \ = int(n_reduced_dimensions), \ scale = False)

# Perform PLS on the training descriptors self.dimensional_reduction_object \ .fit(self.processed_descriptors, self.processed_response)

# if method == 'umap':

# Initialize a dimensional reduction object with the # number of reduced dimensions specified self.dimensional_reduction_object \ = umap.UMAP(n_components = int(n_reduced_dimensions), n_neighbors = n_neighbors, min_dist = min_dist, metric = metric)

# Perform UMAP on the training descriptors self.dimensional_reduction_object \ .fit(self.processed_descriptors)

### PLS explained descriptor space matrix ###

# If the descriptors data set contains one observation if self.processed_descriptors.index.size == 1 \ or isinstance(self.processed_descriptors, pd.Series):

# Apply the dimensional reduction to the descriptors data descriptors_scores = self.dimensional_reduction_object \ .transform(self.processed_descriptors.values.reshape(1,-1))

# Create a header labeling each of the reduced dimensions n_reduced_dimensions = descriptors_scores.shape[-1] column_header = ['Comp' + str(i+1) \ for i in range(n_reduced_dimensions)]

# Update the processed descriptors attribute with the # transformed, reduced descriptor space 233 self.processed_descriptors = pd.Series(descriptors_scores[0], index = column_header)

# Otherwise, the descriptors data set contains multiple # dimensions else:

# descriptors_scores = self.dimensional_reduction_object.transform( self.processed_descriptors)

# Create a header labeling each of the reduced dimensions n_reduced_dimensions = descriptors_scores.shape[-1] column_header = ['Comp' + str(i+1) for i in range(n_reduced_dimensions)]

# Update the processed descriptors attribute with the # transformed, reduced descriptor space self.processed_descriptors = pd.DataFrame(descriptors_scores, index = self.processed_descriptors.index.tolist(), \ columns = column_header)

# Define a method for removing or imputing observations with missing data def missing_data_imputation(self, method = 'remove', indicator_type = np.nan):

# Find the identifiers of observations missing data if isinstance(self.processed_descriptors, pd.Series):

# missing_value_labels \ = self.processed_descriptors.index \ [self.processed_descriptors.isnull().values.any(axis = 0)]

else:

# missing_value_labels \ = self.processed_descriptors.index \ [self.processed_descriptors.isnull().values.any(axis = 1)]

# if missing_value_labels.size != 0:

# Package the observations missing values into a Pandas Series self.observations_missing_values \ = pd.Series(missing_value_labels, name = 'observations_missing_values')

else:

self.observations_missing_values \ = pd.Series(name = 'observations_missing_values') return

234 # If the observations are to be removed if method == 'remove':

# Drop observations/rows with any missing values self.processed_descriptors = self.processed_descriptors.dropna()

# If the response variable exists for the data set if self.processed_response is not None:

# Drop the same observations from the response data self.processed_response \ = self.processed_response.drop(missing_value_labels)

# Otherwise, the observations will be imputed else:

# If an imputation object was not passed if self.imputation_object is None:

# Initialize an imputation object # (NaN instances are recognised as missing values, imputed by # column/descriptor means) self.imputation_object \ = SimpleImputer(missing_values = indicator_type, strategy = method)

# Fit the descriptors to the imputation object self.imputation_object.fit(self.processed_descriptors)

# Impute the descriptors and package them into a Pandas DataFrame descriptors_array = \ self.imputation_object.transform(self.processed_descriptors)

# Recast the data set descriptors as a Pandas DataFrame self.processed_descriptors = \ pd.DataFrame(descriptors_array, index = self.processed_descriptors.index, columns = self.processed_descriptors.columns)

# Define a function to remove descriptors with variance below the desired # threshold def variance_threshold_removal(self, threshold = 0.0):

# If a variance-based descriptor removal object does not exist if self.variance_threshold_object is None:

# Initialize the variance-based feature selector self.variance_threshold_object = VarianceThreshold(threshold)

# Fit the descriptors to the variance-based feature selector self.variance_threshold_object.fit(self.processed_descriptors)

# Find the names of the descriptors removed zero_variance_descriptors \ 235 = self.processed_descriptors.columns \ [self.variance_threshold_object.variances_ <= threshold]

# Find the names of the descriptors retained non_zero_variance_descriptors \ = self.processed_descriptors.columns \ [self.variance_threshold_object.variances_ > threshold]

# Package the zero variance descriptors into a Pandas Series self.zero_variance_descriptors \ = pd.Series(zero_variance_descriptors, name = 'zero_variance_descriptors')

# Apply the variance threshold remove to the data set descriptor_array \ = self.variance_threshold_object\ .transform(self.processed_descriptors)

# Recast the selected descriptors into a Pandas DataFrame self.processed_descriptors \ = pd.DataFrame(descriptor_array, index = self.processed_descriptors.index, columns = non_zero_variance_descriptors)

# Define a method to remove infrequent binary descriptors def infrequent_descriptor_removal(self, threshold = 3):

# If the data set has no infrequent binary descriptor attribute # (such would be the case of a training set) if self.infrequent_binary_descriptors is None:

# Identify the binary descriptors (Pandas index object) binary_descriptor_names \ = self.var_types.index[(self.var_types == 6)['Type'] \ .tolist()].tolist()

# If there are binary descriptors if binary_descriptor_names:

# binary_descriptor_names \ = set(self.processed_descriptors.columns.tolist()) \ .intersection(set(binary_descriptor_names))

# Retrieve the binary descriptors data from the current # data set binary_descriptors \ = self.processed_descriptors.loc[:,binary_descriptor_names]

# Find the binary descriptors deemed too infrequent to use # for modeling infreq_binary_descriptor_names \ = binary_descriptors.columns[binary_descriptors.sum(axis = 0) \ <= threshold] 236

# Record the infrequent descriptors removed as an attribute of # the data set self.infrequent_binary_descriptors \ = pd.Series(infreq_binary_descriptor_names, name = 'infrequent_binary_descriptors')

# The data set will have infrequent binary descriptors if found # or assigned as an attribute ahead of time (such is the case with # a test set) if self.infrequent_binary_descriptors is not None:

# If the data set is a Series (local modeling) if isinstance(self.processed_descriptors, pd.Series):

# Drop the infrequent binary descriptors from the data set self.processed_descriptors \ = self.processed_descriptors. \ drop(self.infrequent_binary_descriptors.tolist())

# If the data is a Data Frame (pre-processing or global modeling) else:

# Drop the infrequent binary descriptors from the data set self.processed_descriptors \ = self.processed_descriptors. \ drop(self.infrequent_binary_descriptors.tolist(), axis = 1)

# Define a method to perform a user-specified transformation on a user- # specified list of descriptors def descriptor_transformation(self, transform_descriptors = None, transform_function = None, transform_inverse = None):

# transformed_descriptor_names = None

# if transform_descriptors:

# The desired descriptors to be transformed actually present within # the data set transformed_descriptor_names \ = [d for d in transform_descriptors \ if d in self.processed_descriptors.columns.tolist()]

# Code to preserve the order of the processed descriptors order = self.processed_descriptors.columns.tolist() transformed_order \ = [d for d in order if d in transformed_descriptor_names] non_transformed_order \ = [d for d in order if d not in transformed_descriptor_names] order = transformed_order + non_transformed_order

237 # If there are descriptors present to be transformed if transformed_descriptor_names:

# Extract the transformed descriptor data transformed_descriptors \ = self.processed_descriptors.loc[:,transformed_descriptor_names]

# Drop the transformed descriptor data from the set self.processed_descriptors \ = self.processed_descriptors.drop(transformed_descriptors, axis = 1)

# If there is not already a transformation object if self.transformation_object is None:

# Initialize a transformation object and fit it to the # transformed descriptor data self.transformation_object \ = FunctionTransformer(func = transform_function, inverse_func = transform_inverse, validate = False) self.transformation_object.fit(transformed_descriptors)

# Apply the transformation to the desired descriptors transformed_descriptors \ = self.transformation_object.transform(transformed_descriptors)

# Combine the transformed and non-transformed descriptors self.processed_descriptors \ = pd.merge(self.processed_descriptors, transformed_descriptors, left_index = True, right_index = True)

# Re-establish the original descriptor order self.processed_descriptors = self.processed_descriptors[order]

# Define a method to standardized numerical (interval or count-based) # descriptor data def standardization(self, standardize = True):

# if standardize:

if isinstance(self.processed_descriptors, pd.Series):

descriptors = self.processed_descriptors.index.tolist()

else:

descriptors = self.processed_descriptors.columns.tolist()

# Identify numerical descriptors (Pandas index object) numerical_descriptor_names \ = self.var_types.index[((self.var_types == 2) | \ (self.var_types == 3))['Type']].tolist()

238 # If there are numerical descriptors if numerical_descriptor_names:

# The numerical descriptors present within the data set numerical_descriptors_present \ = list(set(numerical_descriptor_names) \ .intersection(set(descriptors)))

# Code to preserve the order of the processed descriptors numerical_order \ = [d for d in descriptors \ if d in numerical_descriptors_present] non_numerical_order \ = [d for d in descriptors \ if d not in numerical_descriptors_present] order = numerical_order + non_numerical_order

# If numerical descriptors are present in the data set if numerical_descriptors_present:

# The numerical descriptor data numerical_descriptors \ = self.processed_descriptors \ [numerical_descriptors_present]

# Drop any non-numerical descriptors so they are not # standardized if isinstance(self.processed_descriptors, pd.Series):

numerical_descriptor_names = numerical_descriptors \ .index.tolist()

numerical_descriptors = numerical_descriptors \ .values.reshape(1,-1)

self.processed_descriptors \ = self.processed_descriptors \ .drop(numerical_descriptor_names, axis = 0)

else:

self.processed_descriptors \ = self.processed_descriptors \ .drop(numerical_descriptors.columns.tolist(), axis = 1)

# If no standardization object exists (training set), then # one must be initialized and fit if self.standardization_object is None:

# Initialize and fit a standardization object to the # numerical data self.standardization_object = StandardScaler() self.standardization_object.fit(numerical_descriptors) 239

# Standardize the numerical descriptors standardized_descriptors \ = self.standardization_object \ .transform(numerical_descriptors)

# if isinstance(self.processed_descriptors, pd.Series):

# standardized_descriptors = standardized_descriptors[0]

# standardized_descriptors \ = pd.Series(standardized_descriptors, index = numerical_descriptor_names)

else:

# standardized_descriptors \ = pd.DataFrame(standardized_descriptors, index = numerical_descriptors.index, columns = numerical_descriptors.columns)

# Add the non-numerical descriptors to the standardized, # numerical descriptors if isinstance(self.processed_descriptors, pd.Series):

# self.processed_descriptors \ = self.processed_descriptors \ .append(standardized_descriptors)

else:

# self.processed_descriptors \ = pd.merge(self.processed_descriptors, standardized_descriptors, left_index = True, right_index = True)

# Re-establish the order of the now standardized data self.processed_descriptors \ = self.processed_descriptors[order]

# Set the descriptor statistics attribute as a summary of the # descriptor statistics self.descriptor_statistics \ = self.processed_descriptors.describe()

# def combine_predictions(self, molesets_collection = None):

240 # for molesets in molesets_collection:

# if not isinstance(molesets, MoleculeSets):

# raise Exception('The objects in the passed collection must be' \ ' MoleculeSets objects.')

# self.global_predictions \ [molesets.test_set.global_predictions.index.tolist()] \ = molesets.test_set.global_predictions

# self.local_predictions \ [molesets.test_set.local_predictions.index.tolist()] \ = molesets.test_set.local_predictions

# self.global_prediction_probabilities \ [molesets.test_set.global_prediction_probabilities.index.tolist()] \ = molesets.test_set.global_prediction_probabilities

# self.local_prediction_probabilities \ [molesets.test_set.local_prediction_probabilities.index.tolist()] \ = molesets.test_set.local_prediction_probabilities def global_on_local_evaluation(self): g = copy.deepcopy(self.global_predictions) l = copy.deepcopy(self.local_predictions) v = g[(((g == 1)|(g == 0))&((l == 1)|(l == 0)))]

self.global_on_local_predictions \ = pd.Series(np.nan, index = self.global_predictions.index)

self.global_on_local_predictions[v.index] = v

gol_pred_probs = self.global_predictions[v.index]

self.global_on_local_performance_summary \ = self.evaluate(response = self.raw_response, predictions = self.global_on_local_predictions, probabilities = gol_pred_probs)

# def evaluate(self, response = None, predictions = None, \ probabilities = None, phase = None):

# if response is None and predictions is None:

241 # if self.raw_response is not None:

# response = copy.deepcopy(self.raw_response)

# if phase == 'global' \ and self.global_predictions is not None:

# predictions = copy.deepcopy(self.global_predictions)

# elif phase == 'local' \ and self.local_predictions is not None:

# predictions = copy.deepcopy(self.local_predictions)

# if probabilities is None:

# if phase == 'global' \ and self.global_prediction_probabilities is not None:

# probabilities \ = copy.deepcopy(self.global_prediction_probabilities) # elif phase == 'local' \ and self.local_prediction_probabilities is not None:

# probabilities \ = copy.deepcopy(self.local_prediction_probabilities)

else:

warn('No prediction probabilities for evaluation.')

# if response is None or predictions is None:

# warn('The test set must have both responses and predictions ' \ 'for the model generating the predictions to be ' \ 'evaluated.') return

# If the number of reponses and predictions do not agree if response is None and predictions is None: if response.size != predictions.size: 242

# Inform the user that the number of responses and predictions # do not agree warn(r'The response and prediction arrays contain an unequal ' \ r'number of values.') return

# n_observations = response.size n_no_predictions = np.sum(predictions.isnull()) n_predictions = predictions.shape[0] - n_no_predictions

# Calculate the fraction of test observations predicted fraction_predicted = n_predictions/n_observations

# Count the number of true positives true_positives = np.sum((response == 1) & (predictions == 1))

# Count the number of true negatives true_negatives = np.sum((response == 0) & (predictions == 0))

# Count the number of false positives false_positives = np.sum((response == 0) & (predictions == 1))

# Count the number of false negatives false_negatives = np.sum((response == 1) & (predictions == 0))

# Calculate conditions positive and negative condition_positive = true_positives + false_negatives condition_negative = true_negatives + false_positives

# Calculate the fraction of positives within the response of those # observations with model predictions prevalence = (((response == 1) & (predictions == 1)) \ | ((response == 1) & (predictions == 0))).sum()/n_predictions

# Calculate predictives positive and negative predictive_positive = true_positives + false_positives predictive_negative = true_negatives + false_negatives

# AUC calculation if response is not None and probabilities is not None:

# probabilities = probabilities.loc[~probabilities.isnull()] response = response.loc[probabilities.index.tolist()]

# if all(response == 1) or all(response == 0):

# auc = np.nan

# 243 else:

# auc = roc_auc_score(response, probabilities) else:

# auc = np.nan

# If predictions were made if n_predictions > 0:

# Calculate the accuracy accuracy = (true_positives + true_negatives) \ / (true_positives + true_negatives + false_positives \ + false_negatives)

# No predictions were made else:

# The accuracy is null accuracy = np.nan

# If positive observations are present if condition_positive != 0:

# Calculate the sensitivity (probability model predicts # positive given positive response) sensitivity = true_positives/condition_positive

# No positive observations are present else:

# The sensitivity is null sensitivity = np.nan

# If negative observations are present if condition_negative != 0:

# Calculate the specificity (probability model predicts # negative given negative response) specificity = true_negatives/condition_negative

# No negative observations are present else:

# The specificity is null specificity = np.nan

# If positive predictions were made if predictive_positive != 0:

# Calculate the positive predictive value (probability response 244 # positive given model predicts positive) positive_predictive_value = true_positives/predictive_positive

# No positive predictions were made else:

# The positive predictive value is null positive_predictive_value = np.nan

# If negative predictions were made if predictive_negative != 0:

# Calculate the negative predictive value (probability # response negative given model predicts negative) negative_predictive_value = true_negatives/predictive_negative

# No negative predictions were made else:

# The negative predictive value is null negative_predictive_value = np.nan

# Store the description of the model performance in the dictionary model_evaluations = [n_observations, n_predictions, \ fraction_predicted, prevalence, \ true_positives, false_negatives, \ false_positives, true_negatives, \ sensitivity, specificity, accuracy, \ positive_predictive_value, \ negative_predictive_value, auc]

# Convert the model evaluations dictionary to a Pandas DataFrame model_evaluation = pd.Series(model_evaluations, index = ['nobs', 'npred', 'fpred', \ 'prev', 'tps', 'fns', \ 'fps', 'tns', 'sens', \ 'spec', 'accu', 'ppv', \ 'npv', 'auc'])

# Return the model evaluations to the calling script if not phase: return model_evaluation

if phase == 'global': self.global_performance_summary = model_evaluation

if phase == 'local': self.local_performance_summary = model_evaluation

# class MoleculeSets:

# n_sets = 0 245

# def __init__(self, name = None, training_set = None, test_set = None, domain_object = None, locality_object = None):

# self.name = name

# if isinstance(training_set, MoleculeSet): self.training_set = training_set else: raise Exception('The training set object is not an instance of ' \ 'MoleculeSet.')

# if isinstance(test_set, MoleculeSet): self.test_set = test_set else: raise Exception('The test set object is not an instance of ' \ 'MoleculeSet.')

# self.domain_object = domain_object self.locality_object = locality_object

# self.outside_domain_test_identifiers = None self.outside_locality_test_identifiers = None

# self.local_neighborhood_statistics = None self.local_information = {} self.local_molecule_sets = {}

# MoleculeSets.n_sets += 1

# def tsne_embedding(self, parameters = None):

# self.training_to_test_transfer(parameters)

# Combine the training and test sets into a single set training_ids = self.training_set.processed_descriptors.index.tolist() test_ids = self.test_set.processed_descriptors.index.tolist() data_set = pd.concat([self.training_set.processed_descriptors, \ self.test_set.processed_descriptors]) idx = data_set.index.tolist()

246 #self.training_set.processed_descriptors.to_csv(r'C:\Users\Hobo\Desktop\training.csv') #self.test_set.processed_descriptors.to_csv(r'C:\Users\Hobo\Desktop\test.csv')

# It is suggested to reduce the dimension of data with many dimensions # (i.e. > 50) before applying t-SNE if data_set.shape[1] > 50:

# PCA is used because we don't know the test set response pca = PCA(n_components = 50, svd_solver = 'full')

#data_set.to_csv(r'C:\Users\Hobo\Desktop\data.csv')

data_set = pca.fit_transform(data_set) data_set \ = pd.DataFrame(data_set, index = idx, \ columns = ['PC' + str(i) for i in range(1,51)])

# Apply the t-SNE embedding to the training and test data tsne = TSNE(n_components = \ int(parameters.dimensional_reduction.n_reduced_dimensions), \ perplexity = int(parameters.dimensional_reduction.perplexity)) self.training_set.dimensional_reduction_object = tsne self.test_set.dimensional_reduction_object = tsne data_set = tsne.fit_transform(data_set) data_set = pd.DataFrame(data_set, index = idx, \ columns = ['EC' + str(i) for i in \ range(1, int(parameters.dimensional_reduction. \ n_reduced_dimensions) + 1)]) # data_set = (data_set - data_set.mean())/data_set.std()

# Separate the training set from the test set self.training_set.processed_descriptors = data_set.loc[training_ids] self.test_set.processed_descriptors = data_set.loc[test_ids]

# def training_to_test_transfer(self, parameters = None):

# if isinstance(parameters, Parameters):

### Removal or imputation of observations with missing values ###

# Transfer the imputation object from the training to the test set self.test_set.imputation_object \ = copy.deepcopy(self.training_set.imputation_object)

# Conduct missing data imputation on the test set self.test_set.missing_data_imputation( method = parameters.preprocess.missing_data_method, indicator_type = parameters.preprocess.missing_data_indicator)

### Remove descriptors with variance less than or equal to the ### ### threshold ### 247

# Transfer the variance threshold object from the training to the # test set self.test_set.zero_variance_descriptors \ = copy.deepcopy(self.training_set.zero_variance_descriptors) self.test_set.variance_threshold_object \ = copy.deepcopy(self.training_set.variance_threshold_object)

# Variance threshold removal will error on the test set without # an object fitted on a training set if self.test_set.zero_variance_descriptors is not None:

# if parameters.phase == 'local':

# remove_test_descriptors \ = [d for d in self.test_set \ .processed_descriptors.index.tolist() \ if d in self.test_set \ .zero_variance_descriptors.tolist()]

# Conduct variance threshold removal on the test set self.test_set.processed_descriptors \ = self.test_set.processed_descriptors \ .drop(remove_test_descriptors, axis = 0)

else:

# remove_test_descriptors \ = [d for d in self.test_set \ .processed_descriptors.columns.tolist() \ if d in self.test_set \ .zero_variance_descriptors.tolist()]

# Conduct variance threshold removal on the test set self.test_set.processed_descriptors \ = self.test_set.processed_descriptors \ .drop(remove_test_descriptors, axis = 1)

### Remove binary descriptors deemed too infrequent within the ### ### training set ###

# Transfer the infrequent binary descriptors from the training set # to the test set self.test_set.infrequent_binary_descriptors \ = copy.deepcopy(self.training_set.infrequent_binary_descriptors)

# If the training and test sets have infrequent binary descriptors if self.test_set.infrequent_binary_descriptors is not None:

# Remove the infrequent binary descriptors from the test set 248 self.test_set.infrequent_descriptor_removal()

### Perform transfermations on the descriptors indicated by the ### ### user ###

# Transfer the transformation object from the training set to the # test set self.test_set.transformation_object \ = copy.deepcopy(self.training_set.transformation_object)

# If there are descriptors listed for transformation and a # transformation object if parameters.preprocess.transform_descriptors \ and self.test_set.transformation_object:

# Apply the transformations to the test set descriptors self.test_set.descriptor_transformation( transform_descriptors \ = parameters.preprocess.transform_descriptors)

### Standardize the test set descriptors ###

# Transfer the standardization object from the training set to the # test set self.test_set.standardization_object \ = copy.deepcopy(self.training_set.standardization_object)

# If there is a standardization object fit on the training set if self.test_set.standardization_object:

# Standardize the test set self.test_set.standardization(parameters.preprocess.standardize)

### Select the significant descriptors of the test set ###

# If the training set has no significant descriptors if self.training_set.significant_descriptors is None:

# Warn the user of the problematic situation of no significant # descriptors warn('The training set has no significant descriptors which ' \ 'will stop the current execution of the workflow.') return

# Transfer the descriptors deemed significant from variable # selection from the training to the test set self.test_set.significant_descriptors \ = copy.deepcopy(self.training_set.significant_descriptors)

# Perform variable selection on the test set self.test_set.variable_selection()

### Reduce the dimension of the test set ###

249 # if parameters.dimensional_reduction.method != 'tsne':

# Transfer the dimensional reduction object from the training set # to the test set self.test_set.dimensional_reduction_object \ = copy.deepcopy(self.training_set.dimensional_reduction_object)

# If the test set has a fitted dimensional reduction object if self.test_set.dimensional_reduction_object:

# Reduce the dimension of the test set self.test_set.dimensional_reduction()

# def domain_assessment(self, domain_k = 5, domain_quantile = 0.95):

# if self.domain_object is None:

# Initialize a nearest neighbors object # (An additional neighbor is required as the first neighbor is the # observation itself) self.domain_object = NearestNeighbors(n_neighbors = int(domain_k + 1))

# Fit the training observations to the nearest neighbor object and # find the nearest neighbors of the training observations and # themselves self.domain_object.fit(self.training_set.processed_descriptors)

# Determine the distances and indices of the training set observations # to the observations fit to the domain object (usually itself) domain_distances, domain_indices \ = self.domain_object.kneighbors\ (self.training_set.processed_descriptors)

# Isolate the distances to the kth nearest neighbors kth_distances = domain_distances[:,-1]

# Take as the domain radius the domain quantile of the distances to # the kth nearest neighbors domain_radius = np.percentile(kth_distances, domain_quantile*100)

# Find the raw indices of the training observations within the domain # radius of the test observations test_neighbor_distances, test_neighbor_indices \ = self.domain_object.radius_neighbors\ (self.test_set.processed_descriptors, domain_radius, return_distance = True)

# Find the identifiers of the test observations out of domain self.outside_domain_test_identifiers \ = self.test_set.processed_descriptors \ .index[np.array([neighs.size for neighs in test_neighbor_indices]) \ 250 < domain_k]

# Remove the out of domain test observations from the test descriptors self.test_set.processed_descriptors \ = self.test_set.processed_descriptors \ .drop(self.outside_domain_test_identifiers)

self.test_set.processed_response \ = self.test_set.processed_response \ .drop(self.outside_domain_test_identifiers)

# Define a method to determine which data points in one set are in # proximity to the data points in another set # (This method will create as many instances of the MoleculeSets class as # there are observations in the test set) def locality_assessment(self, method = 'radius', radius = 0.5, locality_k = 5, locality_quantile = 0.95):

# Acceptable methods: 'radius', 'quantile', 'neighbor' # if method == 'quantile' or method == 'radius':

# if method == 'quantile':

# if not self.locality_object:

# Initialize a nearest neighbor object using locality k # (An additional neighbor is required as the first neighbor # is the observation itself) self.locality_object \ = NearestNeighbors(n_neighbors = locality_k + 1)

# Fit the training observations with greater than domain k # nearest neighbors and find the distances and raw indices # of this set to itself self.locality_object\ .fit(self.training_set.processed_descriptors)

# locality_distances, locality_indices \ = self.locality_object.kneighbors \ (self.training_set.processed_descriptors)

# Isolate the distances to the kth nearest neighbors locality_kth_distances = locality_distances[:,-1]

# Take as the locality radius the locality quantile of the # distances to the kth nearest neighbors radius \ = np.percentile(locality_kth_distances, 251 locality_quantile*100)

# elif method == 'radius':

# if not self.locality_object:

# self.locality_object = NearestNeighbors(radius = radius)

# self.locality_object\ .fit(self.training_set.processed_descriptors)

# Find the raw indices and identifiers of the training # observations (all observations) within locality radius of the # test observations local_neighbor_distances, local_neighbor_indices \ = self.locality_object.radius_neighbors \ (self.test_set.processed_descriptors, radius, return_distance = True)

# elif method == 'neighbor':

# if not self.locality_object:

# Define a k-NN object with the number of neighbors k # specified by the user and fit the training descriptors to it locality_k = int(locality_k)

self.locality_object \ = NearestNeighbors(n_neighbors = locality_k) self.locality_object \ .fit(self.training_set.processed_descriptors)

# Determine the k-nearest neighbors of the test queries, the # distances to the training neighbors, and their responses local_neighbor_distances, local_neighbor_indices \ = self.locality_object.kneighbors \ (self.test_set.processed_descriptors)

# local_neighbor_identifiers \ = [[self.training_set.processed_descriptors.index[raw_ind] \ for raw_ind in loc] for loc in local_neighbor_indices] local_neighbor_responses \ = [[self.training_set.processed_response[ident] for ident in qry] \ for qry in local_neighbor_identifiers]

# Iterate over the number of test observations in consideration 252 for tidx in range(self.test_set.processed_descriptors.index.size):

# test_observation_identifier \ = self.test_set.processed_descriptors.index[tidx]

loc_info = np.array([local_neighbor_distances[tidx], \ local_neighbor_responses[tidx]]) loc_info = loc_info.transpose()

local_information_df \ = pd.DataFrame(loc_info, index = local_neighbor_identifiers[tidx], columns = ['Distance', 'Response'])

# self.local_information[test_observation_identifier] \ = local_information_df

# If there is at least one neighbor for the current test query if len(local_neighbor_indices[tidx]) >= locality_k:

# Construct the local learning sets using the raw training # and test obserations data

# raw_descriptors = None, # raw_response = None, # var_types = None, # set_label = None, # pickled_rdk_molecules = None)

# query_training_set \ = MoleculeSet(raw_descriptors = self.training_set \ .raw_descriptors \ .loc[local_neighbor_identifiers[tidx],:], raw_response = self.training_set \ .raw_response \ .loc[local_neighbor_identifiers[tidx]], var_types = self.training_set.var_types, pickled_rdk_molecules \ = {i:v for i,v in self.training_set \ .pickled_rdk_molecules.items() \ if i in local_neighbor_identifiers[tidx]})

# query_data = MoleculeSet(raw_descriptors = self.test_set \ .raw_descriptors \ .loc[test_observation_identifier,:], raw_response = self.test_set \ .raw_response \ .loc[test_observation_identifier], var_types = self.test_set.var_types, pickled_rdk_molecules \ 253 = {test_observation_identifier: \ self.test_set \ .pickled_rdk_molecules \ [test_observation_identifier]})

# self.local_molecule_sets[test_observation_identifier] \ = MoleculeSets(name = test_observation_identifier, training_set = query_training_set, test_set = query_data)

# queries_with_neighbors, neighbor_statistics = [], []

# for query, neighborhood in self.local_information.items():

# if not neighborhood.empty:

# queries_with_neighbors.append(query) neighbor_statistics.append(neighborhood['Distance'].describe())

# if neighbor_statistics:

# neighbor_statistics_df = pd.concat(neighbor_statistics, axis = 1) neighbor_statistics_df.columns = queries_with_neighbors self.local_neighborhood_statistics = neighbor_statistics_df

# def train(self, technique = 'pls', n_neighbors = 5, weights = 'uniform', n_components = 2):

# Techniques: 'pls', 'lda', 'knn', 'logistic'

# training_descriptors = self.training_set.processed_descriptors training_response = self.training_set.processed_response

# If the requested modeling technique is knn if technique == 'knn':

# Try to train and predict with a knn model try:

# Initialize a knn classifer object with the number of # neighbors and weighting passed to the function self.training_set.modeling_technique_object \ = KNeighborsClassifier(n_neighbors = int(n_neighbors), weights = weights) 254

# Train the model on the training descriptors and responses self.training_set.modeling_technique_object \ .fit(training_descriptors, training_response)

# If something went wrong during model training and prediction except:

# warn('A error occuried while attempting to train the model.')

# If the requested modeling method is logistic regression if technique == 'logistic':

# Try to train and predict with a logistic model try:

# Initialize a logistic regression object self.training_set. \ modeling_technique_object \ = LogisticRegression(solver = 'lbfgs', max_iter = 1000, C = 1e10)

# Train the model on the training descriptors and responses self.training_set.modeling_technique_object \ .fit(training_descriptors, training_response)

# If something went wrong during model training and prediction except:

# warn('A error occuried while attempting to train the model.')

# If the modeling method is linear discriminant if technique == 'lda':

# Try to train and predict with a linear discriminant model try:

# Initialize a linear discriminant object self.training_set.modeling_technique_object \ = LinearDiscriminantAnalysis()

# Train the model on the training descriptors and responses self.training_set.modeling_technique_object \ .fit(training_descriptors, training_response)

# If something went wrong during model training and prediction except:

# A default prediction value indicated the occurrence of an # error warn('A error occuried while attempting to train the model.')

# If the modeling method is partial least squares discriminant 255 if technique == 'pls':

# Try to train and predict with a pls-da model #try:

# Initialize a pls classification object self.training_set.modeling_technique_object \ = PLSRegression(n_components = int(n_components), scale = False)

# Build a pls model using the training pc scores and responses self.training_set.modeling_technique_object \ .fit(training_descriptors, training_response)

# If something went wrong during model training and prediction #except: # A default prediction value indicated the occurrence of an # error #warn('A error occuried while attempting to train the model.')

# def predict(self):

# if not self.test_set.modeling_technique_object:

# warn('The test set does not have a fitted model necessary ' \ 'to make predictions.') return

# if isinstance(self.test_set.processed_response, pd.Series):

# if self.test_set.global_predictions is None: self.test_set.global_predictions \ = pd.Series(np.nan, index = self.test_set.processed_response.index)

# if self.test_set.local_predictions is None: self.test_set.local_predictions \ = pd.Series(np.nan, index = self.test_set.processed_response.index)

# If a prediction is being made on a single observation if len(self.test_set.processed_descriptors.shape) < 2:

# Reshape the 1d series data to avoid sklearn warning test_descriptors \ = self.test_set.processed_descriptors \ .values.reshape(1,-1).copy()

256 # else:

# test_descriptors = self.test_set.processed_descriptors.copy()

# A PLS model makes predictions of probability for a classification # problem which must be converted into a binary decision if hasattr(self.test_set.modeling_technique_object, 'y_loadings_'):

# Make predictions on the test descriptors predictions = self.test_set.modeling_technique_object \ .predict(test_descriptors)

# Convert the predictions from continuous to binary predictions = np.array([1 if pred >= 0.5 else 0 \ for pred in predictions])

# else:

# Make predictions on the test descriptors predictions = self.test_set.modeling_technique_object \ .predict(test_descriptors)

# prediction_probabilities = self.test_set \ .modeling_technique_object.predict_proba(test_descriptors)

# If a prediction is being made on a single observation if len(self.test_set.processed_descriptors.shape) < 2:

# self.test_set.local_predictions = predictions

# self.test_set.local_prediction_probabilities \ = prediction_probabilities[:,1]

else:

# self.test_set.global_predictions \ [self.test_set.processed_descriptors.index.tolist()] \ = predictions

# self.test_set.global_prediction_probabilities \ [self.test_set.processed_descriptors.index.tolist()] \ = prediction_probabilities[:,1]

# def evaluate(self, response = None, predictions = None, \ probabilities = None, phase = None): 257 err_flg = False

# if response is None and predictions is None:

# if self.test_set.raw_response is not None:

# response = copy.deepcopy(self.test_set.raw_response)

# if phase == 'global' \ and self.test_set.global_predictions is not None:

# predictions = copy.deepcopy(self.test_set.global_predictions)

# elif phase == 'local' \ and self.test_set.local_predictions is not None:

# predictions = copy.deepcopy(self.test_set.local_predictions)

# if probabilities is None:

# if phase == 'global' \ and self.test_set.global_prediction_probabilities is not None:

# probabilities \ = copy.deepcopy(self.test_set.global_prediction_probabilities)

# elif phase == 'local' \ and self.test_set.local_prediction_probabilities is not None:

# probabilities \ = copy.deepcopy(self.test_set.local_prediction_probabilities)

else:

warn('No prediction probabilities for evaluation.')

# if response is None or predictions is None:

# warn('The test set must have both responses and predictions ' \ 258 'for the model generating the predictions to be ' \ 'evaluated.') err_flg = True

# If the number of reponses and predictions do not agree if response is None and predictions is None: if response.size != predictions.size:

# Inform the user that the number of responses and predictions # do not agree warn(r'The response and prediction arrays contain an unequal ' \ r'number of values.') err_flg = True # If no predictions were made or no responses exist or no # predictions exist if np.all(np.isnan(predictions)):

# Return an empty Pandas Series warn('The model did not make any predictions.') err_flg = True

# n_observations = response.size n_no_predictions = np.sum(predictions.isnull()) n_predictions = predictions.shape[0] - n_no_predictions

# Calculate the fraction of test observations predicted fraction_predicted = n_predictions/n_observations

# Count the number of true positives true_positives = np.sum((response == 1) & (predictions == 1))

# Count the number of true negatives true_negatives = np.sum((response == 0) & (predictions == 0))

# Count the number of false positives false_positives = np.sum((response == 0) & (predictions == 1))

# Count the number of false negatives false_negatives = np.sum((response == 1) & (predictions == 0))

# Calculate conditions positive and negative condition_positive = true_positives + false_negatives condition_negative = true_negatives + false_positives

# Calculate the fraction of positives within the response of those # observations with model predictions prevalence = (((response == 1) & (predictions == 1)) \ | ((response == 1) & (predictions == 0))).sum()/n_predictions

# Calculate predictives positive and negative predictive_positive = true_positives + false_positives predictive_negative = true_negatives + false_negatives

259 # AUC calculation if response is not None and probabilities is not None:

# probabilities = probabilities.loc[~probabilities.isnull()] response = response.loc[probabilities.index.tolist()]

# if all(response == 1) or all(response == 0):

# auc = np.nan # else:

# auc = roc_auc_score(response, probabilities) else:

# auc = np.nan

# If predictions were made if n_predictions > 0:

# Calculate the accuracy accuracy = (true_positives + true_negatives) \ / (true_positives + true_negatives + false_positives \ + false_negatives)

# No predictions were made else:

# The accuracy is null accuracy = np.nan

# If positive observations are present if condition_positive != 0:

# Calculate the sensitivity (probability model predicts # positive given positive response) sensitivity = true_positives/condition_positive

# No positive observations are present else:

# The sensitivity is null sensitivity = np.nan

# If negative observations are present if condition_negative != 0:

# Calculate the specificity (probability model predicts 260 # negative given negative response) specificity = true_negatives/condition_negative

# No negative observations are present else:

# The specificity is null specificity = np.nan

# If positive predictions were made if predictive_positive != 0: # Calculate the positive predictive value (probability response # positive given model predicts positive) positive_predictive_value = true_positives/predictive_positive

# No positive predictions were made else:

# The positive predictive value is null positive_predictive_value = np.nan

# If negative predictions were made if predictive_negative != 0:

# Calculate the negative predictive value (probability # response negative given model predicts negative) negative_predictive_value = true_negatives/predictive_negative

# No negative predictions were made else:

# The negative predictive value is null negative_predictive_value = np.nan

# Store the description of the model performance in the dictionary model_evaluations = [n_observations, n_predictions, \ fraction_predicted, prevalence, \ true_positives, false_negatives, \ false_positives, true_negatives, \ sensitivity, specificity, accuracy, \ positive_predictive_value, \ negative_predictive_value, auc]

# Convert the model evaluations dictionary to a Pandas DataFrame model_evaluation = pd.Series(model_evaluations, index = ['nobs', 'npred', 'fpred', \ 'prev', 'tps', 'fns', \ 'fps', 'tns', 'sens', \ 'spec', 'accu', 'ppv', \ 'npv', 'auc'])

# Return the model evaluations to the calling script if not phase: return model_evaluation 261

elif phase == 'global': self.test_set.global_performance_summary = model_evaluation

else: self.test_set.local_performance_summary = model_evaluation

# class Preprocess:

# def __init__(self, option = True, missing_data_method = 'remove', missing_data_indicator = np.nan, variance_threshold_value = 0, infrequent_threshold_value = 3, transform_descriptors = None, transform_function = None, transform_inverse = None, standardize = True):

# self.option = option self.missing_data_method = missing_data_method self.missing_data_indicator = missing_data_indicator self.variance_threshold_value = variance_threshold_value self.infrequent_threshold_value = infrequent_threshold_value self.transform_descriptors = transform_descriptors self.transform_function = transform_function self.transform_inverse = transform_inverse self.standardize = standardize

# class VariableSelection:

# def __init__(self, option = True, significance_level = 0.05, method = 'univariate', univariate_binary_method = 'fisher'):

# self.option = option self.significance_level = significance_level self.method = method self.univariate_binary_method = univariate_binary_method

# class DimensionalReduction:

# def __init__(self, option = True, method = 'pls', n_reduced_dimensions = 2, 262 pca_method = 'component', ratio_explained_variance = 0.99, perplexity = 30, k_neighbors = 15, min_dist = 0.1, metric = 'euclidean'):

# self.option = option self.method = method

# self.n_reduced_dimensions = n_reduced_dimensions

# self.pca_method = pca_method self.ratio_explained_variance = ratio_explained_variance

# self.perplexity = perplexity

# self.k_neighbors = k_neighbors self.min_dist = min_dist self.metric = metric # class DomainAssessment:

# def __init__(self, option = True, domain_k = 5, domain_quantile = 0.95):

# self.option = option self.domain_k = domain_k self.domain_quantile = domain_quantile

# class LocalityAssessment:

# def __init__(self, option = True, method = 'radius', radius = 0.5, locality_k = 5, locality_quantile = 0.95):

# self.option = option self.method = method self.radius = radius self.locality_k = locality_k self.locality_quantile = locality_quantile

# 263 class Train:

# def __init__(self, option = True, technique = 'pls', n_neighbors = 5, weights = 'uniform', n_components = 2): # self.option = option self.technique = technique self.n_neighbors = n_neighbors self.weights = weights self.n_components = n_components

# class Parameters:

# def __init__(self, name = None, phase = 'global', preprocess = None, variable_selection = None, dimensional_reduction = None, domain_assessment = None, locality_assessment = None, train = None):

# self.name = name self.phase = phase self.preprocess = preprocess self.variable_selection = variable_selection self.dimensional_reduction = dimensional_reduction self.domain_assessment = domain_assessment self.locality_assessment = locality_assessment self.train = train

@classmethod def from_file(cls, filepath = None, grid = False):

### Initialize storage objects and read parameter input file ###

parameter_sets, parameter_objects = [], [] param_file = pd.read_csv(filepath, index_col = 0)

### Read the parameters file and generate the grid ### if grid == True:

# paramsets = []

# df = pd.read_csv(filepath, index_col = 0) 264 param_order = df.columns.tolist() par = OrderedDict() par['name'], par['phase'] = None, None for param in param_order: dot_loc = param.find('.') param_name = param[:dot_loc] param_field = param[dot_loc+1:] if param_name not in par.keys(): par[param_name] = OrderedDict() if param_field not in par[param_name].keys(): par[param_name][param_field] = None

# glo = df.iloc[0] glo.index = [g + '.global' for g in glo.index.tolist()] glo_dict = glo.to_dict() for k, v in glo_dict.items(): glo_dict[k] = Parameters.cleaner(v)

# loc = df.iloc[1] loc.index = [l + '.local' for l in loc.index.tolist()] loc_dict = loc.to_dict() for k, v in loc_dict.items(): loc_dict[k] = Parameters.cleaner(v)

# grid_dict = copy.deepcopy(glo_dict) for k, v in loc_dict.items(): grid_dict[k] = v

# grid = ParameterGrid(grid_dict)

# for i in range(len(grid)):

glob, loca = OrderedDict(), OrderedDict()

glob['name'], glob['phase'] = 'Grid Point ' + str(i + 1), 'global' loca['name'], loca['phase'] = 'Grid Point ' + str(i + 1), 'local'

for para in param_order:

if para + '.global' in grid[i].keys(): glob[para] = grid[i][para + '.global']

if para + '.local' in grid[i].keys(): loca[para] = grid[i][para + '.local']

paramsets.append(glob) paramsets.append(loca)

265 # Convert the grid point objects for Parameter object construction for ps in paramsets:

converted_params = copy.deepcopy(par)

for ke, va in ps.items(): if '.' in ke: dloc = ke.find('.') pname = ke[:dloc] pvalue = ke[dloc+1:] else: pname = ke pvalue = None

if not pvalue: converted_params[pname] = va else: converted_params[pname][pvalue] = [va]

parameter_sets.append(converted_params)

### Read the parameters file already containing a grid ### else:

param_dict = OrderedDict() param_dict['name'] = None

for param_classes in param_file.columns.tolist():

if '.' not in param_classes:

param_class = copy.deepcopy(param_classes) class_attribute = None

else:

loc = param_classes.find('.') param_class = param_classes[:loc] class_attribute = param_classes[loc+1:]

if param_class not in param_dict.keys(): param_dict[param_class] = OrderedDict()

if class_attribute and class_attribute \ not in param_dict[param_class]: param_dict[param_class][class_attribute] = None

for idx, vals in param_file.iterrows(): gp_params = copy.deepcopy(param_dict) gp_params['name'] = idx

for param_classes in param_file.columns.tolist():

if '.' not in param_classes: 266

param_class = copy.deepcopy(param_classes) class_attribute = None

else: loc = param_classes.find('.') param_class = param_classes[:loc] class_attribute = param_classes[loc+1:]

if class_attribute is not None:

gp_params[param_class][class_attribute] \ = Parameters.cleaner(vals[param_classes])

else:

gp_params[param_class] = vals[param_classes]

parameter_sets.append(gp_params)

### Convert the parameter grid points into Parameters objects ###

for param_point in parameter_sets:

# if param_point['Train']['technique'] != ['pls'] \ or param_point['Train']['n_components'] \ <= param_point['DimensionalReduction']['n_reduced_dimensions']:

depend_objs = []

for class_name, class_dict in param_point.items():

if class_name == 'name' or class_name == 'phase':

depend_objs.append(class_dict)

elif class_name in globals():

class_obj = globals()[class_name](*[v[0] for k, v in class_dict.items()]) depend_objs.append(class_obj)

param_obj = Parameters(*depend_objs) parameter_objects.append(param_obj)

return parameter_objects def to_series(self, writepath = None):

s_dict = {}

for attr, value in self.__dict__.items():

if attr in ['name', 'phase']: 267

s_dict[attr] = value

try: if value.__class__.__name__ != 'str':

for subattr, subvalue in value.__dict__.items():

#print(value.__class__.__name__ + '.' + subattr)

s_dict[value.__class__.__name__ + '.' + subattr] = subvalue

except:

pass

s = pd.Series(s_dict)

return s

@staticmethod def cleaner(data = None):

def clean(d = None):

if d == 'None': d = None if d == 'TRUE' or d == 'True' or d == True: d = True elif d == 'FALSE' or d == 'False' or d == False: d = False elif d == 'np.nan': d = np.nan else: try: d = float(d) except: pass

return d

clean_data = []

try:

data = data.split(',') for datum in data:

datum = clean(datum) clean_data.append(datum)

except:

data = clean(data) 268 clean_data.append(data)

return clean_data # def premodeling(mol_sets_obj = None, parameters = None, qid = None):

# Check if the local prediction has at least one training observation if parameters.phase == 'local': if mol_sets_obj.training_set.processed_descriptors.shape[0] <= 1: mol_sets_obj.test_set.local_predictions = np.nan return

# Check if the local training set has a uniform response if parameters.phase == 'local': if not any(mol_sets_obj.training_set.processed_response.values.tolist()): mol_sets_obj.test_set.local_predictions = 0 return if all(mol_sets_obj.training_set.processed_response.values.tolist()): mol_sets_obj.test_set.local_predictions = 1 return

### Preprocessing ### if parameters.preprocess.option == True: #print('Preprocessing') mol_sets_obj.training_set.preprocess( missing_data_method \ = parameters.preprocess.missing_data_method, missing_data_indicator \ = parameters.preprocess.missing_data_indicator, variance_threshold_value \ = parameters.preprocess.variance_threshold_value, infrequent_threshold_value \ = parameters.preprocess.infrequent_threshold_value, transform_descriptors \ = parameters.preprocess.transform_descriptors, transform_function \ = parameters.preprocess.transform_function, transform_inverse \ = parameters.preprocess.transform_inverse, standardize = parameters.preprocess.standardize)

### Variable selection ### if parameters.variable_selection.option == True: #print('Variable Selection') mol_sets_obj.training_set.variable_selection( significance_level \ = parameters.variable_selection.significance_level, method = parameters.variable_selection.method, univariate_binary_method \ = parameters.variable_selection.univariate_binary_method)

else: mol_sets_obj.training_set.significant_descriptors \ = pd.Series(mol_sets_obj.training_set.processed_descriptors \ 269 .columns.tolist())

# Check if the local prediction has at least one training observation if parameters.phase == 'local': if mol_sets_obj.training_set.processed_descriptors.empty: mol_sets_obj.test_set.local_predictions = np.nan return

### Dimensional reduction ### if parameters.dimensional_reduction.option == True:

# if parameters.phase == 'global' \ and parameters.dimensional_reduction.method == 'tsne':

# mol_sets_obj. \ tsne_embedding(parameters)

# elif parameters.dimensional_reduction.n_reduced_dimensions > \ min(mol_sets_obj.training_set.processed_descriptors.shape[0]-1, \ mol_sets_obj.training_set.processed_descriptors.shape[1]):

if parameters.phase == 'local': mol_sets_obj.test_set.local_predictions = np.nan return

else: warn('The global dimensional reduction could not be applied.')

else: mol_sets_obj.training_set.dimensional_reduction( method = parameters.dimensional_reduction.method, n_reduced_dimensions \ = parameters.dimensional_reduction.n_reduced_dimensions, pca_method = parameters.dimensional_reduction.pca_method, ratio_explained_variance \ = parameters.dimensional_reduction.ratio_explained_variance)

#if parameters.phase == 'local': # mol_sets_obj.training_set.processed_descriptors.to_csv("C:/Users/Hobo/ # Desktop/local_learning_sets_obj/" + str(qid) + "_training_descriptors.csv")

# def workflow(moleculesets_object = None, global_parameters = None, local_parameters = None, verbose = False, path = None):

# if not isinstance(moleculesets_object, MoleculeSets): raise Exception('The training and test set objects must be ' \ 'MoleculeSet instances.')

# 270 if not isinstance(global_parameters, Parameters) and \ not isinstance(local_parameters, Parameters): raise Exception('Parameters must be Parameters instances.')

# Premodeling premodeling(moleculesets_object, global_parameters)

# Transfer if global_parameters.dimensional_reduction.method != 'tsne': moleculesets_object.training_to_test_transfer(global_parameters)

# Domain assessment if global_parameters.domain_assessment.option == True: moleculesets_object \ .domain_assessment(global_parameters.domain_assessment.domain_k, global_parameters.domain_assessment \ .domain_quantile)

# Training moleculesets_object.train(technique = global_parameters.train.technique, n_neighbors = global_parameters.train.n_neighbors, weights = global_parameters.train.weights, n_components = global_parameters.train.n_components)

# Prediction moleculesets_object.test_set.modeling_technique_object \ = copy.deepcopy(moleculesets_object.training_set.modeling_technique_object) moleculesets_object.predict()

# Evaluation #moleculesets_object.evaluate(phase = 'global')

# Locality assessment if global_parameters.locality_assessment.option == True: moleculesets_object \ .locality_assessment(method = global_parameters \ .locality_assessment.method, radius = global_parameters \ .locality_assessment.radius, locality_k = global_parameters \ .locality_assessment.locality_k, locality_quantile = global_parameters \ .locality_assessment.locality_quantile)

# localsets = moleculesets_object.local_molecule_sets

### Write the global information to file ###

# If writing information is desired if verbose == True and path is not None:

global_path = path + r'global_info/' descriptor_path = path + r'descriptor_sets/' 271 dim_reduct_path = global_path + r'reduction_info/' model_path = global_path + r'model_info/'

# Create a folder at the specified directory to store global information try: if not os.path.exists(global_path): os.makedirs(global_path) if not os.path.exists(descriptor_path): os.makedirs(descriptor_path) if not os.path.exists(dim_reduct_path): os.makedirs(dim_reduct_path) if not os.path.exists(model_path): os.makedirs(model_path) except: raise Exception('The directories could not be created.')

### Training ### if moleculesets_object.training_set.observations_missing_values is not None: if not moleculesets_object.training_set.observations_missing_values.empty: moleculesets_object.training_set.observations_missing_values. \ to_csv(global_path + r'global_training_set_observations_missing_values.csv') if moleculesets_object.training_set.zero_variance_descriptors is not None: if not moleculesets_object.training_set.zero_variance_descriptors.empty: moleculesets_object.training_set.zero_variance_descriptors. \ to_csv(global_path + r'global_training_set_zero_variance_descriptors.csv') if moleculesets_object.training_set.infrequent_binary_descriptors is not None: if not moleculesets_object.training_set.infrequent_binary_descriptors.empty: moleculesets_object.training_set.infrequent_binary_descriptors. \ to_csv(global_path + r'global_training_set_infrequent_binary_descriptors.csv') if moleculesets_object.training_set.significant_descriptors is not None: if not moleculesets_object.training_set.significant_descriptors.empty: moleculesets_object.training_set.significant_descriptors. \ to_csv(global_path + r'global_significant_descriptors.csv') if moleculesets_object.training_set.descriptor_statistics is not None: if not moleculesets_object.training_set.descriptor_statistics.empty: moleculesets_object.training_set.descriptor_statistics. \ to_csv(global_path + r'global_descriptor_statistics.csv') if moleculesets_object.training_set.significant_descriptors is not None: if not moleculesets_object.training_set.significant_descriptors.empty: moleculesets_object.training_set.significant_descriptors. \ to_csv(descriptor_path + r'global_significant_descriptors.csv') if moleculesets_object.training_set.processed_descriptors is not None: if not moleculesets_object.training_set.processed_descriptors.empty: moleculesets_object.training_set.processed_descriptors. \ to_csv(global_path + r'global_training_set_processed_descriptors.csv') if moleculesets_object.training_set.processed_response is not None: if not moleculesets_object.training_set.processed_response.empty: 272 moleculesets_object.training_set.processed_response. \ to_csv(global_path + r'global_training_set_processed_response.csv')

if moleculesets_object.training_set.dimensional_reduction_object is not None: if global_parameters.dimensional_reduction.method != 'tsne': W = moleculesets_object.training_set.dimensional_reduction_object.x_weights_ P = moleculesets_object.training_set.dimensional_reduction_object.x_loadings_ C = moleculesets_object.training_set.dimensional_reduction_object.y_weights_ if W.size != 0 and P.size != 0 and C.size != 0: try: S = np.matmul(P.transpose(), W) R = np.matmul(W, S) B = np.matmul(R, C.transpose()).flatten() B = pd.Series(B, index = moleculesets_object.training_set.significant_descriptors) R = pd.DataFrame(R, index = moleculesets_object.training_set.significant_descriptors, \ columns = moleculesets_object.training_set.processed_descriptors.columns) R.to_csv(dim_reduct_path + 'global_pls_rotations.csv') B.to_csv(dim_reduct_path + 'global_pls_descriptor_coefficients.csv') except: pass

# if moleculesets_object.training_set.dimensional_reduction_explained_variance is not None: # moleculesets_object.training_set.dimensional_reduction_explained_variance \ # .to_csv(dim_reduct_path + 'global_pls_explained_variance.csv')

if moleculesets_object.training_set.modeling_technique_object is not None: log_coeffs = pd.Series(moleculesets_object.training_set.modeling_technique_object.coef_.flatten(), \ index = moleculesets_object.training_set.processed_descriptors.columns.tolist()) log_intcpt = pd.Series(moleculesets_object.training_set.modeling_technique_object.intercept_.flatten(), \ index = ['Intercept']) if not log_coeffs.empty: log_coeffs.to_csv(model_path + r'global_model_coefficients.csv') if not log_intcpt.empty: log_intcpt.to_csv(model_path + r'global_model_intercept.csv')

### Test ### if moleculesets_object.test_set.observations_missing_values is not None: if not moleculesets_object.test_set.observations_missing_values.empty: moleculesets_object.test_set.observations_missing_values. \ to_csv(global_path + r'global_test_set_observations_missing_values.csv')

if moleculesets_object.test_set.zero_variance_descriptors is not None: if not moleculesets_object.test_set.zero_variance_descriptors.empty: moleculesets_object.test_set.zero_variance_descriptors. \ to_csv(global_path + r'global_test_set_zero_variance_descriptors.csv')

if moleculesets_object.test_set.infrequent_binary_descriptors is not None: if not moleculesets_object.test_set.infrequent_binary_descriptors.empty: moleculesets_object.test_set.infrequent_binary_descriptors. \ to_csv(global_path + r'global_test_set_infrequent_binary_descriptors.csv') if moleculesets_object.test_set.processed_descriptors is not None: if not moleculesets_object.test_set.processed_descriptors.empty: 273 moleculesets_object.test_set.processed_descriptors. \ to_csv(global_path + r'global_test_set_processed_descriptors.csv')

if moleculesets_object.test_set.modeling_technique_object is not None: glo_pred_probs = pd.Series(moleculesets_object.test_set.global_prediction_probabilities, \ index = moleculesets_object.test_set.processed_descriptors.index) if not glo_pred_probs.empty: glo_pred_probs.to_csv(global_path + r'global_prediction_probabilities.csv')

### Training & Test Set ### if moleculesets_object.outside_domain_test_identifiers is not None: outside_domain_test_identifiers = pd.Series(moleculesets_object.outside_domain_test_identifiers) if not outside_domain_test_identifiers.empty: outside_domain_test_identifiers.to_csv(global_path + r'global_outside_domain_test_identifiers.csv')

if moleculesets_object.outside_locality_test_identifiers is not None: outside_locality_test_identifiers = pd.Series(moleculesets_object.outside_locality_test_identifiers) if not outside_locality_test_identifiers.empty: outside_locality_test_identifiers.to_csv(global_path + r'global_outside_locality_test_identifiers.csv')

### End portion of code writing global information ###

### Begin Local Modeling ###

# for query_id, query_moleculesets_object in localsets.items():

# if not query_moleculesets_object.training_set \ .processed_descriptors.empty:

# premodeling(query_moleculesets_object, local_parameters, query_id)

# Do not continue on with modeling if a decision has already been made # during pre-modeling if query_moleculesets_object.test_set.local_predictions \ not in [0,1,np.nan] and query_moleculesets_object.training_set \ .processed_descriptors.shape[1] >= local_parameters.train \ .n_components:

# query_moleculesets_object.training_to_test_transfer(local_parameters)

# query_moleculesets_object.train(technique = local_parameters \ .train.technique, n_neighbors = local_parameters \ .train.n_neighbors, weights = local_parameters \ .train.weights, n_components = local_parameters \ 274 .train.n_components)

# query_moleculesets_object.test_set.modeling_technique_object \ = copy.deepcopy(query_moleculesets_object \ .training_set.modeling_technique_object)

# query_moleculesets_object.predict()

# moleculesets_object.test_set.local_predictions[query_id] \ = query_moleculesets_object.test_set.local_predictions

# moleculesets_object.test_set.local_prediction_probabilities[query_id] \ = query_moleculesets_object.test_set.local_prediction_probabilities

### Write the local information to file ###

# If writing information is desired if verbose == True and path is not None:

# descriptor_path = path + r'descriptor_sets/' qry_path = path + r'local_info/' + str(query_id) + r'/' qry_dim_reduct_path = qry_path + r'reduction_info/' qry_model_path = qry_path + r'model_info/'

# Create a folder at the specified directory to store global information try: if not os.path.exists(descriptor_path): os.makedirs(descriptor_path) if not os.path.exists(qry_path): os.makedirs(qry_path) if not os.path.exists(qry_dim_reduct_path): os.makedirs(qry_dim_reduct_path) if not os.path.exists(qry_model_path): os.makedirs(qry_model_path)

except: raise Exception('The directories could not be created.')

### Training ### if query_moleculesets_object.training_set.observations_missing_values is not None: if not query_moleculesets_object.training_set.observations_missing_values.empty: query_moleculesets_object.training_set.observations_missing_values. \ to_csv(qry_path + str(query_id) + '_training_set_observations_missing_values.csv')

if query_moleculesets_object.training_set.zero_variance_descriptors is not None: if not query_moleculesets_object.training_set.zero_variance_descriptors.empty: query_moleculesets_object.training_set.zero_variance_descriptors. \ to_csv(qry_path + str(query_id) + '_training_set_zero_variance_descriptors.csv')

275 if query_moleculesets_object.training_set.infrequent_binary_descriptors is not None: if not query_moleculesets_object.training_set.infrequent_binary_descriptors.empty: query_moleculesets_object.training_set.infrequent_binary_descriptors. \ to_csv(qry_path + str(query_id) + '_training_set_infrequent_binary_descriptors.csv')

if query_moleculesets_object.training_set.significant_descriptors is not None: if not query_moleculesets_object.training_set.significant_descriptors.empty: query_moleculesets_object.training_set.significant_descriptors. \ to_csv(qry_path + str(query_id) + '_significant_descriptors.csv')

if query_moleculesets_object.training_set.descriptor_statistics is not None: if not query_moleculesets_object.training_set.descriptor_statistics.empty: query_moleculesets_object.training_set.descriptor_statistics. \ to_csv(qry_path + str(query_id) + '_descriptor_statistics.csv')

if query_moleculesets_object.training_set.processed_descriptors is not None: if not query_moleculesets_object.training_set.processed_descriptors.empty: query_moleculesets_object.training_set.processed_descriptors. \ to_csv(qry_path + str(query_id) + '_training_set_processed_descriptors.csv')

if query_moleculesets_object.training_set.processed_response is not None: if not query_moleculesets_object.training_set.processed_response.empty: query_moleculesets_object.training_set.processed_response. \ to_csv(qry_path + str(query_id) + '_training_set_processed_response.csv')

if query_moleculesets_object.training_set.significant_descriptors is not None: if not query_moleculesets_object.training_set.significant_descriptors.empty: query_moleculesets_object.training_set.significant_descriptors. \ to_csv(descriptor_path + str(query_id) + '_significant_descriptors.csv')

if query_moleculesets_object.training_set.dimensional_reduction_object is not None: if global_parameters.dimensional_reduction.method != 'tsne': W = query_moleculesets_object.training_set.dimensional_reduction_object.x_weights_ P = query_moleculesets_object.training_set.dimensional_reduction_object.x_loadings_ C = query_moleculesets_object.training_set.dimensional_reduction_object.y_weights_ if W.size != 0 and P.size != 0 and C.size != 0: try: S = np.matmul(P.transpose(), W) R = np.matmul(W, S) B = np.matmul(R, C.transpose()).flatten() B = pd.Series(B, index = query_moleculesets_object.training_set.significant_descriptors) R = pd.DataFrame(R, index = query_moleculesets_object.training_set.significant_descriptors, \ columns = query_moleculesets_object.training_set.processed_descriptors.columns) R.to_csv(qry_dim_reduct_path + str(query_id) + '_pls_rotations.csv') B.to_csv(qry_dim_reduct_path + str(query_id) + '_pls_descriptor_coefficients.csv') except: pass # if query_moleculesets_object.training_set.dimensional_reduction_explained_variance is not None: # query_moleculesets_object.training_set.dimensional_reduction_explained_variance \ # .to_csv(qry_dim_reduct_path + str(query_id) + '_pls_explained_variance.csv')

if query_moleculesets_object.training_set.modeling_technique_object is not None: 276 log_coeffs = pd.Series(query_moleculesets_object.training_set.modeling_technique_object.coef_.flatten(), \ index = query_moleculesets_object.training_set.processed_descriptors.columns.tolist()) log_intcpt = pd.Series(query_moleculesets_object.training_set.modeling_technique_object.intercept_.flatten(), \ index = ['Intercept']) if not log_coeffs.empty: log_coeffs.to_csv(qry_model_path + str(query_id) + r'_model_coefficients.csv') if not log_intcpt.empty: log_intcpt.to_csv(qry_model_path + str(query_id) + r'_model_intercept.csv')

### Test ### if query_moleculesets_object.test_set.observations_missing_values is not None: if not query_moleculesets_object.test_set.observations_missing_values.empty: query_moleculesets_object.test_set.observations_missing_values. \ to_csv(qry_path + '\global_test_set_observations_missing_values.csv')

if query_moleculesets_object.test_set.zero_variance_descriptors is not None: if not query_moleculesets_object.test_set.zero_variance_descriptors.empty: query_moleculesets_object.test_set.zero_variance_descriptors. \ to_csv(qry_path + '\global_test_set_zero_variance_descriptors.csv')

if query_moleculesets_object.test_set.infrequent_binary_descriptors is not None: if not query_moleculesets_object.test_set.infrequent_binary_descriptors.empty: query_moleculesets_object.test_set.infrequent_binary_descriptors. \ to_csv(qry_path + '\global_test_set_infrequent_binary_descriptors.csv')

if query_moleculesets_object.test_set.processed_descriptors is not None: if not query_moleculesets_object.test_set.processed_descriptors.empty: query_moleculesets_object.test_set.processed_descriptors. \ to_csv(qry_path + str(query_id) + '_test_set_processed_descriptors.csv')

if query_moleculesets_object.test_set.modeling_technique_object is not None: query_pred_probs = pd.Series(query_moleculesets_object.test_set.local_prediction_probabilities, \ index = [str(query_id)]) if not query_pred_probs.empty: query_pred_probs.to_csv(qry_model_path + str(query_id) + r'_prediction_probabilities.csv')

if query_moleculesets_object.test_set.local_predictions is not None: query_response = pd.Series(query_moleculesets_object.test_set.local_predictions, index = ['Activity']) if not query_response.empty: query_response.to_csv(qry_path + str(query_id) + '_locally_predicted_activity.csv')

if moleculesets_object.test_set.global_predictions[str(query_id)] is not None: query_response = pd.Series(moleculesets_object.test_set.global_predictions[str(query_id)], index = ['Activity']) if not query_response.empty: query_response.to_csv(qry_path + str(query_id) + '_globally_predicted_activity.csv')

# If writing information is desired if verbose == True and path is not None: 277

# local_path = path + r'local_info/' if not os.path.exists(local_path): os.makedirs(local_path)

# if moleculesets_object.local_information is not None: for k, df in moleculesets_object.local_information.items(): if not df.empty: qry_path = path + r'local_info/' + str(k) + r'/' if os.path.exists(qry_path): df.to_csv(qry_path + str(k) + '_neighborhood.csv')

# if moleculesets_object.test_set.local_prediction_probabilities is not None: if not moleculesets_object.test_set.local_prediction_probabilities.empty: moleculesets_object.test_set.local_prediction_probabilities \ .to_csv(local_path + r'local_prediction_probabilities.csv')

### End portion of code writing local information to file ###

# #moleculesets_object.evaluate(phase = 'local')

# def global_workflow(moleculesets_object = None, global_parameters = None): # if not isinstance(moleculesets_object, MoleculeSets): raise Exception('The training and test set objects must be ' \ 'MoleculeSet instances.')

# if not isinstance(global_parameters, Parameters): raise Exception('Parameters must be Parameters instances.')

# Premodeling premodeling(moleculesets_object, global_parameters)

# Transfer if global_parameters.dimensional_reduction.method != 'tsne': moleculesets_object.training_to_test_transfer(global_parameters)

# Domain assessment if global_parameters.domain_assessment.option == True: moleculesets_object \ .domain_assessment(global_parameters.domain_assessment.domain_k, global_parameters.domain_assessment \ .domain_quantile)

# Training moleculesets_object.train(technique = global_parameters.train.technique, n_neighbors = global_parameters.train.n_neighbors, 278 weights = global_parameters.train.weights, n_components = global_parameters.train.n_components)

# Prediction moleculesets_object.test_set.modeling_technique_object \ = copy.deepcopy(moleculesets_object.training_set.modeling_technique_object) moleculesets_object.predict()

# Evaluation moleculesets_object.evaluate(phase = 'global')

### Operating Program Hansen Dataset Global/Local Cross-Validation ### if __name__ == '__main__':

# param_grid = Parameters.from_file(filepath \ = r'C:/Users/Hobo/Desktop/parameter_grid.csv', grid = True)

# hansen_moleset = MoleculeSet.from_file( csv_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_ames_data.csv', sdf_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\AMES_Darshan_Modified.sdf', var_type_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_ames_types.csv', set_label_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_set_labels.csv')

# hansen_moleset \ .partition(method = 'file', file_path = \ r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_partition_set.csv')

# hansen_split_moleculesets \ = hansen_moleset.to_moleculesets(training_partition = True)

# result_types = ['global', 'local', 'global-on-local'] n_grid_points = ['grid point ' + str(n+1) for n in range(int(len(param_grid)/2))] grid_point_col = [p for p in n_grid_points for r in result_types] result_type_col = [r for p in n_grid_points for r in result_types] grid_point_index = [i + ' ' + j for i,j in zip(grid_point_col, result_type_col)]

# performance_summary = pd.DataFrame(index = grid_point_index, \ columns = param_grid[0].to_series() \ .index.tolist()[2:] + ['nobs', 'npred', \ 'fpred', 'prev', 'tps', 'fns', 'fps', \ 'tns', 'sens', 'spec', 'accu', 'ppv', \ 'npv', 'auc'])

# pred_types = ['global', 'local'] pred_grid_point_col = [p for p in n_grid_points for r in pred_types] pred_type_col = [r for p in n_grid_points for r in pred_types] pred_grid_index = [i + ' ' + j for i,j in zip(pred_grid_point_col, pred_type_col)] 279

# grid_predictions = pd.DataFrame(index = pred_grid_index) for mol in hansen_moleset.raw_response.index.tolist(): grid_predictions[mol] = np.nan

# for n in range(int(len(param_grid)/2)):

oriset = copy.deepcopy(hansen_moleset) molsets = copy.deepcopy(hansen_split_moleculesets)

print('') print('### Grid Point ' + str(n+1) + ' ###') print('')

# gparam, lparam = param_grid[2*n], param_grid[2*n+1]

# for molesetobj in molsets:

# workflow(moleculesets_object = molesetobj, global_parameters = gparam, local_parameters = lparam, verbose = True, path = r'C:/Users/Hobo/Desktop/Hansen set/Data/')

# oriset.combine_predictions(molsets)

oriset.evaluate(phase = 'global') oriset.evaluate(phase = 'local') oriset.global_on_local_evaluation()

# grid_predictions.loc['grid point ' + str(n+1) + ' global', oriset.global_predictions.index] \ = oriset.global_predictions grid_predictions.loc['grid point ' + str(n+1) + ' local', oriset.global_predictions.index] \ = oriset.local_predictions

# performance_summary.loc['grid point ' + str(n+1) + ' global'] \ = gparam.to_series().tolist()[2:] \ + oriset.global_performance_summary.tolist()

performance_summary.loc['grid point ' + str(n+1) + ' local'] \ = lparam.to_series().tolist()[2:] \ + oriset.local_performance_summary.tolist()

performance_summary.loc['grid point ' + str(n+1) + ' global-on-local'] \ 280 = lparam.to_series().tolist()[2:] \ + oriset.global_on_local_performance_summary.tolist()

# grid_predictions.to_csv(r'C:\Users\Hobo\Desktop\Hansen set\Data\predictions.csv') performance_summary.to_csv(r'C:\Users\Hobo\Desktop\Hansen set\Data\summary.csv')

#### Operating Program Hansen Dataset Global Only ### #if __name__ == '__main__': # # # # param_grid = Parameters.from_file(filepath \ # = r'C:/Users/Hobo/Desktop/parameter_grid.csv', grid = True) # # # # hansen_moleset = MoleculeSet.from_file( # csv_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_ames_data.csv', # sdf_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\AMES_Darshan_Modified.sdf', # var_type_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_ames_types.csv', # set_label_path = r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_set_labels.csv') # # # # hansen_moleset \ # .partition(method = 'file', file_path = \ # r'C:\Users\Hobo\Desktop\Hansen set\Data\hansen_partition_set.csv') # # # # hansen_split_moleculesets \ # = hansen_moleset.to_moleculesets(training_partition = True) # # # # master_index = ['grid point ' + str(n+1) + ' global' for n in range(int(len(param_grid)/2))] # # # # performance_summary = pd.DataFrame(index = master_index, \ # columns = param_grid[0].to_series() \ # .index.tolist()[2:] + ['nobs', 'npred', \ # 'fpred', 'prev', 'tps', 'fns', 'fps', \ # 'tns', 'sens', 'spec', 'accu', 'ppv', \ # 'npv']) # # # # grid_predictions = pd.DataFrame(index = master_index) # for mol in hansen_moleset.raw_response.index.tolist(): # grid_predictions[mol] = np.nan # # # # for n in range(int(len(param_grid)/2)): # # oriset = copy.deepcopy(hansen_moleset) # molsets = copy.deepcopy(hansen_split_moleculesets) # # print('') # print('### Grid Point ' + str(n+1) + ' ###') 281 # print('') # # # # gparam, lparam = param_grid[2*n], param_grid[2*n+1] # # # # for molesetobj in molsets: # # # # global_workflow(moleculesets_object = molesetobj, # global_parameters = gparam) # # # # oriset.combine_predictions(molsets) # oriset.evaluate(phase = 'global') # # # # grid_predictions.loc['grid point ' + str(n+1) + ' global', # oriset.global_predictions.index] \ # = oriset.global_predictions # # # # performance_summary.loc['grid point ' + str(n+1) + ' global'] \ # = gparam.to_series().tolist()[2:] \ # + oriset.global_performance_summary.tolist() # # # # grid_predictions.to_csv(r'C:\Users\Hobo\Desktop\Hansen set\Data\predictions.csv') # performance_summary.to_csv(r'C:\Users\Hobo\Desktop\Hansen set\Data\summary.csv')

### Operating Program Muehlbacher Dataset ### #if __name__ == '__main__': # # # # param_grid = Parameters.from_file(filepath \ # = r'C:/Users/Hobo/Desktop/parameter_grid.csv', grid = True) # # # # muehlbacher_moleset = MoleculeSet.from_file( # csv_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_data.csv', # sdf_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_molecules.sdf', # var_type_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_types.csv', # set_label_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_set_labels.csv') # # # # muehlbacher_moleset.partition(method = 'file', file_path = \ # r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_partition_set.csv') # # # # muehlbacher_split_moleculesets \ # = muehlbacher_moleset.to_moleculesets(training_partition = False) # # result_types = ['global', 'local', 'global-on-local'] # n_grid_points = ['grid point ' + str(n+1) for n in range(int(len(param_grid)/2))] 282 # grid_point_col = [p for p in n_grid_points for r in result_types] # result_type_col = [r for p in n_grid_points for r in result_types] # grid_point_index = [i + ' ' + j for i,j in zip(grid_point_col, result_type_col)] # # # # performance_summary = pd.DataFrame(index = grid_point_index, \ # columns = param_grid[0].to_series() \ # .index.tolist()[2:] + ['nobs', 'npred', \ # 'fpred', 'prev', 'tps', 'fns', 'fps', \ # 'tns', 'sens', 'spec', 'accu', 'ppv', \ # 'npv']) # # # # pred_types = ['global', 'local'] # pred_grid_point_col = [p for p in n_grid_points for r in pred_types] # pred_type_col = [r for p in n_grid_points for r in pred_types] # pred_grid_index = [i + ' ' + j for i,j in zip(pred_grid_point_col, pred_type_col)] # # # # grid_predictions = pd.DataFrame(index = pred_grid_index) # for mol in muehlbacher_moleset.raw_response.index.tolist(): # grid_predictions[mol] = np.nan # # # # for n in range(int(len(param_grid)/2)): # # oriset = copy.deepcopy(muehlbacher_moleset) # molsets = copy.deepcopy(muehlbacher_split_moleculesets) # # print('') # print('### Grid Point ' + str(n+1) + ' ###') # print('') # # # # gparam, lparam = param_grid[2*n], param_grid[2*n+1] # # # # for molesetobj in molsets: # # # # workflow(moleculesets_object = molesetobj, # global_parameters = gparam, # local_parameters = lparam) # # # # oriset.combine_predictions(molsets) # # oriset.evaluate(phase = 'global') # oriset.evaluate(phase = 'local') # oriset.global_on_local_evaluation() # # # # grid_predictions.loc['grid point ' + str(n+1) + ' global', # oriset.global_predictions.index] \ 283 # = oriset.global_predictions # grid_predictions.loc['grid point ' + str(n+1) + ' local', # oriset.global_predictions.index] \ # = oriset.local_predictions # # # # performance_summary.loc['grid point ' + str(n+1) + ' global'] \ # = gparam.to_series().tolist()[2:] \ # + oriset.global_performance_summary.tolist() # # performance_summary.loc['grid point ' + str(n+1) + ' local'] \ # = lparam.to_series().tolist()[2:] \ # + oriset.local_performance_summary.tolist() # # performance_summary.loc['grid point ' + str(n+1) + ' global-on-local'] \ # = lparam.to_series().tolist()[2:] \ # + oriset.global_on_local_performance_summary.tolist() # # # # grid_predictions.to_csv(r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\predictions.csv') # performance_summary.to_csv(r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\summary.csv')

### Operating Program Muehlbacher Dataset Global Only ### #if __name__ == '__main__': # # # # param_grid = Parameters.from_file(filepath \ # = r'C:/Users/Hobo/Desktop/parameter_grid.csv', grid = True) # # # # muehlbacher_moleset = MoleculeSet.from_file( # csv_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_data.csv', # sdf_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_molecules.sdf', # var_type_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_types.csv', # set_label_path = r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_set_labels.csv') # # # # muehlbacher_moleset \ # .partition(method = 'file', file_path = \ # r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\muehlbacher_bbb_partition_set.csv') # # # # muehlbacher_split_moleculesets \ # = muehlbacher_moleset.to_moleculesets(training_partition = True) # # # # master_index = ['grid point ' + str(n+1) + ' global' for n in range(int(len(param_grid)/2))] # # # # performance_summary = pd.DataFrame(index = master_index, \ # columns = param_grid[0].to_series() \ # .index.tolist()[2:] + ['nobs', 'npred', \ # 'fpred', 'prev', 'tps', 'fns', 'fps', \ # 'tns', 'sens', 'spec', 'accu', 'ppv', \ 284 # 'npv']) # # # # grid_predictions = pd.DataFrame(index = master_index) # for mol in muehlbacher_moleset.raw_response.index.tolist(): # grid_predictions[mol] = np.nan # # # # for n in range(int(len(param_grid)/2)): # # oriset = copy.deepcopy(muehlbacher_moleset) # molsets = copy.deepcopy(muehlbacher_split_moleculesets) # # print('') # print('### Grid Point ' + str(n+1) + ' ###') # print('') # # # # gparam, lparam = param_grid[2*n], param_grid[2*n+1] # # # # for molesetobj in molsets: # # # # global_workflow(moleculesets_object = molesetobj, # global_parameters = gparam) # # # # oriset.combine_predictions(molsets) # oriset.evaluate(phase = 'global') # # # # grid_predictions.loc['grid point ' + str(n+1) + ' global', # oriset.global_predictions.index] \ # = oriset.global_predictions # # # # performance_summary.loc['grid point ' + str(n+1) + ' global'] \ # = gparam.to_series().tolist()[2:] \ # + oriset.global_performance_summary.tolist() # # # # grid_predictions.to_csv(r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\predictions.csv') # performance_summary.to_csv(r'C:\Users\Hobo\Desktop\Muehlbacher_set\Data\summary.csv')

285