<<

NEW FRAGMENTATION METHOD TO ENHANCE STRUCTURE-BASED

IN SILICO MODELING OF CHEMICALLY-INDUCED TOXICITY

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Darshan Mehta, M.S.

Graduate Program in Chemical Engineering

The Ohio State University

2016

Dissertation Committee:

James Rathman, Advisor

Chihae Yang, Co-advisor

Aravind Asthagiri

Bhavik Bakshi

Copyright by

Darshan Mehta

2016

ABSTRACT

Evaluating the potential toxicity of chemical compounds is an important step in the development of all new products these days, ranging from drugs to food ingredients to materials in consumer products. This evaluation is necessary for obtaining approval from regulatory agencies and for minimizing safety concerns. Current methods for assessing toxicity largely rely on experimental techniques that are time consuming and resource intensive. The development of computational methods is therefore of great current interest as it helps to reduce and prioritize the experimental tests. Principles of chemical informatics (also called as chemoinformatics), where chemical descriptors are used to capture and represent structural information, are widely used to predict the toxicity of chemicals based on their molecular structure. These descriptors describe the molecules by abstracting their structural information into a mathematical quantity and then relating them to a particular question such as the similarity between two molecules or as factors in a computational model. In spite of the developments in identifying newer descriptors, there is still a need for more efficient methods that can reduce the high-dimensional descriptor space, give meaningful descriptors, and yield better toxicity prediction results.

In this research, we developed novel chemical descriptors that were able to overcome some of the above-mentioned limitations. These novel descriptors are linear subgraphs of chemical structures that are annotated with atom-based features such as atom identity and partial charge.

These features provide flexibility in defining chemical fragments, and thus allow one to explore different levels of structural details from the same chemical fragment. Even though the fragments

ii are linear in composition, they also capture branched structural information due to the provision of annotating features. Our particular interest in linear fragments was due to the potential of building

Markov chain models for prediction purposes. Markov models have proved useful in bioinformatics methods for the sequence analysis of DNA, RNA, and other nucleotide sequences. We developed a similar model tailored for analyzing linear chemical fragments that helped to quantify the relationship between these fragments and chemically-induced toxicity.

We evaluated the performance of annotated linear fragments using datasets on two toxicity endpoints, namely skin sensitization and Ames mutagenicity. The first part of this evaluation was to explore the fragment space to see if descriptors that appear to be related to these two toxicity endpoints could be identified. We were able to identify 15 unique descriptors that are well-known to cause skin sensitizing effects and 12 descriptors that are known to cause mutagenic effects. The second part of the evaluation was to explore the performance of Markov chain models for predicting mutagenicity of chemicals. We developed several models using different annotation schemes and fragment lengths and explored their predictive performances. It was found that these models performed significantly better or comparably similar to the other non-parametric approaches reported in the literature. We also explored the performance of kNN models using annotated linear fragments to substantiate the effectiveness of our novel descriptors.

iii

DEDICATION

To the presiding deities at Columbus Krishna House:

Sri Sri Gaura Nitai, Sri Sri Radha Natabara, and Sri Sri Jagannath Baladev Subhadra

To my spiritual master, His Holiness Radhanath Swami

And to all innocent creatures used for experimental testing of chemicals

iv

ACKNOWLEDGMENTS

I would sincerely like to acknowledge the guidance and support that I received from my advisors, Drs. James Rathman and Chihae Yang. Without their expert leadership, clear vision, and constructive feedback, it would not have been possible for me to complete this monumental task. I would also like to acknowledge the help that I received from my lab members for their regular feedback and advice. In particular, I would like to thank

Aleksandra Mostrag-Szlichtyng, Dimitar Hristozov, and Bryan Hobocienski. I am especially grateful to Aleksandra for compiling the skin sensitization dataset and to Dimitar for helping me set up the RDKit package in Python scripting language. I also sincerely thank the professors in department of Statistics for teaching me the fundamentals of statistical data analysis. This played a significant role during the course of my research and eventually led me to pursue a Master’s degree in Applied Statistics.

Finally, I would like to acknowledge the emotional and spiritual support that I received from my parents, brother, wife, and all the wonderful devotees at Columbus

Krishna House. My stay in Columbus wouldn’t have been the same without them.

v

VITA

2008……………………………………B. Chem. Engg. (Bachelor of Chemical

Engineering), Institute of Chemical Technology.

2008 to 2010……………………………Manager, Central Technical Services Dept.,

Reliance Industries Limited, Hazira, India.

2010 to present…………………………Graduate Research/Teaching Associate,

Department of Chemical and Biomolecular

Engineering, The Ohio State University.

2013……………………………………M.A.S. (Master of Applied Statistics), The Ohio

State University.

2014……………………………………M.S. Chemical Engineering, The Ohio State

University.

FIELDS OF STUDY

Major Field: Chemical Engineering

vi

TABLE OF CONTENTS

Abstract ...... ii

Dedication ...... iv

Acknowledgments...... v

Vita ...... vi

List of Tables ...... xi

List of Figures ...... xvi

CHAPTER 1: INTRODUCTION ...... 1

1.1 QSAR Modeling ...... 7

CHAPTER 2: BACKGROUND ...... 11

2.1 Representing Chemical Structures ...... 11

2.2 Structural Descriptors ...... 15

vii

2.3 Application of QSAR Methods ...... 19

2.4 Research Problem ...... 23

2.5 Proposed Solution ...... 24

CHAPTER 3: DATASETS AND COMPUTATIONAL TOOLS ...... 27

3.1 Training Datasets ...... 27

3.2 Computational Tools ...... 32

CHAPTER 4: METHODS ...... 35

4.1 Generation of Linear Fragments ...... 35

4.2 Chemical Annotations ...... 38

4.3 Compound-Fragment Data Matrix ...... 41

4.4 Developing Markov Chain Models ...... 44

4.5 Evaluating Model Performance ...... 53

CHAPTER 5: RESULTS ON IDENTIFICATION OF STRUCTURL ALERTS...... 55

5.1 Distinguishing Structurally Similar Compounds ...... 55

viii

5.2 Identifying Structural Alerts for Skin Sensitization ...... 59

5.3 Comparison of Different Annotation Schemes ...... 71

5.4 Identifying Structural Alerts for Ames Mutagenicity ...... 73

CHAPTER 6: RESULTS ON CLASSIFICATION OF COMPOUNDS USING

MARKOV CHAIN MODELS ...... 84

6.1 Markov Chain Models Based on One-step Connections ...... 84

6.2 Markov Chain Models Using Different Annotation Schemes ...... 94

6.3 Markov Chain Models Using Fragments of Longer Lengths ...... 98

6.4 Markov Chain Models with Improved Specificity ...... 104

6.5 Additional Results Using Annotated Linear Fragments Coupled with kNN Models ...... 109

CHAPTER 7: CONCLUDING REMARKS ...... 125

7.1 Summary ...... 125

7.2 Future Work ...... 127

7.3 Conclusion ...... 129

ix

REFERENCES ...... 131

Appendix A. Computer Programs: Algorithms and Python Scripts ...... 139

Appendix B. Statistical Results for Identification of Structural Alerts ...... 147

Appendix C. Results for 5-Fold Cross-Validation of Benchmark Dataset for Ames

Mutagenicity ...... 164

x

LIST OF TABLES

Table 1: Compound counts for the 5-fold cross-validation (CV) scheme ...... 30

Table 2: Iteration steps of depth-first search algorithm for atom #1 of m-ethyl phenol ... 37

Table 3: Atom symbols using different annotation schemes for m-ethyl phenol ...... 39

Table 4: Example of count matrix ...... 45

Table 5: Example of transition probability matrix (generated using count matrix data in

Table 4) ...... 45

Table 6: Example of connection vector with counts and probabilities (generated using count matrix data in Table 4) ...... 47

Table 7: Connection count vector and corresponding probabilities for training set with binary toxicity outcome (generated using data from Table 6) ...... 48

Table 8: General 2x2 confusion matrix ...... 54

Table 9: Properties of two isomers, β-Terpinene and β-Phellandrene ...... 56

Table 10: Tanimoto distances between the two isomers ...... 58

Table 11: Contingency table for fragment [`10-', `30+', `10-'] ...... 64

Table 12: Summary statistics for skin sensitization dataset...... 66

Table 13: Significant fragments in skin sensitization dataset along with the annotation schemes that were able to identify them ...... 67

Table 14: General 2x2 contingency table ...... 76 xi

Table 15: Summary statistics for Ames mutagenicity dataset ...... 79

Table 16: Significant fragments identified in Ames mutagenicity dataset ...... 80

Table 17: Important toxicophores for Ames mutagenicity and corresponding fragments 83

Table 18: Unique symbols identified using training set for CV 1 ...... 85

Table 19: One-step connections with counts and probabilities using training set for CV 1

(partial output) ...... 86

Table 20: Calculation of overall log-likelihood for 3-nitro-o-xylene molecule ...... 90

Table 21: Confusion matrix for test set of CV 1 using {AI, nC, nH} annotation scheme 91

Table 22: Performance parameters for 5-fold cross-validation of Hansen dataset using

{AI, nC, nH} annotation scheme and one-step connection probabilities ...... 93

Table 23: Comparison of performance with other non-parametric methods (averaged over

5-fold cross-validation splits) ...... 93

Table 24: Performance parameters for Hansen dataset using Markov chain models with different annotation schemes ...... 95

Table 25: 5-bin classification of partial charge (PC) annotation feature ...... 96

Table 26: Performance parameters for Hansen dataset using different annotation schemes

(with PC binned into 5 categories and ring atom (RA) annotation added) ...... 97

Table 27: Calculation of overall log-likelihood for 3-nitro-o-xylene molecule using fragments of length 3 ...... 100

xii

Table 28: Performance parameters for Hansen dataset using Markov chain models with fragments of longer lengths ...... 101

Table 29: 5-level categorization of the ring atom (RA) annotation feature ...... 107

Table 30: Five nearest neighbors identified for the 3-nitro-o-xylene molecule from the training set for CV 1 ...... 112

Table 31: Performance parameters for 5-fold cross-validation of Hansen dataset using kNN modeling method with 5 nearest neighbors and fragments of length 3 generated using {AI, nC, nH} annotation scheme ...... 113

Table 32: Averaged performance parameters for Hansen dataset using kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 114

Table 33: Tanimoto distance and mutagenic activity of 5 nearest neighbors identified in the training set for some hypothetical test compound ...... 118

Table 34: Averaged performance parameters for Hansen dataset using weighted kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 119

Table 35: Averaged performance parameters for Hansen dataset using weighted kNN modeling method with 7 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 122

xiii

Table 36: Significant and positively correlated fragments for skin sensitization dataset using {AI} annotation scheme with χ2 and γ statistic values ...... 147

Table 37: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC} annotation scheme with χ2 and γ statistic values ...... 148

Table 38: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC, nH} annotation scheme with χ2 and γ statistic values ...... 150

Table 39: Significant and positively correlated fragments for skin sensitization dataset using {nC, nH, PC} annotation scheme with χ2 and γ statistic values ...... 154

Table 40: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC, nH, PC} annotation scheme with χ2 and γ statistic values ...... 159

Table 41: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using preliminary Markov chain models based on one-step connection probabilities (14 sub-tables for 14 different annotation schemes) ...... 164

Table 42: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using Markov models based on longer fragment lengths (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths) ...... 169

Table 43: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using preliminary Markov chain models based on one-step connection probabilities with different annotation schemes and classification criteria (4 sub-tables) ...... 172 xiv

Table 44: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using kNN modeling methods with 5 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths) ...... 174

Table 45: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using weighted kNN models with 5 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths) ...... 177

Table 46: Performance parameters for 5-fold cross-validation of benchmark dataset for

Ames mutagenicity using weighted kNN models with 7 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths) ...... 181

xv

LIST OF FIGURES

Figure 1: In silico approach ...... 2

Figure 2: Drug discovery pipeline ...... 3

Figure 3: Antidiabetic drug molecules ...... 6

Figure 4: Role of structural descriptors ...... 8

Figure 5: Flow diagram of QSAR-based in silico approach ...... 10

Figure 6: Representation of chemical structure ...... 12

Figure 7: m-ethyl phenol molecule ...... 12

Figure 8: SD file for m-ethyl phenol molecule ...... 14

Figure 9: Number of compounds in different categories in skin sensitization dataset ..... 28

Figure 10: Distribution of molecular weights for skin sensitization dataset ...... 28

Figure 11: Number of compounds in different categories in the benchmark dataset for

Ames mutagenicity ...... 29

Figure 12: Distribution of molecular weights for compounds in the benchmark dataset for

Ames mutagenicity ...... 30

Figure 13: Compound counts for training sets in the 5-fold CV scheme ...... 31

Figure 14: Compound counts for test sets in the 5-fold CV scheme ...... 31

Figure 15: Typical example using MarvinSketch software ...... 32

Figure 16: Typical example using MarvinView software ...... 33 xvi

Figure 17: Linear fragments using graph theory and depth-first search algorithm ...... 35

Figure 18: Unique fragments generated using different annotation schemes (partial output) ...... 40

Figure 19: Typical compound-fragment data matrix ...... 41

Figure 20: Sample training set ...... 42

Figure 21: Compound-fragment data matrix (partial) using {AI} annotation scheme ..... 42

Figure 22: Compound-fragment data matrix (partial) using {AI, nC, nH} annotation scheme...... 42

Figure 23: Excel interface for generating compound-fragment data matrix using annotated linear chemical fragments ...... 43

Figure 24: One-step connection probabilities (based on data in Table 7) ...... 49

Figure 25: One-step connection probability ratios (based on data in Table 7) ...... 49

Figure 26: Overview of classification strategy ...... 52

Figure 27: Two isomers: β-Terpinene and β-Phellandrene ...... 55

Figure 28: Activity cliff phenomenon for chemical properties ...... 57

Figure 29: Possible explanation of activity cliff phenomenon using annotated linear chemical fragments ...... 58

Figure 30: Total fragments generated from skin sensitization dataset with fragment lengths between 2 and 15 using 5 different annotation schemes ...... 59

xvii

Figure 31: Fragment distribution for skin sensitization dataset using 5 different annotation schemes ...... 60

Figure 32: Total fragments generated from skin sensitization after removing singletons/doubletons ...... 61

Figure 33: Fragment distribution for skin sensitization dataset after removing singletons/doubletons ...... 62

Figure 34: Histogram of fragments using {nC, nH, PC} annotation scheme (before removing singletons/doubletons) ...... 63

Figure 35: Histogram of fragments using {nC, nH, PC} annotation scheme (after removing singletons/doubletons) ...... 63

Figure 36: Total fragments generated from Ames mutagenicity dataset with path lengths between 3 and 7 using 5 different annotation schemes ...... 73

Figure 37: Fragment distribution for Ames mutagenicity dataset before removing singleton/doubleton fragments ...... 74

Figure 38: Fragment distribution for Ames mutagenicity dataset after removing singleton/doubleton fragments ...... 74

Figure 39: Histogram of fragments using {AI, nC, nH} annotation scheme (before removing singleton/doubleton fragments) ...... 75

Figure 40: Histogram of fragments using {AI, nC, nH} annotation scheme (after removing singleton/doubleton fragments) ...... 75 xviii

Figure 41: One-step connection probabilities obtained using the training set for CV 1

(partial output) ...... 87

Figure 42: One-step connection probability ratios obtained using the training set for CV 1

(partial output) ...... 87

Figure 43: 3-nitro-o-xylene molecule from test set for CV 1 ...... 89

Figure 44: 2D molecular graph of 3-nitro-o-xylene using {AI, nC, nH} annotation scheme...... 89

Figure 45: ROC plot comparing performance of Markov chain models with different annotation schemes and other non-parametric approaches reported in literature ...... 95

Figure 46: ROC plot comparing performance of Markov chain models with different annotation schemes (PC binned into 5 categories and ring atom (RA) annotation added)

...... 97

Figure 47: Graphical comparison of performance parameters for Hansen dataset obtained using Markov chain models with fragments of longer lengths ...... 102

Figure 48: Distribution of log-likelihood values for compounds predicted as false positives using a preliminary Markov chain model on the test set for CV 1 and using

{nC, nH, PC, RA} annotation scheme ...... 105

Figure 49: Distribution of log-likelihood values for compounds predicted as true positives using a preliminary Markov chain model on the test set for CV 1 and using {nC, nH, PC,

RA} annotation scheme ...... 105 xix

Figure 50: ROC plot comparing performance of the best Markov chain model developed using {nC, nH, PC, RA} annotation scheme with a previous Markov chain model developed using {AI, nC, nH} annotation scheme and other non-parametric approaches reported in the literature ...... 108

Figure 51: Distribution of Tanimoto distances of 3-nitro-o-xylene from all compounds in the training set for CV 1...... 111

Figure 52: Graphical comparison of performance parameters for Hansen dataset obtained using kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 115

Figure 53: Graphical comparison of performance parameters for Hansen dataset obtained using weighted kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 120

Figure 54: Graphical comparison of performance parameters for Hansen dataset obtained using weighted kNN modeling method with 7 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes ...... 123

Figure 55: ROC plot comparing the performance of best kNN model and best Markov chain model developed using annotated linear fragments with other non-parametric approaches reported in the literature ...... 124

xx

CHAPTER 1: INTRODUCTION

Chemical informatics (also commonly referred to as chemoinformatics) is a branch of molecular informatics that applies computational and informational techniques to solve chemical problems. It emerged as a separate discipline in late 1960s and early 1970s in order to track the increasing amount of information available about chemical compounds and their properties.

Presently, there are more than 100 million compounds known in the CAS (Chemical Abstract

Services) registry1,2 and the amount of information is constantly increasing. Every year, there are more than 1 million new compounds synthesized and more than 700, 000 publications that contribute in some way to chemical information3. Thus, chemoinformatics fills the need for a specialized discipline at the interface of chemistry and information technology that helps to leverage the vast amount of chemical information to inform and guide future work.

Chemoinformatics methods have been traditionally used for performing substructure search operations and for retrieval of specific compounds from large databases. They have also been frequently used for analysis of relational databases and data mining. The field of chemoinformatics has grown ever since and has found applications in areas such as predictive modeling, drug discovery, toxicity and risk assessment, and structure-based formulations. Of particular interest in this research is the application of chemoinformatics methods for evaluating potential toxicity of chemical compounds. This is important in the development of all new products of regulatory interest ranging from drugs to food ingredients to materials in consumer products.

1

It is important to note the fundamental difference between informatics modeling and molecular modeling approaches. Molecular modeling is based on atomistic level description of molecular systems and uses first principles like Newtonian mechanics and quantum mechanics to model the physical properties of molecules. Using such methods, one can accurately predict properties like dipole moment and polarizability of any given molecule. However, these first principles generally fail to predict a molecule's distribution in the environment (partitioning) or its toxic properties. For example, given an ensemble of drug molecules, it is not possible to predict from first principles their activity against a particular target receptor. Similarly, given the sequence of nucleotides in a gene, it is not possible to predict the biological function of the protein coded for by the gene. In such situations, informatics methods are very useful.

The goal of chemoinformatics methods is to transform chemical data into information, and then information into knowledge to help make better and faster decisions as efficiently as possible4,5. They involve the development of computational models (also commonly referred to as

“in silico” models) that use the results obtained from in vitro and/or in vivo experiments to intelligently predict the properties of compounds for which experimental data is not available.

Figure 1 shows a schematic representation of this approach.

Figure 1: In silico approach

2

As seen in Figure 1, chemical data is used in the first step in conjunction with experimental results to develop a predictive model. This model is then used to predict properties of virtual compounds or to generate virtually safe compounds. In the third step, the resultant predictions are used to synthesize these compounds and new experiments are performed to validate the predictions.

Thus, in silico models help to develop causal relationships between chemical structures and their properties based on inductive methods of learning. They are especially helpful in identifying possible problems early in the discovery process and to prioritize experimental work. For example, consider the drug discovery pipeline shown in Figure 2.

Figure 2: Drug discovery pipeline1

1 Image taken from the book Stem Cell Biology in Normal Life and Diseases: Chapter 6 Stem Cell Predictive Hemotoxicology by Holli Harper and Ivan Rich. DOI: 10.5772/54430. 3

As seen in Figure 2, the drug discovery process generally begins by collecting a repository of all possible drug molecules (chemical structures) that are supposed to be active against the target disease. They are referred to as potential drug candidates. These can be hypothetical chemical structures and need not be actually existing or synthesized chemicals. These drug candidates are then passed through a series of coarse and fine screening steps that screen out the non-active and potentially harmful or toxic drug molecules. The remaining drug candidates are then optimized for lead identification and sent for pre-clinical animal testing.

One of the important steps during the screening stages is to check for the potential toxicity of drug candidates and their associated chemical impurities. Typical toxicity outcomes tested are genetic toxicity (mutagenicity), carcinogenicity, and skin sensitization. Current methods for assessing toxicity largely rely on experimental measurements. However, these methods have many disadvantages associated with them. First of all, it is not possible to conduct a thorough experimental study of every candidate compound as it is too expensive and time consuming.

Secondly, the drug molecule has to be synthesized for it to be tested. This largely limits the number of candidate drug molecules. Also, the experimental results for one drug cannot be generalized for other drugs. And lastly, experimental methods are generally based on animal testing, a concern that has gained considerable attention in recent years. Thus, computational methods are favored to reduce and prioritize the experimental tests.

This approach of using chemoinformatics based computational models for predicting adverse health effects of chemicals is called as computational toxicology. The US Environmental

Protection Agency (EPA) defines computational toxicology as “the application of mathematical and computer models to predict adverse effects and to better understand the single or multiple mechanisms through which a given chemical induces harm." This field of computational toxicology is a very vibrant and rapidly developing discipline6.

4

There are several advantages of using computational methods for toxicity prediction. Due to the availability of large chemical datasets in recent times, it has been possible to develop computational models with high degree of accuracy and robustness. These models are generally very fast and reliable in their predictive ability. These predictions can also be easily generalized over a large range of compounds. Thus, computational methods significantly reduce the time and money required for screening candidate compounds. They also help to reduce the need for animal testing of chemicals. It should however be noted that the goal is not to replace experimental methods, but to complement them so that the overall process becomes more efficient.

Computational methods can also be helpful in identifying potentially toxic drug candidates that sometimes may not be detected during the experimental screening and pre-clinical trial phases.

These drugs act by a mechanism generally known as idiosyncratic drug reactions. In such cases, problems arise after the drug has been approved for use and thousands of patients have been treated.

One famous example of this is an antidiabetic drug known as Troglitazone used for the treatment of Type 2 diabetes. Troglitazone was the first antidiabetic agent that was approved for clinical use in 1997. However, it had to be withdrawn from market in the year 2000 because it caused serious idiosyncratic hepatotoxicity (liver toxicity) in patients. This drug was replaced by the development of two new drug molecules, namely rosiglitazone and pioglitazone. They have a similar mechanism of action but without the inherent side effect of liver toxicity. Figure 3 shows the chemical structures of the three drug molecules under consideration.

Several mechanisms of action have been proposed using in vitro and in vivo studies to explain the liver toxicity caused by troglitazone7–9. However, it still remains a complex phenomenon and many interacting factors contribute to the observed idiosyncratic toxicity. This strongly justifies the need for chemoinformatics based computational models as they can help to predict toxicity outcomes solely based on chemical structural information of the drug candidates.

5

Figure 3: Antidiabetic drug molecules

6

1.1 QSAR Modeling

Chemoinformatics methods help to model complex chemical phenomena such as chemically-induced toxicity by a combined application of chemistry, information science, computational methods, and statistical data analysis. They help to leverage the use of historical experimental data by building models that relate chemical structures and/or physicochemical properties to their observed biological activities. These models are called as Structure-Activity

Relationship (SAR) models. The SAR models which quantify these relationships using data mining and statistical tools are called as Quantitative Structure-Activity Relationship (QSAR) models.

Thus, QSAR is defined as a statistical model that relates structural or property descriptors of a compound to its chemical or biological activity10.

In this approach, chemical structures are treated as independent predictor variables and descriptors that are suitable for mathematical modeling are extracted from them. Since it is generally not possible to directly correlate the structure of a compound to its activity, it is necessary to generate structure-based descriptors that can be used to calculate or estimate the toxicity outcome. This is shown graphically in Figure 4. Statistical models are then developed that correlate these structural descriptors to the toxicity of chemicals.

Most descriptors differ in the way they capture chemical structural information. A key assumption in this approach is that the descriptors appropriately capture features and properties of a compound responsible for its biological activity. This is where chemistry plays a central role by identifying the relevant structural features. A wise choice of the descriptor space based on chemically relevant features helps to make the model mechanistically interpretable and distinguishes it from the “black box” approach.

7

Figure 4: Role of structural descriptors

A typical example of representing molecular structures in a finite descriptor space is the use of MACCS keys. The MACCS keys are a set of questions about chemical structures that have a binary outcome. For example, are there fewer than 3 in the compound, is there a ring of size 4, is there a halogen atom (F, Cl, Br, I) present, etc. Thus, every compound will have a corresponding binary value (0 or 1) associated with each question. These binary values are arranged in a linear sequence and they form the “bitstring” or “fingerprint” for that compound. Thus, considering the 3 example questions described above, compounds can have different fingerprints such as 111, 110, 101, 011, etc. These fingerprints help to uniquely characterize each compound and can be used to perform similarity search calculations. The original version of MACCS keys had 166 bits, though newer versions with 320 bits are also available.

The other popular example is CACTVS substructure keys used by PubChem. This set of keys has 881 bits in total divided into 7 sections depending on the questions being addressed. For example, the 1st section has 115 bits that characterizes hierarchic element counts and tests for the presence or count of individual chemical atoms. It should be noted here that these substructure based keys form a class of descriptors and they typically use 1D and/or 2D chemical information.

There are many other descriptors that have been proposed, developed, and reported in the literature.

These will be described in detail in the Structural Descriptors section (2.2) of Background chapter. 8

Thus, summarizing the discussion so far, principles of chemoinformatics are used in the development of computational models for quick and reliable predictions of toxicity outcomes. This approach is especially important in the initial phase of product development pipeline, where it is necessary to eliminate potentially toxic or hazardous candidates. In order to accomplish this, chemical structures are represented as a function of structural descriptors and statistical methods are developed that relate these descriptors to the desired toxicity outcome.

Now, the first step in this process is to compile a set of chemical structures for which the toxicity outcome is known from previous experimental studies. This is called as the training set.

This set is used to train the statistical model and optimize its parameters. Once an optimized statistical model is generated, it is used to predict the toxicity outcomes for those compounds for which experimental data is not available. This set of compounds is called as the test set. The performance of any set of proposed descriptors and statistical model is determined based on the number of test compounds that are correctly classified. The lower the prediction error rate for the test compounds, the better is the set of proposed descriptors and statistical model. A graphical illustration of QSAR-based in silico approach is shown in Figure 5.

Apart from being used for predicting toxicity outcomes, the QSAR-based in silico approach can also be used for identifying common descriptors observed in toxicological compounds. Thus, for a given toxicity outcome, a set of descriptors can be generated from the training set that contains only those descriptors that are prominently observed in the toxic compounds. These descriptors are called as structural alerts and they can then be used to identify and highlight potentially toxic descriptors in test compounds. This approach has tremendous potential and can lead to the discovery of new structural alerts, thus throwing light on mechanisms by which chemicals induce toxicity.

9

Figure 5: Flow diagram of QSAR-based in silico approach

Thus ends a comprehensive introduction to the general discipline of chemoinformatics and its different applications. The remaining chapters in this dissertation are organized as follows.

Chapter 2 provides a background of the fundamental concepts in chemoinformatics and describes the research problem with an outline of the proposed solution. Chapter 3 describes computational tools and toxicity datasets used in the course of this research. Chapter 4 describes the algorithm used for generating annotated linear chemical fragments and subsequent Markov chain models.

Chapter 5 presents results on identification of relevant descriptors and structural alerts. Chapter 6 presents results on Markov chain modeling approach with detailed analysis of different annotation schemes and fragment lengths. Classification results obtained using kNN modeling method coupled with annotated linear fragments are also presented. Chapter 7 summarizes and highlights the significant accomplishments of this research and describes the potential research that could stem from this work.

10

CHAPTER 2: BACKGROUND

As chemoinformatics is a relatively new field of research for chemical engineers, it is necessary to get acquainted with some background information and technical jargon specific to this discipline. This section briefly describes different methods of representing chemical structures in electronic form, types of structural descriptors, QSAR methods, and an overview of Markov chain based statistical models. It also describes the problem statement addressed in this research along with an outline of the proposed solution.

2.1 Representing Chemical Structures

In order to generate structural descriptors from chemical structures, it is very important to define a common nomenclature for representing all compounds in a machine-readable form.

Consider the case of generating MACCS keys for example. Here, it is very important and a fundamental requirement that the machine is able to recognize and process chemical structures.

This will make it possible to extract relevant information such as whether there are less than 3 atoms in the molecule, is there a ring of size 4, and is there a presence of halogen atom.

This representation of chemical structures has various levels of refinement. It starts from linear notations and 2D chemical graphs and culminates in 3D structure representations and molecular surfaces3. This is shown graphically in Figure 6.

11

Figure 6: Representation of chemical structure2

This section describes the most widely used forms for representing chemical structures using m-ethyl phenol as an example. The structure of m-ethyl phenol is shown in Figure 7.

Figure 7: m-ethyl phenol molecule

2 Image taken from the book Chemoinformatics: A Textbook by Johann Gasteiger (page 17). 12

This molecule can be represented in different forms as follows:

1. Molecular Formula:

The molecular formula for m-ethyl phenol is C8H10O. This is the most general way of

representing any molecule. It gives the count of the different atoms present in the molecule and

it is mainly used as a shorthand notation in chemical and biological databases.

2. SMILES:

SMILES is an acronym for Simplified Molecular Input Line Entry Specification. SMILES

notation was created in 1986 by David Weininger for chemical data processing11 and it has found

widespread use as a universal chemical nomenclature system3. It is a way of representing

chemical structures using line notation that describes different atoms and their connectivity. The

SMILES notation for m-ethyl phenol is c1c(CC)cccc1(O). Aliphatic atoms are denoted by

capital letters and aromatic atoms are denoted by small letters. Branching atoms are included in

parenthesis. There can be many possible SMILES notation for a molecule depending on the

position of the starting atom.

3. XML:

XML stands for eXtensible Markup Language and it is a web-friendly way of representing

chemical structures. One can search for any molecule on the internet by typing its XML notation

in a search engine. The two major types in XML are InChI (International Chemical Identifier)

and CML (Chemical Markup Language). The InChI for m-ethyl phenol is

‘InChI=1S/C8H10O/c1-2-7-4-3-5-8(9)6-7/h3-6,9H,2H2,1H3'.

13

4. Structure-Data Files:

Structure-Data (SD) file formats store molecular information using connection tables. The

SD file format was designed by Molecular Design Lab and it contains information about atoms,

atomic coordinates, bonds, connections, and associated properties. A typical format of the SD

file for m-ethyl phenol is shown in Figure 8.

Figure 8: SD file for m-ethyl phenol molecule

An SD file can store information about many compounds. Each compound’s information is contained in a separate block called as the Mol file. The Mol files are separated from each other by ‘$$$$’ sign. The first step in this method is numbering all the heavy atoms. This numbering can be arbitrary and can start at any atom, though canonical numbering systems can be helpful to number the atoms uniformly. Hydrogen atoms can be defined explicitly and included in the

14 connection table or they can be defined implicitly. The structure of the Mol file is quite rigid and every blank space and line has some associated significance. The first block is called as the Header block and it contains the chemical identifier of the molecule. The next block is the Counts block and it contains the number of atoms and bonds in the molecule. The Atom block contains the spatial coordinates of the atoms along with the identity of each atom. This block may also contain additional data such as mass difference and charge on the atom. The Bond block is the most important block and it contains information about the connectivity between all the atoms. For example, the first line in this block is read as ‘Atom #1 is connected to Atom #2 by a single bond’; the second line is read as ‘Atom #1 is connected to Atom #6 by a double bond’. This block can also include information about aromatic bonds and stereochemistry. The ‘M END’ command at the end of the Bond block indicates the end of the connection table. A Property block may follow this and can contain information like IUPAC name, molecular weight, total charge, hydrogen bond acceptors, etc. This block can be as long as desired. We will use SD file formats as a standard method of representing chemical information in this research.

2.2 Structural Descriptors

Structural descriptors help to abstract structural information into a mathematical quantity.

They are a useful way of generating secondary information from the primary information contained in a molecule's structure. Structural descriptors can be broadly classified into two categories: steric and electronic. Steric descriptors are related to the connectivity and shape of a molecule and they characterize structures according to their size, degree of branching, and overall shape. They capture information like distances between atoms, torsion angles, radius of gyration, and molecular diameter. These descriptors are typically characterized by a single value based on the topology of

15 the molecule. Several indices such as Branching index, Wiener index, and Zagreb index are used.

Electronic descriptors are related to the electronic configuration of atoms and bonds. These descriptors characterize the atoms depending on their charge and presence of lone pairs, and they characterize the bonds by classifying them as sigma and pi bonds. They also classify atoms based on their electronegativity, dipole moment, polarizability, and partial charges.

Structural descriptors can also be classified based on their dimensions. 3D descriptors are those that capture the shape, size, and orientation of a molecule as it might actually exist in the system of interest. Examples of these include mean dipole moment, surface area/volume, and moment of inertia. These can be calculated from quantum mechanical calculations. 2D descriptors are fragments or substructures of molecules that are generated from a graphical representation of molecules. These are typically used for substructure search and similarity calculations. 1D descriptors include different features of a molecule such as its molecular weight, number of rotatable bonds, hydrogen bond donors, hydrogen bond acceptors, and rings.

These descriptors can be used individually for modeling or two or more descriptors can complement each other to form a more robust predictive model. Different descriptors can also be used sequentially for step-by-step screening of query molecules. It should be noted that all these types of descriptors have some pros and cons associated with them. For example, 3D descriptors have interpretable meanings and help to capture molecular structures as they might actually exist in a particular 3D configuration. However, these are difficult to generate and can be computationally expensive. 2D descriptors, on the other hand, are relatively easy to generate but generally give rise to a large number of redundant fragments or fragments that cannot be well interpreted. It should, however, be noted that 2D fragments form the building blocks of most QSAR applications and can perform as good as 3D descriptors for modeling purposes12–14. These 2D fragments are of particular interest in this research.

16

2D fragments can be further classified into two categories: predefined and dynamic.

Predefined fragments are those that are derived based on knowledge from human experts. In this approach, descriptor libraries are generated using experience of domain experts15 and guidelines relating structural alerts to different toxic endpoints are established as rules-of-thumb. The

ChemoTyper16,17 application recently developed by Molecular Networks GmbH is a pertinent example of this. This application contains a library of fragments for genotoxic carcinogen rules and it allows searching for different structural fragments and highlights them in large datasets of molecules. Other examples include Leadscope Predictive Data Miner (LPDM) by Leadscope Inc.18 and a web-based platform developed by Sushko et al.19 for collecting toxicological structural alerts from literature. Although these predefined fragments help to make sound judgments, they require considerable time, effort, and detailed analysis of relevant literature15.

Dynamic fragments, on the other hand, are data-dependent and are generated on-the-fly from a dataset of chemical structures. These fragments automate the knowledge discovery process and help in the discovery of new structural alerts, thus complementing the predefined fragments approach. In order to generate dynamic fragments, each chemical structure is considered as a topological graph with atoms as nodes and bonds as edges. The complexity of the chemical structure is then assessed and desired descriptors are extracted by atom-based, bond-based, or circular fragmentation methods. Thus, there are different types of dynamic fragments such as linear, radial, dendritic, etc.

Some of these descriptors are proprietary while some others are open source, where algorithms for generating them are freely available online or reported in the literature. Sykora et al.20 describe in detail the development of Chemical Descriptors Library (CDL), a generic and open source software library for chemical informatics. Klopman21 describes an algorithm for generating labeled linear subgraphs for use as descriptors. Rogers et al.13 have developed an algorithm for

17 generating circular fragments, also called as extended-connectivity fingerprints. This is a novel class of topological descriptors that is specifically developed for structure-activity relationship modeling. Faulon et al.22 describe another algorithm for generating a new signature descriptor based on extended valence sequence.

Several computer programs have also been developed to automate the calculation of molecular descriptors. Examples of these are PaDEL-Descriptor23, an open source software for calculating molecular descriptors and fingerprints, Dragon24, and Mold225. Many proprietary software programs are also available, for example, MC4PC by MultiCASE, Inc.26,27, CORINA

Symphony28 by Molecular Networks GmbH, and MolConnZ29.

In their paper14, Duan et al. analyze and compare the predictive performance of eight different types of 2D fragments using a chemoinformatics package Canvas30 on a well-validated dataset with five targets. The fragments generated by them were Linear, Dendritic, Radial,

MACCS, MOLPRINT2D, Pairwise, Triplet, and Torsion. They conclude that most fragments have similar retrieval rates on average and there is no single best method for all target classes. However, for a given target and query there were significant differences between the methods. The MACCS fragments had the lowest retrieval rate. Thus, they are not sufficiently discriminating because they code only presence or absence of substructures without any connectivity information between the features. The pairwise and triplet fragments perform better when active molecules being sought are similar in size to the query molecules. Thus, these fragments are sensitive to molecular size as they encode information from all pairs or triplets of distances. The MOLPRINT2D and radial fragments excelled the performance results. The overall success of radial fragments suggests that discriminating power is increased not just by considering a fragment, but also the manner in which that fragment connects to its surroundings.

18

2.3 Application of QSAR Methods

There has been extensive study done to correlate biological properties of many important pharmaceutical drugs to their chemical structure. QSAR models have been used to investigate adverse health effects of drugs used for thyroid31, tumor32, and tuberculosis33 treatments. QSAR models have also been used to predict toxic endpoints of chemicals such as cytotoxicity34, hepatotoxicity35, cardiac toxicity36, and carcinogenicity37,38. Many statistical methods and software programs have been developed39,40 and their performances have been compared in the literature41,42.

One example of an open source QSAR tool is VEGA43 (Virtual models for property Evaluation of chemicals within a Global Architecture). VEGA was developed as part of the European legislation for chemical substances called as REACH (Registration, Evaluation, Authorization, and

Restriction of Chemicals). It provides models for the following endpoints and properties: BCF

(bio-concentration factor), skin sensitization, mutagenicity, carcinogenicity, developmental toxicity, LC50 aquatic toxicity, ready biodegradability, and LogP. It has very good graphical user interface and comes with precise documentation and guidelines. The following two case studies highlight the procedure generally used for developing QSAR models.

Case study 1: Carcinogenicity

In his paper, Gilles Klopman21 describes a method for identifying significant biophores that help to explain the carcinogenicity of N-nitrosoamines in rats. The effect of N-nitrosoamines in rats has been extensively studied in the literature and their mechanism of action is well documented.

Klopman used his technique to reproduce these results and to identify known relevant descriptors.

He developed a model using a training set consisting of 39 compounds with 27 categorized as active carcinogens and 12 categorized as inactive. He then used his Computer Automated Structure

19

Evaluation (CASE) program to generate linear fragments (called as subunits) containing between

3 and 12 interconnected heavy atoms. Each atom had two labels assigned to it; one to indicate the multiplicity of bonds and the other to indicate the presence of a side chain. A statistical analysis of the fragment distribution was then made. A binomial distribution was assumed, and a fragment was classified as active if its probability of occurrence in active compounds as compared to inactive compounds was more than 0.95. This led to the identification of three significant fragments. The likelihood of a compound to be active was then computed based on the weights of all of its fragments. This approach gave correct prediction for all 27 active compounds and for 10 out of the

12 inactive compounds. The predictive power of the program was tested by applying it to 4 compounds that were withheld from the training set and the program predicted correct results for all 4 of them.

Case study 2: Developmental toxicity

In their paper31, Cunningham et al. describe the development of a QSAR model for prediction of developmental toxicity for anti-thyroid drugs. The model was developed from data for 323 compounds with 130 categorized as developmental toxicants and 193 as non-toxicants. The authors generated 2D fragments using Tripos Sybyl HQSAR module. Using this module, the user can specify attributes for fragment determination such as atom counts, atomic connections, bond types, hydrogen atoms, chirality, and hydrogen bond donor and acceptor groups. The authors chose to specify fragments between three and seven atoms in size and considered atoms, atomic connections, and bond types. Upon completion of this in silico fragmentation, a compound- fragment data matrix was generated with compounds in rows and fragments in columns. This data matrix was subsequently analyzed with the cat-SAR (Categorical Structure-Activity Relationship) expert system in order to identify structural features associated with the toxic and non-toxic classes

20 of compounds. The model was validated using self-fit, leave-one-out (LOO), and multiple leave- many-out (LMO) cross-validation methods. Average concordance, sensitivity, and specificity values were computed and documented. Using this model, 13 fragments were identified for the three anti-thyroid medications that were used to treat hyperthyroidism. Out of these, 9 were associated with developmental toxicants and 4 were associated with non-toxicants. Based on structural analysis, it was concluded that all three drugs available for treating hyperthyroidism were capable of producing developmental toxicity. Thus, it emphasized the need to develop new molecules with structural attributes that provide suppression of thyroid function while at the same time minimizing the risk of developmental toxicity.

In order to effectively use the results from QSAR models, we need improved statistical methods and data mining techniques. In recent times, the fundamental paradigm of statistical analysis has changed from “system identification” to “predictive modeling”. System identification aims to reconstruct true probability distributions to achieve good predictive performance; while predictive modeling uses simple probability distributions, though not necessarily correct, to build models with highest predictive performance44.

Initially, QSAR models focused on the use of multiple linear regression, though in recent times a wide variety of statistical algorithms have been developed. These include support vector machines (SVM), decision trees, random forests, k-nearest neighbors (kNN), and artificial neural networks (ANN). A brief description of these algorithms is given by Xu et al.45 in their paper to compare the performance of different machine learning methods. Many of these methods have been implemented in open source machine learning packages such as AZOrange46 and KNIME. This is of great use to researchers lacking extensive machine learning knowledge as it helps them to create flexible applications using a graphical programming environment.

21

Specialized software packages have also been developed that combine the generation of structural descriptors and/or a pre-defined library of structural alerts with statistical algorithms for machine learning. Examples include DEREK47 (Deductive Estimation of Risk from Existing

Knowledge) by Lhasa Ltd., CASE Ultra48 by MultiCASE, Inc., Leadscope Model Applier by

Leadscope, Inc.18, and Toxtree49 by Ideaconsult Ltd. (developed under terms of contract with the

European Commission Joint Research Center).

In June 2014, the International Conference on Harmonization of technical requirements for registration of pharmaceuticals for human use (ICH) released an M7 guideline50 for assessment and control of mutagenic impurities in pharmaceuticals in order to limit potential carcinogenic risk.

The ICH M7 guideline states that manufacturers of pharmaceutical products must submit a report that includes detailed analysis of the mutagenic potential of the active pharmaceutical ingredient

(API) as well as the associated impurities for obtaining regulatory clearance. Under this guideline, manufacturers are allowed to submit QSAR model results in lieu of in vitro testing. However, the submission of QSAR models are required to be done with both rule-based expert methods

(predefined fragments) and structure-based statistical methods (dynamic fragments).

QSAR models should also be OECD (Organisation for Economic Co-operation and

Development) compliant51. According to OECD recommendations, QSAR models should have a defined endpoint, an unambiguous algorithm, a defined domain of applicability, appropriate measures of predictive performance, and a mechanistic interpretation, if possible. This has given tremendous boost to efforts for building structure-based QSAR models with high reliability and predictive performance. Software packages such as those developed by Lhasa, MultiCASE, and

Leadscope have integrated both the rule-based and structure-based approaches in their products.

These packages are widely used and accepted as a standard by both pharmaceutical companies as well as regulatory agencies.

22

2.4 Research Problem

As discussed above, there is a great need for the development of novel structure-based

QSAR models. As such, several types of structural descriptors and statistical models have been proposed, developed, and reported in the literature. The 3D descriptors seem to be most promising as they can capture real-time interactions of chemicals in biological systems by identifying regions in a molecule that are likely to be involved in binding to a receptor site. However, the drawback is that the calculations for 3D descriptors are computationally expensive and not all 3D descriptors generated are good from the modeling standpoint.

On the other hand, it has been found that 2D descriptors have as good as or even better predictive performance than 3D descriptors in certain cases. Thus, many methods for generating

2D descriptors have been developed such as linear, radial, and dendritic as discussed previously in section 2.2. However, since we are modeling a very complex phenomena such as chemically- induced toxicity, it is expected that 2D descriptors will not do complete justice to the problem at hand. This is because the 2D descriptors fail to take into account a lot of factors such as the method of injection of chemical into the human body, the rate of absorption of chemical into the bloodstream, and chemical transformations and reactions within the body. In any case, there is definitely a lot of information contained in the 2D structure of a chemical and we need to explore this information to the fullest extent possible by developing novel descriptors. This will help to wisely narrow down the list of potential candidates for further rigorous screening steps.

Most of the currently available methods for generating 2D descriptors suffer from several drawbacks such as the extremely large number of descriptors generated and the difficulty in interpreting them. For example, consider the circular fragments developed by Rogers et al. These fragments have been found to be very effective in SAR modeling with consistently better prediction

23 accuracies for a wide variety of datasets and toxicity endpoints. However, this method yields more than 200,000 circular fragments of radius 4 from a library of 50,000 compounds extracted from the

Derwent World Drug Index13. This is a much larger set of fragments than is common for other methods and they cannot be decoded back into chemical structures to be interpreted in a meaningful way. Thus, although the predictive performance of circular fragments is very high, the results cannot be translated back to be interpreted in a mechanistic way. There is a need to develop better

2D descriptors that can help to reduce the high-dimensional descriptor space, give more meaningful results, and yield better prediction accuracies at the same time.

2.5 Proposed Solution

To address this need, we propose a novel method for dynamic generation of linear descriptors. These descriptors are linear subgraphs of chemical structures that can be annotated with atom-based features such as atom identity, connectivity, and partial charge. The main advantage of this approach is that although these descriptors are 2D linear fragments, they can be annotated with features that capture “superior” 3D information such as partial charge and stereochemical information. Thus, they help to incorporate steric as well as electronic information about the atoms within the same descriptor. These descriptors can be flexibly defined depending on the information desired to be extracted from them.

We are specifically interested in developing linear descriptors so that we can explore them to develop Markov chain based statistical models for classification purposes. Markov chain model is a very useful statistical tool for modeling linear sequences of events and it has found good success in bioinformatics methods for the analysis of nucleic acids and proteins. We have developed a similar model that is specifically tailored for analyzing chemical structures.

24

It should be noted that other methods of generating descriptors might appear to be more useful because they can capture branched structural information as well, thus accounting for more structural details. However, as mentioned in the research problem, there are several disadvantages associated with them. One of them being that they give a huge number of descriptors, out of which only a few might be useful from the modeling standpoint. Also, nonlinear fragments cannot be used to apply sequence analysis techniques and to build Markov chain models. On the other hand, the linear fragments developed in this research have been shown to capture branched structural information as well, due to the provision of annotating features. These novel descriptors combine most of the advantages of other approaches while reducing the dimension of descriptor space and providing a mechanistic interpretation to chemical fragments at the same time. Thus, I focus my research on the development of annotated linear chemical fragments for use in identification of structural alerts and in the development of novel QSAR methods based on Markov chain models.

Specific tasks undertaken in this project were as follows.

1. Development of search algorithms:

The initial phase of the research involved development of search algorithms for

generating linear subgraphs from a database of chemical structures. Principles of graph

theory and depth-first search algorithm were used.

2. Specification of annotation features:

Chemical annotations were incorporated in the algorithm to provide flexibility in

defining the criteria used for generating unique fragments. Only atom types were

annotated. Bond type annotations are implicit in the atom types.

25

3. Generation of compound-fragment data matrix:

After identifying the unique fragments, a compound-fragment data matrix was

generated with compounds in rows and fragments in columns. This table contains values

‘0’ and ‘1’ depending on whether the fragment is present in the compound or not.

4. Identification of relevant descriptors:

Relevant descriptors and structural alerts were identified using datasets on skin

sensitization and Ames mutagenicity. Several statistical tests were employed to identify

significant fragments that helped to distinguish the toxic class of compounds from the non-

toxic class.

5. Development of statistical models:

Many statistical models can be developed for classifying query compounds of

interest. We focus our efforts on developing Markov chain models and compare them with

results obtained using kNN (k-nearest neighbors) models. The performance of different

models was compared using cross-validation methods.

26

CHAPTER 3: DATASETS AND COMPUTATIONAL TOOLS

3.1 Training Datasets

This section briefly describes the training sets that were used in this research for the identification of relevant structural descriptors and for modeling purposes. Two particular endpoints were considered, namely skin sensitization and mutagenicity.

Skin sensitization is an important toxicity end-point and it has been very widely studied52,53.

Several QSAR models have been developed to predict this endpoint43,54. The skin sensitization potency of chemicals is characterized by using LLNA (local lymph node assay) E3% values. It is the dose required to give a 3-fold response between the treatment and control groups. The dataset used in our study is compiled from several sources55,56 and contains 467 compounds classified into five groups based on their relative measure of sensitization potency. The five LLNA categories are: non-sensitizer, weak sensitizer, moderate sensitizer, strong sensitizer, and extreme sensitizer. The last 2 categories are grouped together and the resultant 4 categories are denoted by NON, WEAK,

MOD, and STR and have 138, 106, 128, and 95 compounds respectively. This is shown in Figure

9. The distribution of molecular weights for these compounds is shown in Figure 10, thus giving an indication of the structural diversity of compounds in this dataset.

27

Figure 9: Number of compounds in different categories in skin sensitization dataset

Figure 10: Distribution of molecular weights for skin sensitization dataset

28

For mutagenicity, the Ames test data was used. The Ames test has been widely used for initial screening of new chemicals and drugs57. It is an in vitro test performed on strains of

Salmonella typhimurium bacteria and it classifies chemicals into two categories: positive and negative. Hansen et al.58 have collected a benchmark dataset comprising of 6512 chemicals (3503

Ames positive and 3009 Ames negative) for in silico prediction of Ames mutagenicity. They have also specified a 5-fold cross-validation scheme along with the chemical structures to be used in each fold of the cross-validation analysis. This allows for direct comparison of different modeling methods, thus promising to be a good strategy for optimizing Ames mutagenicity prediction.

Figure 11 shows the counts of Ames positive and negative compounds in this dataset and

Figure 12 shows their molecular weight distribution. Table 1 shows the compound counts to be used in the training and test sets for the 5-fold cross-validation scheme. This is shown graphically in Figure 13 and Figure 14. The actual compound numbers to be used in the cross-validation analysis can be found in the Supporting Information section of the paper by Hansen et al58.

Figure 11: Number of compounds in different categories in the benchmark dataset for Ames mutagenicity 29

Figure 12: Distribution of molecular weights for compounds in the benchmark dataset for Ames mutagenicity

Training set Test set

Ames Ames

Ames POS NEG Total Ames POS NEG Total

CV 1 2919 2609 5528 584 400 984

CV 2 2933 2594 5527 570 415 985

CV 3 2932 2596 5528 571 413 984

CV 4 2932 2593 5525 571 416 987

CV 5 2930 2595 5525 573 414 987

Table 1: Compound counts for the 5-fold cross-validation (CV) scheme

30

Figure 13: Compound counts for training sets in the 5-fold CV scheme

Figure 14: Compound counts for test sets in the 5-fold CV scheme

31

3.2 Computational Tools

This section describes different programming languages and software packages that were used during the course of this research. For converting chemical structures into SD file format, a software package called as MarvinSketch was used. MarvinSketch is an advanced chemical editor from ChemAxon for drawing chemical structures, queries, and reactions59. The most important feature of MarvinSketch that was used in the research was its ability to convert a SMILES or InChI notation of any molecule into a Mol file. In cases where SMILES and InChI notation for a molecule were not available, the GUI feature of MarvinSketch was used for drawing chemical structures.

Different Mol files were then concatenated into a single SD file with *.sdf extension. Figure 15 shows a typical example of how chemical structures are displayed in MarvinSketch.

Figure 15: Typical example using MarvinSketch software 32

In order to view the chemical structures in SD file, a software package called as

MarvinView was used. MarvinView is an advanced chemical viewer from ChemAxon that supports viewing of a large number of molecules in a spreadsheet or matrix layout60. Additional fields such as a molecule’s name, CAS number, and SMILES string can also be displayed. Figure 16 shows a typical example of how chemical structures are displayed in MarvinView.

Figure 16: Typical example using MarvinView software

33

Python61 programming language was used as the main scripting language in this research.

Python is an open source programming language whose design philosophy emphasizes code readability62. Its focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world63. The chemical information handling in Python was simplified by using RDKit64 package. RDKit is a chemical informatics and machine learning package to handle operations such as numbering of atoms, identifying neighboring atoms and ring atoms, and calculation of partial charges.

Other programming languages such as Visual Basic, MATLAB, and R were also used quite frequently. VisualBasic is a third-generation event-driven programming language65 that can be used in conjunction with Microsoft Excel to yield user-friendly outputs and displays. MATLAB is a numerical computing environment and fourth-generation programming language developed by

MathWorks. It allows matrix manipulations, plotting of functions and data, and implementation of algorithms66. R67 is an open source programming language and software environment for statistical computing and graphics.

34

CHAPTER 4: METHODS

4.1 Generation of Linear Fragments

As stated in the Proposed Solution section (2.5) of the Background chapter, the first task in this research was to develop search algorithms for the dynamic generation of linear fragments from any given set of compounds. In order to generate linear fragments from chemical structures, we used principles of graph theory and depth-first algorithm. Graph theory is the study of mathematical structures that are used to model pair-wise relations between objects from a certain collection. A graph in this context refers to a collection of nodes that are connected by edges. In our case, we consider each heavy atom in a compound as a node and each bond as an edge. Starting at one particular node, we traverse the structure of molecular graph using depth-first search algorithm and trace all possible longest paths from that node. This is shown in Figure 17, where the longest linear paths are extracted from atom #1 of m-ethyl phenol molecule.

Figure 17: Linear fragments using graph theory and depth-first search algorithm 35

The depth-first search algorithm is designed such that it allows us to capture information about paths of shorter lengths as well. The depth-first algorithm is described below using m-ethyl phenol as example.

1. For each heavy atom in the compound, do the following:

a. Identify its connections. For atom #1, these are atoms 2, 6, and 9.

b. Create new arrays and store the connections. For atom #1, three arrays will be

created and they will be [1, 2], [1, 6], and [1, 9].

c. For each connection identified, do the following:

i. Identify the next connected heavy atom.

ii. If the connected atom has no branching, append it to the array.

iii. If the connected atom has branching, create new arrays and append

corresponding connections.

iv. Stop if a terminal atom encountered or end of ring detected.

d. Repeat the above steps for each old and new array created.

2. Repeat the above steps for all heavy atoms in the compound.

Table 2 shows how the arrays proceed with each iteration for atom #1. As seen in Table 2, the algorithm terminates after 11 iterations and yields 5 arrays containing paths of longest lengths.

These are [1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 7, 8], [1, 6, 5, 4, 3, 2], [1, 6, 5, 7, 8], and [1, 9]. Fragments of smaller lengths like [1, 2], [1, 2, 3], etc. are then extracted from this information. 36

Iteration Arrays

1 [1, 2] [1, 6] [1, 9]

2 [1, 2, 3] [1, 6] [1, 9]

3 [1, 2, 3, 4] [1, 6] [1, 9]

4 [1, 2, 3, 4, 5] [1, 6] [1, 9]

5 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7] [1, 6] [1, 9]

6 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6] [1, 9]

7 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6, 5] [1, 9]

8 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6, 5, 4] [1, 6, 5, 7] [1, 9]

9 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6, 5, 4, 3] [1, 6, 5, 7] [1, 9]

10 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6, 5, 4, 3, 2] [1, 6, 5, 7] [1, 9]

11 [1, 2, 3, 4, 5, 6] [1, 2, 3, 4, 5, 7, 8] [1, 6, 5, 4, 3, 2] [1, 6, 5, 7, 8] [1, 9]

Table 2: Iteration steps of depth-first search algorithm for atom #1 of m-ethyl phenol

After generating linear fragments of different lengths from all heavy atoms, the next step is then to identify and remove the redundant ones. For example, one of the paths generated from atom #1 will be [1, 2, 3]. Similarly, one of the paths generated from atom #3 will be [3, 2, 1]. Since chemical fragments have no inherent directionality, these 2 paths are redundant. One of them needs to be removed in order to reduce the dimension of descriptor-space and avoid confounding due to perfectly correlated descriptors. This is done by comparing the fragments in both forward and reverse directions and if two fragments are found to be identical then the one that begins with the lowest number is retained. So, in the above case, the fragment [1, 2, 3] will be retained and the fragment [3, 2, 1] will be discarded. 37

4.2 Chemical Annotations

Atom-based selection features called as chemical annotations are then introduced to remove chemically redundant fragments. In the m-ethyl phenol molecule, the fragments [2, 1, 6] and [6, 5, 4] are identified as unique because they are numbered differently. However, referring back to the chemical structure, we can see that these two fragments are identical. They are fragments of length 3 that are part of an aromatic ring with one single bond and one double bond

(or aromatic bonds). Chemical annotations help to detect this redundancy by specifying the resolution of information desired to be extracted from chemical structures. The algorithm was initially developed with four annotation options as follows:

1. Atom identity (AI): This is the chemical identity of the atom. This annotation can be a

single character as in C and O or a two-character variable as in Cl and Br.

2. Number of connections (nC): This is the number of heavy atoms that are connected to the

atom. This annotation is a single digit variable and can take values from 0 to 9.

3. Number of hydrogens (nH): This is the number of hydrogen atoms attached to the atom.

This includes both implicitly and explicitly defined hydrogen atoms. This annotation is a

single digit variable and can take values from 0 to 9.

4. Partial charge (PC): This is the partial charge on an atom calculated using the

GasteigerCharge feature in RDKit64. Since partial charge is a continuous variable, we bin

it into three categories for annotation purposes. If the partial charge is less than -0.05, it is

classified as negative and denoted by ‘-’. If the partial charge is between -0.05 and 0.05, it

is classified as neutral and denoted by ‘0’. If the partial charge is greater than 0.05, it is

classified as positive and denoted by ‘+’. 38

An annotation scheme is then defined by selecting any possible combination of the annotation options. Based on the annotation scheme selected, each heavy atom in the chemical structure is identified by a unique atom symbol. For example, if ‘AI’ and ‘nC’ features are selected, then the annotation scheme is defined as {AI, nC} and the atom symbols are C2, O1, etc. Table 3 shows these atom symbols for the 9 heavy atoms in m-ethyl phenol molecule using 5 different annotation schemes.

Atom {AI} {AI, nC} {AI, nC, nH} {nC, nH, PC} {AI, nC, nH, PC}

1 C C3 C30 30+ C30+

2 C C2 C21 210 C210

3 C C2 C21 21- C21-

4 C C2 C21 21- C21-

5 C C3 C30 300 C300

6 C C2 C21 210 C210

7 C C2 C22 220 C220

8 C C1 C13 13- C13-

9 O O1 O11 11- O11-

Table 3: Atom symbols using different annotation schemes for m-ethyl phenol

As seen in Table 3, only atom types are annotated. The bond types (e.g., single bond, double bond) are implicit in the atom annotations. It should be noted that once the annotation scheme is specified, the same annotations will be used for all compounds in the training set. In other words, it is not possible to specify different annotation schemes for different compounds. However, the 39 algorithm can be executed multiple times with different annotation schemes and can be used to compare their performance. Figure 18 shows a partial output of the fragments generated when the algorithm was run twice for m-ethyl phenol; the first time using {AI} and the second time using

{AI, nC, nH} annotation schemes.

Figure 18: Unique fragments generated using different annotation schemes (partial output)

As seen in Figure 18, increasing the number of annotations in the annotation scheme helps to better distinguish between different fragments. For example, when {AI} annotation scheme was used, the carbon-carbon connection was identified only as [‘C’, ‘C’], but using {AI, nC, nH} annotation scheme, the same connection was identified by 4 different fragments; [‘C30’, ‘C21’],

[‘C21’, ‘C21’], [‘C30’, ‘C22’], and [‘C22’, ‘C13’]. Thus, selecting more annotation options provides better resolution and helps to capture structural information in greater detail. However, the drawback is that using more annotations in the annotation scheme is computationally expensive and usually yields a lot more fragments than desired. These additional fragments may not necessarily capture more details about the chemical structure. 40

4.3 Compound-Fragment Data Matrix

After identifying the unique fragments from all compounds in a training set, a compound- fragment data matrix is generated with compounds in rows and fragments in columns. This table contains values ‘0’ and ‘1’ depending on whether the fragment is present in the compound or not.

Thus, each compound is uniquely represented as a linear combination of all the fragments. This is also called as the fingerprint of each compound. Figure 19 shows a typical compound-fragment data matrix for a training set containing m compounds and n fragments.

Figure 19: Typical compound-fragment data matrix

This algorithm was executed for a training set containing three compounds as shown in

Figure 20. A part of the actual output generated is shown in Figure 21 and Figure 22. 33 unique fragments were generated when the {AI} annotation scheme was used and 90 unique fragments were generated when the {AI, nC, nH} annotation scheme was used. The output shown here is only a small subset of the fragments generated.

41

Figure 20: Sample training set

Figure 21: Compound-fragment data matrix (partial) using {AI} annotation scheme

Figure 22: Compound-fragment data matrix (partial) using {AI, nC, nH} annotation scheme

The scripts for generating annotated linear fragments and compound-fragment data matrix are included in Appendix A. We integrated these Python scripts into Microsoft Excel and developed a seamless workflow with user-friendly interface for the same. This interface allows the user to specify names for input SD file, output Excel or text (.tsv format) file for storing the compound- fragment data matrix, specify annotation options, fragment lengths, remove singleton and doubleton fragments (sparsely observed fragments), and histogram outputs for displaying distribution of fragment lengths. A snapshot of this Excel interface is shown in Figure 23.

42

Figure 23: Excel interface for generating compound-fragment data matrix using annotated linear chemical fragments

43

4.4 Developing Markov Chain Models

Markov chains are very useful for modeling linear sequences of events. A Markov chain model assumes that the probability of each event depends only on the outcome of the preceding event and not the entire previous sequence. In formal terms, the Markov property is that

Pr{푋푛+1 = 푗 | 푋0 = 푖0, . . . . , 푋푛−1 = 푖푛−1, 푋푛 = 푖푛} = Pr{푋푛+1 = 푗 | 푋푛 = 푖푛}

푓표푟 푎푙푙 푡푖푚푒 푝표푖푛푡푠 푛 푎푛푑 푎푙푙 푠푡푎푡푒푠 푖0, . . . . , 푖푛−1, 푖, 푗.

The probability of Xn+1 being in state j given that Xn is in state i is called the one-step

푛,푛+1 transition probability and is denoted by 푃푖,푗 . When the one-step transition probabilities are independent of the time variable n, the Markov chain is said to have stationary transition

푛,푛+1 probabilities. Thus, 푃푖,푗 = 푃푖,푗. A stationary Markov chain is completely defined by its transition probability matrix and the specification of initial probability distribution of the states.

Markov chains have found some use in chemoinformatics applications such as the generation of new molecules using computer-assisted design tools68.

Before beginning the discussion on the development of Markov chains for modeling toxicity, it is important to note the difference between global and local classification models. Global models are those that make use of all relevant information from the training set in order to make prediction on a test compound, whereas local models are those that use information from compounds that are identified to be similar using fingerprint or descriptor similarity calculations69.

A typical example of global model is the standard least squares regression model and a typical example of local model is the k-nearest neighbors (kNN) model. The Markov chain models developed in this research fall under the category of global models, where information from all compounds in the training set is used to build the model for classification of the test set. 44

We developed Markov chain models for predicting toxicity and classifying test compounds by using the chemically annotated atom symbols. As a first step, all unique symbols from the training set are identified based on the annotation scheme selected. A count matrix is then generated by counting the number of transitions from one symbol to another using structural information contained in the SD file. The transition probability matrix (TPM) is then obtained by summing the counts in each row and dividing each element of the count matrix by the corresponding row sum.

Table 4 and Table 5 show a typical example of count matrix and TPM for a hypothetical training set with four unique annotated symbols – C1, C2, C3, and O1.

C1 C2 C3 O1

C1 0 10 12 6

C2 10 2 5 3

C3 12 5 1 7

O1 6 3 7 1 Table 4: Example of count matrix

C1 C2 C3 O1

C1 0.00 0.36 0.43 0.21

C2 0.50 0.10 0.25 0.15

C3 0.48 0.20 0.04 0.28

O1 0.35 0.18 0.41 0.06 Table 5: Example of transition probability matrix (generated using count matrix data in Table 4)

45

It should be noted here that the count matrix is always symmetric but the TPM is not. The main criterion for a TPM is that all its rows should sum to 1. Now, in order to develop a working

Markov chain model, we had to overcome several challenges. First of all, it should be considered that chemical paths do not have any inherent direction. Thus, the Markov probability for any given sequence must be the same regardless of whether it is calculated by analyzing the sequence left-to- right or right-to-left. This is generally not the case because the TPM is not a symmetric matrix.

Secondly, it should be considered that the test set might contain several fragments that were never observed in the training set. Thus, we had to find a way to assign finite probabilities to all possible transitions. Lastly, the test set might contain new symbols that were not present in the training set.

Thus, we had to find an effective way to incorporate new symbols in the analysis. These constraints also helped to define the domain of applicability for the models.

In order to overcome these challenges, we came up with a preliminary Markov chain model based on one-step connection probabilities. Since the word ‘transition’ inherently implies direction, we used the idea of ‘connection’ probability between two symbols. As the word suggests, connection probability is the probability of observing any two symbols connected to each other in the 2D molecular graph. Thus, it takes into account only fragments of length 2, or in other words, two symbols that are at a distance of one bond length from each other. These are referred to as one- step connections. This approach reduces the TPM into a vector of length n(n+1)/2, where n is the total number of unique symbols. Corresponding probabilities are then obtained by summing all the counts and dividing each element by the sum. Table 6 shows an example of connection count vector and corresponding probabilities.

46

Connection count Connection probability

C1-C1 0 0

C1-C2 10 0.21

C1-C3 12 0.26

C1-O1 6 0.13

C2-C2 2 0.04

C2-C3 5 0.11

C2-O1 3 0.06

C3-C3 1 0.02

C3-O1 7 0.15

O1-O1 1 0.02

Table 6: Example of connection vector with counts and probabilities (generated using count matrix data in Table 4)

Now, suppose we know a binary toxicity outcome (positive (POS)/ negative (NEG)) for all compounds in this training set. We can then split the connection counts and corresponding probabilities in Table 6 into two sets depending on which chemicals are assigned to each category.

This is shown in Table 7, where the probability ratios are calculated as PPOS/PNEG. This is shown graphically in Figure 24 and Figure 25.

47

Connection count Connection probability

Positive Negative Positive Negative Probability

(CPOS) (CNEG) (PPOS) (PNEG) ratio

C1-C1 0 0 0 0 Nan

C1-C2 6 4 0.22 0.2 1.1

C1-C3 10 2 0.37 0.1 3.7

C1-O1 4 2 0.15 0.1 1.5

C2-C2 1 1 0.04 0.05 0.8

C2-C3 2 3 0.07 0.15 0.47

C2-O1 1 2 0.04 0.10 0.4

C3-C3 1 0 0.04 0 Nan

C3-O1 2 5 0.07 0.25 0.28

O1-O1 0 1 0 0.05 0

Table 7: Connection count vector and corresponding probabilities for training set with binary toxicity outcome (generated using data from Table 6)

48

Figure 24: One-step connection probabilities (based on data in Table 7)

Figure 25: One-step connection probability ratios (based on data in Table 7)

49

As seen here, some one-step connections such as C1-C1, C1-O1, and C2-C2 are observed with roughly equal probabilities in both the POS and NEG class of compounds. These connections do not have much discriminating power and are not of much use from the modeling standpoint.

There are other connections such as C2-C3, C2-O1, and C3-O1 that are mainly observed in the

NEG class of compounds. These connections can be considered as safe and their presence in compounds increases their likelihood of being non-toxic. On the other hand, there are connections such as C1-C3 that are mainly observed in the POS class of compounds. These connections are considered as alerts and their presence in compounds increases their likelihood of being toxic.

After calculating the connection counts and probabilities, we calculate the sequence probability for a fragment having length greater than 2. Consider a fragment [‘C2’, ‘C1’, ‘O1’] for example. There are 2 one-step connections in this fragment; viz. [‘C2’, ‘C1’] and [‘C1’, ‘O1’]. It should be noted here that the connection [‘C2’, ‘C1’] is the same as [‘C1’, ‘C2’]. The sequence probability of this fragment is called as its likelihood and is calculated as follows:

푙푖푘푒푙푖ℎ표표푑 = 푝(퐶1 − 퐶2) ∗ 푝(퐶1 − 푂1)

A log-likelihood is then calculated by taking natural log on both sides. We then calculate the difference in log-likelihood under both POS and NEG Markov models. Thus, we get a log- likelihood for each fragment.

푙표푔 − 푙푖푘푒푙푖ℎ표표푑 = log(푝(퐶1 − 퐶2)) + log (푝(퐶1 − 푂1))

푃푂푆 푁퐸퐺 (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푓푟푎푔푚푒푛푡 푖 = (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푖 − (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푖

50

This information can then be used to identify important structural alerts. For fragments with positive log-likelihood values, the higher the value of the log-likelihood, greater is its probability of being a potentially toxic structural alert. Similarly, for fragments with negative log- likelihood values, the higher the absolute value of the log-likelihood, greater is its probability of being a potentially safe fragment. We repeat this analysis for all the fragments in the compound- fragment data matrix. An overall log-likelihood for the whole compound is then calculated as:

(푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푐표푚푝표푢푛푑 = ∑(푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푓푟푎푔푚푒푛푡 푖 푖=1

푤ℎ푒푟푒 ′푚′ 푖푠 푡ℎ푒 푛푢푚푏푒푟 표푓 푓푟푎푔푚푒푛푡푠 푖푛 푡ℎ푒 푐표푚푝표푢푛푑 − 푓푟푎푔푚푒푛푡

푑푎푡푎 푚푎푡푟푖푥 표푏푠푒푟푣푒푑 푖푛 푡ℎ푒 푐표푚푝표푢푛푑 표푓 푖푛푡푒푟푒푠푡

If this overall log-likelihood is calculated to be greater than or equal to zero, then the compound is classified as POS. If it is less than zero, then the compound is classified as NEG. An overview of this classification strategy is shown graphically in Figure 26.

If a test compound contains atom symbols that were not observed in the training set, then that test compound is considered to be out of the model’s applicability domain and is classified as

‘NP’ (not predicted). If a test compound contains at least one connection observed in only POS class and at least one connection observed in only NEG class, then that test compound is out of the model’s applicability domain and is classified as ‘NP’. If a test compound contains atom symbols that were all observed in the training set, but contains certain one-step connections that were not observed in the training set, then those connections are skipped and prediction is made on the test compound using the remaining one-step connections.

51

52

Figure 26: Overview of classification strategy

52

4.5 Evaluating Model Performance

We can explore different Markov chain models with different ranges of fragment lengths and annotation schemes. In order to choose the best model, we need to evaluate the performance of each model. The performance of any model is determined by the prediction error of the model, i.e. how well the model makes predictions about new data. This is called as the external validation of the model where predictions are made on a completely different set of compounds that were in no way a part of the training set of compounds used for training the model. This measure of performance is called as the Mean Squared Prediction Error (MSPE).

A statistical technique called as k-fold cross-validation was used to determine MSPE. In this technique, the dataset is split into k roughly equal sized parts. The model is trained using data from k-1 parts and is validated on the kth part. This is repeated for all the k parts and an average

MSPE is calculated. This average MSPE is used as the measure of performance for evaluating the efficiency of the model. It is important to get acquainted with three terminologies used in calculating MSPE. These are sensitivity, specificity, and concordance. Sensitivity measures the proportion of true positives and specificity measures the proportion of true negatives. Concordance is the proportion of true predictions and it measures the overall accuracy of the model. Thus,

푆푒푛푠푖푡푖푣푖푡푦 = Pr(푌푝푟푒푑 = 1 | 푌 = 1)

푆푝푒푐푖푓푖푐푖푡푦 = Pr(푌푝푟푒푑 = 0 | 푌 = 0)

In order to calculate these performance parameters, the classification results are generally compiled into a 2x2 confusion matrix. This matrix classifies the results into 4 categories, namely true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as shown in 53

Table 8. The sensitivity, specificity, and concordance values are then calculated using the following equations.

Predicted

Positive Negative

Positive TP FN Actual Negative FP TN

Table 8: General 2x2 confusion matrix

푇푃 푆푒푛푠푖푡푖푣푖푡푦 = (푇푃 + 퐹푁)

푇푁 푆푝푒푐푖푓푖푐푖푡푦 = (푇푁 + 퐹푃)

(푇푃 + 푇푁) 퐶표푛푐표푟푑푎푛푒 = (푇푃 + 퐹푃 + 푇푁 + 퐹푁)

An effective way to display this information graphically is a receiver operating characteristic (ROC) curve. The ROC curve plots sensitivity as a function of (1 – specificity) for the outcomes predicted from different models. As a rule of thumb, the closer the ROC curve is to the upper left corner, the higher is the overall accuracy of the model.

54

CHAPTER 5: RESULTS ON IDENTIFICATION OF STRUCTURL ALERTS

5.1 Distinguishing Structurally Similar Compounds

We first explain the results obtained using two specific compounds, namely β-Terpinene and β-Phellandrene. This analysis demonstrates how the annotated linear chemical fragments help to distinguish between apparently similar-looking compounds based on the annotation options selected. The chemical structures of β-Terpinene and β-Phellandrene are shown in Figure 27.

Figure 27: Two isomers: β-Terpinene and β-Phellandrene

In chemoinformatics methods, the general assumption is that chemicals with similar structures have similar properties. Although this assumption holds true most of the times, this is not always the case, especially when complex endpoints such as chemically-induced toxicity are considered70. The two isomers shown above have similar properties for most endpoints such as density, boiling point, vapor pressure, and refractivity. However, they differ significantly in their

55 skin sensitization potencies and bio-concentration factors (BCF). The properties of these two isomers were found using ChemSpider71 and are shown in Table 9.

β-Terpinene β-Phellandrene Units

Density 0.8  0.1 0.8  0.1 g/cc

Boiling point 173.5 175 °C

Vapor pressure 1.7  0.1 1.6  0.1 mmHg

Flash point 46.5  15.2 44.0  13 °C

Refractive index 1.467 1.467

Surface tension 25.4  5 25.4  5 dyne/cm

LogP 4.47 4.35

BCF (at pH 5.5) 893.99 700.94

Skin sensitization potency Non-sensitizer Strong sensitizer Table 9: Properties of two isomers, β-Terpinene and β-Phellandrene

As seen in Table 9, β-Terpinene is a non-sensitizer to skin, whereas β-Phellandrene is a strong skin sensitizer. Similarly, the BCF of β-Terpinene at pH 5.5 is 893.99, whereas the BCF of

β-Phellandrene at pH 5.5 is 700.94. This phenomenon, where a set of similar chemicals show similar properties for most endpoints, but differ significantly with respect to some other endpoints, is commonly referred to as an activity cliff72,73. A simplistic representation of this phenomenon is depicted in Figure 28 using properties of the two isomers. Thus, it is not always trivial to predict all the properties of compounds from their chemical structures, especially properties that are determined by complex mechanisms of action. 56

Figure 28: Activity cliff phenomenon for chemical properties

The annotated linear chemical fragments developed in this research have been found to provide some explanation for this phenomenon for the two isomers under consideration. In order to explain this, we calculated a similarity metric between these two compounds based on Tanimoto distances. Tanimoto distance (TD) between two compounds is defined as

푐 푇퐷 = 1 − 푎 + 푏 − 푐

푤ℎ푒푟푒 푎 = 푛푢푚푏푒푟 표푓 표푛 − 푏푖푡푠 (1) 푖푛 푐표푚푝표푢푛푑 1;

푏 = 푛푢푚푏푒푟 표푓 표푛 − 푏푖푡푠 (1) 푖푛 푐표푚푝표푢푛푑 2;

푐 = 푛푢푚푏푒푟 표푓 표푛 − 푏푖푡푠 (1) 푖푛 푐표푚푚표푛 푏푒푡푤푒푒푛 푐표푚푝표푢푛푑 1 푎푛푑 2

Tanimoto distances range from 0 to 1, where 0 means completely similar and 1 means completely dissimilar. We calculated Tanimoto distances between the two isomers using different annotation schemes. Table 10 shows the results obtained when fragments of different lengths were considered in this analysis.

57

Path lengths {AI} {AI, nC} {AI, nC, nH}

2-6 0 0 0.89

3-5 0 0 0.91

5-6 0 0 1

Table 10: Tanimoto distances between the two isomers

It can be seen from Table 10 that the two isomers are completely similar when the {AI} and {AI, nC} annotation schemes are used. However, when the {AI, nC, nH} annotation scheme is used, the two isomers are found to be very dissimilar. Thus, it can be concluded that for most properties such as density, boiling point, and vapor pressure, only the atom identity (AI) and number of connections (nC) are important. However, for complex properties such as skin sensitization and BCF, the positioning of the hydrogen atoms (nH) is extremely important. This explanation for the activity cliff phenomenon is shown in Figure 29. Thus, the annotated linear fragments help to identify subtle differences in chemical structures and can hint to possible reaction mechanisms for complex phenomena such as skin sensitization and BCF.

Figure 29: Possible explanation of activity cliff phenomenon using annotated linear chemical fragments 58

5.2 Identifying Structural Alerts for Skin Sensitization

This section describes the analysis of different annotation schemes and fragment lengths in the identification of relevant structural alerts from the skin sensitization dataset. As mentioned in the Training Datasets section (3.1) of the Datasets and Computational Tools chapter, the skin sensitization dataset contains 467 compounds classified into five categories based on their relative measure of skin sensitizing potencies. The objective of this evaluation is to explore the fragment space to see if the descriptors that are related to skin sensitizing effects could be identified.

In order to do so, we analyzed and compared the performance of five different annotation schemes: {AI}, {AI, nC}, {AI, nC, nH}, {nC, nH, PC}, and {AI, nC, nH, PC}. For each annotation scheme, we first processed all the compounds through the search algorithm and extracted linear fragments of lengths 2 to 15. Figure 30 shows the total number of fragments generated using the five annotation schemes discussed above.

Figure 30: Total fragments generated from skin sensitization dataset with fragment lengths between 2 and 15 using 5 different annotation schemes

59

As expected, Figure 30 shows that more fragments are generated when more annotation options are selected. This is because selecting more annotation options increases the resolution for distinguishing between different fragments, thus generating more fragments in the process. It is also observed that more fragments are generated when PC annotation is used instead of AI

(comparing annotation schemes 3 and 4), even though the PC annotation has only three possible states. Figure 31 shows the distribution of fragment lengths using the five annotation schemes.

Figure 31: Fragment distribution for skin sensitization dataset using 5 different annotation schemes

This chart gives an idea of the structural diversity of chemicals in the dataset and can be used to characterize groups of chemicals. It can be seen that these distributions are unimodal. This is because there is less number of unique fragments with extreme path lengths. A large number of fragments are generated with smaller path lengths, but most of them are redundant, and a few fragments are generated with longer path lengths, as this is limited by the size of compounds in the dataset. 60

It can also be seen that the number of fragments generated is quite large. For example, using the {nC, nH, PC} annotation scheme, 34,925 fragments are generated in total. Thus, the next step is to reduce the dimension of this fragment space further. This is done by removing those fragments that are observed in less than 3 compounds out of the 467 compounds. These fragments are called as singletons and doubletons. They are observed very sparsely and it is supposed that they cannot help in explaining the sensitization potency of compounds. Figure 32 shows the total number of fragments generated and Figure 33 shows the corresponding fragment distribution after the singletons and doubletons have been removed.

Figure 32: Total fragments generated from skin sensitization after removing singletons/doubletons

As seen from Figure 32, the total number of fragments generated reduces drastically. After removing singletons and doubletons, the total number of fragments generated using {nC, nH, PC} annotation scheme reduces to 2,914.

61

Figure 33: Fragment distribution for skin sensitization dataset after removing singletons/doubletons

It can be seen from Figure 33 that for path lengths less than 6, the number of fragments increases with increasing annotation options; while for path lengths greater than 6, they gradually decrease. This fragment space is further reduced by considering a smaller range of path lengths.

The analysis that follows is based on fragments with path lengths 3 to 7. Figure 34 and Figure 35 show histograms of the fragment categorized by their length before and after removing singletons/doubletons using the {nC, nH, PC} annotation scheme. It can be seen from Figure 34 that the total number of fragments generated increase with increasing fragment lengths. Also, the proportion of singletons and doubletons increase as the fragment length increases. After removing the singletons and doubletons, it can be seen from Figure 35 that the total number of fragments generated reduce drastically, especially for fragments of higher lengths. The total number of fragments generated thus form a unimodal distribution with respect to fragment length.

62

Figure 34: Histogram of fragments using {nC, nH, PC} annotation scheme (before removing singletons/doubletons)

Figure 35: Histogram of fragments using {nC, nH, PC} annotation scheme (after removing singletons/doubletons)

63

In order to identify distinguishing fragments, we generate a contingency table for each fragment. A contingency table contains counts of compounds in which the fragment is present or absent. These compound counts are classified by LLNA category. Table 11 shows the contingency table for fragment [`10-', `30+', `10-'] generated using {nC, nH, PC} scheme.

NON WEAK MOD STR Total

Present 6 3 8 17 34

(Expected) (10.05) (7.72) (9.32) (6.92)

Absent 132 103 120 78 433

(Expected) (127.95) (98.28) (118.68) (88.08)

Total 138 106 128 95 467

Table 11: Contingency table for fragment [`10-', `30+', `10-']

As seen from this table, the fragment [`10-', `30+', `10-'] is present in 6 compounds in NON category, 3 compounds in WEAK category, 8 compounds in MOD category, and 17 compounds in

STR category. The null hypothesis (H0) is that the presence/absence of fragment has no effect on the degree of skin sensitization. Thus,

34 푝 = 푝 = 푝 = 푝 = = 0.0728 푁푂푁 푊퐸퐴퐾 푀푂퐷 푆푇푅 467

This is the proportion of compounds in which we expect to see the fragment assuming that

H0 is true. We then compute the counts in each category using this expected proportion. These expected counts are shown in parenthesis in Table 11. We then compute a 2 test statistic as follows:

64

2 4 2 (푂푖푗 − 퐸푖푗) 2 = ∑ ∑ 퐸푖푗 푖=1 푗=1

where O is the observed count, E is the expected count, i is the presence/absence of fragment, and j is the LLNA category. This test statistic is called as Pearson’s 2 statistic and it is

2 distributed as  (3) when H0 is true. Assuming a significance level (α) of 0.01, a fragment is considered to be significant if its 2 statistic is greater than 11.3. The fragments with higher 2 values tend to be more distinguishing. The 2 statistic for [`10-', `30+', `10-'] is found to be 20.91.

Thus, this fragment is statistically significant and it helps to distinguish between strong and weak skin sensitizing compounds.

We then prepare a list of all fragments sorted by their 2 values. This helped us to identify

174 significant fragments with path lengths between 3 and 7. Now, we are interested in only those fragments which are positively correlated with skin sensitization. Thus, we calculate a -statistic for each fragment and check if the correlation is positive or not. -statistic measures the correlation between two discrete variables and it is calculated as,

 −   = 퐶 퐷 퐶 + 퐷

where C and D are the number of concordant and discordant pairs respectively. We then sort the fragments by their respective  values. Fragments with  values greater than 0.35 were considered to be positively correlated with skin sensitization. This narrowed down the list to 131 fragments that are significant (2 > 11.3) as well as positively correlated ( > 0.35).

65

Table 12 shows the counts of total number of fragments generated, significant fragments, and positively correlated fragments for each annotation scheme. The complete results (for significant and positively correlated fragments) with actual statistics are included in Appendix B.

Fragments generated Significant Positively correlated Annotation After removing fragments fragments (2 > 11.3 scheme Total singletons/ (2 > 11.3) and  > 0.35) doubletons

{AI} 585 299 35 17

{AI, nC} 4895 1548 107 75

{AI, nC, nH} 8812 1956 163 103

{nC, nH, PC} 12684 2233 174 131

{AI, nC, nH, PC} 13330 2232 174 135

Table 12: Summary statistics for skin sensitization dataset

The list of fragments thus obtained is hierarchical by nature. In other words, many different fragments refer to the same descriptor. For example, we find that the fragment [`10-', `30+', `30+',

`21+', `30+'] is significant, and thus, many smaller fragments that are contained within this fragment are also found to be significant. In order to eliminate this confounding, we manually go through each fragment in the narrowed down list and identify the descriptor it corresponds to. These unique descriptors are shown in Table 13 along with the annotation schemes that were able to identify them. The ‘Descriptor’ column lists the fragments identified along with the descriptors that they correspond to. The ‘Representative structure’ column depicts these fragments, where they are highlighted in red within a representative compound structure. 66

{AI} {AI, {AI, {AI,

Fragment nC} nC, {nC, nC, Descriptor Representative structure Number nH} nH, nH,

PC} PC}

[N,C,C,C,N,O] 1      (Aromatic dinitro)

[O,N,O] 2      (Nitro)

[O,C,O,C,O] 3      (Ring esters)

[Cl,C,C,C,C,N] 4  (p-chloro anilines)

[O1,C3,C3,C2, 5 C3,O1]    

(Phenolic esters)

Continued

Table 13: Significant fragments in skin sensitization dataset along with the annotation schemes that were able to identify them

67

Table 13 continued

[N1,C3,C3,C2,

C3,C2] 6     (o- and p-substituted

anilines)

[N1,C3,C2,C2,

C3,N1] 7   (p-phenylene

diamines)

[Cl1,C3,C3,C2,C3]

8 (o- and p-substituted  

)

[C2,N3,C3,C2,C3]

9 (tert-aromatic  

amines)

[O11,C30,C21,

C30,C21] 10    (m-substituted

phenols)

Continued

68

Table 13 continued

[O11,C30,C30,

11 C30,O11]   

(Diphenols)

[C21,C21,C21,O10]

12 (α,β-unsaturated   

)

[O10,C30,C21,C21]

13 (α,β-unsaturated   

ketones)

[30+,21+,20-]

14 (5-member ring 

containing N)

[10-,220,300,21-] 15  (Benzyl halide) (Br can be replaced by other halogen atoms)

These results show that the annotated linear chemical fragments are able to capture important descriptors responsible for imparting high skin sensitization potential to chemicals. Most of the descriptors identified in Table 13 are known to cause skin sensitizing effects and their 69 mechanism of action is well-known. In their paper, Aptula et al.52 describe the common reaction mechanisms by which chemicals induce skin sensitization and classify compounds based on their reaction mechanism. In the following paragraph, we discuss the five classes of compounds described by them and see how our annotated fragments correspond to them.

The first class of compounds is the Michael and pro-Michael type acceptors. These are mainly α, β-unsaturated aldehydes and ketones. As seen in Table 13, we were able to identify these using the last three annotation schemes (fragments 12 and 13). The second class of compounds is the SNAr electrophiles. These are mainly chlorinated dinitrobenzenes. We were able to identify these using all five annotation schemes (fragment 1). The third class of compounds is the SN2 electrophiles. These are mainly substituted 5-member ring compounds containing N and S. We were able to identify these using annotation scheme 4 (fragment 14). The fourth class of compounds is the Schiff Base formers. These are mainly aliphatic aldehydes and activated ketones. We were able to identify these a little indirectly as seen in fragments 8 and 11. These fragments correspond to substituted chlorobenzenes and diphenols, but they have a carbonyl group attached to an aromatic ring that might be responsible for their skin sensitizing activity. The fifth class of compounds is the acylating agents. These are mainly esters of acidic alcohols such as phenols and carboxylic anhydrides. We were able to identify these using almost all the five annotation schemes (fragments

3 and 5). Apart from these, we were also able to identify several other fragments such as diamines

(fragment 7) and aromatic tertiary amines (fragment 9). These fragments might hint to a different reaction mechanism by which chemicals induce skin sensitization, thus potentially leading to the development of newer structural alerts. This analysis shows that the annotated chemical fragments are effective in capturing important structural descriptors from a given set of chemical structures with defined toxicity endpoint.

70

We can also see from Table 12 that the total number of fragments generated is relatively low, especially after removing the singleton and doubleton fragments. This is very helpful in reducing computational time and in identifying only those fragments that are highly probable to cause skin sensitizing effects. These fragments are also easily convertible to the structural fragments that they correspond to in the actual chemical structures. This makes them more meaningful and interpretable by chemists for determining their mechanism of action. And even though the fragments are linear in composition, they are capable of capturing branched fragments as well due to the power of annotating features. For example, consider the o- and p-substituted anilines (fragment 6). The fragment identified in this case is the benzene ring and attached –NH2 group. However, because of the ‘nC’ and ‘nH’ annotating features, we were able to capture the ortho- and para- substituted ring structure as well. This makes our method more powerful and enables it to capture higher dimensional information.

5.3 Comparison of Different Annotation Schemes

As seen in Table 13, 15 unique descriptors were identified by using different combinations of the annotating features provided. The first three descriptors (nitro compounds, dinitro compounds, and ring esters) were identified by all five annotation schemes. This suggests that these descriptors are quite obvious and can be identified relatively easily from a sufficiently large set of compounds. The next descriptor (p-chloro aniline) was identified by the {AI} scheme alone. When more annotating features are selected, the structural resolution increases and the information regarding the descriptors under consideration could get diffused, thus making it more difficult to identify them. Thus, these descriptors could be identified by using the ‘AI’ feature alone suggesting that they might be responsible for weak or moderate sensitizing effects.

71

When the ‘nC’ feature was added to ‘AI’, we were able to identify five more descriptors

(phenolic esters, anilines, dianilines, chlorobenzenes, and aromatic tertiary amines). Then, adding the ‘nH’ feature, we were able to identify four more descriptors (phenols, diphenols, and unsaturated aldehydes and ketones). It can be seen that the {AI, nC, nH} scheme was able to identify almost all the descriptors described in Table 13. Thus, we can conclude that this annotation scheme has the right combination of annotating features that gives the optimum resolution for a set of compounds with skin sensitization as the toxicity endpoint. When the ‘PC’ feature was added, we were able to identify two more descriptors (5-member ring compounds containing , and benzyl halide). Thus, partial charge helped in capturing additional information that was not previously captured by atom identity. In the benzyl halide case, the Bromine atom could be substituted by other halogen atoms. This shows that the identity of the atom in this case is not important. What is important is that a negatively charged halide atom is attached to a benzyl group.

However, there is also a downside to using the PC feature. As seen in Table 13, annotation scheme

{nC, nH, PC} was not able to identify diamines, chlorobenzenes, and aromatic tertiary amines

(fragments 7 - 9). Thus, specifying partial charge can sometimes over specify structural information, thereby missing some important descriptors in the dataset.

This is clearly seen when the fifth annotation scheme {AI, nC, nH, PC} was used. The simultaneous use of ‘AI’ and ‘PC’ features over-specified the atomic structural information and thus failed to identify five of the important descriptors described in Table 13. This shows that an annotation scheme should be wisely chosen depending on the dataset and toxicity endpoint under consideration. An optimized annotation scheme will help to provide the best resolution for capturing structural information as well as reduce the computational time by avoiding the generation of less meaningful fragments.

72

5.4 Identifying Structural Alerts for Ames Mutagenicity

This section describes the identification of structural alerts from the Ames mutagenicity dataset using different annotation schemes and fragment lengths. As discussed in the Training

Datasets section (3.1) of the Datasets and Computational Tools chapter, Hansen et al.58 have compiled a dataset for Ames mutagenicity containing 6512 compounds. A part of this dataset containing 984 compounds (test set for CV 1) was used in this analysis.

As a first step, we extracted fragments with path lengths between 3 and 7 from this dataset using the five annotation schemes mentioned earlier. Figure 36 shows the total number of fragments generated. Figure 37 and Figure 38 show the distribution of fragment lengths before and after removing the singleton/doubleton fragments respectively. Figure 39 and Figure 40 show histograms of fragments categorized by their length before and after removing singletons/doubleton fragments respectively using the {AI, nC, nH} annotation scheme.

Figure 36: Total fragments generated from Ames mutagenicity dataset with path lengths between 3 and 7 using 5 different annotation schemes 73

Figure 37: Fragment distribution for Ames mutagenicity dataset before removing singleton/doubleton fragments

Figure 38: Fragment distribution for Ames mutagenicity dataset after removing singleton/doubleton fragments

74

Figure 39: Histogram of fragments using {AI, nC, nH} annotation scheme (before removing singleton/doubleton fragments)

Figure 40: Histogram of fragments using {AI, nC, nH} annotation scheme (after removing singleton/doubleton fragments)

75

In order to identify distinguishing fragments, we generate a 2x2 contingency table for each fragment. We use the Fisher’s exact test for analysis and compute positive prediction rate (PPV), negative prediction rate (NPV), and odds ratio (OR) statistics for each fragment. Table 14 shows a 2x2 contingency table for a general fragment ‘F1’.

Mutagenic Non-mutagenic Total

Present N11 N12 n1.

Absent N21 N22 n2.

Total n.1 n.2 n..

Table 14: General 2x2 contingency table

Fisher’s exact test is developed by conditioning on row and column totals. The key result used is as follows.

퐼푓 푁11 ~ 퐵푖푛(푛.1, 푝) 푎푛푑 푁12 ~ 퐵푖푛(푛.2, 푝)

푡ℎ푒푛 푁11 | (푁11 + 푁12 = 푡) ~ 퐻푦푝푒푟푔푒표푚푒푡푟푖푐

It should be noted that this result is valid only when N11 and N12 are independent. In mathematical terms,

(푛.1) ( 푛.2 ) 푤 푛1.−푤 Pr{푁11 = 푤 | (푁11 + 푁12 = 푛1.)} = ( 푛.. ) 푛1.

76

Let pM and pN denote the probabilities of observing the fragment in mutagenic and non- mutagenic compounds respectively. The hypothesis is then,

퐻0: 푝푀 = 푝푁

퐻1: 푝푀 > 푝푁

The test statistic is N11 and it has a hypergeometric distribution. Thus,

푝 − 푣푎푙푢푒 = Pr{푁11 > 푤 | 푁11 + 푁12 = 푛1.}

Since n.1 and n.2 are very large for our dataset, we can use the normal approximation to

Fisher’s exact test. Thus,

푁11 ~ 푁(푛.1 ∗ 푝1, 푛.1 ∗ 푝1 ∗ (1 − 푝1))

푁12 ~ 푁(푛.2 ∗ 푝2, 푛.2 ∗ 푝2 ∗ (1 − 푝2))

This gives rise to the following test statistic:

푝̂ − 푝̂ 푧 = 1 2 1 1 √푝̂(1 − 푝̂) ( + ) 푛.1 푛.2

푁11 푁12 푁11 + 푁12 푤ℎ푒푟푒, 푝̂1 = ; 푝̂2 = ; 푎푛푑 푝̂ = 푛.1 푛.2 푛.1 + 푛.2

77

When H0 is true, this z-statistic is distributed as N(0, 1). Assuming a significance level (α) of 0.01, a fragment is considered to be statistically significant if its z-statistic value is greater than

2.33. Fragments with higher z-statistic values tend to be more distinguishing. A list of fragments is then prepared sorted by their z-statistic values. We then compute the PPV, NPV, and OR statistics as follows.

푁 푃푃푉 = 11 (푛1. + 2)

푁 푁푃푉 = 22 (푛2. + 2)

(푁 + 1)(푁 + 1) 푂푅 = 11 22 (푁12 + 1)(푁21 + 1)

Table 15 shows the counts of total number of fragments generated, significant fragments, fragments with PPV > 0.75, and fragments with OR > 4 for each annotation scheme.

78

Fragments generated Significant After Number of Number of Annotation fragments removing fragments with fragments with scheme Total (z-stat > singletons/ PPV > 0.75 OR > 4 2.33) doubletons

{AI} 1457 725 44 48 35

{AI, nC} 16272 5721 257 215 232

{AI, nC, nH} 30999 7999 360 361 411

{nC, nH, PC} 48858 10961 443 530 717

{AI, nC, nH, PC} 53369 11357 460 561 807

Table 15: Summary statistics for Ames mutagenicity dataset

As it was observed with the skin sensitization dataset, the list of fragments obtained here is also hierarchical by nature. We manually go through this list of fragments and identify the unique descriptors that they correspond to. These unique descriptors are structural alerts that help in the identification of potentially mutagenic compounds. Table 16 shows these unique descriptors along with the annotation schemes that were able to identify them. The ‘Descriptor’ column lists the fragments identified along with the descriptors that they correspond to. The ‘Representative structure’ column depicts these fragments, where they are highlighted in red within a representative compound structure.

79

{AI} {AI, {AI, {AI,

Fragment Representative nC} nC, {nC, nC, Descriptor Number structure nH} nH, nH,

PC} PC}

[O,N,O] 1      (Nitro)

[O,N,C,C,C,N] 2      (Aromatic dinitro)

[O,C,O,N,C,O] 3    (CO & NO groups)

[O,C,N,N,O]

4 (Nitroso) 

[C,N,N] 5  (Azo-type)

Continued

Table 16: Significant fragments identified in Ames mutagenicity dataset

80

Table 16 continued

[N1,C3,C2,C2,C3] 6     (Aromatic amines)

[C3,C3,C3,C2,C2,

C2,C3] 7     (Polycylic aromatic

system)

[N2,C3,S2,C2] 8 (5-member ring    

containing N and S)

[C2,C2,Cl1] 9    (Aliphatic )

[C3,N2,O1] 10  (Hydroxyl amines)

[C30,C30,C30,

11 C31,O20]   

(Epoxides)

Continued

81

Table 16 continued

[N30,C30,N20,

C30,C30,N20] 12  (Fused rings with

alternating N atoms)

As seen in Table 16, the descriptors identified are quite different than the ones identified for skin sensitization dataset. These results support the fact that the annotated linear fragments can capture meaningful descriptors responsible for inducing mutagenicity as well. Thus, depending on the dataset supplied and the toxicity endpoint of interest, the linear annotated fragments are able to capture relevant descriptors accordingly. This is of crucial importance for dynamically generated fragments because they are data-dependent and automatic.

Most of the descriptors identified in Table 16 are known to cause mutagenic effects and their mechanism of action is well-known. In their paper, Kazius et al.74 derive and validate toxicophores for mutagenicity prediction using a dataset of 4337 compounds. They identified 8 major classes of toxicophores, which were further expanded and characterized into a set of 29 specific descriptors. This set of descriptors enabled them to classify and predict mutagenicity of different compounds with error percentages as low as 15%. Thus, we used their toxicophores as benchmark and analyzed how the annotated linear fragments compared with them. Table 17 shows the 8 classes of toxicophores identified by Kazius et al. and the annotated linear fragments that correspond to them (Refer Table 16 for chemical fragment numbers).

As seen in Table 17, the annotated linear fragments were able to identify all of the toxicophore classes. As our dataset had only 984 compounds, we might have missed some of the 82 structural diversity that was contained in the larger dataset (4337 compounds) used by Kazius et al.

In any case, we were still able to identify all of the eight major toxicophores classes. Thus, our approach is very promising in identifying relevant structural alerts.

Toxicophore name Fragment number

Aromatic nitro 1, 2

Aromatic amine 6

Three-membered heterocycle 11

Nitroso 4

Unsubstituted hetero-atom bonded hetero-atom 3, 10

Azo type 5

Aliphatic halide 9

Polycyclic aromatic system 7 Table 17: Important toxicophores for Ames mutagenicity and corresponding fragments

Apart from these important toxicophores, we were also able to identify other fragments such as 5-member rings containing nitrogen and sulfur (fragment 8), and fused rings with alternating nitrogen atoms (fragment 12). These fragments might hint to a different reaction mechanism by which chemicals induce mutagenicity. We can also see that the annotation scheme

{AI, nC, nH} captured most of the descriptors in Table 16. Thus, {AI, nC, nH} can be considered to be the optimized annotation scheme for modeling Ames mutagenicity. It should also be noted that the annotated linear fragments were able to capture even non-linear information such as polycyclic ring systems (fragment 7) due to the power of annotating features.

83

CHAPTER 6: RESULTS ON CLASSIFICATION OF COMPOUNDS USING

MARKOV CHAIN MODELS

6.1 Markov Chain Models Based on One-step Connections

This section describes the predictive performance of Markov chain models developed using the one-step connection approach as described in the Developing Markov Chain Models section

(4.4) of the Methods chapter. The complete dataset consisting of 6512 compounds compiled by

Hansen et al. is used in this analysis. The Markov chain models are analyzed by calculating the performance parameters of sensitivity, specificity, and concordance for the 5-fold cross-validation scheme specified earlier.

For each cross-validation fold, we first identified the unique symbols in the corresponding training set and generated connection probability vectors. We then analyzed the predictive performance on the test sets by using the five different annotation schemes considered previously.

In the discussion that follows, data from the first cross-validation fold (CV 1) and {AI, nC, nH} annotation scheme have been used for illustration purposes. As seen in Table 1 in the Training

Datasets section (3.1) of the Datasets and Computational Tools chapter, there are 5528 compounds in the training set for CV 1, where 2919 are Ames positive and 2609 are Ames negative. Using the structural information of compounds in each of these classes, we identified 38 unique symbols present in the entire training set. These symbols are shown in Table 18.

84

B30 Br10 C11 C12 C13 C20 C21 C22 C30 C31

C40 Cl10 Cl40 F10 I10 N10 N11 N12 N20 N21

N30 N40 O10 O11 O20 P30 P31 P40 P41 S10

S11 S20 S21 S30 S31 S40 Se20 Si40

Table 18: Unique symbols identified using training set for CV 1

We then processed each structure in both classes of the training set and generated a connection vector that contained all the observed one-step connections and their corresponding counts. The counts within each class were summed up and a probability for each connection was calculated by dividing the count for that connection by the total sum. Using the training data in CV

1, 214 one-step connections were observed with a total count of 53,628 for Ames POS class and

45,539 for Ames NEG class. The probability for each connection i under POS and NEG scenarios was then calculated as follows. Table 19 shows data for a selected set of 10 connections along with their counts and probabilities for both POS and NEG classes. This is shown graphically in Figure

41 and Figure 42.

퐶표푢푛푡푃푂푆 푃푃푂푆 = 푖 푖 53,628

퐶표푢푛푡푁퐸퐺 푃푁퐸퐺 = 푖 푖 45,539

85

Connection count Connection probability Probability Connection Positive Negative Positive Negative ratio

(CPOS) (CNEG) (PPOS) (PNEG)

C13 – C30 848 762 0.0158 0.0167 0.945

C21 – C30 12865 8068 0.2399 0.1772 1.354

C21 – C21 8202 5490 0.1529 0.1206 1.269

C30 – C30 8236 4268 0.1536 0.0937 1.639

C30 – N30 1578 664 0.0294 0.0146 2.018

N30 – O10 1700 338 0.0317 0.0074 4.271

N20 – C30 1505 870 0.0281 0.0191 1.469

C31 – O11 416 560 0.0078 0.0123 0.631

C30 – O10 1451 1855 0.0271 0.0407 0.664

C40 – Cl10 63 113 0.0012 0.0025 0.474

Table 19: One-step connections with counts and probabilities using training set for CV 1 (partial output)

86

Figure 41: One-step connection probabilities obtained using the training set for CV 1 (partial output)

Figure 42: One-step connection probability ratios obtained using the training set for CV 1 (partial output)

87

As seen from this data, there are certain connections that are predominantly observed in the Ames POS class of compounds. For example, the ‘N30 – O10’ connection is more than 4 times likely to be observed in the Ames POS class. Thus, it can be inferred that the presence of such connections in compounds increases their likelihood of being mutagenic. On the other hand, there are other connections such as ‘C30 – O10’ and ‘C40 – Cl10’ that are predominantly observed in the Ames NEG class. The presence of these connections in compounds increases their likelihood of being non-mutagenic. There are also connections such as ‘C13 – C30’ that are roughly equally observed in both POS and NEG classes. These connections are not much helpful in distinguishing between the two classes of compounds.

After generating the connection vector along with their respective counts and probabilities, we then begin processing compounds in the test set. For each compound in the test set, we go through all the one-step connections observed in that compound and calculate a log-likelihood value based on the above-stated probabilities.

푛 푛 푃푂푆 푁퐸퐺 (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푐표푚푝표푢푛푑 = ∑{log(푃푖 )} − ∑{log(푃푖 )} 푖=1 푖=1

where, n is the total number of one-step connections in the compound. If this value is calculated to be greater than or equal to 0, then the compound is predicted as Ames POS, and if it is less than 0, then the compound is predicted as Ames NEG. This calculation is shown below using the example of 3-nitro-o-xylene molecule from the test set for CV 1.

88

Figure 43: 3-nitro-o-xylene molecule from test set for CV 1

As a first step, we annotate this molecule using the {AI, nC, nH} annotation scheme. After annotating the atom types, the 2D graph of this molecule looks as shown below.

Figure 44: 2D molecular graph of 3-nitro-o-xylene using {AI, nC, nH} annotation scheme

In this molecule, there are 11 bonds in total, and thus 11 one-step connections as well.

These connections are numbered in red in Figure 44. For each of these one-step connections, we calculate the natural logarithm of the respective connection probabilities for both POS and NEG classes. We then add all the log-probabilities for each class and calculate the overall log-likelihood for the compound using the equation above. This is illustrated in Table 20, where the training data as shown in Table 19 has been used for calculation purposes. 89

Connection probability Log-probability

Connection No. Positive Negative Positive Negative

(PPOS) (PNEG) log(PPOS) log(PNEG)

O10 – N30 0.0317 0.0074 -3.451 -4.906 1

N30 – O10 0.0317 0.0074 -3.451 -4.906 2

N30 – C30 0.0294 0.0146 -3.527 -4.227 3

C30 – C21 0.2399 0.1772 -1.428 -1.730 4

C21 – C21 0.1529 0.1206 -1.878 -2.115 5

C21 – C21 0.1529 0.1206 -1.878 -2.115 6

C21 – C30 0.2399 0.1772 -1.428 -1.730 7

C30 – C30 0.1536 0.0937 -1.873 -2.368 8

C30 – C30 0.1536 0.0937 -1.873 -2.368 9

C30 – C13 0.0158 0.0167 -4.148 -4.092 10

C30 – C13 0.0158 0.0167 -4.148 -4.092 11

Sum -29.083 -34.649

Overall log-likelihood 5.566

Table 20: Calculation of overall log-likelihood for 3-nitro-o-xylene molecule

90

As seen from Table 20, the overall log-likelihood for the 3-nitro-o-xylene molecule is calculated to be 5.566. Since this number is greater than 0, the compound is predicted to belong to the Ames POS class, and thus mutagenic. Referring back to the test set, it is found that this compound is actually mutagenic, thus making our prediction true in this case. These calculations are repeated for all molecules in the test set, and the model’s performance is evaluated by generating its confusion matrix and calculating the sensitivity, specificity, and concordance parameter values.

The confusion matrix for the test set for CV 1 is shown in Table 21.

Predicted

Positive Negative

Positive 435 149 Actual Negative 173 227

Table 21: Confusion matrix for test set of CV 1 using {AI, nC, nH} annotation scheme

푇푃 435 ∴ 푆푒푛푠푖푡푖푣푖푡푦 = = = 0.745 (푇푃 + 퐹푁) 584

푇푁 227 푆푝푒푐푖푓푖푐푖푡푦 = = = 0.568 (푇푁 + 퐹푃) 400

(푇푃 + 푇푁) 662 퐶표푛푐표푟푑푎푛푒 = = = 0.673 (푇푃 + 퐹푃 + 푇푁 + 퐹푁) 984

91

It should be noted here that if a certain atom symbol in a test compound is not observed in the training set, then the particular test compound is classified as being out of the model’s applicability domain and is labeled as ‘NP’ (not predicted). In certain cases, a particular connection in a test compound may not be observed in the training set, but as long as its symbols are present in the training set, a prediction will be made on the test compound. In such cases, the particular connection is ignored and prediction is made using probability values of other connections in the molecule. It might also happen that certain connections in test compounds are observed in only

Ames POS compounds in the training set. In such cases, the respective test compound is immediately predicted as Ames POS without taking into account the probability values of all other connections in the compound. The same thing applies if a test compound has a connection that is only observed in Ames NEG compounds in the training set. In such cases, the test compound is immediately predicted as Ames NEG. In certain cases, a test compound might have some connections that are observed in only Ames POS compounds in the training set and some connections that are observed in only Ames NEG compounds in the training set. In such cases, the log-likelihood cannot be computed for the compound, and it is classified as ‘NP’.

Using these rules, calculations were performed for all 5-folds of the Hansen dataset. Table

22 shows sensitivity, specificity, and concordance values for each cross-validation fold. Note that the sensitivity value for CV 1 is 0.746 (and not 0.745 as calculated using the confusion matrix above) because there was one Ames POS compound that was not predicted. Thus, the sensitivity was calculated as follows.

435 435 푆푒푛푠푖푡푖푣푖푡푦 = = = 0.746 (584 − 1) 583

92

No. of compounds Sensitivity Specificity Concordance not predicted (NP)

CV 1 0.746 0.568 0.673 1

CV 2 0.765 0.559 0.678 0

CV 3 0.755 0.533 0.662 0

CV 4 0.781 0.498 0.662 0

CV 5 0.752 0.542 0.664 2

Average 0.760 0.540 0.668

Table 22: Performance parameters for 5-fold cross-validation of Hansen dataset using {AI, nC, nH} annotation scheme and one-step connection probabilities

As seen from this data, the preliminary results are quite promising. We were able to correctly predict 76% of the Ames POS compounds and 54% of the Ames NEG compounds, thus giving an overall accuracy or concordance of 66.8%. Though these results are not exceptionally good, they are comparable to those reported in the literature. Hansen et al. have reported the performance of three non-parametric approaches on this Ames benchmark dataset58. These results are shown in Table 23.

Sensitivity Specificity

Pipeline Pilot 0.84 0.64

MultiCASE 0.78 0.57

DEREK 0.73 0.50

Table 23: Comparison of performance with other non-parametric methods (averaged over 5-fold cross-validation splits) 93

As discussed earlier, DEREK and MultiCASE are commercial software packages that are widely used in the industry. The DEREK software used by Hansen et al. was based on a static set of rules derived from a largely unknown dataset and expert knowledge. The MultiCASE software

(AZ2 module) used by them was based on a fixed set of mainly 2D descriptors. Pipeline Pilot, on the other hand, is a graphical scientific workflow developed by BIOVIA75 with machine learning capabilities. It is also equipped with a chemical fingerprint technology that can generate different dynamic descriptors. Hansen et al. used the Bayesian categorization model in Pipeline Pilot and combined it with the extended connectivity fingerprints (circular fragments) to get classification results.

The results in Table 22 show that our approach based on a preliminary version of Markov chain model using one-step connection probabilities gave better prediction results than DEREK software and comparatively similar results to those obtained using MultiCASE software.

6.2 Markov Chain Models Using Different Annotation Schemes

We then explored different annotation schemes and analyzed the performance results. The averaged results over 5-fold cross-validation splits are shown in Table 24. The actual results for the 5-fold cross-validations are included in Appendix C. Figure 45 shows an ROC plot comparing the results obtained using Markov chain models with different annotation schemes and commercial software packages. As mentioned earlier, the closer a point is to the upper left corner in the ROC plot, the better is its overall predictive performance. As seen in Figure 45, Pipeline Pilot gives the best predictive performance. This is followed by MultiCASE and Markov chain models with {AI, nC} and {AI, nC, nH, PC} annotation schemes. The predictive performance obtained using these three methods is comparatively similar. 94

Annotation Sensitivity Specificity Concordance Scheme

{AI} 0.653 0.616 0.638

{AI, nC} 0.770 0.565 0.685

{AI, nC, nH} 0.760 0.540 0.668

{nC, nH, PC} 0.767 0.535 0.671

{AI, nC, nH, PC} 0.756 0.588 0.686

Table 24: Performance parameters for Hansen dataset using Markov chain models with different annotation schemes

Figure 45: ROC plot comparing performance of Markov chain models with different annotation schemes and other non-parametric approaches reported in literature

95

In order to further improve the performance of our preliminary Markov chain models, we adopted two different strategies. First, we increased the bin categories for partial charge annotation feature to 5. Thus, partial charge (PC) was classified into 5 categories as follows.

PC bin Classified as Denoted as

PC < -0.15 Strongly negative ‘-’

-0.15 < PC < -0.05 Mildly negative ‘n’

-0.05 < PC < 0.05 Neutral ‘o’

0.05 < PC < 0.15 Mildly positive ‘p’

PC > 0.15 Strongly positive ‘+’

Table 25: 5-bin classification of partial charge (PC) annotation feature

This helped to increase the amount of information captured by the PC annotation feature.

Since partial charge is a “superior” feature, it was important to classify it into more number of bins so that it could capture information with higher structural resolution. The second strategy was to increase the number of annotation options available. We executed this by introducing a new annotation feature called as ‘Ring atom (RA)’. This annotation helped to capture the information whether any given atom was part of a ring or chain substructure. It is a single digit binary variable with value 0 for a chain atom and 1 for a ring atom. Thus, with the new categorization of partial charge and addition of ring atom annotation, the atom symbols under {AI, nC, nH, PC, RA} annotation scheme would be denoted as C30p1, C21n0, O11-0, etc. The predictive performance on the Ames benchmark dataset was then calculated using several different annotation schemes. The results are tabulated in Table 26 and the corresponding ROC plot is shown in Figure 46.

96

Annotation Scheme Sensitivity Specificity Concordance

{AI, RA} 0.753 0.532 0.661

{AI, nC, RA} 0.765 0.559 0.679

{AI, nH, RA} 0.751 0.580 0.679

{nC, nH, PC} 0.765 0.592 0.693

{AI, nC, nH, PC} 0.760 0.609 0.697

{AI, nC, PC, RA} 0.760 0.628 0.705

{AI, nH, PC, RA} 0.757 0.622 0.701

{nC, nH, PC, RA} 0.789 0.603 0.711

{AI, nC, nH, PC, RA} 0.795 0.601 0.714

Table 26: Performance parameters for Hansen dataset using different annotation schemes (with PC binned into 5 categories and ring atom (RA) annotation added)

Figure 46: ROC plot comparing performance of Markov chain models with different annotation schemes (PC binned into 5 categories and ring atom (RA) annotation added) 97

As seen from these results, the predictive performance of Markov chain models was tremendously improved. Especially, the {nC, nH, PC, RA} and {AI, nC, nH, PC, RA} annotation schemes yielded better prediction results than the ones obtained using DEREK and MultiCASE software packages (refer Table 23). Thus, we can conclude that it was beneficial to categorize partial charge into 5 categories and adding a new ring annotation option.

It should be noted here that although the predictions obtained using the two annotation schemes mentioned above are quite similar, there were considerably less number of compounds that were not predicted (NP) using the {nC, nH, PC, RA} annotation scheme (refer Appendix C for actual results). Thus, the {nC, nH, PC, RA} scheme seems to be the best for modeling Ames mutagenicity data with very few compounds out of the model’s applicability domain.

6.3 Markov Chain Models Using Fragments of Longer Lengths

We then explored Markov chain models with fragments of longer lengths as an effort to increase the predictive performance of our models. These models are still based on one-step connection probabilities, however, they include information about fragments of longer lengths as well from the compound-fragment data matrix. The following calculations explain how a prediction is made on a test compound using 3-nitro-o-xylene (refer Figure 43) as an example.

As a first step, we generate a compound-fragment data matrix using all compounds in the test set depending on the fragment lengths desired and the annotation scheme selected by the user.

Suppose we want to generate fragments of length 3 using the {AI, nC, nH} annotation scheme.

Then, for the 3-nitro-o-xylene molecule, 10 unique fragments are identified. One of these fragments is [‘C30’, ‘C30’, ‘C21’]. The sequence probability for this fragment is called as its likelihood.

Taking natural log on both sides, we get a log-likelihood for the fragment as follows.

98

푙푖푘푒푙푖ℎ표표푑 = 푝(퐶30 − 퐶30) ∗ 푝(퐶30 − 퐶21)

푙표푔 − 푙푖푘푒푙푖ℎ표표푑 = log(푝(퐶30 − 퐶30)) + log(푝(퐶30 − 퐶21))

Referring back to the one-step connection probabilities calculated using the training set

(Table 19), we get a log-likelihood for this fragment under both Ames POS and NEG scenarios.

(푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푃푂푆 = log(0.1536) + log(0.2399)

∴ (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푃푂푆 = −3.301

Similarly,

(푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푁퐸퐺 = log(0.0937) + log(0.1772)

∴ (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푁퐸퐺 = −4.098

Thus, the overall log-likelihood of this fragment is calculated as follows.

푃푂푆 푁퐸퐺 (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푓푟푎푔푚푒푛푡 = (푙표푔 − 푙푖푘푒푙푖ℎ표표푑) − (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)

∴ (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푓푟푎푔푚푒푛푡 = −3.301 − (−4.098)

∴ (푙표푔 − 푙푖푘푒푙푖ℎ표표푑)푓푟푎푔푚푒푛푡 = 0.797

This calculation is repeated for all the unique fragments identified in the test compound.

The overall log-likelihood for the compound is then calculated by summing the log-likelihood of all its fragments. If this overall log-likelihood is calculated to be greater than or equal to 0, then the compound is predicted as Ames POS, otherwise it is predicted as Ames NEG. Table 27 shows this calculation for the 3-nitro-o-xylene molecule using the 10 unique fragments identified. 99

Log-likelihood Overall No. Fragment Positive Negative Log-likelihood

(PPOS) (PNEG)

1 C30 - C30 - C21 -3.301 -4.098 0.797

2 C13 - C30 - C21 -5.576 -5.822 0.246

3 C13 - C30 - C30 -6.021 -6.460 0.439

4 C30 - C21- C21 -3.306 -3.845 0.539

5 C30 - C30 - C30 -3.746 -4.736 0.990

6 C21 - C21 - C21 -3.756 -4.230 0.474

7 C21 - C30 - N30 -4.955 -5.957 1.002

8 C30 - N30 - O10 -6.978 -9.133 2.155

9 C30 - C30 - N30 -5.400 -6.595 1.195

10 O10 - N30 - O10 -6.902 -9.812 2.91

Overall log-likelihood 10.747

Table 27: Calculation of overall log-likelihood for 3-nitro-o-xylene molecule using fragments of length 3

As seen from Table 27, the overall log-likelihood for the 3-nitro-o-xylene molecule is calculated to be 10.747. Since this number is greater than 0, the compound is predicted to be Ames

POS. Referring back to the calculations done using one-step connections in Table 20, the overall log-likelihood for 3-nitro-o-xylene was found to be 5.566. Thus, we can conclude that using

100 fragments of higher lengths gives a higher overall likelihood, and thus a much stronger evidence for classifying a compound to the Ames POS category. Table 28 shows the prediction results using fragments of lengths 2, 3, and 4 for a selected set of three annotation schemes. Figure 47 shows a graphical comparison of the prediction results.

Annotation Scheme Sensitivity Specificity Concordance

Fragment length 2

{AI, nC, nH} 0.640 0.683 0.658

{AI, nC, nH, PC} 0.739 0.638 0.697

{nC, nH, PC, RA} 0.757 0.648 0.711

Fragment length 3

{AI, nC, nH} 0.724 0.556 0.653

{AI, nC, nH, PC} 0.770 0.574 0.688

{nC, nH, PC, RA} 0.785 0.582 0.700

Fragment length 4

{AI, nC, nH} 0.762 0.486 0.646

{AI, nC, nH, PC} 0.793 0.532 0.684

{nC, nH, PC, RA} 0.807 0.531 0.692

Table 28: Performance parameters for Hansen dataset using Markov chain models with fragments of longer lengths

101

Figure 47: Graphical comparison of performance parameters for Hansen dataset obtained using Markov chain models with fragments of longer lengths

102

As seen from these results, considering fragments of longer lengths does not necessarily improve the predictive performance of Markov chain models. In fact, there is a slight deterioration in the overall performance. It is found that the sensitivity parameter increases slightly with increasing fragment lengths. However, the specificity parameter drops down significantly, thus decreasing the overall predictive performance of the model.

The prominent reason for this seems to be the overlap between different fragments of longer lengths. For example, considering fragments of length 3, there were 10 unique fragments identified in the 3-nitro-o-xylene molecule. This gives rise to a total of 30 one-step connections, as opposed to 11 obtained using purely one-step connection approach (refer Table 20). This leads to an increase in the redundant information and biases the prediction of a molecule towards the Ames

POS category. Thus, more and more compounds that should have been predicted as Ames NEG begin to be predicted as Ames POS. It was also found that the models developed using fragments of longer lengths did not yield a prediction for more number of compounds (classified as ‘NP’) because some compounds in the test set have a relatively smaller size.

Thus, it can be concluded that the preliminary model developed using one-step connection approach yields the best predictive performance within the family of Markov chain models. This is because it takes into account all the one-step connections in a molecule only once. There is no overlap between fragments and thus there is no redundant or repetitive information. It was also found that the choice of annotation scheme did have a significant impact on the performance of the model. The models developed using annotation schemes of {AI, nC, PC, RA} and {nC, nH, PC,

RA} performed exceedingly well (refer Table 26). Finally, we found that considering fragments of longer lengths in the model was not of much help from the modeling standpoint. In fact, the overall performance of the model deteriorated slightly with increasing fragment lengths.

103

6.4 Markov Chain Models with Improved Specificity

This section focuses on the efforts taken to increase the specificity value of the preliminary

Markov chain models developed using the one-step connection approach. As observed from results in the previous sections, the sensitivity value of a model is significantly greater than its specificity value. One of the main reasons for this is supposed be the natural bias in the training sets that were used for developing these models. As described in the Training Datasets section (3.1) of the

Datasets and Computational Tools chapter, there were significantly more number of compounds in the Ames POS class as compared to the Ames NEG class in the training sets. Thus, it is likely that the model will predict Ames POS compounds in the test sets with more accuracy than the Ames

NEG compounds.

Since we were already able to achieve reasonably high sensitivity values (true positives), it would help tremendously if we could adjust some classification parameters of the model to improve its specificity value (true negatives). Thus, we decided to develop a new classification strategy. The current classification strategy classifies a compound as Ames POS if its overall log- likelihood is calculated to be greater than or equal to 0, and Ames NEG otherwise. If we change this classification criteria so that a compound is classified as Ames POS if its overall log-likelihood is calculated to be greater than or equal to some value x, and Ames NEG otherwise, then we might be able to increase the specificity of the model with minimal penalty on sensitivity. In order to determine an optimum value for x, we analyzed the distribution of log-likelihood values for those compounds that were incorrectly predicted as Ames POS (false positives) and correctly predicted as Ames POS (true positives). Figure 48 and Figure 49 show the distribution of log-likelihood values obtained using a preliminary Markov chain model on the first cross-validation (CV 1) fold of the Hansen dataset using {nC, nH, PC, RA} annotation scheme.

104

Figure 48: Distribution of log-likelihood values for compounds predicted as false positives using a preliminary Markov chain model on the test set for CV 1 and using {nC, nH, PC, RA} annotation scheme

Figure 49: Distribution of log-likelihood values for compounds predicted as true positives using a preliminary Markov chain model on the test set for CV 1 and using {nC, nH, PC, RA} annotation scheme 105

As seen in Figure 48, there were 12 compounds that were falsely predicted as Ames positive with a log-likelihood value of less than 0.25, and 23 compounds that were falsely predicted as Ames positive with a cumulative log-likelihood value of less than 0.5. Similarly, from Figure

49 we can see that the corresponding number of compounds that were truly predicted as Ames positive were 7 and 13 respectively. Thus, if the x value for classification is chosen to be 0.5, then the model will predict 23 additional compounds as Ames NEG with the penalty of only 13 compounds falsely predicted as Ames NEG.

Thus, we executed the model with new classification criteria (x = 0.5) over the 5-fold cross- validation splits in the Hansen dataset. The average values of sensitivity, specificity, and concordance were calculated to be 0.766, 0.635, and 0.711 respectively. As compared to the classification results obtained previously (refer Table 26), there was an increase in the specificity of the model with a slight penalty on its sensitivity. However, since the model results were averaged over the 5-fold cross-validation splits, the gain in specificity balanced the loss in sensitivity and thus the overall concordance of the model remained the same.

We performed a similar analysis for Markov chain models developed using different annotation schemes. With {AI, nC, nH, PC, RA} annotation scheme, we did find an improvement in the overall predictive performance of the model. The sensitivity, specificity, and concordance values for the new model were 0.772, 0.637, and 0.716 respectively as compared to 0.795, 0.601, and 0.714 obtained previously (refer Table 26). Thus, there was a slight overall improvement in the model’s performance due to a higher gain in specificity as compared to the loss in sensitivity.

In any case, since this improvement in performance is not very significant and a lot of test compounds are found to be out of the model’s applicability domain, we decided to support the model developed previously using the {nC, nH, PC, RA} annotation scheme with classification criteria of x equal to 0. 106

We then developed additional Markov chain models by increasing the resolution of information captured by ‘Ring Atom’ (RA) annotation option. Initially, RA was a binary variable that was assigned a value ‘1’ if the atom was part of a ring, and ‘0’ otherwise. Even though this information is helpful by itself, we felt the need to incorporate additional information such as whether the atom is part of a 5-membered ring, or 6-membered ring, or both. In order to capture this information, we classified the RA annotation option into 5 categories as follows.

Condition Value

If chain atom 0

If atom part of 5-membered ring 5

If atom part of 6-membered ring 6

If atom part of both 5- and 6-membered rings 7

If atom part of ring of any other size 4

Table 29: 5-level categorization of the ring atom (RA) annotation feature

Preliminary Markov chain models based on one-step connection approach were then developed using this updated annotation option. The performance parameters on Hansen dataset using {nC, nH, PC, RA} annotation scheme were found to be 0.793, 0.623, and 0.723. Thus, there was a significant improvement in performance as opposed to the classification results obtained previously using a binary RA annotation option (refer Table 26). It was therefore beneficial to categorize the RA annotation option into 5 categories as described above. Figure 50 shows an

ROC plot comparing the performance of this best Markov chain model with the results of other non-parametric approaches reported in the literature by Hansen et al. (refer Table 23).

107

Figure 50: ROC plot comparing performance of the best Markov chain model developed using {nC, nH, PC, RA} annotation scheme with a previous Markov chain model developed using {AI, nC, nH} annotation scheme and other non-parametric approaches reported in the literature

As seen here, the performance of a Markov chain model using {nC, nH, PC, RA} annotation scheme excels that of the DEREK and MultiCASE software packages. It is also found that this performance is much better than that obtained using the first Markov chain model with

{AI, nC, nH} annotation scheme (refer Table 22). Comparing the two models, we find an improvement of 0.033 (4.34%) in sensitivity and 0.083 (15.37%) in specificity, with an overall improvement of 0.055 (8.23%) in the concordance of the model. It should also be noted that the number of chemicals that are not predicted (NP) using the Markov chain approach are very few. In fact, the proportion of compounds that are not predicted (due to being out of the model’s applicability domain) for {nC, nH, PC, RA} annotation scheme is much less than 2%, thus giving a coverage of at least 98% for all test sets of the 5-fold cross-validation splits.

108

6.5 Additional Results Using Annotated Linear Fragments

Coupled with kNN Models

In this section, we describe the results obtained on the Hansen dataset using annotated linear fragments coupled with kNN (k-nearest neighbors) modeling methods. kNN is a classical modeling technique that has been successfully applied on various datasets with a wide variety of descriptors. The application of kNN models provides a means to validate the performance of our novel descriptors and allows the use of fragments of longer lengths in the analysis.

In order to apply kNN models, we first generate a compound-fragment data matrix using all compounds in the training set based on the annotation scheme selected and fragment lengths chosen. We then go through each compound in the test set and generate its corresponding fingerprint vector and map it to the fragment space generated using the training set. We then use

Tanimoto distance metric to compute distances of the test compound from all compounds in the training set. Based on these distances, k (typically 5 or 7) compounds are identified in the training set that have the smallest distance to the test compound under consideration. These k compounds are called as the nearest neighbors and are most similar to the test compound based on the descriptor space generated (in our case, the descriptor space depends on the annotation scheme and fragment lengths chosen for analysis).

The activity of the test compound is then predicted by considering the activities of the k nearest neighbors. The simplest approach to this is for each nearest neighbor to “vote” for a certain outcome. Depending on the majority vote, the test compound is classified accordingly. Suppose k is selected to be 5, then if 3 or more nearest neighbors are mutagenic (Ames POS), the test compound is predicted as Ames POS. Similarly, if 3 or more neighbors are non-mutagenic (Ames

NEG), then the test compound is predicted as Ames NEG.

109

It should be noted here that the kNN method is computationally more demanding than the

Markov chain approach described earlier. This is because, for each test compound, we need to calculate its pairwise distance to all the compounds in the training set and then select a set of k neighbors that are closest to the test compound. Thus, if there are m compounds in the training set and n compounds in the test set, then a total of m x n pairwise distances need to be calculated. These distances are then sorted and k nearest neighbors are identified. Typically, m is very large compared to k (m is generally in 1000’s and k is 5 or 7), thus rending most distances calculated of no use.

In the discussion that follows, we provide an example of how the kNN model was applied in our analysis. We used the Ames mutagenicity dataset compiled by Hansen et al. with the 5-fold cross-validation scheme provided by them for analyzing the performance of our novel descriptors.

We consider compounds in the first cross-validation (CV 1) fold for demonstration purposes along with an annotation scheme of {AI, nC, nH} and fragments of length 3.

As a first step, we generate a compound-fragment data matrix using all compounds in the training set. The CV 1 fold in the Hansen dataset has 5528 compounds assigned to the training set

(refer Table 1). Using this structural information, 1086 unique fragments of length 3 were identified. We then removed the singleton and doubleton fragments, which helped to reduce the number of unique fragments to 681. Thus, the descriptor space generated has a dimension of 681 with each dimension taking a binary value of either 0 or 1.

Now, consider the 3-nitro-o-xylene molecule in the test set for CV 1. In order to make a prediction on this compound, we first generate its fingerprint vector. As discussed in a previous section, using an annotation scheme of {AI, nC, nH}, there are 10 unique fragments of length 3 in the 3-nitro-o-xylene molecule. These fragments are identified in the 681-dimension space and the corresponding bits are marked as 1. This enables the calculation of Tanimoto distance metric between 3-nitro-o-xylene and all compounds in the training set.

110

It should be noted here that if any fragment in the test compound is not observed in the descriptor space generated by the training set, then that fragment is simply ignored. Thus, only those fragments in a test compound are considered for distance calculations that were actually observed in the training set. All these distances are calculated for the 3-nitro-o-xylene molecule and their distribution is as shown in Figure 51.

Figure 51: Distribution of Tanimoto distances of 3-nitro-o-xylene from all compounds in the training set for CV 1

As seen from this figure, most compounds in the training set are found to be very distant from the 3-nitro-o-xylene molecule in the test set. However, we are interested in identifying only k nearest neighbors that have the least distance from the test molecule under consideration. Table 30 shows 5 nearest neighbors identified for 3-nitro-o-xylene molecule from the training set for CV 1 along with their chemical structures, Tanimoto distance metrics, and mutagenic activities.

111

Nearest Neighbor (NN) Chemical structure Tanimoto distance Mutagenic activity

Test compound

NN #1 0.0 1

NN #2 0.0 1

NN #3 0.0 1

NN #4 0.091 1

NN #5 0.091 1

Table 30: Five nearest neighbors identified for the 3-nitro-o-xylene molecule from the training set for CV 1 112

It can be seen from Table 30 that the nearest neighbors identified from the training set have very similar chemical structures compared to that of the test compound. In this case, all 5 nearest neighbors identified have a mutagenic activity of ‘1’, thus indicating that they are all Ames positive. Since the number of mutagenic nearest neighbors is greater than 3, we classify the test compound as belonging to the Ames POS category. Referring back to the test set, we find that the prediction is true in this case.

This procedure is then repeated for all compounds in the test set and a prediction is made for each test compound. A 2x2 confusion matrix is then generated and the performance parameters of sensitivity, specificity, and concordance are computed as described earlier. Using fragments of length 3 and {AI, nC, nH} annotation scheme, the performance parameters for CV 1 fold were found to be 0.793, 0.626, and 0.725 respectively. We then performed these calculations for all the

5-folds of the cross-validation scheme and calculated the average performance parameters. These values are shown in Table 31.

No. of compounds Sensitivity Specificity Concordance not predicted (NP)

CV 1 0.793 0.626 0.725 8

CV 2 0.820 0.676 0.759 8

CV 3 0.836 0.662 0.763 6

CV 4 0.820 0.636 0.742 1

CV 5 0.823 0.652 0.751 6

Average 0.818 0.650 0.748

Table 31: Performance parameters for 5-fold cross-validation of Hansen dataset using kNN modeling method with 5 nearest neighbors and fragments of length 3 generated using {AI, nC, nH} annotation scheme 113

It should be noted here that several test compounds were classified as ‘NP’ due to being out of the model’s applicability domain. A compound was considered as such if any one of its nearest neighbors was found to be at a distance of 1.0 from it. Thus, a prediction was made on a test compound only if all its nearest neighbors were at a distance of less than 1.0 from it. This exercise was then repeated with different annotation schemes and fragment lengths. Table 32 shows results obtained using a selected set of 3 different annotation schemes and 3 fragment lengths. The actual results for the 5-fold cross-validation splits are included in Appendix C. Figure

52 shows a graphical comparison of the prediction results.

Annotation Scheme Sensitivity Specificity Concordance

Fragment length 2

{AI, nC, nH} 0.788 0.663 0.736

{AI, nC, nH, PC} 0.800 0.651 0.738

{AI, nC, nH, PC, RA} 0.799 0.662 0.741

Fragment length 3

{AI, nC, nH} 0.818 0.650 0.748

{AI, nC, nH, PC} 0.805 0.666 0.747

{AI, nC, nH, PC, RA} 0.799 0.669 0.745

Fragment length 4

{AI, nC, nH} 0.819 0.640 0.745

{AI, nC, nH, PC} 0.800 0.651 0.738

{AI, nC, nH, PC, RA} 0.800 0.664 0.744

Table 32: Averaged performance parameters for Hansen dataset using kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes 114

Figure 52: Graphical comparison of performance parameters for Hansen dataset obtained using kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes

115

As seen from these results, the performance of kNN models does not improve significantly by adding more annotation options in the annotation scheme. The overall concordance of the model stays around 0.74 for all the 3 annotation schemes shown above. Thus, the addition of PC and RA annotating features are not much helpful from the kNN modeling standpoint. It can also be seen from the above results that considering fragments of longer lengths in the analysis does not significantly improve the predictive performance of the model.

In fact, kNN models with fragments of length 4 were observed to perform consistently poorer than the ones developed with fragments of length 3. The model developed using fragments of length 3 with the {AI, nC, nH} annotation scheme was found to give the best overall predictive performance with a sensitivity of 0.818, specificity of 0.650, and concordance of 0.748. Comparing these results to those obtained using the best Markov chain model (refer section 6.4), we find that the kNN models give better sensitivities as well as specificities for the classification problem.

It should be noted here that even though the kNN models yield better predictive performance, these models are computationally much more demanding than the Markov chain models developed in this research. This is because kNN is a local model and it is built to suit the needs of the test compounds, whereas Markov chains are global models that are pre-built depending on the structural information contained in the training set. Consider the CV 1 fold of the Hansen dataset for example. In the Markov chain approach, the model is first developed by processing the

5528 compounds in the training set. The test set is then considered and a prediction is made on each compound by calculating its likelihood based on the one-step connection probabilities. In the kNN method, on the other hand, for each test compound, its pairwise distance is calculated to all the compounds in the training set and then the nearest neighbors are identified. Thus, for the CV 1 fold of the Hansen dataset with 984 compounds in the test set, a total of 5528*984 (= 5.44 million) distances need to be calculated before making a prediction on all the test compounds.

116

This is computationally very intensive and takes much more time to process, especially when fragments of longer lengths are considered. It was found that for fragments of length 2 with

{AI, nC, nH} annotation scheme, the processing time for kNN models was about 5 minutes for each CV fold. With the addition of annotation options, the processing time increased to about 10 minutes for the {AI, nC, nH, PC, RA} annotation scheme. When fragments of length 3 were considered, the processing time was about 15 minutes for the {AI, nC, nH} annotation scheme, and about 45 minutes for the {AI, nC, nH, PC, RA} annotation scheme. For fragments of length 4, the corresponding times were 25 minutes and 90 minutes respectively. Thus, for the kNN models developed using fragments of length 4 and {AI, nC, nH, PC, RA} annotation scheme, it took about

450 minutes (7.5 hours) to get prediction results for all the 5 folds of the cross-validation scheme.

The Markov chain approach, on the other hand, gave prediction results for all the 5 folds in less than about 30 minutes for the same fragment length and annotation scheme considered above.

We then focused our attempts on improving the predictive performance of kNN models. It was found from the above results that increasing the annotation options and considering fragments of longer lengths did not help much in improving the predictive performance. This went counter intuitive to the expected result of observing increased performance with increasing annotation options. Thus, we decided to develop weighted kNN models where the “vote” of each nearest neighbor was weighed by its distance to the test compound under consideration. It was supposed that this weighted approach would bias the “vote” of each neighbor and thus help to correctly capture the effect of increasing annotation options and fragment lengths.

For example, consider a test compound with 5 nearest neighbors identified in the training set. Table 33 shows the distance of its neighbors along with their mutagenic activity for some hypothetical test compound.

117

NN #1 NN #2 NN #3 NN #4 NN #5

Distance 0.10 0.15 0.85 0.90 0.95

Activity 0 0 1 1 1

Table 33: Tanimoto distance and mutagenic activity of 5 nearest neighbors identified in the training set for some hypothetical test compound

As seen here, 3 out of the 5 nearest neighbors are found to be mutagenic. Thus, in the case of a normal kNN model, the test compound would be predicted as Ames POS. However, for the weighted distance approach, a new metric score is calculated as follows. If this score is calculated to be greater than 0.5, then the compound is classified as Ames POS, and Ames NEG otherwise.

퐴푐푡푖푣푖푡푦 ∑푘 ( 푖 ) 푖=1 퐷푖푠푡푎푛푐푒 푆푐표푟푒 = 푖 푘 1 ∑푖=1 ( ) 퐷푖푠푡푎푛푐푒푖

0 0 1 1 1 ( ) + ( ) + ( ) + ( ) + ( ) ∴ 푆푐표푟푒 = 0.10 0.15 0.85 0.90 0.95 1 1 1 1 1 ( ) + ( ) + ( ) + ( ) + ( ) 0.10 0.15 0.85 0.90 0.95

3.34 ∴ 푆푐표푟푒 = = 0.167 20.01

Since this score is less than 0.5, the test compound under consideration is classified as

Ames NEG. Thus, even though only 2 neighbors are non-mutagenic, since they are much closer to the test compound than the other 3 mutagenic neighbors, the prediction on the test compound is biased towards the non-mutagenic category.

118

Table 34 shows the classification results obtained using the weighted kNN approach for 3 different fragment lengths and a selected set of 3 different annotation schemes. These results are shown graphically in Figure 53.

Annotation Scheme Sensitivity Specificity Concordance

Fragment length 2

{AI, nC, nH} 0.787 0.670 0.738

{AI, nC, nH, PC} 0.801 0.668 0.746

{AI, nC, nH, PC, RA} 0.802 0.684 0.753

Fragment length 3

{AI, nC, nH} 0.815 0.655 0.748

{AI, nC, nH, PC} 0.799 0.676 0.748

{AI, nC, nH, PC, RA} 0.797 0.688 0.751

Fragment length 4

{AI, nC, nH} 0.818 0.646 0.747

{AI, nC, nH, PC} 0.797 0.658 0.739

{AI, nC, nH, PC, RA} 0.803 0.672 0.749

Table 34: Averaged performance parameters for Hansen dataset using weighted kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes

119

Figure 53: Graphical comparison of performance parameters for Hansen dataset obtained using weighted kNN modeling method with 5 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes

120

As seen from these results, there was considerable increase in performance with increasing annotation options. However, not much significant improvement was observed with increasing fragment lengths. The best performance was obtained with fragments of length 2 and annotation scheme of {AI, nC, nH, PC, RA}. The sensitivity, specificity, and concordance values for this model were found to be 0.802, 0.684, and 0.753 respectively. These results are considerably better than the ones obtained using the normal kNN models. It should be noted here that if a neighbor was found to be at a distance of 0.0 from the test compound, then this distance was corrected to 0.05 in order to prevent error caused due to division by zero.

We then built weighted kNN models with 7 nearest neighbors. It was supposed that considering more number of neighbors would increase the predictive performance of the model.

We used the same classification criteria in this case. If the metric score was calculated to be greater than 0.5, then the compound was predicted as Ames POS, and Ames NEG otherwise. If any one of the 7 nearest neighbors was found to be at a distance of 1.0 from the test compound, then the corresponding test compound was considered to be out of the model’s applicability domain and was predicted as ‘NP’. Table 35 shows the classification results obtained using the weighted kNN approach with 7 nearest neighbors for 3 different fragment lengths and a selected set of 3 different annotation schemes. These results are shown graphically in Figure 54.

As seen from these results, there was no significant improvement in the predictive performance of the model with increasing annotation options or fragment lengths. The concordance of all models stayed consistent around 0.745. Also, these results are comparable to the ones obtained using kNN models with 5 nearest neighbors. Thus, there is no significant improvement observed in the predictive performance of the model by considering 7 nearest neighbors. It was also found that considering 7 nearest neighbors increased the number of test compounds that were out of the model’s applicability domain and thus predicted as ‘NP’.

121

Annotation Scheme Sensitivity Specificity Concordance

Fragment length 2

{AI, nC, nH} 0.797 0.673 0.745

{AI, nC, nH, PC} 0.813 0.655 0.747

{AI, nC, nH, PC, RA} 0.812 0.668 0.752

Fragment length 3

{AI, nC, nH} 0.825 0.643 0.749

{AI, nC, nH, PC} 0.811 0.665 0.750

{AI, nC, nH, PC, RA} 0.807 0.674 0.752

Fragment length 4

{AI, nC, nH} 0.820 0.643 0.746

{AI, nC, nH, PC} 0.800 0.662 0.743

{AI, nC, nH, PC, RA} 0.806 0.659 0.745

Table 35: Averaged performance parameters for Hansen dataset using weighted kNN modeling method with 7 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes

122

Figure 54: Graphical comparison of performance parameters for Hansen dataset obtained using weighted kNN modeling method with 7 nearest neighbors and fragments of different lengths generated using a selected set of 3 different annotation schemes

123

Thus, it can be concluded that the best kNN model was the one developed by considering fragments of length 2 generated using the {AI, nC, nH, PC, RA} annotation scheme and weighing the 5 nearest neighbors depending on their distance from the test compound. Figure 55 shows an

ROC plot summarizing the results obtained using the best kNN model and the best Markov chain model and compares them to the other non-parametric approaches reported in the literature. The kNN and Markov chain models are denoted by kNN-ALF and Markov-ALF respectively, where

ALF stands for annotated linear fragments.

As seen from these results, both the kNN-ALF and Markov-ALF models perform significantly better than DEREK and MultiCASE. The kNN-ALF model also gives considerably higher specificity as compared to that obtained using Pipeline Pilot, although its sensitivity value is slightly lower. Thus, both the kNN-ALF and Markov-ALF models give significantly better or comparable results to other non-parametric approaches reported in the literature.

Figure 55: ROC plot comparing the performance of best kNN model and best Markov chain model developed using annotated linear fragments with other non-parametric approaches reported in the literature 124

CHAPTER 7: CONCLUDING REMARKS

7.1 Summary

We began this research with the goal of developing novel linear descriptors for use in chemoinformatics applications and subsequent Markov chain modeling methods for classification purposes. During the course of the research, we were able to successfully develop novel linear descriptors that helped to not only capture the linear connectivity between different atoms, but also

“superior” atomic information such as partial charge and ring annotation. These descriptors were found to reduce the dimension of descriptor space as compared to some other methods reported in the literature. They also gave meaningful interpretations as it was possible to convert them back into chemical paths and identify them in actual chemical structures.

The final version of the algorithm for generating these descriptors was equipped with five annotation options; namely, atom identity (AI), number of heavy atom connections (nC), number of hydrogen atoms attached (nH), partial charge on the atom (PC), and ring annotation (RA). The

Python scripts for this algorithm were created in such a way so as to facilitate easy addition of new annotation features. This helped to simplify the addition of RA annotation option into the algorithm.

The Python scripts for this algorithm were also seamlessly integrated into an Excel interface for user-friendly display and output. This facilitated the generation of compound-fragment data matrix and its subsequent storage in an Excel file or as a text file. It also allowed for repeated processing of a dataset using different annotation schemes and fragment lengths.

125

Using these novel descriptors and several statistical tests, we were able to identify potential structural alerts from a set of compounds with known toxicity outcomes. We were able to identify

15 structural alerts from the skin sensitization dataset containing 467 compounds. These compounds were classified into 4 categories depending on their relative skin sensitizing potency.

Most of the structural alerts identified were known to cause skin sensitizing effects and had been reported in the literature, thus proving the efficacy of our novel descriptors in identifying them.

Similarly, we were able to identify 12 structural alerts from the Ames mutagenicity dataset containing 984 compounds with a binary toxicity outcome.

We then developed Markov chain models to explore the information contained in the annotated linear fragments for modeling Ames mutagenicity. Initially, we developed these models based on one-step connection probabilities. We then extended this approach to include fragments of longer lengths. After analyzing the performance of these different models on the benchmark dataset compiled by Hansen et al., it was found that using fragments of longer lengths was not much helpful. Rather, the predictive performance deteriorated slightly with increasing fragment lengths.

The best Markov model was found to be the one developed using preliminary one-step connection probability approach with {nC, nH, PC, RA} annotation scheme. This model gave sensitivity, specificity, and concordance values of 0.789, 0.603, and 0.711 respectively.

We also applied kNN models for predicting Ames mutagenicity to test the effectiveness of our novel descriptors. We built several kNN models using different annotation schemes and fragment lengths. The best kNN model was developed using the weighted approach by considering

5 nearest neighbors, {AI, nC, nH, PC, RA} annotation scheme, and fragments of length 2. This model gave sensitivity, specificity, and concordance values of 0.802, 0.684, and 0.753 respectively.

This result corroborates the fact that annotated linear fragments are sufficiently able to capture structural diversity of chemicals and are able to identify molecules with similar chemical structures.

126

7.2 Future Work

The results obtained from this research are very promising and a lot of potentially useful research can stem from it in the future. One immediate extension of this research could be to test the Markov chain models developed here for predicting different toxicity endpoints such as skin sensitization and carcinogenicity. An analysis of the results thus obtained would help to validate the effectiveness of annotated linear fragments as well as the subsequent Markov chain models.

There is also a need to explore more annotation options that would take into account additional steric and electronic properties of atoms. One possibility would be to add the stereochemistry annotation option, which would help to distinguish between different stereoisomers.

It would also be helpful to analyze a range of fragment lengths while developing the predictive models. In our study, we considered one fragment length for each model. It is quite possible that considering a range of fragment lengths will help to improve the predictive performance of these models. An important exercise would be to determine the optimal range of fragment lengths to be used in the analysis. This leads us to the next possibility of exploring the co- occurrence and proximity of different fragments in a molecule. It is a known fact that chemicals induce toxicity by more than one mechanism of action. Thus, several structural features are responsible for imparting toxic properties to chemicals. In some cases, certain fragments might not induce toxicity when considered alone. However, in the presence of some other structural features, they might act through a different mechanism and induce toxicity. Thus, it would be of crucial importance to quantify and understand the simultaneous presence of one or more structural features in a molecule that lead to an observed toxicity outcome.

There is also a need to define the applicability domain for the Markov chain models in a more rigorous fashion. In this study, we considered a test molecule to be out of the model’s

127 applicability domain if it contained atomic symbols that were not observed in the training set. Thus, as long as the atomic symbols were observed in the training set, a prediction was made on the test compound. If certain one-step connections in the test compound were not observed in the training set, then those connections were simply skipped for calculation purposes. A more rigorous criteria for the applicability domain could help to improve the performance parameters of the model at the expense of decreased coverage of test set predictions. There is also a need for more stringent criteria for classification. Generally, it is a more costly mistake to classify a toxic compound as non-toxic

(false negatives) as compared to classifying a non-toxic compound as toxic (false positives). Thus, the cost of misclassification needs to be incorporated in the analysis.

Another possible extension of this research could be to develop innovative Markov chain models for prediction purposes. The fundamental challenge in developing Markov models for analyzing chemical structures is to tackle the inherent non-directionality of chemical paths. In our study, we accomplished this by considering one-step connections between the different atomic symbols. There is a need to develop better ways to incorporate this non-directionality constraint in sequence probability calculations. There is also a need to develop sequence analysis techniques similar to the ones used in bioinformatics methods to discuss the similarity between chemical structures in terms of their alignment scores. This would be a novel way to define and interpret similarity between different molecules.

Finally, there is a need to develop better quality datasets for training and testing new models. In this study, we had to manually correct several chemical structures in the benchmark dataset for Ames mutagenicity before processing the training and test sets through our algorithm.

This was particularly due to the presence of charge violations in specifying their structural information in the original SD file. Thus, there is a need to develop chemically correct and reliable datasets for a wide variety of toxicity endpoints.

128

7.3 Conclusion

In today’s information age, there is a great need to utilize the wealth of abundant information generated by experimental studies to inform and guide future work. This need is especially realized in the pharmaceutical industry where it is becoming increasingly important to develop computational models for predicting toxicity endpoints of candidate drug molecules.

Newer methods are being developed for generating relevant structural descriptors and advances are being made towards statistical models with high predictive performances. The research described in this dissertation is a significant step and yet a humble contribution to these ongoing efforts.

In this research, we developed a novel algorithm for the dynamic generation of linear fragments from chemical structures. These fragments have annotated atom types that provide flexibility in defining them. It was found that although these fragments are linear in composition, they are able to capture branched structural information as well as polycyclic ring systems due to the provision of annotating features. These features make them a powerful tool in identifying relevant descriptors for chemoinformatics applications. Using these novel descriptors, we were able to identify several important and well-known alerts for two toxicity endpoints, namely skin sensitization and Ames mutagenicity. This showed that our method is capable of capturing meaningful descriptors and it might lead to the discovery of new structural alerts, thus complementing the predefined fragments approach. We were also able to reduce the dimension of descriptor-space as compared to some other methods reported in the literature.

From the modeling standpoint, the innovative development and application of Markov chain models for predicting Ames mutagenicity was quite successful. Different annotation schemes and fragment lengths were explored and the models gave considerably high prediction accuracies.

These were significantly better or comparably similar to those obtained by using other descriptors

129 and non-parametric modeling methods with the same training and test sets. The high predictive performance of kNN models developed using the annotated linear fragments as descriptors also proved the efficacy of our novel descriptors.

Thus, the work conducted in this research gave very promising results and it has tremendous potential for growth and applications. We hope that future efforts will refine this method further and give rise to better descriptors and statistical models. I would like to conclude this section by stating some of the benefits that could be realized from this research. First, this research can assist in expediting the process of bringing new products to market while minimizing the associated safety concerns. Second, it can significantly lower the costs required to screen candidate compounds and reduce the need for animal testing of chemicals. And lastly, it can help in the identification of new structural alerts, thus increasing our understanding of the mechanisms by which chemicals induce toxicity.

130

REFERENCES

1. CAS, Chemical Abstracts Service Home Page. http://www.cas.org/. Accessed October 30, 2015.

2. CAS Assigns the 100 Millionth CAS Registry Number to a Substance Designed to Treat Acute Myeloid Leukemia. http://www.cas.org/news/media-releases/100- millionth-substance. Accessed October 30, 2015.

3. Engel T. Basic Overview of Chemoinformatics. J Chem Inf Model. 2006;46(6):2267-2277. doi:10.1021/ci600234z.

4. Brown FK. Chapter 35. Chemoinformatics: What is it and How does it Impact Drug Discovery. In: Bristol JA, ed. Vol 33. Annual Reports in Medicinal Chemistry. Academic Press; 1998:375-384. doi:http://dx.doi.org/10.1016/S0065- 7743(08)61100-8.

5. Brown FK. Editorial Opinion: Chemoinformatics – a ten year update. Curr Opin Drug Discov Dev. 2005;8(3):296-302.

6. Reisfeld B, Mayeno AN. Computational Toxicology. 2012;929:3-7. doi:10.1007/978-1-62703-050-2.

7. He K, Talaat RE, Pool WF, et al. Metabolic activation of troglitazone: identification of a reactive metabolite and mechanisms involved. Drug Metab Dispos. 2004;32(6):639-646. doi:10.1124/dmd.32.6.639.

8. Julie NL, Julie IM, Kende AI, Wilson GL. Mitochondrial dysfunction and delayed hepatotoxicity: another lesson from troglitazone. Diabetologia. 2008;51(11):2108- 2116. doi:10.1007/s00125-008-1133-6.

131

9. Hu D, Wu C, Li Z, et al. Characterizing the mechanism of thiazolidinedione- induced hepatotoxicity: An in vitro model in mitochondria. Toxicol Appl Pharmacol. 2015;284(6):134-141. doi:10.1016/j.taap.2015.02.018.

10. Burello E, Worth A. Computational nanotoxicology: Predicting toxicity of nanoparticles. Nat Nanotechnol. 2011;6(3):138-139. doi:10.1038/nnano.2011.27.

11. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model. 1988;28(1):31-36.

12. Khashan R, Zheng W, Tropsha A. The Development of Novel Chemical Fragment-Based Descriptors Using Frequent Common Subgraph Mining Approach and Their Application in QSAR Modeling. Mol Inform. 2014;33(3):201- 215. doi:10.1002/minf.201300165.

13. Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742-754. doi:10.1021/ci100050t.

14. Duan J, Dixon SL, Lowrie JF, Sherman W. Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J Mol Graph Model. 2010;29(2):157-170. doi:10.1016/j.jmgm.2010.05.008.

15. Sherhod R, Gillet VJ, Judson PN, Vessey JD. Automating knowledge discovery for toxicity prediction using jumping emerging pattern mining. J Chem Inf Model. 2012;52(11):3074-3087. doi:10.1021/ci300254w.

16. ChemoTyper Community Website. https://chemotyper.org/. Accessed October 30, 2015.

17. Yang C, Tarkhov A, Marusczyk J, et al. New publicly available chemical query language, CSRML, to support chemotype representations for application to data mining and modeling. J Chem Inf Model. 2015;55(3):510-528. doi:10.1021/ci500667v.

18. Leadscope, Inc.: Leadscope - Chemoinformatics Platform for Drug Discovery. http://leadscope.com/. Accessed October 30, 2015. 132

19. Sushko I, Salmina E, Potemkin VA, Poda G, Tetko I V. ToxAlerts: A Web Server of Structural Alerts for Toxic Chemicals and Compounds with Potential Adverse Reactions. J Chem Inf Model. 2012;52(8):2310-2316. doi:10.1021/ci300245q.

20. Sykora VJ, Leahy DE. Chemical Descriptors Library (CDL): a generic, open source software library for chemical informatics. J Chem Inf Model. 2008;48(10):1931-1942. doi:10.1021/ci800135h.

21. Klopman G. Artificial Intelligence Approach to Structure-Activity Studies . Computer Automated Structure Evaluation of Biological Activity of Organic Molecules. J Am Chem Soc. 1984;77(106):7315-7321.

22. Faulon J-L, Visco DP, Pophale RS. The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci. 2003;43(3):707-720. doi:10.1021/ci020345w.

23. Yap CW. PaDEL-Descriptor : An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J Comput Chem. 2011;32(7):1466-1474. doi:10.1002/jcc.

24. Todeschini R, Consonni V, Wiese M. Handbook of Molecular Descriptors. Wiley- VCH, Weinheim, Germany; 2001.

25. Hong H, Xie Q, Ge W, et al. Mold 2 , Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics. J Chem Inf Model. 2008;2(48):1337-1344.

26. MultiCASE High quality software for in-silico ICH M7 safety assessment. http://multicase.com/. Accessed October 30, 2015.

27. Klopman G, Ivanov JM, Saiakhov RD, Chakravarti SK. MC4PC - An artificial intelligence approach to the discovery of quantitative structure-toxic activity relationship. In: Helma C, ed. Predictive Toxicology. Boca Raton FL, USA: CRC Press; 2005:423-457.

28. CORINA Symphony - Managing and Profiling Molecular Datasets | Inspiring Chemical Discovery. https://www.molecular- networks.com/products/corinasymphony. Accessed October 30, 2015. 133

29. Molconn-Z(TM) 4.00. http://www.edusoft-lc.com/molconn/. Accessed February 5, 2015.

30. Canvas- Product Features. http://www.schrodinger.com/Canvas/. Accessed July 21, 2015.

31. Cunningham AR, Carrasquer CA, Mattison DR. A categorical structure-activity relationship analysis of the developmental toxicity of antithyroid drugs. Int J Pediatr Endocrinol. 2009;2009:936154. doi:10.1155/2009/936154.

32. Farag AM, Mayhoub AS, Eldebss TMA, et al. Synthesis and structure-activity relationship studies of pyrazole-based heterocycles as antitumor agents. Arch Pharm (Weinheim). 2010;343(7):384-396. doi:10.1002/ardp.200900176.

33. Moraski GC, Chang M, Villegas-Estrada A, Franzblau SG, Möllmann U, Miller MJ. Structure-activity relationship of new anti-tuberculosis agents derived from oxazoline and benzyl esters. Eur J Med Chem. 2010;45(5):1703-1716. doi:10.1016/j.ejmech.2009.12.074.

34. Pillon NJ, Soulère L, Vella RE, et al. Quantitative structure-activity relationship for 4-hydroxy-2-alkenal induced cytotoxicity in L6 muscle cells. Chem Biol Interact. 2010;188(1):171-180. doi:10.1016/j.cbi.2010.06.015.

35. Greene N, Fisk L, Naven RT, Note RR, Patel ML, Pelletier DJ. Developing structure-activity relationships for the prediction of hepatotoxicity. Chem Res Toxicol. 2010;23(7):1215-1222. doi:10.1021/tx1000865.

36. Frid AA, Matthews EJ. Prediction of drug-related cardiac adverse effects in humans--B: use of QSAR programs for early detection of drug-induced cardiac toxicities. Regul Toxicol Pharmacol. 2010;56(3):276-289. doi:10.1016/j.yrtph.2009.11.005.

37. Patlewicz G, Rodford R, Walker JD. Quantitative structure-activity relationships for predicting mutagenicity and carcinogenicity. Environ Toxicol Chem. 2003;22(8):1885-1893. doi:10.1897/01-461.

38. Helguera AM, Cabrera Pérez MA, González MP, Ruiz RM, González Díaz H. A topological substructural approach applied to the computational prediction of 134

rodent carcinogenicity. Bioorg Med Chem. 2005;13(7):2477-2488. doi:10.1016/j.bmc.2005.01.035.

39. Klopman G, Chakravarti SK, Zhu H, Ivanov JM, Saiakhov RD. ESP: a method to predict toxicity and pharmacological properties of chemicals using multiple MCASE databases. J Chem Inf Comput Sci. 2004;44(2):704-715. doi:10.1021/ci030298n.

40. Fjodorova N, Vracko M, Novic M, Roncaglioni A, Benfenati E. New public QSAR model for carcinogenicity. Chem Cent J. 2010;4 Suppl 1(Suppl 1):S3. doi:10.1186/1752-153X-4-S1-S3.

41. Benigni R. The first US National Toxicology Program exercise on the prediction of rodent carcinogenicity: definitive results. Mutat Res. 1997;387(1):35-45.

42. Benfenati E, Benigni R, Demarini DM, et al. Predictive models for carcinogenicity and mutagenicity: frameworks, state-of-the-art, and perspectives. J Environ Sci Health C Environ Carcinog Ecotoxicol Rev. 2009;27(2):57-90. doi:10.1080/10590500902885593.

43. VEGA | Virtual models for property Evaluation of chemicals within a Global Architecture. http://www.vega-qsar.eu/. Accessed February 5, 2015.

44. Varnek A, Baskin I. Machine learning methods for property prediction in chemoinformatics: Quo Vadis? J Chem Inf Model. 2012;52(6):1413-1437. doi:10.1021/ci200409x.

45. Xu C, Cheng F, Chen L, et al. In silico prediction of chemical Ames mutagenicity. J Chem Inf Model. 2012;52(11):2840-2847. doi:10.1021/ci300400a.

46. Stalring JC, Carlsson LA, Almeida P, Boyer S. AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment. J Cheminform. 2011;3(1):28. doi:10.1186/1758-2946-3-28.

47. Sanderson DM, Earnshaw CG. Computer Prediction of Possible Toxic Action from Chemical Structure ; The DEREK System. Hum Exp Toxicol. 1991;10:261- 273.

135

48. CASE Ultra Models: High quality in-silico toxicity QSAR models. http://multicase.com/case-ultra-models. Accessed October 13, 2015.

49. Toxtree — EURL ECVAM. https://eurl-ecvam.jrc.ec.europa.eu/laboratories- research/predictive_toxicology/qsar_tools/toxtree. Accessed October 13, 2015.

50. ICH M7 - Genotoxic Impurities - Assessment and Control of DNA Reactive (Mutagenic) Impurities to Limit Potential Carcinogenic Risk. Guideline. 2014:30.

51. OECD. Report on the Regulatory Uses and Applications in OECD Member Countries of (Quantitative) Structure-Activity Relationship [(Q)SAR] Models in the Assessment of New and Existing Chemicals. Paris, France: Organisation for Economic Co-operation and Development; 2006. doi:ENV/JM/MONO(2007)10.

52. Aptula A, Patlewicz G, Roberts D. Skin sensitization: reaction mechanistic applicability domains for structure-activity relationships. Chem Res Toxicol. 2005;18(9):1420-1426.

53. Alves VM, Muratov E, Fourches D, et al. Predicting chemically-induced skin reactions. Part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds. Toxicol Appl Pharmacol. January 2015. doi:10.1016/j.taap.2014.12.014.

54. Home - QSAR. http://www.qsartoolbox.org/. Accessed February 5, 2015.

55. Gerberick GF, Ryan CA, Kern PS, et al. Compilation of historical local lymph node data for evaluation of skin sensitization alternative methods. Dermatitis. 2005;16(4):157-202.

56. Kern PS, Gerberick GF, Ryan CA, Kimber I, Aptula A, Basketter DA. Local Lymph Node Data for the Evaluation of Skin Sensitization Alternatives : A Second Compilation. Dermatitis. 2010;21(1):8-32. doi:10.2310/6620.2009.09038.

57. Mortelmans K, Zeiger E. The Ames Salmonella/microsome mutagenicity assay. Mutat Res. 2000;455(1-2):29-60.

58. Hansen K, Mika S, Schroeter T, et al. Benchmark data set for in silico prediction 136

of Ames mutagenicity. J Chem Inf Model. 2009;49(9):2077-2081. doi:10.1021/ci900161g.

59. MarvinSketch – advanced chemical drawing software « ChemAxon – cheminformatics platforms and desktop applications. http://www.chemaxon.com/products/marvin/marvinsketch/. Accessed October 20, 2015.

60. MarvinView, a generic 2D/3D molecule renderer « ChemAxon – cheminformatics platforms and desktop applications. https://www.chemaxon.com/products/marvin/marvinview/. Accessed October 20, 2015.

61. Welcome to Python.org. https://www.python.org/. Accessed October 30, 2015.

62. Python. https://en.wikipedia.org/wiki/Python_(programming_language).

63. Lutz M. Learning Python. O’Reilly Media; 2009.

64. Landrum G. RDKit: Open-source cheminformatics. http://rdkit.org/. Accessed January 30, 2015.

65. Visual Basic. https://en.wikipedia.org/wiki/Visual_Basic.

66. MATLAB. https://en.wikipedia.org/wiki/MATLAB.

67. R. https://en.wikipedia.org/wiki/R.

68. Kutchukian PS, Lou D, Shakhnovich EI. FOG: Fragment Optimized Growth Algorithm for the de Novo Generation of Molecules occupying Druglike Chemical Space. J Chem Inf Model. 2009;49(7):1630-1642. doi:10.1021/ci9000458.

69. Helgee EA, Carlsson L, Boyer S, Norinder U. Evaluation of Quantitative Structure - Activity Relationship Modeling Strategies : Local and Global Models. J Chem Inf Model. 2010;50:677-689.

137

70. Martin YC, Kofron JL, Traphagen LM. Do structurally similar molecules have similar biological activity? J Med Chem. 2002;45(19):4350-4358. doi:10.1021/jm020155c.

71. ChemSpider | Search and share chemistry. http://www.chemspider.com/. Accessed July 20, 2015.

72. Stumpfe D, Bajorath J. Exploring activity cliffs in medicinal chemistry. J Med Chem. 2012;55(7):2932-2942. doi:10.1021/jm201706b.

73. Maggiora GM. On outliers and activity cliffs - Why QSAR often disappoints. J Chem Inf Model. 2006;46(4):1535. doi:10.1021/ci060117s.

74. Kazius J, McGuire R, Bursi R. Derivation and validation of toxicophores for mutagenicity prediction. J Med Chem. 2005;48(1):312-320. doi:10.1021/jm040835a.

75. BIOVIA Pipeline Pilot | Scientific Workflow Authoring Application for Data Analysis. http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/. Accessed October 22, 2015.

138

APPENDIX A. COMPUTER PROGRAMS: ALGORITHMS AND PYTHON SCRIPTS

1. Generation of annotated linear fragments and compound-fragment data matrix

From any given set of compounds, annotated linear fragments are extracted and corresponding compound-fragment data matrix is generated using the following algorithm.

a. Inputs taken: Name and directory location of SD file containing the compounds, annotation options (different combination of annotation options gives different annotation schemes), minimum and maximum lengths of fragments to be considered. b. For each compound in the SD file, relevant information such as number of atoms, number of bonds, and compound name is first collected. c. Then, for each atom in the compound, its identity, number of hydrogens, number of heavy atom connections, partial charge, and ring information is identified and stored. d. Then, the longest linear paths are identified for each compound using depth-first search algorithm. These paths are simply numeric sequences of numbers, where each number represents a heavy atom in the compound. e. From the information contained in the longest linear paths, all smaller sub-paths are then identified. Path redundancy is checked in both forward and reverse directions and only unique paths are retained. f. Chemical annotations are then used to identify and remove chemically-redundant paths. The linear paths are denoted as fragments from this point onwards because they now capture and represent actual chemical structural information. g. An array is then created that contains all fragments identified in the entire dataset of compounds. Fragment redundancy is checked in both forward and reverse directions and only unique fragments are retained. h. A matrix is then created with m by n dimensions, where m is the total number of compounds in the dataset and n is the number of unique fragments identified (of desired path lengths). We then go through each compound in the dataset and mark the corresponding bits as ‘1’ if the fragments are actually observed in that compound. i. An option is also provided to export this compound-fragment data matrix into an Excel file or a text file.

The following Python script was written to accomplish this task. It consists of many function files, whose scripts are also provided below. 139

A. Parent script – fingerprint_table.py

# Function to generate linear annotated chemical fragments # 5 annotation features provided (AI, nC, nH, PC, RA) # This function returns the fingerprint table ('uni_sympath' and 'finger' lists) # # INPUTS: # fName: The name of input SD file # ann: Annotation scheme # minL: Starting path length # maxL: Ending path length def fingerprint_table2(fName, ann, minL, maxL): import sys sys.path.append("C:\Users\Darshan Mehta\Documents\Research\Programs\Graph Theory\Markov Modeling") import pSearch_det2, subPaths, sym_path3, uni_sym, fingerprint from rdkit import Chem from rdkit.Chem import AllChem

training = Chem.SDMolSupplier('C:\\Users\\Darshan Mehta\\Documents\\Research\\Programs\\Graph Theory\\Markov Modeling\\Ames_Benchmark\\' + fName + '.sdf') iComp, nAtom, nBond, all_Iden = [], [], [], [] counter, nHyd, cTable, all_sympath, activity, cas_no = [], [], [], [], [], []

print '\nNumber of compounds in SD file = ', len(training) def classify(charge): if charge < -0.15: charge_classify = '-' else: if charge < -0.05: charge_classify = 'n' else: if charge < 0.05: charge_classify = 'o' else: if charge < 0.15: charge_classify = 'p' else: charge_classify = '+' return charge_classify 140

all_symbol = [] mol_count = 0 # Keeps track of molecule count in sd file for mol_i in training: mol = Chem.RemoveHs(mol_i) # Remove explicit hydrogens iComp.append(mol.GetProp("_Name")) activity.append(int(mol.GetProp("Activity"))) cas_no.append(mol.GetProp("CAS_NO")) nAtom.append(mol.GetNumAtoms()) nBond.append(mol.GetNumBonds()) iD = [atom.GetSymbol() for atom in mol.GetAtoms()] all_Iden.append(iD) maxx = nAtom[mol_count] if int(ann[3]) == 1: AllChem.ComputeGasteigerCharges(mol) gCharge = [round(float(mol.GetAtomWithIdx(k).GetProp('_GasteigerCharge')), 4) for k in range(0, maxx, 1)] gCharge_classify = [classify(gCharge[l]) for l in range(0, maxx, 1)] else: gCharge_classify = [] if int(ann[4]) == 1: ringinfo = [int(mol.GetAtomWithIdx(k).IsInRing()) for k in range(0, maxx, 1)] ringinfo2 = ringinfo for m in range(0, len(ringinfo), 1): if ringinfo[m] != 0: ring1, ring2 = int(mol.GetAtomWithIdx(m).IsInRingSize(5)), int(mol.GetAtomWithIdx(m).IsInRingSize(6)) if ring1 == ring2 == 1: ringinfo2[m] = 7 elif ring1 == 1: ringinfo2[m] = 5 elif ring2 == 1: ringinfo2[m] = 6 elif ring1 == ring2 == 0: ringinfo2[m] = 4 else: ringinfo2 = []

from numpy import zeros, arange cTab_Mat = zeros((maxx, maxx))

conn, hyd, neighbor = [], [], [] #'neighbor' is eq to 'relation' in fPathSearch_Call for j in range(0, maxx, 1): 141

atom = mol.GetAtomWithIdx(j) conn.append(len(atom.GetNeighbors())) hyd.append(atom.GetTotalNumHs()) neighbor.append([x.GetIdx() for x in atom.GetNeighbors()]) for k in range(0, conn[j], 1): temp = neighbor[j][k] cTab_Mat[j][temp] = 1

longpath = pSearch_det2.pSearch_det2(cTab_Mat, arange(1, maxx+1), arange(1, maxx+1), [], [], int(maxL)) path = subPaths.subPaths(longpath) symbol, sympath = sym_path3.sym_path3(iD, conn, hyd, gCharge_classify, ringinfo2, path, ann) all_symbol.append(symbol) sympath_dL = [] for i in range(0, len(sympath), 1): if len(sympath[i]) >= minL: sympath_dL.append(sympath[i]) all_sympath.append(sympath_dL) mol_count += 1

uni_sympath = uni_sym.uni_sym(all_sympath) finger = fingerprint.fingerprint(all_sympath, uni_sympath)

return activity, cas_no, uni_sympath, finger

B. Function for depth-wise search algorithm – pSearch_det2.py def pSearch_det2(iMat, iD, iNum, iPath, mRoad, maxL):

# Function for finding all possible paths using depth-wise search... # deterministic algorithm. # Function modified to find longest possible paths up to a specified length ‘maxL’

# Variables: # INPUT # iMat - Matrix form of connection table for compound i # iD - Node numbers (from 1 to maxx) # iNum - path-determining node numbers (typically same as iD) # iPath - Path history # mRoad - Empty list (paths will be added as function runs multiple times) 142

# maxL - Maximum path length desired # OUTPUT # mRoad - List of all longest possible paths from nodes specified in iNum

from numpy import argwhere, ones from scipy import delete for i in range(0,len(iNum),1): j = int(argwhere(iD == iNum[i])) #for node 1, j will be 0 k = ones(len(iMat)) k[j] = 0 m = iD[iMat[j,:] != 0] if len(m) != 0 and len(iPath) < (maxL - 1): temp1 = delete(iMat,j,0) temp2 = delete(temp1,j,1) pSearch_det2(temp2, iD[k != 0], m, iPath+[iNum[i]], mRoad, maxL) else: mRoad.append(iPath+[iNum[i]])

return mRoad

C. Function for identifying sub-paths – subPaths.py def subPaths(mRoad):

# Function for breaking all longest paths into unique sub-paths

# Variables: # INPUT # mRoad - List of all longest possible paths # OUTPUT # road - List of all unique sub-paths

road = [] for i in range(0, len(mRoad), 1): for j in range(0, len(mRoad[i]), 1): if (mRoad[i][:j+1] not in road) and (mRoad[i][:j+1][::-1] not in road): road.append(mRoad[i][:j+1])

def bylength(word1,word2): return len(word1) - len(word2) road.sort(cmp=bylength) return road 143

D. Function for identifying symbolic paths using chemical annotations – sym_path3.py def sym_path3(iD, conn, hyd, gchg, ring, path, ann):

# Function for converting node-based paths into symbol-based paths

# Variables: # INPUT # iD - Identity of each heavy atom in compound i # conn - Connectivity of all heavy atoms in compound i # hyd - #H-atoms connected to each heavy atom in compound i # gchg - Gasteiger charge (partial charge) on each heavy atom in compound i # ring - Ring info (ring atom or not) about each heavy atom in compound i # path - List of all unique sub-paths # ann - List of user-specified annotations # 1. Match atom identity (0/1) # 2. Match atom connectivity (0/1) # 3. Match atom nHyd (0/1) # 4. Match atom partial charge(0/1) # OUTPUT # symbol - List of symbolic notation for each heavy atom in compound i # symPath - List of all unique symbolic sub-paths

if not any(ann): print '\n ERROR: Select at least one criteria' return []

mega = [iD, conn, hyd, gchg, ring] k = 0 for i in range(0, len(ann)): if ann[i] != 0 and k == 1: temp = ''.join([str(item) for item in mega[i]]) symbol = map(''.join, zip(symbol, temp)) if ann[i] != 0 and k == 0: if any(ann[i+1:]): if i == 0: symbol = mega[i] k = 1 else: temp = ''.join([str(item) for item in mega[i]]) symbol = temp 144

k = 1 else: symbol = mega[i]

symPath = [] for i in range(0, len(path), 1): sym1 = [] for j in range(0, len(path[i]), 1): sym1.append(symbol[path[i][j] - 1]) if sym1 not in symPath and sym1[::-1] not in symPath: symPath.append(sym1)

return symbol, symPath

E. Function for collecting all unique symbolic paths – uni_sym.py def uni_sym(all_sympath):

# Function for collecting unique symbol-based sub-paths from all compounds

# Variables: # INPUT # all_sympath - List of symbolic paths for all compounds in sd file # OUTPUT # uni_sympath - List of all unique symbolic paths

uni_sympath = []

for i in range(0, len(all_sympath), 1): for j in range(0, len(all_sympath[i]), 1): if all_sympath[i][j] not in uni_sympath and all_sympath[i][j][::-1] not in uni_sympath: uni_sympath.append(all_sympath[i][j])

def bylength(word1,word2): return len(word1) - len(word2)

uni_sympath.sort(cmp=bylength)

return uni_sympath

145

F. Function for generating compound-fragment data matrix – fingerprint.py def fingerprint(iSympath, uniSympath):

# Function to create a list of fingerprints (0s and 1s)

# Variables: # INPUT # iSympath - List of symbolic paths for all compounds in sd file # uniSympath - List of all unique symbolic paths # OUTPUT # fPrint - List of fingerprints for all compounds

from numpy import zeros

fPrint = []

for i in range(0, len(iSympath), 1): tempPrint = [0]*len(uniSympath) for j in range(0, len(uniSympath), 1): # Following code revised after a bug was detected (10/03/2013) if uniSympath[j] in iSympath[i] or uniSympath[j][::-1] in iSympath[i]: tempPrint[j] = 1 fPrint.append(tempPrint)

return fPrint

146

APPENDIX B. STATISTICAL RESULTS FOR IDENTIFICATION OF

STRUCTURAL ALERTS

1. Skin sensitization

Number Fragment χ2 – statistic γ - statistic 1 ['N', 'C', 'C', 'C', 'N'] 24.525 0.601 2 ['N', 'C', 'C', 'C', 'C', 'C', 'N'] 23.308 0.566 3 ['O', 'N', 'O'] 20.911 0.466 4 ['N', 'C', 'C', 'C', 'N', 'O'] 20.575 0.723 5 ['C', 'N', 'O'] 18.508 0.390 6 ['C', 'C', 'N', 'O'] 18.508 0.390 7 ['O', 'N', 'C', 'C', 'C', 'N', 'O'] 17.793 0.853 8 ['N', 'C', 'C', 'C', 'C', 'N'] 17.259 0.667 9 ['N', 'C', 'C', 'N'] 16.260 0.656 10 ['C', 'N', 'C'] 15.626 0.373 11 ['N', 'C', 'C', 'N', 'O'] 14.597 0.834 12 ['C', 'C', 'S'] 12.039 0.361 13 ['Cl', 'C', 'C', 'C', 'C', 'N'] 11.934 0.758 14 ['N', 'C', 'S', 'C', 'C', 'N'] 11.830 1.000 15 ['O', 'C', 'O', 'C', 'O'] 11.540 0.900 16 ['C', 'N', 'C', 'C', 'C', 'Cl'] 11.540 0.900 17 ['C', 'N', 'C', 'C', 'C', 'N', 'C'] 11.485 0.641 Table 36: Significant and positively correlated fragments for skin sensitization dataset using {AI} annotation scheme with χ2 and γ statistic values

147

Number Fragment χ2 – statistic γ - statistic 1 ['C3', 'C2', 'C3', 'N3'] 37.356 0.678 2 ['C2', 'C3', 'N3'] 32.580 0.545 3 ['C2', 'C3', 'C2', 'C3', 'N3'] 31.304 0.721 4 ['C2', 'C2', 'C3', 'C2', 'C3', 'N3'] 27.670 0.770 5 ['C3', 'C2', 'C2', 'C3', 'C3', 'N3'] 24.425 0.755 6 ['C3', 'C2', 'C2', 'C3', 'C2', 'C3', 'N3'] 24.425 0.755 7 ['C2', 'C3', 'C2', 'C2', 'C3', 'C3', 'N3'] 24.425 0.755 8 ['C2', 'C2', 'C3', 'C3', 'N3'] 22.426 0.610 9 ['C2', 'C3', 'C2', 'C3', 'N3', 'O1'] 21.720 0.666 10 ['O1', 'N3', 'O1'] 20.911 0.466 11 ['C2', 'C3', 'N3', 'O1'] 19.865 0.489 12 ['O1', 'N3', 'C3', 'C2', 'C3', 'N3', 'O1'] 19.776 0.892 13 ['C2', 'N3', 'C3', 'C2', 'C3'] 19.160 0.930 14 ['C3', 'C3', 'C2', 'C3', 'N3'] 18.774 0.621 15 ['C2', 'C2', 'C3', 'C3', 'N3', 'O1'] 18.746 0.644 16 ['C3', 'C2', 'C2', 'C3', 'C3', 'N3', 'O1'] 18.448 0.716 17 ['C2', 'C2', 'C3', 'C2', 'C3', 'N3', 'O1'] 18.448 0.716 18 ['C3', 'N3', 'O1'] 18.416 0.447 19 ['C2', 'C3', 'C3', 'C2', 'C3', 'N3'] 18.290 0.638 20 ['C2', 'C3', 'C3', 'C2', 'C2', 'C3', 'N3'] 18.290 0.638 21 ['C2', 'C2', 'C3', 'C3', 'C2', 'C3', 'N3'] 18.290 0.638 22 ['C2', 'C2', 'C3', 'N3'] 18.208 0.468 23 ['C3', 'C2', 'C3', 'N3', 'O1'] 17.946 0.564 24 ['N3', 'C3', 'C2', 'C3', 'N3', 'O1'] 17.793 0.853 25 ['N3', 'C3', 'C2', 'C3', 'N3'] 17.793 0.853 26 ['C2', 'C3', 'C3', 'N3'] 17.789 0.505 27 ['C3', 'C3', 'N3', 'O1'] 16.859 0.505 28 ['C3', 'C3', 'N3'] 16.508 0.445 29 ['N1', 'C3', 'C3', 'C2', 'C3'] 16.378 0.783 Continued Table 37: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC} annotation scheme with χ2 and γ statistic values

148

Table 37 continued

Number Fragment χ2 – statistic γ - statistic 30 ['C3', 'C3', 'C2', 'C2', 'C3', 'N3'] 16.292 0.537 31 ['N1', 'C3', 'C3', 'C2', 'C3', 'C2', 'C2'] 16.190 0.877 32 ['N1', 'C3', 'C3', 'C2', 'C3', 'C2'] 16.190 0.877 33 ['N1', 'C3', 'C2', 'C2', 'C3', 'C2', 'C3'] 16.190 0.877 34 ['O1', 'C3', 'C2', 'C3', 'C2', 'C3', 'O1'] 15.889 1.000 35 ['O1', 'C3', 'C2', 'C3', 'C2', 'C3', 'C3'] 15.889 1.000 36 ['O1', 'C3', 'C2', 'C3', 'C2', 'C3'] 15.889 1.000 37 ['N1', 'C3', 'C3', 'C2', 'C3', 'N1'] 15.889 1.000 38 ['C1', 'C2', 'N3', 'C3', 'C2', 'C3'] 15.889 1.000 39 ['N2', 'C3', 'C2', 'C2', 'C3'] 15.859 0.554 40 ['C2', 'C2', 'N3', 'C3', 'C2', 'C3'] 15.288 0.918 41 ['N2', 'C3', 'C2', 'C3', 'C3', 'C2'] 15.243 0.703 42 ['N2', 'C3', 'C2', 'C3', 'C3'] 15.243 0.703 43 ['C2', 'C3', 'C3', 'N3', 'O1'] 14.047 0.519 44 ['N1', 'C3', 'C2', 'C2', 'C3', 'C2'] 13.971 0.405 45 ['C2', 'C3', 'C3', 'C3', 'C3', 'C3', 'C2'] 13.294 0.680 46 ['O1', 'C3', 'C3', 'C3', 'C2', 'C3', 'C2'] 12.882 0.857 47 ['N1', 'C3', 'C3', 'C2'] 12.690 0.594 48 ['C3', 'N2', 'C2'] 12.272 0.493 49 ['S2', 'C3', 'C3', 'C2', 'C3', 'C2', 'C2'] 11.830 1.000 50 ['S2', 'C3', 'C3', 'C2', 'C3', 'C2'] 11.830 1.000 51 ['S2', 'C3', 'C3', 'C2', 'C3'] 11.830 1.000 52 ['S2', 'C3', 'C2', 'C2', 'C3', 'C2'] 11.830 1.000 53 ['S2', 'C3', 'C2', 'C2', 'C3'] 11.830 1.000 54 ['O2', 'C3', 'C3', 'C2', 'C3', 'O1'] 11.830 1.000 55 ['N3', 'C3', 'C3', 'S2'] 11.830 1.000 56 ['N3', 'C3', 'C2', 'C2', 'C3', 'C3', 'C1'] 11.830 1.000 57 ['Cl1', 'C3', 'C3', 'C2', 'C3', 'C2'] 11.830 1.000 58 ['C3', 'C3', 'C2', 'C3', 'N3', 'C2', 'C1'] 11.830 1.000 59 ['C3', 'C2', 'C3', 'C2', 'C2', 'C3', 'S2'] 11.830 1.000 Continued

149

Table 37 continued

Number Fragment χ2 – statistic γ - statistic 60 ['C2', 'O2', 'C3', 'C3', 'C2', 'C3', 'O1'] 11.830 1.000 61 ['C2', 'N3', 'C3', 'C2', 'C3', 'C2', 'C2'] 11.830 1.000 62 ['C2', 'N3', 'C3', 'C2', 'C3', 'C2'] 11.830 1.000 63 ['C2', 'N3', 'C3', 'C2', 'C3', 'C1'] 11.830 1.000 64 ['C2', 'N2', 'C3', 'C2', 'C3', 'C3', 'C2'] 11.830 1.000 65 ['C2', 'N2', 'C3', 'C2', 'C3', 'C3'] 11.830 1.000 66 ['C2', 'N2', 'C3', 'C2', 'C3'] 11.830 1.000 67 ['C2', 'C2', 'N3', 'C3', 'C2', 'C3', 'C1'] 11.830 1.000 68 ['C1', 'C3', 'C2', 'C3', 'N3', 'C2', 'C1'] 11.830 1.000 69 ['N3', 'C3', 'C2', 'C2', 'C3', 'C3', 'N3'] 11.647 0.810 70 ['O1', 'C3', 'O2', 'C3', 'O1'] 11.540 0.900 71 ['N1', 'C3', 'C2', 'C2', 'C3', 'N1'] 11.540 0.900 72 ['Cl1', 'C3', 'C3', 'C2', 'C3'] 11.540 0.900 73 ['O1', 'C3', 'C3', 'C2', 'C3', 'O1'] 11.485 0.641 74 ['N2', 'C3', 'C2', 'C3', 'C3', 'C2', 'C2'] 11.485 0.641 75 ['C3', 'C3', 'C2', 'C3', 'N3', 'O1'] 11.430 0.543

Number Fragment χ2 – statistic γ - statistic 1 ['C30', 'C21', 'C30', 'N30'] 38.794 0.683 2 ['C21', 'C30', 'C21', 'C30', 'N30'] 34.360 0.797 3 ['C21', 'C30', 'N30'] 32.580 0.545 4 ['C30', 'C30', 'C21', 'C30', 'C21'] 32.449 0.364 5 ['C21', 'C30', 'C30', 'C21', 'C30'] 29.490 0.376 6 ['C21', 'C30', 'C30', 'C21', 'C30', 'C21'] 28.543 0.365 7 ['C30', 'C30', 'C21', 'C30'] 28.310 0.357 8 ['O11', 'C30', 'C21', 'C30'] 28.088 0.638 Continued Table 38: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC, nH} annotation scheme with χ2 and γ statistic values 150

Table 38 continued

Number Fragment χ2 – statistic γ - statistic 9 ['O11', 'C30', 'C21', 'C30', 'C21'] 27.976 0.788 10 ['C21', 'C21', 'C30', 'C21', 'C30', 'N30'] 27.670 0.770 11 ['C30', 'C21', 'C21', 'C30', 'C30', 'N30'] 24.425 0.755 12 ['C30', 'C21', 'C21', 'C30', 'C21', 'C30', 'N30'] 24.425 0.755 13 ['C21', 'C30', 'C21', 'C30', 'N30', 'O10'] 24.425 0.755 14 ['C21', 'C30', 'C21', 'C21', 'C30', 'C30', 'N30'] 24.425 0.755 15 ['C21', 'C21', 'C30', 'C30', 'N30'] 22.426 0.610 16 ['O10', 'N30', 'O10'] 20.911 0.466 17 ['C30', 'C30', 'N30'] 20.676 0.495 18 ['C21', 'C30', 'N30', 'O10'] 19.865 0.489 19 ['O10', 'N30', 'C30', 'C21', 'C30', 'N30', 'O10'] 19.776 0.892 20 ['C22', 'N30', 'C30', 'C21', 'C30'] 19.160 0.930 21 ['C30', 'C30', 'C21', 'C30', 'N30'] 18.774 0.621 22 ['C21', 'C21', 'C30', 'C30', 'N30', 'O10'] 18.746 0.644 23 ['C30', 'C21', 'C21', 'C30', 'C30', 'N30', 'O10'] 18.448 0.716 24 ['C21', 'C21', 'C30', 'C21', 'C30', 'N30', 'O10'] 18.448 0.716 25 ['C30', 'N30', 'O10'] 18.416 0.447 26 ['C21', 'C30', 'C30', 'C21', 'C30', 'N30'] 18.290 0.638 27 ['C21', 'C30', 'C30', 'C21', 'C21', 'C30', 'N30'] 18.290 0.638 28 ['C21', 'C21', 'C30', 'C30', 'C21', 'C30', 'N30'] 18.290 0.638 29 ['C21', 'C21', 'C30', 'N30'] 18.208 0.468 30 ['C30', 'C21', 'C30', 'N30', 'O10'] 17.946 0.564 31 ['N30', 'C30', 'C21', 'C30', 'N30', 'O10'] 17.793 0.853 32 ['N30', 'C30', 'C21', 'C30', 'N30'] 17.793 0.853 33 ['C21', 'C30', 'C30', 'N30'] 17.789 0.505 34 ['C30', 'C30', 'N30', 'O10'] 16.859 0.505 35 ['N12', 'C30', 'C30', 'C21', 'C30'] 16.378 0.783 36 ['C30', 'C30', 'C21', 'C21', 'C30', 'N30'] 16.292 0.537 37 ['C21', 'C30', 'C21', 'C21', 'C30', 'C30', 'C13'] 16.275 0.722 38 ['N12', 'C30', 'C30', 'C21', 'C30', 'C21', 'C21'] 16.190 0.877 Continued

151

Table 38 continued

Number Fragment χ2 – statistic γ - statistic 39 ['N12', 'C30', 'C30', 'C21', 'C30', 'C21'] 16.190 0.877 40 ['N12', 'C30', 'C21', 'C21', 'C30', 'C21', 'C30'] 16.190 0.877 41 ['O11', 'C30', 'C30', 'C30', 'C21', 'C30', 'C21'] 15.889 1.000 42 ['O11', 'C30', 'C21', 'C30', 'C21', 'C30', 'O11'] 15.889 1.000 43 ['O11', 'C30', 'C21', 'C30', 'C21', 'C30', 'C30'] 15.889 1.000 44 ['O11', 'C30', 'C21', 'C30', 'C21', 'C30'] 15.889 1.000 45 ['N12', 'C30', 'C30', 'C21', 'C30', 'N12'] 15.889 1.000 46 ['C13', 'C22', 'N30', 'C30', 'C21', 'C30'] 15.889 1.000 47 ['O11', 'C30', 'C30', 'C30', 'C21', 'C30'] 15.415 0.803 48 ['O11', 'C30', 'C30', 'C30', 'O11'] 15.322 0.628 49 ['C22', 'C22', 'N30', 'C30', 'C21', 'C30'] 15.288 0.918 50 ['O11', 'C30', 'C21', 'C30', 'C21', 'C21', 'C30'] 15.227 0.714 51 ['C30', 'C30', 'C21', 'C30', 'O11'] 14.666 0.519 52 ['O11', 'C30', 'C30', 'O11'] 14.610 0.757 53 ['O11', 'C30', 'C21', 'C30', 'C21', 'C21'] 14.515 0.700 54 ['C21', 'C30', 'C30', 'N30', 'O10'] 14.047 0.519 55 ['N12', 'C30', 'C21', 'C21', 'C30', 'C21'] 13.971 0.405 56 ['C21', 'C21', 'C21', 'O10'] 13.392 0.440 57 ['C13', 'C21', 'C21', 'C30'] 13.388 0.503 58 ['C21', 'C30', 'C30', 'C30', 'C21', 'C30'] 13.294 0.446 59 ['N20', 'C30', 'C21'] 13.237 0.487 60 ['N20', 'C30', 'C21', 'C21'] 13.035 0.450 61 ['C13', 'C30', 'C21', 'C30', 'C21', 'C21', 'C30'] 12.916 0.688 62 ['N12', 'C30', 'C30', 'C21'] 12.690 0.594 63 ['C13', 'C21', 'C21'] 12.352 0.464 64 ['C21', 'C21', 'C21', 'C30', 'C21', 'C21', 'C30'] 12.178 0.611 65 ['C21', 'C30', 'C30', 'C30', 'C21'] 12.129 0.377 66 ['C21', 'C30', 'C30', 'C30', 'C30', 'C30', 'C21'] 11.934 0.758 67 ['C21', 'C21', 'C30', 'C30', 'C30', 'C30', 'C30'] 11.934 0.758 68 ['S20', 'C30', 'C30', 'C21', 'C30', 'C21', 'C21'] 11.830 1.000 Continued

152

Table 38 continued

Number Fragment χ2 – statistic γ - statistic 69 ['S20', 'C30', 'C30', 'C21', 'C30', 'C21'] 11.830 1.000 70 ['S20', 'C30', 'C30', 'C21', 'C30'] 11.830 1.000 71 ['S20', 'C30', 'C21', 'C21', 'C30', 'C21'] 11.830 1.000 72 ['S20', 'C30', 'C21', 'C21', 'C30'] 11.830 1.000 73 ['O20', 'C30', 'C30', 'C21', 'C30', 'O11'] 11.830 1.000 74 ['N30', 'C30', 'C30', 'S20'] 11.830 1.000 75 ['N30', 'C30', 'C21', 'C21', 'C30', 'C30', 'C13'] 11.830 1.000 76 ['N20', 'C30', 'N20'] 11.830 1.000 77 ['N20', 'C30', 'C21', 'C30', 'C30', 'C21'] 11.830 1.000 78 ['N20', 'C30', 'C21', 'C30', 'C30'] 11.830 1.000 79 ['Cl10', 'C30', 'C30', 'C21', 'C30', 'C21'] 11.830 1.000 80 ['C30', 'C30', 'O20', 'C22', 'C22', 'C22', 'C22'] 11.830 1.000 81 ['C30', 'C30', 'C21', 'C30', 'N30', 'C22', 'C13'] 11.830 1.000 82 ['C30', 'C21', 'C30', 'C21', 'C21', 'C30', 'S20'] 11.830 1.000 83 ['C22', 'O20', 'C30', 'C30', 'C21', 'C30', 'O11'] 11.830 1.000 84 ['C22', 'O20', 'C30', 'C30', 'C21', 'C30', 'C30'] 11.830 1.000 85 ['C22', 'N30', 'C30', 'C21', 'C30', 'C21', 'C21'] 11.830 1.000 86 ['C22', 'N30', 'C30', 'C21', 'C30', 'C21'] 11.830 1.000 87 ['C22', 'N30', 'C30', 'C21', 'C30', 'C13'] 11.830 1.000 88 ['C22', 'C22', 'N30', 'C30', 'C21', 'C30', 'C13'] 11.830 1.000 89 ['C13', 'C30', 'C21', 'C30', 'N30', 'C22', 'C13'] 11.830 1.000 90 ['C30', 'C30', 'C21', 'C21', 'C30', 'C30', 'C21'] 11.786 0.670 91 ['C30', 'C21', 'C21', 'C30', 'C30', 'C13'] 11.769 0.624 92 ['C30', 'C30', 'C21', 'C30', 'C30', 'C30', 'O11'] 11.697 0.762 93 ['N30', 'C30', 'C21', 'C21', 'C30', 'C30', 'N30'] 11.647 0.810 94 ['O10', 'C30', 'O20', 'C30', 'O10'] 11.540 0.900 95 ['O10', 'C30', 'C30', 'C21', 'C30', 'C30', 'O11'] 11.540 0.900 96 ['N12', 'C30', 'C21', 'C21', 'C30', 'N12'] 11.540 0.900 97 ['Cl10', 'C30', 'C30', 'C21', 'C30'] 11.540 0.900 98 ['O10', 'C30', 'C21', 'C21'] 11.529 0.404 Continued

153

Table 38 continued

Number Fragment χ2 – statistic γ - statistic 99 ['O10', 'C30', 'C30', 'C21', 'C30', 'O11'] 11.485 0.641 100 ['C21', 'C30', 'C30', 'C30', 'C21', 'C30', 'C30'] 11.485 0.641 101 ['C30', 'C30', 'C21', 'C30', 'N30', 'O10'] 11.430 0.543 102 ['C13', 'C30', 'C30', 'C21', 'C21'] 11.430 0.543 103 ['C30', 'C21', 'C30', 'C21', 'C30', 'C30'] 11.341 0.566

Number Fragment χ2 – statistic γ - statistic 1 ['30+', '30+', '21+'] 43.150 0.921 2 ['10-', '30+', '30+', '21+'] 35.534 0.908 3 ['30+', '30+', '30+'] 25.377 0.408 4 ['30+', '30+', '21+', '30+'] 24.563 0.880 5 ['30+', '21+', '30+'] 24.563 0.880 6 ['21+', '30+', '30+', '210'] 24.563 0.880 7 ['10-', '30+', '30+', '21+', '30+'] 24.563 0.880 8 ['11-', '30+', '30+', '30+', '210'] 23.373 0.644 9 ['300', '210', '210', '300', '210'] 23.005 0.854 10 ['10-', '30+', '10-'] 20.911 0.466 11 ['210', '30+', '30+', '30+'] 20.831 0.364 12 ['30+', '30+', '21+', '30+', '30+'] 19.776 0.892 13 ['10-', '30+', '30+', '21+', '30+', '30+', '10-'] 19.776 0.892 14 ['10-', '30+', '30+', '21+', '30+', '30+'] 19.776 0.892 15 ['210', '210', '300', '210', '300', '130'] 19.721 1.000 16 ['210', '210', '300', '210', '300'] 19.721 1.000 Continued Table 39: Significant and positively correlated fragments for skin sensitization dataset using {nC, nH, PC} annotation scheme with χ2 and γ statistic values

154

Table 39 continued

Number Fragment χ2 – statistic γ - statistic 17 ['12-', '300', '210', '210', '300', '210'] 19.721 1.000 18 ['30+', '210', '210', '30+', '30+', '30+'] 19.596 0.837 19 ['11-', '30+', '30+', '30+', '11-'] 19.121 0.747 20 ['300', '210', '210', '300'] 19.038 0.815 21 ['30+', '30+', '30+', '10-'] 18.961 0.386 22 ['30+', '30+', '210', '210', '30+', '21+'] 17.793 0.853 23 ['30+', '30+', '21+', '30+', '210', '210'] 17.793 0.853 24 ['30+', '30+', '21+', '30+', '210'] 17.793 0.853 25 ['30+', '210', '210', '30+', '21+', '30+', '30+'] 17.793 0.853 26 ['30+', '210', '210', '30+', '21+', '30+'] 17.793 0.853 27 ['30+', '210', '210', '30+', '21+'] 17.793 0.853 28 ['30+', '21+', '30+', '30+', '210', '210'] 17.793 0.853 29 ['30+', '21+', '30+', '30+', '210'] 17.793 0.853 30 ['30+', '21+', '30+', '210', '210'] 17.793 0.853 31 ['30+', '21+', '30+', '210'] 17.793 0.853 32 ['210', '30+', '21+', '30+', '30+', '210'] 17.793 0.853 33 ['21+', '30+', '30+', '210', '210', '30+'] 17.793 0.853 34 ['21+', '30+', '30+', '210', '210'] 17.793 0.853 35 ['21+', '30+', '210', '210', '30+', '30+', '30+'] 17.793 0.853 36 ['21+', '30+', '210', '210'] 17.793 0.853 37 ['21+', '30+', '210'] 17.793 0.853 38 ['10-', '30+', '30+', '21+', '30+', '210', '210'] 17.793 0.853 39 ['10-', '30+', '30+', '21+', '30+', '210'] 17.793 0.853 40 ['210', '210', '30+', '30+', '30+'] 17.362 0.662 41 ['11-', '30+', '210', '300'] 16.928 0.665 42 ['30+', '210', '210', '30+', '30+', '30+', '10-'] 16.574 0.820 43 ['21-', '300', '300', '300', '300', '21-'] 15.889 1.000 44 ['21-', '300', '21-', '21-', '300', '300'] 15.889 1.000 45 ['21-', '21-', '300', '300', '300', '300', '21-'] 15.889 1.000 46 ['21-', '21-', '300', '300', '300', '21-', '21-'] 15.889 1.000 Continued

155

Table 39 continued

Number Fragment χ2 – statistic γ - statistic 47 ['11-', '30+', '210', '300', '21-'] 15.889 1.000 48 ['300', '210', '210', '300', '12-'] 15.288 0.918 49 ['210', '300', '210', '300', '130'] 15.288 0.918 50 ['300', '210', '300', '210'] 15.227 0.714 51 ['210', '210', '30+', '30+', '30+', '10-'] 15.084 0.640 52 ['11-', '30+', '30+', '11-'] 14.610 0.757 53 ['30+', '30+', '210', '210', '30+', '30+', '30+'] 14.597 0.834 54 ['21-', '210', '21+', '10-'] 13.392 0.440 55 ['21-', '210', '21+'] 13.392 0.440 56 ['30+', '210', '210', '30+'] 12.936 0.352 57 ['210', '30+', '300', '210'] 12.915 0.464 58 ['210', '30+', '30+', '21+', '30+', '30+', '10-'] 12.882 0.857 59 ['210', '30+', '30+', '21+', '30+', '30+'] 12.882 0.857 60 ['210', '210', '30+', '30+', '21+', '30+', '30+'] 12.882 0.857 61 ['21+', '30+', '30+', '210', '210', '30+', '30+'] 12.882 0.857 62 ['300', '300', '300', '21-', '21-', '21-'] 12.728 0.481 63 ['300', '300', '300', '21-', '21-'] 12.728 0.481 64 ['300', '300', '300', '21-'] 12.728 0.481 65 ['21-', '210', '30+', '10-'] 12.567 0.492 66 ['10-', '30+', '30+', '30+', '10-'] 12.497 0.551 67 ['30+', '30+', '210', '210', '30+', '30+', '10-'] 12.259 0.570 68 ['30+', '30+', '210', '210', '30+', '30+'] 12.259 0.570 69 ['300', '300', '300', '21-', '21-', '300', '300'] 11.830 1.000 70 ['300', '300', '210', '300', '210', '210'] 11.830 1.000 71 ['300', '300', '210', '210', '300', '210'] 11.830 1.000 72 ['300', '300', '210', '210', '300'] 11.830 1.000 73 ['300', '300', '21-', '21-', '300', '300', '21-'] 11.830 1.000 74 ['300', '300', '21-', '21-', '300', '300'] 11.830 1.000 75 ['300', '210', '300', '300', '12-'] 11.830 1.000 76 ['300', '210', '300', '30-', '220', '130'] 11.830 1.000 Continued

156

Table 39 continued

Number Fragment χ2 – statistic γ - statistic 77 ['300', '210', '300', '210', '210', '300', '12-'] 11.830 1.000 78 ['300', '210', '210', '300', '300', '130'] 11.830 1.000 79 ['300', '210', '210', '300', '210', '300', '130'] 11.830 1.000 80 ['300', '210', '210', '300', '210', '300'] 11.830 1.000 81 ['300', '21-', '21-', '300', '300', '300', '300'] 11.830 1.000 82 ['300', '21-', '21-', '300', '300', '300'] 11.830 1.000 83 ['300', '21-', '21-', '300', '300', '21-', '21-'] 11.830 1.000 84 ['30-', '300', '210', '300', '130'] 11.830 1.000 85 ['30-', '300', '210', '300'] 11.830 1.000 86 ['30+', '30+', '210', '30+', '30+', '30+', '11-'] 11.830 1.000 87 ['30+', '30+', '21+', '20-'] 11.830 1.000 88 ['30+', '30+', '20-', '22+', '220', '22-', '22-'] 11.830 1.000 89 ['30+', '21+', '20-'] 11.830 1.000 90 ['220', '30-', '300', '210', '300', '130'] 11.830 1.000 91 ['220', '30-', '300', '210', '300'] 11.830 1.000 92 ['220', '22+', '20-', '30+', '30+', '210', '30+'] 11.830 1.000 93 ['22+', '20-', '30+', '30+', '210', '30+', '30+'] 11.830 1.000 94 ['22+', '20-', '30+', '30+', '210', '30+', '11-'] 11.830 1.000 95 ['22+', '20-', '30+', '30+', '210', '30+'] 11.830 1.000 96 ['210', '300', '300', '210', '300', '210'] 11.830 1.000 97 ['210', '300', '300', '210', '210', '300'] 11.830 1.000 98 ['210', '300', '210', '300', '300', '12-'] 11.830 1.000 99 ['210', '300', '210', '210', '300', '300', '130'] 11.830 1.000 100 ['210', '30+', '200'] 11.830 1.000 101 ['210', '210', '300', '300', '210', '300'] 11.830 1.000 102 ['210', '210', '300', '210', '300', '300', '12-'] 11.830 1.000 103 ['21-', '300', '300', '300', '300', '300', '21-'] 11.830 1.000 104 ['21-', '300', '300', '300', '300', '300'] 11.830 1.000 105 ['21-', '300', '300', '300', '21-', '21-', '300'] 11.830 1.000 Continued

157

Table 39 continued

Number Fragment χ2 – statistic γ - statistic 106 ['21-', '300', '21-', '21-', '300', '300', '300'] 11.830 1.000 107 ['21-', '21-', '300', '300', '300', '300', '300'] 11.830 1.000 108 ['21-', '21-', '300', '21-', '21-', '300', '300'] 11.830 1.000 109 ['21-', '21-', '21-', '300', '21-', '21-', '300'] 11.830 1.000 110 ['20-', '30+', '30+', '210', '30+', '30+', '30+'] 11.830 1.000 111 ['20-', '30+', '30+', '210', '30+', '11-'] 11.830 1.000 112 ['130', '300', '210', '300', '30-', '220', '130'] 11.830 1.000 113 ['12-', '300', '210', '300'] 11.830 1.000 114 ['11-', '30+', '30+', '30+', '210', '30+', '210'] 11.830 1.000 115 ['11-', '30+', '30+', '30+', '210', '30+'] 11.830 1.000 116 ['11-', '30+', '30+', '210', '21-', '300', '210'] 11.830 1.000 117 ['11-', '30+', '30+', '210', '21-', '300'] 11.830 1.000 118 ['11-', '30+', '210', '300', '21-', '210', '30+'] 11.830 1.000 119 ['11-', '30+', '210', '300', '21-', '210'] 11.830 1.000 120 ['11-', '30+', '210', '30+', '210', '30+', '30+'] 11.830 1.000 121 ['11-', '30+', '210', '30+', '210', '30+', '11-'] 11.830 1.000 122 ['11-', '30+', '210', '30+', '210', '30+'] 11.830 1.000 123 ['10-', '30+', '30+', '21+', '20-'] 11.830 1.000 124 ['10-', '220', '300', '21-'] 11.830 1.000 125 ['10-', '220', '300'] 11.830 1.000 126 ['10-', '30+', '20-', '30+', '10-'] 11.540 0.900 127 ['300', '21-', '21-', '300', '300', '21-'] 11.485 0.641 128 ['21-', '300', '300', '300', '21-', '21-', '21-'] 11.485 0.641 129 ['21-', '300', '300', '300', '21-', '21-'] 11.485 0.641 130 ['21-', '300', '300', '300', '21-'] 11.485 0.641 131 ['11-', '30+', '30+', '30+'] 11.390 0.378

158

χ2 – γ - Number Fragment statistic statistic 1 ['C30+', 'C30+', 'C21+'] 35.534 0.908 2 ['C21+', 'C30+', 'N30+', 'O10-'] 31.763 0.900 3 ['C21+', 'C30+', 'N30+'] 31.763 0.900 4 ['C30+', 'C30+', 'C21+', 'C30+'] 24.563 0.880 5 ['C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 24.563 0.880 6 ['C30+', 'C21+', 'C30+', 'N30+'] 24.563 0.880 7 ['C30+', 'C21+', 'C30+'] 24.563 0.880 8 ['C21+', 'C30+', 'C30+', 'C210'] 24.563 0.880 9 ['O11-', 'C30+', 'C30+', 'C30+', 'C210'] 23.373 0.644 10 ['C300', 'C210', 'C210', 'C300', 'C210'] 23.005 0.854 11 ['O10-', 'N30+', 'O10-'] 22.494 0.508 12 ['C210', 'C210', 'C30+', 'C30+', 'N30+', 'O10-'] 21.589 0.826 13 ['C210', 'C210', 'C30+', 'C30+', 'N30+'] 21.589 0.826 14 ['C30+', 'C30+', 'N30+', 'O10-'] 20.771 0.636 15 ['C30+', 'C30+', 'N30+'] 20.771 0.636 16 ['C30+', 'N30+', 'O10-'] 19.865 0.489 17 ['O10-', 'N30+', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 19.776 0.892 18 ['N30+', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 19.776 0.892 19 ['N30+', 'C30+', 'C21+', 'C30+', 'N30+'] 19.776 0.892 20 ['C30+', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 19.776 0.892 21 ['C30+', 'C30+', 'C21+', 'C30+', 'N30+'] 19.776 0.892 22 ['N12-', 'C300', 'C210', 'C210', 'C300', 'C210'] 19.721 1.000 23 ['C210', 'C210', 'C300', 'C210', 'C300', 'C130'] 19.721 1.000 24 ['C210', 'C210', 'C300', 'C210', 'C300'] 19.721 1.000 25 ['O11-', 'C30+', 'C30+', 'C30+', 'O11-'] 19.121 0.747 26 ['C300', 'C210', 'C210', 'C300'] 19.038 0.815 27 ['C30+', 'C30+', 'C210', 'C210', 'C30+', 'C21+'] 17.793 0.853 28 ['C30+', 'C30+', 'C21+', 'C30+', 'C210', 'C210'] 17.793 0.853 29 ['C30+', 'C30+', 'C21+', 'C30+', 'C210'] 17.793 0.853 30 ['C30+', 'C210', 'C210', 'C30+', 'C30+', 'N30+', 'O10-'] 17.793 0.853 Continued Table 40: Significant and positively correlated fragments for skin sensitization dataset using {AI, nC, nH, PC} annotation scheme with χ2 and γ statistic values 159

Table 40 continued

χ2 – γ - Number Fragment statistic statistic 31 ['C30+', 'C210', 'C210', 'C30+', 'C30+', 'N30+'] 17.793 0.853 32 ['C30+', 'C210', 'C210', 'C30+', 'C21+', 'C30+', 'N30+'] 17.793 0.853 33 ['C30+', 'C210', 'C210', 'C30+', 'C21+', 'C30+'] 17.793 0.853 34 ['C30+', 'C210', 'C210', 'C30+', 'C21+'] 17.793 0.853 35 ['C30+', 'C21+', 'C30+', 'C30+', 'C210', 'C210'] 17.793 0.853 36 ['C30+', 'C21+', 'C30+', 'C30+', 'C210'] 17.793 0.853 37 ['C30+', 'C21+', 'C30+', 'C210', 'C210'] 17.793 0.853 38 ['C30+', 'C21+', 'C30+', 'C210'] 17.793 0.853 39 ['C210', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 17.793 0.853 40 ['C210', 'C30+', 'C21+', 'C30+', 'N30+'] 17.793 0.853 41 ['C210', 'C30+', 'C21+', 'C30+', 'C30+', 'C210'] 17.793 0.853 42 ['C210', 'C210', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 17.793 0.853 43 ['C210', 'C210', 'C30+', 'C21+', 'C30+', 'N30+'] 17.793 0.853 44 ['C21+', 'C30+', 'C30+', 'C210', 'C210', 'C30+'] 17.793 0.853 45 ['C21+', 'C30+', 'C30+', 'C210', 'C210'] 17.793 0.853 46 ['C21+', 'C30+', 'C210', 'C210', 'C30+', 'C30+', 'N30+'] 17.793 0.853 47 ['C21+', 'C30+', 'C210', 'C210'] 17.793 0.853 48 ['C21+', 'C30+', 'C210'] 17.793 0.853 49 ['O11-', 'C30+', 'C210', 'C300'] 16.928 0.665 50 ['C210', 'C30+', 'C30+', 'N30+', 'O10-'] 16.007 0.597 51 ['C210', 'C30+', 'C30+', 'N30+'] 16.007 0.597 52 ['C21-', 'C300', 'C300', 'C300', 'C300', 'C21-'] 15.889 1.000 53 ['C21-', 'C300', 'C21-', 'C21-', 'C300', 'C300'] 15.889 1.000 54 ['C21-', 'C21-', 'C300', 'C300', 'C300', 'C300', 'C21-'] 15.889 1.000 55 ['C21-', 'C21-', 'C300', 'C300', 'C300', 'C21-', 'C21-'] 15.889 1.000 56 ['C300', 'C210', 'C210', 'C300', 'N12-'] 15.288 0.918 57 ['C210', 'C300', 'C210', 'C300', 'C130'] 15.288 0.918 58 ['C300', 'C210', 'C300', 'C210'] 15.227 0.714 59 ['O11-', 'C30+', 'C30+', 'O11-'] 14.610 0.757 60 ['C21-', 'C210', 'C21+', 'O10-'] 13.392 0.440 Continued

160

Table 40 continued

χ2 – γ - Number Fragment statistic statistic 61 ['C21-', 'C210', 'C21+'] 13.392 0.440 62 ['C21-', 'C210', 'C30+', 'O10-'] 13.074 0.537 63 ['C30+', 'C210', 'C210', 'C30+'] 12.936 0.352 64 ['C210', 'C30+', 'C300', 'C210'] 12.915 0.464 65 ['N30+', 'C30+', 'C210', 'C210', 'C30+', 'C30+', 'N30+'] 12.882 0.857 66 ['C210', 'C30+', 'C30+', 'C21+', 'C30+', 'N30+', 'O10-'] 12.882 0.857 67 ['C210', 'C30+', 'C30+', 'C21+', 'C30+', 'N30+'] 12.882 0.857 68 ['C210', 'C210', 'C30+', 'C30+', 'C21+', 'C30+', 'N30+'] 12.882 0.857 69 ['C21+', 'C30+', 'C30+', 'C210', 'C210', 'C30+', 'N30+'] 12.882 0.857 70 ['C300', 'C300', 'C300', 'C21-', 'C21-', 'C21-'] 12.728 0.481 71 ['C300', 'C300', 'C300', 'C21-', 'C21-'] 12.728 0.481 72 ['C300', 'C300', 'C300', 'C21-'] 12.728 0.481 73 ['O20-', 'C30+', 'C30+', 'C210', 'C30+', 'O11-'] 11.830 1.000 74 ['O20-', 'C30+', 'C30+', 'C210', 'C30+', 'C30+', 'C30+'] 11.830 1.000 75 ['O11-', 'C30+', 'C30+', 'C30+', 'C210', 'C30+', 'C210'] 11.830 1.000 76 ['O11-', 'C30+', 'C30+', 'C30+', 'C210', 'C30+'] 11.830 1.000 77 ['O11-', 'C30+', 'C30+', 'C210', 'C21-', 'C300', 'C210'] 11.830 1.000 78 ['O11-', 'C30+', 'C30+', 'C210', 'C21-', 'C300'] 11.830 1.000 79 ['O11-', 'C30+', 'C210', 'C300', 'C21-', 'C210', 'C30+'] 11.830 1.000 80 ['O11-', 'C30+', 'C210', 'C300', 'C21-', 'C210'] 11.830 1.000 81 ['O11-', 'C30+', 'C210', 'C300', 'C21-'] 11.830 1.000 82 ['O11-', 'C30+', 'C210', 'C30+', 'C210', 'C30+', 'O11-'] 11.830 1.000 83 ['O11-', 'C30+', 'C210', 'C30+', 'C210', 'C30+', 'C30+'] 11.830 1.000 84 ['O11-', 'C30+', 'C210', 'C30+', 'C210', 'C30+'] 11.830 1.000 85 ['N30-', 'C300', 'C210', 'C300', 'C130'] 11.830 1.000 86 ['N30-', 'C300', 'C210', 'C300'] 11.830 1.000 87 ['N20-', 'C30+', 'N20-'] 11.830 1.000 88 ['N20-', 'C30+', 'C300', 'C130'] 11.830 1.000 89 ['N20-', 'C30+', 'C210', 'C30+'] 11.830 1.000 90 ['N12-', 'C300', 'C210', 'C300'] 11.830 1.000 Continued

161

Table 40 continued

χ2 – γ - Number Fragment statistic statistic 91 ['C300', 'C300', 'C300', 'C21-', 'C21-', 'C300', 'C300'] 11.830 1.000 92 ['C300', 'C300', 'C210', 'C300', 'C210', 'C210'] 11.830 1.000 93 ['C300', 'C300', 'C210', 'C210', 'C300', 'C210'] 11.830 1.000 94 ['C300', 'C300', 'C210', 'C210', 'C300'] 11.830 1.000 95 ['C300', 'C300', 'C21-', 'C21-', 'C300', 'C300', 'C21-'] 11.830 1.000 96 ['C300', 'C300', 'C21-', 'C21-', 'C300', 'C300'] 11.830 1.000 97 ['C300', 'C210', 'C300', 'N30-', 'C220', 'C130'] 11.830 1.000 98 ['C300', 'C210', 'C300', 'C300', 'N12-'] 11.830 1.000 99 ['C300', 'C210', 'C300', 'C210', 'C210', 'C300', 'N12-'] 11.830 1.000 100 ['C300', 'C210', 'C210', 'C300', 'Cl10-'] 11.830 1.000 101 ['C300', 'C210', 'C210', 'C300', 'C300', 'C130'] 11.830 1.000 102 ['C300', 'C210', 'C210', 'C300', 'C210', 'C300', 'C130'] 11.830 1.000 103 ['C300', 'C210', 'C210', 'C300', 'C210', 'C300'] 11.830 1.000 104 ['C300', 'C21-', 'C21-', 'C300', 'C300', 'C300', 'C300'] 11.830 1.000 105 ['C300', 'C21-', 'C21-', 'C300', 'C300', 'C300'] 11.830 1.000 106 ['C300', 'C21-', 'C21-', 'C300', 'C300', 'C21-', 'C21-'] 11.830 1.000 107 ['C30+', 'C30+', 'O20-', 'C22+', 'C220', 'C22-', 'C22-'] 11.830 1.000 108 ['C30+', 'C30+', 'C30+', 'C21+'] 11.830 1.000 109 ['C30+', 'C30+', 'C210', 'C30+', 'C30+', 'C30+', 'O11-'] 11.830 1.000 110 ['C30+', 'C30+', 'C210', 'C30+', 'C30+', 'C30+', 'C210'] 11.830 1.000 111 ['C220', 'N30-', 'C300', 'C210', 'C300', 'C130'] 11.830 1.000 112 ['C220', 'N30-', 'C300', 'C210', 'C300'] 11.830 1.000 113 ['C220', 'C22+', 'O20-', 'C30+', 'C30+', 'C210', 'C30+'] 11.830 1.000 114 ['C22+', 'O20-', 'C30+', 'C30+', 'C210', 'C30+', 'O11-'] 11.830 1.000 115 ['C22+', 'O20-', 'C30+', 'C30+', 'C210', 'C30+', 'C30+'] 11.830 1.000 116 ['C22+', 'O20-', 'C30+', 'C30+', 'C210', 'C30+'] 11.830 1.000 117 ['C210', 'C300', 'C300', 'C210', 'C300', 'C210'] 11.830 1.000 118 ['C210', 'C300', 'C300', 'C210', 'C210', 'C300'] 11.830 1.000 119 ['C210', 'C300', 'C210', 'C300', 'C300', 'N12-'] 11.830 1.000 120 ['C210', 'C300', 'C210', 'C210', 'C300', 'C300', 'C130'] 11.830 1.000 Continued

162

Table 40 continued

χ2 – γ - Number Fragment statistic statistic 121 ['C210', 'C210', 'C300', 'C300', 'C210', 'C300'] 11.830 1.000 122 ['C210', 'C210', 'C300', 'C210', 'C300', 'C300', 'N12-'] 11.830 1.000 123 ['C21-', 'C300', 'C300', 'C300', 'C300', 'C300', 'C21-'] 11.830 1.000 124 ['C21-', 'C300', 'C300', 'C300', 'C300', 'C300'] 11.830 1.000 125 ['C21-', 'C300', 'C300', 'C300', 'C21-', 'C21-', 'C300'] 11.830 1.000 126 ['C21-', 'C300', 'C21-', 'C21-', 'C300', 'C300', 'C300'] 11.830 1.000 127 ['C21-', 'C21-', 'C300', 'C300', 'C300', 'C300', 'C300'] 11.830 1.000 128 ['C21-', 'C21-', 'C300', 'C21-', 'C21-', 'C300', 'C300'] 11.830 1.000 129 ['C21-', 'C21-', 'C21-', 'C300', 'C21-', 'C21-', 'C300'] 11.830 1.000 130 ['C130', 'C300', 'C210', 'C300', 'N30-', 'C220', 'C130'] 11.830 1.000 131 ['O10-', 'C30+', 'O20-', 'C30+', 'O10-'] 11.540 0.900 132 ['C300', 'C21-', 'C21-', 'C300', 'C300', 'C21-'] 11.485 0.641 133 ['C21-', 'C300', 'C300', 'C300', 'C21-', 'C21-', 'C21-'] 11.485 0.641 134 ['C21-', 'C300', 'C300', 'C300', 'C21-', 'C21-'] 11.485 0.641 135 ['C21-', 'C300', 'C300', 'C300', 'C21-'] 11.485 0.641

163

APPENDIX C. RESULTS FOR 5-FOLD CROSS-VALIDATION OF BENCHMARK

DATASET FOR AMES MUTAGENICITY

1. Preliminary Markov chain models with one-step connection probabilities

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.611 0.618 0.614 0 CV 2 0.646 0.607 0.629 0 CV 3 0.641 0.600 0.624 0 CV 4 0.688 0.623 0.661 0 CV 5 0.681 0.633 0.661 0 Average 0.653 0.616 0.638 A. Annotation scheme {AI}

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.765 0.527 0.669 0 CV 2 0.761 0.593 0.690 0 CV 3 0.771 0.579 0.690 0 CV 4 0.800 0.553 0.696 0 CV 5 0.752 0.575 0.678 0 Average 0.770 0.565 0.685 B. Annotation scheme: {AI, nC} Continued

Table 41: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using preliminary Markov chain models based on one-step connection probabilities (14 sub-tables for 14 different annotation schemes)

164

Table 41 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.746 0.568 0.673 1 CV 2 0.765 0.559 0.678 0 CV 3 0.755 0.533 0.662 0 CV 4 0.781 0.498 0.662 0 CV 5 0.752 0.542 0.664 2 Average 0.760 0.540 0.668 C. Annotation scheme: {AI, nC, nH}

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.771 0.520 0.669 0 CV 2 0.784 0.536 0.680 1 CV 3 0.755 0.557 0.672 0 CV 4 0.776 0.520 0.668 1 CV 5 0.750 0.544 0.664 2 Average 0.767 0.535 0.671 D. Annotation scheme: {nC, nH, PC} (3-level PC)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.749 0.594 0.686 4 CV 2 0.768 0.604 0.700 3 CV 3 0.739 0.578 0.671 1 CV 4 0.792 0.570 0.699 5 CV 5 0.733 0.594 0.675 6 Average 0.756 0.588 0.686 E. Annotation scheme: {AI, nC, nH, PC} (3-level PC) Continued 165

Table 41 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.731 0.500 0.637 0 CV 2 0.767 0.557 0.678 0 CV 3 0.764 0.542 0.671 0 CV 4 0.779 0.507 0.665 0 CV 5 0.724 0.553 0.652 0 Average 0.753 0.532 0.661 F. Annotation scheme: {AI, RA} (binary RA annotation)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.760 0.521 0.663 1 CV 2 0.754 0.589 0.685 1 CV 3 0.772 0.567 0.686 0 CV 4 0.793 0.524 0.680 0 CV 5 0.747 0.594 0.683 0 Average 0.765 0.559 0.679 G. Annotation scheme: {AI, nC, RA} (binary RA annotation)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.735 0.583 0.673 0 CV 2 0.761 0.598 0.692 0 CV 3 0.762 0.567 0.680 0 CV 4 0.774 0.541 0.676 0 CV 5 0.721 0.611 0.675 0 Average 0.751 0.580 0.679 H. Annotation scheme: {AI, nH, RA} (binary RA annotation) Continued 166

Table 41 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.756 0.584 0.686 4 CV 2 0.781 0.608 0.708 2 CV 3 0.751 0.596 0.686 0 CV 4 0.795 0.586 0.707 1 CV 5 0.740 0.588 0.676 1 Average 0.765 0.592 0.693 I. Annotation scheme: {nC, nH, PC} (5-level PC)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.756 0.615 0.699 8 CV 2 0.775 0.619 0.709 4 CV 3 0.736 0.599 0.678 5 CV 4 0.788 0.599 0.708 7 CV 5 0.745 0.614 0.690 7 Average 0.760 0.609 0.697 J. Annotation scheme: {AI, nC, nH, PC} (5-level PC)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.729 0.596 0.675 6 CV 2 0.754 0.652 0.711 2 CV 3 0.757 0.654 0.711 4 CV 4 0.803 0.608 0.721 6 CV 5 0.758 0.631 0.705 5 Average 0.760 0.628 0.705 K. Annotation scheme: {AI, nC, PC, RA} (5-level PC and binary RA) Continued 167

Table 41 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.730 0.632 0.690 2 CV 2 0.763 0.632 0.708 2 CV 3 0.759 0.630 0.705 2 CV 4 0.787 0.591 0.704 7 CV 5 0.746 0.627 0.696 6 Average 0.757 0.622 0.701 L. Annotation scheme: {AI, nH, PC, RA} (5-level PC and binary RA)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.778 0.581 0.698 4 CV 2 0.788 0.626 0.720 3 CV 3 0.782 0.610 0.710 2 CV 4 0.819 0.599 0.726 4 CV 5 0.777 0.598 0.702 1 Average 0.789 0.603 0.711 M. Annotation scheme: {nC, nH, PC, RA} (5-level PC and binary RA)

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.779 0.598 0.705 10 CV 2 0.801 0.613 0.722 7 CV 3 0.792 0.602 0.712 8 CV 4 0.818 0.591 0.722 8 CV 5 0.784 0.599 0.707 9 Average 0.795 0.601 0.714 N. Annotation scheme: {AI, nC, nH, PC, RA} (5-level PC and binary RA) 168

2. Markov chain models with fragment of longer lengths

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.611 0.713 0.652 1 CV 2 0.649 0.680 0.662 0 CV 3 0.643 0.663 0.651 0 CV 4 0.665 0.666 0.666 0 CV 5 0.634 0.692 0.658 1 Average 0.640 0.683 0.658 A. Annotation scheme: {AI, nC, nH} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.734 0.636 0.694 3 CV 2 0.744 0.639 0.700 2 CV 3 0.716 0.630 0.680 1 CV 4 0.773 0.612 0.705 4 CV 5 0.729 0.671 0.705 2 Average 0.739 0.638 0.697 B. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 2 Continued

Table 42: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using Markov models based on longer fragment lengths (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths)

169

Table 42 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.748 0.632 0.701 2 CV 2 0.756 0.658 0.715 3 CV 3 0.736 0.644 0.698 2 CV 4 0.784 0.645 0.725 4 CV 5 0.759 0.661 0.718 1 Average 0.757 0.648 0.711 C. Annotation scheme: {nC, nH, PC, RA} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.722 0.601 0.672 4 CV 2 0.718 0.553 0.648 1 CV 3 0.716 0.530 0.638 1 CV 4 0.737 0.502 0.638 0 CV 5 0.726 0.596 0.671 1 Average 0.724 0.556 0.653 D. Annotation scheme: {AI, nC, nH} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.771 0.551 0.682 6 CV 2 0.760 0.592 0.689 3 CV 3 0.759 0.579 0.683 2 CV 4 0.799 0.557 0.697 4 CV 5 0.759 0.591 0.688 2 Average 0.770 0.574 0.688 E. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 3 Continued 170

Table 42 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.780 0.554 0.688 5 CV 2 0.770 0.608 0.702 4 CV 3 0.771 0.593 0.696 3 CV 4 0.817 0.575 0.715 4 CV 5 0.787 0.579 0.700 1 Average 0.785 0.582 0.700 F. Annotation scheme: {nC, nH, PC, RA} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.758 0.506 0.656 10 CV 2 0.761 0.506 0.653 4 CV 3 0.756 0.483 0.641 5 CV 4 0.760 0.429 0.621 7 CV 5 0.774 0.505 0.661 7 Average 0.762 0.486 0.646 G. Annotation scheme: {AI, nC, nH} and Fragment length: 4

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.793 0.527 0.685 12 CV 2 0.803 0.533 0.689 6 CV 3 0.776 0.537 0.676 6 CV 4 0.817 0.511 0.689 11 CV 5 0.775 0.554 0.682 8 Average 0.793 0.532 0.684 H. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 4 Continued 171

Table 42 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.798 0.510 0.681 11 CV 2 0.803 0.546 0.695 7 CV 3 0.795 0.541 0.689 7 CV 4 0.835 0.522 0.704 11 CV 5 0.805 0.534 0.692 7 Average 0.807 0.531 0.692 I. Annotation scheme: {nC, nH, PC, RA} and Fragment length: 4

3. Markov chain models with improved specificity

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.756 0.639 0.708 4 CV 2 0.768 0.641 0.715 3 CV 3 0.756 0.646 0.710 2 CV 4 0.801 0.633 0.730 4 CV 5 0.749 0.617 0.694 1 Average 0.766 0.635 0.711 A. Annotation scheme: {nC, nH, PC, RA} and Classification criteria: x = 0.5 Continued

Table 43: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using preliminary Markov chain models based on one-step connection probabilities with different annotation schemes and classification criteria (4 sub-tables)

172

Table 43 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.758 0.659 0.718 10 CV 2 0.769 0.635 0.713 7 CV 3 0.765 0.644 0.714 8 CV 4 0.806 0.627 0.730 8 CV 5 0.764 0.619 0.703 9 Average 0.772 0.637 0.716 B. Annotation scheme: {AI, nC, nH, PC, RA} and Classification criteria: x = 0.5

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.776 0.602 0.706 15 CV 2 0.804 0.636 0.734 12 CV 3 0.780 0.645 0.723 5 CV 4 0.822 0.612 0.735 14 CV 5 0.784 0.620 0.715 4 Average 0.793 0.623 0.723 C. Annotation scheme: {nC, nH, PC, RA} (5-level RA) and Classification criteria: x = 0

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.787 0.615 0.718 24 CV 2 0.809 0.635 0.736 19 CV 3 0.798 0.647 0.734 13 CV 4 0.821 0.599 0.728 16 CV 5 0.787 0.613 0.715 12 Average 0.800 0.622 0.726 D. Annotation scheme: {AI, nC, nH, PC, RA} (5-level RA) and Classification criteria: x = 0 173

4. kNN models

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.789 0.658 0.736 3

CV 2 0.774 0.694 0.740 3

CV 3 0.786 0.668 0.737 1

CV 4 0.811 0.601 0.722 0

CV 5 0.780 0.695 0.744 1

Average 0.788 0.663 0.736 A. Annotation scheme: {AI, nC, nH} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.785 0.632 0.723 5

CV 2 0.791 0.682 0.745 4

CV 3 0.805 0.666 0.746 2

CV 4 0.832 0.614 0.740 1

CV 5 0.789 0.660 0.735 2

Average 0.800 0.651 0.738 B. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 2 Continued

Table 44: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using kNN modeling methods with 5 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths)

174

Table 44 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.780 0.650 0.727 5 CV 2 0.813 0.694 0.763 5 CV 3 0.793 0.666 0.739 2 CV 4 0.811 0.641 0.739 1 CV 5 0.798 0.658 0.739 2 Average 0.799 0.662 0.741 C. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.793 0.626 0.725 8 CV 2 0.820 0.676 0.759 8 CV 3 0.836 0.662 0.763 6 CV 4 0.820 0.636 0.742 1 CV 5 0.823 0.652 0.751 6 Average 0.818 0.650 0.748 D. Annotation scheme: {AI, nC, nH} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.796 0.646 0.735 11 CV 2 0.821 0.705 0.772 11 CV 3 0.802 0.683 0.752 8 CV 4 0.810 0.638 0.738 7 CV 5 0.798 0.656 0.738 9 Average 0.805 0.666 0.747 E. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 3 Continued 175

Table 44 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.788 0.640 0.728 15 CV 2 0.793 0.708 0.757 14 CV 3 0.800 0.674 0.747 10 CV 4 0.817 0.654 0.748 10 CV 5 0.799 0.668 0.744 14 Average 0.799 0.669 0.745 F. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.808 0.619 0.732 35 CV 2 0.828 0.686 0.768 21 CV 3 0.835 0.635 0.751 25 CV 4 0.806 0.607 0.723 27 CV 5 0.817 0.655 0.749 26 Average 0.819 0.640 0.745 G. Annotation scheme: {AI, nC, nH} and Fragment length: 4

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.788 0.625 0.723 48 CV 2 0.780 0.692 0.743 37 CV 3 0.817 0.672 0.755 36 CV 4 0.810 0.606 0.725 37 CV 5 0.807 0.662 0.746 37 Average 0.800 0.651 0.738 H. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 4 Continued 176

Table 44 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.782 0.616 0.716 65

CV 2 0.772 0.729 0.754 53

CV 3 0.814 0.682 0.759 52

CV 4 0.817 0.631 0.740 52

CV 5 0.815 0.661 0.750 47

Average 0.800 0.664 0.744 I. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 4

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.791 0.676 0.744 3

CV 2 0.786 0.689 0.745 3

CV 3 0.796 0.678 0.747 1

CV 4 0.800 0.608 0.719 0

CV 5 0.764 0.700 0.737 1

Average 0.787 0.670 0.738 A. Annotation scheme: {AI, nC, nH} and Fragment length: 2 Continued

Table 45: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using weighted kNN models with 5 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths)

177

Table 45 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.790 0.665 0.740 5 CV 2 0.796 0.684 0.749 4 CV 3 0.807 0.678 0.753 2 CV 4 0.828 0.631 0.745 1 CV 5 0.784 0.682 0.741 2 Average 0.801 0.668 0.746 B. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.784 0.673 0.739 5 CV 2 0.817 0.709 0.771 5 CV 3 0.796 0.695 0.754 2 CV 4 0.813 0.655 0.746 1 CV 5 0.801 0.687 0.753 2 Average 0.802 0.684 0.753 C. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.784 0.629 0.721 8 CV 2 0.818 0.674 0.757 8 CV 3 0.832 0.676 0.767 6 CV 4 0.828 0.639 0.748 1 CV 5 0.811 0.659 0.747 6 Average 0.815 0.655 0.748 D. Annotation scheme: {AI, nC, nH} and Fragment length: 3 Continued 178

Table 45 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.789 0.658 0.736 11 CV 2 0.814 0.727 0.777 11 CV 3 0.797 0.683 0.749 8 CV 4 0.803 0.646 0.737 7 CV 5 0.792 0.668 0.740 9 Average 0.799 0.676 0.748 E. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.790 0.660 0.737 15 CV 2 0.796 0.732 0.769 14 CV 3 0.793 0.686 0.748 10 CV 4 0.813 0.671 0.753 10 CV 5 0.793 0.690 0.750 14 Average 0.797 0.688 0.751 F. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.805 0.627 0.733 35 CV 2 0.819 0.695 0.767 21 CV 3 0.837 0.635 0.752 25 CV 4 0.819 0.617 0.734 27 CV 5 0.812 0.658 0.747 26 Average 0.818 0.646 0.747 G. Annotation scheme: {AI, nC, nH} and Fragment length: 4 Continued 179

Table 45 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.782 0.647 0.729 48 CV 2 0.776 0.697 0.743 37 CV 3 0.817 0.672 0.755 36 CV 4 0.809 0.606 0.724 37 CV 5 0.801 0.668 0.745 37 Average 0.797 0.658 0.739 H. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 4

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.787 0.658 0.736 65 CV 2 0.778 0.716 0.752 53 CV 3 0.813 0.682 0.758 52 CV 4 0.821 0.637 0.744 52 CV 5 0.815 0.668 0.753 47 Average 0.803 0.672 0.749 I. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 4

180

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.796 0.678 0.748 4 CV 2 0.805 0.689 0.757 3 CV 3 0.795 0.685 0.749 1 CV 4 0.806 0.613 0.724 0 CV 5 0.782 0.702 0.748 1 Average 0.797 0.673 0.745 A. Annotation scheme: {AI, nC, nH} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.801 0.625 0.729 5 CV 2 0.817 0.694 0.766 4 CV 3 0.822 0.661 0.755 2 CV 4 0.841 0.631 0.753 1 CV 5 0.784 0.664 0.734 3 Average 0.813 0.655 0.747 B. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 2 Continued

Table 46: Performance parameters for 5-fold cross-validation of benchmark dataset for Ames mutagenicity using weighted kNN models with 7 nearest neighbors (9 sub-tables with 3 different annotation schemes and 3 different fragment lengths)

181

Table 46 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.789 0.637 0.727 CV 2 0.815 0.704 0.768 5 CV 3 0.814 0.678 0.757 2 CV 4 0.835 0.639 0.753 1 CV 5 0.808 0.681 0.755 3 Average 0.812 0.668 0.752 C. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 2

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.807 0.624 0.733 10 CV 2 0.809 0.674 0.752 8 CV 3 0.840 0.659 0.764 6 CV 4 0.837 0.623 0.747 2 CV 5 0.830 0.637 0.749 6 Average 0.825 0.643 0.749 D. Annotation scheme: {AI, nC, nH} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.810 0.649 0.745 13 CV 2 0.817 0.707 0.771 11 CV 3 0.811 0.663 0.749 8 CV 4 0.813 0.637 0.740 8 CV 5 0.804 0.668 0.747 10 Average 0.811 0.665 0.750 E. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 3 Continued 182

Table 46 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.802 0.640 0.736 17 CV 2 0.800 0.727 0.769 15 CV 3 0.814 0.676 0.756 11 CV 4 0.824 0.645 0.749 14 CV 5 0.797 0.680 0.748 15 Average 0.807 0.674 0.752 F. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 3

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.813 0.620 0.736 37 CV 2 0.815 0.686 0.760 22 CV 3 0.844 0.642 0.758 29 CV 4 0.810 0.627 0.734 29 CV 5 0.819 0.639 0.743 28 Average 0.820 0.643 0.746 G. Annotation scheme: {AI, nC, nH} and Fragment length: 4

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.798 0.627 0.730 50 CV 2 0.792 0.702 0.754 39 CV 3 0.810 0.671 0.751 41 CV 4 0.797 0.624 0.725 40 CV 5 0.802 0.684 0.753 41 Average 0.800 0.662 0.743 H. Annotation scheme: {AI, nC, nH, PC} and Fragment length: 4 Continued

183

Table 46 continued

No. of compounds Sensitivity Specificity Concordance not predicted (NP) CV 1 0.803 0.636 0.737 72 CV 2 0.784 0.708 0.752 56 CV 3 0.821 0.677 0.760 62 CV 4 0.815 0.623 0.736 60 CV 5 0.807 0.650 0.742 53 Average 0.806 0.659 0.745 I. Annotation scheme: {AI, nC, nH, PC, RA} and Fragment length: 4

184